Data Mining Concepts and Techniques phần 6 ppt

In order to plot an ROC curve for a given classification model, M, the model must be able to return a probability or ranking for the predicted class of each test tuple.That is, we need t

Trang 1

362 Chapter 6 Classification and Prediction

cancerous patient is not cancerous) is far greater than that of a false positive (incorrectlyyet conservatively labeling a noncancerous patient as cancerous) In such cases, we canoutweigh one type of error over another by assigning a different cost to each Thesecosts may consider the danger to the patient, financial costs of resulting therapies, andother hospital costs Similarly, the benefits associated with a true positive decision may

be different than that of a true negative Up to now, to compute classifier accuracy, wehave assumed equal costs and essentially divided the sum of true positives and truenegatives by the total number of test tuples Alternatively, we can incorporate costsand benefits by instead computing the average cost (or benefit) per decision Otherapplications involving cost-benefit analysis include loan application decisions and tar-get marketing mailouts For example, the cost of loaning to a defaulter greatly exceedsthat of the lost business incurred by denying a loan to a nondefaulter Similarly, in anapplication that tries to identify households that are likely to respond to mailouts ofcertain promotional material, the cost of mailouts to numerous households that do notrespond may outweigh the cost of lost business from not mailing to households thatwould have responded Other costs to consider in the overall analysis include the costs

to collect the data and to develop the classification tool

“Are there other cases where accuracy may not be appropriate?” In classification

prob-lems, it is commonly assumed that all tuples are uniquely classifiable, that is, that eachtraining tuple can belong to only one class Yet, owing to the wide diversity of data

in large databases, it is not always reasonable to assume that all tuples are uniquelyclassifiable Rather, it is more probable to assume that each tuple may belong to morethan one class How then can the accuracy of classifiers on large databases be mea-sured? The accuracy measure is not appropriate, because it does not take into accountthe possibility of tuples belonging to more than one class

Rather than returning a class label, it is useful to return a probability class

distribu-tion Accuracy measures may then use a second guess heuristic, whereby a class

pre-diction is judged as correct if it agrees with the first or second most probable class.Although this does take into consideration, to some degree, the nonunique classifica-tion of tuples, it is not a complete solution

6.12.2 Predictor Error Measures

“How can we measure predictor accuracy?” Let D T be a test set of the form (X1, y1),

(X2,y2), , (X d , y d ), where the X i are the n-dimensional test tuples with associated known values, y i , for a response variable, y, and d is the number of tuples in D T Sincepredictors return a continuous value rather than a categorical label, it is difficult to say

exactly whether the predicted value, y0i , for X iis correct Instead of focusing on whether

y0i is an “exact” match with y i, we instead look at how far off the predicted value is from

the actual known value Loss functions measure the error between y iand the predicted

Trang 2

6.13 Evaluating the Accuracy of a Classifier or Predictor 363

Based on the above, the test error (rate), or generalization error, is the average loss

over the test set Thus, we get the following error rates

Mean absolute error :

ing error measure is called the root mean squared error This is useful in that it allows

the error measured to be of the same magnitude as the quantity being predicted.Sometimes, we may want the error to be relative to what it would have been if we

had just predicted y, the mean value for y from the training data, D That is, we can

normalize the total loss by dividing by the total loss incurred from always predictingthe mean Relative measures of error include:

Relative absolute error :

d We can

take the root of the relative squared error to obtain the root relative squared error so

that the resulting error is of the same magnitude as the quantity predicted

In practice, the choice of error measure does not greatly affect prediction modelselection

How can we use the above measures to obtain a reliable estimate of classifier racy (or predictor accuracy in terms of error)? Holdout, random subsampling, cross-validation, and the bootstrap are common techniques for assessing accuracy based on

Trang 3

accu-364 Chapter 6 Classification and Prediction

Test set

Training set

DerivemodelData

Estimateaccuracy

Figure 6.29 Estimating accuracy with the holdout method

randomly sampled partitions of the given data The use of such techniques to estimateaccuracy increases the overall computation time, yet is useful for model selection

The holdout method is what we have alluded to so far in our discussions about

accu-racy In this method, the given data are randomly partitioned into two independent

sets, a training set and a test set Typically, two-thirds of the data are allocated to the

training set, and the remaining one-third is allocated to the test set The training set isused to derive the model, whose accuracy is estimated with the test set (Figure 6.29).The estimate is pessimistic because only a portion of the initial data is used to derivethe model

Random subsampling is a variation of the holdout method in which the holdout

method is repeated k times The overall accuracy estimate is taken as the average of the

accuracies obtained from each iteration (For prediction, we can take the average of thepredictor error rates.)

6.13.2 Cross-validation

In k-fold cross-validation, the initial data are randomly partitioned into k mutually

exclusive subsets or “folds,” D1, D2, , D k, each of approximately equal size

Train-ing and testTrain-ing is performed k times In iteration i, partition D iis reserved as the testset, and the remaining partitions are collectively used to train the model That is, in

the first iteration, subsets D2, , D k collectively serve as the training set in order to

obtain a first model, which is tested on D1; the second iteration is trained on subsets

D1, D3, , D k and tested on D2; and so on Unlike the holdout and random pling methods above, here, each sample is used the same number of times for trainingand once for testing For classification, the accuracy estimate is the overall number of

subsam-correct classifications from the k iterations, divided by the total number of tuples in the

initial data For prediction, the error estimate can be computed as the total loss from

the k iterations, divided by the total number of initial tuples.

Trang 4

6.13 Evaluating the Accuracy of a Classifier or Predictor 365

Leave-one-out is a special case of k-fold cross-validation where k is set to the number

of initial tuples That is, only one sample is “left out” at a time for the test set In

stratified cross-validation, the folds are stratified so that the class distribution of the

tuples in each fold is approximately the same as that in the initial data

In general, stratified 10-fold cross-validation is recommended for estimating racy (even if computation power allows using more folds) due to its relatively low biasand variance

accu-6.13.3 Bootstrap

Unlike the accuracy estimation methods mentioned above, the bootstrap method

samples the given training tuples uniformly with replacement That is, each time a

tuple is selected, it is equally likely to be selected again and readded to the training set.For instance, imagine a machine that randomly selects tuples for our training set In

sampling with replacement, the machine is allowed to select the same tuple more than

once

There are several bootstrap methods A commonly used one is the 632 bootstrap,

which works as follows Suppose we are given a data set of d tuples The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d

samples It is very likely that some of the original data tuples will occur more than once

in this sample The data tuples that did not make it into the training set end up formingthe test set Suppose we were to try this out several times As it turns out, on average,63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8%will form the test set (hence, the name, 632 bootstrap.)

“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is (1 − 1/d) We have to select d

times, so the probability that a tuple will not be chosen during this whole time is (1 −

1/d) d If d is large, the probability approaches e−1= 0.368.14Thus, 36.8% of tupleswill not be selected for training and thereby end up in the test set, and the remaining63.2% will form the training set

We can repeat the sampling procedure k times, where in each iteration, we use the

current test set to obtain an accuracy estimate of the model obtained from the currentbootstrap sample The overall accuracy of the model is then estimated as

Acc(M) =

k

∑

i=1

(0.632× Acc(M i)test set+ 0.368× Acc(M i)train set), (6.65)

where Acc(M i)test set is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i Acc(M i)train setis the accuracy of the model obtained with

bootstrap sample i when it is applied to the original set of data tuples The bootstrap

method works well with small data sets

Trang 5

New datasample

Prediction

Figure 6.30 Increasing model accuracy: Bagging and boosting each generate a set of classification or

prediction models, M1, M2, , M k Voting strategies are used to combine the predictionsfor a given unknown tuple

In Section 6.3.3, we saw how pruning can be applied to decision tree induction to help

improve the accuracy of the resulting decision trees Are there general strategies for

improving classifier and predictor accuracy?

The answer is yes Bagging and boosting are two such techniques (Figure 6.30) They

are examples of ensemble methods, or methods that use a combination of models Each

combines a series of k learned models (classifiers or predictors), M1, M2, , M k, with

the aim of creating an improved composite model, M∗ Both bagging and boosting can

be used for classification as well as prediction

6.14.1 Bagging

We first take an intuitive look at how bagging works as a method of increasing accuracy.For ease of explanation, we will assume at first that our model is a classifier Supposethat you are a patient and would like to have a diagnosis made based on your symptoms.Instead of asking one doctor, you may choose to ask several If a certain diagnosis occursmore than any of the others, you may choose this as the final or best diagnosis That

is, the final diagnosis is made based on a majority vote, where each doctor gets anequal vote Now replace each doctor by a classifier, and you have the basic idea behindbagging Intuitively, a majority vote made by a large group of doctors may be morereliable than a majority vote made by a small group

Given a set, D, of d tuples, bagging works as follows For iteration i (i = 1, 2, , k), a

training set, D i , of d tuples is sampled with replacement from the original set of tuples, D Note that the term bagging stands for bootstrap aggregation Each training set is a bootstrap

sample, as described in Section 6.13.3 Because sampling with replacement is used, some

Trang 6

6.14 Ensemble Methods—Increasing the Accuracy 367

Algorithm: Bagging The bagging algorithm—create an ensemble of models (classifiers or

pre-dictors) for a learning scheme where each model gives an equally-weighted prediction

Input:

D , a set of d training tuples;

k, the number of models in the ensemble;

a learning scheme (e.g., decision tree algorithm, backpropagation, etc.)

Output: A composite model, M∗.

Method:

(1) for i = 1 to k do // create k models:

(2) create bootstrap sample, D i , by sampling D with replacement;

(3) use D i to derive a model, M i;

of the original tuples of D may not be included in D i, whereas others may occur more than

once A classifier model, M i , is learned for each training set, D i To classify an unknown

tuple, X, each classifier, M i, returns its class prediction, which counts as one vote The

bagged classifier, M∗, counts the votes and assigns the class with the most votes to X.

Bagging can be applied to the prediction of continuous values by taking the average value

of each prediction for a given test tuple The algorithm is summarized in Figure 6.31.The bagged classifier often has significantly greater accuracy than a single classifier

derived from D, the original training data It will not be considerably worse and is

more robust to the effects of noisy data The increased accuracy occurs because thecomposite model reduces the variance of the individual classifiers For prediction, it

was theoretically proven that a bagged predictor will always have improved accuracy over a single predictor derived from D.

6.14.2 Boosting

We now look at the ensemble method of boosting As in the previous section, supposethat as a patient, you have certain symptoms Instead of consulting one doctor, youchoose to consult several Suppose you assign weights to the value or worth of eachdoctor’s diagnosis, based on the accuracies of previous diagnoses they have made The

Trang 7

final diagnosis is then a combination of the weighted diagnoses This is the essencebehind boosting

In boosting, weights are assigned to each training tuple A series of k classifiers is

iteratively learned After a classifier M iis learned, the weights are updated to allow the

subsequent classifier, M i+1, to “pay more attention” to the training tuples that were

mis-classified by M i The final boosted classifier, M∗, combines the votes of each individual

classifier, where the weight of each classifier’s vote is a function of its accuracy Theboosting algorithm can be extended for the prediction of continuous values

Adaboost is a popular boosting algorithm Suppose we would like to boost the accuracy

of some learning method We are given D, a data set of d class-labeled tuples, (X1, y1),

(X2, y2), , (X d , y d ), where y i is the class label of tuple X i Initially, Adaboost assigns each

training tuple an equal weight of 1/d Generating k classifiers for the ensemble requires

k rounds through the rest of the algorithm In round i, the tuples from D are sampled to form a training set, D i , of size d Sampling with replacement is used—the same tuple may

be selected more than once Each tuple’s chance of being selected is based on its weight

A classifier model, M i , is derived from the training tuples of D i Its error is then calculated

using D ias a test set The weights of the training tuples are then adjusted according to howthey were classified If a tuple was incorrectly classified, its weight is increased If a tuplewas correctly classified, its weight is decreased A tuple’s weight reflects how hard it is toclassify—the higher the weight, the more often it has been misclassified These weightswill be used to generate the training samples for the classifier of the next round The basicidea is that when we build a classifier, we want it to focus more on the misclassified tuples

of the previous round Some classifiers may be better at classifying some “hard” tuplesthan others In this way, we build a series of classifiers that complement each other Thealgorithm is summarized in Figure 6.32

Now, let’s look at some of the math that’s involved in the algorithm To compute

the error rate of model M i , we sum the weights of each of the tuples in D i that M i

misclassified That is,

where err(X j)is the misclassification error of tuple X j: If the tuple was misclassified,

then err(X j)is 1 Otherwise, it is 0 If the performance of classifier M iis so poor that

its error exceeds 0.5, then we abandon it Instead, we try again by generating a new D i

training set, from which we derive a new M i

The error rate of M iaffects how the weights of the training tuples are updated If a tuple

in round i was correctly classified, its weight is multiplied by error(M i)/(1− error(M i)).Once the weights of all of the correctly classified tuples are updated, the weights for alltuples (including the misclassified ones) are normalized so that their sum remains thesame as it was before To normalize a weight, we multiply it by the sum of the old weights,divided by the sum of the new weights As a result, the weights of misclassified tuples areincreased and the weights of correctly classified tuples are decreased, as described above

“Once boosting is complete, how is the ensemble of classifiers used to predict the class

label of a tuple, X?” Unlike bagging, where each classifier was assigned an equal vote,

Trang 8

6.14 Ensemble Methods—Increasing the Accuracy 369

Algorithm: Adaboost A boosting algorithm—create an ensemble of classifiers Each one gives

a weighted vote

Input:

D , a set of d class-labeled training tuples;

k, the number of rounds (one classifier is generated per round);

a classification learning scheme

Output: A composite model.

Method:

(1) initialize the weight of each tuple in D to 1/d;

(2) for i = 1 to k do // for each round:

(3) sample D with replacement according to the tuple weights to obtain D i;

(4) use training set D i to derive a model, M i;

(5) compute error(M i), the error rate of Mi(Equation 6.66)

(6) if error(M i) > 0.5then

(7) reinitialize the weights to 1/d

(8) go back to step 3 and try again;

(10) for each tuple in D ithat was correctly classified do

(11) multiply the weight of the tuple by error(M i)/(1− error(M i)); // update weights(12) normalize the weight of each tuple;

(13) endfor

To use the composite model to classify tuple, X:

(1) initialize weight of each class to 0;

(2) for i = 1 to k do // for each classifier:

(7) return the class with the largest weight;

Figure 6.32 Adaboost, a boosting algorithm

boosting assigns a weight to each classifier’s vote, based on how well the classifier formed The lower a classifier’s error rate, the more accurate it is, and therefore, the

per-higher its weight for voting should be The weight of classifier M i’s vote is

log1− error(M i)

Trang 9

For each class, c, we sum the weights of each classifier that assigned class c to X The class with the highest sum is the “winner” and is returned as the class prediction for tuple X.

“How does boosting compare with bagging?” Because of the way boosting focuses on

the misclassified tuples, it risks overfitting the resulting composite model to such data.Therefore, sometimes the resulting “boosted” model may be less accurate than a sin-gle model derived from the same data Bagging is less susceptible to model overfitting.While both can significantly improve accuracy in comparison to a single model, boost-ing tends to achieve greater accuracy

Suppose that we have generated two models, M1 and M2(for either classification orprediction), from our data We have performed 10-fold cross-validation to obtain amean error rate for each How can we determine which model is best? It may seemintuitive to select the model with the lowest error rate, however, the mean error rates

are just estimates of error on the true population of future data cases There can be

con-siderable variance between error rates within any given 10-fold cross-validation

exper-iment Although the mean error rates obtained for M1and M2may appear different,that difference may not be statistically significant What if any difference between thetwo may just be attributed to chance? This section addresses these questions

6.15.1 Estimating Confidence Intervals

To determine if there is any “real” difference in the mean error rates of two models,

we need to employ a test of statistical significance In addition, we would like to obtain

some confidence limits for our mean error rates so that we can make statements like

“any observed mean will not vary by +/− two standard errors 95% of the time for future samples” or “one model is better than the other by a margin of error of +/− 4%.”

What do we need in order to perform the statistical test? Suppose that for eachmodel, we did 10-fold cross-validation, say, 10 times, each time using a different 10-foldpartitioning of the data Each partitioning is independently drawn We can average the

10 error rates obtained each for M1 and M2, respectively, to obtain the mean errorrate for each model For a given model, the individual error rates calculated in thecross-validations may be considered as different, independent samples from a proba-

bility distribution In general, they follow a t distribution with k-1 degrees of freedom where, here, k = 10 (This distribution looks very similar to a normal, or Gaussian,

distribution even though the functions defining the two are quite different Both areunimodal, symmetric, and bell-shaped.) This allows us to do hypothesis testing where

the significance test used is the t-test, or Student’s t-test Our hypothesis is that the two

models are the same, or in other words, that the difference in mean error rate between

the two is zero If we can reject this hypothesis (referred to as the null hypothesis), then

we can conclude that the difference between the two models is statistically significant,

in which case we can select the model with the lower error rate

Trang 10

6.15 Model Selection 371

In data mining practice, we may often employ a single test set, that is, the same test

set can be used for both M1and M2 In such cases, we do a pairwise comparison of the

two models for each 10-fold cross-validation round That is, for the ith round of 10-fold

cross-validation, the same cross-validation partitioning is used to obtain an error rate

for M1and an error rate for M2 Let err(M1)i (or err(M2)i) be the error rate of model

M1 (or M2) on round i The error rates for M1 are averaged to obtain a mean error

rate for M1, denoted err(M1) Similarly, we can obtain err(M2) The variance of the

difference between the two models is denoted var(M1− M2) The t-test computes the

t-statistic with k − 1 degrees of freedom for k samples In our example we have k = 10 since, here, the k samples are our error rates obtained from ten 10-fold cross-validations for each model The t-statistic for pairwise comparison is computed as follows:

t = err(M1)− err(M2)p

err(M1)i − err(M2)i − (err(M1)− err(M2))i2 (6.69)

To determine whether M1and M2are significantly different, we compute t and select

a significance level, sig In practice, a significance level of 5% or 1% is typically used We then consult a table for the t distribution, available in standard textbooks on statistics.

This table is usually shown arranged by degrees of freedom as rows and significance

levels as columns Suppose we want to ascertain whether the difference between M1and

M2 is significantly different for 95% of the population, that is, sig = 5% or 0.05 We need to find the t distribution value corresponding to k − 1 degrees of freedom (or 9 degrees of freedom for our example) from the table However, because the t distribution

is symmetric, typically only the upper percentage points of the distribution are shown

Therefore, we look up the table value for z = sig/2, which in this case is 0.025, where

z is also referred to as a confidence limit If t > z or t < −z, then our value of t lies in

the rejection region, within the tails of the distribution This means that we can reject

the null hypothesis that the means of M1and M2are the same and conclude that there

is a statistically significant difference between the two models Otherwise, if we cannot

reject the null hypothesis, we then conclude that any difference between M1 and M2

can be attributed to chance

If two test sets are available instead of a single test set, then a nonpaired version of

the t-test is used, where the variance between the means of the two models is estimated

and k1and k2are the number of validation samples (in our case, 10-fold

cross-validation rounds) used for M1 and M2, respectively When consulting the table of t

distribution, the number of degrees of freedom used is taken as the minimum number

of degrees of the two models

Trang 11

6.15.2 ROC Curves

ROC curves are a useful visual tool for comparing two classification models The name

ROC stands for Receiver Operating Characteristic ROC curves come from signal

detec-tion theory that was developed during World War II for the analysis of radar images AnROC curve shows the trade-off between the true positive rate or sensitivity (proportion

of positive tuples that are correctly identified) and the false-positive rate (proportion

of negative tuples that are incorrectly identified as positive) for a given model That

is, given a two-class problem, it allows us to visualize the trade-off between the rate atwhich the model can accurately recognize ‘yes’ cases versus the rate at which it mis-takenly identifies ‘no’ cases as ‘yes’ for different “portions” of the test set Any increase

in the true positive rate occurs at the cost of an increase in the false-positive rate Thearea under the ROC curve is a measure of the accuracy of the model

In order to plot an ROC curve for a given classification model, M, the model must

be able to return a probability or ranking for the predicted class of each test tuple.That is, we need to rank the test tuples in decreasing order, where the one the classifierthinks is most likely to belong to the positive or ‘yes’ class appears at the top of the list.Naive Bayesian and backpropagation classifiers are appropriate, whereas others, such

as decision tree classifiers, can easily be modified so as to return a class probabilitydistribution for each prediction The vertical axis of an ROC curve represents the truepositive rate The horizontal axis represents the false-positive rate An ROC curve for

Mis plotted as follows Starting at the bottom left-hand corner (where the true positiverate and false-positive rate are both 0), we check the actual class label of the tuple atthe top of the list If we have a true positive (that is, a positive tuple that was correctlyclassified), then on the ROC curve, we move up and plot a point If, instead, the tuplereally belongs to the ‘no’ class, we have a false positive On the ROC curve, we moveright and plot a point This process is repeated for each of the test tuples, each timemoving up on the curve for a true positive or toward the right for a false positive.Figure 6.33 shows the ROC curves of two classification models The plot also shows

a diagonal line where for every true positive of such a model, we are just as likely toencounter a false positive Thus, the closer the ROC curve of a model is to the diago-nal line, the less accurate the model If the model is really good, initially we are morelikely to encounter true positives as we move down the ranked list Thus, the curvewould move steeply up from zero Later, as we start to encounter fewer and fewer truepositives, and more and more false positives, the curve cases off and becomes morehorizontal

To assess the accuracy of a model, we can measure the area under the curve Severalsoftware packages are able to perform such calculation The closer the area is to 0.5,the less accurate the corresponding model is A model with perfect accuracy will have

an area of 1.0

Trang 12

6.16 Summary 373

0.0

0.2 0.4 0.6 0.8 1.0

false positive rate

Figure 6.33 The ROC curves of two classification models

Classification and prediction are two forms of data analysis that can be used to extract

models describing important data classes or to predict future data trends While sification predicts categorical labels (classes), prediction models continuous-valued

clas-functions

Preprocessing of the data in preparation for classification and prediction can involve

data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes, and data transformation, such as generalizing the

data to higher-level concepts or normalizing the data

Predictive accuracy, computational speed, robustness, scalability, and interpretability

are five criteria for the evaluation of classification and prediction methods.

ID3, C4.5, and CART are greedy algorithms for the induction of decision trees Each

algorithm uses an attribute selection measure to select the attribute tested for each

nonleaf node in the tree Pruning algorithms attempt to improve accuracy by

remov-ing tree branches reflectremov-ing noise in the data Early decision tree algorithms cally assume that the data are memory resident—a limitation to data mining on large

typi-databases Several scalable algorithms, such as SLIQ, SPRINT, and RainForest, have

been proposed to address this issue

Nạve Bayesian classification and Bayesian belief networks are based on Bayes,

theo-rem of posterior probability Unlike nạve Bayesian classification (which assumes class

Trang 13

conditional independence), Bayesian belief networks allow class conditional pendencies to be defined between subsets of variables

inde-A rule-based classifier uses a set of IF-THEN rules for classification Rules can be

extracted from a decision tree Rules may also be generated directly from trainingdata using sequential covering algorithms and associative classification algorithms

Backpropagation is a neural network algorithm for classification that employs a

method of gradient descent It searches for a set of weights that can model the data

so as to minimize the mean squared distance between the network’s class predictionand the actual class label of data tuples Rules may be extracted from trained neuralnetworks in order to help improve the interpretability of the learned network

A Support Vector Machine (SVM) is an algorithm for the classification of both linear

and nonlinear data It transforms the original data in a higher dimension, from where

it can find a hyperplane for separation of the data using essential training tuples called

support vectors.

Associative classification uses association mining techniques that search for frequently

occurring patterns in large databases The patterns may generate rules, which can beanalyzed for use in classification

Decision tree classifiers, Bayesian classifiers, classification by backpropagation,

sup-port vector machines, and classification based on association are all examples of eager learners in that they use training tuples to construct a generalization model and in this way are ready for classifying new tuples This contrasts with lazy learners or instance- based methods of classification, such as nearest-neighbor classifiers and case-based

reasoning classifiers, which store all of the training tuples in pattern space and waituntil presented with a test tuple before performing generalization Hence, lazy learnersrequire efficient indexing techniques

In genetic algorithms, populations of rules “evolve” via operations of crossover and mutation until all rules within a population satisfy a specified threshold Rough set theory can be used to approximately define classes that are not distinguishable based

on the available attributes Fuzzy set approaches replace “brittle” threshold cutoffs for

continuous-valued attributes with degree of membership functions

Linear, nonlinear, and generalized linear models of regression can be used for

predic-tion Many nonlinear problems can be converted to linear problems by performingtransformations on the predictor variables Unlike decision trees, regression trees andmodel trees are used for prediction In regression trees, each leaf stores a continuous-valued prediction In model trees, each leaf holds a regression model

Stratified k-fold cross-validation is a recommended method for accuracy estimation.

Bagging and boosting methods can be used to increase overall accuracy by learning and combining a series of individual models For classifiers, sensitivity, specificity, and precision are useful alternatives to the accuracy measure, particularly when the main

class of interest is in the minority There are many measures of predictor error, such as

Trang 14

be more computationally intensive than most decision tree methods.

Exercises

6.1 Briefly outline the major steps of decision tree classification.

6.2 Why is tree pruning useful in decision tree induction? What is a drawback of using a

separate set of tuples to evaluate pruning?

6.3 Given a decision tree, you have the option of (a) converting the decision tree to rules

and then pruning the resulting rules, or (b) pruning the decision tree and then verting the pruned tree to rules What advantage does (a) have over (b)?

con-6.4 It is important to calculate the worst-case computational complexity of the decision

tree algorithm Given data set D, the number of attributes n, and the number of training tuples |D|, show that the computational cost of growing a tree is at most

n × |D| × log(|D|).

6.5 Why is nạve Bayesian classification called “nạve”? Briefly outline the major ideas of

nạve Bayesian classification

6.6 Given a 5 GB data set with 50 attributes (each containing 100 distinct values) and

512 MB of main memory in your laptop, outline an efficient method that constructsdecision trees in such large data sets Justify your answer by rough calculation of yourmain memory usage

6.7 RainForest is an interesting scalable algorithm for decision tree induction Develop a

scalable naive Bayesian classification algorithm that requires just a single scan of theentire data set for most databases Discuss whether such an algorithm can be refined

to incorporate boosting to further enhance its classification accuracy.

6.8 Compare the advantages and disadvantages of eager classification (e.g., decision tree,

Bayesian, neural network) versus lazy classification (e.g., k-nearest neighbor,

case-based reasoning)

6.9 Design an efficient method that performs effective nạve Bayesian classification over

an infinite data stream (i.e., you can scan the data stream only once) If we wanted to

Trang 15

discover the evolution of such classification schemes (e.g., comparing the classification

scheme at this moment with earlier schemes, such as one from a week ago), whatmodified design would you suggest?

6.10 What is associative classification? Why is associative classification able to achieve higher

classification accuracy than a classical decision tree method? Explain how associativeclassification can be used for text document classification

6.11 The following table consists of training data from an employee database The data

have been generalized For example, “31 35” for age represents the age range of 31

to 35 For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that row.

department status age salary count

sales senior 31 35 46K 50K 30sales junior 26 30 26K 30K 40sales junior 31 35 31K 35K 40systems junior 21 25 46K 50K 20systems senior 31 35 66K 70K 5systems junior 26 30 46K 50K 3systems senior 41 45 66K 70K 3marketing senior 36 40 46K 50K 10marketing junior 31 35 41K 45K 4secretary senior 46 50 36K 40K 4secretary junior 26 30 26K 30K 6

Let status be the class label attribute.

(a) How would you modify the basic decision tree algorithm to take into

considera-tion the count of each generalized data tuple (i.e., of each row entry)?

(b) Use your algorithm to construct a decision tree from the given data

(c) Given a data tuple having the values “systems,” “26 30,” and “46–50K” for the attributes department, age, and salary, respectively, what would a naive Bayesian classification of the status for the tuple be?

(d) Design a multilayer feed-forward neural network for the given data Label thenodes in the input and output layers

(e) Using the multilayer feed-forward neural network obtained above, show the weightvalues after one iteration of the backpropagation algorithm, given the training

instance “(sales, senior, 31 35, 46K 50K).” Indicate your initial weight values and

biases, and the learning rate used

6.12 The support vector machine (SVM) is a highly accurate classification method However,

SVM classifiers suffer from slow processing when training with a large set of data

Trang 16

Exercises 377

tuples Discuss how to overcome this difficulty and develop a scalable SVM algorithmfor efficient SVM classification in large datasets

6.13 Write an algorithm for k-nearest-neighbor classification given k and n, the number of

attributes describing each tuple

6.14 The following table shows the midterm and final exam grades obtained for students

(a) Plot the data Do x and y seem to have a linear relationship?

(b) Use the method of least squares to find an equation for the prediction of a student’s

final exam grade based on the student’s midterm grade in the course

(c) Predict the final exam grade of a student who received an 86 on the midtermexam

6.15 Some nonlinear regression models can be converted to linear models by applying

trans-formations to the predictor variables Show how the nonlinear regression equation

y =αXβcan be converted to a linear regression equation solvable by the method ofleast squares

6.16 What is boosting? State why it may improve the accuracy of decision tree induction 6.17 Show thataccuracy is afunction of sensitivity and specificity, thatis, proveEquation(6.58) 6.18 Suppose that we would like to select between two prediction models, M1and M2 Wehave performed 10 rounds of 10-fold cross-validation on each model, where the same

data partitioning in round i is used for both M1and M2 The error rates obtained for

M1are 30.5, 32.2, 20.7, 20.6, 31.0, 41.0, 27.7, 26.0, 21.5, 26.0 The error rates for M2

are 22.4, 14.5, 22.4, 19.6, 20.7, 20.4, 22.1, 19.4, 16.2, 35.0 Comment on whether onemodel is significantly better than the other considering a significance level of 1%

Trang 17

6.19 It is difficult to assess classification accuracy when individual data objects may belong

to more than one class at a time In such cases, comment on what criteria you woulduse to compare different classifiers modeled after the same data

Bibliographic Notes

Classification from machine learning, statistics, and pattern recognition perspectiveshas been described in many books, such as Weiss and Kulikowski [WK91], Michie,Spiegelhalter, and Taylor [MST94], Russel and Norvig [RN95], Langley [Lan96], Mitchell[Mit97], Hastie, Tibshirani, and Friedman [HTF01], Duda, Hart, and Stork [DHS01],Alpaydin [Alp04], Tan, Steinbach, and Kumar [TSK05], and Witten and Frank [WF05].Many of these books describe each of the basic methods of classification discussed in thischapter, as well as practical techniques for the evaluation of classifier performance Editedcollections containing seminal articles on machine learning can be found in Michalski,Carbonell, and Mitchell [MCM83,MCM86], Kodratoff and Michalski [KM90], Shavlikand Dietterich [SD90], and Michalski and Tecuci [MT94] For a presentation of machinelearning with respect to data mining applications, see Michalski, Bratko, and Kubat[MBK98]

The C4.5 algorithm is described in a book by Quinlan [Qui93] The CART system is

detailed in Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone

[BFOS84] Both books give an excellent presentation of many of the issues regardingdecision tree induction C4.5 has a commercial successor, known as C5.0, which can be

found at www.rulequest.com ID3, a predecessor of C4.5, is detailed in Quinlan [Qui86].

It expands on pioneering work on concept learning systems, described by Hunt, Marin,and Stone [HMS66] Other algorithms for decision tree induction include FACT (Lohand Vanichsetakul [LV88]), QUEST (Loh and Shih [LS97]), PUBLIC (Rastogi and Shim[RS98]), and CHAID (Kass [Kas80] and Magidson [Mag94]) INFERULE (Uthurusamy,Fayyad, and Spangler [UFS91]) learns decision trees from inconclusive data, where prob-abilistic rather than categorical classification rules are obtained KATE (Manago andKodratoff [MK91]) learns decision trees from complex structured data Incremental ver-sions of ID3 include ID4 (Schlimmer and Fisher [SF86a]) and ID5 (Utgoff [Utg88]), thelatter of which is extended in Utgoff, Berkman, and Clouse [UBC97] An incrementalversion of CART is described in Crawford [Cra89] BOAT (Gehrke, Ganti, Ramakrish-nan, and Loh [GGRL99]), a decision tree algorithm that addresses the scalabilty issue

in data mining, is also incremental Other decision tree algorithms that address ity include SLIQ (Mehta, Agrawal, and Rissanen [MAR96]), SPRINT (Shafer, Agrawal,and Mehta [SAM96]), RainForest (Gehrke, Ramakrishnan, and Ganti [GRG98]), andearlier approaches, such as Catlet [Cat91], and Chan and Stolfo [CS93a, CS93b] Theintegration of attribution-oriented induction with decision tree induction is proposed

scalabil-in Kamber, Wscalabil-instone, Gong, et al [KWG+97] For a comprehensive survey of manysalient issues relating to decision tree induction, such as attribute selection and pruning,see Murthy [Mur98]

Trang 18

Bibliographic Notes 379

For a detailed discussion on attribute selection measures, see Kononenko and Hong[KH97] Information gain was proposed by Quinlan [Qui86] and is based on pioneeringwork on information theory by Shannon and Weaver [SW49] The gain ratio, proposed

as an extension to information gain, is described as part of C4.5 [Qui93] The Gini indexwas proposed for CART [BFOS84] The G-statistic, based on information theory, is given

in Sokal and Rohlf [SR81] Comparisons of attribute selection measures include tine and Niblett [BN92], Fayyad and Irani [FI92], Kononenko [Kon95], Loh and Shih[LS97], and Shih [Shi99] Fayyad and Irani [FI92] show limitations of impurity-basedmeasures such as information gain and Gini index They propose a class of attributeselection measures called C-SEP (Class SEParation), which outperform impurity-basedmeasures in certain cases Kononenko [Kon95] notes that attribute selection measuresbased on the minimum description length principle have the least bias toward multival-ued attributes Martin and Hirschberg [MH95] proved that the time complexity of deci-sion tree induction increases exponentially with respect to tree height in the worst case,and under fairly general conditions in the average case Fayad and Irani [FI90] foundthat shallow decision trees tend to have many leaves and higher error rates for a largevariety of domains Attribute (or feature) construction is described in Liu and Motoda[LM98, Le98] Examples of systems with attribute construction include BACON by Lan-gley, Simon, Bradshaw, and Zytkow [LSBZ87], Stagger by Schlimmer [Sch86], FRINGE

Bun-by Pagallo [Pag89], and AQ17-DCI Bun-by Bloedorn and Michalski [BM98]

There are numerous algorithms for decision tree pruning, including cost ity pruning (Breiman, Friedman, Olshen, and Stone [BFOS84]), reduced error prun-ing (Quinlan [Qui87]), and pessimistic pruning (Quinlan [Qui86]) PUBLIC (Rastogiand Shim [RS98]) integrates decision tree construction with tree pruning MDL-basedpruning methods can be found in Quinlan and Rivest [QR89], Mehta, Agrawal, andRissanen [MRA95], and Rastogi and Shim [RS98] Other methods include Niblett andBratko [NB86], and Hosking, Pednault, and Sudan [HPS97] For an empirical compar-ison of pruning methods, see Mingers [Min89] and Malerba, Floriana, and Semeraro[MFS95] For a survey on simplifying decision trees, see Breslow and Aha [BA97].There are several examples of rule-based classifiers These include AQ15 (Hong,Mozetic, and Michalski [HMM86]), CN2 (Clark and Niblett [CN89]), ITRULE (Smythand Goodman [SG92]), RISE (Domingos [Dom94]), IREP (Furnkranz and Widmer[FW94]), RIPPER (Cohen [Coh95]), FOIL (Quinlan and Cameron-Jones [Qui90,QCJ93]), and Swap-1 (Weiss and Indurkhya [WI98]) For the extraction of rules fromdecision trees, see Quinlan [Qui87, Qui93] Rule refinement strategies that identify themost interesting rules among a given rule set can be found in Major and Mangano[MM95]

complex-Thorough presentations of Bayesian classification can be found in Duda, Hart, andStork [DHS01], Weiss and Kulikowski [WK91], and Mitchell [Mit97] For an anal-ysis of the predictive power of nạve Bayesian classifiers when the class conditionalindependence assumption is violated, see Domingos and Pazzani [DP96] Experimentswith kernel density estimation for continuous-valued attributes, rather than Gaussianestimation, have been reported for nạve Bayesian classifiers in John [Joh97] For anintroduction to Bayesian belief networks, see Heckerman [Hec96] For a thorough

Trang 19

presentation of probabilistic networks, see Pearl [Pea88] Solutions for learning thebelief network structure from training data given observable variables are proposed inCooper and Herskovits [CH92], Buntine [Bun94], and Heckerman, Geiger, and Chick-ering [HGC95] Algorithms for inference on belief networks can be found in Russelland Norvig [RN95] and Jensen [Jen96] The method of gradient descent, described inSection 6.4.4 for training Bayesian belief networks, is given in Russell, Binder, Koller,and Kanazawa [RBKK95] The example given in Figure 6.11 is adapted from Russell

et al [RBKK95] Alternative strategies for learning belief networks with hidden ables include application of Dempster, Laird, and Rubin’s [DLR77] EM (ExpectationMaximization) algorithm (Lauritzen [Lau95]) and methods based on the minimumdescription length principle (Lam [Lam98]) Cooper [Coo90] showed that the generalproblem of inference in unconstrained belief networks is NP-hard Limitations of beliefnetworks, such as their large computational complexity (Laskey and Mahoney [LM97]),have prompted the exploration of hierarchical and composable Bayesian models (Pfef-fer, Koller, Milch, and Takusagawa [PKMT99] and Xiang, Olesen, and Jensen [XOJ00]).These follow an object-oriented approach to knowledge representation

vari-The perceptron is a simple neural network, proposed in 1958 by Rosenblatt [Ros58],which became a landmark in early machine learning history Its input units are ran-domly connected to a single layer of output linear threshold units In 1969, Minskyand Papert [MP69] showed that perceptrons are incapable of learning concepts thatare linearly inseparable This limitation, as well as limitations on hardware at the time,dampened enthusiasm for research in computational neuronal modeling for nearly 20years Renewed interest was sparked following presentation of the backpropagationalgorithm in 1986 by Rumelhart, Hinton, and Williams [RHW86], as this algorithmcan learn concepts that are linearly inseparable Since then, many variations for back-propagation have been proposed, involving, for example, alternative error functions(Hanson and Burr [HB88]), dynamic adjustment of the network topology (Me´zardand Nadal [MN89], Fahlman and Lebiere [FL90], Le Cun, Denker, and Solla [LDS90],and Harp, Samad, and Guha [HSG90] ), and dynamic adjustment of the learning rateand momentum parameters (Jacobs [Jac88]) Other variations are discussed in Chauvinand Rumelhart [CR95] Books on neural networks include Rumelhart and McClelland[RM86], Hecht-Nielsen [HN90], Hertz, Krogh, and Palmer [HKP91], Bishop [Bis95],Ripley [Rip96], and Haykin [Hay99] Many books on machine learning, such as [Mit97,RN95], also contain good explanations of the backpropagation algorithm There areseveral techniques for extracting rules from neural networks, such as [SN88, Gal93,TS93, Avn95, LSL95, CS96b, LGT97] The method of rule extraction described in Sec-tion 6.6.4 is based on Lu, Setiono, and Liu [LSL95] Critiques of techniques for ruleextraction from neural networks can be found in Craven and Shavlik [CS97] Roy[Roy00] proposes that the theoretical foundations of neural networks are flawed withrespect to assumptions made regarding how connectionist learning models the brain

An extensive survey of applications of neural networks in industry, business, and ence is provided in Widrow, Rumelhart, and Lehr [WRL94]

sci-Support Vector Machines (SVMs) grew out of early work by Vapnik and Chervonenkis

on statistical learning theory [VC71] The first paper on SVMs was presented by Boser,

Trang 20

Bibliographic Notes 381

Guyon, and Vapnik [BGV92] More detailed accounts can be found in books by Vapnik[Vap95, Vap98] Good starting points include the tutorial on SVMs by Burges [Bur98] andtextbook coverage by Kecman [Kec01] For methods for solving optimization problems,see Fletcher [Fle87] and Nocedal and Wright [NW99] These references give additionaldetails alluded to as “fancy math tricks” in our text, such as transformation of the problem

to a Lagrangian formulation and subsequent solving using Karush-Kuhn-Tucker (KKT)conditions For the application of SVMs to regression, see Schlkopf, Bartlett, Smola, andWilliamson [SBSW99], and Drucker, Burges, Kaufman, Smola, and Vapnik [DBK+97].Approaches to SVM for large data include the sequential minimal optimization algo-rithm by Platt [Pla98], decomposition approaches such as in Osuna, Freund, and Girosi[OFG97], and CB-SVM, a microclustering-based SVM algorithm for large data sets, by

Yu, Yang, and Han [YYH03]

Many algorithms have been proposed that adapt association rule mining to the task

of classification The CBA algorithm for associative classification was proposed by Liu,Hsu, and Ma [LHM98] A classifier, using emerging patterns, was proposed by Dongand Li [DL99] and Li, Dong, and Ramamohanarao [LDR00] CMAR (Classificationbased on Multiple Association Rules) was presented in Li, Han, and Pei [LHP01] CPAR(Classification based on Predictive Association Rules) was proposed in Yin and Han

[YH03b] Cong, Tan, Tung, and Xu proposed a method for mining top-k covering rule

groups for classifying gene expression data with high accuracy [CTTX05] Lent, Swami,and Widom [LSW97] proposed the ARCS system, which was described in Section 5.3

on mining multidimensional association rules It combines ideas from association rulemining, clustering, and image processing, and applies them to classification Meretakisand Wüthrich [MW99] proposed to construct a nạve Bayesian classifier by mininglong itemsets

Nearest-neighbor classifiers were introduced in 1951 by Fix and Hodges [FH51]

A comprehensive collection of articles on nearest-neighbor classification can be found

in Dasarathy [Das91] Additional references can be found in many texts on tion, such as Duda et al [DHS01] and James [Jam85], as well as articles by Cover andHart [CH67] and Fukunaga and Hummels [FH87] Their integration with attribute-weighting and the pruning of noisy instances is described in Aha [Aha92] The use ofsearch trees to improve nearest-neighbor classification time is detailed in Friedman,Bentley, and Finkel [FBF77] The partial distance method was proposed by researchers

classifica-in vector quantization and compression It is outlclassifica-ined classifica-in Gersho and Gray [GG92].The editing method for removing “useless” training tuples was first proposed by Hart[Har68] The computational complexity of nearest-neighbor classifiers is described inPreparata and Shamos [PS85] References on case-based reasoning (CBR) include thetexts Riesbeck and Schank [RS89] and Kolodner [Kol93], as well as Leake [Lea96] andAamodt and Plazas [AP94] For a list of business applications, see Allen [All94] Exam-ples in medicine include CASEY by Koton [Kot88] and PROTOS by Bareiss, Porter, andWeir [BPW88], while Rissland and Ashley [RA87] is an example of CBR for law CBR

is available in several commercial software products For texts on genetic algorithms, seeGoldberg [Gol89], Michalewicz [Mic92], and Mitchell [Mit96] Rough sets wereintroduced in Pawlak [Paw91] Concise summaries of rough set theory in data

Trang 21

mining include Ziarko [Zia91], and Cios, Pedrycz, and Swiniarski [CPS98] Roughsets have been used for feature reduction and expert system design in many applica-tions, including Ziarko [Zia91], Lenarcik and Piasta [LP97], and Swiniarski [Swi98].Algorithms to reduce the computation intensity in finding reducts have been proposed

in Skowron and Rauszer [SR92] Fuzzy set theory was proposed by Zadeh in [Zad65,Zad83] Additional descriptions can be found in [YZ94, Kec01]

Many good textbooks cover the techniques of regression Examples include James[Jam85], Dobson [Dob01], Johnson and Wichern [JW02], Devore [Dev95], Hogg andCraig [HC95], Neter, Kutner, Nachtsheim, and Wasserman [NKNW96], and Agresti[Agr96] The book by Press, Teukolsky, Vetterling, and Flannery [PTVF96] and accom-panying source code contain many statistical procedures, such as the method of leastsquares for both linear and multiple regression Recent nonlinear regression modelsinclude projection pursuit and MARS (Friedman [Fri91]) Log-linear models are also

known in the computer science literature as multiplicative models For log-linear

mod-els from a computer science perspective, see Pearl [Pea88] Regression trees (Breiman,Friedman, Olshen, and Stone [BFOS84]) are often comparable in performance withother regression methods, particularly when there exist many higher-order dependen-cies among the predictor variables For model trees, see Quinlan [Qui92]

Methods for data cleaning and data transformation are discussed in Kennedy, Lee,Van Roy, et al [KLV+98], Weiss and Indurkhya [WI98], Pyle [Pyl99], and Chapter 2

of this book Issues involved in estimating classifier accuracy are described in Weissand Kulikowski [WK91] and Witten and Frank [WF05] The use of stratified 10-foldcross-validation for estimating classifier accuracy is recommended over the holdout,cross-validation, leave-one-out (Stone [Sto74]) and bootstrapping (Efron and Tibshi-rani [ET93]) methods, based on a theoretical and empirical study by Kohavi [Koh95].Bagging is proposed in Breiman [Bre96] The boosting technique of Freund andSchapire [FS97] has been applied to several different classifiers, including decision treeinduction (Quinlan [Qui96]) and naive Bayesian classification (Elkan [Elk97]) Sensi-tivity, specificity, and precision are discussed in Frakes and Baeza-Yates [FBY92] ForROC analysis, see Egan [Ega75] and Swets [Swe88]

The University of California at Irvine (UCI) maintains a Machine Learning itory of data sets for the development and testing of classification algorithms It alsomaintains a Knowledge Discovery in Databases (KDD) Archive, an online repository oflarge data sets that encompasses a wide variety of data types, analysis tasks, and appli-

Repos-cation areas For information on these two repositories, see www.ics.uci.edu/~mlearn/ MLRepository.html and http://kdd.ics.uci.edu.

No classification method is superior over all others for all data types and domains.Empirical comparisons of classification methods include [Qui88, SMT91, BCP93,CM94, MST94, BU95], and [LLS00]

Trang 22

7 Cluster Analysis

Imaginethat you are given a set of data objects for analysis where, unlike in classification, the class

label of each object is not known This is quite common in large databases, because

assigning class labels to a large number of objects can be a very costly process Clustering

is the process of grouping the data into classes or clusters, so that objects within a

clus-ter have high similarity in comparison to one another but are very dissimilar to objects

in other clusters Dissimilarities are assessed based on the attribute values describing theobjects Often, distance measures are used Clustering has its roots in many areas, includ-ing data mining, statistics, biology, and machine learning

In this chapter, we study the requirements of clustering methods for large amounts ofdata We explain how to compute dissimilarities between objects represented by variousattribute or variable types We examine several clustering techniques, organized into the

following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (such as frequent pattern–based methods), and constraint-based clustering Clustering can also be used for outlier detection, which forms the final topic of this chapter.

The process of grouping a set of physical or abstract objects into classes of similar objects

is called clustering A cluster is a collection of data objects that are similar to one another

within the same cluster and are dissimilar to the objects in other clusters A cluster of data

objects can be treated collectively as one group and so may be considered as a form of datacompression Although classification is an effective means for distinguishing groups orclasses of objects, it requires the often costly collection and labeling of a large set of trainingtuples or patterns, which the classifier uses to model each group It is often more desirable

to proceed in the reverse direction: First partition the set of data into groups based on datasimilarity (e.g., using clustering), and then assign labels to the relatively small number ofgroups Additional advantages of such a clustering-based process are that it is adaptable

to changes and helps single out useful features that distinguish different groups

383

Trang 23

384 Chapter 7 Cluster Analysis

Cluster analysis is an important human activity Early in childhood, we learn how

to distinguish between cats and dogs, or between animals and plants, by continuouslyimproving subconscious clustering schemes By automated clustering, we can identifydense and sparse regions in object space and, therefore, discover overall distribution pat-terns and interesting correlations among data attributes Cluster analysis has been widelyused in numerous applications, including market research, pattern recognition, dataanalysis, and image processing In business, clustering can help marketers discover dis-tinct groups in their customer bases and characterize customer groups based onpurchasing patterns In biology, it can be used to derive plant and animal taxonomies,categorize genes with similar functionality, and gain insight into structures inherent inpopulations Clustering may also help in the identification of areas of similar land use

in an earth observation database and in the identification of groups of houses in a cityaccording to house type, value, and geographic location, as well as the identification ofgroups of automobile insurance policy holders with a high average claim cost It can also

be used to help classify documents on the Web for information discovery

Clustering is also called data segmentation in some applications because clustering

partitions large data sets into groups according to their similarity Clustering can also be

used for outlier detection, where outliers (values that are “far away” from any cluster)

may be more interesting than common cases Applications of outlier detection includethe detection of credit card fraud and the monitoring of criminal activities in electroniccommerce For example, exceptional cases in credit card transactions, such as very expen-sive and frequent purchases, may be of interest as possible fraudulent activity As a datamining function, cluster analysis can be used as a stand-alone tool to gain insight intothe distribution of data, to observe the characteristics of each cluster, and to focus on aparticular set of clusters for further analysis Alternatively, it may serve as a preprocessingstep for other algorithms, such as characterization, attribute subset selection, and clas-sification, which would then operate on the detected clusters and the selected attributes

or features

Data clustering is under vigorous development Contributing areas of research includedata mining, statistics, machine learning, spatial database technology, biology, and mar-keting Owing to the huge amounts of data collected in databases, cluster analysis hasrecently become a highly active topic in data mining research

As a branch of statistics, cluster analysis has been extensively studied for many years,

focusing mainly on distance-based cluster analysis Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical

analysis software packages or systems, such as S-Plus, SPSS, and SAS In machine

learn-ing, clustering is an example of unsupervised learning Unlike classification, clustering

and unsupervised learning do not rely on predefined classes and class-labeled training

examples For this reason, clustering is a form of learning by observation, rather than

learning by examples In data mining, efforts have focused on finding methods for cient and effective cluster analysis in large databases Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases.

Trang 24

effi-7.1 What Is Cluster Analysis? 385

Clustering is a challenging field of research in which its potential applications posetheir own special requirements The following are typical requirements of clustering indata mining:

Scalability: Many clustering algorithms work well on small data sets containing fewer

than several hundred data objects; however, a large database may contain millions of

objects Clustering on a sample of a given large data set may lead to biased results.

Highly scalable clustering algorithms are needed

Ability to deal with different types of attributes: Many algorithms are designed to

cluster interval-based (numerical) data However, applications may require ing other types of data, such as binary, categorical (nominal), and ordinal data, ormixtures of these data types

cluster-Discovery of clusters with arbitrary shape: Many clustering algorithms determine

clusters based on Euclidean or Manhattan distance measures Algorithms based onsuch distance measures tend to find spherical clusters with similar size and density.However, a cluster could be of any shape It is important to develop algorithms thatcan detect clusters of arbitrary shape

Minimal requirements for domain knowledge to determine input parameters: Many

clustering algorithms require users to input certain parameters in cluster analysis(such as the number of desired clusters) The clustering results can be quite sensi-tive to input parameters Parameters are often difficult to determine, especially fordata sets containing high-dimensional objects This not only burdens users, but italso makes the quality of clustering difficult to control

Ability to deal with noisy data: Most real-world databases contain outliers or missing,

unknown, or erroneous data Some clustering algorithms are sensitive to such dataand may lead to clusters of poor quality

Incremental clustering and insensitivity to the order of input records: Some

clus-tering algorithms cannot incorporate newly inserted data (i.e., database updates)into existing clustering structures and, instead, must determine a new clusteringfrom scratch Some clustering algorithms are sensitive to the order of input data.That is, given a set of data objects, such an algorithm may return dramaticallydifferent clusterings depending on the order of presentation of the input objects

It is important to develop incremental clustering algorithms and algorithms thatare insensitive to the order of input

High dimensionality: A database or a data warehouse can contain several dimensions

or attributes Many clustering algorithms are good at handling low-dimensional data,involving only two to three dimensions Human eyes are good at judging the quality

of clustering for up to three dimensions Finding clusters of data objects in dimensional space is challenging, especially considering that such data can be sparseand highly skewed

Trang 25

high-386 Chapter 7 Cluster Analysis

Constraint-based clustering: Real-world applications may need to perform clustering

under various kinds of constraints Suppose that your job is to choose the locationsfor a given number of new automatic banking machines (ATMs) in a city To decideupon this, you may cluster households while considering constraints such as the city’srivers and highway networks, and the type and number of customers per cluster Achallenging task is to find groups of data with good clustering behavior that satisfyspecified constraints

Interpretability and usability: Users expect clustering results to be interpretable,

com-prehensible, and usable That is, clustering may need to be tied to specific semanticinterpretations and applications It is important to study how an application goal mayinfluence the selection of clustering features and methods

With these requirements in mind, our study of cluster analysis proceeds as follows First,

we study different types of data and how they can influence clustering methods Second,

we present a general categorization of clustering methods We then study each clusteringmethod in detail, including partitioning methods, hierarchical methods, density-basedmethods, grid-based methods, and model-based methods We also examine clustering inhigh-dimensional space, constraint-based clustering, and outlier analysis

In this section, we study the types of data that often occur in cluster analysis and how

to preprocess them for such an analysis Suppose that a data set to be clustered contains

nobjects, which may represent persons, houses, documents, countries, and so on Mainmemory-based clustering algorithms typically operate on either of the following two datastructures

Data matrix (or object-by-variable structure): This represents n objects, such as sons, with p variables (also called measurements or attributes), such as age, height,

per-weight, gender, and so on The structure is in the form of a relational table, or n-by-p matrix (n objects ×p variables):

Dissimilarity matrix (or object-by-object structure): This stores a collection of

prox-imities that are available for all pairs of n objects It is often represented by an n-by-n

table:

Trang 26

7.2 Types of Data in Cluster Analysis 387

where d(i, j) is the measured difference or dissimilarity between objects i and j In

general, d(i, j) is a nonnegative number that is close to 0 when objects i and j are

highly similar or “near” each other, and becomes larger the more they differ Since

d(i, j) = d( j, i), and d(i, i) = 0, we have the matrix in (7.2) Measures of dissimilarity

are discussed throughout this section

The rows and columns of the data matrix represent different entities, while those of thedissimilarity matrix represent the same entity Thus, the data matrix is often called a

two-mode matrix, whereas the dissimilarity matrix is called a one-mode matrix Many

clustering algorithms operate on a dissimilarity matrix If the data are presented in theform of a data matrix, it can first be transformed into a dissimilarity matrix before apply-ing such clustering algorithms

In this section, we discuss how object dissimilarity can be computed for objects

described by interval-scaled variables; by binary variables; by categorical, ordinal, and ratio-scaled variables; or combinations of these variable types Nonmetric similarity

between complex objects (such as documents) is also described The dissimilarity datacan later be used to compute clusters of objects

7.2.1 Interval-Scaled Variables

This section discusses interval-scaled variables and their standardization It then describes

distance measures that are commonly used for computing the dissimilarity of objects

described by such variables These measures include the Euclidean, Manhattan, and Minkowski distances.

“What are interval-scaled variables?” Interval-scaled variables are continuous

mea-surements of a roughly linear scale Typical examples include weight and height, latitudeand longitude coordinates (e.g., when clustering houses), and weather temperature.The measurement unit used can affect the clustering analysis For example, changingmeasurement units from meters to inches for height, or from kilograms to pounds forweight, may lead to a very different clustering structure In general, expressing a variable

in smaller units will lead to a larger range for that variable, and thus a larger effect on theresulting clustering structure To help avoid dependence on the choice of measurementunits, the data should be standardized Standardizing measurements attempts to giveall variables an equal weight This is particularly useful when given no prior knowledge

of the data However, in some applications, users may intentionally want to give more

Trang 27

weight to a certain set of variables than to others For example, when clustering basketballplayer candidates, we may prefer to give more weight to the variable height

“How can the data for a variable be standardized?” To standardize measurements, one

choice is to convert the original measurements to unitless variables Given measurements

for a variable f , this can be performed as follows.

1. Calculate the mean absolute deviation, s f:

devia-(i.e., |x i f − m f|) are not squared; hence, the effect of outliers is somewhat reduced

There are more robust measures of dispersion, such as the median absolute deviation.

However, the advantage of using the mean absolute deviation is that the z-scores ofoutliers do not become too small; hence, the outliers remain detectable

Standardization may or may not be useful in a particular application Thus the choice

of whether and how to perform standardization should be left to the user Methods ofstandardization are also discussed in Chapter 2 under normalization techniques for datapreprocessing

After standardization, or without standardization in certain applications, the larity (or similarity) between the objects described by interval-scaled variables is typicallycomputed based on the distance between each pair of objects The most popular distance

dissimi-measure is Euclidean distance, which is defined as

d(i, j) =

q

(x i1− x j1)2+ (x i2− x j2)2+· · · + (x in − x jn)2, (7.5)

where i = (x i1, x i2, , x in)and j = (x j1, x j2, , x jn)are two n-dimensional data objects.

Another well-known metric is Manhattan (or city block) distance, defined as

d(i, j) = |x i1− x j1| + |x i2− x j2| + · · · + |x in − x jn| (7.6)

Both the Euclidean distance and Manhattan distance satisfy the following mathematicrequirements of a distance function:

Trang 28

1 d(i, j) ≥ 0: Distance is a nonnegative number.

2 d(i, i) = 0: The distance of an object to itself is 0.

3 d(i, j) = d( j, i): Distance is a symmetric function.

4 d(i, j) ≤ d(i, h) + d(h, j): Going directly from object i to object j in space is no more

than making a detour over any other object h (triangular inequality).

Example 7.1 Euclidean distance and Manhattan distance Let x1= (1, 2) and x2= (3, 5) represent two

objects as in Figure 7.1 The Euclidean distance between the two isp(22+ 32) = 3.61.The Manhattan distance between the two is 2 + 3 = 5

Minkowski distance is a generalization of both Euclidean distance and Manhattan

distance It is defined as

d(i, j) = (|x i1− x j1|p+|x i2− x j2|p+· · · + |x in − x jn|p)1/p, (7.7)

where p is a positive integer Such a distance is also called L pnorm, in some literature

It represents the Manhattan distance when p = 1 (i.e., L1norm) and Euclidean distance

when p = 2 (i.e., L2norm)

If each variable is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as

d(i, j) =

q

w1|x i1− x j1|2+ w2|x i2− x j2|2+· · · + w m |x in − x jn|2 (7.8)Weighting can also be applied to the Manhattan and Minkowski distances

Trang 29

A binary variable has only two states: 0 or 1, where 0 means that the variable

is absent, and 1 means that it is present Given the variable smoker describing a

patient, for instance, 1 indicates that the patient smokes, while 0 indicates that thepatient does not Treating binary variables as if they are interval-scaled can lead tomisleading clustering results Therefore, methods specific to binary data are necessaryfor computing dissimilarities

“So, how can we compute the dissimilarity between two binary variables?” One approach

involves computing a dissimilarity matrix from the given binary data If all binary ables are thought of as having the same weight, we have the 2-by-2 contingency table of

vari-Table 7.1, where q is the number of variables that equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s is the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j The total number of variables is p, where

p = q + r + s + t.

“What is the difference between symmetric and asymmetric binary variables?” A binary

variable is symmetric if both of its states are equally valuable and carry the same weight;

that is, there is no preference on which outcome should be coded as 0 or 1 One such

example could be the attribute gender having the states male and female Dissimilarity

that is based on symmetric binary variables is called symmetric binary dissimilarity Its

dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the

dissimilarity between objects i and j.

d(i, j) = r + s

A binary variable is asymmetric if the outcomes of the states are not equally

important, such as the positive and negative outcomes of a disease test By convention,

we shall code the most important outcome, which is usually the rarest one, by 1

(e.g., HIV positive) and the other by 0 (e.g., HIV negative) Given two asymmetric

binary variables, the agreement of two 1s (a positive match) is then considered moresignificant than that of two 0s (a negative match) Therefore, such binary variables areoften considered “monary” (as if having one state) The dissimilarity based on such

variables is called asymmetric binary dissimilarity, where the number of negative

Table 7.1 A contingency table for binary variables

Trang 30

matches, t, is considered unimportant and thus is ignored in the computation, as

shown in Equation (7.10)

d(i, j) = r + s

Complementarily, we can measure the distance between two binary variables based

on the notion of similarity instead of dissimilarity For example, the asymmetric binary similarity between the objects i and j, or sim(i, j), can be computed as,

Example 7.2 Dissimilarity between binary variables Suppose that a patient record table (Table 7.2)

contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes

are asymmetric binary

For asymmetric attribute values, let the valuesY (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0 Suppose that the distance between objects (patients)

is computed based only on the asymmetric variables According to Equation (7.10), thedistance between each pair of the three patients, Jack, Mary, and Jim, is

Table 7.2 A relational table where patients are described by binary attributes

Trang 31

These measurements suggest that Mary and Jim are unlikely to have a similar diseasebecause they have the highest dissimilarity value among the three pairs Of the threepatients, Jack and Mary are the most likely to have a similar disease

7.2.3 Categorical, Ordinal, and Ratio-Scaled Variables

“How can we compute the dissimilarity between objects described by categorical, ordinal, and ratio-scaled variables?”

Categorical Variables

A categorical variable is a generalization of the binary variable in that it can take on more

than two states For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue.

Let the number of states of a categorical variable be M The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, , M Notice that such integers are used

just for data handling and do not represent any specific ordering

“How is dissimilarity computed between objects described by categorical variables?” The dissimilarity between two objects i and j can be computed based on the ratio of

mismatches:

d(i, j) = p − m

where m is the number of matches (i.e., the number of variables for which i and j are

in the same state), and p is the total number of variables Weights can be assigned to increase the effect of m or to assign greater weight to the matches in variables having a

larger number of states

Example 7.3 Dissimilarity between categorical variables Suppose that we have the sample data of

Table 7.3, except that only the object-identifier and the variable (or attribute) test-1 are available, where test-1 is categorical (We will use test-2 and test-3 in later examples.) Let’s

compute the dissimilarity matrix (7.2), that is,

Table 7.3 A sample data table containing variables of mixed type

Trang 32

Categorical variables can be encoded by asymmetric binary variables by creating a

new binary variable for each of the M states For an object with a given state value, the

binary variable representing that state is set to 1, while the remaining binary variables

are set to 0 For example, to encode the categorical variable map color, a binary variable

can be created for each of the five colors listed above For an object having the color

yellow, the yellow variable is set to 1, while the remaining four variables are set to 0 The

dissimilarity coefficient for this form of encoding can be calculated using the methodsdiscussed in Section 7.2.2

Ordinal Variables

A discrete ordinal variable resembles a categorical variable, except that the M states of

the ordinal value are ordered in a meaningful sequence Ordinal variables are veryuseful for registering subjective assessments of qualities that cannot be measuredobjectively For example, professional ranks are often enumerated in a sequential

order, such as assistant, associate, and full for professors A continuous ordinal

vari-able looks like a set of continuous data of an unknown scale; that is, the relative

ordering of the values is essential but their actual magnitude is not For example,the relative ranking in a particular sport (e.g., gold, silver, bronze) is often moreessential than the actual values of a particular measure Ordinal variables may also beobtained from the discretization of interval-scaled quantities by splitting the valuerange into a finite number of classes The values of an ordinal variable can be

mapped to ranks For example, suppose that an ordinal variable f has M f states

These ordered states define the ranking 1, , M f

“How are ordinal variables handled?” The treatment of ordinal variables is quite

similar to that of interval-scaled variables when computing the dissimilarity between

objects Suppose that f is a variable from a set of ordinal variables describing

Trang 33

n objects The dissimilarity computation with respect to f involves the following

steps:

1. The value of f for the ith object is x i f , and f has M f ordered states, representing the

ranking 1, , M f Replace each x i f by its corresponding rank, r i f ∈ {1, , M f}

2. Since each ordinal variable can have a different number of states, it is often essary to map the range of each variable onto [0.0,1.0] so that each variable has

nec-equal weight This can be achieved by replacing the rank r i f of the ith object in the f th variable by

z i f = r i f− 1

3. Dissimilarity can then be computed using any of the distance measures described in

Section 7.2.1 for interval-scaled variables, using z i f to represent the f value for the ith

object

Example 7.4 Dissimilarity between ordinal variables Suppose that we have the sample data of

Table 7.3, except that this time only the object-identifier and the continuous ordinal able, test-2, are available There are three states for test-2, namely fair, good, and excellent, that is M f= 3 For step 1, if we replace each value for test-2 by its rank, the four objects areassigned the ranks 3, 1, 2, and 3, respectively Step 2 normalizes the ranking by mappingrank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0 For step 3, we can use, say, the Euclideandistance (Equation (7.5)), which results in the following dissimilarity matrix:

A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an

exponential scale, approximately following the formula

where A and B are positive constants, and t typically represents time Common examples

include the growth of a bacteria population or the decay of a radioactive element

“How can I compute the dissimilarity between objects described by ratio-scaled ables?” There are three methods to handle ratio-scaled variables for computing the dis-

vari-similarity between objects

Trang 34

Treat ratio-scaled variables like interval-scaled variables This, however, is not usually

a good choice since it is likely that the scale may be distorted

Apply logarithmic transformation to a ratio-scaled variable f having value x i f for

object i by using the formula y i f = log(x i f) The yi f values can be treated as valued, as described in Section 7.2.1 Notice that for some ratio-scaled variables, log-log or other transformations may be applied, depending on the variable’s definitionand the application

interval-Treat x i f as continuous ordinal data and treat their ranks as interval-valued

The latter two methods are the most effective, although the choice of method used maydepend on the given application

Example 7.5 Dissimilarity between ratio-scaled variables This time, we have the sample data of

Table 7.3, except that only the object-identifier and the ratio-scaled variable, test-3, are available Let’s try a logarithmic transformation Taking the log of test-3 results in the

values 2.65, 1.34, 2.21, and 3.08 for the objects 1 to 4, respectively Using the Euclideandistance (Equation (7.5)) on the transformed values, we obtain the following dissimilar-ity matrix:







7.2.4 Variables of Mixed Types

Sections 7.2.1 to 7.2.3 discussed how to compute the dissimilarity between objects

described by variables of the same type, where these types may be either interval-scaled, symmetric binary, asymmetric binary, categorical, ordinal, or ratio-scaled However, in many real databases, objects are described by a mixture of variable types In general, a

database can contain all of the six variable types listed above

“So, how can we compute the dissimilarity between objects of mixed variable types?”

One approach is to group each kind of variable together, performing a separate clusteranalysis for each variable type This is feasible if these analyses derive compatible results.However, in real applications, it is unlikely that a separate cluster analysis per variabletype will generate compatible results

A more preferable approach is to process all variable types together, performing asingle cluster analysis One such technique combines the different variables into a singledissimilarity matrix, bringing all of the meaningful variables onto a common scale of theinterval [0.0,1.0]

Trang 35

Suppose that the data set contains p variables of mixed type The dissimilarity d(i, j) between objects i and j is defined as

where the indicatorδ( f ) i j = 0if either (1) x i f or x j f is missing (i.e., there is no

measure-ment of variable f for object i or object j), or (2) x i f = x j f = 0and variable f is

asym-metric binary; otherwise,δ( f ) i j = 1 The contribution of variable f to the dissimilarity

between i and j, that is, d ( f ) i j , is computed dependent on its type:

If f is interval-based: d i j ( f )=max |x i f −x j f|

h x h f −min h x h f , where h runs over all nonmissing objects for variable f

If f is binary or categorical: d i j ( f )= 0if x i f = x j f ; otherwise d i j ( f )= 1

If f is ordinal: compute the ranks r i f and z i f = r i f−1

M f−1, and treat z i f as scaled

interval-If f is ratio-scaled: either perform logarithmic transformation and treat the formed data as interval-scaled; or treat f as continuous ordinal data, compute r i f

trans-and z i f , and then treat z i f as interval-scaled

The above steps are identical to what we have already seen for each of the individualvariable types The only difference is for interval-based variables, where here wenormalize so that the values map to the interval [0.0,1.0] Thus, the dissimilaritybetween objects can be computed even when the variables describing the objects are

of different types

Example 7.6 Dissimilarity between variables of mixed type Let’s compute a dissimilarity matrix

for the objects of Table 7.3 Now we will consider all of the variables, which are

of different types In Examples 7.3 to 7.5, we worked out the dissimilarity matrices

for each of the individual variables The procedures we followed for test-1 (which is categorical) and test-2 (which is ordinal) are the same as outlined above for processing

variables of mixed types Therefore, we can use the dissimilarity matrices obtained

for test-1 and test-2 later when we compute Equation (7.15) First, however, we need

to complete some work for test-3 (which is ratio-scaled) We have already applied a

logarithmic transformation to its values Based on the transformed values of 2.65,

1.34, 2.21, and 3.08 obtained for the objects 1 to 4, respectively, we let max h x h= 3.08

and min h x h= 1.34 We then normalize the values in the dissimilarity matrix obtained

in Example 7.5 by dividing each one by (3.08 − 1.34) = 1.74 This results in the

following dissimilarity matrix for test-3:

Trang 36







We can now use the dissimilarity matrices for the three variables in our computation of

Equation (7.15) For example, we get d(2, 1) =1(1)+1(1)+1(0.75)3 = 0.92 The resultingdissimilarity matrix obtained for the data described by the three variables of mixedtypes is:







If we go back and look at Table 7.3, we can intuitively guess that objects 1 and 4 are

the most similar, based on their values for test-1 and test-2 This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest value for any pair of different objects.

Similarly, the matrix indicates that objects 2 and 4 are the least similar

7.2.5 Vector Objects

In some applications, such as information retrieval, text document clustering, and logical taxonomy, we need to compare and cluster complex objects (such as documents)containing a large number of symbolic entities (such as keywords and phrases) To mea-sure the distance between complex objects, it is often desirable to abandon traditionalmetric distance computation and introduce a nonmetric similarity function

bio-There are several ways to define such a similarity function, s(x, y), to compare two vectors x and y One popular way is to define the similarity function as a cosine measure

and general linear transformation

is the length of the vector.

Trang 37

When variables are binary-valued (0 or 1), the above similarity function can be

interpreted in terms of shared features and attributes Suppose an object x possesses

the ith attribute if x i= 1 Then xt · y is the number of attributes possessed by both x and y, and |x||y| is the geometric mean of the number of attributes possessed by x and the number possessed by y Thus s(x, y) is a measure of relative possession of common

attributes

Example 7.7 Nonmetric similarity between two objects using cosine Suppose we are given two

vec-tors, x = (1, 1, 0, 0) and y = (0, 1, 1, 0) By Equation (7.16), the similarity between x and

tance, is frequently used in information retrieval and biology taxonomy.

Notice that there are many ways to select a particular similarity (or distance) tion or normalize the data for cluster analysis There is no universal standard toguide such selection The appropriate selection of such measures will heavily depend

func-on the given applicatifunc-on One should bear this in mind and refine the selectifunc-on ofsuch measures to ensure that the clusters generated are meaningful and useful for theapplication at hand

Many clustering algorithms exist in the literature It is difficult to provide a crisp gorization of clustering methods because these categories may overlap, so that a methodmay have features from several categories Nevertheless, it is useful to present a relativelyorganized picture of the different clustering methods

cate-In general, the major clustering methods can be classified into the followingcategories

Partitioning methods: Given a database of n objects or data tuples, a partitioning

method constructs k partitions of the data, where each partition represents a ter and k ≤ n That is, it classifies the data into k groups, which together satisfy the

clus-following requirements: (1) each group must contain at least one object, and (2) eachobject must belong to exactly one group Notice that the second requirement can berelaxed in some fuzzy partitioning techniques References to such techniques are given

in the bibliographic notes

Given k, the number of partitions to construct, a partitioning method creates an

initial partitioning It then uses an iterative relocation technique that attempts to

Trang 38

7.3 A Categorization of Major Clustering Methods 399

improve the partitioning by moving objects from one group to another The generalcriterion of a good partitioning is that objects in the same cluster are “close” or related

to each other, whereas objects of different clusters are “far apart” or very different.There are various kinds of other criteria for judging the quality of partitions

To achieve global optimality in partitioning-based clustering would require theexhaustive enumeration of all of the possible partitions Instead, most applications

adopt one of a few popular heuristic methods, such as (1) the k-means algorithm,

where each cluster is represented by the mean value of the objects in the cluster, and

(2) the k-medoids algorithm, where each cluster is represented by one of the objects

located near the center of the cluster These heuristic clustering methods work well forfinding spherical-shaped clusters in small to medium-sized databases To find clus-ters with complex shapes and for clustering very large data sets, partitioning-basedmethods need to be extended Partitioning-based clustering methods are studied indepth in Section 7.4

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of

the given set of data objects A hierarchical method can be classified as being either

agglomerative or divisive, based on how the hierarchical decomposition is formed The agglomerative approach, also called the bottom-up approach, starts with each object

forming a separate group It successively merges the objects or groups that are close

to one another, until all of the groups are merged into one (the topmost level of the

hierarchy), or until a termination condition holds The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster In each suc-

cessive iteration, a cluster is split up into smaller clusters, until eventually each object

is in one cluster, or until a termination condition holds

Hierarchical methods suffer from the fact that once a step (merge or split) is done,

it can never be undone This rigidity is useful in that it leads to smaller computationcosts by not having to worry about a combinatorial number of different choices How-ever, such techniques cannot correct erroneous decisions There are two approaches

to improving the quality of hierarchical clustering: (1) perform careful analysis ofobject “linkages” at each hierarchical partitioning, such as in Chameleon, or (2) inte-grate hierarchical agglomeration and other approaches by first using a hierarchical

agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as itera-

tive relocation, as in BIRCH Hierarchical clustering methods are studied inSection 7.5

Density-based methods: Most partitioning methods cluster objects based on the

dis-tance between objects Such methods can find only spherical-shaped clusters andencounter difficulty at discovering clusters of arbitrary shapes Other clustering meth-

ods have been developed based on the notion of density Their general idea is to

con-tinue growing the given cluster as long as the density (number of objects or datapoints) in the “neighborhood” exceeds some threshold; that is, for each data pointwithin a given cluster, the neighborhood of a given radius has to contain at least a

Trang 39

minimum number of points Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape

DBSCAN and its extension, OPTICS, are typical density-based methods that growclusters according to a density-based connectivity analysis DENCLUE is a methodthat clusters objects based on the analysis of the value distributions of density func-tions Density-based clustering methods are studied in Section 7.6

Grid-based methods: Grid-based methods quantize the object space into a finite

num-ber of cells that form a grid structure All of the clustering operations are performed

on the grid structure (i.e., on the quantized space) The main advantage of thisapproach is its fast processing time, which is typically independent of the number

of data objects and dependent only on the number of cells in each dimension in thequantized space

STING is a typical example of a grid-based method WaveCluster applies wavelettransformation for clustering analysis and is both grid-based and density-based Grid-based clustering methods are studied in Section 7.7

Model-based methods: Model-based methods hypothesize a model for each of the

clus-ters and find the best fit of the data to the given model A model-based algorithm maylocate clusters by constructing a density function that reflects the spatial distribution

of the data points It also leads to a way of automatically determining the number ofclusters based on standard statistics, taking “noise” or outliers into account and thusyielding robust clustering methods

EM is an algorithm that performs expectation-maximization analysis based on tistical modeling COBWEB is a conceptual learning algorithm that performs prob-

sta-ability analysis and takes concepts as a model for clusters SOM (or self-organizing

feature map) is a neural network-based algorithm that clusters by mapping dimensional data into a 2-D or 3-D feature map, which is also useful for data visual-ization Model-based clustering methods are studied in Section 7.8

high-The choice of clustering algorithm depends both on the type of data available and onthe particular purpose of the application If cluster analysis is used as a descriptive orexploratory tool, it is possible to try several algorithms on the same data to see what thedata may disclose

Some clustering algorithms integrate the ideas of several clustering methods, so that

it is sometimes difficult to classify a given algorithm as uniquely belonging to only oneclustering method category Furthermore, some applications may have clustering criteriathat require the integration of several clustering techniques

Aside from the above categories of clustering methods, there are two classes of

clus-tering tasks that require special attention One is clusclus-tering high-dimensional data, and the other is constraint-based clustering.

Clustering high-dimensional data is a particularly important task in cluster analysis

because many applications require the analysis of objects containing a large

Định dạng
Số trang	78
Dung lượng	2,23 MB