In order to plot an ROC curve for a given classification model, M, the model must be able to return a probability or ranking for the predicted class of each test tuple.That is, we need t
Trang 1362 Chapter 6 Classification and Prediction
cancerous patient is not cancerous) is far greater than that of a false positive (incorrectlyyet conservatively labeling a noncancerous patient as cancerous) In such cases, we canoutweigh one type of error over another by assigning a different cost to each Thesecosts may consider the danger to the patient, financial costs of resulting therapies, andother hospital costs Similarly, the benefits associated with a true positive decision may
be different than that of a true negative Up to now, to compute classifier accuracy, wehave assumed equal costs and essentially divided the sum of true positives and truenegatives by the total number of test tuples Alternatively, we can incorporate costsand benefits by instead computing the average cost (or benefit) per decision Otherapplications involving cost-benefit analysis include loan application decisions and tar-get marketing mailouts For example, the cost of loaning to a defaulter greatly exceedsthat of the lost business incurred by denying a loan to a nondefaulter Similarly, in anapplication that tries to identify households that are likely to respond to mailouts ofcertain promotional material, the cost of mailouts to numerous households that do notrespond may outweigh the cost of lost business from not mailing to households thatwould have responded Other costs to consider in the overall analysis include the costs
to collect the data and to develop the classification tool
“Are there other cases where accuracy may not be appropriate?” In classification
prob-lems, it is commonly assumed that all tuples are uniquely classifiable, that is, that eachtraining tuple can belong to only one class Yet, owing to the wide diversity of data
in large databases, it is not always reasonable to assume that all tuples are uniquelyclassifiable Rather, it is more probable to assume that each tuple may belong to morethan one class How then can the accuracy of classifiers on large databases be mea-sured? The accuracy measure is not appropriate, because it does not take into accountthe possibility of tuples belonging to more than one class
Rather than returning a class label, it is useful to return a probability class
distribu-tion Accuracy measures may then use a second guess heuristic, whereby a class
pre-diction is judged as correct if it agrees with the first or second most probable class.Although this does take into consideration, to some degree, the nonunique classifica-tion of tuples, it is not a complete solution
6.12.2 Predictor Error Measures
“How can we measure predictor accuracy?” Let D T be a test set of the form (X1, y1),
(X2,y2), , (X d , y d ), where the X i are the n-dimensional test tuples with associated known values, y i , for a response variable, y, and d is the number of tuples in D T Sincepredictors return a continuous value rather than a categorical label, it is difficult to say
exactly whether the predicted value, y0i , for X iis correct Instead of focusing on whether
y0i is an “exact” match with y i, we instead look at how far off the predicted value is from
the actual known value Loss functions measure the error between y iand the predicted
Trang 26.13 Evaluating the Accuracy of a Classifier or Predictor 363
Based on the above, the test error (rate), or generalization error, is the average loss
over the test set Thus, we get the following error rates
Mean absolute error :
ing error measure is called the root mean squared error This is useful in that it allows
the error measured to be of the same magnitude as the quantity being predicted.Sometimes, we may want the error to be relative to what it would have been if we
had just predicted y, the mean value for y from the training data, D That is, we can
normalize the total loss by dividing by the total loss incurred from always predictingthe mean Relative measures of error include:
Relative absolute error :
d We can
take the root of the relative squared error to obtain the root relative squared error so
that the resulting error is of the same magnitude as the quantity predicted
In practice, the choice of error measure does not greatly affect prediction modelselection
How can we use the above measures to obtain a reliable estimate of classifier racy (or predictor accuracy in terms of error)? Holdout, random subsampling, cross-validation, and the bootstrap are common techniques for assessing accuracy based on
Trang 3accu-364 Chapter 6 Classification and Prediction
Test set
Training set
DerivemodelData
Estimateaccuracy
Figure 6.29 Estimating accuracy with the holdout method
randomly sampled partitions of the given data The use of such techniques to estimateaccuracy increases the overall computation time, yet is useful for model selection
The holdout method is what we have alluded to so far in our discussions about
accu-racy In this method, the given data are randomly partitioned into two independent
sets, a training set and a test set Typically, two-thirds of the data are allocated to the
training set, and the remaining one-third is allocated to the test set The training set isused to derive the model, whose accuracy is estimated with the test set (Figure 6.29).The estimate is pessimistic because only a portion of the initial data is used to derivethe model
Random subsampling is a variation of the holdout method in which the holdout
method is repeated k times The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration (For prediction, we can take the average of thepredictor error rates.)
6.13.2 Cross-validation
In k-fold cross-validation, the initial data are randomly partitioned into k mutually
exclusive subsets or “folds,” D1, D2, , D k, each of approximately equal size
Train-ing and testTrain-ing is performed k times In iteration i, partition D iis reserved as the testset, and the remaining partitions are collectively used to train the model That is, in
the first iteration, subsets D2, , D k collectively serve as the training set in order to
obtain a first model, which is tested on D1; the second iteration is trained on subsets
D1, D3, , D k and tested on D2; and so on Unlike the holdout and random pling methods above, here, each sample is used the same number of times for trainingand once for testing For classification, the accuracy estimate is the overall number of
subsam-correct classifications from the k iterations, divided by the total number of tuples in the
initial data For prediction, the error estimate can be computed as the total loss from
the k iterations, divided by the total number of initial tuples.
Trang 46.13 Evaluating the Accuracy of a Classifier or Predictor 365
Leave-one-out is a special case of k-fold cross-validation where k is set to the number
of initial tuples That is, only one sample is “left out” at a time for the test set In
stratified cross-validation, the folds are stratified so that the class distribution of the
tuples in each fold is approximately the same as that in the initial data
In general, stratified 10-fold cross-validation is recommended for estimating racy (even if computation power allows using more folds) due to its relatively low biasand variance
accu-6.13.3 Bootstrap
Unlike the accuracy estimation methods mentioned above, the bootstrap method
samples the given training tuples uniformly with replacement That is, each time a
tuple is selected, it is equally likely to be selected again and readded to the training set.For instance, imagine a machine that randomly selects tuples for our training set In
sampling with replacement, the machine is allowed to select the same tuple more than
once
There are several bootstrap methods A commonly used one is the 632 bootstrap,
which works as follows Suppose we are given a data set of d tuples The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d
samples It is very likely that some of the original data tuples will occur more than once
in this sample The data tuples that did not make it into the training set end up formingthe test set Suppose we were to try this out several times As it turns out, on average,63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8%will form the test set (hence, the name, 632 bootstrap.)
“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is (1 − 1/d) We have to select d
times, so the probability that a tuple will not be chosen during this whole time is (1 −
1/d) d If d is large, the probability approaches e−1= 0.368.14Thus, 36.8% of tupleswill not be selected for training and thereby end up in the test set, and the remaining63.2% will form the training set
We can repeat the sampling procedure k times, where in each iteration, we use the
current test set to obtain an accuracy estimate of the model obtained from the currentbootstrap sample The overall accuracy of the model is then estimated as
Acc(M) =
k
∑
i=1
(0.632× Acc(M i)test set+ 0.368× Acc(M i)train set), (6.65)
where Acc(M i)test set is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i Acc(M i)train setis the accuracy of the model obtained with
bootstrap sample i when it is applied to the original set of data tuples The bootstrap
method works well with small data sets
Trang 5366 Chapter 6 Classification and Prediction
New datasample
Prediction
Figure 6.30 Increasing model accuracy: Bagging and boosting each generate a set of classification or
prediction models, M1, M2, , M k Voting strategies are used to combine the predictionsfor a given unknown tuple
In Section 6.3.3, we saw how pruning can be applied to decision tree induction to help
improve the accuracy of the resulting decision trees Are there general strategies for
improving classifier and predictor accuracy?
The answer is yes Bagging and boosting are two such techniques (Figure 6.30) They
are examples of ensemble methods, or methods that use a combination of models Each
combines a series of k learned models (classifiers or predictors), M1, M2, , M k, with
the aim of creating an improved composite model, M∗ Both bagging and boosting can
be used for classification as well as prediction
6.14.1 Bagging
We first take an intuitive look at how bagging works as a method of increasing accuracy.For ease of explanation, we will assume at first that our model is a classifier Supposethat you are a patient and would like to have a diagnosis made based on your symptoms.Instead of asking one doctor, you may choose to ask several If a certain diagnosis occursmore than any of the others, you may choose this as the final or best diagnosis That
is, the final diagnosis is made based on a majority vote, where each doctor gets anequal vote Now replace each doctor by a classifier, and you have the basic idea behindbagging Intuitively, a majority vote made by a large group of doctors may be morereliable than a majority vote made by a small group
Given a set, D, of d tuples, bagging works as follows For iteration i (i = 1, 2, , k), a
training set, D i , of d tuples is sampled with replacement from the original set of tuples, D Note that the term bagging stands for bootstrap aggregation Each training set is a bootstrap
sample, as described in Section 6.13.3 Because sampling with replacement is used, some
Trang 66.14 Ensemble Methods—Increasing the Accuracy 367
Algorithm: Bagging The bagging algorithm—create an ensemble of models (classifiers or
pre-dictors) for a learning scheme where each model gives an equally-weighted prediction
Input:
D , a set of d training tuples;
k, the number of models in the ensemble;
a learning scheme (e.g., decision tree algorithm, backpropagation, etc.)
Output: A composite model, M∗.
Method:
(1) for i = 1 to k do // create k models:
(2) create bootstrap sample, D i , by sampling D with replacement;
(3) use D i to derive a model, M i;
of the original tuples of D may not be included in D i, whereas others may occur more than
once A classifier model, M i , is learned for each training set, D i To classify an unknown
tuple, X, each classifier, M i, returns its class prediction, which counts as one vote The
bagged classifier, M∗, counts the votes and assigns the class with the most votes to X.
Bagging can be applied to the prediction of continuous values by taking the average value
of each prediction for a given test tuple The algorithm is summarized in Figure 6.31.The bagged classifier often has significantly greater accuracy than a single classifier
derived from D, the original training data It will not be considerably worse and is
more robust to the effects of noisy data The increased accuracy occurs because thecomposite model reduces the variance of the individual classifiers For prediction, it
was theoretically proven that a bagged predictor will always have improved accuracy over a single predictor derived from D.
6.14.2 Boosting
We now look at the ensemble method of boosting As in the previous section, supposethat as a patient, you have certain symptoms Instead of consulting one doctor, youchoose to consult several Suppose you assign weights to the value or worth of eachdoctor’s diagnosis, based on the accuracies of previous diagnoses they have made The
Trang 7368 Chapter 6 Classification and Prediction
final diagnosis is then a combination of the weighted diagnoses This is the essencebehind boosting
In boosting, weights are assigned to each training tuple A series of k classifiers is
iteratively learned After a classifier M iis learned, the weights are updated to allow the
subsequent classifier, M i+1, to “pay more attention” to the training tuples that were
mis-classified by M i The final boosted classifier, M∗, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function of its accuracy Theboosting algorithm can be extended for the prediction of continuous values
Adaboost is a popular boosting algorithm Suppose we would like to boost the accuracy
of some learning method We are given D, a data set of d class-labeled tuples, (X1, y1),
(X2, y2), , (X d , y d ), where y i is the class label of tuple X i Initially, Adaboost assigns each
training tuple an equal weight of 1/d Generating k classifiers for the ensemble requires
k rounds through the rest of the algorithm In round i, the tuples from D are sampled to form a training set, D i , of size d Sampling with replacement is used—the same tuple may
be selected more than once Each tuple’s chance of being selected is based on its weight
A classifier model, M i , is derived from the training tuples of D i Its error is then calculated
using D ias a test set The weights of the training tuples are then adjusted according to howthey were classified If a tuple was incorrectly classified, its weight is increased If a tuplewas correctly classified, its weight is decreased A tuple’s weight reflects how hard it is toclassify—the higher the weight, the more often it has been misclassified These weightswill be used to generate the training samples for the classifier of the next round The basicidea is that when we build a classifier, we want it to focus more on the misclassified tuples
of the previous round Some classifiers may be better at classifying some “hard” tuplesthan others In this way, we build a series of classifiers that complement each other Thealgorithm is summarized in Figure 6.32
Now, let’s look at some of the math that’s involved in the algorithm To compute
the error rate of model M i , we sum the weights of each of the tuples in D i that M i
misclassified That is,
where err(X j)is the misclassification error of tuple X j: If the tuple was misclassified,
then err(X j)is 1 Otherwise, it is 0 If the performance of classifier M iis so poor that
its error exceeds 0.5, then we abandon it Instead, we try again by generating a new D i
training set, from which we derive a new M i
The error rate of M iaffects how the weights of the training tuples are updated If a tuple
in round i was correctly classified, its weight is multiplied by error(M i)/(1− error(M i)).Once the weights of all of the correctly classified tuples are updated, the weights for alltuples (including the misclassified ones) are normalized so that their sum remains thesame as it was before To normalize a weight, we multiply it by the sum of the old weights,divided by the sum of the new weights As a result, the weights of misclassified tuples areincreased and the weights of correctly classified tuples are decreased, as described above
“Once boosting is complete, how is the ensemble of classifiers used to predict the class
label of a tuple, X?” Unlike bagging, where each classifier was assigned an equal vote,
Trang 86.14 Ensemble Methods—Increasing the Accuracy 369
Algorithm: Adaboost A boosting algorithm—create an ensemble of classifiers Each one gives
a weighted vote
Input:
D , a set of d class-labeled training tuples;
k, the number of rounds (one classifier is generated per round);
a classification learning scheme
Output: A composite model.
Method:
(1) initialize the weight of each tuple in D to 1/d;
(2) for i = 1 to k do // for each round:
(3) sample D with replacement according to the tuple weights to obtain D i;
(4) use training set D i to derive a model, M i;
(5) compute error(M i), the error rate of Mi(Equation 6.66)
(6) if error(M i) > 0.5then
(7) reinitialize the weights to 1/d
(8) go back to step 3 and try again;
(10) for each tuple in D ithat was correctly classified do
(11) multiply the weight of the tuple by error(M i)/(1− error(M i)); // update weights(12) normalize the weight of each tuple;
(13) endfor
To use the composite model to classify tuple, X:
(1) initialize weight of each class to 0;
(2) for i = 1 to k do // for each classifier:
(7) return the class with the largest weight;
Figure 6.32 Adaboost, a boosting algorithm
boosting assigns a weight to each classifier’s vote, based on how well the classifier formed The lower a classifier’s error rate, the more accurate it is, and therefore, the
per-higher its weight for voting should be The weight of classifier M i’s vote is
log1− error(M i)
Trang 9370 Chapter 6 Classification and Prediction
For each class, c, we sum the weights of each classifier that assigned class c to X The class with the highest sum is the “winner” and is returned as the class prediction for tuple X.
“How does boosting compare with bagging?” Because of the way boosting focuses on
the misclassified tuples, it risks overfitting the resulting composite model to such data.Therefore, sometimes the resulting “boosted” model may be less accurate than a sin-gle model derived from the same data Bagging is less susceptible to model overfitting.While both can significantly improve accuracy in comparison to a single model, boost-ing tends to achieve greater accuracy
Suppose that we have generated two models, M1 and M2(for either classification orprediction), from our data We have performed 10-fold cross-validation to obtain amean error rate for each How can we determine which model is best? It may seemintuitive to select the model with the lowest error rate, however, the mean error rates
are just estimates of error on the true population of future data cases There can be
con-siderable variance between error rates within any given 10-fold cross-validation
exper-iment Although the mean error rates obtained for M1and M2may appear different,that difference may not be statistically significant What if any difference between thetwo may just be attributed to chance? This section addresses these questions
6.15.1 Estimating Confidence Intervals
To determine if there is any “real” difference in the mean error rates of two models,
we need to employ a test of statistical significance In addition, we would like to obtain
some confidence limits for our mean error rates so that we can make statements like
“any observed mean will not vary by +/− two standard errors 95% of the time for future samples” or “one model is better than the other by a margin of error of +/− 4%.”
What do we need in order to perform the statistical test? Suppose that for eachmodel, we did 10-fold cross-validation, say, 10 times, each time using a different 10-foldpartitioning of the data Each partitioning is independently drawn We can average the
10 error rates obtained each for M1 and M2, respectively, to obtain the mean errorrate for each model For a given model, the individual error rates calculated in thecross-validations may be considered as different, independent samples from a proba-
bility distribution In general, they follow a t distribution with k-1 degrees of freedom where, here, k = 10 (This distribution looks very similar to a normal, or Gaussian,
distribution even though the functions defining the two are quite different Both areunimodal, symmetric, and bell-shaped.) This allows us to do hypothesis testing where
the significance test used is the t-test, or Student’s t-test Our hypothesis is that the two
models are the same, or in other words, that the difference in mean error rate between
the two is zero If we can reject this hypothesis (referred to as the null hypothesis), then
we can conclude that the difference between the two models is statistically significant,
in which case we can select the model with the lower error rate
Trang 106.15 Model Selection 371
In data mining practice, we may often employ a single test set, that is, the same test
set can be used for both M1and M2 In such cases, we do a pairwise comparison of the
two models for each 10-fold cross-validation round That is, for the ith round of 10-fold
cross-validation, the same cross-validation partitioning is used to obtain an error rate
for M1and an error rate for M2 Let err(M1)i (or err(M2)i) be the error rate of model
M1 (or M2) on round i The error rates for M1 are averaged to obtain a mean error
rate for M1, denoted err(M1) Similarly, we can obtain err(M2) The variance of the
difference between the two models is denoted var(M1− M2) The t-test computes the
t-statistic with k − 1 degrees of freedom for k samples In our example we have k = 10 since, here, the k samples are our error rates obtained from ten 10-fold cross-validations for each model The t-statistic for pairwise comparison is computed as follows:
t = err(M1)− err(M2)p
err(M1)i − err(M2)i − (err(M1)− err(M2))i2 (6.69)
To determine whether M1and M2are significantly different, we compute t and select
a significance level, sig In practice, a significance level of 5% or 1% is typically used We then consult a table for the t distribution, available in standard textbooks on statistics.
This table is usually shown arranged by degrees of freedom as rows and significance
levels as columns Suppose we want to ascertain whether the difference between M1and
M2 is significantly different for 95% of the population, that is, sig = 5% or 0.05 We need to find the t distribution value corresponding to k − 1 degrees of freedom (or 9 degrees of freedom for our example) from the table However, because the t distribution
is symmetric, typically only the upper percentage points of the distribution are shown
Therefore, we look up the table value for z = sig/2, which in this case is 0.025, where
z is also referred to as a confidence limit If t > z or t < −z, then our value of t lies in
the rejection region, within the tails of the distribution This means that we can reject
the null hypothesis that the means of M1and M2are the same and conclude that there
is a statistically significant difference between the two models Otherwise, if we cannot
reject the null hypothesis, we then conclude that any difference between M1 and M2
can be attributed to chance
If two test sets are available instead of a single test set, then a nonpaired version of
the t-test is used, where the variance between the means of the two models is estimated
and k1and k2are the number of validation samples (in our case, 10-fold
cross-validation rounds) used for M1 and M2, respectively When consulting the table of t
distribution, the number of degrees of freedom used is taken as the minimum number
of degrees of the two models
Trang 11372 Chapter 6 Classification and Prediction
6.15.2 ROC Curves
ROC curves are a useful visual tool for comparing two classification models The name
ROC stands for Receiver Operating Characteristic ROC curves come from signal
detec-tion theory that was developed during World War II for the analysis of radar images AnROC curve shows the trade-off between the true positive rate or sensitivity (proportion
of positive tuples that are correctly identified) and the false-positive rate (proportion
of negative tuples that are incorrectly identified as positive) for a given model That
is, given a two-class problem, it allows us to visualize the trade-off between the rate atwhich the model can accurately recognize ‘yes’ cases versus the rate at which it mis-takenly identifies ‘no’ cases as ‘yes’ for different “portions” of the test set Any increase
in the true positive rate occurs at the cost of an increase in the false-positive rate Thearea under the ROC curve is a measure of the accuracy of the model
In order to plot an ROC curve for a given classification model, M, the model must
be able to return a probability or ranking for the predicted class of each test tuple.That is, we need to rank the test tuples in decreasing order, where the one the classifierthinks is most likely to belong to the positive or ‘yes’ class appears at the top of the list.Naive Bayesian and backpropagation classifiers are appropriate, whereas others, such
as decision tree classifiers, can easily be modified so as to return a class probabilitydistribution for each prediction The vertical axis of an ROC curve represents the truepositive rate The horizontal axis represents the false-positive rate An ROC curve for
Mis plotted as follows Starting at the bottom left-hand corner (where the true positiverate and false-positive rate are both 0), we check the actual class label of the tuple atthe top of the list If we have a true positive (that is, a positive tuple that was correctlyclassified), then on the ROC curve, we move up and plot a point If, instead, the tuplereally belongs to the ‘no’ class, we have a false positive On the ROC curve, we moveright and plot a point This process is repeated for each of the test tuples, each timemoving up on the curve for a true positive or toward the right for a false positive.Figure 6.33 shows the ROC curves of two classification models The plot also shows
a diagonal line where for every true positive of such a model, we are just as likely toencounter a false positive Thus, the closer the ROC curve of a model is to the diago-nal line, the less accurate the model If the model is really good, initially we are morelikely to encounter true positives as we move down the ranked list Thus, the curvewould move steeply up from zero Later, as we start to encounter fewer and fewer truepositives, and more and more false positives, the curve cases off and becomes morehorizontal
To assess the accuracy of a model, we can measure the area under the curve Severalsoftware packages are able to perform such calculation The closer the area is to 0.5,the less accurate the corresponding model is A model with perfect accuracy will have
an area of 1.0
Trang 126.16 Summary 373
0.0
0.2 0.4 0.6 0.8 1.0
false positive rate
Figure 6.33 The ROC curves of two classification models
Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends While sification predicts categorical labels (classes), prediction models continuous-valued
clas-functions
Preprocessing of the data in preparation for classification and prediction can involve
data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes, and data transformation, such as generalizing the
data to higher-level concepts or normalizing the data
Predictive accuracy, computational speed, robustness, scalability, and interpretability
are five criteria for the evaluation of classification and prediction methods.
ID3, C4.5, and CART are greedy algorithms for the induction of decision trees Each
algorithm uses an attribute selection measure to select the attribute tested for each
nonleaf node in the tree Pruning algorithms attempt to improve accuracy by
remov-ing tree branches reflectremov-ing noise in the data Early decision tree algorithms cally assume that the data are memory resident—a limitation to data mining on large
typi-databases Several scalable algorithms, such as SLIQ, SPRINT, and RainForest, have
been proposed to address this issue
Nạve Bayesian classification and Bayesian belief networks are based on Bayes,
theo-rem of posterior probability Unlike nạve Bayesian classification (which assumes class
Trang 13374 Chapter 6 Classification and Prediction
conditional independence), Bayesian belief networks allow class conditional pendencies to be defined between subsets of variables
inde-A rule-based classifier uses a set of IF-THEN rules for classification Rules can be
extracted from a decision tree Rules may also be generated directly from trainingdata using sequential covering algorithms and associative classification algorithms
Backpropagation is a neural network algorithm for classification that employs a
method of gradient descent It searches for a set of weights that can model the data
so as to minimize the mean squared distance between the network’s class predictionand the actual class label of data tuples Rules may be extracted from trained neuralnetworks in order to help improve the interpretability of the learned network
A Support Vector Machine (SVM) is an algorithm for the classification of both linear
and nonlinear data It transforms the original data in a higher dimension, from where
it can find a hyperplane for separation of the data using essential training tuples called
support vectors.
Associative classification uses association mining techniques that search for frequently
occurring patterns in large databases The patterns may generate rules, which can beanalyzed for use in classification
Decision tree classifiers, Bayesian classifiers, classification by backpropagation,
sup-port vector machines, and classification based on association are all examples of eager learners in that they use training tuples to construct a generalization model and in this way are ready for classifying new tuples This contrasts with lazy learners or instance- based methods of classification, such as nearest-neighbor classifiers and case-based
reasoning classifiers, which store all of the training tuples in pattern space and waituntil presented with a test tuple before performing generalization Hence, lazy learnersrequire efficient indexing techniques
In genetic algorithms, populations of rules “evolve” via operations of crossover and mutation until all rules within a population satisfy a specified threshold Rough set theory can be used to approximately define classes that are not distinguishable based
on the available attributes Fuzzy set approaches replace “brittle” threshold cutoffs for
continuous-valued attributes with degree of membership functions
Linear, nonlinear, and generalized linear models of regression can be used for
predic-tion Many nonlinear problems can be converted to linear problems by performingtransformations on the predictor variables Unlike decision trees, regression trees andmodel trees are used for prediction In regression trees, each leaf stores a continuous-valued prediction In model trees, each leaf holds a regression model
Stratified k-fold cross-validation is a recommended method for accuracy estimation.
Bagging and boosting methods can be used to increase overall accuracy by learning and combining a series of individual models For classifiers, sensitivity, specificity, and precision are useful alternatives to the accuracy measure, particularly when the main
class of interest is in the minority There are many measures of predictor error, such as
Trang 14be more computationally intensive than most decision tree methods.
Exercises
6.1 Briefly outline the major steps of decision tree classification.
6.2 Why is tree pruning useful in decision tree induction? What is a drawback of using a
separate set of tuples to evaluate pruning?
6.3 Given a decision tree, you have the option of (a) converting the decision tree to rules
and then pruning the resulting rules, or (b) pruning the decision tree and then verting the pruned tree to rules What advantage does (a) have over (b)?
con-6.4 It is important to calculate the worst-case computational complexity of the decision
tree algorithm Given data set D, the number of attributes n, and the number of training tuples |D|, show that the computational cost of growing a tree is at most
n × |D| × log(|D|).
6.5 Why is nạve Bayesian classification called “nạve”? Briefly outline the major ideas of
nạve Bayesian classification
6.6 Given a 5 GB data set with 50 attributes (each containing 100 distinct values) and
512 MB of main memory in your laptop, outline an efficient method that constructsdecision trees in such large data sets Justify your answer by rough calculation of yourmain memory usage
6.7 RainForest is an interesting scalable algorithm for decision tree induction Develop a
scalable naive Bayesian classification algorithm that requires just a single scan of theentire data set for most databases Discuss whether such an algorithm can be refined
to incorporate boosting to further enhance its classification accuracy.
6.8 Compare the advantages and disadvantages of eager classification (e.g., decision tree,
Bayesian, neural network) versus lazy classification (e.g., k-nearest neighbor,
case-based reasoning)
6.9 Design an efficient method that performs effective nạve Bayesian classification over
an infinite data stream (i.e., you can scan the data stream only once) If we wanted to
Trang 15376 Chapter 6 Classification and Prediction
discover the evolution of such classification schemes (e.g., comparing the classification
scheme at this moment with earlier schemes, such as one from a week ago), whatmodified design would you suggest?
6.10 What is associative classification? Why is associative classification able to achieve higher
classification accuracy than a classical decision tree method? Explain how associativeclassification can be used for text document classification
6.11 The following table consists of training data from an employee database The data
have been generalized For example, “31 35” for age represents the age range of 31
to 35 For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that row.
department status age salary count
sales senior 31 35 46K 50K 30sales junior 26 30 26K 30K 40sales junior 31 35 31K 35K 40systems junior 21 25 46K 50K 20systems senior 31 35 66K 70K 5systems junior 26 30 46K 50K 3systems senior 41 45 66K 70K 3marketing senior 36 40 46K 50K 10marketing junior 31 35 41K 45K 4secretary senior 46 50 36K 40K 4secretary junior 26 30 26K 30K 6
Let status be the class label attribute.
(a) How would you modify the basic decision tree algorithm to take into
considera-tion the count of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data
(c) Given a data tuple having the values “systems,” “26 30,” and “46–50K” for the attributes department, age, and salary, respectively, what would a naive Bayesian classification of the status for the tuple be?
(d) Design a multilayer feed-forward neural network for the given data Label thenodes in the input and output layers
(e) Using the multilayer feed-forward neural network obtained above, show the weightvalues after one iteration of the backpropagation algorithm, given the training
instance “(sales, senior, 31 35, 46K 50K).” Indicate your initial weight values and
biases, and the learning rate used
6.12 The support vector machine (SVM) is a highly accurate classification method However,
SVM classifiers suffer from slow processing when training with a large set of data
Trang 16Exercises 377
tuples Discuss how to overcome this difficulty and develop a scalable SVM algorithmfor efficient SVM classification in large datasets
6.13 Write an algorithm for k-nearest-neighbor classification given k and n, the number of
attributes describing each tuple
6.14 The following table shows the midterm and final exam grades obtained for students
(a) Plot the data Do x and y seem to have a linear relationship?
(b) Use the method of least squares to find an equation for the prediction of a student’s
final exam grade based on the student’s midterm grade in the course
(c) Predict the final exam grade of a student who received an 86 on the midtermexam
6.15 Some nonlinear regression models can be converted to linear models by applying
trans-formations to the predictor variables Show how the nonlinear regression equation
y =αXβcan be converted to a linear regression equation solvable by the method ofleast squares
6.16 What is boosting? State why it may improve the accuracy of decision tree induction 6.17 Show thataccuracy is afunction of sensitivity and specificity, thatis, proveEquation(6.58) 6.18 Suppose that we would like to select between two prediction models, M1and M2 Wehave performed 10 rounds of 10-fold cross-validation on each model, where the same
data partitioning in round i is used for both M1and M2 The error rates obtained for
M1are 30.5, 32.2, 20.7, 20.6, 31.0, 41.0, 27.7, 26.0, 21.5, 26.0 The error rates for M2
are 22.4, 14.5, 22.4, 19.6, 20.7, 20.4, 22.1, 19.4, 16.2, 35.0 Comment on whether onemodel is significantly better than the other considering a significance level of 1%
Trang 17378 Chapter 6 Classification and Prediction
6.19 It is difficult to assess classification accuracy when individual data objects may belong
to more than one class at a time In such cases, comment on what criteria you woulduse to compare different classifiers modeled after the same data
Bibliographic Notes
Classification from machine learning, statistics, and pattern recognition perspectiveshas been described in many books, such as Weiss and Kulikowski [WK91], Michie,Spiegelhalter, and Taylor [MST94], Russel and Norvig [RN95], Langley [Lan96], Mitchell[Mit97], Hastie, Tibshirani, and Friedman [HTF01], Duda, Hart, and Stork [DHS01],Alpaydin [Alp04], Tan, Steinbach, and Kumar [TSK05], and Witten and Frank [WF05].Many of these books describe each of the basic methods of classification discussed in thischapter, as well as practical techniques for the evaluation of classifier performance Editedcollections containing seminal articles on machine learning can be found in Michalski,Carbonell, and Mitchell [MCM83,MCM86], Kodratoff and Michalski [KM90], Shavlikand Dietterich [SD90], and Michalski and Tecuci [MT94] For a presentation of machinelearning with respect to data mining applications, see Michalski, Bratko, and Kubat[MBK98]
The C4.5 algorithm is described in a book by Quinlan [Qui93] The CART system is
detailed in Classification and Regression Trees by Breiman, Friedman, Olshen, and Stone
[BFOS84] Both books give an excellent presentation of many of the issues regardingdecision tree induction C4.5 has a commercial successor, known as C5.0, which can be
found at www.rulequest.com ID3, a predecessor of C4.5, is detailed in Quinlan [Qui86].
It expands on pioneering work on concept learning systems, described by Hunt, Marin,and Stone [HMS66] Other algorithms for decision tree induction include FACT (Lohand Vanichsetakul [LV88]), QUEST (Loh and Shih [LS97]), PUBLIC (Rastogi and Shim[RS98]), and CHAID (Kass [Kas80] and Magidson [Mag94]) INFERULE (Uthurusamy,Fayyad, and Spangler [UFS91]) learns decision trees from inconclusive data, where prob-abilistic rather than categorical classification rules are obtained KATE (Manago andKodratoff [MK91]) learns decision trees from complex structured data Incremental ver-sions of ID3 include ID4 (Schlimmer and Fisher [SF86a]) and ID5 (Utgoff [Utg88]), thelatter of which is extended in Utgoff, Berkman, and Clouse [UBC97] An incrementalversion of CART is described in Crawford [Cra89] BOAT (Gehrke, Ganti, Ramakrish-nan, and Loh [GGRL99]), a decision tree algorithm that addresses the scalabilty issue
in data mining, is also incremental Other decision tree algorithms that address ity include SLIQ (Mehta, Agrawal, and Rissanen [MAR96]), SPRINT (Shafer, Agrawal,and Mehta [SAM96]), RainForest (Gehrke, Ramakrishnan, and Ganti [GRG98]), andearlier approaches, such as Catlet [Cat91], and Chan and Stolfo [CS93a, CS93b] Theintegration of attribution-oriented induction with decision tree induction is proposed
scalabil-in Kamber, Wscalabil-instone, Gong, et al [KWG+97] For a comprehensive survey of manysalient issues relating to decision tree induction, such as attribute selection and pruning,see Murthy [Mur98]
Trang 18Bibliographic Notes 379
For a detailed discussion on attribute selection measures, see Kononenko and Hong[KH97] Information gain was proposed by Quinlan [Qui86] and is based on pioneeringwork on information theory by Shannon and Weaver [SW49] The gain ratio, proposed
as an extension to information gain, is described as part of C4.5 [Qui93] The Gini indexwas proposed for CART [BFOS84] The G-statistic, based on information theory, is given
in Sokal and Rohlf [SR81] Comparisons of attribute selection measures include tine and Niblett [BN92], Fayyad and Irani [FI92], Kononenko [Kon95], Loh and Shih[LS97], and Shih [Shi99] Fayyad and Irani [FI92] show limitations of impurity-basedmeasures such as information gain and Gini index They propose a class of attributeselection measures called C-SEP (Class SEParation), which outperform impurity-basedmeasures in certain cases Kononenko [Kon95] notes that attribute selection measuresbased on the minimum description length principle have the least bias toward multival-ued attributes Martin and Hirschberg [MH95] proved that the time complexity of deci-sion tree induction increases exponentially with respect to tree height in the worst case,and under fairly general conditions in the average case Fayad and Irani [FI90] foundthat shallow decision trees tend to have many leaves and higher error rates for a largevariety of domains Attribute (or feature) construction is described in Liu and Motoda[LM98, Le98] Examples of systems with attribute construction include BACON by Lan-gley, Simon, Bradshaw, and Zytkow [LSBZ87], Stagger by Schlimmer [Sch86], FRINGE
Bun-by Pagallo [Pag89], and AQ17-DCI Bun-by Bloedorn and Michalski [BM98]
There are numerous algorithms for decision tree pruning, including cost ity pruning (Breiman, Friedman, Olshen, and Stone [BFOS84]), reduced error prun-ing (Quinlan [Qui87]), and pessimistic pruning (Quinlan [Qui86]) PUBLIC (Rastogiand Shim [RS98]) integrates decision tree construction with tree pruning MDL-basedpruning methods can be found in Quinlan and Rivest [QR89], Mehta, Agrawal, andRissanen [MRA95], and Rastogi and Shim [RS98] Other methods include Niblett andBratko [NB86], and Hosking, Pednault, and Sudan [HPS97] For an empirical compar-ison of pruning methods, see Mingers [Min89] and Malerba, Floriana, and Semeraro[MFS95] For a survey on simplifying decision trees, see Breslow and Aha [BA97].There are several examples of rule-based classifiers These include AQ15 (Hong,Mozetic, and Michalski [HMM86]), CN2 (Clark and Niblett [CN89]), ITRULE (Smythand Goodman [SG92]), RISE (Domingos [Dom94]), IREP (Furnkranz and Widmer[FW94]), RIPPER (Cohen [Coh95]), FOIL (Quinlan and Cameron-Jones [Qui90,QCJ93]), and Swap-1 (Weiss and Indurkhya [WI98]) For the extraction of rules fromdecision trees, see Quinlan [Qui87, Qui93] Rule refinement strategies that identify themost interesting rules among a given rule set can be found in Major and Mangano[MM95]
complex-Thorough presentations of Bayesian classification can be found in Duda, Hart, andStork [DHS01], Weiss and Kulikowski [WK91], and Mitchell [Mit97] For an anal-ysis of the predictive power of nạve Bayesian classifiers when the class conditionalindependence assumption is violated, see Domingos and Pazzani [DP96] Experimentswith kernel density estimation for continuous-valued attributes, rather than Gaussianestimation, have been reported for nạve Bayesian classifiers in John [Joh97] For anintroduction to Bayesian belief networks, see Heckerman [Hec96] For a thorough
Trang 19380 Chapter 6 Classification and Prediction
presentation of probabilistic networks, see Pearl [Pea88] Solutions for learning thebelief network structure from training data given observable variables are proposed inCooper and Herskovits [CH92], Buntine [Bun94], and Heckerman, Geiger, and Chick-ering [HGC95] Algorithms for inference on belief networks can be found in Russelland Norvig [RN95] and Jensen [Jen96] The method of gradient descent, described inSection 6.4.4 for training Bayesian belief networks, is given in Russell, Binder, Koller,and Kanazawa [RBKK95] The example given in Figure 6.11 is adapted from Russell
et al [RBKK95] Alternative strategies for learning belief networks with hidden ables include application of Dempster, Laird, and Rubin’s [DLR77] EM (ExpectationMaximization) algorithm (Lauritzen [Lau95]) and methods based on the minimumdescription length principle (Lam [Lam98]) Cooper [Coo90] showed that the generalproblem of inference in unconstrained belief networks is NP-hard Limitations of beliefnetworks, such as their large computational complexity (Laskey and Mahoney [LM97]),have prompted the exploration of hierarchical and composable Bayesian models (Pfef-fer, Koller, Milch, and Takusagawa [PKMT99] and Xiang, Olesen, and Jensen [XOJ00]).These follow an object-oriented approach to knowledge representation
vari-The perceptron is a simple neural network, proposed in 1958 by Rosenblatt [Ros58],which became a landmark in early machine learning history Its input units are ran-domly connected to a single layer of output linear threshold units In 1969, Minskyand Papert [MP69] showed that perceptrons are incapable of learning concepts thatare linearly inseparable This limitation, as well as limitations on hardware at the time,dampened enthusiasm for research in computational neuronal modeling for nearly 20years Renewed interest was sparked following presentation of the backpropagationalgorithm in 1986 by Rumelhart, Hinton, and Williams [RHW86], as this algorithmcan learn concepts that are linearly inseparable Since then, many variations for back-propagation have been proposed, involving, for example, alternative error functions(Hanson and Burr [HB88]), dynamic adjustment of the network topology (Me´zardand Nadal [MN89], Fahlman and Lebiere [FL90], Le Cun, Denker, and Solla [LDS90],and Harp, Samad, and Guha [HSG90] ), and dynamic adjustment of the learning rateand momentum parameters (Jacobs [Jac88]) Other variations are discussed in Chauvinand Rumelhart [CR95] Books on neural networks include Rumelhart and McClelland[RM86], Hecht-Nielsen [HN90], Hertz, Krogh, and Palmer [HKP91], Bishop [Bis95],Ripley [Rip96], and Haykin [Hay99] Many books on machine learning, such as [Mit97,RN95], also contain good explanations of the backpropagation algorithm There areseveral techniques for extracting rules from neural networks, such as [SN88, Gal93,TS93, Avn95, LSL95, CS96b, LGT97] The method of rule extraction described in Sec-tion 6.6.4 is based on Lu, Setiono, and Liu [LSL95] Critiques of techniques for ruleextraction from neural networks can be found in Craven and Shavlik [CS97] Roy[Roy00] proposes that the theoretical foundations of neural networks are flawed withrespect to assumptions made regarding how connectionist learning models the brain
An extensive survey of applications of neural networks in industry, business, and ence is provided in Widrow, Rumelhart, and Lehr [WRL94]
sci-Support Vector Machines (SVMs) grew out of early work by Vapnik and Chervonenkis
on statistical learning theory [VC71] The first paper on SVMs was presented by Boser,
Trang 20Bibliographic Notes 381
Guyon, and Vapnik [BGV92] More detailed accounts can be found in books by Vapnik[Vap95, Vap98] Good starting points include the tutorial on SVMs by Burges [Bur98] andtextbook coverage by Kecman [Kec01] For methods for solving optimization problems,see Fletcher [Fle87] and Nocedal and Wright [NW99] These references give additionaldetails alluded to as “fancy math tricks” in our text, such as transformation of the problem
to a Lagrangian formulation and subsequent solving using Karush-Kuhn-Tucker (KKT)conditions For the application of SVMs to regression, see Schlkopf, Bartlett, Smola, andWilliamson [SBSW99], and Drucker, Burges, Kaufman, Smola, and Vapnik [DBK+97].Approaches to SVM for large data include the sequential minimal optimization algo-rithm by Platt [Pla98], decomposition approaches such as in Osuna, Freund, and Girosi[OFG97], and CB-SVM, a microclustering-based SVM algorithm for large data sets, by
Yu, Yang, and Han [YYH03]
Many algorithms have been proposed that adapt association rule mining to the task
of classification The CBA algorithm for associative classification was proposed by Liu,Hsu, and Ma [LHM98] A classifier, using emerging patterns, was proposed by Dongand Li [DL99] and Li, Dong, and Ramamohanarao [LDR00] CMAR (Classificationbased on Multiple Association Rules) was presented in Li, Han, and Pei [LHP01] CPAR(Classification based on Predictive Association Rules) was proposed in Yin and Han
[YH03b] Cong, Tan, Tung, and Xu proposed a method for mining top-k covering rule
groups for classifying gene expression data with high accuracy [CTTX05] Lent, Swami,and Widom [LSW97] proposed the ARCS system, which was described in Section 5.3
on mining multidimensional association rules It combines ideas from association rulemining, clustering, and image processing, and applies them to classification Meretakisand Wüthrich [MW99] proposed to construct a nạve Bayesian classifier by mininglong itemsets
Nearest-neighbor classifiers were introduced in 1951 by Fix and Hodges [FH51]
A comprehensive collection of articles on nearest-neighbor classification can be found
in Dasarathy [Das91] Additional references can be found in many texts on tion, such as Duda et al [DHS01] and James [Jam85], as well as articles by Cover andHart [CH67] and Fukunaga and Hummels [FH87] Their integration with attribute-weighting and the pruning of noisy instances is described in Aha [Aha92] The use ofsearch trees to improve nearest-neighbor classification time is detailed in Friedman,Bentley, and Finkel [FBF77] The partial distance method was proposed by researchers
classifica-in vector quantization and compression It is outlclassifica-ined classifica-in Gersho and Gray [GG92].The editing method for removing “useless” training tuples was first proposed by Hart[Har68] The computational complexity of nearest-neighbor classifiers is described inPreparata and Shamos [PS85] References on case-based reasoning (CBR) include thetexts Riesbeck and Schank [RS89] and Kolodner [Kol93], as well as Leake [Lea96] andAamodt and Plazas [AP94] For a list of business applications, see Allen [All94] Exam-ples in medicine include CASEY by Koton [Kot88] and PROTOS by Bareiss, Porter, andWeir [BPW88], while Rissland and Ashley [RA87] is an example of CBR for law CBR
is available in several commercial software products For texts on genetic algorithms, seeGoldberg [Gol89], Michalewicz [Mic92], and Mitchell [Mit96] Rough sets wereintroduced in Pawlak [Paw91] Concise summaries of rough set theory in data
Trang 21382 Chapter 6 Classification and Prediction
mining include Ziarko [Zia91], and Cios, Pedrycz, and Swiniarski [CPS98] Roughsets have been used for feature reduction and expert system design in many applica-tions, including Ziarko [Zia91], Lenarcik and Piasta [LP97], and Swiniarski [Swi98].Algorithms to reduce the computation intensity in finding reducts have been proposed
in Skowron and Rauszer [SR92] Fuzzy set theory was proposed by Zadeh in [Zad65,Zad83] Additional descriptions can be found in [YZ94, Kec01]
Many good textbooks cover the techniques of regression Examples include James[Jam85], Dobson [Dob01], Johnson and Wichern [JW02], Devore [Dev95], Hogg andCraig [HC95], Neter, Kutner, Nachtsheim, and Wasserman [NKNW96], and Agresti[Agr96] The book by Press, Teukolsky, Vetterling, and Flannery [PTVF96] and accom-panying source code contain many statistical procedures, such as the method of leastsquares for both linear and multiple regression Recent nonlinear regression modelsinclude projection pursuit and MARS (Friedman [Fri91]) Log-linear models are also
known in the computer science literature as multiplicative models For log-linear
mod-els from a computer science perspective, see Pearl [Pea88] Regression trees (Breiman,Friedman, Olshen, and Stone [BFOS84]) are often comparable in performance withother regression methods, particularly when there exist many higher-order dependen-cies among the predictor variables For model trees, see Quinlan [Qui92]
Methods for data cleaning and data transformation are discussed in Kennedy, Lee,Van Roy, et al [KLV+98], Weiss and Indurkhya [WI98], Pyle [Pyl99], and Chapter 2
of this book Issues involved in estimating classifier accuracy are described in Weissand Kulikowski [WK91] and Witten and Frank [WF05] The use of stratified 10-foldcross-validation for estimating classifier accuracy is recommended over the holdout,cross-validation, leave-one-out (Stone [Sto74]) and bootstrapping (Efron and Tibshi-rani [ET93]) methods, based on a theoretical and empirical study by Kohavi [Koh95].Bagging is proposed in Breiman [Bre96] The boosting technique of Freund andSchapire [FS97] has been applied to several different classifiers, including decision treeinduction (Quinlan [Qui96]) and naive Bayesian classification (Elkan [Elk97]) Sensi-tivity, specificity, and precision are discussed in Frakes and Baeza-Yates [FBY92] ForROC analysis, see Egan [Ega75] and Swets [Swe88]
The University of California at Irvine (UCI) maintains a Machine Learning itory of data sets for the development and testing of classification algorithms It alsomaintains a Knowledge Discovery in Databases (KDD) Archive, an online repository oflarge data sets that encompasses a wide variety of data types, analysis tasks, and appli-
Repos-cation areas For information on these two repositories, see www.ics.uci.edu/~mlearn/ MLRepository.html and http://kdd.ics.uci.edu.
No classification method is superior over all others for all data types and domains.Empirical comparisons of classification methods include [Qui88, SMT91, BCP93,CM94, MST94, BU95], and [LLS00]
Trang 227 Cluster Analysis
Imaginethat you are given a set of data objects for analysis where, unlike in classification, the class
label of each object is not known This is quite common in large databases, because
assigning class labels to a large number of objects can be a very costly process Clustering
is the process of grouping the data into classes or clusters, so that objects within a
clus-ter have high similarity in comparison to one another but are very dissimilar to objects
in other clusters Dissimilarities are assessed based on the attribute values describing theobjects Often, distance measures are used Clustering has its roots in many areas, includ-ing data mining, statistics, biology, and machine learning
In this chapter, we study the requirements of clustering methods for large amounts ofdata We explain how to compute dissimilarities between objects represented by variousattribute or variable types We examine several clustering techniques, organized into the
following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (such as frequent pattern–based methods), and constraint-based clustering Clustering can also be used for outlier detection, which forms the final topic of this chapter.
The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters A cluster of data
objects can be treated collectively as one group and so may be considered as a form of datacompression Although classification is an effective means for distinguishing groups orclasses of objects, it requires the often costly collection and labeling of a large set of trainingtuples or patterns, which the classifier uses to model each group It is often more desirable
to proceed in the reverse direction: First partition the set of data into groups based on datasimilarity (e.g., using clustering), and then assign labels to the relatively small number ofgroups Additional advantages of such a clustering-based process are that it is adaptable
to changes and helps single out useful features that distinguish different groups
383
Trang 23384 Chapter 7 Cluster Analysis
Cluster analysis is an important human activity Early in childhood, we learn how
to distinguish between cats and dogs, or between animals and plants, by continuouslyimproving subconscious clustering schemes By automated clustering, we can identifydense and sparse regions in object space and, therefore, discover overall distribution pat-terns and interesting correlations among data attributes Cluster analysis has been widelyused in numerous applications, including market research, pattern recognition, dataanalysis, and image processing In business, clustering can help marketers discover dis-tinct groups in their customer bases and characterize customer groups based onpurchasing patterns In biology, it can be used to derive plant and animal taxonomies,categorize genes with similar functionality, and gain insight into structures inherent inpopulations Clustering may also help in the identification of areas of similar land use
in an earth observation database and in the identification of groups of houses in a cityaccording to house type, value, and geographic location, as well as the identification ofgroups of automobile insurance policy holders with a high average claim cost It can also
be used to help classify documents on the Web for information discovery
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity Clustering can also be
used for outlier detection, where outliers (values that are “far away” from any cluster)
may be more interesting than common cases Applications of outlier detection includethe detection of credit card fraud and the monitoring of criminal activities in electroniccommerce For example, exceptional cases in credit card transactions, such as very expen-sive and frequent purchases, may be of interest as possible fraudulent activity As a datamining function, cluster analysis can be used as a stand-alone tool to gain insight intothe distribution of data, to observe the characteristics of each cluster, and to focus on aparticular set of clusters for further analysis Alternatively, it may serve as a preprocessingstep for other algorithms, such as characterization, attribute subset selection, and clas-sification, which would then operate on the detected clusters and the selected attributes
or features
Data clustering is under vigorous development Contributing areas of research includedata mining, statistics, machine learning, spatial database technology, biology, and mar-keting Owing to the huge amounts of data collected in databases, cluster analysis hasrecently become a highly active topic in data mining research
As a branch of statistics, cluster analysis has been extensively studied for many years,
focusing mainly on distance-based cluster analysis Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical
analysis software packages or systems, such as S-Plus, SPSS, and SAS In machine
learn-ing, clustering is an example of unsupervised learning Unlike classification, clustering
and unsupervised learning do not rely on predefined classes and class-labeled training
examples For this reason, clustering is a form of learning by observation, rather than
learning by examples In data mining, efforts have focused on finding methods for cient and effective cluster analysis in large databases Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clus- tering mixed numerical and categorical data in large databases.
Trang 24effi-7.1 What Is Cluster Analysis? 385
Clustering is a challenging field of research in which its potential applications posetheir own special requirements The following are typical requirements of clustering indata mining:
Scalability: Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain millions of
objects Clustering on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed
Ability to deal with different types of attributes: Many algorithms are designed to
cluster interval-based (numerical) data However, applications may require ing other types of data, such as binary, categorical (nominal), and ordinal data, ormixtures of these data types
cluster-Discovery of clusters with arbitrary shape: Many clustering algorithms determine
clusters based on Euclidean or Manhattan distance measures Algorithms based onsuch distance measures tend to find spherical clusters with similar size and density.However, a cluster could be of any shape It is important to develop algorithms thatcan detect clusters of arbitrary shape
Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis(such as the number of desired clusters) The clustering results can be quite sensi-tive to input parameters Parameters are often difficult to determine, especially fordata sets containing high-dimensional objects This not only burdens users, but italso makes the quality of clustering difficult to control
Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data Some clustering algorithms are sensitive to such dataand may lead to clusters of poor quality
Incremental clustering and insensitivity to the order of input records: Some
clus-tering algorithms cannot incorporate newly inserted data (i.e., database updates)into existing clustering structures and, instead, must determine a new clusteringfrom scratch Some clustering algorithms are sensitive to the order of input data.That is, given a set of data objects, such an algorithm may return dramaticallydifferent clusterings depending on the order of presentation of the input objects
It is important to develop incremental clustering algorithms and algorithms thatare insensitive to the order of input
High dimensionality: A database or a data warehouse can contain several dimensions
or attributes Many clustering algorithms are good at handling low-dimensional data,involving only two to three dimensions Human eyes are good at judging the quality
of clustering for up to three dimensions Finding clusters of data objects in dimensional space is challenging, especially considering that such data can be sparseand highly skewed
Trang 25high-386 Chapter 7 Cluster Analysis
Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints Suppose that your job is to choose the locationsfor a given number of new automatic banking machines (ATMs) in a city To decideupon this, you may cluster households while considering constraints such as the city’srivers and highway networks, and the type and number of customers per cluster Achallenging task is to find groups of data with good clustering behavior that satisfyspecified constraints
Interpretability and usability: Users expect clustering results to be interpretable,
com-prehensible, and usable That is, clustering may need to be tied to specific semanticinterpretations and applications It is important to study how an application goal mayinfluence the selection of clustering features and methods
With these requirements in mind, our study of cluster analysis proceeds as follows First,
we study different types of data and how they can influence clustering methods Second,
we present a general categorization of clustering methods We then study each clusteringmethod in detail, including partitioning methods, hierarchical methods, density-basedmethods, grid-based methods, and model-based methods We also examine clustering inhigh-dimensional space, constraint-based clustering, and outlier analysis
In this section, we study the types of data that often occur in cluster analysis and how
to preprocess them for such an analysis Suppose that a data set to be clustered contains
nobjects, which may represent persons, houses, documents, countries, and so on Mainmemory-based clustering algorithms typically operate on either of the following two datastructures
Data matrix (or object-by-variable structure): This represents n objects, such as sons, with p variables (also called measurements or attributes), such as age, height,
per-weight, gender, and so on The structure is in the form of a relational table, or n-by-p matrix (n objects ×p variables):
Dissimilarity matrix (or object-by-object structure): This stores a collection of
prox-imities that are available for all pairs of n objects It is often represented by an n-by-n
table:
Trang 267.2 Types of Data in Cluster Analysis 387
where d(i, j) is the measured difference or dissimilarity between objects i and j In
general, d(i, j) is a nonnegative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they differ Since
d(i, j) = d( j, i), and d(i, i) = 0, we have the matrix in (7.2) Measures of dissimilarity
are discussed throughout this section
The rows and columns of the data matrix represent different entities, while those of thedissimilarity matrix represent the same entity Thus, the data matrix is often called a
two-mode matrix, whereas the dissimilarity matrix is called a one-mode matrix Many
clustering algorithms operate on a dissimilarity matrix If the data are presented in theform of a data matrix, it can first be transformed into a dissimilarity matrix before apply-ing such clustering algorithms
In this section, we discuss how object dissimilarity can be computed for objects
described by interval-scaled variables; by binary variables; by categorical, ordinal, and ratio-scaled variables; or combinations of these variable types Nonmetric similarity
between complex objects (such as documents) is also described The dissimilarity datacan later be used to compute clusters of objects
7.2.1 Interval-Scaled Variables
This section discusses interval-scaled variables and their standardization It then describes
distance measures that are commonly used for computing the dissimilarity of objects
described by such variables These measures include the Euclidean, Manhattan, and Minkowski distances.
“What are interval-scaled variables?” Interval-scaled variables are continuous
mea-surements of a roughly linear scale Typical examples include weight and height, latitudeand longitude coordinates (e.g., when clustering houses), and weather temperature.The measurement unit used can affect the clustering analysis For example, changingmeasurement units from meters to inches for height, or from kilograms to pounds forweight, may lead to a very different clustering structure In general, expressing a variable
in smaller units will lead to a larger range for that variable, and thus a larger effect on theresulting clustering structure To help avoid dependence on the choice of measurementunits, the data should be standardized Standardizing measurements attempts to giveall variables an equal weight This is particularly useful when given no prior knowledge
of the data However, in some applications, users may intentionally want to give more
Trang 27388 Chapter 7 Cluster Analysis
weight to a certain set of variables than to others For example, when clustering basketballplayer candidates, we may prefer to give more weight to the variable height
“How can the data for a variable be standardized?” To standardize measurements, one
choice is to convert the original measurements to unitless variables Given measurements
for a variable f , this can be performed as follows.
1. Calculate the mean absolute deviation, s f:
devia-(i.e., |x i f − m f|) are not squared; hence, the effect of outliers is somewhat reduced
There are more robust measures of dispersion, such as the median absolute deviation.
However, the advantage of using the mean absolute deviation is that the z-scores ofoutliers do not become too small; hence, the outliers remain detectable
Standardization may or may not be useful in a particular application Thus the choice
of whether and how to perform standardization should be left to the user Methods ofstandardization are also discussed in Chapter 2 under normalization techniques for datapreprocessing
After standardization, or without standardization in certain applications, the larity (or similarity) between the objects described by interval-scaled variables is typicallycomputed based on the distance between each pair of objects The most popular distance
dissimi-measure is Euclidean distance, which is defined as
d(i, j) =
q
(x i1− x j1)2+ (x i2− x j2)2+· · · + (x in − x jn)2, (7.5)
where i = (x i1, x i2, , x in)and j = (x j1, x j2, , x jn)are two n-dimensional data objects.
Another well-known metric is Manhattan (or city block) distance, defined as
d(i, j) = |x i1− x j1| + |x i2− x j2| + · · · + |x in − x jn| (7.6)
Both the Euclidean distance and Manhattan distance satisfy the following mathematicrequirements of a distance function:
Trang 287.2 Types of Data in Cluster Analysis 389
1 d(i, j) ≥ 0: Distance is a nonnegative number.
2 d(i, i) = 0: The distance of an object to itself is 0.
3 d(i, j) = d( j, i): Distance is a symmetric function.
4 d(i, j) ≤ d(i, h) + d(h, j): Going directly from object i to object j in space is no more
than making a detour over any other object h (triangular inequality).
Example 7.1 Euclidean distance and Manhattan distance Let x1= (1, 2) and x2= (3, 5) represent two
objects as in Figure 7.1 The Euclidean distance between the two isp(22+ 32) = 3.61.The Manhattan distance between the two is 2 + 3 = 5
Minkowski distance is a generalization of both Euclidean distance and Manhattan
distance It is defined as
d(i, j) = (|x i1− x j1|p+|x i2− x j2|p+· · · + |x in − x jn|p)1/p, (7.7)
where p is a positive integer Such a distance is also called L pnorm, in some literature
It represents the Manhattan distance when p = 1 (i.e., L1norm) and Euclidean distance
when p = 2 (i.e., L2norm)
If each variable is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as
d(i, j) =
q
w1|x i1− x j1|2+ w2|x i2− x j2|2+· · · + w m |x in − x jn|2 (7.8)Weighting can also be applied to the Manhattan and Minkowski distances
Trang 29390 Chapter 7 Cluster Analysis
A binary variable has only two states: 0 or 1, where 0 means that the variable
is absent, and 1 means that it is present Given the variable smoker describing a
patient, for instance, 1 indicates that the patient smokes, while 0 indicates that thepatient does not Treating binary variables as if they are interval-scaled can lead tomisleading clustering results Therefore, methods specific to binary data are necessaryfor computing dissimilarities
“So, how can we compute the dissimilarity between two binary variables?” One approach
involves computing a dissimilarity matrix from the given binary data If all binary ables are thought of as having the same weight, we have the 2-by-2 contingency table of
vari-Table 7.1, where q is the number of variables that equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s is the num- ber of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j The total number of variables is p, where
p = q + r + s + t.
“What is the difference between symmetric and asymmetric binary variables?” A binary
variable is symmetric if both of its states are equally valuable and carry the same weight;
that is, there is no preference on which outcome should be coded as 0 or 1 One such
example could be the attribute gender having the states male and female Dissimilarity
that is based on symmetric binary variables is called symmetric binary dissimilarity Its
dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the
dissimilarity between objects i and j.
d(i, j) = r + s
A binary variable is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a disease test By convention,
we shall code the most important outcome, which is usually the rarest one, by 1
(e.g., HIV positive) and the other by 0 (e.g., HIV negative) Given two asymmetric
binary variables, the agreement of two 1s (a positive match) is then considered moresignificant than that of two 0s (a negative match) Therefore, such binary variables areoften considered “monary” (as if having one state) The dissimilarity based on such
variables is called asymmetric binary dissimilarity, where the number of negative
Table 7.1 A contingency table for binary variables
Trang 307.2 Types of Data in Cluster Analysis 391
matches, t, is considered unimportant and thus is ignored in the computation, as
shown in Equation (7.10)
d(i, j) = r + s
Complementarily, we can measure the distance between two binary variables based
on the notion of similarity instead of dissimilarity For example, the asymmetric binary similarity between the objects i and j, or sim(i, j), can be computed as,
Example 7.2 Dissimilarity between binary variables Suppose that a patient record table (Table 7.2)
contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes
are asymmetric binary
For asymmetric attribute values, let the valuesY (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0 Suppose that the distance between objects (patients)
is computed based only on the asymmetric variables According to Equation (7.10), thedistance between each pair of the three patients, Jack, Mary, and Jim, is
Table 7.2 A relational table where patients are described by binary attributes
Trang 31392 Chapter 7 Cluster Analysis
These measurements suggest that Mary and Jim are unlikely to have a similar diseasebecause they have the highest dissimilarity value among the three pairs Of the threepatients, Jack and Mary are the most likely to have a similar disease
7.2.3 Categorical, Ordinal, and Ratio-Scaled Variables
“How can we compute the dissimilarity between objects described by categorical, ordinal, and ratio-scaled variables?”
Categorical Variables
A categorical variable is a generalization of the binary variable in that it can take on more
than two states For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue.
Let the number of states of a categorical variable be M The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, , M Notice that such integers are used
just for data handling and do not represent any specific ordering
“How is dissimilarity computed between objects described by categorical variables?” The dissimilarity between two objects i and j can be computed based on the ratio of
mismatches:
d(i, j) = p − m
where m is the number of matches (i.e., the number of variables for which i and j are
in the same state), and p is the total number of variables Weights can be assigned to increase the effect of m or to assign greater weight to the matches in variables having a
larger number of states
Example 7.3 Dissimilarity between categorical variables Suppose that we have the sample data of
Table 7.3, except that only the object-identifier and the variable (or attribute) test-1 are available, where test-1 is categorical (We will use test-2 and test-3 in later examples.) Let’s
compute the dissimilarity matrix (7.2), that is,
Table 7.3 A sample data table containing variables of mixed type
Trang 327.2 Types of Data in Cluster Analysis 393
Categorical variables can be encoded by asymmetric binary variables by creating a
new binary variable for each of the M states For an object with a given state value, the
binary variable representing that state is set to 1, while the remaining binary variables
are set to 0 For example, to encode the categorical variable map color, a binary variable
can be created for each of the five colors listed above For an object having the color
yellow, the yellow variable is set to 1, while the remaining four variables are set to 0 The
dissimilarity coefficient for this form of encoding can be calculated using the methodsdiscussed in Section 7.2.2
Ordinal Variables
A discrete ordinal variable resembles a categorical variable, except that the M states of
the ordinal value are ordered in a meaningful sequence Ordinal variables are veryuseful for registering subjective assessments of qualities that cannot be measuredobjectively For example, professional ranks are often enumerated in a sequential
order, such as assistant, associate, and full for professors A continuous ordinal
vari-able looks like a set of continuous data of an unknown scale; that is, the relative
ordering of the values is essential but their actual magnitude is not For example,the relative ranking in a particular sport (e.g., gold, silver, bronze) is often moreessential than the actual values of a particular measure Ordinal variables may also beobtained from the discretization of interval-scaled quantities by splitting the valuerange into a finite number of classes The values of an ordinal variable can be
mapped to ranks For example, suppose that an ordinal variable f has M f states
These ordered states define the ranking 1, , M f
“How are ordinal variables handled?” The treatment of ordinal variables is quite
similar to that of interval-scaled variables when computing the dissimilarity between
objects Suppose that f is a variable from a set of ordinal variables describing
Trang 33394 Chapter 7 Cluster Analysis
n objects The dissimilarity computation with respect to f involves the following
steps:
1. The value of f for the ith object is x i f , and f has M f ordered states, representing the
ranking 1, , M f Replace each x i f by its corresponding rank, r i f ∈ {1, , M f}
2. Since each ordinal variable can have a different number of states, it is often essary to map the range of each variable onto [0.0,1.0] so that each variable has
nec-equal weight This can be achieved by replacing the rank r i f of the ith object in the f th variable by
z i f = r i f− 1
3. Dissimilarity can then be computed using any of the distance measures described in
Section 7.2.1 for interval-scaled variables, using z i f to represent the f value for the ith
object
Example 7.4 Dissimilarity between ordinal variables Suppose that we have the sample data of
Table 7.3, except that this time only the object-identifier and the continuous ordinal able, test-2, are available There are three states for test-2, namely fair, good, and excellent, that is M f= 3 For step 1, if we replace each value for test-2 by its rank, the four objects areassigned the ranks 3, 1, 2, and 3, respectively Step 2 normalizes the ranking by mappingrank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0 For step 3, we can use, say, the Euclideandistance (Equation (7.5)), which results in the following dissimilarity matrix:
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an
exponential scale, approximately following the formula
where A and B are positive constants, and t typically represents time Common examples
include the growth of a bacteria population or the decay of a radioactive element
“How can I compute the dissimilarity between objects described by ratio-scaled ables?” There are three methods to handle ratio-scaled variables for computing the dis-
vari-similarity between objects
Trang 347.2 Types of Data in Cluster Analysis 395
Treat ratio-scaled variables like interval-scaled variables This, however, is not usually
a good choice since it is likely that the scale may be distorted
Apply logarithmic transformation to a ratio-scaled variable f having value x i f for
object i by using the formula y i f = log(x i f) The yi f values can be treated as valued, as described in Section 7.2.1 Notice that for some ratio-scaled variables, log-log or other transformations may be applied, depending on the variable’s definitionand the application
interval-Treat x i f as continuous ordinal data and treat their ranks as interval-valued
The latter two methods are the most effective, although the choice of method used maydepend on the given application
Example 7.5 Dissimilarity between ratio-scaled variables This time, we have the sample data of
Table 7.3, except that only the object-identifier and the ratio-scaled variable, test-3, are available Let’s try a logarithmic transformation Taking the log of test-3 results in the
values 2.65, 1.34, 2.21, and 3.08 for the objects 1 to 4, respectively Using the Euclideandistance (Equation (7.5)) on the transformed values, we obtain the following dissimilar-ity matrix:
7.2.4 Variables of Mixed Types
Sections 7.2.1 to 7.2.3 discussed how to compute the dissimilarity between objects
described by variables of the same type, where these types may be either interval-scaled, symmetric binary, asymmetric binary, categorical, ordinal, or ratio-scaled However, in many real databases, objects are described by a mixture of variable types In general, a
database can contain all of the six variable types listed above
“So, how can we compute the dissimilarity between objects of mixed variable types?”
One approach is to group each kind of variable together, performing a separate clusteranalysis for each variable type This is feasible if these analyses derive compatible results.However, in real applications, it is unlikely that a separate cluster analysis per variabletype will generate compatible results
A more preferable approach is to process all variable types together, performing asingle cluster analysis One such technique combines the different variables into a singledissimilarity matrix, bringing all of the meaningful variables onto a common scale of theinterval [0.0,1.0]
Trang 35396 Chapter 7 Cluster Analysis
Suppose that the data set contains p variables of mixed type The dissimilarity d(i, j) between objects i and j is defined as
where the indicatorδ( f ) i j = 0if either (1) x i f or x j f is missing (i.e., there is no
measure-ment of variable f for object i or object j), or (2) x i f = x j f = 0and variable f is
asym-metric binary; otherwise,δ( f ) i j = 1 The contribution of variable f to the dissimilarity
between i and j, that is, d ( f ) i j , is computed dependent on its type:
If f is interval-based: d i j ( f )=max |x i f −x j f|
h x h f −min h x h f , where h runs over all nonmissing objects for variable f
If f is binary or categorical: d i j ( f )= 0if x i f = x j f ; otherwise d i j ( f )= 1
If f is ordinal: compute the ranks r i f and z i f = r i f−1
M f−1, and treat z i f as scaled
interval-If f is ratio-scaled: either perform logarithmic transformation and treat the formed data as interval-scaled; or treat f as continuous ordinal data, compute r i f
trans-and z i f , and then treat z i f as interval-scaled
The above steps are identical to what we have already seen for each of the individualvariable types The only difference is for interval-based variables, where here wenormalize so that the values map to the interval [0.0,1.0] Thus, the dissimilaritybetween objects can be computed even when the variables describing the objects are
of different types
Example 7.6 Dissimilarity between variables of mixed type Let’s compute a dissimilarity matrix
for the objects of Table 7.3 Now we will consider all of the variables, which are
of different types In Examples 7.3 to 7.5, we worked out the dissimilarity matrices
for each of the individual variables The procedures we followed for test-1 (which is categorical) and test-2 (which is ordinal) are the same as outlined above for processing
variables of mixed types Therefore, we can use the dissimilarity matrices obtained
for test-1 and test-2 later when we compute Equation (7.15) First, however, we need
to complete some work for test-3 (which is ratio-scaled) We have already applied a
logarithmic transformation to its values Based on the transformed values of 2.65,
1.34, 2.21, and 3.08 obtained for the objects 1 to 4, respectively, we let max h x h= 3.08
and min h x h= 1.34 We then normalize the values in the dissimilarity matrix obtained
in Example 7.5 by dividing each one by (3.08 − 1.34) = 1.74 This results in the
following dissimilarity matrix for test-3:
Trang 367.2 Types of Data in Cluster Analysis 397
We can now use the dissimilarity matrices for the three variables in our computation of
Equation (7.15) For example, we get d(2, 1) =1(1)+1(1)+1(0.75)3 = 0.92 The resultingdissimilarity matrix obtained for the data described by the three variables of mixedtypes is:
If we go back and look at Table 7.3, we can intuitively guess that objects 1 and 4 are
the most similar, based on their values for test-1 and test-2 This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest value for any pair of different objects.
Similarly, the matrix indicates that objects 2 and 4 are the least similar
7.2.5 Vector Objects
In some applications, such as information retrieval, text document clustering, and logical taxonomy, we need to compare and cluster complex objects (such as documents)containing a large number of symbolic entities (such as keywords and phrases) To mea-sure the distance between complex objects, it is often desirable to abandon traditionalmetric distance computation and introduce a nonmetric similarity function
bio-There are several ways to define such a similarity function, s(x, y), to compare two vectors x and y One popular way is to define the similarity function as a cosine measure
and general linear transformation
is the length of the vector.
Trang 37398 Chapter 7 Cluster Analysis
When variables are binary-valued (0 or 1), the above similarity function can be
interpreted in terms of shared features and attributes Suppose an object x possesses
the ith attribute if x i= 1 Then xt · y is the number of attributes possessed by both x and y, and |x||y| is the geometric mean of the number of attributes possessed by x and the number possessed by y Thus s(x, y) is a measure of relative possession of common
attributes
Example 7.7 Nonmetric similarity between two objects using cosine Suppose we are given two
vec-tors, x = (1, 1, 0, 0) and y = (0, 1, 1, 0) By Equation (7.16), the similarity between x and
tance, is frequently used in information retrieval and biology taxonomy.
Notice that there are many ways to select a particular similarity (or distance) tion or normalize the data for cluster analysis There is no universal standard toguide such selection The appropriate selection of such measures will heavily depend
func-on the given applicatifunc-on One should bear this in mind and refine the selectifunc-on ofsuch measures to ensure that the clusters generated are meaningful and useful for theapplication at hand
Many clustering algorithms exist in the literature It is difficult to provide a crisp gorization of clustering methods because these categories may overlap, so that a methodmay have features from several categories Nevertheless, it is useful to present a relativelyorganized picture of the different clustering methods
cate-In general, the major clustering methods can be classified into the followingcategories
Partitioning methods: Given a database of n objects or data tuples, a partitioning
method constructs k partitions of the data, where each partition represents a ter and k ≤ n That is, it classifies the data into k groups, which together satisfy the
clus-following requirements: (1) each group must contain at least one object, and (2) eachobject must belong to exactly one group Notice that the second requirement can berelaxed in some fuzzy partitioning techniques References to such techniques are given
in the bibliographic notes
Given k, the number of partitions to construct, a partitioning method creates an
initial partitioning It then uses an iterative relocation technique that attempts to
Trang 387.3 A Categorization of Major Clustering Methods 399
improve the partitioning by moving objects from one group to another The generalcriterion of a good partitioning is that objects in the same cluster are “close” or related
to each other, whereas objects of different clusters are “far apart” or very different.There are various kinds of other criteria for judging the quality of partitions
To achieve global optimality in partitioning-based clustering would require theexhaustive enumeration of all of the possible partitions Instead, most applications
adopt one of a few popular heuristic methods, such as (1) the k-means algorithm,
where each cluster is represented by the mean value of the objects in the cluster, and
(2) the k-medoids algorithm, where each cluster is represented by one of the objects
located near the center of the cluster These heuristic clustering methods work well forfinding spherical-shaped clusters in small to medium-sized databases To find clus-ters with complex shapes and for clustering very large data sets, partitioning-basedmethods need to be extended Partitioning-based clustering methods are studied indepth in Section 7.4
Hierarchical methods: A hierarchical method creates a hierarchical decomposition of
the given set of data objects A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is formed The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group It successively merges the objects or groups that are close
to one another, until all of the groups are merged into one (the topmost level of the
hierarchy), or until a termination condition holds The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster In each suc-
cessive iteration, a cluster is split up into smaller clusters, until eventually each object
is in one cluster, or until a termination condition holds
Hierarchical methods suffer from the fact that once a step (merge or split) is done,
it can never be undone This rigidity is useful in that it leads to smaller computationcosts by not having to worry about a combinatorial number of different choices How-ever, such techniques cannot correct erroneous decisions There are two approaches
to improving the quality of hierarchical clustering: (1) perform careful analysis ofobject “linkages” at each hierarchical partitioning, such as in Chameleon, or (2) inte-grate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as itera-
tive relocation, as in BIRCH Hierarchical clustering methods are studied inSection 7.5
Density-based methods: Most partitioning methods cluster objects based on the
dis-tance between objects Such methods can find only spherical-shaped clusters andencounter difficulty at discovering clusters of arbitrary shapes Other clustering meth-
ods have been developed based on the notion of density Their general idea is to
con-tinue growing the given cluster as long as the density (number of objects or datapoints) in the “neighborhood” exceeds some threshold; that is, for each data pointwithin a given cluster, the neighborhood of a given radius has to contain at least a
Trang 39400 Chapter 7 Cluster Analysis
minimum number of points Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape
DBSCAN and its extension, OPTICS, are typical density-based methods that growclusters according to a density-based connectivity analysis DENCLUE is a methodthat clusters objects based on the analysis of the value distributions of density func-tions Density-based clustering methods are studied in Section 7.6
Grid-based methods: Grid-based methods quantize the object space into a finite
num-ber of cells that form a grid structure All of the clustering operations are performed
on the grid structure (i.e., on the quantized space) The main advantage of thisapproach is its fast processing time, which is typically independent of the number
of data objects and dependent only on the number of cells in each dimension in thequantized space
STING is a typical example of a grid-based method WaveCluster applies wavelettransformation for clustering analysis and is both grid-based and density-based Grid-based clustering methods are studied in Section 7.7
Model-based methods: Model-based methods hypothesize a model for each of the
clus-ters and find the best fit of the data to the given model A model-based algorithm maylocate clusters by constructing a density function that reflects the spatial distribution
of the data points It also leads to a way of automatically determining the number ofclusters based on standard statistics, taking “noise” or outliers into account and thusyielding robust clustering methods
EM is an algorithm that performs expectation-maximization analysis based on tistical modeling COBWEB is a conceptual learning algorithm that performs prob-
sta-ability analysis and takes concepts as a model for clusters SOM (or self-organizing
feature map) is a neural network-based algorithm that clusters by mapping dimensional data into a 2-D or 3-D feature map, which is also useful for data visual-ization Model-based clustering methods are studied in Section 7.8
high-The choice of clustering algorithm depends both on the type of data available and onthe particular purpose of the application If cluster analysis is used as a descriptive orexploratory tool, it is possible to try several algorithms on the same data to see what thedata may disclose
Some clustering algorithms integrate the ideas of several clustering methods, so that
it is sometimes difficult to classify a given algorithm as uniquely belonging to only oneclustering method category Furthermore, some applications may have clustering criteriathat require the integration of several clustering techniques
Aside from the above categories of clustering methods, there are two classes of
clus-tering tasks that require special attention One is clusclus-tering high-dimensional data, and the other is constraint-based clustering.
Clustering high-dimensional data is a particularly important task in cluster analysis
because many applications require the analysis of objects containing a large