Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 78 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
78
Dung lượng
2,23 MB
Nội dung
362 Chapter Classification and Prediction cancerous patient is not cancerous) is far greater than that of a false positive (incorrectly yet conservatively labeling a noncancerous patient as cancerous) In such cases, we can outweigh one type of error over another by assigning a different cost to each These costs may consider the danger to the patient, financial costs of resulting therapies, and other hospital costs Similarly, the benefits associated with a true positive decision may be different than that of a true negative Up to now, to compute classifier accuracy, we have assumed equal costs and essentially divided the sum of true positives and true negatives by the total number of test tuples Alternatively, we can incorporate costs and benefits by instead computing the average cost (or benefit) per decision Other applications involving cost-benefit analysis include loan application decisions and target marketing mailouts For example, the cost of loaning to a defaulter greatly exceeds that of the lost business incurred by denying a loan to a nondefaulter Similarly, in an application that tries to identify households that are likely to respond to mailouts of certain promotional material, the cost of mailouts to numerous households that not respond may outweigh the cost of lost business from not mailing to households that would have responded Other costs to consider in the overall analysis include the costs to collect the data and to develop the classification tool “Are there other cases where accuracy may not be appropriate?” In classification problems, it is commonly assumed that all tuples are uniquely classifiable, that is, that each training tuple can belong to only one class Yet, owing to the wide diversity of data in large databases, it is not always reasonable to assume that all tuples are uniquely classifiable Rather, it is more probable to assume that each tuple may belong to more than one class How then can the accuracy of classifiers on large databases be measured? The accuracy measure is not appropriate, because it does not take into account the possibility of tuples belonging to more than one class Rather than returning a class label, it is useful to return a probability class distribution Accuracy measures may then use a second guess heuristic, whereby a class prediction is judged as correct if it agrees with the first or second most probable class Although this does take into consideration, to some degree, the nonunique classification of tuples, it is not a complete solution 6.12.2 Predictor Error Measures “How can we measure predictor accuracy?” Let DT be a test set of the form (X1 , y1 ), (X2 ,y2 ), , (Xd , yd ), where the Xi are the n-dimensional test tuples with associated known values, yi , for a response variable, y, and d is the number of tuples in DT Since predictors return a continuous value rather than a categorical label, it is difficult to say exactly whether the predicted value, yi , for Xi is correct Instead of focusing on whether yi is an “exact” match with yi , we instead look at how far off the predicted value is from the actual known value Loss functions measure the error between yi and the predicted value, yi The most common loss functions are: Absolute error : |yi − yi | (6.59) Squared error : (yi − yi )2 (6.60) 6.13 Evaluating the Accuracy of a Classifier or Predictor 363 Based on the above, the test error (rate), or generalization error, is the average loss over the test set Thus, we get the following error rates d ∑ |yi − yi | Mean absolute error : i=1 d (6.61) d ∑ (yi − yi )2 Mean squared error : i=1 d (6.62) The mean squared error exaggerates the presence of outliers, while the mean absolute error does not If we were to take the square root of the mean squared error, the resulting error measure is called the root mean squared error This is useful in that it allows the error measured to be of the same magnitude as the quantity being predicted Sometimes, we may want the error to be relative to what it would have been if we had just predicted y, the mean value for y from the training data, D That is, we can normalize the total loss by dividing by the total loss incurred from always predicting the mean Relative measures of error include: d ∑ |yi − yi | Relative absolute error : i=1 d (6.63) ∑ |yi − y| i=1 d ∑ (yi − yi )2 Relative squared error : i=1 d (6.64) ∑ (yi − y)2 i=1 ∑t yi where y is the mean value of the yi ’s of the training data, that is y = i=1 We can d take the root of the relative squared error to obtain the root relative squared error so that the resulting error is of the same magnitude as the quantity predicted In practice, the choice of error measure does not greatly affect prediction model selection 6.13 Evaluating the Accuracy of a Classifier or Predictor How can we use the above measures to obtain a reliable estimate of classifier accuracy (or predictor accuracy in terms of error)? Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for assessing accuracy based on 364 Chapter Classification and Prediction Training set Derive model Estimate accuracy Data Test set Figure 6.29 Estimating accuracy with the holdout method randomly sampled partitions of the given data The use of such techniques to estimate accuracy increases the overall computation time, yet is useful for model selection 6.13.1 Holdout Method and Random Subsampling The holdout method is what we have alluded to so far in our discussions about accuracy In this method, the given data are randomly partitioned into two independent sets, a training set and a test set Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set The training set is used to derive the model, whose accuracy is estimated with the test set (Figure 6.29) The estimate is pessimistic because only a portion of the initial data is used to derive the model Random subsampling is a variation of the holdout method in which the holdout method is repeated k times The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration (For prediction, we can take the average of the predictor error rates.) 6.13.2 Cross-validation In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or “folds,” D1 , D2 , , Dk , each of approximately equal size Training and testing is performed k times In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train the model That is, in the first iteration, subsets D2 , , Dk collectively serve as the training set in order to obtain a first model, which is tested on D1 ; the second iteration is trained on subsets D1 , D3 , , Dk and tested on D2 ; and so on Unlike the holdout and random subsampling methods above, here, each sample is used the same number of times for training and once for testing For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data For prediction, the error estimate can be computed as the total loss from the k iterations, divided by the total number of initial tuples 6.13 Evaluating the Accuracy of a Classifier or Predictor 365 Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples That is, only one sample is “left out” at a time for the test set In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data In general, stratified 10-fold cross-validation is recommended for estimating accuracy (even if computation power allows using more folds) due to its relatively low bias and variance 6.13.3 Bootstrap Unlike the accuracy estimation methods mentioned above, the bootstrap method samples the given training tuples uniformly with replacement That is, each time a tuple is selected, it is equally likely to be selected again and readded to the training set For instance, imagine a machine that randomly selects tuples for our training set In sampling with replacement, the machine is allowed to select the same tuple more than once There are several bootstrap methods A commonly used one is the 632 bootstrap, which works as follows Suppose we are given a data set of d tuples The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples It is very likely that some of the original data tuples will occur more than once in this sample The data tuples that did not make it into the training set end up forming the test set Suppose we were to try this out several times As it turns out, on average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8% will form the test set (hence, the name, 632 bootstrap.) “Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is (1 − 1/d) We have to select d times, so the probability that a tuple will not be chosen during this whole time is (1 − 1/d)d If d is large, the probability approaches e−1 = 0.368.14 Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the remaining 63.2% will form the training set We can repeat the sampling procedure k times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample The overall accuracy of the model is then estimated as k Acc(M) = ∑ (0.632 × Acc(Mi )test set + 0.368 × Acc(Mi )train set ), (6.65) i=1 where Acc(Mi )test set is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i Acc(Mi )train set is the accuracy of the model obtained with bootstrap sample i when it is applied to the original set of data tuples The bootstrap method works well with small data sets 14 e is the base of natural logarithms, that is, e = 2.718 366 Chapter Classification and Prediction M1 New data sample M2 Data • • Combine votes Prediction Mk Figure 6.30 Increasing model accuracy: Bagging and boosting each generate a set of classification or prediction models, M1 , M2 , , Mk Voting strategies are used to combine the predictions for a given unknown tuple 6.14 Ensemble Methods—Increasing the Accuracy In Section 6.3.3, we saw how pruning can be applied to decision tree induction to help improve the accuracy of the resulting decision trees Are there general strategies for improving classifier and predictor accuracy? The answer is yes Bagging and boosting are two such techniques (Figure 6.30) They are examples of ensemble methods, or methods that use a combination of models Each combines a series of k learned models (classifiers or predictors), M1 , M2 , , Mk , with the aim of creating an improved composite model, M∗ Both bagging and boosting can be used for classification as well as prediction 6.14.1 Bagging We first take an intuitive look at how bagging works as a method of increasing accuracy For ease of explanation, we will assume at first that our model is a classifier Suppose that you are a patient and would like to have a diagnosis made based on your symptoms Instead of asking one doctor, you may choose to ask several If a certain diagnosis occurs more than any of the others, you may choose this as the final or best diagnosis That is, the final diagnosis is made based on a majority vote, where each doctor gets an equal vote Now replace each doctor by a classifier, and you have the basic idea behind bagging Intuitively, a majority vote made by a large group of doctors may be more reliable than a majority vote made by a small group Given a set, D, of d tuples, bagging works as follows For iteration i (i = 1, 2, , k), a training set, Di , of d tuples is sampled with replacement from the original set of tuples, D Note that the term bagging stands for bootstrap aggregation Each training set is a bootstrap sample, as described in Section 6.13.3 Because sampling with replacement is used, some 6.14 Ensemble Methods—Increasing the Accuracy 367 Algorithm: Bagging The bagging algorithm—create an ensemble of models (classifiers or predictors) for a learning scheme where each model gives an equally-weighted prediction Input: D, a set of d training tuples; k, the number of models in the ensemble; a learning scheme (e.g., decision tree algorithm, backpropagation, etc.) Output: A composite model, M∗ Method: (1) (2) (3) (4) for i = to k // create k models: create bootstrap sample, Di , by sampling D with replacement; use Di to derive a model, Mi ; endfor To use the composite model on a tuple, X: (1) (2) (3) (4) if classification then let each of the k models classify X and return the majority vote; if prediction then let each of the k models predict a value for X and return the average predicted value; Figure 6.31 Bagging of the original tuples of D may not be included in Di , whereas others may occur more than once A classifier model, Mi , is learned for each training set, Di To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which counts as one vote The bagged classifier, M∗, counts the votes and assigns the class with the most votes to X Bagging can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple The algorithm is summarized in Figure 6.31 The bagged classifier often has significantly greater accuracy than a single classifier derived from D, the original training data It will not be considerably worse and is more robust to the effects of noisy data The increased accuracy occurs because the composite model reduces the variance of the individual classifiers For prediction, it was theoretically proven that a bagged predictor will always have improved accuracy over a single predictor derived from D 6.14.2 Boosting We now look at the ensemble method of boosting As in the previous section, suppose that as a patient, you have certain symptoms Instead of consulting one doctor, you choose to consult several Suppose you assign weights to the value or worth of each doctor’s diagnosis, based on the accuracies of previous diagnoses they have made The 368 Chapter Classification and Prediction final diagnosis is then a combination of the weighted diagnoses This is the essence behind boosting In boosting, weights are assigned to each training tuple A series of k classifiers is iteratively learned After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1 , to “pay more attention” to the training tuples that were misclassified by Mi The final boosted classifier, M∗, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy The boosting algorithm can be extended for the prediction of continuous values Adaboost is a popular boosting algorithm Suppose we would like to boost the accuracy of some learning method We are given D, a data set of d class-labeled tuples, (X1 , y1 ), (X2 , y2 ), , (Xd , yd ), where yi is the class label of tuple Xi Initially, Adaboost assigns each training tuple an equal weight of 1/d Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm In round i, the tuples from D are sampled to form a training set, Di , of size d Sampling with replacement is used—the same tuple may be selected more than once Each tuple’s chance of being selected is based on its weight A classifier model, Mi , is derived from the training tuples of Di Its error is then calculated using Di as a test set The weights of the training tuples are then adjusted according to how they were classified If a tuple was incorrectly classified, its weight is increased If a tuple was correctly classified, its weight is decreased A tuple’s weight reflects how hard it is to classify—the higher the weight, the more often it has been misclassified These weights will be used to generate the training samples for the classifier of the next round The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of the previous round Some classifiers may be better at classifying some “hard” tuples than others In this way, we build a series of classifiers that complement each other The algorithm is summarized in Figure 6.32 Now, let’s look at some of the math that’s involved in the algorithm To compute the error rate of model Mi , we sum the weights of each of the tuples in Di that Mi misclassified That is, d error(Mi ) = ∑ w j × err(Xj ), (6.66) j where err(Xj ) is the misclassification error of tuple Xj : If the tuple was misclassified, then err(Xj ) is Otherwise, it is If the performance of classifier Mi is so poor that its error exceeds 0.5, then we abandon it Instead, we try again by generating a new Di training set, from which we derive a new Mi The error rate of Mi affects how the weights of the training tuples are updated If a tuple in round i was correctly classified, its weight is multiplied by error(Mi )/(1 − error(Mi )) Once the weights of all of the correctly classified tuples are updated, the weights for all tuples (including the misclassified ones) are normalized so that their sum remains the same as it was before To normalize a weight, we multiply it by the sum of the old weights, divided by the sum of the new weights As a result, the weights of misclassified tuples are increased and the weights of correctly classified tuples are decreased, as described above “Once boosting is complete, how is the ensemble of classifiers used to predict the class label of a tuple, X?” Unlike bagging, where each classifier was assigned an equal vote, 6.14 Ensemble Methods—Increasing the Accuracy 369 Algorithm: Adaboost A boosting algorithm—create an ensemble of classifiers Each one gives a weighted vote Input: D, a set of d class-labeled training tuples; k, the number of rounds (one classifier is generated per round); a classification learning scheme Output: A composite model Method: (1) initialize the weight of each tuple in D to 1/d; (2) for i = to k // for each round: (3) sample D with replacement according to the tuple weights to obtain Di ; (4) use training set Di to derive a model, Mi ; (5) compute error(Mi ), the error rate of Mi (Equation 6.66) (6) if error(Mi ) > 0.5 then (7) reinitialize the weights to 1/d (8) go back to step and try again; (9) endif (10) for each tuple in Di that was correctly classified (11) multiply the weight of the tuple by error(Mi )/(1 − error(Mi )); // update weights (12) normalize the weight of each tuple; (13) endfor To use the composite model to classify tuple, X: (1) (2) (3) (4) (5) (6) (7) initialize weight of each class to 0; for i = to k // for each classifier: wi = log 1−error(Mi ) ; error(Mi ) // weight of the classifier’s vote c = Mi (X); // get class prediction for X from Mi add wi to weight for class c endfor return the class with the largest weight; Figure 6.32 Adaboost, a boosting algorithm boosting assigns a weight to each classifier’s vote, based on how well the classifier performed The lower a classifier’s error rate, the more accurate it is, and therefore, the higher its weight for voting should be The weight of classifier Mi ’s vote is log − error(Mi ) error(Mi ) (6.67) 370 Chapter Classification and Prediction For each class, c, we sum the weights of each classifier that assigned class c to X The class with the highest sum is the “winner” and is returned as the class prediction for tuple X “How does boosting compare with bagging?” Because of the way boosting focuses on the misclassified tuples, it risks overfitting the resulting composite model to such data Therefore, sometimes the resulting “boosted” model may be less accurate than a single model derived from the same data Bagging is less susceptible to model overfitting While both can significantly improve accuracy in comparison to a single model, boosting tends to achieve greater accuracy 6.15 Model Selection Suppose that we have generated two models, M1 and M2 (for either classification or prediction), from our data We have performed 10-fold cross-validation to obtain a mean error rate for each How can we determine which model is best? It may seem intuitive to select the model with the lowest error rate, however, the mean error rates are just estimates of error on the true population of future data cases There can be considerable variance between error rates within any given 10-fold cross-validation experiment Although the mean error rates obtained for M1 and M2 may appear different, that difference may not be statistically significant What if any difference between the two may just be attributed to chance? This section addresses these questions 6.15.1 Estimating Confidence Intervals To determine if there is any “real” difference in the mean error rates of two models, we need to employ a test of statistical significance In addition, we would like to obtain some confidence limits for our mean error rates so that we can make statements like “any observed mean will not vary by +/− two standard errors 95% of the time for future samples” or “one model is better than the other by a margin of error of +/− 4%.” What we need in order to perform the statistical test? Suppose that for each model, we did 10-fold cross-validation, say, 10 times, each time using a different 10-fold partitioning of the data Each partitioning is independently drawn We can average the 10 error rates obtained each for M1 and M2 , respectively, to obtain the mean error rate for each model For a given model, the individual error rates calculated in the cross-validations may be considered as different, independent samples from a probability distribution In general, they follow a t distribution with k-1 degrees of freedom where, here, k = 10 (This distribution looks very similar to a normal, or Gaussian, distribution even though the functions defining the two are quite different Both are unimodal, symmetric, and bell-shaped.) This allows us to hypothesis testing where the significance test used is the t-test, or Student’s t-test Our hypothesis is that the two models are the same, or in other words, that the difference in mean error rate between the two is zero If we can reject this hypothesis (referred to as the null hypothesis), then we can conclude that the difference between the two models is statistically significant, in which case we can select the model with the lower error rate 6.15 Model Selection 371 In data mining practice, we may often employ a single test set, that is, the same test set can be used for both M1 and M2 In such cases, we a pairwise comparison of the two models for each 10-fold cross-validation round That is, for the ith round of 10-fold cross-validation, the same cross-validation partitioning is used to obtain an error rate for M1 and an error rate for M2 Let err(M1 )i (or err(M2 )i ) be the error rate of model M1 (or M2 ) on round i The error rates for M1 are averaged to obtain a mean error rate for M1 , denoted err(M1 ) Similarly, we can obtain err(M2 ) The variance of the difference between the two models is denoted var(M1 − M2 ) The t-test computes the t-statistic with k − degrees of freedom for k samples In our example we have k = 10 since, here, the k samples are our error rates obtained from ten 10-fold cross-validations for each model The t-statistic for pairwise comparison is computed as follows: t= err(M1 ) − err(M2 ) , var(M1 − M2 )/k (6.68) where var(M1 − M2 ) = k ∑ err(M1 )i − err(M2 )i − (err(M1 ) − err(M2 )) k i=1 (6.69) To determine whether M1 and M2 are significantly different, we compute t and select a significance level, sig In practice, a significance level of 5% or 1% is typically used We then consult a table for the t distribution, available in standard textbooks on statistics This table is usually shown arranged by degrees of freedom as rows and significance levels as columns Suppose we want to ascertain whether the difference between M1 and M2 is significantly different for 95% of the population, that is, sig = 5% or 0.05 We need to find the t distribution value corresponding to k − degrees of freedom (or degrees of freedom for our example) from the table However, because the t distribution is symmetric, typically only the upper percentage points of the distribution are shown Therefore, we look up the table value for z = sig/2, which in this case is 0.025, where z is also referred to as a confidence limit If t > z or t < −z, then our value of t lies in the rejection region, within the tails of the distribution This means that we can reject the null hypothesis that the means of M1 and M2 are the same and conclude that there is a statistically significant difference between the two models Otherwise, if we cannot reject the null hypothesis, we then conclude that any difference between M1 and M2 can be attributed to chance If two test sets are available instead of a single test set, then a nonpaired version of the t-test is used, where the variance between the means of the two models is estimated as var(M1 ) var(M2 ) + , (6.70) var(M1 − M2 ) = k1 k2 and k1 and k2 are the number of cross-validation samples (in our case, 10-fold crossvalidation rounds) used for M1 and M2 , respectively When consulting the table of t distribution, the number of degrees of freedom used is taken as the minimum number of degrees of the two models 7.7 Grid-Based Methods 425 Figure 7.14 Examples of center-defined clusters (top row) and arbitrary-shape clusters (bottom row) 7.7.1 STING: STatistical INformation Grid STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is precomputed and stored These statistical parameters are useful for query processing, as described below Figure 7.15 shows a hierarchical structure for STING clustering Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-level cells These parameters include the following: the attribute-independent parameter, count; the attribute-dependent parameters, mean, stdev (standard deviation), (minimum), max (maximum); and the type of distribution that the attribute value in the cell follows, such as normal, uniform, exponential, or none (if the distribution is unknown) When the data are loaded into the database, the parameters count, mean, stdev, min, and max of the bottom-level cells are calculated directly from the data The value of distribution may either be assigned by the user if the distribution type is known beforehand or obtained by hypothesis tests such as the χ2 test The type of distribution of a higher-level cell can be computed based on the majority of distribution types of its corresponding lower-level cells in conjunction with a threshold filtering process If the distributions of the lowerlevel cells disagree with each other and fail the threshold test, the distribution type of the high-level cell is set to none 426 Chapter Cluster Analysis 1st layer (i-1)-st layer ith layer Figure 7.15 A hierarchical structure for STING clustering “How is this statistical information useful for query answering?” The statistical parameters can be used in a top-down, grid-based method as follows First, a layer within the hierarchical structure is determined from which the query-answering process is to start This layer typically contains a small number of cells For each cell in the current layer, we compute the confidence interval (or estimated range of probability) reflecting the cell’s relevancy to the given query The irrelevant cells are removed from further consideration Processing of the next lower level examines only the remaining relevant cells This process is repeated until the bottom layer is reached At this time, if the query specification is met, the regions of relevant cells that satisfy the query are returned Otherwise, the data that fall into the relevant cells are retrieved and further processed until they meet the requirements of the query “What advantages does STING offer over other clustering methods?” STING offers several advantages: (1) the grid-based computation is query-independent, because the statistical information stored in each cell represents the summary information of the data in the grid cell, independent of the query; (2) the grid structure facilitates parallel processing and incremental updating; and (3) the method’s efficiency is a major advantage: STING goes through the database once to compute the statistical parameters of the cells, and hence the time complexity of generating clusters is O(n), where n is the total number of objects After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n 7.7 Grid-Based Methods 427 Because STING uses a multiresolution approach to cluster analysis, the quality of STING clustering depends on the granularity of the lowest level of the grid structure If the granularity is very fine, the cost of processing will increase substantially; however, if the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis Moreover, STING does not consider the spatial relationship between the children and their neighboring cells for construction of a parent cell As a result, the shapes of the resulting clusters are isothetic; that is, all of the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected This may lower the quality and accuracy of the clusters despite the fast processing time of the technique 7.7.2 WaveCluster: Clustering Using Wavelet Transformation WaveCluster is a multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space In this approach, each grid cell summarizes the information of a group of points that map into the cell This summary information typically fits into main memory for use by the multiresolution wavelet transform and the subsequent cluster analysis A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands The wavelet model can be applied to d-dimensional signals by applying a one-dimensional wavelet transform d times In applying a wavelet transform, data are transformed so as to preserve the relative distance between objects at different levels of resolution This allows the natural clusters in the data to become more distinguishable Clusters can then be identified by searching for dense regions in the new domain Wavelet transforms are also discussed in Chapter 2, where they are used for data reduction by compression Additional references to the technique are given in the bibliographic notes “Why is wavelet transformation useful for clustering?” It offers the following advantages: It provides unsupervised clustering It uses hat-shaped filters that emphasize regions where the points cluster, while suppressing weaker information outside of the cluster boundaries Thus, dense regions in the original feature space act as attractors for nearby points and as inhibitors for points that are further away This means that the clusters in the data automatically stand out and “clear” the regions around them Thus, another advantage is that wavelet transformation can automatically result in the removal of outliers The multiresolution property of wavelet transformations can help detect clusters at varying levels of accuracy For example, Figure 7.16 shows a sample of twodimensional feature space, where each point in the image represents the attribute or feature values of one object in the spatial data set Figure 7.17 shows the resulting wavelet transformation at different resolutions, from a fine scale (scale 1) to a coarse scale (scale 3) At each level, the four subbands into which the original 428 Chapter Cluster Analysis data are decomposed are shown The subband shown in the upper-left quadrant emphasizes the average neighborhood around each data point The subband in the upper-right quadrant emphasizes the horizontal edges of the data The subband in the lower-left quadrant emphasizes the vertical edges, while the subband in the lower-right quadrant emphasizes the corners Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is the number of objects in the database The algorithm implementation can be made parallel WaveCluster is a grid-based and density-based algorithm It conforms with many of the requirements of a good clustering algorithm: It handles large data sets efficiently, discovers clusters with arbitrary shape, successfully handles outliers, is insensitive to the order of input, and does not require the specification of input parameters such as the Figure 7.16 A sample of two-dimensional feature space From [SCZ98] (a) (b) (c) Figure 7.17 Multiresolution of the feature space in Figure 7.16 at (a) scale (high resolution); (b) scale (medium resolution); and (c) scale (low resolution) From [SCZ98] 7.8 Model-Based Clustering Methods 429 number of clusters or a neighborhood radius In experimental studies, WaveCluster was found to outperform BIRCH, CLARANS, and DBSCAN in terms of both efficiency and clustering quality The study also found WaveCluster capable of handling data with up to 20 dimensions 7.8 Model-Based Clustering Methods Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model Such methods are often based on the assumption that the data are generated by a mixture of underlying probability distributions In this section, we describe three examples of model-based clustering Section 7.8.1 presents an extension of the k-means partitioning algorithm, called Expectation-Maximization Conceptual clustering is discussed in Section 7.8.2 A neural network approach to clustering is given in Section 7.8.3 7.8.1 Expectation-Maximization In practice, each cluster can be represented mathematically by a parametric probability distribution The entire data is a mixture of these distributions, where each individual distribution is typically referred to as a component distribution We can therefore cluster the data using a finite mixture density model of k probability distributions, where each distribution represents a cluster The problem is to estimate the parameters of the probability distributions so as to best fit the data Figure 7.18 is an example of a simple finite mixture density model There are two clusters Each follows a normal or Gaussian distribution with its own mean and standard deviation The EM (Expectation-Maximization) algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates It can be viewed as an extension of the k-means paradigm, which assigns an object to the cluster with which it is most similar, based on the cluster mean (Section 7.4.1) Instead of assigning each object to a dedicated cluster, EM assigns each object to a cluster according to a weight representing the probability of membership In other words, there are no strict boundaries between clusters Therefore, new means are computed based on weighted measures EM starts with an initial estimate or “guess” of the parameters of the mixture model (collectively referred to as the parameter vector) It iteratively rescores the objects against the mixture density produced by the parameter vector The rescored objects are then used to update the parameter estimates Each object is assigned a probability that it would possess a certain set of attribute values given that it was a member of a given cluster The algorithm is described as follows: Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster means or centers (as in k-means partitioning), as well as making guesses for the additional parameters 430 Chapter Cluster Analysis g(m2, 2) g(m1, 1) Figure 7.18 Each cluster can be represented by a probability distribution, centered at a mean, and with a standard deviation Here, we have two clusters, corresponding to the Gaussian distributions g(m1 , σ1 ) and g(m2 , σ2 ), respectively, where the dashed circles represent the first standard deviation of the distributions Iteratively refine the parameters (or clusters) based on the following two steps: (a) Expectation Step: Assign each object xi to cluster Ck with the probability P(xi ∈ Ck ) = p(Ck |xi ) = p(Ck )p(xi |Ck ) , p(xi ) (7.36) where p(xi |Ck ) = N(mk , Ek (xi )) follows the normal (i.e., Gaussian) distribution around mean, mk , with expectation, Ek In other words, this step calculates the probability of cluster membership of object xi , for each of the clusters These probabilities are the “expected” cluster memberships for object xi (b) Maximization Step: Use the probability estimates from above to re-estimate (or refine) the model parameters For example, mk = n xi P(xi ∈ Ck ) ∑ n i=1 ∑ j P(xi ∈ C j ) (7.37) This step is the “maximization” of the likelihood of the distributions given the data The EM algorithm is simple and easy to implement In practice, it converges fast but may not reach the global optima Convergence is guaranteed for certain forms of optimization functions The computational complexity is linear in d (the number of input features), n (the number of objects), and t (the number of iterations) Bayesian clustering methods focus on the computation of class-conditional probability density They are commonly used in the statistics community In industry, 7.8 Model-Based Clustering Methods 431 AutoClass is a popular Bayesian clustering method that uses a variant of the EM algorithm The best clustering maximizes the ability to predict the attributes of an object given the correct cluster of the object AutoClass can also estimate the number of clusters It has been applied to several domains and was able to discover a new class of stars based on infrared astronomy data Further references are provided in the bibliographic notes 7.8.2 Conceptual Clustering Conceptual clustering is a form of clustering in machine learning that, given a set of unlabeled objects, produces a classification scheme over the objects Unlike conventional clustering, which primarily identifies groups of like objects, conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or class Hence, conceptual clustering is a two-step process: clustering is performed first, followed by characterization Here, clustering quality is not solely a function of the individual objects Rather, it incorporates factors such as the generality and simplicity of the derived concept descriptions Most methods of conceptual clustering adopt a statistical approach that uses probability measurements in determining the concepts or clusters Probabilistic descriptions are typically used to represent each derived concept COBWEB is a popular and simple method of incremental conceptual clustering Its input objects are described by categorical attribute-value pairs COBWEB creates a hierarchical clustering in the form of a classification tree “But what is a classification tree? Is it the same as a decision tree?” Figure 7.19 shows a classification tree for a set of animal data A classification tree differs from a decision tree Each node in a classification tree refers to a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node The probabilistic description includes the probability of the concept and conditional probabilities of the form P(Ai = vi j |Ck ), where Ai = vi j is an attribute-value pair (that is, the ith attribute takes its jth possible value) and Ck is the concept class (Counts are accumulated and stored at each node for computation of the probabilities.) This is unlike decision trees, which label branches rather than nodes and use logical rather than probabilistic descriptors.3 The sibling nodes at a given level of a classification tree are said to form a partition To classify an object using a classification tree, a partial matching function is employed to descend the tree along a path of “best” matching nodes COBWEB uses a heuristic evaluation measure called category utility to guide construction of the tree Category utility (CU) is defined as ∑n P(Ck )[∑i ∑ j P(Ai = vi j |Ck )2 − ∑i ∑ j P(Ai = vi j )2 ] k=1 , n (7.38) where n is the number of nodes, concepts, or “categories” forming a partition, {C1 , C2 , , Cn }, at the given level of the tree In other words, category utility is the Decision trees are described in Chapter 432 Chapter Cluster Analysis Animal P(C0) ϭ 1.0 P(scales C0) ϭ 0.25 … Fish P(C1) ϭ 0.25 P(scales C1) ϭ 1.0 … Amphibian P(C2) ϭ 0.25 P(moist C2) ϭ 1.0 … Mammal P(C4) ϭ 0.5 P(hair C4) ϭ 1.0 … Mammal/bird P(C3) ϭ 0.5 P(hair C3) ϭ 0.5 … Bird P(C5) ϭ 0.5 P(feathers C5) ϭ 1.0 … Figure 7.19 A classification tree Figure is based on [Fis87] increase in the expected number of attribute values that can be correctly guessed given a partition (where this expected number corresponds to the term P(Ck )Σi Σ j P(Ai = vi j |Ck )2 ) over the expected number of correct guesses with no such knowledge (corresponding to the term Σi Σ j P(Ai = vi j )2 ) Although we not have room to show the derivation, category utility rewards intraclass similarity and interclass dissimilarity, where: Intraclass similarity is the probability P(Ai = vi j |Ck ) The larger this value is, the greater the proportion of class members that share this attribute-value pair and the more predictable the pair is of class members Interclass dissimilarity is the probability P(Ck |Ai = vi j ) The larger this value is, the fewer the objects in contrasting classes that share this attribute-value pair and the more predictive the pair is of the class Let’s look at how COBWEB works COBWEB incrementally incorporates objects into a classification tree “Given a new object, how does COBWEB decide where to incorporate it into the classification tree?” COBWEB descends the tree along an appropriate path, updating counts along the way, in search of the “best host” or node at which to classify the object This decision is based on temporarily placing the object in each node and computing the category utility of the resulting partition The placement that results in the highest category utility should be a good host for the object 7.8 Model-Based Clustering Methods 433 “What if the object does not really belong to any of the concepts represented in the tree so far? What if it is better to create a new node for the given object?” That is a good point In fact, COBWEB also computes the category utility of the partition that would result if a new node were to be created for the object This is compared to the above computation based on the existing nodes The object is then placed in an existing class, or a new class is created for it, based on the partition with the highest category utility value Notice that COBWEB has the ability to automatically adjust the number of classes in a partition It does not need to rely on the user to provide such an input parameter The two operators mentioned above are highly sensitive to the input order of the object COBWEB has two additional operators that help make it less sensitive to input order These are merging and splitting When an object is incorporated, the two best hosts are considered for merging into a single class Furthermore, COBWEB considers splitting the children of the best host among the existing categories These decisions are based on category utility The merging and splitting operators allow COBWEB to perform a bidirectional search—for example, a merge can undo a previous split COBWEB has a number of limitations First, it is based on the assumption that probability distributions on separate attributes are statistically independent of one another This assumption is, however, not always true because correlation between attributes often exists Moreover, the probability distribution representation of clusters makes it quite expensive to update and store the clusters This is especially so when the attributes have a large number of values because the time and space complexities depend not only on the number of attributes, but also on the number of values for each attribute Furthermore, the classification tree is not height-balanced for skewed input data, which may cause the time and space complexity to degrade dramatically CLASSIT is an extension of COBWEB for incremental clustering of continuous (or real-valued) data It stores a continuous normal distribution (i.e., mean and standard deviation) for each individual attribute in each node and uses a modified category utility measure that is an integral over continuous attributes instead of a sum over discrete attributes as in COBWEB However, it suffers similar problems as COBWEB and thus is not suitable for clustering large database data Conceptual clustering is popular in the machine learning community However, the method does not scale well for large data sets 7.8.3 Neural Network Approach The neural network approach is motivated by biological neural networks.4 Roughly speaking, a neural network is a set of connected input/output units, where each connection has a weight associated with it Neural networks have several properties that make them popular for clustering First, neural networks are inherently parallel and distributed processing architectures Second, neural networks learn by adjusting their interconnection weights so as to best fit the data This allows them to “normalize” or “prototype” Neural networks were also introduced in Chapter on classification and prediction 434 Chapter Cluster Analysis the patterns and act as feature (or attribute) extractors for the various clusters Third, neural networks process numerical vectors and require object patterns to be represented by quantitative features only Many clustering tasks handle only numerical data or can transform their data into quantitative features if needed The neural network approach to clustering tends to represent each cluster as an exemplar An exemplar acts as a “prototype” of the cluster and does not necessarily have to correspond to a particular data example or object New objects can be distributed to the cluster whose exemplar is the most similar, based on some distance measure The attributes of an object assigned to a cluster can be predicted from the attributes of the cluster’s exemplar Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonon, or as topologically ordered maps SOMs’ goal is to represent all points in a high-dimensional source space by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity relationships (hence the topology) are preserved as much as possible The method is particularly useful when a nonlinear mapping is inherent in the problem itself SOMs can also be viewed as a constrained version of k-means clustering, in which the cluster centers tend to lie in a low-dimensional manifold in the feature or attribute space With SOMs, clustering is performed by having several units competing for the current object The unit whose weight vector is closest to the current object becomes the winning or active unit So as to move even closer to the input object, the weights of the winning unit are adjusted, as well as those of its nearest neighbors SOMs assume that there is some topology or ordering among the input objects and that the units will eventually take on this structure in space The organization of units is said to form a feature map SOMs are believed to resemble processing that can occur in the brain and are useful for visualizing high-dimensional data in 2-D or 3-D space The SOM approach has been used successfully for Web document clustering The left graph of Figure 7.20 shows the result of clustering 12,088 Web articles from the usenet newsgroup comp.ai.neural-nets using the SOM approach, while the right graph of the figure shows the result of drilling down on the keyword: “mining.” The neural network approach to clustering has strong theoretical links with actual brain processing Further research is required to make it more effective and scalable in large databases due to long processing times and the intricacies of complex data 7.9 Clustering High-Dimensional Data Most clustering methods are designed for clustering low-dimensional data and encounter challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks) This is because when the dimensionality increases, usually only a small number of dimensions are relevant to 7.9 Clustering High-Dimensional Data universities interpreted warren costa neuron phoneme extrapolation weightless paradigm brain neurotransmitters noise decay hidden personnel neurofuzzy sigmoid cture annealing papers tools alizing annealing snns mining bayes tdl unsupervised benchmark validation saturation analysts toolbox levenberg-narquardt scheduling mining mining fortran robot backpropagator’s encoding conjugate trading rbf variable missing exploration judgement principle packages interactions neurocomputing alamos consciousness atree x pdp packa bootstrap curves intelligence signals genesis programmer elman aisb signal engines java ga workshop postdoctoral connect 435 popular rate Figure 7.20 The result of SOM clustering of 12,088 Web articles on comp.ai.neural-nets (left), and of drilling down on the keyword: “mining” (right) Based on http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html certain clusters, but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered Moreover, when dimensionality increases, data usually become increasingly sparse because the data points are likely located in different dimensional subspaces When the data become really sparse, data points located at different dimensions can be considered as all equally distanced, and the distance measure, which is essential for cluster analysis, becomes meaningless To overcome this difficulty, we may consider using feature (or attribute) transformation and feature (or attribute) selection techniques Feature transformation methods, such as principal component analysis5 and singular value decomposition,6 transform the data onto a smaller space while generally preserving Principal component analysis was introduced in Chapter as a method of dimensionality reduction Singular value decomposition is discussed in Chapter 436 Chapter Cluster Analysis the original relative distance between objects They summarize data by creating linear combinations of the attributes, and may discover hidden structures in the data However, such techniques not actually remove any of the original attributes from analysis This is problematic when there are a large number of irrelevant attributes The irrelevant information may mask the real clusters, even after transformation Moreover, the transformed features (attributes) are often difficult to interpret, making the clustering results less useful Thus, feature transformation is only suited to data sets where most of the dimensions are relevant to the clustering task Unfortunately, real-world data sets tend to have many highly correlated, or redundant, dimensions Another way of tackling the curse of dimensionality is to try to remove some of the dimensions Attribute subset selection (or feature subset selection7 ) is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes) Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criteria It is most commonly performed by supervised learning—the most relevant set of attributes are found with respect to the given class labels It can also be performed by an unsupervised process, such as entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters Other evaluation functions, such as category utility, may also be used Subspace clustering is an extension to attribute subset selection that has shown its strength at high-dimensional clustering It is based on the observation that different subspaces may contain different, meaningful clusters Subspace clustering searches for groups of clusters within different subspaces of the same data set The problem becomes how to find such subspace clusters effectively and efficiently In this section, we introduce three approaches for effective clustering of high-dimensional data: dimension-growth subspace clustering, represented by CLIQUE, dimension-reduction projected clustering, represented by PROCLUS, and frequent patternbased clustering, represented by pCluster 7.9.1 CLIQUE: A Dimension-Growth Subspace Clustering Method CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimension-growth subspace clustering in high-dimensional space In dimension-growth subspace clustering, the clustering process starts at single-dimensional subspaces and grows upward to higher-dimensional ones Because CLIQUE partitions each dimension like a grid structure and determines whether a cell is dense based on the number of points it contains, it can also be viewed as an integration of density-based and grid-based clustering methods However, its overall approach is typical of subspace clustering for high-dimensional space, and so it is introduced in this section Attribute subset selection is known in the machine learning literature as feature subset selection It was discussed in Chapter 7.9 Clustering High-Dimensional Data 437 The ideas of the CLIQUE clustering algorithm are outlined as follows Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units), thereby discovering the overall distribution patterns of the data set A unit is dense if the fraction of total data points contained in it exceeds an input model parameter In CLIQUE, a cluster is defined as a maximal set of connected dense units “How does CLIQUE work?” CLIQUE performs multidimensional clustering in two steps In the first step, CLIQUE partitions the d-dimensional data space into nonoverlapping rectangular units, identifying the dense units among these This is done (in 1-D) for each dimension For example, Figure 7.21 shows dense rectangular units found with respect to age for the dimensions salary and (number of weeks of) vacation The subspaces representing these dense units are intersected to form a candidate search space in which dense units of higher dimensionality may exist “Why does CLIQUE confine its search for dense units of higher dimensionality to the intersection of the dense units in the subspaces?” The identification of the candidate search space is based on the Apriori property used in association rule mining.8 In general, the property employs prior knowledge of items in the search space so that portions of the space can be pruned The property, adapted for CLIQUE, states the following: If a k-dimensional unit is dense, then so are its projections in (k−1)-dimensional space That is, given a k-dimensional candidate dense unit, if we check its (k −1)-th projection units and find any that are not dense, then we know that the kth dimensional unit cannot be dense either Therefore, we can generate potential or candidate dense units in k-dimensional space from the dense units found in (k − 1)-dimensional space In general, the resulting space searched is much smaller than the original space The dense units are then examined in order to determine the clusters In the second step, CLIQUE generates a minimal description for each cluster as follows For each cluster, it determines the maximal region that covers the cluster of connected dense units It then determines a minimal cover (logic description) for each cluster “How effective is CLIQUE?” CLIQUE automatically finds subspaces of the highest dimensionality such that high-density clusters exist in those subspaces It is insensitive to the order of input objects and does not presume any canonical data distribution It scales linearly with the size of input and has good scalability as the number of dimensions in the data is increased However, obtaining meaningful clustering results is dependent on Association rule mining is described in detail in Chapter In particular, the Apriori property is described in Section 5.2.1 The Apriori property can also be used for cube computation, as described in Chapter Chapter Cluster Analysis salary (10,000) 20 30 40 50 60 age 20 30 40 50 60 age vacation (week) vacation 50 age lar y 30 sa 438 Figure 7.21 Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate search space for dense units of higher dimensionality 7.9 Clustering High-Dimensional Data 439 proper tuning of the grid size (which is a stable structure here) and the density threshold This is particularly difficult because the grid size and density threshold are used across all combinations of dimensions in the data set Thus, the accuracy of the clustering results may be degraded at the expense of the simplicity of the method Moreover, for a given dense region, all projections of the region onto lower-dimensionality subspaces will also be dense This can result in a large overlap among the reported dense regions Furthermore, it is difficult to find clusters of rather different density within different dimensional subspaces Several extensions to this approach follow a similar philosophy For example, let’s think of a grid as a set of fixed bins Instead of using fixed bins for each of the dimensions, we can use an adaptive, data-driven strategy to dynamically determine the bins for each dimension based on data distribution statistics Alternatively, instead of using a density threshold, we would use entropy (Chapter 6) as a measure of the quality of subspace clusters 7.9.2 PROCLUS: A Dimension-Reduction Subspace Clustering Method PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace clustering method That is, instead of starting from single-dimensional spaces, it starts by finding an initial approximation of the clusters in the high-dimensional attribute space Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters This leads to the exploration of dense regions in all subspaces of some desired dimensionality and avoids the generation of a large number of overlapped clusters in projected dimensions of lower dimensionality PROCLUS finds the best set of medoids by a hill-climbing process similar to that used in CLARANS, but generalized to deal with projected clustering It adopts a distance measure called Manhattan segmental distance, which is the Manhattan distance on a set of relevant dimensions The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster refinement In the initialization phase, it uses a greedy algorithm to select a set of initial medoids that are far apart from each other so as to ensure that each cluster is represented by at least one object in the selected set More concretely, it first chooses a random sample of data points proportional to the number of clusters we wish to generate, and then applies the greedy algorithm to obtain an even smaller final subset for the next phase The iteration phase selects a random set of k medoids from this reduced set (of medoids), and replaces “bad” medoids with randomly chosen new medoids if the clustering is improved For each medoid, a set of dimensions is chosen whose average distances are small compared to statistical expectation The total number of dimensions associated to medoids must be k × l, where l is an input parameter that selects the average dimensionality of cluster subspaces The refinement phase computes new dimensions for each medoid based on the clusters found, reassigns points to medoids, and removes outliers ... senior senior junior 31 35 26 30 31 35 21 25 31 35 26 30 41 45 36 40 31 35 46K 50K 26K 30K 31K 35K 46K 50K 66 K 70K 46K 50K 66 K 70K 46K 50K 41K 45K 30 40 40 20... 1.0 6. 16 Summary 373 1.0 true positive rate 0.8 0 .6 0.4 0.2 0.0 0.0 0.2 0.4 0 .6 0.8 1.0 false positive rate Figure 6. 33 The ROC curves of two classification models 6. 16 Summary Classification and. .. Carbonell, and Mitchell [MCM83,MCM 86] , Kodratoff and Michalski [KM90], Shavlik and Dietterich [SD90], and Michalski and Tecuci [MT94] For a presentation of machine learning with respect to data mining