80 Christopher J.C. Burges as radial basis function kernels), this centering is equivalent to centering a distance matrix in feature space. (Williams, 2001) further points out that for these kernels, classical MDS in feature space is equivalent to a form of metric MDS in input space. Although ostensibly kernel PCA gives a function that can be applied to test points, while MDS does not, kernel PCA does so by using the Nystr ¨ om approximation (see Section 4.2.1), and exactly the same can be done with MDS. The subject of feature extraction and dimensional reduction is vast. In this re- view I’ve limited the discussion to mostly geometric methods, and even with that restriction it’s far from complete, so I’d like to alert the reader to three other inter- esting leads. The first is the method of principal curves, where the idea is to find that smooth curve that passes through the data in such a way that the sum of short- est distances from each point to the curve is minimized, thus providing a nonlinear, one-dimensional summary of the data (Hastie and Stuetzle, 1989); the idea has since been extended by applying various regularization schemes (including kernel-based), and to manifolds of higher dimension (Sch ¨ olkopf and Smola, 2002). Second, com- petitions have been held at recent NIPS workshops on feature extraction, and the reader can find a wealth of information there (Guyon, 2003). Finally, recent work on object detection has shown that boosting, where each weak learner uses a single feature, can be a very effective method for finding a small set of good (and mutually complementary) features from a large pool of possible features (Viola and Jones, 2001). Acknowledgments I thank John Platt for valuable discussions. Thanks also to Lawrence Saul, Bernhard Sch ¨ olkopf, Jay Stokes and Mike Tipping for commenting on the manuscript. References M.A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoretical foundations of the poten- tial function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. P.F. Baldi and K. Hornik. Learning in linear neural networks: A survey. IEEE Transactions on Neural Networks, 6(4):837–858, July 1995. A. Basilevsky. Statistical Factor Analysis and Related Methods. Wiley, New York, 1994. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre- sentation. Neural Computation, 15(6):1373–1396, 2003. Y. Bengio, J. Paiement, and P. Vincent. Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps and spectral clustering. In Advances in Neural Information Processing Sys- tems 16. MIT Press, 2004. C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysys on Semigroups. Springer- Verlag, 1984. C. M. Bishop. Bayesian PCA. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 382–388, Cam- bridge, MA, 1999. The MIT Press. 4 Geometric Methods for Feature Extraction and Dimensional Reduction 81 I. Borg and P. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, 1997. B. E. Boser, I. M. Guyon, and V .Vapnik. A training algorithm for optimal margin classi- fiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, 1992. ACM. C.J.C. Burges. Some Notes on Applied Mathematics for Machine Learning. In O. Bousquet, U. von Luxburg, and G. R ¨ atsch, editors, Advanced Lectures on Machine Learning, pages 21–40. Springer Lecture Notes in Aritificial Intelligence, 2004. C.J.C. Burges, J.C. Platt, and S. Jana. Extracting noise-robust features from audio. In Proc. IEEE Conference on Acoustics, Speech and Signal Processing, pages 1021–1024. IEEE Signal Processing Society, 2002. C.J.C. Burges, J.C. Platt, and S. Jana. Distortion discriminant analysis for audio fingerprint- ing. IEEE Transactions on Speech and Audio Processing, 11(3):165–174, 2003. F.R.K. Chung. Spectral Graph Theory. American Mathematical Society, 1997. T.F. Cox and M.A.A. Cox., Multidimensional Scaling. Chapman and Hall, 2001. R.B. Darlington. Factor analysis. Technical report, Cornell University, http://comp9.psych.cornell.edu/Darlington/factor.htm. V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear dimensional- ity reduction. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 705–712. MIT Press, 2002. P. Diaconis and D. Freedman. Asymptotics of graphical projection pursuit. Annals of Statis- tics, 12:793–815, 1984. K.I. Diamantaras and S.Y. Kung. Principal Component Neural Networks. John Wiley, 1996. R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley, 1973. C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystr ¨ om method. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(2), 2004. J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823, 1981. J.H. Friedman, W. Stuetzle, and A. Schroeder. Projection pursuit density estimation. J. Amer. Statistical Assoc., 79:599–608, 1984. J.H. Friedman and J.W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, c-23(9):881–890, 1974. G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins, third edition, 1996. M. Gondran and M. Minoux. Graphs and Algorithms. John Wiley and Sons, 1984. I. Guyon. NIPS 2003 workshop on feature extraction: http://clopinet.com/isabelle/Projects/NIPS2003/. J. Ham, D.D. Lee, S. Mika, and B. Sch ¨ olkopf. A kernel view of dimensionality reduction of manifolds. In Proceedings of the International Conference on Machine Learning, 2004. T.J. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):502–516, 1989. R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1985. P.J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, 1985. A. Hyv ¨ arinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, 2001. Y. LeCun and Y. Bengio. Convolutional networks for images, speech and time-series. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. M. Meila and J. Shi. Learning segmentation by random walks. In Advances in Neural Information Processing Systems, pages 873–879, 2000. S. Mika, B. Sch ¨ olkopf, A. J. Smola, K R. M ¨ uller, M. Scholz, and G. R ¨ atsch. Kernel PCA and de–noising in feature spaces. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, 82 Christopher J.C. Burges Advances in Neural Information Processing Systems 11. MIT Press, 1999. A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. J. Platt. Private Communication. J. Platt. Fastmap, MetricMap, and Landmark MDS are all Nystr ¨ om algorithms. In Z. Ghahra- mani and R. Cowell, editors, Proc. 10th International Conference on Artificial Intelli- gence and Statistics, 2005. W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vettering. Numerical recipes in C: the art of scientific computing. Cambridge University Press, 2nd edition, 1992. S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(22):2323–2326, 2000. I.J. Schoenberg. Remarks to maurice frechet’s article sur la d ´ efinition axiomatique d’une classe d’espace distanci ´ es vectoriellement applicable sur espace de hilbert. Annals of Mathematics, 36:724–732, 1935. B. Sch ¨ olkopf. The kernel trick for distances. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 301–307. MIT Press, 2001. B. Sch ¨ olkopf and A. Smola. Learning with Kernels. MIT Press, 2002. B. Sch ¨ olkopf, A. Smola, and K-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. C.E. Spearman. ’General intelligence’ objectively determined and measured. American Journal of Psychology, 5:201–293, 1904. C.J. Stone. Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10(4):1040–1053, 1982. J.B. Tenenbaum. Mapping a manifold of perceptual observations. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Process- ing Systems, volume 10. The MIT Press, 1998. M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, 61(3):611, 1999A. M.E. Tipping and C.M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443–482, 1999B. P. Viola and M. Jones. Robust real-time object detection. In Second international workshop on statistical and computational theories of vision - modeling, learning, computing, and sampling, 2001. S. Wilks. Mathematical Statistics. John Wiley, 1962. C.K.I. Williams. On a Connection between Kernel PCA and Metric Multidimensional Scal- ing. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 675–681. MIT Press, 2001. C.K.I. Williams and M. Seeger. Using the Nystr ¨ om method to speed up kernel machines. In Leen, Dietterich, and Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. 5 Dimension Reduction and Feature Selection Barak Chizi 1 and Oded Maimon 1 Tel-Aviv University Summary. Data Mining algorithms search for meaningful patterns in raw data sets. The Data Mining process requires high computational cost when dealing with large data sets. Reducing dimensionality (the number of attributed or the number of records) can effectively cut this cost. This chapter focuses a pre-processing step which removes dimension from a given data set before it is fed to a data mining algorithm. This work explains how it is often possible to reduce dimensionality with minimum loss of information. Clear dimension reduction taxonomy is described and techniques for dimension reduction are presented theoretically. Key words: Dimension Reduction, Preprocessing 5.1 Introduction Data Mining algorithms are used for searching meaningful patterns in raw data sets. Dimensionality (i.e., the number of data set attributes or groups of attributes) con- stitutes a serious obstacle to the efficiency of most Data Mining algorithms. This obstacle is sometimes known as the “curse of dimensionality” (Elder and Pregibon, 1996). Techniques quite efficient in low dimensions (e.g., nearest neighbors) cannot provide any meaningful results when the number of records goes beyond a ‘modest’ size of 10 attributes. Data-mining algorithms are computationally intensive. Figure 5.1 describes the typical trade-off between the error rate of a Data Mining model and the cost of ob- taining the model (in particular, the model may be a classification model). The cost is a function of the theoretical complexity of the Data Mining algorithm that derives the model, and is correlated with the time required for the algorithm to run, and the size of the data set. When discussing dimension reduction, given a set of records, the size of the data set is defined as the number of attributes, and is often used as an estimator to the mining cost. Theoretically, knowing the exact functional relation between the cost and the er- ror may point out the ideal classifier (i.e. a classifier that produces minimal error rate ε ∗ and costs h ∗ to be derived). On some occasions, one might prefer using an inferior O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_5, © Springer Science+Business Media, LLC 2010 84 Barak Chizi and Oded Maimon Fig. 5.1. typical cost-error relation in a classification models. classifier that uses only a part of the data h ≤h ∗ and produces an increased error rate. In practice, the exact tradeoff curve of Figure 5.1 is seldom known, and generating it might be computationally prohibitive. The objective of dimension reduction in Data Mining domains is to identify the smallest cost at which a Data Mining algorithm can keep the error rate below ε f (this error rate is sometimes referred to efficiency frontier). Feature selection, is a problem closely related to dimension reduction. The objec- tive of feature selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information. Since feature selection re- duces the dimensionality of the data, it holds out the possibility of more effective & rapid operation of data mining algorithms (i.e. Data Mining algorithms can be operated faster and more effectively by using feature selection). In some cases, as a result of feature selection, accuracy on future classification can be improved; in other instances, the result is a more compact, easily interpreted representation of the target concept (Hall, 1999). On the other hand, feature selection is a costly process, and it also contradicts the initial assumption, that all information (i.e. attributes) is required in order to achieve maximum accuracy. While some attributes are less important, there are no attributes that are irrelevant or redundant. As described later on this work, feature selection problem is a sub problem of dimension reduction. Figure 5.2 is taxonomy of dimension reduction reasons. It can be seen that there are four major reasons for performing dimension reduc- tion. Each reason can be referred to as a distinctive sub-problem: 1. Decreasing the learning (model) cost; 5 Dimension Reduction and Feature Selection 85 Dimension Reduction Reduction of Redundant Dimension Decreasing Learning Cost Feature Selection Record Selection Records Reduction Attributes Reduction Attribute Decomposition Variable Selection Sample Decomposition Function Decomposition Increasing Learning Performance Reduction of Irrelevant Dimension Fig. 5.2. Texonomy of dimension reduction problem. 2. Increasing the learning (model) performance; 3. Reducing irrelevant dimensions; 4. Reducing redundant dimensions. Reduction of redundant dimensions and of irrelevant dimension can be further divided into two sub-problems: Feature selection The objective of feature selection is to identify some features in the data-set as important, and discard any other feature as irrelevant and redundant information. The process of feature selection reduces the dimensionality of the data and enables learning algorithms to operate faster and more effectively. In some cases, the accuracy of future classifications can be improved; in others, the result is a more compact, easily interpreted model (Hall, 1999). Record selection Just as some attributes are more useful than others, some records (examples) may better aid the learning process than others (Blum and Langley, 1997). The other two sub-problems of dimension reduction, as described in Figure 5.2, are increasing learning performance and decreasing learning cost. Each of the these two sub-problems can also be divided into two further sub-problems: records reduc- tion and attribute reduction. Record reduction is sometimes referred to as sample (or tuple) decomposition. Attribute reduction can be further divided into two sub 86 Barak Chizi and Oded Maimon problems: attribute decomposition and function decomposition. These decomposi- tion problems embody an extensive methodology called decomposition methodology discussed in Chapter 50.7 of this volume. A sub-problem of attribute decomposition, as seen in Figure 5.2, is variable selec- tion. The solution to this problem is a pre-processing step which removes attributes from a given data set before feeding it to a Data Mining algorithm. The rationale for this step is the reduction of time required for running the Data Mining algorithm, since the running time depends both on the number of records as well as on the num- ber of attributes in each record (the dimension). Variable selection may scarify some accuracy but saves time in the learning process. This Chapter provides survey of feature selection techniques and variable selec- tion techniques 5.2 Feature Selection Techniques 5.2.1 Feature Filters The earliest approaches to feature selection within machine learning were filter meth- ods. All filter methods use heuristics based on general characteristics of the data rather than a learning algorithm to evaluate the merit of feature subsets. As a conse- quence, filter methods are generally much faster than wrapper methods, and, as such, are more practical for use on data of high dimensionality. FOCUS Almuallim and Dietterich (1992) describe an algorithm originally designed for Boolean domains called FOCUS. FOCUS exhaustively searches the space of feature subsets until it finds the minimum combination of features that divides the training data into pure classes (that is, where every combination of feature values is associ- ated with a single class). This is referred to as the “min- features bias”. Following feature selection, the final feature subset is passed to ID3 (Quinlan, 1986), which constructs a decision tree. There are two main difficulties with FOCUS, as pointed out by Caruanna and Freitag (1994) . Firstly, since FOCUS is driven to attain consistency on the training data, an exhaustive search may be difficult if many features are needed to attain con- sistency. Secondly, a strong bias towards consistency can be statistically unwarranted and may lead to over-fitting the training data— the algorithm will continue to add features to repair a single inconsistency. The authors address the first of these problems in their paper (Almuallim and Dietterich, 1992). Three algorithms— each consisting of forward selection search coupled with a heuristic to approximate the min- features bias— are presented as methods to make FOCUS computationally feasible on domains with many features. The first algorithm evaluates features using the following information theoretic formula: 5 Dimension Reduction and Feature Selection 87 Entropy(Q)= 2 | Q | −1 ∑ i=0 p i +n i | Sample | p i p i +n i log 2 p i p i +n i + n i p i +n i log 2 n i p i +n i (5.1) For a given feature subset Q , there are 2 | Q | possible truth value assignments to the features. A given feature set divides the training data into groups of instances with the same truth value assignments to the features in Q. Equation 5.1 measures the overall entropy of the class values in these groups— p i and n i denote the number of positive and negative examples in the i-th group respectively. At each stage, the feature which minimizes Equation 5.1 is added to the current feature subset. The second algorithm chooses the most discriminating feature to add to the cur- rent subset at each stage of the search. For a given pair of positive and negative ex- amples, a feature is discriminating if its value differs between the two. At each stage, the feature is chosen which discriminates the greatest number of positive- negative pairs of examples— that have not yet been discriminated by any existing feature in the subset. The third algorithm is like the second except that each positive- negative example pair contributes a weighted increment to the score of each feature that discriminates it. The increment depends on the total number of features that discriminate the pair. LVF Liu and Setiono (1996) describe an algorithm similar to FOCUS called LVF. Like FOCUS, LVF is consistency driven and, unlike FOCUS, can handle noisy domains if the approximate noise level is known a- priori. LVF generates a random subset S from the feature subset space during each round of execution. If S contains fewer features than the current best subset, the inconsis- tency rate of the dimensionally reduced data described by S is compared with the inconsistency rate of the best subset. If S is at least as consistent as the best subset, S replaces the best subset. The inconsistency rate of the training data prescribed by a given feature subset is defined over all groups of matching instances. Within a group of matching instances the inconsistency count is the number of instances in the group minus the number of instances in the group with the most frequent class value. The overall inconsistency rate is the sum of the inconsistency counts of all groups of matching instances divided by the total number of instances. Liu and Setiono report good results for LVF when applied to some artificial domains and mixed results when applied to commonly used natural domains. They also applied LVF to two “large” data sets— the first having 65,000 instances described by 59 attributes; the second having 5909 instances described by 81 attributes. They report that LVF was able to reduce the number of attributes on both data sets by more than half. They also note that due to the random nature of LVF, the longer it is allowed to execute, the better the results (as measured by the inconsistency criterion). 88 Barak Chizi and Oded Maimon Filtering Features Through Discretization Setiono and Liu (1996) note that discretization has the potential to perform feature selection among numeric features. If a numeric feature can justifiably be discretized to a single value, then it can safely be removed from the data. The combined discretization and feature selection algorithm Chi2, uses a chi- square χ 2 statistic to perform discretization. Numeric attributes are initially sorted by placing each observed value into its own interval. Each numeric attribute is then re- peatedly discretized by using the χ 2 test to determine when adjacent intervals should be merged. The extent of the merging process is controlled by the use of an automatically set χ 2 threshold. The threshold is determined by attempting to maintain the original fidelity of the data— inconsistency (measured the same way as in the LVF algorithm described above) controls the process. The authors report results on three natural do- mains containing a mixture of numeric and nominal features, using C4.5 (Quinlan, 1993, Quinlan, 1986) before and after discretization. They conclude that Chi2 is ef- fective at improving C4.5’s performance and eliminating some features However, it is not clear whether C4.5’s improvement is due entirely to some features having been removed or whether discretization plays a role as well. Using One Learning Algorithm as a Filter for Another Several researchers have explored the possibility of using a particular learning algo- rithm as a pre- processor to discover useful feature subsets for a primary learning algorithm. Cardie (1995) describes the application of decision tree algorithms to the task of selecting feature subsets for use by instance based learners. C4.5 was applied to three natural language data sets; only the features that appeared in the final de- cision trees were used with a k–nearest neighbor classifier. The use of this hybrid system resulted in significantly better performance than either C4.5 or the k nearest neighbor algorithm when used alone. In a similar approach, Singh and Provan (1996) use a greedy oblivious deci- sion tree algorithm to select features from which to construct a Bayesian network. Oblivious decision trees differ from those constructed by algorithms such as C4.5 in that all nodes at the same level of an oblivious decision tree test the same attribute. Feature subsets selected by three oblivious decision tree algorithms— each employ- ing a different information splitting criterion— were evaluated with a Bayesian net- work classifier on several the-oretic machine learning datasets. Results showed that Bayesian networks using features selected by the oblivious decision tree algorithms outperformed Bayesian networks without feature selection. Holmes and Nevill–Manning (1995) use Holte’s 1R system (Holte, 1993) to es- timate the predictive accuracy of individual features. 1R builds rules based on single features (called predictive 1- rules, 1- rules can be thought of as single level decision trees). If the data is split into training and test sets, it is possible to calculate a classi- fication accuracy for each rule and hence each feature. From classification scores, a ranked list of features is obtained. Experiments with choosing a select number of the 5 Dimension Reduction and Feature Selection 89 highest ranked features and using them with common machine learning algorithms showed that, on average, the top three or more features are as accurate as using the original set. This approach is unusual due to the fact that no search is conducted. Instead, it relies on the user to decide how many features to include from the ranked list in the final subset. Pfahringer (1995) uses a program for inducing decision table majority classi- fiers to select features. DTM (Decision Table Majority) classifiers are a simple type of nearest neighbor classifier where the similarity function is restricted to return- ing stored instances that are exact matches with the instance to be classified. If no instances are returned, the most prevalent class in the training data is used as the predicted class; otherwise, the majority class of all matching instances is used. DTM works best when all features are nominal. Induction of a DTM is achieved by greed- ily searching the space of possible decision tables. Since a decision table is defined by the features it includes, induction is simply feature selection. In Pfahringer’s approach, the minimum description length principle (MDL) (Ris- sanen, 1978) guides the search by estimating the cost of encoding a decision table and the training examples it misclassifies with respect to a given feature subset. The features appearing in the final decision table are then used with other learning algo- rithms. Experiments on a small selection of machine learning datasets showed that feature selection by DTM induction can improve the accuracy of C4.5 in some cases. DTM classifiers induced using MDL were also compared with those induced using cross- validation (a wrapper approach) to estimate the accuracy of tables (and hence feature sets). The MDL approach was shown to be more efficient than, and perform as well as, as cross- validation. An Information Theoretic Feature Filter Koller and Sahami (1996) introduced a feature selection algorithm based on ideas from information theory and probabilistic reasoning. The rationale behind their ap- proach is that, since the goal of an induction algorithm is to estimate the probability distributions over the class values, given the original feature set, feature subset selec- tion should attempt to remain as close to these original distributions as possible. More formally, let C be a set of classes, V a set of features, X a subset of V , v an assignment of values (v 1 , ,v n ) to the features in V , and v x the projection of the values in v onto the variables in X. The goal of the feature selector is to choose X so that P(C | X = v x ) is as close as possible to P(C | V = v) . To achieve this goal, the algorithm begins with all the original features and em- ploys a backward elimination search to remove, at each stage, the feature that causes the least change between the two distributions. Because it is not reliable to estimate high order probability distributions from limited data, an approximate algorithm is given that uses pair- wise combinations of features. Cross entropy is used to measure the difference between two distributions and the user must specify how many fea- tures are to be removed by the algorithm. The cross entropy of the class distribution given a pair of features is: . University Press, 2nd edition, 19 92. S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 29 0 (22 ) :23 23 23 26, 20 00. I.J. Schoenberg. Remarks to maurice. minimal error rate ε ∗ and costs h ∗ to be derived). On some occasions, one might prefer using an inferior O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_5,. Press, 20 01. 5 Dimension Reduction and Feature Selection Barak Chizi 1 and Oded Maimon 1 Tel-Aviv University Summary. Data Mining algorithms search for meaningful patterns in raw data sets. The Data Mining