Statistics, Data Mining, and Machine Learning in Astronomy 46 • Chapter 2 Fast Computation on Massive Data Sets often compute various MLM as fast asO(N) orO(N log N) We will highlight a few of the bas[.]
46 • Chapter Fast Computation on Massive Data Sets often compute various MLM as fast as O(N) or O(N log N) We will highlight a few of the basic concepts behind such methods 2.3 Seven Types of Computational Problem There are a large number of statistical/machine learning methods described in this book Making them run fast boils down to a number of different types of computational problems, including the following: Basic problems: These include simple statistics, like means, variances, and covariance matrices We also put basic one-dimensional sorts and range searches in this category These are all typically simple to compute in the sense that they are O(N) or O(N log N) at worst We will discuss some key basic problems in §2.5.1 Generalized N-body problems: These include virtually any problem involving distances or other similarities between (all or many) pairs (or higher-order n-tuples) of points, such as nearest-neighbor searches, correlation functions, or kernel density estimates Such problems are typically O(N ) or O(N ) if computed straightforwardly, but more sophisticated algorithms are available (WSAS, [12]) We will discuss some such problems in §2.5.2 Linear algebraic problems: These include all the standard problems of computational linear algebra, including linear systems, eigenvalue problems, and inverses Assuming typical cases with N D, these can be O(N) but in some cases the matrix of interest is N × N, making the computation O(N ) Some common examples where parameter fitting ends up being conveniently phrased in terms of linear algebra problems appear in dimensionality reduction (chapter 7) and linear regression (chapter 8) Optimization problems: Optimization is the process of finding the minimum or maximum of a function This class includes all the standard subclasses of optimization problems, from unconstrained to constrained, convex and nonconvex Unconstrained optimizations can be fast (though somewhat indeterminate as they generally only lead to local optima), being O(N) for each of a number of iterations Constrained optimizations, such as the quadratic programs required by nonlinear support vector machines (discussed in chapter 9) are O(N ) in the worst case Some optimization approaches beyond the widely used unconstrained optimization methods such as gradient descent or conjugate gradient are discussed in §4.4.3 on the expectation maximization algorithm for mixtures of Gaussians Integration problems: Integration arises heavily in the estimation of Bayesian models, and typically involves high-dimensional functions Performing integration with high accuracy via quadrature has a computational complexity which is exponential in the dimensionality D In §5.8 we describe the Markov chain Monte Carlo (MCMC) algorithm, which can be used for efficient high-dimensional integration and related computations Graph-theoretic problems: These problems involve traversals of graphs, as in probabilistic graphical models or nearest-neighbor graphs for manifold learning The most difficult computations here are those involving discrete variables, in which the computational cost may be O(N) but is exponential in the number of interacting discrete variables among the D dimensions Exponential computations are by far the most time consuming and generally must be avoided at all cost