Statistics, Data Mining, and Machine Learning in Astronomy 50 • Chapter 2 Fast Computation on Massive Data Sets the only option For an example of state of the art parallel multidimensional tree algori[.]
50 • Chapter Fast Computation on Massive Data Sets the only option For an example of state-of-the-art parallel multidimensional tree algorithms that can be used on very large data sets see [11] and [25] For an example of the effective use of GPUs in astronomy, see [26] Problem transformation: The final trick is, when all else fails, to change the problem itself This can take a number of forms One is to change the problem type into one which admits faster algorithms, for example reformulating an exponential-cost discrete graph problem to an approximation which can be cast in terms of continuous optimization, the central idea of variational methods for probability evaluations in graphical models [17] Another is to reformulate a mathematical program (optimization problem) to create one which is easier to solve, for example via the Lagrangian dual in the optimization of support vector machines (§9.6) or lesser-known transformations for difference-of-convex functions or semidefinite relaxations in more challenging formulations of machine learning methods [14, 44] A final option is to create a new machine learning method which maintains the statistical properties of interest while being inherently easier to compute An example is the formulation of a decisiontree-like method for density estimation [33] which is nonparametric like KDE but inherits the O(N log N) construction time and O(log N) querying time of a tree 2.5 Case Studies: Speedup Strategies in Practice Examples of the above problem types and speedup strategies abound in the literature, and the state of the art is always evolving For a good discussion of practical approaches to these, see the references listed above In the remainder of this chapter we will offer a brief practical discussion of the first speedup strategy: tree-based approaches for searching, sorting, and multidimensional neighbors-based statistics, and their implementation in Python We will begin with unidimensional searching and sorting, and move to a specific case of N-point problems, the nearestneighbor search This discussion will serve to contextualize the problems and principles discussed above, and also act as a starting point for thinking about efficient algorithmic approaches to difficult problems 2.5.1 Unidimensional Searches and Other Problems Searching and sorting algorithms are fundamental to many applications in machine learning and data mining, and are concepts that are covered in detail in many texts on computer algorithms (e.g., NumRec) Because of this, most data processing and statistics packages include some implementation of efficient sorts and searches, including various tools available in Python Searching The SQL language of relational databases is good at expressing queries which are composed of unidimensional searches An example is finding all objects whose brightness is between two numbers and whose size is less than a certain number Each of the unidimensional searches within the query is often called a range search 2.5 Case Studies: Speedup Strategies in Practice • 51 The typical approach for accelerating such range searches is a one-dimensional tree data structure called a B-tree The idea of binary search is that of checking whether the value or range of interest is less than some pivot value, which determines which branch of the tree to follow If the tree is balanced the search time becomes O(log N) A hash table data structure is an array, with input data mapped to array indices using a hash function A simple example of hashing is the sorting of a data set in a pixelized grid If one of the dimensions is x, and the pixel grid ranges from xmin to xmax , with the pixel width , then the (zero-based) pixel index can be computed as int[(x − xmin )/], where the function i nt returns the integer part of a real number The cost of computing a hash function must be small enough to make a hashingbased solution more efficient than alternative approaches, such as trees The strength of hashing is that its lookup time can be independent of N, that is, O(1), or constant in time Hash tables can thus be more efficient than search trees in principle, but this is difficult to achieve in general, and thus search trees are most often used in real systems The Python package NumPy implements efficient array-based searching and hashing Efficient searching can be accomplished via the function numpy.searchsorted, and scales as O(N log N) (see figure 2.1, and the example code below) Basic hashing can be performed using numpy.histogram, and the multidimensional counterparts numpy.histogram2d and numpy.histogramdd These functions are used throughout this text in the context of data visualization (see, e.g., figure 1.10) Sorting For sorting, the built-in Python function sorted and the more efficient array sorting function numpy.sort are useful tools The most important thing to understand is the scaling of these functions with the size of the array By default, numpy.sort uses a quicksort algorithm which scales as O(N log N) (see figure 2.2) Quicksort is a good multipurpose sorting algorithm, and will be sufficient for most data analysis situations Below there are some examples of searching and sorting using the tools available in Python Note that we are using the IPython interpreter in order to have access to the commands %time and %timeit, which are examples of the “magic functions” made available by IPython (see appendix A) In [ ] : import numpy as np In [ ] : np random seed ( ) In [ ] : x = np random rand ( E ) In [ ] : % time x sort ( ) # time a single run CPU times : user s , sys : 0 s , total : s Wall time : s In [ ] : print x [ 8 e -0 e -0 9 e -0 , 9 9 e -0 9 9 e -0 9 9 e -0 ] • Chapter Fast Computation on Massive Data Sets Scaling of Sort Algorithms 102 101 Relative sort time 52 list sort NumPy sort O[N log N ] O[N ] 100 10−1 10−2 10−3 105 106 Length of Array 107 Figure 2.2 The scaling of the quicksort algorithm Plotted for comparison are lines showing O(N) and O(N log N) scaling The quicksort algorithm falls along the O(N log N) line, as expected This sorts the array in place, and accomplishes the task very quickly The numpy package also has an efficient means of searching this sorted list for a desired value: In [ ] : np searchsorted (x , ) Out [ ] : 9 As expected, 0.5 falls very near the midpoint in the list of values We can see the speed of this algorithm using IPython’s % timeit functionality: In [ ] : % timeit np searchsorted (x , ) 0 0 loops , best of : us per loop If you have an array of values and would like to sort by a particular column, the argsort function is the best bet: In In In [[ [ [ [ [ ] : X = np random random ( ( , ) ) [ ] : np set_printoptions ( precision = ) [ 1 ] : print X 0.96 0.92 ] 0.71 0.22 0.63] 0.34 0.82 0.97] 0.33 0.98 0.44] 2.5 Case Studies: Speedup Strategies in Practice [ In In [[ [ [ [ [ 0.95 [12]: [13]: 0.33 0.34 0.71 0.95 0.96 • 53 0.33 0.73]] i_sort = np argsort ( X [ : , ] ) print X [ i_sort ] 0.98 0.44] 0.82 0.97] 0.22 0.63] 0.33 0.73] 0.92 ]] Here we have sorted the data by the first column, and the values in the second and third columns were rearranged in the same order To sort each column independently, the sort function can be given an axis argument: In In [[ [ [ [ [ [14]: [15]: 0.33 0.34 0.71 0.95 0.96 X sort ( ) print X 0.22 0.44] 0.33 0.63] 0.82 0.73] 0.92 0.97] 0.98 ]] Now every column of X has been individually sorted, and the values in each row are no longer associated Beware the difference of these last two operations! In the case where X represents an array of observations, sorting along every column like this will lead to a meaningless data set 2.5.2 Multidimensional Searches and Other Operations Just about any computational problem involving distances or other similarities between points falls into the class of generalized N-body problems mentioned in §2.3 The overall most efficient way to exactly/accurately perform such computations is via algorithms based on multidimensional trees Such algorithms begin by building an indexing structure analogous to the B-trees which accelerate SQL queries This indexing structure is built once for the lifetime of that particular data set (assuming the data set does not change), and thereafter can be used by fast algorithms that traverse it to perform a wide variety of computations This fact should be emphasized: a single tree for a data set can be used to increase the efficiency of many different machine learning methods We begin with one of the simplest generalized N-body problems, that of nearest-neighbor search Nearest-neighbor searches The basic nearest-neighbor problem can be stated relatively easily We are given an N × D matrix X representing N points (vectors) in D dimensions The i th point in X is specified as the vector xi , with i = 1, , N, and each xi has D components, xi,d , d = 1, , D Given a query point x, we want to find the closest point in X under a 54 • Chapter Fast Computation on Massive Data Sets given distance metric For simplicity, we use the well-known Euclidean metric, D D(x, xi ) = (xd − xi,d )2 (2.1) d=1 The goal of our computation is to find x ∗ = arg D(x, xi ) i (2.2) It is common to have to this search for more than one query object at a time—in general there will be a set of query objects This case is called the all-nearestneighbor search The special but common case where the query set is the same as the reference set, or the set over which we are searching, is called the monochromatic case, as opposed to the more general bichromatic case where the two sets are different An example of a bichromatic case is the cross-matching of objects from two astronomical catalogs We consider here the monochromatic case for simplicity: for each point xi in X, find its nearest neighbor (or more generally, k nearest neighbors) in X (other than itself): ∀i , xi∗ = arg D(xi , x j ) j (2.3) At first glance, this seems relatively straightforward: all we need is to compute the distances between every pair of points, and then choose the closest This can be quickly written in Python: # file : e a s y _ n e a r e s t _ n e i g h b o r py import numpy as np def easy_nn ( X ) : N , D = X shape neighbors = np zeros (N , dtype = int ) for i in range ( N ) : # initialize closest distance to infinity j_closest = i d_closest = np inf for j in range ( N ) : # skip distance between a point and itself if i = = j : continue d = np sqrt ( np sum ( ( X [ i ] - X [ j ] ) * * ) ) if d < d_closest : j_closest = j d_closest = d neighbors [ i ] = j_closest return neighbors 2.5 Case Studies: Speedup Strategies in Practice • 55 # IPython In [ ] : import numpy as np In [ ] : np random seed ( ) In [ ] : from e a s y _ n e a r e s t _ n e i g h b o r import easy_nn In [ ] : X = np random random ( ( , ) ) # points in dimensions In [ ] : easy_nn ( X ) Out [ ] : array ( [ , , , , , , , , , ] ) In [ ] : X = np random random ( ( 0 , ) ) # 0 points in dimensions In [ ] : % timeit easy_nn ( X ) loops , best of : s per loop This naive algorithm is simple to code, but leads to very long computation times for large numbers of points For N points, the computation time is O(N )—if we increase the sample size by a factor of 10, the computation time will increase by approximately a factor of 100 In astronomical contexts, when the number of objects can number in the billions, this can quickly lead to problems Those familiar with Python and NumPy (and those who have read appendix A) will notice a glaring problem here: this uses loops instead of a vectorized implementation to compute the distances We can vectorize this operation by observing the following identity: (X i k − X j k )2 = k X i2k − 2X i k X j k + X 2j k k = X i2k − Xi k X j k + (2.4) X 2j k (2.5) = [X X T ]ii + [X X T ] j j − 2[X X T ]i j , (2.6) k k k where in the final line, we have written the sums in terms of matrix products Now the entire operation can be reexpressed in terms of fast vectorized math: # file : v e c t o r i z e d _ n e a r e s t _ n e i g h b o r py import numpy as np def vectorized_nn ( X ) : XXT = np dot (X , X T ) Xii = XXT diagonal ( ) D = Xii - * XXT + Xii [ : , np newaxis ] # numpy argsort returns sorted indices along a # given axis we ' ll take the second column # ( index ) because the first column corresponds # to the distance between each point and itself return np argsort (D , axis = ) [ : , ] 56 • Chapter Fast Computation on Massive Data Sets # IPython : In [ ] : import numpy as np In [ ] : np random seed ( ) In [ ] : from v e c t o r i z e d _ n e a r e s t _ n e i g h b o r import vectorized_nn In [ ] : X = np random random ( ( , ) ) In [ ] : vectorized_nn ( X ) Out [ ] : array ( [ , , , , , , , , , ] ) In [ ] : X = np random random ( ( 0 , ) ) In [ ] : % timeit vectorized_nn ( X ) # timeit is a special feature of IPython loops , best of : ms per loop Through vectorization, we have sped up our calculation by a factor of over 100 We have to be careful here, though Our clever speed improvement does not come without cost First, the vectorization requires a large amount of memory For N points, we allocate not one but two N × N matrices As N grows larger, then, the amount of memory used by this algorithm increases in proportion to N Also note that, although we have increased the computational efficiency through vectorization, the algorithm still computes O(N ) distance computations There is another disadvantage here as well Because we are splitting the computation into separate parts, the machine floating-point precision can lead to unexpected results Consider the following example: In In In In [1]: [2]: [3]: [4]: import numpy as np x = 1.0 y = 0.0 np sqrt ( ( x - y ) * * ) # how we computed # non - vectorized distances Out [ ] : In [ ] : np sqrt ( x * * + y * * - * x * y ) # vectorized distances Out [ ] : In [ ] : x + = 0 0 0 0 In [ ] : y + = 0 0 0 0 In [ ] : np sqrt ( ( x - y ) * * ) # distances Out [ ] : # non - vectorized In [ ] : np sqrt ( x * * + y * * - * x * y ) # vectorized distances Out [ ] : 2.5 Case Studies: Speedup Strategies in Practice • 57 The distance calculations in lines and correspond to the method used in the slow example above The distance calculations in lines and correspond to our fast vectorized example, and line leads to the wrong result The reason for this is the floating-point precision of the computer Because we are taking one very large number (x**2 + y**2) and subtracting another large number (2*x*y) which differs by only one part in 1016 , we suffer from roundoff error in line Thus our efficient method, though faster than the initial implementation, suffers from three distinct disadvantages: Like the nonvectorized version, the computational efficiency still scales as O(N ), which will be too slow for data sets of interest • Unlike the nonvectorized version, the memory use also scales as O(N ), which may cause problems for large data sets • Roundoff error due to the vectorization tricks can cause incorrect results in some circumstances • This sort of method is often called a brute-force, exhaustive, or naive search It takes no shortcuts in its approach to the data: it simply evaluates and compares every possible option The resulting algorithm is very easy to code and understand, but can lead to very slow computation as the data set grows larger Fortunately, there are a variety of tree-based algorithms available which can improve on this Trees for increasing the efficiency of a search There are a number of common types of multidimensional tree structures which can be used for a nearest-neighbor search Quad-trees and oct-trees The earliest multidimensional tree structures were quadtrees and oct-trees These work in two and three dimensions, respectively Oct-trees, in particular, have long been used in astrophysics within the context of N-body and smoothed particle hydrodynamics (SPH) simulations (see, e.g.,[42, 45]) A quad-tree is a simple data structure used to arrange two-dimensional data, in which each tree node has exactly four children, representing its four quadrants Each node is defined by four numbers: its left, right, top, and bottom extents Figure 2.3 shows a visualization of a quad-tree for some generated structured data Notice how quickly the quad-tree narrows in on the structured parts of the data Whole regions of the parameter space can be eliminated from a search in this way This idea can be generalized to higher dimensions, using an oct-tree, so named because each node has up to eight children (representing the eight quadrants of a three-dimensional space) By grouping the points in this manner, one can progressively constrain the distances between a test point and groups of points in a tree, using the bounding box of each group of points to provide a lower bound on the distance between the query point and any point in the group If the lower bound is not better than the best-candidate nearest-neighbor distance the algorithm knows about so far, it can prune the group completely from the search, saving very large amounts of work The result is that under certain conditions the cost of the nearest-neighbor search reduces to O(log N) for a single query point: a significant improvement over 58 • Chapter Fast Computation on Massive Data Sets Quad-tree Example level level level level Figure 2.3 Example of a quad-tree brute force when N is large The tree itself, the result of a one-time build operation, takes O(N log N) time to construct O(N log N) is just a bit worse than O(N), and the constant in front of the build time is in practice very small, so tree construction is fast All the multidimensional trees we will look at yield O(N log N) build time and O(log N) single-query search time under certain conditions, though they will still display different constants depending on the data The dependence on the number of dimensions D also differs, as we discuss next kd-trees The quad-tree and oct-tree ideas above suggest a straightforward generalization to higher dimensions In two dimensions, we build a tree with four children per node In three dimensions, we build a tree with eight children per node Perhaps in D dimensions, we should simply build a tree with D children per node, and create a search algorithm similar to that of a quad- or oct-tree? A bit of calculation shows that this is infeasible: for even modest-sized values of D, the size of the tree quickly blows up For example, if D = 10, each node would require 210 = 1024 children This means that to go two levels down (i.e., to divide each dimension into four units) would already require over 106 nodes! The problem quickly gets out of hand as the 2.5 Case Studies: Speedup Strategies in Practice • 59 kd-tree Example level level level level Figure 2.4 Example of a kd-tree dimension continues to increase If we push the dimension up to D = 100, and make the dubious assumption that each node only requires byte in memory, even a single level of the tree would require 1015 petabytes of storage To put this in perspective, this is about ten billion times the estimated total volume of worldwide internet traffic in the year 2010 Evidently, this strategy is not going to work This immense growth in the number of subdivisions of a space as the dimensionality of that space grows is one manifestation of the curse of dimensionality (see §7.1 for more details) A solution to this problem is the kd-tree [2], so named as it is a k-dimensional generalization of the quad-tree and oct-tree To get around the dimensionality issues discussed above, kd-trees are generally implemented as binary trees: that is, each node has two children The top node of a kd-tree is a D-dimensional hyperrectangle which contains the entire data set To create the subnodes, the volume is split into two regions along a single dimension, and this procedure is repeated recursively until the lowest nodes contain a specified number of points Figure 2.4 shows the kd-tree partition of a data set in two dimensions Notice that, like the quad-tree in figure 2.3, the kd-tree partitions the space into rectilinear regions Unlike the quad-tree, the kd-tree adapts the split points in order to better 60 • Chapter Fast Computation on Massive Data Sets represent the data Because of the binary nature of the kd-tree, it is suitable for higherdimensional data A fast kd-tree implementation is available in SciPy, and can be used as follows: In In In In In In [1]: [2]: [3]: [4]: [5]: [6]: import numpy as np from scipy spatial import cKDTree np random seed ( ) X = np random random ( ( 0 , ) ) kdt = cKDTree ( X ) # build the KDTree % timeit kdt query (X , k = ) # query for two # neighbors 0 loops , best of : ms per loop The nearest neighbor of each point in a set of 1000 is found in just a few milliseconds: a factor of 20 improvement over the vectorized brute-force method for 1000 points In general, as the number of points grows larger, the computation time will increase as O(log N), for each of the N query points, for a total of O(N log N) In [ ] : X = np random random ( ( 0 0 , ) ) In [ ] : kdt = cKDTree ( X ) In [ ] : % timeit kdt query (X , k = ) # query for two # neighbors loops , best of : 9 ms per loop A factor of 100 more points leads to a factor of 120 increase in computational cost, which is consistent with our prediction of O(N log N) How does this compare to the brute-force vectorized method? Well, if brute-force searches truly scale as O(N ), then we would expect the computation to take 1002 ×139 ms, which comes to around 23 minutes, compared to a few seconds for a tree-based method! Even as kd-trees solve the scaling and dimensionality issues discussed above, they are still subject to a fundamental weakness, at least in principle Because the kd-tree relies on rectilinear splitting of the data space, it also falls subject to the curse of dimensionality To see why, imagine building a kd-tree on points in a Ddimensional space Because the kd-tree splits along a single dimension in each level, one must go D levels deep before each dimension has been split For D relatively small, this does not pose a problem But for, say, D = 100, this means that we must create 2100 ≈ 1030 nodes in order to split each dimension once! This is a clear limitation of kd-trees in high dimensions One would expect that, for N points in D dimensions, a kd-tree will lose efficiency when D log2 N As a result, other types of trees sometimes better than kd-trees, such as the ones we describe below Ball-trees Ball-trees [29, 30] make use of an intuitive fact: if x1 is far from x2 and x2 is near x3 , then x1 is also far from x3 This intuition is a reflection of the triangle inequality, D(x1 , x2 ) + D(x2 , x3 ) ≤ D(x1 , x3 ), (2.7) 2.5 Case Studies: Speedup Strategies in Practice • 61 Ball-tree Example level level level level Figure 2.5 Example of a ball-tree which can be proven relatively easily for Euclidean distances, and also applies to any of a large number of other distance metrics which may be useful in certain applications Ball-trees represent a way to address the flaw of kd-trees applied to highdimensional structured data Rather than building rectilinear nodes in D dimensions, ball-tree construction builds hyperspherical nodes Each node is defined by a centroid ci and a radius r i , such that the distance D(y, ci ) ≤ r i for every point y contained in the node With this construction, given a point x outside the node, it is straightforward to show from the triangle inequality (eq 2.7) that [D(x, ci ) − r i ] ≤ D(x, y) ≤ [D(x, ci ) + r i ] (2.8) for any point y in the node Using this fact, a neighbor search can proceed quickly by eliminating large parts of the data set from a query through a single distance computation Figure 2.5 shows an example of a ball-tree in two dimensions Comparing to the kd-tree example in figure 2.4 (which uses the same data set), one can see that 62 • Chapter Fast Computation on Massive Data Sets the ball-tree nodes more quickly converge on the nonlinear structure of the data set This efficiency allows the ball-tree to be much more efficient than the kd-tree in high dimensions in some cases There is a fast ball-tree algorithm included in the package Scikit-learn, which can be used as follows (compare to the kd-tree example used above): In [ ] : import numpy as np In [ ] : from sklearn neighbors import BallTree In [ ] : np random seed ( ) In [ ] : X = np random random ( ( 0 , ) ) In [ ] : bt = BallTree ( X ) # build the Ball Tree In [ ] : % timeit bt query (X , k = ) # query for two neighbors 0 loops , best of : ms per loop In [ ] : X = np random random ( ( 0 0 , ) ) In [ ] : bt = BallTree ( X ) In [ ] : % timeit bt query (X , k = ) # query for two neighbors loops , best of : s per loop We see that in low dimensions, the ball-tree and kd-tree have comparable computational complexity As the number of dimensions increases, the ball-tree can outperform the kd-tree, but the actual performance depends highly on the internal structure or intrinsic dimensionality of the data (see below) Other trees Many other fast methods for tree-based nearest-neighbor searches have been developed, too numerous to cover here Cover trees [3] represent an interesting nonbinary kind of ball-tree that, by construction, allows a theoretical proof of the usual desired O(log N) single-query search time under mild assumptions (unlike kdtrees and ball-trees in which it is difficult to provide such hard runtime guarantees beyond simple cases such as uniformly distributed random data) The idea of deriving orthogonal directions from the data (see §7.3) for building trees is an old one—some modern twists are shown in [24, 27] Maximum margin trees have been shown to perform better than the well-known tree structures in the setting of timesensitive searches, that is, where the goal is to return the best nearest-neighbor answer within a bounded time budget [35] In astronomy, for two-dimensional data on a sphere, a structure based on hexagons has been shown to be effective [21] Trees have also been developed for other kinds of similarities between points, beyond distances—for example cosine trees [16] and cone trees [34], for dot products There are in fact hundreds or possibly thousands of proposed data structures for nearest-neighbor searches A number of useful references can be found in [4, 10, 39] Intrinsic dimensionality Thankfully, despite the curse of dimensionality, there is a nice property that real data sets almost always have In reality the dimensions are generally correlated, meaning that the “true” or “intrinsic” dimensionality of the data (in some sense the true number of degrees of freedom) is often much ... Because we are splitting the computation into separate parts, the machine floating-point precision can lead to unexpected results Consider the following example: In In In In [1]: [2]: [3]: [4]:... space) By grouping the points in this manner, one can progressively constrain the distances between a test point and groups of points in a tree, using the bounding box of each group of points to provide... available in SciPy, and can be used as follows: In In In In In In [1]: [2]: [3]: [4]: [5]: [6]: import numpy as np from scipy spatial import cKDTree np random seed ( ) X = np random random (