Statistics, Data Mining, and Machine Learning in Astronomy References • 63 smaller than the simple number of columns, or “extrinsic” dimension The real performance of multidimensional trees is subject[.]
References • 63 smaller than the simple number of columns, or “extrinsic” dimension The real performance of multidimensional trees is subject to this intrinsic dimensionality, which unfortunately is not known in advance, and itself requires an expensive (typically O(N )) algorithm to estimate Krauthgamer and Lee [19] describe one notion of intrinsic dimensionality, the expansion constant, and examples of how the runtime of algorithms of interest in this book depend on it are shown in [3, 36] The result is that kd-trees, while seemingly not a compelling idea in high dimensions, can often work well in up to hundreds or even thousands of dimensions, depending on the structure of the data and the number of data points— sometimes much better than ball-trees So the story is not as clear as one might hope or think So which tree should you use? Unfortunately, at this stage of research in computational geometry there is little theory to help answer the question without just trying a number of different structures and seeing which works the best Approximate neighbor methods The degradation in the speedup provided by treebased algorithms in high dimensions has motivated the use of sampling methods in conjunction with trees The locality-based hashing scheme [1], based on the simple idea of random projections, is one recent approach It is based on the idea of returning points that lie within a fixed factor of the true distance to the query point However, Hammsersley in 1950 [15] showed that distances numerically occupy a narrowing range as the dimension increases: this makes it clear that such an approximation notion in high dimensions will eventually accept any point as a candidate for the true nearest neighbor A newer approximation notion [37] guarantees that the returned neighbor found is within a certain distance in rank order, for example, in the closest 1% of the points, without actually computing all the distances This retains meaning even in the face of arbitrary dimension In general we caution that the design of practical fast algorithms can be subtle, in particular requiring that substantial thought be given to any mathematical notion of approximation that is employed Another example of an important subtlety in this regard occurs in the computation of approximate kernel summations, in which it is recommended to bound the relative error (the error as a percentage of the true sum) rather than the more straightforward and typical practice of bounding the absolute error (see [23]) 