Statistics, Data Mining, and Machine Learning in Astronomy References • 63 smaller than the simple number of columns, or “extrinsic” dimension The real performance of multidimensional trees is subject[.]
References • 63 smaller than the simple number of columns, or “extrinsic” dimension The real performance of multidimensional trees is subject to this intrinsic dimensionality, which unfortunately is not known in advance, and itself requires an expensive (typically O(N )) algorithm to estimate Krauthgamer and Lee [19] describe one notion of intrinsic dimensionality, the expansion constant, and examples of how the runtime of algorithms of interest in this book depend on it are shown in [3, 36] The result is that kd-trees, while seemingly not a compelling idea in high dimensions, can often work well in up to hundreds or even thousands of dimensions, depending on the structure of the data and the number of data points— sometimes much better than ball-trees So the story is not as clear as one might hope or think So which tree should you use? Unfortunately, at this stage of research in computational geometry there is little theory to help answer the question without just trying a number of different structures and seeing which works the best Approximate neighbor methods The degradation in the speedup provided by treebased algorithms in high dimensions has motivated the use of sampling methods in conjunction with trees The locality-based hashing scheme [1], based on the simple idea of random projections, is one recent approach It is based on the idea of returning points that lie within a fixed factor of the true distance to the query point However, Hammsersley in 1950 [15] showed that distances numerically occupy a narrowing range as the dimension increases: this makes it clear that such an approximation notion in high dimensions will eventually accept any point as a candidate for the true nearest neighbor A newer approximation notion [37] guarantees that the returned neighbor found is within a certain distance in rank order, for example, in the closest 1% of the points, without actually computing all the distances This retains meaning even in the face of arbitrary dimension In general we caution that the design of practical fast algorithms can be subtle, in particular requiring that substantial thought be given to any mathematical notion of approximation that is employed Another example of an important subtlety in this regard occurs in the computation of approximate kernel summations, in which it is recommended to bound the relative error (the error as a percentage of the true sum) rather than the more straightforward and typical practice of bounding the absolute error (see [23]) References [1] [2] [3] [4] Andoni, A and P Indyk (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions Communications of the ACM 51(1), 117–122 Bentley, J L (1975) Multidimensional binary search trees used for associative searching Commun ACM 18, 509–517 Beygelzimer, A., S Kakade, and J Langford (2006) Cover trees for nearest neighbor In ICML, pp 97–104 Bhatia, N and Vandana (2010) Survey of nearest neighbor techniques Journal of Computer Science 8(2), 64 • Chapter Fast Computation on Massive Data Sets [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] Brown, P G (2010) Overview of SciDB: Large scale array storage, processing and analysis In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, New York, NY, USA, pp 963–968 ACM Budavári, T and A S Szalay (2008) Probabilistic cross-identification of astronomical sources ApJ 679, 301–309 Budavári, T., V Wild, A S Szalay, L Dobos, and C.-W Yip (2009) Reliable eigenspectra for new generation surveys MNRAS 394, 1496–1502 Cormen, T H., C E Leiserson, and R L Rivest (2001) Introduction to Algorithms MIT Press Dean, J and S Ghemawat (2008) MapReduce: Simplified data processing on large clusters Commun ACM 51(1), 107–113 Dhanabal, S and D S Chandramathi (2011) Article: A review of various k-nearest neighbor query processing techniques International Journal of Computer Applications 31(7), 14–22 Foundation of Computer Science, New York, USA Gardner, J P., A Connolly, and C McBride (2007) Enabling rapid development of parallel tree search applications In Proceedings of the 5th IEEE Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’07, New York, NY, USA, pp 1–10 ACM Gray, A and A Moore (2000) ‘N-Body’ problems in statistical learning In Advances in Neural Information Processing Systems 13, pp 521–527 MIT Press Greengard, L and V Rokhlin (1987) A fast algorithm for particle simulations Journal of Computational Physics 73 Guan, W and A G Gray (2013) Sparse fractional-norm support vector machine via DC programming Computational Statistics and Data Analysis (CSDA) (in press) Hammersley, J M (1950) The distribution of distance in a hypersphere Annals of Mathematical Statistics 21(3), 447–452 Holmes, M., A G Gray, and C Isbell (2009) QUIC-SVD: Fast SVD using cosine trees In Advances in Neural Information Processing Systems (NIPS) (Dec 2008) MIT Press Jordan, M I., Z Ghahramani, T S Jaakkola, and L K Saul (1999) An introduction to variational methods for graphical models Mach Learn 37(2), 183–233 Juditsky, A., G Lan, A Nemirovski, and A Shapiro (2009) Robust stochastic approximation approach to stochastic programming SIAM Journal on Optimization 19(4), 1574–1609 Krauthgamer, R and J R Lee (2004) Navigating nets: simple algorithms for proximity search In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’04, pp 798–807 Society for Industrial and Applied Mathematics Kubica, J., L Denneau, Jr., A Moore, R Jedicke, and A Connolly (2007) Efficient algorithms for large-scale asteroid discovery In R A Shaw, F Hill, and D J Bell (Eds.), Astronomical Data Analysis Software and Systems XVI, Volume 376 of Astronomical Society of the Pacific Conference Series, pp 395–404 Kunszt, P., A S Szalay, and A Thakar (2000) The hierarchical triangular mesh In Mining the Sky: Proc of the MPA/ESO/MPE Workshop, pp 631–637 Lang, D., D W Hogg, K Mierle, M Blanton, and S Roweis (2010) Astrometry.net: Blind astrometric calibration of arbitrary astronomical images Astronomical Journal 137, 1782–2800 Lee, D and A G Gray (2006) Faster Gaussian summation: Theory and experiment In In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence References [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] • 65 Lee, D and A G Gray (2009) Fast high-dimensional kernel summations using the Monte Carlo multipole method In Advances in Neural Information Processing Systems (NIPS) (Dec 2008) MIT Press Lee, D., R Vuduc, and A G Gray (2012) A distributed kernel summation framework for general-dimension machine learning In SIAM International Conference on Data Mining (SDM) Lee, M and T Budavari (2012) Fast cross-matching of astronomical catalogs on GPUs In GPU Technology Conference (GTC) Liaw, Y.-C., M.-L Leou, and C.-M Wu (2010) Fast exact k nearest neighbors search using an orthogonal search tree Pattern Recognition 43(6), 2351–2358 Malon, D., P van Gemmeren, and J Weinstein (2012) An exploration of SciDB in the context of emerging technologies for data stores in particle physics and cosmology Journal of Physics Conference Series 368(1), 012021 Moore, A (2000) The Anchors Hierarchy: Using the triangle inequality to survive high dimensional data In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp 397–405 Morgan Kaufmann Omohundro, S M (1989) Five balltree construction algorithms Technical report International Computer Science Institute, TR-89-063 Ouyang, H and A G Gray (2012a) NASA: Achieving lower regrets and faster rates via adaptive stepsizes In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) Ouyang, H and A G Gray (2012b) Stochastic smoothing for nonsmooth minimizations: Accelerating SGD by exploiting structure In International Conference on Machine Learning (ICML) Ram, P and A G Gray (2011) Density estimation trees In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) Ram, P and A G Gray (2012a) Maximum inner-product search using cone trees In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) Ram, P and A G Gray (2012b) Nearest-neighbor search on a time budget via max-margin trees In SIAM International Conference on Data Mining (SDM) Ram, P., D Lee, W March, and A G Gray (2010) Linear-time algorithms for pairwise statistical problems In Advances in Neural Information Processing Systems (NIPS) (Dec 2009) MIT Press Ram, P., D Lee, H Ouyang, and A G Gray (2010) Rank-approximate nearest neighbor search: Retaining meaning and speed in high dimensions In Advances in Neural Information Processing Systems (NIPS) (Dec 2009) MIT Press Richards, J W., D L Starr, H Brink, A A Miller, J S Bloom, and others (2012) Active learning to overcome sample selection bias: Application to photometric variable star classification ApJ 744, 192 Samet, H (1990) The Design and Analysis of Spatial Data Structures Addison-Wesley Longman Sastry, R and A G Gray (2012) UPAL: Unbiased pool based active learning In Conference on Artificial Intelligence and Statistics (AISTATS) Scargle, J D., J P Norris, B Jackson, and J Chiang (2012) Studies in astronomical time series analysis VI Bayesian block representations ArXiv:astro-ph/1207.5578 Springel, V., N Yoshida, and S D M White (2001) GADGET: A code for collisionless and gasdynamical cosmological simulations New Astr 6, 79–117 66 • Chapter Fast Computation on Massive Data Sets [43] [44] [45] Stonebraker, M., P Brown, A Poliakov, and S Raman (2011) The architecture of SciDB In SSDBM’11: Proceedings of the 23rd International Conference on Scientific and Statistical Database Management, SSDBM’11, Berlin, Heidelberg, pp 1–16 Springer Vasiloglou, N., A G Gray, and D Anderson (2008) Scalable semidefinite manifold learning In IEEE International Workshop on Machine Learning For Signal Processing (MLSP) Wadsley, J W., J Stadel, and T Quinn (2004) Gasoline: A flexible, parallel implementation of TreeSPH New Astr 9, 137–158 ... Discovery and Data Mining (KDD) Ouyang, H and A G Gray (2012b) Stochastic smoothing for nonsmooth minimizations: Accelerating SGD by exploiting structure In International Conference on Machine Learning. .. and D Anderson (2008) Scalable semidefinite manifold learning In IEEE International Workshop on Machine Learning For Signal Processing (MLSP) Wadsley, J W., J Stadel, and T Quinn (2004) Gasoline:... Lee, D., R Vuduc, and A G Gray (2012) A distributed kernel summation framework for general-dimension machine learning In SIAM International Conference on Data Mining (SDM) Lee, M and T Budavari