Statistics, data mining, and machine learning in astronomy

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	2
Dung lượng	101,33 KB

Nội dung

Statistics, Data Mining, and Machine Learning in Astronomy 6 2 Nearest Neighbor Density Estimation • 257 dependent on the dimension The noise model leads to an analytic treatment of the deconvolved ke[.]

6.2 Nearest-Neighbor Density Estimation • 257 dependent on the dimension The noise model leads to an analytic treatment of the deconvolved kernel We will assume that the noise is distributed according to the multivariate exponential √ |x1 | |x2 | |x D | g (x) = √ exp − + + ··· + , σ1 σ2 σD D σ1 σ2 σ D (6.11) where the σi represent the standard deviations in each dimension We will assume that (σ1 , , σ D ) are known for each data point For the case of a Gaussian kernel function, the deconvolution kernel is then K h,σ (x) = (2π ) D exp(−|x| /2) i σi2 − (xi − 1) 2h (6.12) This deconvolution kernel can then be used in place of the kernels discussed in §6.1.1 above, noting the additional dependence on the error σ of each point 6.1.3 Extensions and Related Methods The idea of kernel density estimation can be extended to other tasks, including classification (kernel discriminant analysis, §9.3.5), regression (kernel regression, §8.5), and conditional density estimation (kernel conditional density estimation, §3.1.3) Some of the ideas that have been developed to make kernel regression highly accurate, discussed in 8.5, can be brought back to density estimation, including the idea of using variable bandwidths, in which each data point can have its own kernel width 6.2 Nearest-Neighbor Density Estimation Another often used and simple density estimation technique is based on the distribution of nearest neighbors For each point (e.g., a pixel location on the twodimensional grid) we can find the distance to the K th-nearest neighbor, d K In this method, originally proposed in an astronomical context by Dressler et al [11], the implied point density at an arbitrary position x is estimated as f K (x) = K , VD (d K ) (6.13) where the volume VD is evaluated according to the problem dimensionality, D (e.g., for D = 2, V2 = π d ; for D = 3, V3 = 4π d /3; for higher dimensions, see eq 7.3) The simplicity of this estimator is a consequence of the assumption that the underlying density field is locally constant In practice, the method is even simpler because one can compute f K (x) = C , d KD (6.14) 258 • Chapter Searching for Structure in Point Data and evaluate the scaling factor C at the end by requiring that the sum of the product of f K (x) and pixel volume is equal to the total number of data points The error in f K (x) is σ f = K 1/2 /VD (d K ), and the fractional (relative) error is σ f / f = 1/K 1/2 Therefore, the fractional accuracy increases with K at the expense of the spatial resolution (the effective resolution scales with K 1/D ) In practice, K should be at least because the estimator is biased and has a large variance for smaller K ; see [7] This general method can be improved (the error in f can be decreased without a degradation in the spatial resolution, or alternatively the resolution can be increased without increasing the error in f ) by considering distances to all K nearest neighbors instead of only the distance to the K th-nearest neighbor; see [18] Given distances to all K neighbors, di , i = 1, , K , f K (x) = C K i =1 diD (6.15) Derivation of eq 6.15 is based on Bayesian analysis, as described in [18] The proper normalization when computing local density without regard to overall mean density is C= K (K + 1) 2VD (1) (6.16) When searching for local overdensities in the case of sparse data, eq 6.15 is superior to eq 6.14; in the constant density case, both methods have similar statistical power For an application in an astronomical setting, see [36] AstroML implements nearest-neighbor density estimation using a fast ball-tree algorithm This can be accomplished as follows: import numpy as np from astroML density_estimation import KNeighborsDensity X = np random normal ( size = ( 0 , ) ) # 0 points # in dims knd = KNeighborsDensity ( " bayesian " , ) # bayesian # method , nbrs knd fit ( X ) # fit the model to the data dens = knd eval ( X ) # evaluate the model at the data The method can be either "simple" to use eq 6.14, or "bayesian" to use eq 6.15 See the AstroML documentation or the code associated with figure 6.4 for further examples Figure 6.4 compares density estimation using the Gaussian kernel with a bandwidth of Mpc and using the nearest-neighbor method (eq 6.15) with K = and K = 40 for the same sample of galaxies as shown in figure 6.3 For small K the ... Searching for Structure in Point Data and evaluate the scaling factor C at the end by requiring that the sum of the product of f K (x) and pixel volume is equal to the total number of data points... a degradation in the spatial resolution, or alternatively the resolution can be increased without increasing the error in f ) by considering distances to all K nearest neighbors instead of only... density estimation using the Gaussian kernel with a bandwidth of Mpc and using the nearest-neighbor method (eq 6.15) with K = and K = 40 for the same sample of galaxies as shown in figure 6.3 For

Ngày đăng: 20/11/2022, 11:18