1. Trang chủ
  2. » Tất cả

Statistics, data mining, and machine learning in astronomy

1 3 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 1
Dung lượng 40,65 KB

Nội dung

Statistics, Data Mining, and Machine Learning in Astronomy 6 Searching for Structure in Point Data “Through space the universe encompasses and swallows me up like an atom; through thought I comprehend[.]

6 Searching for Structure in Point Data “Through space the universe encompasses and swallows me up like an atom; through thought I comprehend the world.” (Blaise Pascal) e begin the third part of this book by addressing methods for exploring and quantifying structure in a multivariate distribution of points One name for this kind of activity is exploratory data analysis (EDA) Given a sample of N points in D-dimensional space, there are three classes of problems that are frequently encountered in practice: density estimation, cluster finding, and statistical description of the observed structure The space populated by points in the sample can be real physical space, or a space spanned by the measured quantities (attributes) For example, we can consider the distribution of sources in a multidimensional color space, or in a six-dimensional space spanned by three-dimensional positions and three-dimensional velocities To infer the pdf from a sample of data is known as density estimation The same methodology is often called data smoothing We have already encountered density estimation in the one-dimensional case when discussing histograms in §4.8 and §5.7.2, and in this chapter we extend it to multidimensional cases Density estimation is one of the most critical components of extracting knowledge from data For example, given a pdf estimated from point data, we can generate simulated distributions of data and compare them against observations If we can identify regions of low probability within the pdf, we have a mechanism for the detection of unusual or anomalous sources If our point data can be separated into subsamples using provided class labels, we can estimate the pdf for each subsample and use the resulting set of pdfs to classify new points: the probability that a new point belongs to each subsample/class is proportional to the pdf of each class evaluated at the position of the point (see §9.3.5) Density estimation relates directly to regression discussed in chapter (where we simplify the problem to the prediction of a single variable from the pdf), and is at the heart of many of the classification procedures described in chapter We discuss nonparametric and parametric methods for density estimation in §6.1–6.3 Given a point data set, we can further ask whether it displays any structure (as opposed to a random distribution of points) Finding concentrations of multivariate points (or groups of sources) is known in astronomy as “clustering” (when a density estimate is available, clusters correspond to “overdensities”) Clusters can be defined W

Ngày đăng: 20/11/2022, 11:19