Statistics, Data Mining, and Machine Learning in Astronomy 7 2 The Data Sets Used in This Chapter • 291 3000 4000 5000 6000 7000 wavelength (Å) 3000 4000 5000 6000 7000 wavelength (Å) 3000 4000 5000[.]
7.2 The Data Sets Used in This Chapter 3000 4000 5000 6000 7000 ˚ wavelength (A) 3000 4000 5000 6000 7000 ˚ wavelength (A) • 291 3000 4000 5000 6000 7000 ˚ wavelength (A) Figure 7.1 A sample of 15 galaxy spectra selected from the SDSS spectroscopic data set (see §1.5.5) These spectra span a range of galaxy types, from star-forming to passive galaxies Each spectrum has been shifted to its rest frame and covers the wavelength interval 3000–8000 Å The specific fluxes, F λ (λ), on the ordinate axes have an arbitrary scaling a sample of 357 million sources Each source has 448 measured attributes (e.g., measures of flux, size, shape, and position) If we used our physical intuition to select just 30 of those attributes from the database (e.g., a subset of the magnitude, size, and ellipticity measures) and normalized the data such that each dimension spanned the range −1 to 1, the probability of having one of the 357 million sources reside within the unit hypersphere would be only one part in 1.4 × 105 Given the dimensionality of current astronomical data sets, how can we ever hope to find or characterize any structure that might be present? The underlying assumption behind our earlier discussions has been that all dimensions or attributes are created equal We know that is, however, not true There exist projections within the data that capture the principal physical and statistical correlations between measured quantities (this idea lies behind the intrinsic dimensionality discussed in §2.5.2) Finding these dimensions or axes efficiently and thereby reducing the dimensionality of the underlying data is the subject of this chapter 7.2 The Data Sets Used in This Chapter Throughout this chapter we use the SDSS galaxy spectra described in §1.5.4 and §1.5.5, as a proxy for high-dimensional data Figure 7.1 shows a representative sample