Statistics, data mining, and machine learning in astronomy

5 2 0
Statistics, data mining, and machine learning in astronomy

Đang tải... (xem toàn văn)

Thông tin tài liệu

Statistics, Data Mining, and Machine Learning in Astronomy 344 • Chapter 8 Regression and Model Fitting 50 100 150 200 250 x 100 200 300 400 500 600 y 1 8 2 0 2 2 2 4 2 6 slope −60 −40 −20 0 20 40 60[.]

344 • Chapter Regression and Model Fitting 600 100 80 500 intercept 60 y 400 300 40 20 −20 200 −40 100 50 100 150 x 200 250 −60 1.8 2.0 2.2 2.4 slope 2.6 Figure 8.6 A linear fit to data with correlated errors in x and y In the literature, this is often referred to as total least squares or errors-in-variables fitting The left panel shows the lines of best fit; the right panel shows the likelihood contours in slope/intercept space The points are the same set used for the examples in [8] where θ1 = arctan(α) and α is the angle between the line and the x-axis The covariance matrix projects onto this space as Si2 = nT i n (8.62) and the distance between a point and the line is given by (see [8]) i = nT zi − θ0 cos α, (8.63) where zi represents the data point (xi , yi ) The log-likelihood is then ln L = −  2 i 2S i i (8.64) Maximizing this likelihood for the regression parameters, θ0 and θ1 is shown in figure 8.6, where we use the data from [8] with correlated uncertainties on the x and y components, and recover the underlying linear relation For a single parameter search (θ1 ) the regression can be undertaken in a brute-force manner As we increase the complexity of the model or the dimensionality of the data, the computational cost will grow and techniques such as MCMC must be employed (see [4]) 8.9 Regression That Is Robust to Outliers A fact of experimental life is that if you can measure an attribute you can also measure it incorrectly Despite the increase in fidelity of survey data sets, any regression or model fitting must be able to account for outliers from the fit For the standard leastsquares regression the use of an L norm results in outliers that have substantial leverage in any fit (contributing as the square of the systematic deviation) If we knew 8.9 Regression That Is Robust to Outliers • 345 e(yi |y) for all of the points in our sample (e.g., they are described by an exponential distribution where we would use the L norm to define the error) then we would simply include the error distribution when defining the likelihood When we not have a priori knowledge of e(yi |y), things become more difficult We can either model e(yi |y) as a mixture model (see § 5.6.7) or assume a form for e(yi |y) that is less sensitive  to outliers An example of the latter would be the adoption of the L norm, i ||yi − wi xi ||, which we introduced in § 8.3, which is less sensitive to outliers than the L norm (and was, in fact, proposed by Rudjer Boškovi´c prior to the development of least-squares regression by Legendre, Gauss, and others [2]) Minimizing the L norm is essentially finding the median The drawback of this least absolute value regression is that there is no closed-form solution and we must minimize the likelihood space using an iterative approach Other approaches to robust regression adopt an approach that seeks to reject outliers In the astronomical community this is usually referred to as “sigma clipping” and is undertaken in an iterative manner by progressively pruning data points that are not well represented by the model Least-trimmed squares formalizes this, somewhat ad hoc approach, by searching for the subset of K points which minimize K i (yi − θi xi ) For large N the number of combinations makes this search expensive Complementary to outlier rejection are the Theil–Sen [20] or the Kendall robust line-fit method and associated techniques In these cases the regression is determined from the median of the slope, θ1 , calculated from all pairs of points within the data set Given the slope, the offset or zero point, θ0 , can be defined from the median of yi − θ1 xi Each of these techniques is simple to estimate and scales to large numbers M estimators (M stands for “maximum-likelihood-type”) approach the problem of outliers by modifying the underlying likelihood estimator to be less sensitive than the classic L norm M estimators are a class of estimators that include many maximum-likelihood approaches (including least squares) They replace the standard least squares, which minimizes the sum of the squares of the residuals between a data value and the model, with a different function Ideally the M estimator has the property that it increases less than the square of the residual and has a unique minimum at zero Huber loss function An example of an M estimator that is common in robust regression is that of the Huber loss (or cost) function [9] The Huber estimator minimizes N  e(yi |y), (8.65) i =1 where e(yi |y) is modeled as 1 φ(t) = t2 c|t| − 12 c if |t| ≤ c, if |t| ≥ c, (8.66) 346 • Chapter Regression and Model Fitting 50 40 c=5 Φ(t) 30 c=∞ 20 c=3 c=2 10 c=1 −10 −5 t 10 Figure 8.7 The Huber loss function for various values of c and t = yi − y with a constant c that must be chosen Therefore, e(t) is a function which acts like t for |t| ≤ c and like |t| for |t| > c and is continuous and differentiable (see figure 8.7) The transition in the Huber function is equivalent to assuming a Gaussian error distribution for small excursions from the true value of the function and an exponential distribution for large excursions (its behavior is a compromise between the mean and the median) Figure 8.8 shows an application of the Huber loss function to data with outliers Outliers have a small effect, and the slope of the Huber loss fit is closer to that of standard linear regression 8.9.1 Bayesian Outlier Methods From a Bayesian perspective, one can use the techniques developed in chapter within the context of a regression model in order to account for, and even to individually identify outliers (recall § 5.6.7) Figure 8.9 again shows the data set used in figure 8.8, which contains three clear outliers In a standard straight-line fit to the data, the result is strongly affected by these points Though this standard linear regression problem is solvable in closed form (as it is in figure 8.8), here we compute the best-fit slope and intercept using MCMC sampling (and show the resulting contours in the upper-right panel) The remaining two panels show two different Bayesian strategies for accounting for outliers The main idea is to enhance the model such that it can naturally explain the presence of outliers In the first model, we account for the outliers through the use of a mixture model, adding a background Gaussian component to our data This is the regression analog of the model explored in § 5.6.5, with the difference that here we are modeling the background as a wide Gaussian rather than a uniform distribution 8.9 Regression That Is Robust to Outliers • 347 700 600 y 500 400 300 200 squared loss: y = 1.08x + 213.3 Huber loss: y = 1.96x + 70.0 100 50 100 150 x 200 250 300 350 Figure 8.8 An example of fitting a simple linear model to data which includes outliers (data is from table of [8]) A comparison of linear regression using the squared-loss function (equivalent to ordinary least-squares regression) and the Huber loss function, with c = (i.e., beyond standard deviation, the loss becomes linear) The mixture model includes three additional parameters: µb and Vb , the mean and variance of the background, and pb , the probability that any point is an outlier With this model, the likelihood becomes (cf eq 5.83; see also [8])   N   (yi − θ1 xi − θ0 )2 − pb  p({yi }|{xi }, {σi }, θ0 , θ1 , µb , Vb , pb ) ∝ exp − 2σi2 2π σi2 i =1   (yi − µb )2 pb exp − + 2(Vb + σi2 ) 2π (Vb + σi2 ) (8.67) Using MCMC sampling and marginalizing over the background parameters yields the dashed-line fit in figure 8.9 The marginalized posterior for this model is shown in the lower-left panel This fit is much less affected by the outliers than is the simple regression model used above Finally, we can go further and perform an analysis analogous to that of § 5.6.7, in which we attempt to identify bad points individually In analogy with eq 5.94 we • Chapter Regression and Model Fitting 700 1.6 600 no outlier correction (dotted fit) 1.4 500 slope y 1.2 400 1.0 300 0.8 200 100 50 0.6 100 150 200 250 300 350 x 2.8 160 200 240 intercept 2.8 mixture model (dashed fit) 2.6 2.4 2.4 280 outlier rejection (solid fit) slope 2.6 slope 348 2.2 2.2 2.0 2.0 −40 40 intercept 80 120 −40 40 intercept 80 120 Figure 8.9 Bayesian outlier detection for the same data as shown in figure 8.8 The top-left panel shows the data, with the fits from each model The top-right panel shows the 1σ and 2σ contours for the slope and intercept with no outlier correction: the resulting fit (shown by the dotted line) is clearly highly affected by the presence of outliers The bottom-left panel shows the marginalized 1σ and 2σ contours for a mixture model (eq 8.67) The bottom-right panel shows the marginalized 1σ and 2σ contours for a model in which points are identified individually as “good” or “bad” (eq 8.68) The points which are identified by this method as bad with a probability greater than 68% are circled in the first panel can fit for nuisance parameters g i , such that if g i = 1, the point is a “good” point, and if g i = the point is a “bad” point With this addition our model becomes N     (yi − θ1 xi − θ0 )2  p({yi }|{xi }, {σi }, {g i }, θ0 , θ1 , µb , Vb ) ∝ exp − 2σi2 2π σi2 i =1   (yi − µb )2 − gi exp − + 2(Vb + σi2 ) 2π (Vb + σi2 ) gi (8.68) This model is very powerful: by marginalizing over all parameters but a particular g i , we obtain a posterior estimate of whether point i is an outlier Using this procedure, ... best-fit slope and intercept using MCMC sampling (and show the resulting contours in the upper-right panel) The remaining two panels show two different Bayesian strategies for accounting for outliers... outliers In a standard straight-line fit to the data, the result is strongly affected by these points Though this standard linear regression problem is solvable in closed form (as it is in figure... + σi2 ) (8.67) Using MCMC sampling and marginalizing over the background parameters yields the dashed-line fit in figure 8.9 The marginalized posterior for this model is shown in the lower-left

Ngày đăng: 20/11/2022, 11:17

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan