Class Notes in Statistics and Econometrics Part 28 ppt

CHAPTER 55 Numerical Minimization Literature: [Thi88, p. 199–219] or [KG80, pp. 425–475]. Regarding numerical methods, the books are classics, and they are available on-line for free at lib-www.lanl.gov/numerical/ Assume θ → f(θ) is a scalar function of a vector argument, with continuous first and second derivatives, which has a global minimum, i.e., there is an argument ˆ θ with f( ˆ θ) ≤ f(θ) for all θ. The numerical methods to find this minimum argument are usually recursive: the computer is given a starting value θ 0 , uses it to compute θ 1 , then it uses θ 1 to compute θ 2 , and so on, constructing a sequence θ 1 , θ 2 , . . . that converges towards 1207 1208 55. NUMERICAL MINIMIZATION a minimum argument. If convergence occurs, this minimum is usually a local minimum, and often one is not sure whether there is not another, better, local minimum somewhere else. At every step, the computer makes two decisions, which can be symbolized as (55.0.10) θ i+1 = θ i + α i d i . Here d i , a vector, is the step direction, and α i , a scalar, is the step size. The choice of the step direction is the main characteristic of the program. Most programs (notable exception: simulated annealing) always choose directions at every step along which the objective function slopes downward, so that one will get lower values of the objective function for small increments in that direction. The step size is then chosen such that the objective function actually decreases. In elaborate cases, the step size is chosen to be that traveling distance in the step direction which gives the best improvement in the objective function, but it is not always efficient to spend this much time on the step size. Let us take a closer lo ok how to determine the step direction. If g  i = (g(θ i ))  is the Jacobian of f at θ i , i.e., the row vector consisting of the partial derivatives of f, then the objective function will slope down along direction d i if the scalar product g  i d i is negative. In determining the step direction, the following fact is useful: All vectors d i for which g  i d i < 0 can be obtained by premultiplying the transpose of the 55. NUMERICAL MINIMIZATION 1209 negative Jacobian, i.e., the negative gradient vector −g i , by an appropriate positive definite matrix R i . Problem 492. 4 points Here is a proof for those who are interested in this issue: Prove that g  d < 0 if and only if d = −Rg for some positive definite symmetric matrix R. Hint: to prove the “only if” part use R = I − gg  /(g  g) −dd  /(d  g). This formula is from [Bar74, p. 86]. To prove that R is positive definite, note that R = Q + S with both Q = I − gg  /(g  g) and S = −dd  /(d  g) nonnegative definite. It is therefore sufficient to show that any x = o for which x  Qx = 0 satisfies x  Sx > 0. Answer. If R is positive definite, then d = −Rg clearly satisfies d  g < 0. Conversely, for any d satisfying d  g < 0, define R = I − gg  /(g  g) − dd  /(d  g). One immediately checks that d = −Rg. To prove that R is positive definite, note that R is the sum of two nonnegative definite matrices Q = I − gg  /(g  g) and S = −dd  /(d  g). It is therefore sufficient to show that any x = o for which x  Qx = 0 satisfies x  Sx > 0. Indeed, if x  Qx = 0, then already Qx = o, which means x = gg  x g  g . Therefore (55.0.11) x  Sx = − xgg  g  g dd  d  g gg  x g  g = −(g  x/g  g) 2 d  g > 0.  1210 55. NUMERICAL MINIMIZATION Many important numerical methods, the so-called gradient methods [KG80, p. 430] use exactly this principle: they find the step direction d i by premultiplying −g i by some positive definite R i , i.e., they use the recursion equation (55.0.12) θ i+1 = θ i − α i R i g i The most important ingredient here is the choice of R i . We will discuss two “natural” choices. The choice which immediately comes to mind is to set R i = I, i.e., d i = −α i g i . Since the gradient vector shows into the direction where the slop e is steepest, this is called the method of steepest descent. However this choice is not as natural as one might first think. There is no benefit to finding the steepest direction, since one can easily increase the step length. It is much more imp ortant to find a direction which allows one to go down for a long time—and for this one should also consider how the gradient is changing. The fact that the direction of steepest descent changes if one changes the scaling of the variables, is another indication that selecting the steepest descent is not a natural criterion. The most “natural” choice for R i is the inverse of the “Hessian matrix” G(θ i ), which is the matrix of second partial derivatives of f, evaluated at θ i . This is called the Newton-Raphson method. If the inverse Hessian is positive definite, the Newton Raphson method amounts to making a Taylor development of f around the so far 55. NUMERICAL MINIMIZATION 1211 best point θ i , breaking this Taylor development off after the quadratic term (so that one gets a quadratic function which at point θ i has the same first and second derivatives as the given objective function), and choosing θ i+1 to be the minimum point of this quadratic approximation to the objective function. Here is a proof that one accomplishes all this if R i is the inverse Hessian. The quadratic approximation (second order Taylor development) of f around θ i is (55.0.13) f(θ) ≈ f(θ i ) +  g(θ i )   (θ −θ i ) + 1 2 (θ −θ i )  G(θ i )(θ −θ i ). By theorem 55.0.1, the minimum argument of this quadratic approximation is (55.0.14) θ i+1 = θ i −  G(θ i )  −1 g(θ i ), which is the above procedure with step size 1 and R i = (G(θ i )) −1 . Theorem 55.0.1. Let G be a n × n positive definite matrix, and g a n-vector. Then the minimum argument of the function (55.0.15) q : z → g  z + 1 2 z  Gz is x = −G −1 g. 1212 55. NUMERICAL MINIMIZATION Proof: Since Gx = −g, it follows for any z that z  g + 1 2 z  Gz = −z  Gx + 1 2 z  Gz =(55.0.16) = 1 2 x  Gx − z  Gx + 1 2 z  Gz − 1 2 x  Gx(55.0.17) = 1 2 (x − z)  G(x − z) − 1 2 x  Gx(55.0.18) This is minimized by z = x. The Newton-Raphson method requires the Hessian matrix. [KG80] recommend to establish mathematical formulas for the derivatives which are then evaluated at θ i , since it is very tricky and unprecise to compute derivatives and the Hessian numerically. The analytical derivatives, on the other hand, are time consuming and the computation of these derivatives m ay be subject to human error. However there are computer programs which automatically compute such derivatives. Splus, for instance, has the deriv function which automatically constructs functions which are the derivatives or gradients of given functions. The main drawback of the Newton-Raphson method is that G(θ i ) is only positive definite if the function is strictly convex. This will be the case when θ i is close to a minimum, but if one starts too far away from a minimum, the Newton-Raphson method may not converge. 55. NUMERICAL MINIMIZATION 1213 There are many modifications of the Newton-Raphson method which get around computing the Hessian and inverting it at every step and at the same time ensure that the matrix R i is always p os itive definite by using an updating formula for R i , which turns R i , after sufficiently many steps into the inverse Hessian. These are probably the most often used methods. A popular one used by the gauss software is the Davidson-Fletcher-Powell algorithm. One drawback of all these methods using matrices is the fact that the size of the matrix R i increases with the square of the number of variables. For problems with large numbers of variables, memory limitations in the computer make it necessary to use methods which do without such a m atrix. A method to do this is the “conjugate gradient method.” If it is too difficult to compute the gradient vector, the “conjugate direction method” may also compare favorably with computing the gradient numerically. CHAPTER 56 Nonlinear Least Squares This chapter ties immediately into chapter 55 about Numerical Minimization. The notation is slightly different; what we called f is now called SSE, and what we called θ is now called β. A much more detailed discussion of all this is given in [DM93, Chapter 6], which uses the notation x(β) instead of our η(β). [Gre97, Chapter 10] defines the vector function η(β) by η t (β) = h(x t , β), i.e., all elements of the vector function η have the same functional form h but differ by the values of the additional arguments x t . [JHG + 88, Chapter (12.2.2)] s et it up in the same way as [Gre97], but they call the function f instead of h. 1215 1216 56. NONLINEAR LEAST SQUARES An additional important “natural” choice for R i is available if the objective function has the nonlinear least squares form (56.0.19) SSE(β) =  y −η(β)    y −η(β)  , where y is a given vector of observations and η(β) is a vector function of a vector argument, i.e., it consists of n scalar functions of k scalar arguments each: (56.0.20) η(β) =      η 1 (β 1 , β 2 , ··· , β k ) η 2 (β 1 , β 2 , ··· , β k ) . . . η n (β 1 , β 2 , ··· , β k )      Minimization of this objective function is an obvious and powerful estimation method whenever the following nonlinear least squares model specification holds: (56.0.21) y = η(β) + ε ε ε, ε ε ε ∼ (o, σ 2 I) If the errors are normally distributed, then nonlinear least squares is equal to the maximum likelihood estimator. (But this is only true as long as the covariance matrix is spherical as assumed here.) [...]... α, β, and γ together For instance, in the linear case (56.1.4) y = (1 − α)X 0 β + αX 1 γ + ε 1 228 56 NONLINEAR LEAST SQUARES every change in α can be undone by counteracting changes in β and γ Therefore ˆ the idea is to estimate γ from model 1, call this estimate γ , and get the predicted ˆ ˆ = η (γ ), and plug this into this model, i.e., one estimates ˆ value of y from model 1 y 1 ˆ ˆ 1 α and β in the... attains its asymptotic properties more slowly than NLS Problem 497 This is [DM93, pp 243 and 284 ] The model is y γ = α + βxt + εt t (56.2.2) with ε ∼ N (o, σ 2 I) y i > 0 • a 1 point Why can this model not be estimated with nonlinear least squares? Answer If all the y’s are greater than unity, then the SSE can be made arbitrarily small by letting γ tend to −∞ and setting α and β zero γ = 1, α = 1, and. .. function, then plot this concentrated objective function and make a grid search for the best β The concentrated objective function can also be obtained by running the regression for every β and getting the SSE from the regression ˆ After you have the point estimates α and β write y t = ηt +εt and construct the pseudoregressors ˆ ˆ ˆ ∂ηt /∂α = xβ and ∂ηt /∂β = α(log xt )xβ If you regress the residuals... model, and use the linearized version of η 0 ˆ ˆ around β, i.e., replace η 0 in the above regression by (56.1.6) ˆ ˆ ˆ ˆ ˆ ˆ η 0 (β) + X 0 (β)(β − β) If one does this, one gets the linear regression (56.1.7) ˆ ˆ ˆ ˆ ˆ y − y 0 = Xδ + α(y 1 − y 0 ) ˆ ˆ ˆ ˆ ˆ ˆ where y 0 = η 0 (β), and δ = (1 − α)(β − β), and one simply has to test for α = 0 ˆ Problem 496 Computer Assignment: The data in table 10.1 are in. .. /home/econ/ehrbar/ec781/consum.txt and they will also be sent to you by e-mail Here are the commands to enter them into SAS: 56.1 THE J TEST 1229 libname ec781 ’/home/econ/ehrbar/ec781/sasdata’; filename consum ’/home/econ/ehrbar/ec781/consum.txt’; data ec781.consum; infile consum; input year y c; run; Use them to re-do the estimation of the consumption function in Table 10.2 In SAS this can be done with the procedure NLIN, described... described in the SAS Users Guide Statistics, [SAS85] Make a scatter plot of the data and plot the fitted curve into this same plot libname ec781 ’/home/econ/ehrbar/ec781/sasdata’; proc nlin data=ec781.consum maxiter=200; parms a1=11.1458 b1=0.89853 g1=1; model c=a1+b1*exp(g1 * log(y)); der.a1=1; 1230 56 NONLINEAR LEAST SQUARES der.b1=exp(g1 * log(y)); der.g1=b1*(exp(g1*log(y)))*log(y); run; 56.2 Nonlinear instrumental... variables estimation If instrumental variables are necessary, one minimizes, instead of (y−η(β)) (y− η(β)), the following objective function: (56.2.1) (y − η(β)) W (W W )−1 W (y − η(β)) (As before, η contains X although we are not making this explicit.) Example: If one uses instrumental variables on the consumption function, one gets this time quite different estimates than from the nonlinear least squares... write it in the form (56.0.47), which in the present model happens to be even a little simpler (because the original regression is almost linear) and gives the true regression coefficient: (56.0.58) ˆ ˆ ˆ δ δ δ y t + γ δ log(zt )zt = a + bxt + czt + dˆ log(zt )zt + error term ˆˆ γ • b How would you obtain the starting value for the Newton-Raphson algorithm? ˆ Answer One possible set of starting values... − α − βxt ; therefore t ∂εt /∂y t = γy γ−1 and ∂εt /∂y s = 0 for s = t The Jacobian has this in the diagonal and 0 in the t off-diagonal, therefore the determinant is J = γ n ( y t )γ−1 and |J| = |γ|n ( y t )γ−1 This gives 1232 56 NONLINEAR LEAST SQUARES the above formula: which I assume is right, it is from [DM93], but somewhere [DM93] has a typo • c 2 points Concentrate the log likelihood function... (because we do not know β) and must be estimated The error term contains the higher order terms in (56.0.34) plus the vector of random disturbances in (56.0.21) This regression is called the Gauss-Newton regression (GNR) at β i [Gre97, (10-8) on p, 452] writes it as (56.0.47) y − η(β i ) + X(β i )β i = X(β i )β + error term 56 NONLINEAR LEAST SQUARES 1223 Problem 495 6 points [DM93, p 178], which is . method which get around computing the Hessian and inverting it at every step and at the same time ensure that the matrix R i is always p os itive definite by using an updating formula for R i , which. objective function can also be obtained by running the regression for every β and getting the SSE from the regression. After you have the point estimates ˆα and ˆ β write y t = η t +ε t and construct the pseudoregressors ∂η t /∂α. points Here is a proof for those who are interested in this issue: Prove that g  d < 0 if and only if d = −Rg for some positive definite symmetric matrix R. Hint: to prove the “only if” part

Định dạng
Số trang	26
Dung lượng	397,43 KB