Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 21 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
21
Dung lượng
362,92 KB
Nội dung
CHAPTER 19 Digression about Correlation Coefficients 19.1. A Unified Definition of Correlation Coefficients Correlation coefficients measure linear association. The usual definition of the simple correlation coefficient between two variables ρ xy (sometimes we also use the notation corr[x, y]) is their standardized covariance (19.1.1) ρ xy = cov[x, y] var[x] var[y] . Because of Cauchy-Schwartz, its value lies between −1 and 1. 513 514 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS Problem 254. Given the constant scalars a = 0 and c = 0 and b and d arbitrary. Show that corr[x, y] = ±corr[ax + b, cy + d], with the + sign being valid if a and c have the same sign, and the − sign otherwise. Answer. Start with cov[ax + b, cy + d] = ac cov[x, y] and go from there. Besides the simple correlation coefficient ρ xy between two scalar variables y and x, one can also define the squared multiple correlation coefficient ρ 2 y(x) between one scalar variable y and a whole vector of variables x, and the partial correlation coef- ficient ρ 12.x between two scalar variables y 1 and y 2 , with a vector of other variables x “partialled out.” The multiple c orrelation coefficient measures the strength of a linear asso c iation between y and all components of x together, and the partial correlation coefficient measures the strength of that part of the linear association between y 1 and y 2 which cannot be attributed to their joint association with x. One can also define partial multiple correlation coefficients. If one wants to measure the linear association between two vectors, then one number is no longer enough, but one needs several numbers, the “canonical correlations.” The multiple or partial correlation coefficients are usually defined as simple cor- relation coefficients involving the best linear predictor or its residual. But all these 19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 515 correlation coefficients share the property that they indicate a proportionate reduc- tion in the MSE. See e.g. [Rao73, pp. 268–70]. Problem 255 makes this point for the simple correlation c oefficient: Problem 255. 4 points Show that the proportionate reduction in the MSE of the best predictor of y, if one goes from predictors of the form y ∗ = a to predictors of the form y ∗ = a + bx, is equal to the squared correlation coefficient between y and x. You are allowed to use the results of Problems 229 and 240. To set notation, call the minimum MSE in the first prediction (Problem 229) MSE[constant term; y], and the minimum MSE in the second prediction (Problem 240) MSE[constant term and x; y]. Show that (19.1.2) MSE[constant term; y] −MSE[constant term and x; y] MSE[constant term; y] = (cov[y, x]) 2 var[y] var[x] = ρ 2 yx . Answer. The minimum MSE with only a constant is var[y] and (18.2.32) says that MSE[constant term and x; y] = var[y]−(cov[x, y]) 2 / var[x]. Therefore the d ifferenc e in MSE’s is (cov[x, y]) 2 / var[x], and if on e divides by var[y] to get the relative difference, one gets exactly the squared correlation coefficient. 516 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS Multiple Correlation Coefficients. Now assume x is a vector while y remains a scalar. Their joint mean vector and dispersion matrix are (19.1.3) x y ∼ µ ν , σ 2 Ω Ω Ω xx ω ω ω xy ω ω ω xy ω yy . By theorem ??, the best linear predictor of y based on x has the formula (19.1.4) y ∗ = ν + ω ω ω xy Ω Ω Ω − xx (x − µ) y ∗ has the following additional extremal value property: no linear combination b x has a higher squared correlation with y than y ∗ . This maximal value of the squared correlation is called the squared multiple correlation coefficient (19.1.5) ρ 2 y(x) = ω ω ω xy Ω Ω Ω − xx ω ω ω xy ω yy The multiple correlation coefficient itself is the positive square root, i.e., it is always nonnegative, while some other correlation coefficients may take on negative values. The squared multiple correlation coefficient can also defined in terms of propor- tionate reduction in MSE. It is equal to the proportionate reduction in the MSE of the best predictor of y if one goes from predictors of the form y ∗ = a to predictors 19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 517 of the form y ∗ = a + b x, i.e., (19.1.6) ρ 2 y(x) = MSE[constant term; y] −MSE[constant term and x; y] MSE[constant term; y] There are therefore two natural definitions of the multiple correlation coefficient. These two definitions correspond to the two formulas for R 2 in (18.3.6). Partial Correlation Coefficients. Now assume y = y 1 y 2 is a vector with two elements and write (19.1.7) x y 1 y 2 ∼ µ ν 1 ν 2 , σ 2 Ω Ω Ω xx ω ω ω y1 ω ω ω y2 ω ω ω y1 ω 11 ω 12 ω ω ω y2 ω 21 ω 22 . Let y ∗ be the best linear predictor of y based on x. The partial correlation coefficient ρ 12.x is defined to b e the s imple correlation between the residuals corr[(y 1 −y ∗ 1 ), (y 2 − y ∗ 2 )]. This measures the correlation between y 1 and y 2 which is “local,” i.e., which does not follow from their association with x. Assume for instance that both y 1 and y 2 are highly correlated with x. Then they will also have a high correlation with each other. Subtracting y ∗ i from y i eliminates this dependency on x, therefore any remaining correlation is “local.” Compare [Krz88, p. 475]. 518 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS The partial correlation coefficient can be defined as the relative reduction in the MSE if one adds y 2 to x as a predictor of y 1 : (19.1.8) ρ 2 12.x = MSE[constant term and x; y 2 ] − MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term and x; y 2 ] . Problem 256. Using the definitions in terms of MSE’s, show that the following relationship holds between the squares of multiple and partial correlation coefficients: (19.1.9) 1 − ρ 2 2(x,1) = (1 −ρ 2 21.x )(1 − ρ 2 2(x) ) Answer. In terms of the MSE, (19.1.9) reads (19.1.10) MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term; y 2 ] = MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term and x; y 2 ] MSE[constant term and x; y 2 ] MSE[constant term; y 2 ] . From (19.1.9) follows the following weighted average formula: (19.1.11) ρ 2 2(x,1) = ρ 2 2(x) + (1 −ρ 2 2(x) )ρ 2 21.x An alternative proof of (19.1.11) is given in [Gra76, pp. 116/17]. 19.2. COR RELATION COEFFICIENTS AND THE ASSOCIATED LEAST SQUARES PROBLEM519 Mixed cases: One can also form multiple correlations coefficients with some of the variables partialled out. The dot notation use d here is due to Yule, [Yul07]. The notation, definition, and formula for the squared correlation coefficient is ρ 2 y(x).z = MSE[constant term and z; y] −MSE[constant term, z, and x; y] MSE[constant term and z; y] (19.1.12) = ω ω ω xy.z Ω Ω Ω − xx.z ω ω ω xy.z ω yy.z (19.1.13) 19.2. Correlation Coefficients and the Associated Least Squares Problem One can define the correlation coefficients also as proportionate reductions in the objective functions of the associated GLS problems. However one must reverse predictor and predictand, i.e., one must look at predictions of a vector x by linear functions of a scalar y. Here it is done for multiple correlation coefficients: The value of the GLS objec- tive function if one predicts x by the best linear predictor x ∗ , which is the minimum attainable when the scalar observation y is given and the vector x can b e chosen 520 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS freely, as long as it satisfies the constraint x = µ + Ω Ω Ω xx q for some q, is (19.2.1) SSE[y; best x] = min xs.t (x − µ) (y − ν) Ω Ω Ω xx ω ω ω xy ω ω ω xy ω yy − x − µ y − ν = (y−ν) ω − yy (y−ν). On the other hand, the value of the GLS objective function when one predicts x by the best constant x = µ is (19.2.2) SSE[y; x = µ] = o (y − ν) Ω Ω Ω − xx + Ω Ω Ω − xx ω ω ω xy ω − yy.x ω ω ω xy Ω Ω Ω − xx −Ω Ω Ω − xx ω ω ω xy ω − yy.x −ω − yy.x ω ω ω xy Ω Ω Ω − xx ω − yy.x o y − ν = (19.2.3) = (y − ν) ω − yy.x (y − ν). The proportionate reduction in the objective function is (19.2.4) SSE[y; x = µ] −SSE[y; b es t x] SSE[y; x = µ] = (y − ν) 2 /ω yy.x − (y − ν) 2 /ω yy (y − ν) 2 /ω yy.x = (19.2.5) = ω yy − ω yy.x ω yy = ρ 2 y(x) = 1 − ω yy.x ω yy = 1 − 1 ω yy ω yy = ρ 2 y(x) 19.3. CA NONICA L CORRELATIONS 521 19.3. Canonical Correlations Now what happens with the correlation coefficients if both predictor and predic- tand are vectors? In this case one has more than one correlation coefficient. One first finds those two linear combinations of the two vectors which have highest correlation, then those which are uncorrelated with the first and have second highest correlation, and so on. Here is the mathematical construction needed: Let x and y be two column vectors consisting of p and q scalar random variables, respectively, and let (19.3.1) V [ x y ] = σ 2 Ω Ω Ω xx Ω Ω Ω xy Ω Ω Ω yx Ω Ω Ω yy , where Ω Ω Ω xx and Ω Ω Ω yy are nonsingular, and let r be the rank of Ω Ω Ω xy . Then there exist two separate transformations (19.3.2) u = Lx, v = My such that (19.3.3) V [ u v ] = σ 2 I p Λ Λ I q 522 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS where Λ is a (usually rectangular) diagonal matrix with only r diagonal elements positive, and the others zero, and where these diagonal elements are sorted in de- scending order. Proof: One obtains the matrix Λ by a singular value decomposition of Ω Ω Ω −1/2 xx Ω Ω Ω xy Ω Ω Ω −1/2 yy = A, say. Let A = P ΛQ be its singular value decomposition with fully orthogonal matrices, as in equation (A.9.8). Define L = P Ω Ω Ω −1/2 xx and M = QΩ Ω Ω −1/2 yy . Therefore LΩ Ω Ω xx L = I, MΩ Ω Ω yy M = I, and LΩ Ω Ω xy M = P Ω Ω Ω −1/2 xx Ω Ω Ω xy Ω Ω Ω −1/2 yy Q = P AQ = Λ. The next problems show how one gets from this the maximization property of the canonical correlation coefficients: Problem 257. Show that for every p-vector l and q-vector m, (19.3.4) corr(l x, m y) ≤ λ 1 where λ 1 is the first (and therefore biggest) diagonal element of Λ. Equality in (19.3.4) holds if l = l 1 , t he first row in L, and m = m 1 , t he first row in M . Answer: If l or m is the null vector, then there is nothing to prove. If neither of them is a null vector, then one can, without loss of generality, multiply them with appropriate scalars so that p = (L −1 ) l and q = (M −1 ) m satisfy p p = 1 and [...]... assumption that Ω xx and Ω yy are nonsingular 19.4 Some Remarks about the Sample Partial Correlation Coefficients The definition of the partial sample correlation coefficients is analogous to that of the partial population correlation coefficients: Given two data vectors y and z, and the matrix X (which includes a constant term), and let M = I −X(X X)−1 X be 19.4 SOME REMARKS ABOUT THE SAMPLE PARTIAL CORRELATION... X, and z; y] 2 rzy.X = SSE[constant term and X; y] [Gre97, p 248] considers it unintuitive that this can be computed using t -statistics Our approach explains why this is so First of all, note that the square of the tstatistic is the F -statistic Secondly, the formula for the F -statistic for the inclusion of z into the regression is (19.4.3) SSE[constant term and X; y] − SSE[constant term, X, and. .. convert X into a matrix which has zeros below the diagonal; their product is therefore Q The LINPACK implementation of this starts with X and modifies its elements in place For each column it generates the corresponding v vector and premultipies the matrix by I − v 2 v vv This generates zeros below the diagonal Instead of writing 20.2 THE LINPACK IMPLEMENTATION OF THE QR DECOMPOSITION 533 the zeros into... decomposition in Splus has two main components: qr is a matrix like a, and qraux is a vector of length ncols(a) LINPACK does not use or store exactly the same v as given here, but uses √ u = v/(σ x x) instead The normalization does not affect the resulting orthogonal transformation; its advantage is that the leading element of each vector, that which is stored in qraux, is at the same time equal u u/2 In other... 1 1 −1 20.2 The LINPACK Implementation of the QR Decomposition This is all we need, but numerically it is possible to construct, without much additional computing time, all the information which adds the missing orthogonal columns to Q In this way Q is square and R is conformable with X This is sometimes called the “complete” QR-decomposition In terms of the decomposition 20.2 THE LINPACK IMPLEMENTATION... computing OLS Estimates 20.1 QR Decomposition One precise and fairly efficient method to compute the Least Squares estimates is the QR decomposition It amounts to going over to an orthonormal basis in R[X] It uses the following mathematical fact: Every matrix X, which has full column rank, can be decomposed in the product of two matrices QR, where Q has the same number of rows and columns as X, and is... normalize it, to get 530 20 NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES q 2 = −1/2 1/2 −1/2 1/2 and r22 = 4 The rest remains a homework problem But I am not sure if my numbers are right 1 3 −1 Problem 263 2 points Compute trace and determinant of 0 2 −2 Is 0 0 1 this matrix symmetric and, if so, is it nonnegative definite? Are its column vectors linearly dependent? Compute the matrix product ... Problem 259 Show that for every p-vector l and q-vector m such that l x is uncorrelated with l1 x, and m y is uncorrelated with m1 y, (19.3.6) corr(l x, m y) ≤ λ2 where λ2 is the second diagonal element of Λ Equality in (19.3.6) holds if l = l2 , the second row in L, and m = m2 , the second row in M Answer If l or m is the null vector, then there is nothing to prove If neither of them is a null vector,... Then the squared partial sample correlation is the squared simple correlation between the least squares residuals: (19.4.1) 2 rzy.X = (z M y)2 (z M z)(y M y) Alternatively, one can define it as the proportionate reduction in the SSE Although X is assumed to incorporate a constant term, I am giving it here separately, in order to show the analogy with (19.1.8): (19.4.2) SSE[constant term and X; y] − SSE[constant... x x + σx11 x x therefore 2v x/v v = 1, and (20.2.5) √ −σ x x 0 2 (I − vv )x = x − v = v v 0 Premultiplication of X by I − v 2 v vv gets therefore the first column into the desired shape By the same principle one can construct a second vector w, which has a zero in the first place, and which annihilates all elements below the second element in the second column of X, etc These successive . 229 and 240. To set notation, call the minimum MSE in the first prediction (Problem 229) MSE[constant term; y], and the minimum MSE in the second prediction (Problem 240) MSE[constant term and x;. first and have second highest correlation, and so on. Here is the mathematical construction needed: Let x and y be two column vectors consisting of p and q scalar random variables, respectively, and. components of x together, and the partial correlation coefficient measures the strength of that part of the linear association between y 1 and y 2 which cannot be attributed to their joint association with