Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 42 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
42
Dung lượng
423,57 KB
Nội dung
CHAPTER 63 Independent Observations from the Same Multivariate Population This Chapter discusses a model that is a special case of the model in Chapter 62.2, but it goes into more depth towards the end. 63.1. Notation and Basic Statistics Notational conventions are not uniform among the different books about mul- tivariate statistic. Johnson and Wichern arrange the data in a r × n matrix X. Each column is a separate independent observation of a q vector with mean µ and dispersion matrix Σ Σ Σ. T here are n observations. 1333 1334 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION We will choose an alternative notation, which is also found in the literature, and write the matrix as a n ×r matrix Y . As before, each column represents a variable, and each row a usually independent observation. Decompose Y into its row vectors as follows: (63.1.1) Y = y 1 . . . y n . Each row (written as a column vector) y i has mean µ and dispersion matrix Σ Σ Σ, and different rows are independent of each other. In other words, E [Y ] = ιµ . V [Y ] is an array of rank 4, not a matrix. In terms of Kronecker products one can write V [vec Y ] = Σ Σ Σ ⊗I. One can form the following descriptive statistics: ¯ y = 1 n y i is the vector of sample means, W = i (y i − ¯ y)(y i − ¯ y) is matrix of (corrected) squares and cross products, the sample covariance matrix is S (n) = 1 n W with divisor n, and R is the matrix of sample correlation coefficients. Notation: the ith sample variance is called s ii (not s 2 i , as one might perhaps expect). The sample means indicate location, the sample standard deviations dispersion, and the sample correlation coefficients linear relationship. 63.1. NOTATION AND BASIC STATISTICS 1335 How do we get these descriptive statistics from the data Y through a matrix manipulation? ¯ y = 1 n ι Y ; now Y −ι ¯ y = (I − ιι n )Y is the matrix of observations with the appropriate sample mean taken out of each element, therefore (63.1.2) W = y 1 − ¯ y ··· y n − ¯ y ( y 1 − ¯ y) . . . (y n − ¯ y) = = Y (I − ιι n ) (I − ιι n )Y = Y (I − ιι n )Y . Then S (n) = 1 n W , and in order to get the sample correlation matrix R, use (63.1.3) D (n) = diag(S (n) ) = s 11 0 ··· 0 0 s 22 ··· 0 . . . . . . . . . . . . 0 0 ··· s nn and then R = (D (n) ) −1/2 S (n) (D (n) ) −1/2 . In analogy to the formulas for variances and covariances of linear transformations of a vector, one has the following formula for sample variances and covariances of linear combinations Y a and Y b: est.cov[Y a, Y b] = a S (n) b. 1336 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION Problem 517. Show that E [ ¯ y] = µ and V [ ¯ y] = 1 n Σ Σ Σ. (The latter identity can be shown in two ways: once using the Kronecker product of matrices, and once by partitioning Y into its rows.) Answer. E [ ¯ y] = E [ 1 n Y ι] = 1 n ( E [Y ]) ι = 1 n µι ι = µ. Using Kronecker products, one obtains from ¯ y = 1 n ι Y that (63.1.4) ¯ y = vec( ¯ y ) = 1 n (I ⊗ι ) vec Y ; therefore (63.1.5) V [ ¯ y] = 1 n 2 (I ⊗ι )(Σ Σ Σ ⊗ I)(I ⊗ι) = 1 n 2 (Σ Σ Σ ⊗ ι ι) = 1 n Σ Σ Σ 63.2. TWO GEOMETRIES 1337 The alternative way to do it is V [ ¯ y] = E [( ¯ y − µ)( ¯ y − µ) ](63.1.6) = E [ 1 n i (y i − µ) 1 n j (y j − µ) ](63.1.7) = 1 n 2 i,j E [(y i − µ)(y j − µ) ](63.1.8) = 1 n 2 i E [(y i − µ)(y i − µ) ](63.1.9) = n n 2 E [(y i − µ)(y i − µ) ] = 1 n Σ Σ Σ.(63.1.10) Problem 518. Show that E [S (n) ] = n−1 n Σ Σ Σ, therefore the unbiased S = 1 n−1 i (x i − ¯ x)(x i − ¯ x) has Σ Σ Σ as its expected value. 63.2. Two Geometries One can distinguish two geometries, according to whether one takes the rows or the columns of Y as the points. Rows as points gives n points in r-dimensional 1338 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION space, the “scatterplot geometry.” If r = 2, this is the scatter plot of the two variables against each other. In this geometry, the sample mean is the center of balance or center of gravity. The dispersion of the observations around their mean defines a distance measure in this geometry. The book introduces this distance by suggesting with its illustrations that the data are clustered in hyperellipsoids. The right way to introduce this distance would be to say: we are not only interested in the r coordinates separately but also in any linear combinations, then use our treatment of the Mahalanobis distance for a given population, and then transfer it to the empirical distribution given by the sample. In the other geometry, all observations of a given random variable form one point, here called “vector.” I.e., the basic entities are the c olumns of Y . In this so-called “vector geometry,” ¯ x is the projection on the diagonal vector ι, and the correlation coefficient is the cosine of the angle between the deviation vectors. Generalized sample variance is defined as determinant of S. Its geometric intu- ition: in the scatter plot geometry it is proportional to the square of the volume of the hyperellipsoids, (see J&W, p. 103), and in the geometry in which the observations of each variabe form a vector it is (63.2.1) det S = (n − 1) −r (volume) 2 63.3. ASSUMPTION OF NORMALITY 1339 where the volume is that spanned by the deviation vectors. 63.3. Assumption of Normality A more general version of this section is 62.2.3. Assume that the y i , the row vectors of Y , are independent, and each is ∼ N(µ, Σ Σ Σ) with Σ Σ Σ positive definite. Then the density function of Y is f Y (Y ) = n j=1 (2π) −r/2 (detΣ Σ Σ) −1/2 exp − 1 2 (y j − µ) Σ Σ Σ −1 (y j − µ) (63.3.1) = (2π) −nr/2 (detΣ Σ Σ) −n/2 exp − 1 2 j (y j − µ) Σ Σ Σ −1 (y j − µ) .(63.3.2) The quadratic form in the exponent can be rewritten as follows: n j=1 (y j − µ) Σ Σ Σ −1 (y j − µ) = n j=1 (y j − ¯ y + ¯ y −µ) Σ Σ Σ −1 (y j − ¯ y + ¯ y −µ) = n j=1 (y j − ¯ y) Σ Σ Σ −1 (y j − ¯ y) + n( ¯ y −µ) Σ Σ Σ −1 ( ¯ y −µ)(63.3.3) 1340 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION The first term can be simplified as follows: j (y j − ¯ y) Σ Σ Σ −1 (y j − ¯ y) = j tr(y j − ¯ y) Σ Σ Σ −1 (y j − ¯ y) = j trΣ Σ Σ −1 (y j − ¯ y)(y j − ¯ y) = tr Σ Σ Σ −1 j (y j − ¯ y)(y j − ¯ y) = n tr Σ Σ Σ −1 S (n) Using this one can write the density function as (63.3.4) f Y (Y ) = (2π) −nr/2 (detΣ Σ Σ) −n/2 exp − n 2 tr(Σ Σ Σ −1 S (n) ) exp − n 2 ( ¯ y−µ) Σ Σ Σ −1 ( ¯ y−µ) . One sees, therefore, that the density function depends on the observation only through ¯ y and S (n) , which means that ¯ y and S (n) are sufficient statistics. Now we compute the maximum likelihood estimators: taking the maximum for µ is simply ˆµ = ¯ y. This leaves the concentrated likelihood function (63.3.5) max µ f Y (Y ) = (2π) −nr/2 (detΣ Σ Σ) −n/2 exp − n 2 tr(Σ Σ Σ −1 S (n) ) . 63.4. EM-ALGORITHM FOR MISSING OBSERVATIONS 1341 To obtain the maximum likelihood estimate of Σ Σ Σ one needs equation (A.8.21) in Theorem A.8.3 in the Appendix and (62.2.15). If one sets A = S (n) 1/2 Σ Σ Σ −1 S (n) 1/2 , then tr A = tr(Σ Σ Σ −1 S (n) ) and det A = (detΣ Σ Σ) −1 det S (n) , in (62.2.15), therefore the concentrated likelihood function (63.3.6) (2π) −nr/2 (detΣ Σ Σ) −n/2 exp − n 2 tr(Σ Σ Σ −1 S (n) ) ≤ (2πe) −rn/2 (det S (n) ) −n/2 with equality holding if ˆ Σ Σ Σ = S (n) . Note that the maximum value is a multiple of the estimated generalized variance. 63.4. EM-Algorithm for Missing Observations The maximization of the likelihood function is far more difficult if some obser- vations are missing. (Here assume they are missing randomly, i.e., the fact that they are missing is not related to the values of these entries. Otherwise one has sample selection bias!) In this case, a good iterative procedure to obtain the maximum like- lihood estimate is the EM-algorithm (expectation-maximization algorithm). It is an iterative prediction and estimation. 1342 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION Let’s follow Johnson and Wichern’s example on their p. 199. The matrix is (63.4.1) Y = − 0 3 7 2 6 5 1 2 − − 5 It is not so important how one gets the initial estimates of µ and Σ Σ Σ: say ˜ µ = 6 1 4 , and to get ˜ Σ Σ Σ take deviations from the mean, putting ze ros in for the missing values (which will of course underestimate the variances), and divide by the number of observations. (Since we are talking maximum likelihood, there is no adjustment for degrees of freedom.) (63.4.2) ˜ Σ Σ Σ = 1 4 Y Y where Y = 0 −1 −1 1 1 2 −1 0 −2 0 0 1 , i.e., ˜ Σ Σ Σ = 1/2 1/4 1 1/4 1/2 3/4 1 3/4 5/2 . Given these estimates, the prediction step is next. The likelihood function de- pends on sample mean and sample dispersion matrix only. These, in turn, are simple functions of the vector of column sums Y ι and the matrix of (uncentered) sums of squares and crossproducts Y Y , which are complete sufficient statistics. To predict [...]... are similar, and if one is not interested in those particular firms in the sample, but in all firms Problem 524 3 points Enumerate as many commonalities and differences as you can between the dummy variable model for pooling cross sectional and time series data, and the seemingly unrelated regression model Answer Both models involve different cross-sectional units in overlapping time intervals In the SUR... numerically because they involve smaller matrices, by exploiting the structure of the overall design matrix First estimate the slope parameters by sweeping out the means, then the intercepts • c 3 points Set up an F -test testing whether the individual intercept parameters are indeed different, by running two separate regressions on the restricted and the unrestricted model and using the generic formula... identically distributed independent error terms with zero mean, and α is a m-vector and β a k-vector of unknown nonrandom parameters • a 3 points Describe in words the characteristics of this model and how it can come about Answer Each of the m units has a different intercept, slope is the same Equal marginal costs but different fixed costs • b 4 points Describe the issues in estimating this model and how it should... proofs in multivariate statistics Conditionally on v, the matrix P is of course constant, and therefore, by theorem 10.4.2, conditionally on v the vector w = P u is standard normal with same variance σuu , and q = u u − w w is an independent σuu χ2 In other words, n−2 conditionally on v, the following three variables are mutually independent and have 63.6 SAMPLE CORRELATION COEFFICIENTS 1351 the following... estimates by regressing running OLS on (64.1.4), i.e., regressing vec Y on Z with an intercept 64.2 The Between-Estimator By premultiplying (64.1.3) by 1 ι one obtains the so-called “between”-regression t ¯ ¯ Defining y = 1 ι Y , i.e., y is the row vector consisting of the column means, and t ¯ ¯ in the same way xi = 1 ι X i and ε = 1 ι E, one obtains t t (64.2.1) ¯ x1 ¯ ¯ = ¯ ¯ ¯ ¯ ¯ y = µι + x1... MISSING OBSERVATIONS 1343 those we need predictions of the missing elements of Y , of their squares, and of their products with each other and with the observed elements of Y Our method of ˜ ˜ predicting is to take conditional expectations, assuming µ and Σ are the true mean and dispersion matrix For the prediction of the upper lefthand corner element of Y , only the first row of Y is relevant Partitioning... t = 2 σα 2 σα ··· 2 2 σ α + σε Problem 525 3 points Using Problems 612 and 613 show that the covariance matrix of the error term in the random coefficients model (after the random part of the intercept has been added to it) is V [vec(ιδ + E)] = I m ⊗ V , where V is defined in (64.5.4) 64.5 VARIANCE COMPONENTS MODEL (RANDOM EFFECTS) 1367 Answer V [vec(ιδ + E)] = V [vec(ιδ ) + vec(E)]... take the inverses block by block, and gets the above • b 1 point Show that this is a matrix-weighted average of the BLUE’s in the individual timeseries regressions, with the inverses of the covariance matrices of these BLUE’s as the weighting matrices Answer Simple because (64.5.9) ˆ ˆ β= X i Σ−1 X i i i −1 X i Σ−1 X i X i Σ−1 X i i i −1 X i Σ−1 y i i i Since the columns of ιδ +E are independent and have... the disturbances only, while in the dummy variable model, no relationship at all is going through the disturbances, all the errors are independent! But in the dummy variable model, the equations are strongly related since all slope coefficients are equal in the different equations, only the intercepts may differ In the SUR model, there is no relationship between the parameters in the different equations,... are random too, 2 they are elements of the vector α ∼ (ιµ, σα I) which is uncorrelated with E Besides 2 β, the two main parameters to be estimated are µ and σα , but sometimes one may also want to predict α 1366 64 POOLING OF CROSS SECTION AND TIME SERIES DATA In our example of firms, this specification would be appropriate if we are not interested in the fixed costs associated with the specific firms in . (D (n) ) −1/2 S (n) (D (n) ) −1/2 . In analogy to the formulas for variances and covariances of linear transformations of a vector, one has the following formula for sample variances and covariances of linear combinations Y a and. Geometries One can distinguish two geometries, according to whether one takes the rows or the columns of Y as the points. Rows as points gives n points in r-dimensional 1338 63. INDEPENDENT OBSERVATIONS. of Y ι and Y Y into the likelihoo d function and get the maximum likelihood estimates of µ and Σ Σ Σ, in other words, set mean and dispersion matrix equal to the sample mean vector and sample 63.5.