Class Notes in Statistics and Econometrics Part 17 pps

CHAPTER 33 Regression Graphics The “regression” referred to in the title of this chapter is not necessarily linear regression. The population regression can be defined as follows: The random scalar y and the random vector x have a joint distribution, and we want to know how the conditional distribution of y|x = x depends on the value x. The distributions themselves are not known, but we have datasets and we use graphical means to estimate the distributions from these datasets. Problem 384. Someone said on an email list about statistics: if you cannot see an effect in the data, then there is no use trying to estimate it. Right or wrong? Answer. One argument one might give is the curse of dimensionality. Also higher moments of the distribution, kurtosis etc., cannot be seen very cleary with the plain eye.  833 834 33. REGRESSION GRAPHICS 33.1. Scatterplot Matrices One common graphical method to explore a dataset is to make a scatter plot of each data series against each other and arrange these plots in a matrix. In R, the pairs function does this. Scatterplot matrices should be produced in the preliminary stages of the investigation, but the researcher should not think he or she is done after having looked at the scatterplot matrices. In the construction of scatter plot matrices, it is good practice to change the signs of some of the variables in order to make all correlations positive if this is possible. [BT99, pp. 17–20] gives a good example of what kinds of things can be seen from looking at scatterplot matrices. The data for this book are available at http://biometrics.ag.uq.edu.au/software.htm Problem 385. 5 points Which inferences about the datasets can you draw from looking at the scatterplot matrix in [BT99, Exhibit 3.2, p. 14]? Answer. The discussion on [BT99, p. 19?] distinguishes three categories. First the univariate phenomena: • yield is more concentrated for local genotypes (•) than for imports (◦); • the converse is true for protein % but not as pronounced; • oil % and seed size are lower for local genotypes (•); regarding seed size, the heaviest • is ligher than the lightest ◦; • height and lodging are greater for local genotypes. 33.1. SCATTERPLOT MATRICES 835 Bivariate phenomena are either within-group or between-group phenomena or both.: • negative relationship of protein % and oil % (both within • and ◦); • p osit ive relationship of oil % and seed size (both within • and ◦ and also between these groups); • negative relationship, between groups, of seed size and height; • p osit ive relationship of height and lodging (within ◦ and between groups); • negative relationship of oil % and lodging (between groups and possibly within •); • negative relationship of seed size and lodging (between groups); • p osit ive relationship of height and lodging (between groups). The between group pehnomena are, of course, not due to an interaction between the groups, but they are the consequence of univariate phenomena. As a third category, the authors p o int out unusual individual points: • 1 high ◦ for yield; • 1 high • (still lower than all the ◦s) for seed size; • 1 low ◦ for lodging; • 1 low • for protein % and oil % in combination.  [Coo98, Figure 2.8 on p. 29] shows a scatterplot matrix of the “horse mussel” data, originally from [Cam89]. This graph is also available at www.stat.umn.edu/ RegGraph/graphics/Figure 2.8.gif. Horse mussels, (Atrinia), were sampled from the Marlborough Sounds. The five variables are L = Shell length in mm, W = Shell width in mm, H = Shell height in mm, S = Shell mass in g, and M = Muscle mass in g. M is the part of the mussel that is edible. 836 33. REGRESSION GRAPHICS Problem 386. 3 points In the mussel data set, M is the “response” (according to [Coo98]). Is it justified to call this variable the “response” and the other variables the explanatory variables, and if so, how would you argue for it? Answer. This is one of the issues which is not sufficiently discussed in the literature. It would be justified if the dimensions and weight of the shell were exogenous to the weight of the edible part of the mussel. I.e., if the mussel first grows the shell, and then it fills this shell wish muscle, and the dimensions of the shell affect how big the muscle can grow, but the muscle itself does not have an influence on the dimensions of the shell. If this is the case, then it makes sense to look at the distribution of M conditionally on the other variables, i.e., ask the question: given certain weights and dimensions of the shell, what is the nature of the mechanism by which the muscle grows inside this shell. But if muscle and shell grow together, both affected by the same variables (temperature, nutrition, daylight, etc.), then the conditio nal distribution is not informative. In this case, the joint distribution is of interest.  In order to get this dataset into R, you simply say data(mussels), after having said library(ecmet). Then you need the command pairs(mussels) to get the scatterplot matrix. Also interesting is pairs(log(mussels)), especially since the log transformation is appropriate if one explains volume and weight by length, height, and width. The scatter plot of M versus H shows a clear curvature; but one should not jump to the conclusion that the regression is not linear. Cook brings another example with constructed data, in which the regression function is clearly linear, without error 33.1. SCATTERPLOT MATRICES 837 term, and in which nevertheless the scatter plot of the response versus one of the predictors shows a similar curvature as in the mussel data. Problem 387. Cook’s constructed dataset is available as dataset reggra29 in the ecmet package. Make a scatterplot matrix of the plot, then load it into XGobi and convince yourself that y depends linearly on x 1 and x 2 . Answer. You need the commands data(reggra29) and then pairs(reggra29) to get the scatterplot matrix. Before you can access xgobi from R, you must give the command library(xgobi). Then xgobi(reggra29). T he dependency is y = 3 + x 1 + elemx 2 /2.  Problem 388. 2 points Why can the scatter plot of the dependen t variable against one of the independent variables be so misleading? Answer. Because the included independent variable becomes a proxy for the excluded variable. The effect of the excluded variable is mistaken to come from the included variable. Now if the included and the excluded variable are independent of e ach other, then the omission of the e xclude d variable increases the n oise, but does not have a syst emat ic effect. But if there is an empirical relationship between the included and the excluded variable, then this translates into a spurious relationship between included and dependent variables. The mathematics of this is discussed in Problem 328.  838 33. REGRESSION GRAPHICS Problem 389. Would it be possible in the scatter plot in [Coo98, p. 217] to reverse the signs of some of the variables in such a way that all correlations are positive? Answer. Yes, you have to reverse the signs of 6Below and AFDC. Here are the instructions how t o do the scatter plots: in arc, go to the load menu. (Ignore the close and the menu boxes, they don’t seem to work.) Then type the path into the long box, /usr/share/ecmet/xlispstat and press return. This gives me only one option, Minneapolis-schools.lsp. I have to press 3 times on this until it jumps to the big box, then I can press enter on the big box to load the data. This gives me a bigger menu. Go to the MPLSchools menu, and to the add variab le option. You have to type in 6BelNeg = (- 6Below), then enter, then AF DCNe g = (- AFDC), and finally BthPtsNeg = (- BthPts). Then go to the Graph&Fit menu, and select scatterplot matrix. Then you have to be careful about the order: first select AFDCNeg, in the left box and double click so that it jumps over to the right b ox. Then select HS, then BthPtsNeg, then 6BelNeg, then 6Above. Now the scatterplot matrix will b e oriented all in 1 direction.  33.2. Conditional Plots In order to account for the effect of excluded variables in a scatter plot, the function coplot makes scatter plots in which the excluded variable is conditioned upon. The graphics demo has such a conditioning plot; here is the code (from the file /usr/lib/R/demos/graphics/graphics.R): data(quakes) 33.3. SPINNING 839 coplot(long ~ lat | depth, data=quakes, pch=21) 33.3. Spinning An obvious method to explore a more than two-dimensional structure graphically is to look at plots of y against various linear combinations of x. Many statistical software packages have the ability to do so, but one of the most powerful ones is XGobi. Documentation about xgobi, which is more detailed than the help(xgobi) in R/Splus can be obtained by typing man xgobi while in unix. A nice brief documentation is [Rip96]. The official manual is is [SCB91] and [BCS96]. XGobi can be used as a stand-alone program or it can be invoked from inside R or Splus. In R, you must give the command library(xgobi) in order to make the function xgobi acces sible. The search for “interesting” projections of the data into one-, two-, or 3-dimensional spaces has been automated in projection pursuit regression programs. The basic ref- erence is [FS81], but there is also the much older [FT74]. The most obvious graphical regression method consists in slicing or binning the data, and taking the mean of the data in each bin. But if you have too many explanatory variables, this local averaging becomes infeasible, because of the “curse of dimensionality.” Consider a dataset with 1000 observations and 10 variables, all between 0 and 1. In order to see whether the data are uniformly distributed or 840 33. REGRESSION GRAPHICS whether they have some structure, you may consider splitting up the 10-dimensional unit cube into smaller cubes and counting the number of datapoints in each of these subcubes. The problem here is: if one makes those subcubes large enough that they contain more than 0 or 1 observations, then their coordinate lengths are not much smaller than the unit hypercube itself. Even with a side length of 1/2, which would be the largest reasonable side length, one needs 1024 subcubes to fill the hypercube, therefore the average number of data points is a little less than 1. By projecting instead of taking subspaces, projection pursuit regression does not have this problem of data scarcity. Projection pursuit regression searches for an interesting and informative projection of the data by maximizing a criterion function. A logical candidate would for instance be the variance ratio as defined in (8.6.7), but there are many others. About grand tours, projection pursuit guided tours, and manual tours see [CBCH97] and [CB97]. Problem 390. If you run XGobi from the menu in Debian GNU/Linux, it uses prim7, which is a 7-dimensional particle physics data set used as an example in [FT74]. The following is from the help page for this dataset: There are 500 observations taken from a high energy particle physics scattering experiment which yields four particles. The reaction can be described completely by 7 independent measurements. 33.4. SUFFICIENT PLOTS 841 The important features of the data are short-lived intermediate reaction stages which appear as protuberant “arms” in the point cloud. The projection pursuit guided tour is the tool to use to understand this data set. Using all 7 variables turn on projection pursuit and optimize with the Holes index until a view is found that has a triangle and two arms crossing each other off one edge (this is very clear once you see it but the Holes index has a tendency to get stuck in another local maximum which doesn’t have much structure). Brush the arms with separate colours and glyphs. Change to the Central Mass index and optimize. As new arms are revealed brush them and continue. When you have either run out o f colours or time turn off projection pursuit and watch the data touring. Then it becomes clear that the underlying structure is a triangle with 5 or 6 arms (some appear to be 1-dimensional, some 2-dimensional) extending from the vertices. 33.4. Sufficient Plots A different approach, which is less ad-hoc than projection pursuit, starts with the theory of conditional independence, see Section 2.10.3. A theoretical exposition of this approach is [Coo98] with web-site www.stat.umn.edu/RegGraph. This web site has the data and several short courses based on the book. Especially the file www.stat.umn.edu/RegGraph/papers/interface.pdf is a nice brief introduction. 842 33. REGRESSION GRAPHICS A m ore practically-oriented book, which teaches the software especially developed for this approach, is [CW99]. For any graphical procedure exploring linear combinations of the explanatory variables, the structural dimension d of a regression is relevant: it is the smallest number of distinct linear combinations of the predictors required to characterize the conditional distribution of y|x. If the data follow a linear regression then their structural dimension is 1. But even if the regression is nonlinear but can be written in the form (33.4.1) y|x ∼ g(β  x) + σ(β  x)ε with ε ε ε independent of x, this is also a population with structural dimension of 1. If t is a monotonic transformation, then t(y)|x ∼ g(β  x) + σ(β  x)ε is an even more general model with structural dimension 1. If (33.4.2) y|x =  β  1 x + ε if x 1 ≥ 0 β  2 x + ε if x 1 < 0 [...]... and stretching Since one needs a different view to get the best fit of the points in the upper-right corner than for the points in the lower-left corner, the structural dimension must be 2 If one looks at the scatter plots of y against all linear combinations of components of x, and none of them show a relationship (either linear or nonlinear), then the structural dimension is zero Here are the instruction... SUFFICIENT PLOTS 843 with β 1 and β 2 linearly independent, then the structural dimension is 2, since one needs 2 different linear combinations of x to characterize the distribution of y If (33.4.3) y|x = x 2 +ε then this is a very simple relationship between x and y, but the structural dimension is k, the number of dimensions of x, since the relationship is not intrinsically linear Problem 391 [FS81, p... x1 and x2 Look at the data using XGgobi or some other spin program What is the structural dimension of the data set? 844 33 REGRESSION GRAPHICS The rubber data are from [Woo72], and they are also discussed in [Ric, p 506] mnr is modulus of natural rubber, temp the temperature in degrees Celsius, and dp Dicumyl Peroxide in percent Answer The data are a plane that has been distorted by twisting and. .. lim replaced by plim holds if xi and ε i are an i.i.d sequence of random variables Problem 394 2 points In the regression model with random regressors y = 1 1 ε Xβ+ε , you only know that plim n X X = Q is a nonsingular matrix, and plim n X ε o Using these two conditions, show that the OLS estimate is consistent ˆ Answer β = (X X)−1 X y = β + (X X)−1 X ε due to (24.0.7), and plim(X X)−1 X ε = plim( X X... by X X/n and s2 by σ 2 to get −1 ˆ ˆ (Rβ n − Rβ) (Rβ n − Rβ) R(X X)−1 R → χ2 i 2 s in distribution All this is not a proof; the point is that in the denominator, the distribution is divided by the increasingly bigger number n − k, while in the numerator, it is divided by the constant i; therefore asymptotically the denominator can be considered 1 The central limit theorems only say that for n → ∞ these... 275] To write down the Grenander conditions, remember that presently X depends on n (in that we only look at the first n elements of y and first n rows of X), therefore also the column vectors xj also depend of n (although we are not indicating this here) Therefore xj xj depends on n as well, and we will make this dependency explicit by writing xj xj = d2 nj Then the first Grenander condition is limn→∞... regression function consisting of the interaction term between x1 and x2 only φ(x) = x1 x2 has structural dimension 2 2, i.e., it can be written in the form φ(x) = m=1 sm (αm x) where sm are smooth functions of one variable Answer (33.4.4) 1 α1 = √ 2 1 1 o 1 α2 = √ 2 1 −1 o s1 (z) = z2 2 s2 (z) = − z2 2 Problem 392 [Coo98, p 62] In the rubber data, mnr is the dependent variable y, and temp and dp form the two... Therefore its limiting covari√ ˆ ance matrix is Q−1 σ 2 QQ−1 = σ 2 Q , Therefore n(β n −β) → N (o, σ 2 Q−1 ) in disˆ tribution One can also say: the asymptotic distribution of β is N (β, σ 2 (X X)−1 ) √ ˆn − Rβ) → N (o, σ 2 RQ−1 R ), and therefore From this follows n(Rβ (34.2.1) ˆ n(Rβ n − Rβ) RQ−1 R −1 ˆ (Rβ n − Rβ) → σ 2 χ2 i Divide by s2 and replace in the limiting case Q by X X/n and s2 by σ 2 to... press the OK button, there will be an error message; apparently the starting values were not good enough Try again, using marginal Box-Cox Starting Values This will succeed, and the LR test for all transformations logs has a p-value of 14 Therefore 33.4 SUFFICIENT PLOTS 845 choose the log transform for all the predictors (If we include all 5 variables, the LR test for all transformations to be log transformations... the very linear relationship between the predictors, and you see that all the scatter plots with the response are very similar This is a sign that the structural dimension is 1 according to [CW99, pp 435/6] If that is the case, then a plot of the actual against the fitted values is a sufficient summary plot For this, run the Fit Linear LS menu option, and then plot the dependent variable against the fitted . within-group or between-group phenomena or both.: • negative relationship of protein % and oil % (both within • and ◦); • p osit ive relationship of oil % and seed size (both within • and ◦ and. of seed size and height; • p osit ive relationship of height and lodging (within ◦ and between groups); • negative relationship of oil % and lodging (between groups and possibly within •); • negative. authors p o int out unusual individual points: • 1 high ◦ for yield; • 1 high • (still lower than all the ◦s) for seed size; • 1 low ◦ for lodging; • 1 low • for protein % and oil % in combination.  [Coo98,

Định dạng
Số trang	22
Dung lượng	379,45 KB