1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Class Notes in Statistics and Econometrics Part 9 pptx

39 220 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 39
Dung lượng 404,59 KB

Nội dung

CHAPTER 17 Causality and Inference This chapter establishes the connection between critical realism and Holland and Rubin’s modelling of causality in statistics as explained in [Hol86] and [WM83, pp. 3–25] (and the related paper [LN81] which comes from a Bayesian point of view). A different approach to causality and inference, [Roy97], is discussed in chapter/section 2.8. Regarding critical realism and econometrics, also [Dow99] s hould be mentioned: this is written by a Post Keynesian econometrician working in an explicitly realist framework. Everyone knows that correlation does not mean causality. Nevertheless, expe- rience shows that statisticians can on occasion make valid inferences about causal- ity. It is therefore legitimate to ask: how and under which conditions can causal 473 474 17. CAUSALITY AND INFERENCE conclusions be drawn from a statistical experiment or a statistical investigation of nonexperimental data? Holland starts his discussion with a description of the “logic of association” (= a flat empirical realism) as opposed to causality (= depth realism). His model for the “logic of ass ociation” is essentially the conventional mathematical model of probability by a set U of “all possible outcomes,” which we described and criticized on p. 12 above. After this, Rubin describes his own model (developed together with Holland). Rubin introduces “counterfactual” (or, as Bhaskar would say, “transfactual”) el- ements since he is not only talking about the value a variable takes for a given individual, but also the value this variable would have taken for the same individual if the causing variables (which Rubin also calls “treatments”) had been different. For simplicity, Holland assumes here that the treatment variable has only two levels: either the individual receives the treatment, or he/she does not (in which case he/she belongs to the “control” group). The correlational view would simply measure the average response of those individuals who receive the treatment, and of those who don’t. Rubin recognizes in his model that the same individual may or may not be subject to the treatment, therefore the response variable has two values, one being the individual’s response if he or she receives the treatment, the other the response if he or she does not. 17. CAUSALITY AND INFERENCE 475 A third variable indicates who receives the treatment. I.e, he has the “causal in- dicator” s which can take two values, t (treatment) and c (control), and two variables y t and y c , which, evaluated at individual ω, indicate the responses this individual would give in case he was subject to the treatment, and in case he was or not. Rubin defines y t − y c to be the causal effect of treatment t versus the control c. But this causal effect cannot be observed. We cannot observe how those indi- viuals who received the treatement would have responded if they had not received the treatment, despite the fact that this non-actualized response is just as real as the response which they indeed gave. This is what Holland calls the Fundamental Problem of Causal Inference. Problem 225. Rubin excludes race as a cause because the individual cannot do anything about his or her race. Is this argument justified? Does this Fundamental Problem mean that causal inference is impossible? Here are several scenarios in which causal inference is pos sible after all: • Temporal stability of the response, and transience of the causal effect. • Unit homogeneity. • Constant effect, i.e., y t (ω) − y c (ω) is the same for all ω. • Independence of the response with respect to the selection process regarding who gets the treatment. 476 17. CAUSALITY AND INFERENCE For an e xample of this last case, say Problem 226. Our universal set U consists of patients who have a certain dis- ease. We will explore the causal effect of a given treatment with the help of three events, T , C, and S, the first two of which are counterfactual, compare [Hol86]. These events are defined as follows: T consists of all patients who would recover if given treatment; C consists of all patients who would recover if not given treat- ment (i.e., if included in the contro l group). The event S consists of all patients actually receiving treatment. The average causal effect of the treatment is defined as Pr[T ] − Pr[C]. • a. 2 points Show that Pr[T ] = Pr[T |S] Pr[S] + Pr[T |S  ](1 − Pr[S])(17.0.6) and that Pr[C] = Pr[C|S] Pr[S] + Pr[C|S  ](1 − Pr[S])(17.0.7) Which of these pro babilities can be estimated as the frequencies of observable outcomes and which cannot? Answer. This is a direct application of (2.7.9). The problem here is that for all ω ∈ C, i.e., for those patients who do not receive trea tme nt, we do not know whether they would have recovered 17. CAUSALITY AND INFERENCE 477 if given treatment, and for all ω ∈ T , i.e., for those patients who do receive treatment, we do not know whether they would have recovered if not given treatment. In other words, neither Pr[T |S] nor E[C|S  ] can be estimated as the frequencies of observable outcomes.  • b. 2 points Assume now that S is independent of T and C, because the subjects are assigned randomly to treatment or control. How can this be used to estimate those elements in the equations (17.0.6) and (17.0.7) which could not be estimated before? Answer. In this case, Pr[T |S] = Pr[T |S  ] and Pr[C|S  ] = Pr[C|S]. Therefore, the average causal effect can be simplified as follows: Pr[T ] − Pr[C] = Pr[T |S] Pr[S] + Pr[T |S  ](1 − Pr[S]) − Pr[C|S] Pr[S] + Pr[C|S  ](1 − Pr[S]) = Pr[T |S] Pr[S] + Pr[T |S](1 − Pr[S]) − Pr[C|S  ] Pr[S] + Pr[C|S  ](1 − Pr[S]) = Pr[T |S] − Pr[C|S  ](17.0.8)  • c. 2 points Why were all these calculations necessary? Could one not have defined from the beginning that the causal effect of the treatment is Pr[T |S]−Pr[C|S  ]? Answer. Pr[T |S] − Pr[C|S  ] is only the empirical difference in recovery frequencies between those who receive treatment an d those who do not. It is always possible to measure these differences, but these differences are not necessarily due to the treatment but may be due to other reasons.  478 17. CAUSALITY AND INFERENCE The main message of the paper is therefore: before drawing causal conclusions one should acertain whether one of these conditions apply which make causal con- clusions possible. In the rest of the paper, Holland compares his approach with other approaches. Supp es ’s definitions of causality are interesting: • If r < s denote two time values, event C r is a prima facie cause of E s iff Pr[E s |C r ] > Pr[E s ]. • C r is a spurious cause of E s iff it is a prima facie cause of E s and for some q < r < s there is an event D q so that Pr[E s |C r , D q ] = Pr[E s |D q ] and Pr[E s |C r , D q ] ≥ Pr[E s |C r ]. • Event C r is a genuine cause of E s iff it is a prima facie but not a spurious cause. This is quite different than Rubin’s analysis. Suppes concentrates on the causes of a given effect, not the effects of a given cause. Suppes has a Popperian falsificationist view: a hypothesis is good if one cannot falsify it, while Holland has the depth-realist view which says that the empirical is only a small part of reality, and which looks at the underlying mechanisms. Problem 227. Construct an example of a probability field with a spurious cause. 17. CAUSALITY AND INFERENCE 479 Granger causality (see chapter/section 67.2.1) is based on the idea: knowing a cause ought to improve our ability to predict. It is more appropriate to speak here of “noncausality” instead of causality: a variable does not cause another if knowing that variable does not improve our ability to predict the other variable. Granger formulates his theory in terms of a specific predictor, the BLUP, while Holland extends it to all predictors. Granger works on it in a time series framework, while Holland gives a more general formulation. Holland’s formulation strips off the unnecessary detail in order to ge t at the essence of things. Holland defines: x is not a Granger cause of y relative to the information in z (which in the timeseries context contains the past values of y) if and only if x and y are conditionally independent given z. Problem 40 explains why this can be tested by testing predictive power. CHAPTER 18 Mean-Variance Analysis in the Linear Model In the present chapter, the only distributional assumptions are that means and variances exist. (From this follows that also the covariances exist). 18.1. Three Versions of the Linear Model As background reading please read [CD97, Chapter 1]. Following [JHG + 88, Chapter 5], we will start with three different linear statisti- cal models. Model 1 is the simplest estimation problem already familiar from chapter 12, with n independent observations from the same distribution, call them y 1 , . . . , y n . The only thing known about the distribution is that mean and variance exist, call them µ and σ 2 . In order to write this as a special case of the “linear model,” define 481 482 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL ε i = y i −µ, and define the vectors y =  y 1 y 2 ··· y n   , ε ε ε =  ε 1 ε 2 ··· ε n   , and ι =  1 1 ··· 1   . Then one can write the model in the form (18.1.1) y = ιµ + ε ε ε ε ε ε ∼ (o, σ 2 I) The notation ε ε ε ∼ (o, σ 2 I) is shorthand for E [ε ε ε] = o (the null vector) and V [ε ε ε] = σ 2 I (σ 2 times the identity matrix, which has 1’s in the diagonal and 0’s elsewhere). µ is the deterministic part of all the y i , and ε i is the random part. Model 2 is “simple regression” in which the deterministic part µ is not constant but is a function of the nonrandom variable x. The assumption here is that this function is differentiable and can, in the range of the variation of the data, be ap- proximated by a linear function [Tin51, pp. 19–20]. I.e., each element of y is a constant α plus a constant multiple of the corresponding element of the nonrandom vector x plus a random error term: y t = α + x t β + ε t , t = 1, . . . , n. This can be written as (18.1.2)    y 1 . . . y n    =    1 . . . 1    α +    x 1 . . . x n    β +    ε 1 . . . ε n    =    1 x 1 . . . . . . 1 x n     α β  +    ε 1 . . . ε n    or (18.1.3) y = Xβ + ε ε ε ε ε ε ∼ (o, σ 2 I) [...]... which is “explained” by the regression, and a part SSE which remains “unexplained,” and R2 measures that fraction of SST which can be “explained” by the regression [Gre97, pp 250–253] and also [JHG+ 88, pp 211/212] try to make this notion plausible Instead of using the vague notions “explained” and “unexplained,” I prefer the following reading, which is based on the third expression for R2 in (18.3.16):... Problem 231 2 points Show the following: if the columns of X are linearly independent, then X X has an inverse (X itself is not necessarily square.) In your proof you may use the following criteria: the columns of X are linearly independent (this is also called: X has full column rank) if and only if Xa = o implies a = o And a square matrix has an inverse if and only if its columns are linearly independent... numerator cancel out, and what remains can be shown to be equal to (18.2.16) Problem 2 39 3 points Show that in the simple regression model, the fitted regression line can be written in the form (18.2.26) y t = y + β(xt − x) ˆ ¯ ˆ ¯ From this follows in particular that the fitted regression line always goes through the point x, y ¯ ¯ 494 ˆ βxt 18 MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL Answer Follows... “explained” by the regression In ¯ order to understand SSR better, we will show next the famous “Analysis of Variance” identity SST = SSR + SSE Problem 2 49 In the reggeom visualization, again with x1 representing the vector of ones, show that SST = SSR + SSE, and show that R2 = cos2 α where α is the angle between two lines in this visualization Which lines? 504 18 MEAN-VARIANCE ANALYSIS IN THE LINEAR... Problem 1 89 Problem 244 In the reggeom visualization, see Problem 350, in which x1 is the vector of ones, which are the vectors Dx2 and Dy? 18.3 THE COEFFICIENT OF DETERMINATION 501 Answer Dx2 is og, the dark blue line starting at the origin, and Dy is cy, the red line starting on x1 and going up to the peak As an additional mathematical tool we will need the Cauchy-Schwartz inequality for the vector... one In the regression model, the xi are nonrandom, only the y i are random, in the other model both x and y are random In the regression model, the expected value of the y i are not fully known, in the other model the expected values of both x and y are fully known Both models have in common that the second moments are known only up to an unknown factor Both models have in common that only first and. .. the decomposition 18.3.8 in the reggeom-visualization Answer From y take the green line down to b, then the light blue line to c, then the red line to the origin 18.3 THE COEFFICIENT OF DETERMINATION 505 This orthogonality can also be explained in terms of sequential projections: instead of projecting y on x1 directly I can first project it on the plane spanned by x1 and x2 , and then project this projection... these vectors and their projections 2 2 Assume we have a dependent variable y and two regressors x1 and x2 , each with 15 observations Then one can visualize the data either as 15 points in 3-dimensional space (a 3-dimensional scatter plot), or 3 points in 15-dimensional space In the first case, each point corresponds to an observation, in the second case, each point corresponds to a variable In this latter... replacing β by β, and divide by 2, to get the “normal equation” (18.2.3) ˆ X y = X X β 486 18 MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL Due to our assumption that all columns of X are linearly independent, X X has an inverse and one can premultiply both sides of (18.2.3) by (X X)−1 : (18.2.4) ˆ β = (X X)−1 X y If the columns of X are not linearly independent, then (18.2.3) has more than one solution, and. .. known, and that they restrict themselves to linear estimators, and that the criterion function is the MSE (the regression model minimaxes it, but the other model minimizes it since there is no unknown parameter whose value one has to 498 18 MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL minimax over But this I cannot say right now, for this we need the Gauss-Markov theorem Also the Gauss-Markov is valid in . 17 Causality and Inference This chapter establishes the connection between critical realism and Holland and Rubin’s modelling of causality in statistics as explained in [Hol86] and [WM83, pp. 3–25] (and. from a Bayesian point of view). A different approach to causality and inference, [Roy97], is discussed in chapter/section 2.8. Regarding critical realism and econometrics, also [Dow 99] s hould be. which has 1’s in the diagonal and 0’s elsewhere). µ is the deterministic part of all the y i , and ε i is the random part. Model 2 is “simple regression” in which the deterministic part µ is not

Ngày đăng: 04/07/2014, 15:20