Risk estimation in retrospective studies

RISK ESTIMATION IN RETROSPECTIVE STUDIES WEI XING (B.Sc.(Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE 2010 Contents Abstract 3 List of Tables 4 List of Figures 5 1 Introduction 6 1.1 Case-control study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Odds ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Relative risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Motivation & organization of thesis . . . . . . . . . . . . . . . . . . . . . 10 2 Methods 12 2.1 Point estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Special case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1 2.4 Testing equality of risks . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Cochran-Armitage test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Application 19 3.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Risk estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Tests of association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Discussion 26 Bibliography 29 Appendix 30 2 Abstract The analysis of data arising from retrospective studies traditionally followed methods involving estimation of relative risks and/or odds ratios. Little attention was given to estimation of risks, presumably due to constraint of such experimental designs, as well as the ease of modelling of odds ratio by logistic regression. Here we present some results for the estimation of risks in a general 2 by k contingency table setting, for a dichotomous outcome variable and under some reasonable assumption of prevalence. We also examine the properties of proposed estimators, and apply them to a largescale genome-wide association (GWA) study data to demonstrate some relevance of the methods. 3 List of Tables 1.1 A random sample classified as a 2 × k contingency table . . . . . . . . . 7 1.2 Association of factor X and disease in a population cross section . . . . . 8 3.1 Mean and SD of estimated variance and covariance . . . . . . . . . . . . 22 3.2 Five number summary of distribution of risks . . . . . . . . . . . . . . . 23 3.3 32 SNPs with strong association . . . . . . . . . . . . . . . . . . . . . . . 25 4 List of Figures 1.1 Relationship between odds ratio and risks . . . . . . . . . . . . . . . . . 10 3.1 Bias of risk estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Chapter 1 Introduction 1.1 Case-control study The case-control study is a primary tool for the study of factors related to disease incidence and is widely used in clinical and epidemiological research. Such studies often utilize a retrospective design, in which the investigator looks backwards and examines exposures to suspected risk or protection factors in relation to an outcome that is established at the start of the study. Compared to a prospective or cohort study that almost always involves following up the subjects over an extended period of time, a case-control study has the advantage of being able to yield results from presently collectible data, using relatively small amount of resources. It also reduces the sample size required to capture a reasonable number of cases, especially when the disease under investigation is rare in the general population. In case-control studies, direct estimation of (absolute) risk is usually not possible as the number of cases and controls are determined without knowledge of how many cases and controls actually exist in the population of interest. On the other hand, one can estimate the odds ratio, a particularly useful measure of association due to its invariance property under retrospective and prospective sampling schemes. An odds ratio of 1 is 6 indicative of statistical independence between exposure and disease outcome. Moreover, it is well known that when disease incidence is low, the odds ratio closely approximates the relative risk. 1.2 Background Consider the use of a baseline categorical variable X with k levels to predict a binary outcome Y . As an example X can be exposure to different levels of a risk factor and Y the incidence of disease. A case-control study with retrospective sampling scheme will then be to collect data from a sample of n1. controls and n2. cases of the two underlying populations. Classification of these n1. + n2. = n.. subjects according to factor X gives rise to a 2 × k contingency table, as shown below, with ni. , n.i and n.. denoting the row, column and overall totals, respectively (Table 1.1). Control (Y = 0) n11 n12 ··· n1k n1. Case (Y = 1) n21 n22 ··· n1k n2. n.1 n.2 ··· n.k n.. Table 1.1: A random sample classified as a 2 × k contingency table If the sample size n.. is small compared to the population, and if one further assumes that the sample is a simple random sample, the k cell counts for row 1 and row 2 of Table 1.1 may be modelled as a realization of two independent multinomial random variables with total counts n1. and n2. , and unknown cell probabilities πij , i = 1, 2 and j = 1, . . . , k. Each multinomial probability πij is in fact the proportion of subjects in subclass j within the diseased (i = 2) or non-diseased (i = 1) populations. In other words, if we could cross classify the total population from which the cases and controls of Table 1.1 were selected (Table 1.2), then πij = Nij /Ni. . In addition, the cell probabilities ∑ of each row must sum up to 1, i.e., j πij = 1. 7 Control (Y = 0) N11 N12 ··· N1k N1. Case (Y = 1) N21 N22 ··· N1k N2. N.1 N.2 ··· N.k N.. Table 1.2: Association of factor X and disease in a population cross section 1.3 Odds ratio For any event A with probability p of occurring, the odds of its occurrence is odds(A) = P(A) p = . 1 − P(A) 1−p Now suppose there are only two subclasses in a population, with factor X = 1, 2 respectively. Invariance of the odds ratio implies that odds ratio = θ = r1 /(1 − r1 ) π21 /π22 = , π11 /π12 r2 /(1 − r2 ) (1.1) where ri denotes the proportion of cases among subclass i of the population, a quantity that we will later define as the risk for that subclass. The left hand side of the equation is the ratio of odds of being in subclass 1 for cases over that for controls, while the right hand side refers to the ratio of odds of being a case for subjects in subclass 1 over that in subclass 2. It is on this basis that we are able to estimate from retrospective studies the multinomial parameters πij and hence the odds ratio. In this case, the sample odds ratio is n11 /n12 n11 n22 θˆ = = n21 /n22 n21 n12 which has an asymptotic normal distribution with mean θ and variance ˆ = Var(θ) ∑ 1 . nij (i,j) 1.4 Relative risk As discussed in the previous Section, it is usually natural to define risk of developing a disease for an individual belonging to any subclass to be the number of cases in that 8 subclass over the total number of subjects in it. The relative risk for one subclass over the another is then estimated by the ratio of the estimated absolute risks. More specifically, if we denote by P the proportion of population falling into the diseased group (that is, the prevalence), by π2j the proportion of diseased group falling into a subclass j, and by π1j the proportion of the non-diseased group falling into that subclass, then the risk of disease for members of that subclass is π2j P , π2j P + π1j (1 − P ) (1.2) and for non-members of the subclass (1 − π2j )P . (1 − π2j )P + (1 − π1j )(1 − P ) The relative risk is therefore π2j (1 − π2j )P + (1 − π1j )(1 − P ) . 1 − π2j π2j P + π1j (1 − P ) (1.3) When P is small and 1 − P is close to 1, (1.3) can be written as π2j 1 − π1j . 1 − π2j π1j (1.4) This is exactly the ratio of odds of being in the subclass for a control over that for a case. Hence odds ratio can be used as an approximation to relative risk when the disease is rare. Alternatively, this can be shown from (1.1) since ) ( 1 − r2 odds ratio = relative risk , 1 − r1 so these two quantities are similar when 1−r2 1−r1 ≈ 1. Finally, in a retrospective study as described in Table 1.1, one can estimate the odds ratio (1.4) for subclass X = j with the statistic n2j n1. − n1j . n2. − n2j n1j 9 1.5 Motivation & organization of thesis The relative ease of modelling, particularly through logistic regression [1], and the approximation to relative risk, has led to the widespread report of odds ratio in medical literature. However, many researchers have reasoned that the use of odds ratio in case control studies is technically correct, but often misleading [2]. Indeed using (1.1), one may plot the risk parameters (r1 , r2 ) that give rise to an odds ratio of 2, 5, 10 and 20 on the same line (Figure 1.1). This shows that the same odds ratio can arise from different risk parameters. In this thesis we argue that estimating risks has the advantage of better interpretability. In addition, we would also like to test the hypothesis that risks are not different among the sub-populations. In doing so we need to make some assumptions, provide 1.0 the likely size of error, and discuss the bias in these estimates. 0.4 r2 0.6 0.8 OR = 2 0.2 OR = 5 OR = 20 0.0 OR = 10 0.0 0.2 0.4 0.6 0.8 1.0 r1 Figure 1.1: Relationship between odds ratio and risks 10 In Chapter 2 we present some theory to estimate disease risk in a general 2 × k contingency table setting. In Chapter 3 we apply the method to the Wellcome Trust Case Control Consortium data and evaluate the performance of the method. In Chapter 4 we provide a discussion and some other issues that may warrant future investigation. 11 Chapter 2 Methods 2.1 Point estimate From Section 1.4 we have the risk for subclass j defined as the ratio rj = N2j , N.j i.e. the ratio of the number of cases in the subclass over the total number of subjects in it (Table 1.2). If P , the disease prevalence is assumed to be known, and let N.. be the population size, then the total number of cases in subclass j of the population can be estimated by P N.. n2j /n2. , and the total number of controls by (1 − P )N.. n1j /n1. . Hence an estimator of risk for subclass j is rˆj = P n2j /n2. P n2j /n2. + (1 − P )n1j /n1. (2.1) except when n2j = n1j = 0. This is also an obvious estimate from (1.2) since n1j /n1. and n2j /n2. are unbiased estimators of π1j and π2j . Indeed (2.1) is merely Bayes’ theorem which relates the prospective, disease probability (or risk) to the retrospective, exposure probability: rj = P(Y = 1|X = j) = P P(X = j|Y = 1) P P(X = j|Y = 1) + (1 − P )P(X = j|Y = 0) where we used Y = 1, 0 to denote a subject being a case and control, respectively. 12 2.2 Expectation Since n1. and n2. are fixed, and P is a constant for a given population at the time of study, each nij has a marginal binomial distribution with parameters ni. and πij . We can then write the expectation of rˆj as the weighted sum E(ˆ rj ) = n1. ∑ n2. ∑ k=0 l=0 ( ) ( ) P l/n2. n1. k n1. −k n2. l (1 − π2j )n2. −l π2j π1j (1 − π1j ) P l/n2. + (1 − P )k/n1. k l (2.2) while the true value of parameter rj is given by (1.2). The bias can be obtained by taking the difference between (2.2) and (1.2) and plotting graphically for all π1j and π2j . When both n1j and n2j are 0, the risk rj is not estimable. Hence in computing (2.2) we have to “ignore” the case of k = l = 0. In addition when n1j or n2j is 0, the estimator (2.1) would give 1 or 0, and inclusion of these outcomes might influence substantially the bias of the estimator. Since the estimator will not be used in these situations, it might be better to consider instead the conditional expectation E(ˆ rj |n1j , n2j > 0). It is linked to E(ˆ rj ) via the relationship E(ˆ rj |n1j , n2j > 0) = E(ˆ rj ) − S1 , 1 − S1 − S2 where S1 = P(n1j = 0, n2j > 0) = (1 − π1j ) S2 = P(n2j = 0, n1j > 0) = (1 − π2j ) n1. n2. ) n2. ( ∑ n2. l=1 n1. ( ∑ k=1 l l π2j (1 − π2j )n2. −l ) n1. k π1j (1 − π1j )n1. −k . k In practice however, we find that the difference between E(ˆ rj |n1j , n2j > 0) and E(ˆ rj ) is negligible. Thus only the unconditional expectation is used in the subsequent applications. 13 2.3 Special case Wiggins and Slater [3] previously discussed a similar problem where it was assumed that all cases of the disease in the population were captured. A random sample was then selected from the control population. They proposed a numerical approximation to the expectation and variance of the risk estimates. More specifically, if the total number of cases in the population is known, i.e. n2. = N2. then N.. = n2. /P and each n2j is no longer random. The estimator (2.1) can be written as rˆj = P n1. n2j n2j = . n2j + (1 − P )n1j n2. /P n1. P n1. n2j + (1 − P )n2. n1j (2.3) Notice that in the above expression n1j is the only random component. The expectation of rˆj is in this case E(ˆ rj ) = n1. ∑ k=0 ( ) P n1. n2j n1. k π1j (1 − π1j )n1. −k . P n1. n2j + (1 − P )n2. k k The expectation can be approximated using a Taylor expansion argument. Let ξ = P/(1 − P ), we find n1. E(ˆ rj ) = (1−π1j ) +(1−π1j ) n1. n1. n2j ξ n2. n1. ( ∑ k=1 n1. n2j ξ 1+ n2. k )−1 ( ) ( )k 1 n1. π1j (2.4) k k 1 − π1j (see Appendix for a detailed derivation). It is also shown in Appendix that ( )( )k ∑ n1. n1. n1. ∑ ∑ 1 n1. π1j 1 1 = − Y1 = k k k 1 − π1j k(1 − π1j ) k k=1 k=1 k=1 and ( )( )k n1. ∑ 1 n1. π1j Y2 = k2 k 1 − π1j k=1 (  )2 n n1. ∑ n1. n1. 1.  ∑ ∑ ∑ 1 1 1 1 = − + . kl(1 − π1j )k 2  n n2  k=1 l=k Using (1 − x)−1 = ∑∞ k=0 k=1 (2.5) (2.6) k=1 xk , one can expand E(ˆ rj ) in a power series and ignore powers of ξ 3 and higher to find { E(ˆ rj ) ≈ (1 − π1j ) n1. ( )} n1. n1. 1+ n2j ξ Y1 − n2j ξY2 . n2. n2. 14 Similarly, one can expand E(ˆ rj2 ) in a power series and ignore the powers of ξ 3 and higher to get ( E(ˆ rj2 ) ≈ (1 − π1j ) n1. ) n21. 2 1 + 2 n2j ξ Y2 . n2. The variance can then be found by Var(ˆ rj ) = E(ˆ rj2 ) − (E(ˆ rj ))2 In the above derivation, the model is built upon the assumption that all cases in the population are captured, and only the number of controls in the sample is treated as random. While this assumption might hold for certain severe, rapid-onset disease that are under extensive surveillance, it certainly does not apply to most common diseases that are of public health interest. Consequently it is necessary to allow uncertainty in both cases and controls. 2.4 Testing equality of risks It is often of interest to study whether a risk factor X is associated with the disease outcome. The classic test of homogeneity between two multinomial probabilities in Table 1.1 is the Pearson’s χ2 test, with the test statistic X2 = ∑ (Oij − Eij )2 , Eij (i,j) where Oij = nij are the observed counts and Eij = ni. n.j /n.. are the expected counts in the (i, j) cell, under the null hypothesis that the multinomial probabilities of row 1 and row 2 are the same, i.e. π1j = π2j for i = 1, . . . , k, which is equivalent to H0 : r1 = r2 = r3 . The alternative hypothesis is H1 : π1j ̸= π2j for at least one of 1, . . . , j. For large sample sizes, the approximate null distribution of this statistic is χ2 with (2 − 1) × (k − 1) = k − 1 degrees of freedom [4]. 15 An alternative is the likelihood ratio test, which is asymptotically equivalent to the χ2 test. The cells counts of two rows in Table 1.1 follow approximately the multinomial distributions with joint frequency function n1. ! f (n11 , . . . , n1k |π11 , . . . , π1k ) = ∏k j=1 k ∏ n1j ! j=1 n π1j1j and k ∏ n2. ! n f (n21 , . . . , n2k |π21 , . . . , π2k ) = ∏k π2j2j . j=1 n2j ! j=1 The two multinomial distributions are independent, and the log likelihood function is therefore l(π11 , . . . , π2k ) = subject to the constraint that ∑ j k ∑ log(π1j ) + j=1 k ∑ n2j log(π2j ) + C, j=1 πij = 1 for i = 1, 2. Alternatively, one can express l(π11 , . . . , π2k ) in terms of the ‘free’ parameters, i.e. l(π11 , . . . , π1(k−1) , π21 , . . . , π2(k−1) ) = k−1 ∑ n1j log(π1j ) + n1k log(1 − k−1 ∑ j=1 j=1 k−1 ∑ k−1 ∑ n2j log(π2j ) + n2k log(1 − j=1 π1j ) + π2j ) + C. j=1 (2.7) The null hypothesis H0 : r1 = · · · = rk is equivalent to the multinomial probabilities of row 1 and row 2 of Table 1.1 being equal. Under H0 , (2.7) is maximized at π ˆ1j = π ˆ2j = n1j + n2j n.j = , n.. n.. or equivalently, rˆ1 = · · · = rˆk = P . Under the alternative hypothesis, there is no restriction on πij ’s so (2.7) is maximized at the MLEs π îj = nij /ni. . The test statistic is therefore ∆ = −2(l0 − l1 ), where l0 and l1 denote the maximized log likelihood under the parameter space specified by the null and alternative hypotheses, respectively. The degree of freedom is calculated as follows: under H0 , there are 2k − 2 free parameters due to the two constraints of 16 summing up to 1. Under H1 , the two rows have the same multinomial probabilities so there are k − 1 free parameters. The test statistic therefore follows an asymptotic χ2 distribution with k − 1 degrees of freedom. It shall be made clear that although incorporation of the prevalence P is required for estimation of risks, it is not needed in the tests. Reparametrization from π to r does not change the maximum likelihood or the likelihood ratio test statistic. 2.5 Cochran-Armitage test We also consider a third test, the Cochran-Armitage test [5, 6] which utilizes ordered categories in a contingency table and tests for a linear trend based on a score xi assigned to each column of Table 1.1. It uses a linear probability model pi = α + βxi , where pi is the expected proportion of cases in the ith subclass in the sample. Note that pi differs from ri by a factor that is constant across all the subclasses, as a result of retrospective sampling. Hence the null hypothesis of independence H0 : β = 0 is equivalent to H0 : r1 = · · · = rj . The test may be rationalized with an analysis of variance approach. More specifically, the Pearson’s χ2 test statistic can be rewritten as a weighted sum of squares, i.e. X2 = = 3 ∑ (n.i pî − n.i pˆ)2 i=1 ∑3 i=1 n.i pˆ + 3 ∑ (n.i (1 − pî ) − n.i (1 − pˆ))2 i=1 n.i (1 − pˆ) n.i (ˆ pi − pˆ)2 . pˆ(1 − pˆ) with the weights being n.i /ˆ p(1− pˆ). Here pî is the observed proportion of cases in the ith subclass and pˆ is the overall proportion of cases in the sample. The regression parameter 17 β can then be obtained by the standard formula for weighted regression ∑ n.i (ˆ pi − pˆ)(xi − x¯) βˆ = i ∑ , ¯)2 i n.i (xi − x where x¯ is the weighted average of xi , i.e. ∑ xi n.i . x¯ = ∑i i n.i It is then possible to subdivide this weighted sum of squares by the rules of analysis of variance into the sum of squares explained by the regression formula ∑ ∑ βˆ2 i n.i (xi − x¯)2 [ i (xi − x¯)n2i ]2 ∑ SSR = = pˆ(1 − pˆ) pˆ(1 − pˆ) i n.i (xi − x¯)2 and the “residual” sum of squares ∑ SSL = i n.i (pˆî − pî )2 pˆ(1 − pˆ) ˆ i is the fitted value for pi . where pˆî = α ˆ + βx The Cochran-Armitage test statistic is simply SSR and it tests H0 : β = 0. When the linear probability model holds, SSR is asymptotically χ2 distributed with 1 degree of freedom. 18 Chapter 3 Application We now apply the methods in Chapter 2 to data from the main experiment of the Wellcome Trust Case Control Consortium: a genome-wide association study involving 2,000 DNA samples from each of seven diseases (type 1 diabetes, type 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis and Crohn’s disease) [7]. As is typical for GWA studies, a dense set of single-nucleotide polymorphisms (SNPs) across the genome is genotyped, and statistical analysis is performed to survey the most common genetic variation that is associated with risk factors for disease. An SNP is a DNA sequence variation occurring when a single nucleotide (A, T, C, or G) in the genome or other shared sequence differs between members of a species. The associated SNP markers are then considered as pointers to the region of the human genome where the disease-causing gene is likely to reside. 3.1 Data description We choose to analyze the data from type-2 diabetes, partly because of the availability of some reasonably reliable prevalence data [8]. In this study, a total number of 2938 controls and 1924 cases are genotyped, after excluding samples with contamination, false 19 identity, non-Caucasian ancestry and relatedness. The case samples are ascertained from sites widely distributed across Great Britain, and the controls come from two sources: about 1500 are representative samples from the 1958 British Birth Cohort and another 1500 are blood donors recruited by the three national UK Blood Services. Summary genotype statistics were retrieved from the European Genotype Archive (http://www.ebi.ac.uk/ega/page.php) and they contained data on over five hundred thousand SNPs distributed over 23 chromosome pairs of the approximately 5000 participants in the study. All SNPs are biallelic, i.e. there are only 2 alleles that usually differ by a single nucleotide, giving rise to 3 possible genotypes. The controls and cases are classified according to their genotypes at the SNP site, making this an application of the theory with k = 3. 3.2 Bias We first evaluate the bias of risk estimates corresponding to different parameter values π1j and π2j . Since rj is undefined when π1j = π2j = 0, we assume hereafter that at least one of π1j and π2j is not 0. Comparing the expressions (2.2) and (1.2) it is easy to see that the bias of rˆj is 0 when either π1j or π2j is 0, and when both π1j and π2j are 1. Let n1. = 3000, n2. = 2000, and prevalence P = 0.039 for type 2 diabetes, the (unconditional) biases (2.2) for selected values of π1j and π2j are plotted (Figure 3.2). In general the bias decreases with π1j , and increases with π2j . Nonetheless, in most cases of the WTCCC study, the biases are small and negligible. 3.3 Variance Since the exact form of standard errors (SE) of the risk estimates is difficult to compute, they are estimated by simulation: first fix some selected multinomial probabilities π1j 20 π1j = 0.5 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 π2j π1j = 0.9 π2j = 0.1 1.0 0.8 1.0 0.8 1.0 0.004 0.8 0.000 0.002 bias 6.0e−07 0.0e+00 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 π2j π1j π2j = 0.5 π2j = 0.9 0.0000 0.0000 0.0010 bias 0.0020 0.0012 0.2 0.0006 bias 0.0 bias 0.0 π2j 1.2e−06 0.0 1.0e−05 0.0e+00 bias 2e−04 0e+00 bias 4e−04 2.0e−05 π1j = 0.1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 π1j 0.6 π1j Figure 3.1: Bias of risk estimates and π2j , j = 1, . . . , 3, then generate using these parameters 5000 2 × 3 tables. Monte Carlo simulation is then performed with n = 106 repeats with estimated multinomial parameters from each of these 5000 tables. The variance-covariance matrix can then be 21 obtained from the 106 risk estimates. Table 3.1 gives the mean and standard deviation of the a diagonal (Var(ˆ r1 )) as well as an off-diagonal element (Cov(ˆ r1 , rˆ3 )) of 5000 variancecovariance matrices. It is clear while the Monte Carlo approach works well in most cases, the estimated SE of risk estimates can be very poor at extreme parameter values. Parameter (0.3, 0.3, 0.4) (0.1, 0.1, 0.8) (0.01, 0.01, 0.98) (0.005, 0.005, 0.99) Var(ˆ r1 ) Mean SD MC 2.75 × 10−6 2.18 × 10−7 2.74 × 10−6 5.48 × 10−8 −1.17 × 10−6 1.66 × 10−6 1.06 × 10−5 9.09 × 10−8 −1.17 × 10−6 9.66 × 10−5 1.32 × 10−4 3.47 × 10−7 −1.22 × 10−6 1.03 × 10−3 3.11 × 10−4 5.59 × 10−7 −1.28 × 10−6 Cov(ˆ r1 , rˆ3 ) −1.17 × 10−6 Var(ˆ r1 ) 1.08 × 10−5 Cov(ˆ r1 , rˆ3 ) −1.18 × 10−6 Var(ˆ r1 ) 1.55 × 10−4 Cov(ˆ r1 , rˆ3 ) −1.25 × 10−6 Var(ˆ r1 ) 5.12 × 10−4 Cov(ˆ r1 , rˆ3 ) −1.35 × 10−6 Table 3.1: Mean and SD of 5000 estimated variance Var(ˆ r1 ) as well as covariance Cov(ˆ r1 , rˆ3 ) generated using estimated parameter values. MC is the simulated variance and covariance using actual parameters. 3.4 Risk estimation Data from a total number of 420,172 SNPs distributed over 22 chromosome pairs are analyzed. For each SNP, we denote by rmax and rmin the maximum and minimum risks among the three genotypes, respectively. Table 3.2 provides the five number summary as well as the mean of the distribution of rmax and rmin in all the SNPs. 22 Min 1st Quartile Median Mean 3rd Quartile Max rmax 0.0390 0.0399 0.0408 0.0436 0.0428 0.526 rmin 0.00237 0.0356 0.0373 0.360 0.0382 0.0390 Table 3.2: Five number summary of distribution of risks 3.5 Tests of association We applied the three tests described in Section 2.3, i.e. the likelihood ratio, χ2 and Cochran-Armitage tests with score {0, 1, 2} to the type 2 diabetes data. The rationale for choosing this score is based on the assumption that any effect of the disease-causing allele, if present, will be proportional to the number of such allele an individual carries. A total number of 32 SNPs with at least one of the P -values below 5 × 10−7 were detected. Based on the WTCCC study, this is the threshold below which a strong association between SNP and underlying disease is declared. Table 3.3 shows the P -values as well as the estimated risks for these SNPs. Of these 32 SNPs, 16 are found with poor clustering and removed from the analysis. The remaining 16 SNPs form 4 clusters. Three of them, located on chromosomes 6, 10 and 16 coincide with the WTCCC study results [7], but a fourth SNP on chromosome 3 (rs2314349) has not been reported. The reason for the discrepancy is not clear. It is possible that the SNP did not pass through the quality control filters in the WTCCC study. Of the 3 SNP clusters that exhibit strong association signal, one located on chromosome 10 and represented by rs4506565 has the lowest P -values. This SNP is within a previously reported transcription factor 7-like 2 gene (T CF 7L2) [9]. The association is possibly explained by its action through regulation of pro-glucagon gene expression in enteroendocrine cells via the Wnt signaling pathway [10]. The second signal is from the F T O or fat-mass and obesity associated gene on chromosome 16 (rs9939609). It was previously hypothesized that the effect of this variant 23 on type 2 diabetes risk is mediated entirely by its effect on adiposity [11]. The third association signal on chromosome 6 features a cluster of highly associated SNPs (including rs9465871) that map to intron 5 of the CDK5 regulatory subunit associated protein 1-like 1 (CDKAL1 ) gene. The exact function of CDKAL1 is not clearly understood, but it shares homology at the protein domain level with CDK5 regulatory subunit associated protein 1 (CDK5RAP1). CDK5RAP1 is known to inhibit the activation of CDK5, a cyclin-dependent kinase which is involved in cell proliferation and maintenance of normal beta-cell function [12]. 24 Chr SNP Trend P -value χ2 P -value rÂA rÂB rˆBB Cluster 3 rs2314349 0.003501860 4.669716 × 10−7 0.036 0.049 0.022 1 6 rs9465871 1.019692 × 10−6 3.341986 × 10−7 −8 0.075 0.042 0.036 1 −7 10 rs7901275 3.085252 × 10 2.166514 × 10 0.032 0.039 0.049 1 10 rs4074720 5.216261 × 10−11 2.794187 × 10−10 0.052 0.038 0.031 1 10 rs7901695 −13 −12 0.058 0.042 0.032 1 10 rs4506565 5.706546 × 10−13 5.048329 × 10−12 0.058 0.042 0.031 1 10 rs4132670 1.883049 × 10−12 1.644901 × 10−11 0.057 0.042 0.031 1 10 rs10787472 2.009426 × 10−11 1.226463 × 10−10 0.052 0.038 0.030 1 10 rs12243326 1.791796 × 10−10 1.017059 × 10−9 0.033 0.042 0.059 1 10 rs7077039 2.846168 × 10−12 1.331970 × 10−11 0.053 0.038 0.030 1 10 rs11196205 7.015932 × 10−11 2.652307 × 10−10 0.052 0.038 0.031 1 −10 0.031 0.038 0.052 1 6.742384 × 10 −11 5.616127 × 10 10 rs10885409 8.47945 × 10 3.482646 × 10 10 rs11196208 5.532075 × 10−11 3.058625 × 10−10 0.031 0.038 0.052 1 −8 −8 0.031 0.042 0.048 1 0.048 0.042 0.031 1 0.031 0.042 0.048 1 0.0067 0.042 0.039 0 3.568200 × 10 0.18 0.033 0.040 0 2.322810 × 10−35 0.042 0.018 0.047 0 −22 0.070 0.030 0.039 0 16 rs7193144 1.446072 × 10 4.775441 × 10 16 rs8050136 2.007310 × 10−8 7.038900 × 10−8 16 rs9939609 1 rs13373826 −8 5.261009 × 10 0.1316872 1.282534 × 10−7 −8 2 rs4080478 0.9508418 4 rs16837871 < 2.2 × 10−16 4 rs13126272 5 rs4270702 −5 1.432675 × 10 0.0026 0.045 0.038 0 −17 0.040 0.031 0.083 0 1.837621 × 10−9 0.045 0.030 0.037 0 −23 0.044 0.019 0.030 0 1.616481 × 10−6 1.694865 × 10−7 0.035 0.051 0.035 0 −6 −10 0.038 0.032 0.048 0 3.406096 × 10 rs2048646 0.04385808 6 rs1324132 6.775095 × 10−7 rs10499044 7 rs1525791 3.499818 × 10 9.790488 × 10−8 0.8052082 5 6 1.907746 × 10 −7 −16 < 2.2 × 10 5.720925 × 10 9 rs488101 1.158367 × 10 1.039887 × 10 14 rs1957779 2.057396 × 10−5 1.777778 × 10−8 0.028 0.044 0.041 0 14 rs1362719 −5 3.089134 × 10 −10 0.038 0.032 0.050 0 15 rs597414 0.01221369 2.115368 × 10−8 0.044 0.030 0.047 0 −20 0.050 0.027 0.046 0 2.270588 × 10−36 0.040 0.027 0.075 0 −93 0.032 0.0024 0.049 0 16 rs9889057 0.0001041203 21 rs226261 4.280764 × 10−7 22 rs11705626 −16 < 2.2 × 10 6.708994 × 10 2.369359 × 10 1.399167 × 10 Table 3.3: 32 SNPs with strong association. Chr: Chromosome. Cluster: a binary variable indicating whether there was good (1) or poor (0) clustering. 25 Chapter 4 Discussion Following the great triumph of the case-control study which demonstrated of the link between tobacco smoking and lung cancer [13], this study design has gained wide recognition. However, their retrospective, non-randomized nature limits the conclusions that can be drawn from them. In this thesis we outlined the general methodology of obtaining risk estimates. We also discussed their statistical properties such as bias and standard error. We concluded that when one has some reasonable cell counts, the bias of the estimates is small and the standard error can be estimated. Compared to the conventional tests of association and trend which only tells a significant departure from the null hypothesis, being able to estimate the risk allows one to decide the direction of association. We consider this an added advantage and argue that it is better to deal with risks, because they are more interpretable. Ease of modelling should not be a general justification for using odds, although it can be for some cases. The assumption of prevalence is key to estimation of risks. Deviations of the true prevalence from the assumed value will not affect the statistical significance of the tests, but it will influence the estimates and thus the (perceived) effect size, i.e. the practical sigificance. This judgment of practical importance is beyond statistical analysis, but we 26 are well-advised to be aware of its place in scientific research. Our methods apply to general case-control studies, and also GWA studies in particular. We are aware that GWA studies, or genetic linkage and association studies in general, involves not only investigating individual SNP in isolation, and that GWA approach has intrinsic problems such as multiple testing that can result in an unprecedented potential for false-positive results [14]. Nonetheless, we think that incorporation of risk assessment into genetic analysis might present an alternative way of dealing with data generated from such studies. Finally, it is worth noting that while prospective studies can estimate the relative risk with respect to becoming sick (incidence), retrospective studies can estimate only the relative risk with respect to being sick (prevalence). The practical importance of this distinction may be enormous. For disease that is usually associated with increased mortality risk such as myocardial infarction, retrospective studies that compare existing cases with controls on one or more risk factors cannot in principle relate to all cases occurring during the past n years because some of these incidence cases may no longer be alive. Hence in such situations the (relative) risks derived from retrospective studies may be quite different from those obtained in prospective studies. 27 Bibliography [1] R. L. Prentice and R. Pyke. Logistic disease incidence models and case-control studies. Biometrika, 66:403–411, 1979. [2] Neil Pearce. What does the odds ratio estimate in a case-control study? International Journal of Epidemiology, 22:1189–1192, 1993. [3] Wiggins RD and Slater M. Estimating probabilities from retrospective data with an application to cot death in lambeth. Biometrics, 37:377–382, 1981. [4] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50:157– 175, 1900. [5] William G. Cochran. Some methods for strengthening the common χ2 tests. Biometics, 10:417–451, 1954. [6] P. Armitage. Tests for linear trends in proportions and frequencies. Biometrics, 11:375–386, 1955. [7] The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661– 678, 2007. 28 [8] E L Massó González, S Johansson, M-A Wallander, and L A Garc´ıa Rodr´ıguez. Trends in the prevalence and incidence of diabetes in the uk: 1996-2005. Journal of Epidemiology and Community Health, 63:332–336, 2009. [9] Struan F A Grant et al. Variant of transcription factor 7-like 2 (tcf7l2) gene confers risk of type 2 diabetes. Nature Genetics, 38:320–323, 2006. [10] Patricia L. Brubaker Fenghua Yi and Tianru Jin. Tcf-4 mediates cell type-specific regulation of proglucagon gene expression by β-catenin and glycogen synthase kinase-3β. Journal of Biological Chemistry, 280:1457–1464, 2005. [11] et al. Frayling TM. A common variant in the fto gene is associated with body mass index and predisposes to childhood and adult obesity. Science, 316:889–94, 2007. [12] Habener JF. Ubeda M, Rukstalis JM. Inhibition of cyclin-dependent kinase 5 activity protects pancreatic beta cells from glucotoxicity. J Biol Chem. Sep 29;(39):. Epub 2006 Aug 3., 281:28858–64, 2006. [13] R. Doll and A. Hill. Smoking and carcinoma of the lung; preliminary report. British Medical Journal, 2:739–48, 1950. [14] Manolio TA Pearson TA. How to interpret a genome-wide association study. JAMA, 299:1335–1344, 2008. 29 Appendix Proof of (2.4). From (2.3) and ξ = P/(1 − P ) we have that P n1. n2j P n1. n2j + (1 − P )n2. n1j ξn1. n2j = ξn1. n2j + n2. n1j n1. n2j ξ n2. n1j 1 = · · n2. n2. n1j + n1. n2j ξ n1j ( )−1 n1. n2j ξ n2. n1j + n1. n2j ξ 1 = . n2. n2. n1j n1j rˆj = Thus ( ) n1. k P n1. n2j π1j (1 − π1j )n1. −k E(ˆ rj ) = (1 − + P n k 1. n2j + (1 − P )n2. k k=1 )−1 ( ) ( )k n1. ( ∑ n1. n2j 1 n1. π1j n1. n1. n1. = (1 − π1j ) + (1 − π1j ) n2j ξ 1+ ξ n2. n2. k k k 1 − π1j k=1 π1j )n1. n1. ∑ Proof of (2.5). Dropping the subscript for π1j and writing n1. = N , we can prove (2.5) by induction. When N = 1, (2.5) clearly holds. Suppose it holds when N = n − 1, that is, )( )k ∑ n−1 ( n−1 n−1 ∑ ∑ 1 n−1 1 π 1 − . = k k k 1−π k(1 − π) k k=1 k=1 k=1 30 Noting that = = = = (n) k = ( ) n−1 n n−k k , we have when N = n, ( )( )k n ∑ 1 n π k k 1−π k=1 )k ( )n n−1 ( ) ( ∑ π π 1 n 1 + k k 1−π n 1−π k=1 )( ( )k ( )n n−1 ∑ π n−1 1 π n + k k(n − k) 1 − π n 1−π k=1 )( )( )k ( )n n−1 ( ∑ 1 1 n−1 π 1 π + + k n − k k 1 − π n 1−π k=1 ( )( )k ( )n n−1 n−1 n−1 ∑ ∑ n−1 π 1 π 1 1 ∑ 1 − + + . k k(1 − π) k n − k k 1 − π n 1 − π k=1 k=1 k=1 (4.1) At the same time, )n ∑ )k n ( )( n π π 1+ = 1−π k 1−π k=0 ) ( ( ) n k ∑ π n = 1+ k 1−π k=1 )( )k ( )n n−1 ( ∑ π n−1 π n = 1+ + k 1 − π n − k 1−π k=1 1 = (1 − π)n Hence ( )( )k n−1 ( ∑ n−1 π 1 1 − πn 1 = − n k 1−π n−k n(1 − π) n k=1 Substituting this into (4.1) gives in the desired results. Proof of (2.6). The proof of Y2 is similar. First observe that the equality holds when 31 N = 1. Next suppose it holds when N = n − 1, then when N = n, )k ( )( n ∑ 1 n π k2 k 1−π k=1 )k ( )( n−1 ∑ n n−1 π πn = + (n − k)k 2 1−π n2 (1 − π)n k k=1 ]( )( )k n−1 [ ∑ πn 1 n−1 π 1 + = + k 2 k(n − k) k 1−π n2 (1 − π)n k=1 ( )2 n−1  n−1 ∑ n−1 n−1  ∑ ∑ 1 ∑ 1 1 1 − + + = kl(1 − π1j )k 2  n n2  k=1 l=k ( k=1 k=1 ) n−1 ∑ 1 n−1 πk πn + k(n − k) k (1 − π)k n2 (1 − π)n k=1 The last two terms can be written as ) ( n−1 ∑ πk 1 n−1 πn + k k(n − k) (1 − π)k n2 (1 − π)n k=1 ( ) n ∑ 1 n πk = nk k (1 − π)k k=1 ( n ) n ∑ ∑ 1 1 1 = − k n k=1 k(1 − π) k k=1 where we have made use of the results from Y1 . Substituting this into the above yields the desired equality. 32 [...]... general, involves not only investigating individual SNP in isolation, and that GWA approach has intrinsic problems such as multiple testing that can result in an unprecedented potential for false-positive results [14] Nonetheless, we think that incorporation of risk assessment into genetic analysis might present an alternative way of dealing with data generated from such studies Finally, it is worth noting... prospective studies can estimate the relative risk with respect to becoming sick (incidence), retrospective studies can estimate only the relative risk with respect to being sick (prevalence) The practical importance of this distinction may be enormous For disease that is usually associated with increased mortality risk such as myocardial infarction, retrospective studies that compare existing cases... or more risk factors cannot in principle relate to all cases occurring during the past n years because some of these incidence cases may no longer be alive Hence in such situations the (relative) risks derived from retrospective studies may be quite different from those obtained in prospective studies 27 Bibliography [1] R L Prentice and R Pyke Logistic disease incidence models and case-control studies. .. line (Figure 1.1) This shows that the same odds ratio can arise from different risk parameters In this thesis we argue that estimating risks has the advantage of better interpretability In addition, we would also like to test the hypothesis that risks are not different among the sub-populations In doing so we need to make some assumptions, provide 1.0 the likely size of error, and discuss the bias in. .. Cluster: a binary variable indicating whether there was good (1) or poor (0) clustering 25 Chapter 4 Discussion Following the great triumph of the case-control study which demonstrated of the link between tobacco smoking and lung cancer [13], this study design has gained wide recognition However, their retrospective, non-randomized nature limits the conclusions that can be drawn from them In this thesis... between odds ratio and risks 10 In Chapter 2 we present some theory to estimate disease risk in a general 2 × k contingency table setting In Chapter 3 we apply the method to the Wellcome Trust Case Control Consortium data and evaluate the performance of the method In Chapter 4 we provide a discussion and some other issues that may warrant future investigation 11 Chapter 2 Methods 2.1 Point estimate From... can be obtained by taking the difference between (2.2) and (1.2) and plotting graphically for all π1j and π2j When both n1j and n2j are 0, the risk rj is not estimable Hence in computing (2.2) we have to “ignore” the case of k = l = 0 In addition when n1j or n2j is 0, the estimator (2.1) would give 1 or 0, and inclusion of these outcomes might in uence substantially the bias of the estimator Since the... CDK5 regulatory subunit associated protein 1-like 1 (CDKAL1 ) gene The exact function of CDKAL1 is not clearly understood, but it shares homology at the protein domain level with CDK5 regulatory subunit associated protein 1 (CDK5RAP1) CDK5RAP1 is known to inhibit the activation of CDK5, a cyclin-dependent kinase which is involved in cell proliferation and maintenance of normal beta-cell function [12]... will in uence the estimates and thus the (perceived) effect size, i.e the practical sigificance This judgment of practical importance is beyond statistical analysis, but we 26 are well-advised to be aware of its place in scientific research Our methods apply to general case-control studies, and also GWA studies in particular We are aware that GWA studies, or genetic linkage and association studies in general,... categories in a contingency table and tests for a linear trend based on a score xi assigned to each column of Table 1.1 It uses a linear probability model pi = α + βxi , where pi is the expected proportion of cases in the ith subclass in the sample Note that pi differs from ri by a factor that is constant across all the subclasses, as a result of retrospective sampling Hence the null hypothesis of independence ... from retrospective studies traditionally followed methods involving estimation of relative risks and/or odds ratios Little attention was given to estimation of risks, presumably due to constraint... association studies in general, involves not only investigating individual SNP in isolation, and that GWA approach has intrinsic problems such as multiple testing that can result in an unprecedented potential... place in scientific research Our methods apply to general case-control studies, and also GWA studies in particular We are aware that GWA studies, or genetic linkage and association studies in general,

Định dạng
Số trang	33
Dung lượng	121,59 KB