Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 33 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
33
Dung lượng
121,59 KB
Nội dung
RISK ESTIMATION IN RETROSPECTIVE STUDIES
WEI XING
(B.Sc.(Hons), NUS)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY
NATIONAL UNIVERSITY OF SINGAPORE
2010
Contents
Abstract
3
List of Tables
4
List of Figures
5
1 Introduction
6
1.1
Case-control study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Odds ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4
Relative risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.5
Motivation & organization of thesis . . . . . . . . . . . . . . . . . . . . .
10
2 Methods
12
2.1
Point estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
Special case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1
2.4
Testing equality of risks . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.5
Cochran-Armitage test . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3 Application
19
3.1
Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.4
Risk estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.5
Tests of association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4 Discussion
26
Bibliography
29
Appendix
30
2
Abstract
The analysis of data arising from retrospective studies traditionally followed methods
involving estimation of relative risks and/or odds ratios. Little attention was given to
estimation of risks, presumably due to constraint of such experimental designs, as well
as the ease of modelling of odds ratio by logistic regression. Here we present some
results for the estimation of risks in a general 2 by k contingency table setting, for a
dichotomous outcome variable and under some reasonable assumption of prevalence.
We also examine the properties of proposed estimators, and apply them to a largescale genome-wide association (GWA) study data to demonstrate some relevance of the
methods.
3
List of Tables
1.1
A random sample classified as a 2 × k contingency table . . . . . . . . .
7
1.2
Association of factor X and disease in a population cross section . . . . .
8
3.1
Mean and SD of estimated variance and covariance . . . . . . . . . . . .
22
3.2
Five number summary of distribution of risks . . . . . . . . . . . . . . .
23
3.3
32 SNPs with strong association . . . . . . . . . . . . . . . . . . . . . . .
25
4
List of Figures
1.1
Relationship between odds ratio and risks . . . . . . . . . . . . . . . . .
10
3.1
Bias of risk estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
5
Chapter 1
Introduction
1.1
Case-control study
The case-control study is a primary tool for the study of factors related to disease incidence and is widely used in clinical and epidemiological research. Such studies often
utilize a retrospective design, in which the investigator looks backwards and examines
exposures to suspected risk or protection factors in relation to an outcome that is established at the start of the study. Compared to a prospective or cohort study that almost
always involves following up the subjects over an extended period of time, a case-control
study has the advantage of being able to yield results from presently collectible data,
using relatively small amount of resources. It also reduces the sample size required to
capture a reasonable number of cases, especially when the disease under investigation is
rare in the general population.
In case-control studies, direct estimation of (absolute) risk is usually not possible as
the number of cases and controls are determined without knowledge of how many cases
and controls actually exist in the population of interest. On the other hand, one can
estimate the odds ratio, a particularly useful measure of association due to its invariance
property under retrospective and prospective sampling schemes. An odds ratio of 1 is
6
indicative of statistical independence between exposure and disease outcome. Moreover,
it is well known that when disease incidence is low, the odds ratio closely approximates
the relative risk.
1.2
Background
Consider the use of a baseline categorical variable X with k levels to predict a binary
outcome Y . As an example X can be exposure to different levels of a risk factor and Y
the incidence of disease. A case-control study with retrospective sampling scheme will
then be to collect data from a sample of n1. controls and n2. cases of the two underlying
populations. Classification of these n1. + n2. = n.. subjects according to factor X gives
rise to a 2 × k contingency table, as shown below, with ni. , n.i and n.. denoting the row,
column and overall totals, respectively (Table 1.1).
Control (Y = 0) n11
n12
···
n1k
n1.
Case (Y = 1) n21
n22
···
n1k
n2.
n.1
n.2
···
n.k
n..
Table 1.1: A random sample classified as a 2 × k contingency table
If the sample size n.. is small compared to the population, and if one further assumes
that the sample is a simple random sample, the k cell counts for row 1 and row 2 of
Table 1.1 may be modelled as a realization of two independent multinomial random
variables with total counts n1. and n2. , and unknown cell probabilities πij , i = 1, 2 and
j = 1, . . . , k. Each multinomial probability πij is in fact the proportion of subjects in
subclass j within the diseased (i = 2) or non-diseased (i = 1) populations. In other
words, if we could cross classify the total population from which the cases and controls of
Table 1.1 were selected (Table 1.2), then πij = Nij /Ni. . In addition, the cell probabilities
∑
of each row must sum up to 1, i.e., j πij = 1.
7
Control (Y = 0)
N11
N12
···
N1k
N1.
Case (Y = 1)
N21
N22
···
N1k
N2.
N.1
N.2
···
N.k
N..
Table 1.2: Association of factor X and disease in a population cross section
1.3
Odds ratio
For any event A with probability p of occurring, the odds of its occurrence is
odds(A) =
P(A)
p
=
.
1 − P(A)
1−p
Now suppose there are only two subclasses in a population, with factor X = 1, 2 respectively. Invariance of the odds ratio implies that
odds ratio = θ =
r1 /(1 − r1 )
π21 /π22
=
,
π11 /π12
r2 /(1 − r2 )
(1.1)
where ri denotes the proportion of cases among subclass i of the population, a quantity
that we will later define as the risk for that subclass. The left hand side of the equation
is the ratio of odds of being in subclass 1 for cases over that for controls, while the right
hand side refers to the ratio of odds of being a case for subjects in subclass 1 over that
in subclass 2. It is on this basis that we are able to estimate from retrospective studies
the multinomial parameters πij and hence the odds ratio. In this case, the sample odds
ratio is
n11 /n12
n11 n22
θˆ =
=
n21 /n22
n21 n12
which has an asymptotic normal distribution with mean θ and variance
ˆ =
Var(θ)
∑ 1
.
nij
(i,j)
1.4
Relative risk
As discussed in the previous Section, it is usually natural to define risk of developing
a disease for an individual belonging to any subclass to be the number of cases in that
8
subclass over the total number of subjects in it. The relative risk for one subclass over the
another is then estimated by the ratio of the estimated absolute risks. More specifically,
if we denote by P the proportion of population falling into the diseased group (that is,
the prevalence), by π2j the proportion of diseased group falling into a subclass j, and
by π1j the proportion of the non-diseased group falling into that subclass, then the risk
of disease for members of that subclass is
π2j P
,
π2j P + π1j (1 − P )
(1.2)
and for non-members of the subclass
(1 − π2j )P
.
(1 − π2j )P + (1 − π1j )(1 − P )
The relative risk is therefore
π2j (1 − π2j )P + (1 − π1j )(1 − P )
.
1 − π2j
π2j P + π1j (1 − P )
(1.3)
When P is small and 1 − P is close to 1, (1.3) can be written as
π2j 1 − π1j
.
1 − π2j π1j
(1.4)
This is exactly the ratio of odds of being in the subclass for a control over that for a
case. Hence odds ratio can be used as an approximation to relative risk when the disease
is rare. Alternatively, this can be shown from (1.1) since
)
(
1 − r2
odds ratio = relative risk
,
1 − r1
so these two quantities are similar when
1−r2
1−r1
≈ 1.
Finally, in a retrospective study as described in Table 1.1, one can estimate the odds
ratio (1.4) for subclass X = j with the statistic
n2j n1. − n1j
.
n2. − n2j n1j
9
1.5
Motivation & organization of thesis
The relative ease of modelling, particularly through logistic regression [1], and the approximation to relative risk, has led to the widespread report of odds ratio in medical
literature. However, many researchers have reasoned that the use of odds ratio in case
control studies is technically correct, but often misleading [2]. Indeed using (1.1), one
may plot the risk parameters (r1 , r2 ) that give rise to an odds ratio of 2, 5, 10 and 20 on
the same line (Figure 1.1). This shows that the same odds ratio can arise from different
risk parameters.
In this thesis we argue that estimating risks has the advantage of better interpretability. In addition, we would also like to test the hypothesis that risks are not different
among the sub-populations. In doing so we need to make some assumptions, provide
1.0
the likely size of error, and discuss the bias in these estimates.
0.4
r2
0.6
0.8
OR = 2
0.2
OR = 5
OR = 20
0.0
OR = 10
0.0
0.2
0.4
0.6
0.8
1.0
r1
Figure 1.1: Relationship between odds ratio and risks
10
In Chapter 2 we present some theory to estimate disease risk in a general 2 × k
contingency table setting. In Chapter 3 we apply the method to the Wellcome Trust
Case Control Consortium data and evaluate the performance of the method. In Chapter
4 we provide a discussion and some other issues that may warrant future investigation.
11
Chapter 2
Methods
2.1
Point estimate
From Section 1.4 we have the risk for subclass j defined as the ratio
rj =
N2j
,
N.j
i.e. the ratio of the number of cases in the subclass over the total number of subjects in
it (Table 1.2). If P , the disease prevalence is assumed to be known, and let N.. be the
population size, then the total number of cases in subclass j of the population can be
estimated by P N.. n2j /n2. , and the total number of controls by (1 − P )N.. n1j /n1. . Hence
an estimator of risk for subclass j is
rˆj =
P n2j /n2.
P n2j /n2. + (1 − P )n1j /n1.
(2.1)
except when n2j = n1j = 0. This is also an obvious estimate from (1.2) since n1j /n1. and
n2j /n2. are unbiased estimators of π1j and π2j . Indeed (2.1) is merely Bayes’ theorem
which relates the prospective, disease probability (or risk) to the retrospective, exposure
probability:
rj = P(Y = 1|X = j) =
P P(X = j|Y = 1)
P P(X = j|Y = 1) + (1 − P )P(X = j|Y = 0)
where we used Y = 1, 0 to denote a subject being a case and control, respectively.
12
2.2
Expectation
Since n1. and n2. are fixed, and P is a constant for a given population at the time of
study, each nij has a marginal binomial distribution with parameters ni. and πij . We
can then write the expectation of rˆj as the weighted sum
E(ˆ
rj ) =
n1. ∑
n2.
∑
k=0 l=0
( )
( )
P l/n2.
n1. k
n1. −k n2.
l
(1 − π2j )n2. −l
π2j
π1j (1 − π1j )
P l/n2. + (1 − P )k/n1. k
l
(2.2)
while the true value of parameter rj is given by (1.2). The bias can be obtained by
taking the difference between (2.2) and (1.2) and plotting graphically for all π1j and π2j .
When both n1j and n2j are 0, the risk rj is not estimable. Hence in computing (2.2)
we have to “ignore” the case of k = l = 0. In addition when n1j or n2j is 0, the estimator
(2.1) would give 1 or 0, and inclusion of these outcomes might influence substantially
the bias of the estimator. Since the estimator will not be used in these situations, it
might be better to consider instead the conditional expectation E(ˆ
rj |n1j , n2j > 0). It is
linked to E(ˆ
rj ) via the relationship
E(ˆ
rj |n1j , n2j > 0) =
E(ˆ
rj ) − S1
,
1 − S1 − S2
where
S1 = P(n1j = 0, n2j > 0) = (1 − π1j )
S2 = P(n2j = 0, n1j > 0) = (1 − π2j )
n1.
n2.
)
n2. (
∑
n2.
l=1
n1. (
∑
k=1
l
l
π2j
(1 − π2j )n2. −l
)
n1. k
π1j (1 − π1j )n1. −k .
k
In practice however, we find that the difference between E(ˆ
rj |n1j , n2j > 0) and E(ˆ
rj ) is
negligible. Thus only the unconditional expectation is used in the subsequent applications.
13
2.3
Special case
Wiggins and Slater [3] previously discussed a similar problem where it was assumed that
all cases of the disease in the population were captured. A random sample was then
selected from the control population. They proposed a numerical approximation to the
expectation and variance of the risk estimates. More specifically, if the total number of
cases in the population is known, i.e. n2. = N2. then N.. = n2. /P and each n2j is no
longer random. The estimator (2.1) can be written as
rˆj =
P n1. n2j
n2j
=
.
n2j + (1 − P )n1j n2. /P n1.
P n1. n2j + (1 − P )n2. n1j
(2.3)
Notice that in the above expression n1j is the only random component. The expectation
of rˆj is in this case
E(ˆ
rj ) =
n1.
∑
k=0
( )
P n1. n2j
n1. k
π1j (1 − π1j )n1. −k .
P n1. n2j + (1 − P )n2. k k
The expectation can be approximated using a Taylor expansion argument. Let ξ =
P/(1 − P ), we find
n1.
E(ˆ
rj ) = (1−π1j )
+(1−π1j )
n1. n1. n2j ξ
n2.
n1. (
∑
k=1
n1. n2j ξ
1+
n2. k
)−1 ( ) (
)k
1 n1.
π1j
(2.4)
k k
1 − π1j
(see Appendix for a detailed derivation). It is also shown in Appendix that
( )(
)k ∑
n1.
n1.
n1.
∑
∑
1 n1.
π1j
1
1
=
−
Y1 =
k
k k
1 − π1j
k(1 − π1j )
k
k=1
k=1
k=1
and
( )(
)k
n1.
∑
1 n1.
π1j
Y2 =
k2 k
1 − π1j
k=1
(
)2 n
n1. ∑
n1.
n1.
1.
∑
∑
∑
1
1
1
1
=
−
+
.
kl(1 − π1j )k 2
n
n2
k=1 l=k
Using (1 − x)−1 =
∑∞
k=0
k=1
(2.5)
(2.6)
k=1
xk , one can expand E(ˆ
rj ) in a power series and ignore powers
of ξ 3 and higher to find
{
E(ˆ
rj ) ≈ (1 − π1j )
n1.
(
)}
n1.
n1.
1+
n2j ξ Y1 −
n2j ξY2
.
n2.
n2.
14
Similarly, one can expand E(ˆ
rj2 ) in a power series and ignore the powers of ξ 3 and
higher to get
(
E(ˆ
rj2 )
≈ (1 − π1j )
n1.
)
n21.
2
1 + 2 n2j ξ Y2 .
n2.
The variance can then be found by
Var(ˆ
rj ) = E(ˆ
rj2 ) − (E(ˆ
rj ))2
In the above derivation, the model is built upon the assumption that all cases in the
population are captured, and only the number of controls in the sample is treated as
random. While this assumption might hold for certain severe, rapid-onset disease that
are under extensive surveillance, it certainly does not apply to most common diseases
that are of public health interest. Consequently it is necessary to allow uncertainty in
both cases and controls.
2.4
Testing equality of risks
It is often of interest to study whether a risk factor X is associated with the disease
outcome. The classic test of homogeneity between two multinomial probabilities in
Table 1.1 is the Pearson’s χ2 test, with the test statistic
X2 =
∑ (Oij − Eij )2
,
Eij
(i,j)
where Oij = nij are the observed counts and Eij = ni. n.j /n.. are the expected counts in
the (i, j) cell, under the null hypothesis that the multinomial probabilities of row 1 and
row 2 are the same, i.e.
π1j = π2j
for i = 1, . . . , k,
which is equivalent to H0 : r1 = r2 = r3 . The alternative hypothesis is H1 : π1j ̸=
π2j for at least one of 1, . . . , j. For large sample sizes, the approximate null distribution
of this statistic is χ2 with (2 − 1) × (k − 1) = k − 1 degrees of freedom [4].
15
An alternative is the likelihood ratio test, which is asymptotically equivalent to the
χ2 test. The cells counts of two rows in Table 1.1 follow approximately the multinomial
distributions with joint frequency function
n1. !
f (n11 , . . . , n1k |π11 , . . . , π1k ) = ∏k
j=1
k
∏
n1j ! j=1
n
π1j1j
and
k
∏
n2. !
n
f (n21 , . . . , n2k |π21 , . . . , π2k ) = ∏k
π2j2j .
j=1 n2j ! j=1
The two multinomial distributions are independent, and the log likelihood function is
therefore
l(π11 , . . . , π2k ) =
subject to the constraint that
∑
j
k
∑
log(π1j ) +
j=1
k
∑
n2j log(π2j ) + C,
j=1
πij = 1 for i = 1, 2. Alternatively, one can express
l(π11 , . . . , π2k ) in terms of the ‘free’ parameters, i.e.
l(π11 , . . . , π1(k−1) , π21 , . . . , π2(k−1) ) =
k−1
∑
n1j log(π1j ) + n1k log(1 −
k−1
∑
j=1
j=1
k−1
∑
k−1
∑
n2j log(π2j ) + n2k log(1 −
j=1
π1j ) +
π2j ) + C.
j=1
(2.7)
The null hypothesis H0 : r1 = · · · = rk is equivalent to the multinomial probabilities of
row 1 and row 2 of Table 1.1 being equal. Under H0 , (2.7) is maximized at
π
ˆ1j = π
ˆ2j =
n1j + n2j
n.j
=
,
n..
n..
or equivalently, rˆ1 = · · · = rˆk = P . Under the alternative hypothesis, there is no
restriction on πij ’s so (2.7) is maximized at the MLEs π
ˆij = nij /ni. . The test statistic
is therefore
∆ = −2(l0 − l1 ),
where l0 and l1 denote the maximized log likelihood under the parameter space specified
by the null and alternative hypotheses, respectively. The degree of freedom is calculated
as follows: under H0 , there are 2k − 2 free parameters due to the two constraints of
16
summing up to 1. Under H1 , the two rows have the same multinomial probabilities so
there are k − 1 free parameters. The test statistic therefore follows an asymptotic χ2
distribution with k − 1 degrees of freedom.
It shall be made clear that although incorporation of the prevalence P is required
for estimation of risks, it is not needed in the tests. Reparametrization from π to r does
not change the maximum likelihood or the likelihood ratio test statistic.
2.5
Cochran-Armitage test
We also consider a third test, the Cochran-Armitage test [5, 6] which utilizes ordered
categories in a contingency table and tests for a linear trend based on a score xi assigned
to each column of Table 1.1. It uses a linear probability model
pi = α + βxi ,
where pi is the expected proportion of cases in the ith subclass in the sample. Note
that pi differs from ri by a factor that is constant across all the subclasses, as a result
of retrospective sampling. Hence the null hypothesis of independence H0 : β = 0 is
equivalent to H0 : r1 = · · · = rj .
The test may be rationalized with an analysis of variance approach. More specifically, the Pearson’s χ2 test statistic can be rewritten as a weighted sum of squares, i.e.
X2 =
=
3
∑
(n.i pˆi − n.i pˆ)2
i=1
∑3
i=1
n.i pˆ
+
3
∑
(n.i (1 − pˆi ) − n.i (1 − pˆ))2
i=1
n.i (1 − pˆ)
n.i (ˆ
pi − pˆ)2
.
pˆ(1 − pˆ)
with the weights being n.i /ˆ
p(1− pˆ). Here pˆi is the observed proportion of cases in the ith
subclass and pˆ is the overall proportion of cases in the sample. The regression parameter
17
β can then be obtained by the standard formula for weighted regression
∑
n.i (ˆ
pi − pˆ)(xi − x¯)
βˆ = i ∑
,
¯)2
i n.i (xi − x
where x¯ is the weighted average of xi , i.e.
∑
xi n.i
.
x¯ = ∑i
i n.i
It is then possible to subdivide this weighted sum of squares by the rules of analysis of
variance into the sum of squares explained by the regression formula
∑
∑
βˆ2 i n.i (xi − x¯)2
[ i (xi − x¯)n2i ]2
∑
SSR =
=
pˆ(1 − pˆ)
pˆ(1 − pˆ) i n.i (xi − x¯)2
and the “residual” sum of squares
∑
SSL =
i
n.i (pˆˆi − pˆi )2
pˆ(1 − pˆ)
ˆ i is the fitted value for pi .
where pˆˆi = α
ˆ + βx
The Cochran-Armitage test statistic is simply SSR and it tests H0 : β = 0. When
the linear probability model holds, SSR is asymptotically χ2 distributed with 1 degree
of freedom.
18
Chapter 3
Application
We now apply the methods in Chapter 2 to data from the main experiment of the
Wellcome Trust Case Control Consortium: a genome-wide association study involving
2,000 DNA samples from each of seven diseases (type 1 diabetes, type 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis and Crohn’s
disease) [7]. As is typical for GWA studies, a dense set of single-nucleotide polymorphisms (SNPs) across the genome is genotyped, and statistical analysis is performed
to survey the most common genetic variation that is associated with risk factors for
disease. An SNP is a DNA sequence variation occurring when a single nucleotide (A, T,
C, or G) in the genome or other shared sequence differs between members of a species.
The associated SNP markers are then considered as pointers to the region of the human
genome where the disease-causing gene is likely to reside.
3.1
Data description
We choose to analyze the data from type-2 diabetes, partly because of the availability
of some reasonably reliable prevalence data [8]. In this study, a total number of 2938
controls and 1924 cases are genotyped, after excluding samples with contamination, false
19
identity, non-Caucasian ancestry and relatedness. The case samples are ascertained from
sites widely distributed across Great Britain, and the controls come from two sources:
about 1500 are representative samples from the 1958 British Birth Cohort and another
1500 are blood donors recruited by the three national UK Blood Services.
Summary genotype statistics were retrieved from the European Genotype Archive
(http://www.ebi.ac.uk/ega/page.php) and they contained data on over five hundred
thousand SNPs distributed over 23 chromosome pairs of the approximately 5000 participants in the study. All SNPs are biallelic, i.e. there are only 2 alleles that usually
differ by a single nucleotide, giving rise to 3 possible genotypes. The controls and cases
are classified according to their genotypes at the SNP site, making this an application
of the theory with k = 3.
3.2
Bias
We first evaluate the bias of risk estimates corresponding to different parameter values
π1j and π2j . Since rj is undefined when π1j = π2j = 0, we assume hereafter that at least
one of π1j and π2j is not 0. Comparing the expressions (2.2) and (1.2) it is easy to see
that the bias of rˆj is 0 when either π1j or π2j is 0, and when both π1j and π2j are 1.
Let n1. = 3000, n2. = 2000, and prevalence P = 0.039 for type 2 diabetes, the
(unconditional) biases (2.2) for selected values of π1j and π2j are plotted (Figure 3.2).
In general the bias decreases with π1j , and increases with π2j . Nonetheless, in most
cases of the WTCCC study, the biases are small and negligible.
3.3
Variance
Since the exact form of standard errors (SE) of the risk estimates is difficult to compute,
they are estimated by simulation: first fix some selected multinomial probabilities π1j
20
π1j = 0.5
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
π2j
π1j = 0.9
π2j = 0.1
1.0
0.8
1.0
0.8
1.0
0.004
0.8
0.000
0.002
bias
6.0e−07
0.0e+00
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
π2j
π1j
π2j = 0.5
π2j = 0.9
0.0000
0.0000
0.0010
bias
0.0020
0.0012
0.2
0.0006
bias
0.0
bias
0.0
π2j
1.2e−06
0.0
1.0e−05
0.0e+00
bias
2e−04
0e+00
bias
4e−04
2.0e−05
π1j = 0.1
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
π1j
0.6
π1j
Figure 3.1: Bias of risk estimates
and π2j , j = 1, . . . , 3, then generate using these parameters 5000 2 × 3 tables. Monte
Carlo simulation is then performed with n = 106 repeats with estimated multinomial
parameters from each of these 5000 tables. The variance-covariance matrix can then be
21
obtained from the 106 risk estimates. Table 3.1 gives the mean and standard deviation of
the a diagonal (Var(ˆ
r1 )) as well as an off-diagonal element (Cov(ˆ
r1 , rˆ3 )) of 5000 variancecovariance matrices. It is clear while the Monte Carlo approach works well in most cases,
the estimated SE of risk estimates can be very poor at extreme parameter values.
Parameter
(0.3, 0.3, 0.4)
(0.1, 0.1, 0.8)
(0.01, 0.01, 0.98)
(0.005, 0.005, 0.99)
Var(ˆ
r1 )
Mean
SD
MC
2.75 × 10−6
2.18 × 10−7
2.74 × 10−6
5.48 × 10−8
−1.17 × 10−6
1.66 × 10−6
1.06 × 10−5
9.09 × 10−8
−1.17 × 10−6
9.66 × 10−5
1.32 × 10−4
3.47 × 10−7
−1.22 × 10−6
1.03 × 10−3
3.11 × 10−4
5.59 × 10−7
−1.28 × 10−6
Cov(ˆ
r1 , rˆ3 ) −1.17 × 10−6
Var(ˆ
r1 )
1.08 × 10−5
Cov(ˆ
r1 , rˆ3 ) −1.18 × 10−6
Var(ˆ
r1 )
1.55 × 10−4
Cov(ˆ
r1 , rˆ3 ) −1.25 × 10−6
Var(ˆ
r1 )
5.12 × 10−4
Cov(ˆ
r1 , rˆ3 ) −1.35 × 10−6
Table 3.1: Mean and SD of 5000 estimated variance Var(ˆ
r1 ) as well as covariance
Cov(ˆ
r1 , rˆ3 ) generated using estimated parameter values. MC is the simulated variance
and covariance using actual parameters.
3.4
Risk estimation
Data from a total number of 420,172 SNPs distributed over 22 chromosome pairs are
analyzed. For each SNP, we denote by rmax and rmin the maximum and minimum risks
among the three genotypes, respectively. Table 3.2 provides the five number summary
as well as the mean of the distribution of rmax and rmin in all the SNPs.
22
Min
1st Quartile
Median
Mean
3rd Quartile
Max
rmax
0.0390
0.0399
0.0408
0.0436
0.0428
0.526
rmin
0.00237
0.0356
0.0373
0.360
0.0382
0.0390
Table 3.2: Five number summary of distribution of risks
3.5
Tests of association
We applied the three tests described in Section 2.3, i.e. the likelihood ratio, χ2 and
Cochran-Armitage tests with score {0, 1, 2} to the type 2 diabetes data. The rationale
for choosing this score is based on the assumption that any effect of the disease-causing
allele, if present, will be proportional to the number of such allele an individual carries. A
total number of 32 SNPs with at least one of the P -values below 5 × 10−7 were detected.
Based on the WTCCC study, this is the threshold below which a strong association
between SNP and underlying disease is declared. Table 3.3 shows the P -values as well
as the estimated risks for these SNPs.
Of these 32 SNPs, 16 are found with poor clustering and removed from the analysis.
The remaining 16 SNPs form 4 clusters. Three of them, located on chromosomes 6, 10
and 16 coincide with the WTCCC study results [7], but a fourth SNP on chromosome
3 (rs2314349) has not been reported. The reason for the discrepancy is not clear. It is
possible that the SNP did not pass through the quality control filters in the WTCCC
study.
Of the 3 SNP clusters that exhibit strong association signal, one located on chromosome 10 and represented by rs4506565 has the lowest P -values. This SNP is within a
previously reported transcription factor 7-like 2 gene (T CF 7L2) [9]. The association is
possibly explained by its action through regulation of pro-glucagon gene expression in
enteroendocrine cells via the Wnt signaling pathway [10].
The second signal is from the F T O or fat-mass and obesity associated gene on
chromosome 16 (rs9939609). It was previously hypothesized that the effect of this variant
23
on type 2 diabetes risk is mediated entirely by its effect on adiposity [11].
The third association signal on chromosome 6 features a cluster of highly associated
SNPs (including rs9465871) that map to intron 5 of the CDK5 regulatory subunit associated protein 1-like 1 (CDKAL1 ) gene. The exact function of CDKAL1 is not clearly
understood, but it shares homology at the protein domain level with CDK5 regulatory
subunit associated protein 1 (CDK5RAP1). CDK5RAP1 is known to inhibit the activation of CDK5, a cyclin-dependent kinase which is involved in cell proliferation and
maintenance of normal beta-cell function [12].
24
Chr
SNP
Trend P -value
χ2 P -value
rˆAA
rˆAB
rˆBB
Cluster
3
rs2314349
0.003501860
4.669716 × 10−7
0.036
0.049
0.022
1
6
rs9465871
1.019692 × 10−6
3.341986 × 10−7
−8
0.075
0.042
0.036
1
−7
10
rs7901275
3.085252 × 10
2.166514 × 10
0.032
0.039
0.049
1
10
rs4074720
5.216261 × 10−11
2.794187 × 10−10
0.052
0.038
0.031
1
10
rs7901695
−13
−12
0.058
0.042
0.032
1
10
rs4506565
5.706546 × 10−13
5.048329 × 10−12
0.058
0.042
0.031
1
10
rs4132670
1.883049 × 10−12
1.644901 × 10−11
0.057
0.042
0.031
1
10
rs10787472
2.009426 × 10−11
1.226463 × 10−10
0.052
0.038
0.030
1
10
rs12243326
1.791796 × 10−10
1.017059 × 10−9
0.033
0.042
0.059
1
10
rs7077039
2.846168 × 10−12
1.331970 × 10−11
0.053
0.038
0.030
1
10
rs11196205
7.015932 × 10−11
2.652307 × 10−10
0.052
0.038
0.031
1
−10
0.031
0.038
0.052
1
6.742384 × 10
−11
5.616127 × 10
10
rs10885409
8.47945 × 10
3.482646 × 10
10
rs11196208
5.532075 × 10−11
3.058625 × 10−10
0.031
0.038
0.052
1
−8
−8
0.031
0.042
0.048
1
0.048
0.042
0.031
1
0.031
0.042
0.048
1
0.0067
0.042
0.039
0
3.568200 × 10
0.18
0.033
0.040
0
2.322810 × 10−35
0.042
0.018
0.047
0
−22
0.070
0.030
0.039
0
16
rs7193144
1.446072 × 10
4.775441 × 10
16
rs8050136
2.007310 × 10−8
7.038900 × 10−8
16
rs9939609
1
rs13373826
−8
5.261009 × 10
0.1316872
1.282534 × 10−7
−8
2
rs4080478
0.9508418
4
rs16837871
< 2.2 × 10−16
4
rs13126272
5
rs4270702
−5
1.432675 × 10
0.0026
0.045
0.038
0
−17
0.040
0.031
0.083
0
1.837621 × 10−9
0.045
0.030
0.037
0
−23
0.044
0.019
0.030
0
1.616481 × 10−6
1.694865 × 10−7
0.035
0.051
0.035
0
−6
−10
0.038
0.032
0.048
0
3.406096 × 10
rs2048646
0.04385808
6
rs1324132
6.775095 × 10−7
rs10499044
7
rs1525791
3.499818 × 10
9.790488 × 10−8
0.8052082
5
6
1.907746 × 10
−7
−16
< 2.2 × 10
5.720925 × 10
9
rs488101
1.158367 × 10
1.039887 × 10
14
rs1957779
2.057396 × 10−5
1.777778 × 10−8
0.028
0.044
0.041
0
14
rs1362719
−5
3.089134 × 10
−10
0.038
0.032
0.050
0
15
rs597414
0.01221369
2.115368 × 10−8
0.044
0.030
0.047
0
−20
0.050
0.027
0.046
0
2.270588 × 10−36
0.040
0.027
0.075
0
−93
0.032
0.0024
0.049
0
16
rs9889057
0.0001041203
21
rs226261
4.280764 × 10−7
22
rs11705626
−16
< 2.2 × 10
6.708994 × 10
2.369359 × 10
1.399167 × 10
Table 3.3: 32 SNPs with strong association. Chr: Chromosome. Cluster: a binary
variable indicating whether there was good (1) or poor (0) clustering.
25
Chapter 4
Discussion
Following the great triumph of the case-control study which demonstrated of the link
between tobacco smoking and lung cancer [13], this study design has gained wide recognition. However, their retrospective, non-randomized nature limits the conclusions that
can be drawn from them.
In this thesis we outlined the general methodology of obtaining risk estimates. We
also discussed their statistical properties such as bias and standard error. We concluded
that when one has some reasonable cell counts, the bias of the estimates is small and
the standard error can be estimated. Compared to the conventional tests of association
and trend which only tells a significant departure from the null hypothesis, being able
to estimate the risk allows one to decide the direction of association. We consider this
an added advantage and argue that it is better to deal with risks, because they are more
interpretable. Ease of modelling should not be a general justification for using odds,
although it can be for some cases.
The assumption of prevalence is key to estimation of risks. Deviations of the true
prevalence from the assumed value will not affect the statistical significance of the tests,
but it will influence the estimates and thus the (perceived) effect size, i.e. the practical
sigificance. This judgment of practical importance is beyond statistical analysis, but we
26
are well-advised to be aware of its place in scientific research.
Our methods apply to general case-control studies, and also GWA studies in particular. We are aware that GWA studies, or genetic linkage and association studies
in general, involves not only investigating individual SNP in isolation, and that GWA
approach has intrinsic problems such as multiple testing that can result in an unprecedented potential for false-positive results [14]. Nonetheless, we think that incorporation
of risk assessment into genetic analysis might present an alternative way of dealing with
data generated from such studies.
Finally, it is worth noting that while prospective studies can estimate the relative
risk with respect to becoming sick (incidence), retrospective studies can estimate only
the relative risk with respect to being sick (prevalence). The practical importance of
this distinction may be enormous. For disease that is usually associated with increased
mortality risk such as myocardial infarction, retrospective studies that compare existing
cases with controls on one or more risk factors cannot in principle relate to all cases
occurring during the past n years because some of these incidence cases may no longer
be alive. Hence in such situations the (relative) risks derived from retrospective studies
may be quite different from those obtained in prospective studies.
27
Bibliography
[1] R. L. Prentice and R. Pyke. Logistic disease incidence models and case-control
studies. Biometrika, 66:403–411, 1979.
[2] Neil Pearce. What does the odds ratio estimate in a case-control study? International Journal of Epidemiology, 22:1189–1192, 1993.
[3] Wiggins RD and Slater M. Estimating probabilities from retrospective data with
an application to cot death in lambeth. Biometrics, 37:377–382, 1981.
[4] K. Pearson. On the criterion that a given system of deviations from the probable
in the case of a correlated system of variables is such that it can be reasonably
supposed to have arisen from random sampling. Philosophical Magazine, 50:157–
175, 1900.
[5] William G. Cochran.
Some methods for strengthening the common χ2 tests.
Biometics, 10:417–451, 1954.
[6] P. Armitage. Tests for linear trends in proportions and frequencies. Biometrics,
11:375–386, 1955.
[7] The Wellcome Trust Case Control Consortium. Genome-wide association study of
14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661–
678, 2007.
28
[8] E L Mass´o Gonz´alez, S Johansson, M-A Wallander, and L A Garc´ıa Rodr´ıguez.
Trends in the prevalence and incidence of diabetes in the uk: 1996-2005. Journal
of Epidemiology and Community Health, 63:332–336, 2009.
[9] Struan F A Grant et al. Variant of transcription factor 7-like 2 (tcf7l2) gene confers
risk of type 2 diabetes. Nature Genetics, 38:320–323, 2006.
[10] Patricia L. Brubaker Fenghua Yi and Tianru Jin. Tcf-4 mediates cell type-specific
regulation of proglucagon gene expression by β-catenin and glycogen synthase
kinase-3β. Journal of Biological Chemistry, 280:1457–1464, 2005.
[11] et al. Frayling TM. A common variant in the fto gene is associated with body mass
index and predisposes to childhood and adult obesity. Science, 316:889–94, 2007.
[12] Habener JF. Ubeda M, Rukstalis JM. Inhibition of cyclin-dependent kinase 5
activity protects pancreatic beta cells from glucotoxicity. J Biol Chem. Sep 29;(39):.
Epub 2006 Aug 3., 281:28858–64, 2006.
[13] R. Doll and A. Hill. Smoking and carcinoma of the lung; preliminary report. British
Medical Journal, 2:739–48, 1950.
[14] Manolio TA Pearson TA. How to interpret a genome-wide association study. JAMA,
299:1335–1344, 2008.
29
Appendix
Proof of (2.4). From (2.3) and ξ = P/(1 − P ) we have that
P n1. n2j
P n1. n2j + (1 − P )n2. n1j
ξn1. n2j
=
ξn1. n2j + n2. n1j
n1. n2j ξ
n2. n1j
1
=
·
·
n2.
n2. n1j + n1. n2j ξ n1j
(
)−1
n1. n2j ξ n2. n1j + n1. n2j ξ
1
=
.
n2.
n2. n1j
n1j
rˆj =
Thus
( )
n1. k
P n1. n2j
π1j (1 − π1j )n1. −k
E(ˆ
rj ) = (1 −
+
P
n
k
1. n2j + (1 − P )n2. k
k=1
)−1 ( ) (
)k
n1. (
∑
n1. n2j
1 n1.
π1j
n1.
n1. n1.
= (1 − π1j ) + (1 − π1j )
n2j ξ
1+
ξ
n2.
n2. k
k k
1 − π1j
k=1
π1j )n1.
n1.
∑
Proof of (2.5). Dropping the subscript for π1j and writing n1. = N , we can prove (2.5)
by induction. When N = 1, (2.5) clearly holds. Suppose it holds when N = n − 1, that
is,
)(
)k ∑
n−1 (
n−1
n−1
∑
∑
1 n−1
1
π
1
−
.
=
k
k
k
1−π
k(1 − π)
k
k=1
k=1
k=1
30
Noting that
=
=
=
=
(n)
k
=
(
)
n−1
n
n−k
k
, we have when N = n,
( )(
)k
n
∑
1 n
π
k k
1−π
k=1
)k
(
)n
n−1 ( ) (
∑
π
π
1 n
1
+
k k
1−π
n 1−π
k=1
)(
(
)k
(
)n
n−1
∑
π
n−1
1
π
n
+
k
k(n
−
k)
1
−
π
n 1−π
k=1
)(
)(
)k
(
)n
n−1 (
∑
1
1
n−1
π
1
π
+
+
k
n
−
k
k
1
−
π
n 1−π
k=1
(
)(
)k
(
)n
n−1
n−1
n−1
∑
∑
n−1
π
1
π
1
1 ∑ 1
−
+
+
.
k
k(1
−
π)
k
n
−
k
k
1
−
π
n
1
−
π
k=1
k=1
k=1
(4.1)
At the same time,
)n ∑
)k
n ( )(
n
π
π
1+
=
1−π
k
1−π
k=0
)
(
(
)
n
k
∑
π
n
= 1+
k
1−π
k=1
)(
)k
(
)n
n−1 (
∑
π
n−1
π
n
= 1+
+
k
1
−
π
n
−
k
1−π
k=1
1
=
(1 − π)n
Hence
(
)(
)k
n−1 (
∑
n−1
π
1
1 − πn
1
=
−
n
k
1−π
n−k
n(1 − π)
n
k=1
Substituting this into (4.1) gives in the desired results.
Proof of (2.6). The proof of Y2 is similar. First observe that the equality holds when
31
N = 1. Next suppose it holds when N = n − 1, then when N = n,
)k
( )(
n
∑
1 n
π
k2 k
1−π
k=1
)k
(
)(
n−1
∑
n
n−1
π
πn
=
+
(n − k)k 2
1−π
n2 (1 − π)n
k
k=1
](
)(
)k
n−1 [
∑
πn
1
n−1
π
1
+
=
+
k 2 k(n − k)
k
1−π
n2 (1 − π)n
k=1
(
)2 n−1
n−1 ∑
n−1
n−1
∑
∑ 1
∑
1
1
1
−
+
+
=
kl(1 − π1j )k 2
n
n2
k=1 l=k
(
k=1
k=1
)
n−1
∑
1
n−1
πk
πn
+
k(n − k)
k
(1 − π)k n2 (1 − π)n
k=1
The last two terms can be written as
)
(
n−1
∑
πk
1
n−1
πn
+
k
k(n − k)
(1 − π)k n2 (1 − π)n
k=1
( )
n
∑
1 n
πk
=
nk k (1 − π)k
k=1
( n
)
n
∑
∑
1
1
1
=
−
k
n k=1 k(1 − π)
k
k=1
where we have made use of the results from Y1 . Substituting this into the above yields
the desired equality.
32
[...]... general, involves not only investigating individual SNP in isolation, and that GWA approach has intrinsic problems such as multiple testing that can result in an unprecedented potential for false-positive results [14] Nonetheless, we think that incorporation of risk assessment into genetic analysis might present an alternative way of dealing with data generated from such studies Finally, it is worth noting... prospective studies can estimate the relative risk with respect to becoming sick (incidence), retrospective studies can estimate only the relative risk with respect to being sick (prevalence) The practical importance of this distinction may be enormous For disease that is usually associated with increased mortality risk such as myocardial infarction, retrospective studies that compare existing cases... or more risk factors cannot in principle relate to all cases occurring during the past n years because some of these incidence cases may no longer be alive Hence in such situations the (relative) risks derived from retrospective studies may be quite different from those obtained in prospective studies 27 Bibliography [1] R L Prentice and R Pyke Logistic disease incidence models and case-control studies. .. line (Figure 1.1) This shows that the same odds ratio can arise from different risk parameters In this thesis we argue that estimating risks has the advantage of better interpretability In addition, we would also like to test the hypothesis that risks are not different among the sub-populations In doing so we need to make some assumptions, provide 1.0 the likely size of error, and discuss the bias in. .. Cluster: a binary variable indicating whether there was good (1) or poor (0) clustering 25 Chapter 4 Discussion Following the great triumph of the case-control study which demonstrated of the link between tobacco smoking and lung cancer [13], this study design has gained wide recognition However, their retrospective, non-randomized nature limits the conclusions that can be drawn from them In this thesis... between odds ratio and risks 10 In Chapter 2 we present some theory to estimate disease risk in a general 2 × k contingency table setting In Chapter 3 we apply the method to the Wellcome Trust Case Control Consortium data and evaluate the performance of the method In Chapter 4 we provide a discussion and some other issues that may warrant future investigation 11 Chapter 2 Methods 2.1 Point estimate From... can be obtained by taking the difference between (2.2) and (1.2) and plotting graphically for all π1j and π2j When both n1j and n2j are 0, the risk rj is not estimable Hence in computing (2.2) we have to “ignore” the case of k = l = 0 In addition when n1j or n2j is 0, the estimator (2.1) would give 1 or 0, and inclusion of these outcomes might in uence substantially the bias of the estimator Since the... CDK5 regulatory subunit associated protein 1-like 1 (CDKAL1 ) gene The exact function of CDKAL1 is not clearly understood, but it shares homology at the protein domain level with CDK5 regulatory subunit associated protein 1 (CDK5RAP1) CDK5RAP1 is known to inhibit the activation of CDK5, a cyclin-dependent kinase which is involved in cell proliferation and maintenance of normal beta-cell function [12]... will in uence the estimates and thus the (perceived) effect size, i.e the practical sigificance This judgment of practical importance is beyond statistical analysis, but we 26 are well-advised to be aware of its place in scientific research Our methods apply to general case-control studies, and also GWA studies in particular We are aware that GWA studies, or genetic linkage and association studies in general,... categories in a contingency table and tests for a linear trend based on a score xi assigned to each column of Table 1.1 It uses a linear probability model pi = α + βxi , where pi is the expected proportion of cases in the ith subclass in the sample Note that pi differs from ri by a factor that is constant across all the subclasses, as a result of retrospective sampling Hence the null hypothesis of independence ... from retrospective studies traditionally followed methods involving estimation of relative risks and/or odds ratios Little attention was given to estimation of risks, presumably due to constraint... association studies in general, involves not only investigating individual SNP in isolation, and that GWA approach has intrinsic problems such as multiple testing that can result in an unprecedented potential... place in scientific research Our methods apply to general case-control studies, and also GWA studies in particular We are aware that GWA studies, or genetic linkage and association studies in general,