INFERENCE ABOUT PARAMETERS OF LOGISTIC

The mechanics of ML estimation and model fitting for logistic regression are special cases of the GLM fitting results of Sections 4.1 and 4.5. From (4.10), the likelihood equations for a GLM are

∑N i=1

(yi−𝜇i)xij var(yi)

𝜕𝜇i

𝜕𝜂i =0, j=1, 2,…,p. For a GLM for binary data,niyi∼bin(ni,𝜋i) with𝜋i=𝜇i=F(∑

j𝛽jxij)=F(𝜂i) for some standard cdfF. Thus,𝜕𝜇i∕𝜕𝜂i=f(𝜂i) wheref is the pdf corresponding toF.

Since the binomial proportionyihas var(yi)=𝜋i(1−𝜋i)∕ni, the likelihood equations are

∑N i=1

ni(yi−𝜋i)xij

𝜋i(1−𝜋i) f(𝜂i)=0, j=1, 2,…,p. That is, in terms of𝜷,

∑N i=1

ni [

yi−F(∑

j𝛽jxij )]

xijf(∑

j𝛽jxij )

F(∑

j𝛽jxij ) [

1−F(∑

j𝛽jxij

)] =0, j=1, 2,…,p. (5.4)

5.3.1 Logistic Regression Likelihood Equations For logistic regression models for binary data,

F(z)= ez

1+ez, f(z)= ez

(1+ez)2 =F(z)[1−F(z)].

The likelihood equations then simplify to

∑N i=1

ni(yi−𝜋i)xij=0, j=1,…,p. (5.5)

LetXdenote theN×pmodel matrix of values of {xij}. Letsdenote the binomial vector of “success” totals with elementssi=niyi. The likelihood equations (5.5) have the form

XTs=XTE(s).

This equation illustrates the fundamental result for GLMs with canonical link function, shown in Equation 4.27, that the likelihood equations equate the sufficient statistics to their expected values.

5.3.2 Covariance Matrix of Logistic Parameter Estimators

The ML estimator𝜷̂has a large-sample normal distribution around𝜷with covariance matrix equal to the inverse of the information matrix. From (4.13), the information matrix for a GLM has the form =XTWX, whereW is the diagonal matrix with elements

wi=(𝜕𝜇i∕𝜕𝜂i)2∕var(yi).

For binomial observations, 𝜇i=𝜋i and var(yi)=𝜋i(1−𝜋i)∕ni. For the logistic regression model, 𝜂i= log[𝜋i∕(1−𝜋i)], so that 𝜕𝜂i∕𝜕𝜋i=1∕[𝜋i(1−𝜋i)]. Thus, wi=ni𝜋i(1−𝜋i), and for large samples, the estimated covariance matrix of𝜷̂ is

var(𝜷̂)={XTWX}̂ −1={XTDiag[nî𝜋i(1− ̂𝜋i)]X}−1, (5.6) whereŴ =Diag[nî𝜋i(1− ̂𝜋i)] denotes theN×Ndiagonal matrix having {nî𝜋i(1−

̂𝜋i)} on the main diagonal. “Large samples” here means a large number of Bernoulli trials, that is, large N for ungrouped data and largen=∑

ini for grouped data, in each case withpfixed. The square roots of the main diagonal elements of Equation (5.6) are estimated standard errors of𝜷.̂

5.3.3 Statistical Inference: Wald Method is Suboptimal

For statistical inference for logistic regression models, we can use the Wald, likelihood-ratio, or score methods introduced in Section 4.3. For example, to test H0: 𝛽j=0, the Wald chi-squared (df =1) uses (̂𝛽j∕SEj)2, whereas the likelihood- ratio chi-squared uses the difference between the deviances for the simpler model with𝛽j=0 and the full model.

These methods usually give similar results for large sample sizes. However, the Wald method has two disadvantages. First, its results depend on the scale for parameterization. To illustrate, for the null model, logit(𝜋)=𝛽0, consider testing H0: 𝛽0=0 (i.e., 𝜋=0.50) when ny has a bin(n,𝜋) distribution. From the delta method, the asymptotic variance of ̂𝛽0=logit(y) is [n𝜋(1−𝜋)]−1. The Wald chi- squared test statistic, which uses the ML estimate of the asymptotic variance, is (̂𝛽0∕SE)2=[logit(y)]2[ny(1−y)]. On the proportion scale, the Wald test statistic is (y−0.50)2∕[y(1−y)∕n]. These are not the same. Evaluations reveal that the logit- scale statistic is too conservative1 and the proportion-scale statistic is too liberal.

A second disadvantage is that when a true effect in a binary regression model is very large, the Wald test is less powerful than the other methods and can show aber- rant behavior. For this single-binomial example, supposen=25. Then,y=24∕25 is stronger evidence againstH0:𝜋=0.50 thany=23∕25, yet the logit Wald statistic equals 9.7 wheny=24∕25 and 11.0 wheny=23∕25. For comparison, the likelihood- ratio statistics are 26.3 and 20.7. As the true effect in a binary regression model increases, for a given sample size the information decreases so quickly that the standard error grows faster than the effect.2 The Wald method fails completely when

̂𝛽j= ±∞, a case we discuss in Section 5.4.2.

5.3.4 Conditional Logistic Regression to Eliminate Nuisance Parameters The total number of binary observations isn=∑N

i=1nifor grouped data andn=N for ungrouped data. ML estimators of the p parameters of the logistic regression

1WhenH0is true, the probability a test of nominal size𝛼rejectsH0islessthan𝛼.

2See Davison (2003, p. 489), Hauck and Donner (1977), and Exercise 5.7.

model and standard methods of inference perform well whennis large compared withp. Sometimesnis small. Sometimespgrows asngrows, as in highly stratified data in which each stratum has its own model parameter. In either case, improved inference results from usingconditional maximum likelihood. This method reduces the parameter space, eliminating nuisance parameters from the likelihood function by conditioning on their sufficient statistics. Inference based on the conditional likelihood can use large-sample asymptotics or small-sample distributions.

We illustrate with a simple case: logistic regression with a single binary explanatory variablexand smalln. For subjectiin an ungrouped data file,

logit[P(yi=1)]=𝛽0+𝛽1xi, i=1,…,N, (5.7) wherexi=1 orxi=0. Usually the log odds ratio𝛽1is the parameter of interest, and 𝛽0is a nuisance parameter. For the exponential dispersion family (4.1) witha(𝜙)=1, the kernel of the log-likelihood function is∑

iyi𝜃i. For the logistic model, this is

∑N i=1

yi𝜃i=

∑N i=1

yi(𝛽0+𝛽1xi)=𝛽0

∑N i=1

yi+𝛽1

∑N i=1

xiyi. The sufficient statistics are∑

iyi for 𝛽0 and ∑

ixiyi for 𝛽1. The grouped form of the data is summarized with a 2×2 contingency table. Denote the two independent binomial “success” totals in the table bys1ands2, having bin(n1,𝜋1) and bin(n2,𝜋2) distributions, as Table 5.2 shows. To conduct conditional inference about𝛽1while eliminating𝛽0, we use the distribution of∑

ixiyi=s1, conditional on∑

iyi=s1+s2. Consider testing H0:𝛽1=0, which corresponds toH0:𝜋1=𝜋2. Under H0, let 𝜋=e𝛽0∕(1+e𝛽0) denote the common value. We eliminate 𝛽0 by finding P(s1= t∣s1+s2=v). By the independence of the binomial variates and the fact that their sum is also binomial, underH0

P(s1=t,s2=u)= (n1

t )

𝜋t(1−𝜋)n1−t (n2

u )

𝜋u(1−𝜋)n2−u, t=0,…,n1,u=0,…,n2 P(s1+s2=v)=

(n1+n2 v

)

𝜋v(1−𝜋)n1+n2−v, v=0, 1,…,n1+n2.

Table 5.2 A 2×2 Table for Binary Response and Explanatory Variables

x 1 0 Total

1 s1 n1−s1 n1

0 s2 n2−s2 n2

So the conditional probability is

P(s1=t∣s1+s2=v)= (n

)𝜋t(1−𝜋)n1−t(n2

v−t

)𝜋v−t(1−𝜋)n2−(v−t) (n

1+n2 v

)𝜋v(1−𝜋)n1+n2−v

= (n1

)(n2

v−t

) (n

1+n2 v

) , max(0,v−n2)≤t≤min(n1,v).

This is thehypergeometric distribution. To testH0:𝛽1=0 againstH1:𝛽1>0, the P-value isP(s1≥t∣s1+s2), for observed valuetfors1. This probability does not depend on𝛽0. We can find it exactly rather than rely on a large-sample approximation.

This test was proposed by R. A. Fisher (1935) and is calledFisher’s exact test(see Exercise 5.31).

The conditional approach has the limitation of requiring sufficient statistics for the nuisance parameters. Reduced sufficient statistics exist only with GLMs that use the canonical link. Thus, the conditional approach works for the logistic model but not for binary GLMs that use other link functions. Another limitation is that when some explanatory variables are continuous, the {yi} values may be completely determined by the given sufficient statistics, making the conditional distribution degenerate.

QUANTITATIVE/QUALITATIVE EXPLANATORY VARIABLES AND INTERPRETING EFFECTS

MODEL MATRICES AND MODEL VECTOR SPACES