L(𝜷)=
∑n i=1
Li=
∑n i=1
logf(yi;𝜃i,𝜙)=
∑n i=1
yi𝜃i−b(𝜃i) a(𝜙) +
∑n i=1
c(yi,𝜙). (4.7)
The notationL(𝜷) reflects the dependence of𝜽on the model parameters𝜷. For the canonical link function,𝜃i=∑
j𝛽jxij, so whena(𝜙) is a fixed constant, the part of the log likelihood involving both the data and the model parameters is
∑n i=1
yi ( p
∑
j=1
𝛽jxij )
=
∑p j=1
𝛽j
( n
∑
i=1
yixij )
. Then the sufficient statistics for {𝛽j} are {∑n
i=1yixij, j=1,…,p}.
4.2.1 Likelihood Equations for a GLM For a GLM𝜂i=∑
j𝛽jxij=g(𝜇i) with link functiong, the likelihood equations are
𝜕L(𝜷)∕𝜕𝛽j=
∑n i=1
𝜕Li∕𝜕𝛽j=0, for allj. To differentiate the log likelihood (4.7), we use the chain rule,
𝜕Li
𝜕𝛽j
= 𝜕Li
𝜕𝜃i
𝜕𝜃i
𝜕𝜇i
𝜕𝜇i
𝜕𝜂i
𝜕𝜂i
𝜕𝛽j
. (4.8)
Since 𝜕Li∕𝜕𝜃i=[yi−b′(𝜃i)]∕a(𝜙), and since𝜇i=b′(𝜃i) and var(yi)=b′′(𝜃i)a(𝜙) from (4.3) and (4.4),
𝜕Li∕𝜕𝜃i=(yi−𝜇i)∕a(𝜙), 𝜕𝜇i∕𝜕𝜃i =b′′(𝜃i)=var(yi)∕a(𝜙).
Also, since𝜂i=∑p
j=1𝛽jxij,𝜕𝜂i∕𝜕𝛽j=xij. Finally, since𝜂i=g(𝜇i),𝜕𝜇i∕𝜕𝜂idepends on the link function for the model. In summary, substituting into (4.8) gives us
𝜕Li
𝜕𝛽j
= 𝜕Li
𝜕𝜃i
𝜕𝜃i
𝜕𝜇i
𝜕𝜇i
𝜕𝜂i
𝜕𝜂i
𝜕𝛽j
= (yi−𝜇i) a(𝜙)
a(𝜙) var(yi)
𝜕𝜇i
𝜕𝜂i
xij= (yi−𝜇i)xij var(yi)
𝜕𝜇i
𝜕𝜂i
. (4.9)
Summing over thenobservations yields the likelihood equations.
Likelihood equations for a GLM:
𝜕L(𝜷)
𝜕𝛽j =
∑n i=1
(yi−𝜇i)xij var(yi)
𝜕𝜇i
𝜕𝜂i =0, j=1, 2,…,p, (4.10) where𝜂i=∑p
j=1𝛽jxij=g(𝜇i) for link functiong.
LetVdenote the diagonal matrix of variances of the observations, and letDdenote the diagonal matrix with elements𝜕𝜇i∕𝜕𝜂i. For the GLM expression𝜼=X𝜷with a model matrixX, these likelihood equations have the form
XTDV−1(y−𝝁)=0. (4.11)
Although𝜷does not appear in these equations, it is there implicitly through𝝁, since 𝜇i=g−1( ∑p
j=1𝛽jxij)
. Different link functions yield different sets of equations. The likelihood equations are nonlinear functions of𝜷 that must be solved iteratively. We defer details to Section 4.5.
4.2.2 Likelihood Equations for Poisson Loglinear Model
For count data, one possible GLM assumes a Poisson random component and uses the log-link function. ThePoisson loglinear modelislog(𝜇i)=∑p
j=1𝛽jxij. For the log link,𝜂i= log𝜇i, so𝜇i= exp(𝜂i) and𝜕𝜇i∕𝜕𝜂i= exp(𝜂i)=𝜇i. Since var(yi)=𝜇i, the likelihood equations (4.10) simplify to
∑n i=1
(yi−𝜇i)xij =0, j=1, 2,…,p. (4.12) These equate the sufficient statistics {∑
iyixij} for𝜷to their expected values. Section 4.5.5 shows that these equations occur for GLMs that use the canonical link function.
4.2.3 The Key Role of the Mean–Variance Relation
Interestingly, the likelihood equations (4.10) depend on the distribution ofyi only through𝜇iand var(yi). The variance itself depends on the mean through a functional form2
var(yi)=v(𝜇i),
for some functionv. For example,v(𝜇i)=𝜇ifor the Poisson,v(𝜇i)=𝜇i(1−𝜇i)∕ni for the binomial proportion, andv(𝜇i)=𝜎2(i.e., constant) for the normal.
When the distribution ofyi is in the exponential dispersion family, the relation between the mean and the variance characterizes3the distribution. For instance, ifyi has distribution in the exponential dispersion family and ifv(𝜇i)=𝜇i, then necessarily yihas the Poisson distribution.
4.2.4 Large-Sample Normal Distribution of Model Parameter Estimators From a fundamental property of maximum likelihood, under standard regularity conditions4, for largenthe ML estimator𝜷̂of𝜷 for a GLM is efficient and has an
2We express the variance ofyasv(𝜇) to emphasize that it is a function of the mean.
3See Jứrgensen (1987), Tweedie (1947), and Wedderburn (1974).
4See Cox and Hinkley (1974, p. 281). Mainly,𝜷falls in the interior of the parameter space andpis fixed asnincreases.
approximate normal distribution. We next use the log-likelihood function for a GLM to find the covariance matrix of that distribution. The covariance matrix is the inverse of the information matrix, which has elementsE[−𝜕2L(𝜷)∕𝜕𝛽h𝜕𝛽j]. The estimator 𝜷̂is more precise when the log-likelihood function has greater curvature at𝜷. To find the covariance matrix, for the contributionLito the log likelihood we use the helpful result
E
( −𝜕2Li
𝜕𝛽h𝜕𝛽j )
=E [(𝜕Li
𝜕𝛽h ) (𝜕Li
𝜕𝛽j )]
,
which holds for distributions in the exponential dispersion family. Thus, using (4.9), E
(−𝜕2Li
𝜕𝛽h𝜕𝛽j
)
=E
[(yi−𝜇i)xih var(yi)
𝜕𝜇i
𝜕𝜂i
(yi−𝜇i)xij var(yi)
𝜕𝜇i
𝜕𝜂i
]
= xihxij var(yi)
(𝜕𝜇i
𝜕𝜂i
)2
. SinceL(𝜷)=∑n
i=1Li, E
(
−𝜕2L(𝜷)
𝜕𝛽h𝜕𝛽j
)
=
∑n i=1
xihxij var(yi)
(𝜕𝜇i
𝜕𝜂i
)2
.
LetWbe the diagonal matrix with main-diagonal elements
wi= (𝜕𝜇i∕𝜕𝜂i)2 var(yi) .
Then, generalizing from the typical element of the information matrix to the entire matrix, with the model matrixX,
=XTWX. (4.13)
The form ofW, and hence, depends on the link functiong, since𝜕𝜂i∕𝜕𝜇i=g′(𝜇i).
In summary,
Asymptotic distribution of𝜷̂for GLM𝜼=X𝜷:
𝜷̂has an approximate N[𝜷, (XTWX)−1] distribution, (4.14) whereWis the diagonal matrix with elementswi=(𝜕𝜇i∕𝜕𝜂i)2∕var(yi).
The asymptotic covariance matrix is estimated byvar(̂ 𝜷̂)=(XTWX)̂ −1, whereŴ is Wevaluated at𝜷.̂
For example, the Poisson loglinear model has the GLM form log𝝁=X𝜷.
For this case,𝜂i= log(𝜇i), so 𝜕𝜂i∕𝜕𝜇i=1∕𝜇i. Thus,wi=(𝜕𝜇i∕𝜕𝜂i)2∕var(yi)=𝜇i, and in the asymptotic covariance matrix (4.14) of𝜷,̂ Wis the diagonal matrix with the elements of𝝁on the main diagonal.
For some GLMs, the parameter vector partitions into the parameters𝜷for the linear predictor and other parameters𝝓(such as a dispersion parameter) needed to specify the model completely. Sometimes5,E(𝜕2L∕𝜕𝛽j𝜕𝜙k)=0 for eachjandk. Similarly, the inverse of the expected information matrix has 0 elements connecting each𝛽j
with each𝜙k. Because this inverse is the asymptotic covariance matrix,𝜷̂and𝝓̂ are then asymptotically independent. The parameters𝜷and𝝓are said to beorthogonal.
This is the generalization to GLMs of the notion of orthogonal parameters for linear models (Cox and Reid 1987). For the exponential dispersion family (4.1),𝜃and𝜙 are orthogonal parameters.
4.2.5 Delta Method Yields Covariance Matrix for Fitted Values
The estimated linear predictor relates to𝜷̂by𝜼̂ =X𝜷̂. Thus, for large samples, its covariance matrix
var(̂𝜼)=Xvar(𝜷)X̂ T≈X(XTWX)−1XT.
We can obtain the asymptotic var(𝝁) from var(̂ 𝜼) by thê delta method, which gives approximate variances using linearizations from a Taylor-series expansion.
For example, in the univariate case with a smooth function h, the linearization h(y)−h(𝜇)≈(y−𝜇)h′(𝜇), which holds for y near 𝜇, implies that var[h(y)]≈ [h′(𝜇)]2var(y) when var(y) is small. For a vector ywith covariance matrixV and a vectorh(y)=(h1(y),…,hn(y))T, let (𝜕h∕𝜕𝝁) denote the Jacobian matrix with entry in rowiand columnjequal to𝜕hi(y)∕𝜕yjevaluated aty=𝝁. Then the delta method yields var[h(y)]≈(𝜕h∕𝜕𝝁)V(𝜕h∕𝜕𝝁)T. So, by the delta method, using the diagonal matrixDwith elements𝜕𝜇i∕𝜕𝜂i, for large samples the covariance matrix of the fitted values
var(𝝁)̂ ≈Dvar(̂𝜼)D≈DX(XTWX)−1XTD.
However, to obtain a confidence interval for𝜇iwhengis not the identity link, it is preferable to construct one for𝜂i and then apply the response functiong−1to the endpoints, thus avoiding the further delta method approximation.
5An example is the negative binomial GLM for counts in Section 7.3.3.
These results for𝜼̂and𝝁̂are based on those for𝜷̂, for which the asymptotics refer ton→∞. However,𝜼̂and𝝁̂ have lengthn. Asymptotics make more sense for them whennis fixed and each component is based on an increasing number of subunits, such that the observations themselves become approximately normal. One such example is a fixed number of binomial observations, in which the asymptotics refer to each binomial sample sizeni →∞. In another example, each observation is a Poisson cell count in a contingency table with fixed dimensions, and the asymptotics refer to each expected cell count growing. Such cases can be expressed as exponential dispersion families in which the dispersion parameter a(𝜙)=𝜙∕𝜔i has weight 𝜔i growing.
This component-specific large-sample theory is calledsmall-dispersion asymptotics (Jứrgensen 1987). The covariance matrix formulas are also used in an approximate sense in the more standard asymptotic cases with largen.
4.2.6 Model Misspecification: Robustness of GLMs with Correct Mean Like other ML estimators of a fixed-length parameter vector, 𝜷̂is consistent (i.e., 𝜷̂ −→p 𝜷asn→∞). Asnincreases,Xhas more rows, the diagonal elements of the asymptotic covariance matrix (XTWX)−1of𝜷̂tend to be smaller, and𝜷̂tends to fall closer to𝜷.
But what if we have misspecified the probability distribution fory? Models, such as GLMs, that assume a response distribution from an exponential family have a certain robustness property. If the model for the mean is correct, that is, if we have specified the link function and linear predictor correctly, then𝜷̂is still consistent6for 𝜷. However, if the assumed variance function is incorrect (which is likely when the assumed distribution foryis incorrect), then so is the formula for var(𝜷). Moreover,̂ not knowing the actual distribution fory, we would not know the correct expression for var(𝜷̂). Section 8.3 discusses model misspecification issues and ways of dealing with it, including using the sample variability to help obtain a consistent estimator of the appropriate covariance matrix.