Logistic regression model formulas:
𝜋i=
exp(∑p j=1𝛽jxij
)
1+ exp(∑p j=1𝛽jxij
) or logit(𝜋i)= log ( 𝜋i
1−𝜋i
)
=
∑p j=1
𝛽jxij. (5.2)
5.2.1 Interpreting𝜷: Effects on Probabilities and on Odds
For a single quantitative x with𝛽 >0, the curve for 𝜋i has the shape of the cdf of a logistic distribution. Since the logistic density is symmetric, as xi changes, 𝜋i approaches 1 at the same rate that it approaches 0. With multiple explanatory variables, since 1−𝜋i=[1+ exp(∑
j𝛽jxij)]−1,𝜋i is monotone in each explanatory variable according to the sign of its coefficient. The rate of climb or descent increases as|𝛽j|increases. When𝛽j=0,yis conditionally independent ofxj, given the other explanatory variables.
How do we interpret the magnitude of 𝛽j? For a quantitative explanatory vari- able, a straight line drawn tangent to the curve at any particular value describes the instantaneous rate of change in𝜋iat that point. Specifically,
𝜕𝜋i
𝜕xij =𝛽j
exp(∑
j𝛽jxij ) [
1+ exp(∑
j𝛽jxij
)]2 =𝛽j𝜋i(1−𝜋i).
The slope is steepest (and equals𝛽j∕4) at a value ofxijfor which𝜋i=1∕2, and the slope decreases toward 0 as𝜋imoves toward 0 or 1.
How do we interpret 𝛽j for a qualitative explanatory variable? Consider first a single binary indicator x. The model, logit(𝜋i)=𝛽0+𝛽1xi, then describes a 2×2 contingency table. For it,
logit[P(y=1∣x=1)]−logit[P(y=1∣x=0)]=[𝛽0+𝛽1(1)]−[𝛽0+𝛽1(0)]=𝛽1.
It follows thate𝛽1 is theodds ratio(Yule 1900, 1912),
e𝛽1 = P(y=1∣x=1)∕[1−P(y=1∣x=1)]
P(y=1∣x=0)∕[1−P(y=1∣x=0)].
With multiple explanatory variables, exponentiating both sides of the equation for the logit shows that the odds𝜋i∕(1−𝜋i) are an exponential function ofxj. The odds multiply bye𝛽j per unit increase inxj, adjusting for the other explanatory variables in the model. For example, e𝛽j is a conditional odds ratio—the odds atxj=u+1 divided by the odds atxj=u, adjusting for the other {xk}.
It is simpler to understand the effects presented on a probability scale than as odds ratios. To summarize the effect of a quantitative explanatory variable, we could compareP(y=1) at extreme values of that variable, with other explanatory variables set at their means. This type of summary is sensible when the distribution of the data indicate that such extreme values can occur at mean values for the other explanatory variables. With a continuous variable, however, this summary can be sensitive to an outlier. So the comparison could instead use its quartiles, thus showing the change inP(y=1) over the middle half of the explanatory variable’s range of observations.
The data can more commonly support such a comparison.
5.2.2 Logistic Regression with Case-Control Studies
In case-control studies,yis known, and researchers look into the past to observex as the random variable. For example, for cases of a particular type of cancer (y=1) and disease-free controls (y=0), a study might observex=whether the person has been a significant smoker. For 2×2 tables, we just observed thate𝛽is the odds ratio withyas the response. But, from Bayes’ theorem,
e𝛽 = P(y=1∣x=1)∕P(y=0∣x=1) P(y=1∣x=0)∕P(y=0∣x=0)
= P(x=1∣y=1)∕P(x=0∣y=1) P(x=1∣y=0)∕P(x=0∣y=0).
So it is possible to estimate the odds ratio in retrospective studies that samplex, for giveny. More generally, with logistic regression we can estimate effects in studies for which the research design reverses the roles ofxandyas response and explanatory, and the effect parameters still have interpretations as log odds ratios.
Here is a formal justification: letz indicate whether a subject is sampled (1= yes, 0=no). Even though the conditional distribution ofygivenxis not sampled, we need a model forP(y=1∣z=1,x), assuming thatP(y=1∣x) follows the logistic model. By Bayes’ theorem,
P(y=1∣z=1,x)= P(z=1∣y=1,x)P(y=1∣x)
∑1
j=0[P(z=1∣y=j,x)P(y=j∣x)]. (5.3)
Now, suppose thatP(z=1∣y,x)=P(z=1∣y) fory=0 and 1; that is, for eachy, the sampling probabilities do not depend onx. For instance, for cases and for controls, the probability of being sampled is the same for smokers and nonsmokers. Under this assumption, substituting𝜌1=P(z=1∣y=1) and𝜌0=P(z=1∣y=0) in Equation (5.3) and dividing the numerator and denominator byP(y=0∣x),
P(y=1∣z=1,x)=
𝜌1exp(∑
j𝛽jxj )
𝜌0+𝜌1exp(∑
j𝛽jxj ).
Then, letting𝛽0∗=𝛽0+ log(𝜌1∕𝜌0),
logit[P(y=1∣z=1,x)]=𝛽0∗+𝛽1x1+⋯.
The logistic regression model holds with the same effect parameters as in the model for P(y=1∣x). With a case-control study we can estimate those effects but we cannot estimate the intercept term, because the data do not supply information about the relative numbers ofy=1 andy=0 observations.
5.2.3 Logistic Regression is Implied by Normal Explanatory Variables Regardless of the sampling design, suppose the explanatory variables are continuous and have a normal distribution, for each response outcome. Specifically, given y, suppose x has anN(𝝁y,V) distribution,y=0, 1. Then, by Bayes’ theorem,P(y= 1∣x) satisfies the logistic regression model with𝜷 =V−1(𝝁1−𝝁0) (Warner 1963).
For example, in a health study of senior citizens, supposey=whether a person has ever had a heart attack andx=cholesterol level. Suppose those who have had a heart attack have an approximately normal distribution onxand those who have not had one also have an approximately normal distribution onx, with similar variance.
Then, the logistic regression function approximates well the curve forP(y=1∣x).
The effect is greater when the groups’ mean cholesterol levels are farther apart. If the distributions are normal but with different variances, the logistic model applies, but having a quadratic term (Exercise 5.1).
5.2.4 Summarizing Predictive Power: Classification Tables and ROC Curves A classification tablecross-classifies the binary responseywith a predictionŷ of whethery=0 or 1 (see Table 5.1). For a model fit to ungrouped data, the prediction for observationiisŷi=1 when̂𝜋i> 𝜋0andŷi=0 when ̂𝜋i≤𝜋0, for a selected cutoff 𝜋0. Common cutoffs are (1)𝜋0=0.50, (2) the sample proportion ofy=1 outcomes, which each ̂𝜋i equals for the model containing only an intercept term. Rather than using ̂𝜋ifrom the model fitted to the dataset that includesyi, it is better to make the prediction with the “leave-one-out” cross-validation approach, which bases ̂𝜋ion the
Table 5.1 A Classification Table Predictionŷ
y 0 1
0 1
Cell counts in such tables yield estimates of sensitivity= P(ŷ=1∣y=1) and specificity=P(ŷ=0∣y=0).
model fitted to the othern−1 observations. For a particular cutoff, summaries of the predictive power from the classification table are estimates of
sensitivity=P(̂y=1∣y=1) and specificity=P(̂y=0∣y=0).
A disadvantage of a classification table is that its cell entries depend strongly on the cutoff𝜋0for predictions. A more informative approach considers the estimated sensitivity and specificity for all the possible𝜋0. The sensitivity is thetrue positive rate(tpr), andP(̂y=1∣y=0)=(1−specificity) is thefalse positive rate(fpr). A plot of the true positive rate as a function of the false positive rate as𝜋0decreases from 1 to 0 is called areceiver operating characteristic(ROC) curve. When𝜋0is near 1, almost all predictions are ŷi=0; then, the point (fpr, tpr) ≈(0, 0). When 𝜋0is near 0, almost all predictions areŷi=1; then, (fpr, tpr)≈(1, 1). For a given specificity, better predictive power corresponds to higher sensitivity. So, the better the predictive power, the higher the ROC curve and the greater the area under it. A ROC curve usually has a concave shape connecting the points (0, 0) and (1, 1), as illustrated by Figure 5.2.
1
0 1
Poor
P( y^ = 1|y = 0) P( = 1|y = 1)
Good y^
Figure 5.2 ROC curves for a binary GLM having good predictive power and for a binary GLM having poor predictive power.
The area under a ROC curve equals a measure of predictive power called the concordance index (Hanley and McNeil 1982). Consider all pairs of observations (i, j) for whichyi=1 andyj=0. The concordance indexcis the proportion of the pairwise predictions that are concordant with the outcomes, having ̂𝜋i> ̂𝜋j. A pair having ̂𝜋i= ̂𝜋j contributes 1
2 to the count of such pairs. The “no effect” value of c=0.50 occurs when the ROC curve is a straight line connecting the points (0, 0) and (1, 1).
5.2.5 Summarizing Predictive Power: Correlation Measures
An alternative measure of predictive power is the correlation between the observed responses {yi} and the model’s fitted values {̂𝜇i}. This generalization of the multiple correlation for linear models is applicable for any GLM (Section 4.6.4). In logistic regression with ungrouped data, corr(y,𝝁) is the correlation between thê Nbinary {yi} observations (1 or 0 for each) and the estimated probabilities. The highly discrete nature ofyconstrains the range of possible correlation values. A related measure esti- mates corr(y∗,𝝁) for the latent continuous variable for the underlying latent variablê model. The square of this measure is an R2analog (McKelvey and Zavoina 1975) that divides the estimated variance ofŷ∗ by the estimated variance of y∗, where
̂ y∗i =∑
j ̂𝛽jxij is the same as the estimated linear predictor. The estimated variance of y∗ equals the estimated variance ofŷ∗plus the variance of 𝜖 in the latent vari- able model. For the probit latent model with standard normal error, var(𝜖)=1. For the corresponding logistic model, var(𝜖)=𝜋2∕3=3.29, the variance of the standard logistic distribution.
Such correlation measures are useful for comparing fits of different models for the same data. They can distinguish between models when the concordance index does not. For instance, with a single explanatory variable,ctakes the same value for every link function that gives a monotone relationship of the same sign betweenx and ̂𝜋.