Diagnosis of the fi nal model

We have already mentioned several times that a model we arrived at, based on previous the-oretical knowledge of the studied process, analysis of empiric data or a combination of both, does not have to “fi t” the given data set reasonably well. Consequently, the inference that is based on it does not have to be correct. Th at is why we should attempt to verify it as thoroughly as possible. Since there are indeed many ways in which a model can be bad, there are also many techniques with which to examine these possibilities. It is utterly impossible to verify all aspects of a model, but we should examine as many of the important aspects as possible. Th is is true despite the fact that we make more work for ourselves in this way – detected problems result in the necessity to develop a new model, thus making us to continue the analysis. Nevertheless, we can justify and defend a model only aft er it has been verifi ed.

Otherwise, the results would represent a mere summary (which might even be unnecessar- ily complicated), and might even provide completely false conclusions (both falsely positive as well as falsely negative).

We will explore the fi nal model m9 using several diagnostics tools which are quite standard for linear models and which are easily available in R. Th is does not mean that diagnosis of your model should be limited to these tools only. It rather shows some basic kinds of tools that we can start with. We can obtain basic diagnostic plots using the plot command, one aft er another. To obtain all four plots, which we discussed in Chapter 4.6, we will use the 5.4.7 DIAGNOSIS OF THE FINAL MODEL

which argument to select the fi rst four plots. We will place the four of them into four panels of the graphical window:

> par(mfrow=c(2,2))

> plot(m9,which=1:4)

In the fi rst plot (Fig. 5-3A), the residuals are arranged in three columns. Th is is because each column represents one level of DIET2 from model m9. Th e red line connecting mean values of the residuals for diff erent levels of the DIET2 stays very close to abscissa. Th is does not indicate any fundamental problems related to the model underestimation or overestimation of the data. Th e spread of residuals is getting bigger as the fi tted value increases, indicating that residual variance is not homogenous. It increases with the mean value. Th is impression is also confi rmed by the red line in the third plot – it is increasing (Fig. 5-3C). For those of

1.0 1.5 2.0 2.5 3.0

−0.50.00.51.0

Fitted values

Residuals

lm(weight ~ diet2) Residuals vs Fitted

22 19 32

−2 −1 0 1 2

−2−10123

Theoretical Quantiles

Standardized residuals

Normal Q−Q

22 19 32

Fig. 5-3 A. Relationship between residuals and fi tted values. B. Q-Q normal plot of standardised residuals. C. Relationship between the square roots of absolute values of standardised residuals and fi tted values. D. Plot of Cook‘s distance for every measurement.

1.0 1.5 2.0 2.5 3.0

0.00.51.01.5

Fitted values

Standardized residuals

Scale−Location

22 19 32

0 20 40 60 80

0.000.020.040.060.08

Obs. number

Cook’s distance

19 32

you who prefer a test to “just viewing a picture”, the variance homogeneity can be tested by the Bartlett test. Th e argument ( formula) of this test will be the model formula from the latest model:

> bartlett.test(weight~diet2)

Bartlett test of homogeneity of variances data: weight by diet2

Bartlett‘s K-squared = 24.2178, df = 2, p-value = 5.51e-06

Th e nominally signifi cant result confi rms heterogeneous variance. Th e Bartlett test does not generally say the same that the plot does. On one hand, it is a formal test with many advan- tages, on the other hand, it is known to be oft en too sensitive to various nuisance infl uences, e.g. deviations from normality. Th e signifi cant result can then suggest a non-normal distribution of the residuals rather than their heterogeneous variance.

We can see from the second plot (Fig. 5-3B) that distribution of the residuals is skewed to the right. Th e furthest residuals are on the top right. Th e potential outliers are labelled by the number under which they appear in the original data frame. We can verify deviation from the normality by a test, for example, the Shapiro-Wilk normality test. Th e argument (x) of this test is the residuals of model m9.

> shapiro.test(resid(m9))

Shapiro-Wilk normality test data: resid(m9)

W = 0.9685, p-value = 0.0356

Th e signifi cant result confi rms departure from normality. Similarly to the Bartlett test, the normality test is not always better than a “mere” graphic representation. Th is is not only because a non-signifi cant result does not guarantee normality, but also because we cannot know how the normality is violated (though this is visible, to some extent, in the plot).

Finally, the last plot (Fig. 5-3D) shows the three residuals with the highest Cook’s distance values (19, 22 and 32). Nevertheless, the Cook’s distance of all of them is very low, less than 0.1, which means that the observations are not extremely infl uential.

Based on the previous diagnostic explorations, we conclude that model m9 would not pass the evaluation. Th e non-constant variance and quite signifi cant departure from normality represent serious violations of assumptions. Th e non-constant variance especially infl uences the standard errors of the estimates. Th is, together with the departures from normality can seriously invalidate the model inferences based on model m9. We should, therefore, try to modify the model using a logarithmic transformation, or suitable weighting, or by using a distribution that allows the variance to increase with the mean value, and then restart the analysis from the very beginning. We will show you how to do that in Chapter 9.5.

5.4.7 DIAGNOSIS OF THE FINAL MODEL

We will now plot the expected value estimates obtained from empirical means in a line plot with whiskers. Th e whiskers will show standard errors of the means (SE) as the correct pre- cision assessments of the means. We will do all of this just by calling the lineplot.CI function from the sciplot package (Fig. 5-4). We will not show any model fi ts or predictions on the plot yet, since we detected model m9 as being inappropriate.

> par(mfrow=c(1,1))

> library(sciplot)

> lineplot.CI(diet2,weight,ylab="Weight", xlab="Diet")

Comparison of levels using contrasts

Contrasts and the model parameterization