Syntax of the systematic part

Entering the systematic part in R is governed by several simple syntactic rules. Th ese rules are very intuitive and natural. Th e systematic part in the glm as well as lm functions is en- tered using the formula argument. In fact, it is the fi rst argument given to any of these two functions. While in the lm function, the model formula determines the model completely (both link and the random part are determined automatically), the glm function it is more fl exible, so that we have to specify more explicitly what we want. Apart from the linear predictor, we have to specify the distribution type (random part) and the link transformation.

We have to be careful though, because if we do not specify them, the Gaussian (or normal) distribution with the identity link will be used by default. Th is corresponds to what is done by lm and it might not be what we intend to do. If we specify a random part and no link, glm uses the so called canonical link for the particular distributional family. Again, that might or might not be what we want to invoke for a particular analysis.

Diff erent distributions imply diff erent assumptions about how variance depends on the mean (diff erent variance functions). So, variance, or scedastic function is selected by specifying the random part. Additionally, one might also specify a weights argument, but we will do so only in rare occasions.

In all examples used in this book, we will specify only linear combinations of parameters. As we have already explained earlier (Chapter 4.2), explanatory variables, as they enter a model, can be transformations of original explanatory variables. Th e model formula can be generally expressed as follows:

response variable ~ linear predictor.

Tilde (~) reads as “it is modelled by”. One response variable is always on the left side of the model formula. On the other side, we specify a linear predictor as a sum of explanatory variables, separated by the plus (+) sign (or other special characters, the meaning of which you will fi nd further in the text). It is a stylised and abbreviated notation. We need to realise that the meaning (and practical interpretation) of this formula considerably changes when we move between particular sub-classes of the GLM models. For example, if we select normal distribution (and identity link), then using the linear predictor, we model directly the mean value of the given response variable. Th at is why the interpretation of coeffi cients, their SEs, etc. is so simple. For other distributions and links, this can change quite dramatically (see Chapter 9–12). For instance, the logarithm of the intensity (of the mean) might be modelled by the linear predictor in a Poisson GLM; logistic transformation of the mean might be modelled in binomial regression, etc. We have to keep all of this in mind when conducting GLM modelling, making sure we are always clear about the things we do and how we actu- ally specify and interpret the model. At the same time, we have to take into account the

specifi cation of the systematic part, distribution specifi cation, link specifi cation and their interaction (i.e. the fact that the meaning of the specifi cation of the systematic part is generally diff erent for diff erent distributions and diff erent link functions).

While on the left side of the model formula, the operators like +, -, / have a mathematical interpretation, on the right side of the model formula (i.e. in the linear predictor), these operators have a diff erent meaning! Specifi cally, + is to add, - is to remove, : is an interaction,

* stands for all terms (involving main eff ects and interaction of variables connected by *). If 6.4 SYNTAX OF THE SYSTEMATIC PART

Table 6-1 Overview of basic models, their mathematical formula, syntax and description.

Formula Syntax in R Description

α μ)= ( i

f y ~ 1 Model containing only the intercept, i.e.

the null model

i x

f(μ )=α+β y ~ x Linear model with a continuous explanatory variable x

i βx

μ =)

log( log(y) ~ x-1

Linear model with a continuous explanatory variable x, without the intercept, with logarithmically transformed expected values of the response variable (y)

) 2

( i xi xi f μ =α +β +γ

y ~ x+I(x^2) or y ~ poly(x,2)

A quadratic model of the continuous variable x

i i

i x x

f(μ )=α+β1 1 +β2 2 y ~ x1+x2

Model with two continuous variables x1 and x2. It is linear in each of these variables. Th is is a particular case of multiple regression.

ijk

jk ik ij

k j i ijk

C B A

C B C A B A

C B A f

: :

: : :

) (

+ + +

= α μ

y ~ A+B+C+A:B+

A:C+B:C+A:B:C shortly

y ~ A*B*C

Model with three factors A, B and C including three main eff ects, three two- way interactions and one three-way interaction. Th is is a general form of three-way ANOVA/ANODEV model with interactions.

jk ik ij

k j i ijk

C B C A B A

C B A f

: : :

) (

+ + +

= α μ

y ~ (A+B+C)^2

Model with three factors (A, B, C), including three main eff ects and three two-way interactions only.

i j i j

ij A x x

f(μ )=α+ +β +δ y ~ x*A

Model with a continuous variable x and a factor a including two main eff ects (regression on continuous and eff ect of a factor) and interaction (change of slope with the levels of the factor A).

we need to use these operators in the usual mathematical sense within the linear predictor, it is necessary to use an interpreter argument, I. Number 1 has a special meaning in the linear predictor – it stands for the intercept. Th us, for example, -1 does not mean that we want to subtract 1 but that we are fi tting a model without the intercept!

Th e Table 6-1 demonstrates several examples that illustrate the translation between the standard mathematical expression and R notations of the systematic part. We should remind you that we will generally use lowercase letters (x) for continuous explanatory variables and uppercase letters (A, B, C) for categorical explanatory variables. Th is is just for simplicity.

Clearly, you can use whatever (legal) names you like when writing your own models. Th e left side of the systematic part of the model generally includes a link to the transformed mean value of the response variable, f (μ).

7 RANDOM PART

By specifying the systematic part, we have not yet described the entire model that we have in mind. At least, not with regard to the function glm. In the glm case, we need to specify the distribution type and the link function, which determine how the expected value of the response variable will be transformed and modelled by the previously specifi ed linear predictor. In con- trast to this, lm always assumes a normal distribution of errors and is thus more convenient to use in simple cases, but is also much more restrictive. Th e glm function has default settings for distribution type and link which will be automatically invoked when we do not specify some- thing else. Th is means that if we specify only the systematic part, glm will not warn us and will perform an analysis with Gaussian distribution and identity link. Fortunately, glm will report in the output basic information allowing us to check various things that the function did.

Specifi cation of the random part (and also of the default link function) is mainly based on the distribution type from which the response variable comes. Th e following key should help you to decide.

If the response variable is composed of

• Continuous measurements . . . 7.1

• Counts and frequencies . . . 7.2

• Relative frequencies . . . 7.3

Th is is far from being really strict. In order to make the right decision, you need to become familiar with basic properties of the distribution types that we will be working with in this book. Obviously, the number of existing distributions is infi nite. Nevertheless, here we will select only those that are available in the lm and glm functions in the R environment. When making our decisions about the distributions within the GLM family, we will be guided either by theoretical considerations (for example, ecological, physical), or by previous ex- perience (our own or that of other authors) with analyses of similar empirical data. However, should there be no theories or experiences regarding a particular case, which is oft en the case in biology, we need to start searching, experimenting – and critically thinking about the results. If it seems that, for some reason, your data (or the mechanism that generates them) do not comply with any standard assumptions, you need to consult a professional statistician. He or she might help you to develop a customised model for the problem if the departure from available GLM models is really substantial.

Comparison of levels using contrasts

Contrasts and the model parameterization