CHAPTER 3 Models for Data 3.1 Statistical Models Many statistical analyses are based on a specific model for a set of data, where this consists of one or more equations that describe the observations in terms of parameters of distributions and random variables. For example, a simple model for the measurement X made by an instrument might be X = 2 + ,, where 2 is the true value of what is being measured, and , is a measurement error that is equally likely to be anywhere in the range from -0.05 to +0.05. In situations where a model is used, an important task for the data analyst is to select a plausible model and to check, as far as possible, that the data are in agreement with this model. This includes both examining the form of the equation assumed, and the distribution or distributions that are assumed for the random variables. To aid in this type of modelling process there are many standard distributions available, the most important of which are considered in the following two sections of this chapter. In addition, there are some standard types of model that are useful for many sets of data. These are considered in the later sections of this chapter. 3.2 Discrete Statistical Distributions A discrete distribution is one for which the random variable being considered can only take on certain specific values, rather than any value within some range (Appendix Section A2). By far the most common situation in this respect is where the random variable is a count and the possible values are 0, 1, 2, 3, and so on. It is conventional to denote a random variable by a capital X and a particular observed value by a lower case x. A discrete distribution is then defined by a list of the possible values x 1 , x 2 , x 3 , , for X, and the probabilities P(x 1 ), P(x 2 ), P(x 3 ), for these values. Of necessity, © 2001 by Chapman & Hall/CRC P(x 1 ) + P(x 2 ) + P(x 3 ) + = 1, i.e., the probabilities must add to 1. Also of necessity, P(x i ) $ 0 for all i, with P(x i ) = 0 meaning that the value x i can never occur. Often there is a specific equation for the probabilities defined by a probability function P(x) = Prob(X = x), where P(x) is some function of x. The mean of a random variable is sometimes called the expected value, and is usually denoted either by µ or E(X). It is the sample mean that would be obtained for a very large sample from the distribution, and it is possible to show that this is equal to E(X) = 3 x i P(x i ) = x 1 P(x 1 ) + x 2 P(x 2 ) + x 3 P(x 3 ) + (3.1) The variance of a discrete distribution is equal to the sample variance that would be obtained for a very large sample from the distribution. It is often denoted by F 2 , and it is possible to show that this is equal to F 2 = 3 (x i - µ) 2 P(x i ) = (x 1 - µ) 2 P(x 1 ) + (x 2 - µ) 2 P(x 2 ) + (x 3 - µ) 2 P(x 3 ) + (3.2) The square root of the variance, F, is the standard deviation of the distribution. The following discrete distributions are the ones which occur most often in environmental and other applications of statistics. Johnson and Kotz (1969) provide comprehensive details on these and many other discrete distributions. The Hypergeometric Distribution The hypergeometric distribution arises when a random sample of size n is taken from a population of N units. If the population contains R units with a certain characteristic, then the probability that the sample will contain exactly x units with the characteristic is P(x) = R C x N-R C n-x / N C n , for x = 0, 1, , Min(n,R), (3.3) where a C b denotes the number of combinations of a objects taken b at at time. The proof of this result will be found in many elementary © 2001 by Chapman & Hall/CRC statistics texts. A random variable with the probabilities of different values given by equation (3.3) is said to have a hypergeometric distribution. The mean and variance are µ = nR/N, (3.4) and F 2 = R(N - R)n/N 2 . (3.5) As an example of a situation where this distribution applies, suppose that a grid is set up over a study area and the intersection of the horizontal and vertical grid lines defines N possible sample locations. Let R of these locations have values in excess of a constant C. If a simple random sample of n of the N locations is taken, then equation (3.1) gives the probability that exactly x out of the n sampled locations will have a value exceeding C. Figure 3.1(a) shows examples of probabilities calculated for some particular hypergeometric distributions. The Binomial Distribution Suppose that it is possible to carry out a certain type of trial and when this is done the probability of observing a positive result is always p for each trial, irrespective of the outcome of any other trial. Then if n trials are carried out the probability of observing exactly x positive is given by the binomial distribution P(x) = n C x p x (1 - p) n-x , for x = 0, 1, 2, , n, (3.6) which is a result also provided in Section A2 of Appendix A. The mean and variance of this distribution are µ = np, (3.7) and F 2 = np(1 - p), (3.8) respectively. © 2001 by Chapman & Hall/CRC (a) Hypergeometric Distributions (b) Binomial Distributions (c) Poisson Distributions Figure 3.1 Examples of hypergeometric, binomial and Poisson discrete probability distributions. An example of this distribution occurs with the situation described in Example 1.3, which was concerned with the use of mark-recapture methods to estimate survival rates of salmon in the Snake and Columbia Rivers in the Pacific Northwest of the United States. In that setting, if n fish are tagged and released into a river and there is a probability p of being recorded while passing a detection station downstream for each of the fish, then the probability of recording a total of exactly p fish downstream is given by equation (3.6). Figure 3.1(b) shows some examples of probabilities calculated for some particular binomial distributions. © 2001 by Chapman & Hall/CRC The Poisson Distribution One derivation of the Poisson distribution is as the limiting form of the binomial distribution as n tends to infinity and p tends to zero, with the mean µ = np remaining constant. More generally, however, it is possible to derive it as the distribution of the number of events in a given interval of time or a given area of space when the events occur at random, independently of each other at a constant mean rate. The probability function is P(x) = exp(-µ) µ x / x!, for x = 0, 1, 2, (3.9) The mean and variance are both equal to µ. In terms of events occurring in time, the type of situation where a Poisson distribution might occur is for counts of the number of occurrences of minor oil leakages in a region per month, or the number of cases per year of a rare disease in the same region. For events occurring in space a Poisson distribution might occur for the number of rare plants found in randomly selected metre square quadrats taken from a large area. In reality, though, counts of these types often display more variation than is expected for the Poisson distribution because of some clustering of the events. Indeed, the ratio of the variance of sample counts to the mean of the same counts, which should be close to one for a Poisson distribution, is sometimes used as an index of the extent to which events do not occur independently of each other. Figure 3.1(c) shows some examples of probabilities calculated for some particular Poisson distributions. 3.3 Continuous Statistical Distributions Continuous distributions are often defined in terms of a probability density function, f(x), which is a function such that the area under the plotted curve between two limits a and b gives the probability of an observation within this range, as shown in Figure 3.2. This area is also the integral between a and b, so that in the usual notation of calculus b Prob( a < X < b) = I f(x) dx. (3.10) a © 2001 by Chapman & Hall/CRC The total area under the curve must be exactly one, and f(x) must be greater than or equal to zero over the range of possible values of x for the distribution to make sense. The mean and variance of a continuous distribution are the sample mean and variance that would be obtained for a very large random sample from the distribution. In calculus notation the mean is µ = I x f(x) dx, where the range of integration is the possible values for the x. This is also sometimes called the expected value of the random variable X, and denoted E(X). Similarly, the variance is F 2 = I (x - µ) 2 f(x) dx, (3.11) where again the integration is over the possible values of x. Figure 3.2 The probability density function f(x) for a continuous distribution. The probability of a value between a and b is the area under the curve between these values, i.e., the area between the two vertical lines at x = a and x = b. The continuous distributions that are described here are ones that often occur in environmental and other applications of statistics. See Johnson and Kotz (1970a, 1970b) for details about many more continuous distributions. © 2001 by Chapman & Hall/CRC The Exponential Distribution The probability density function for the exponential distribution with mean µ is f(x) = (1/µ)exp(-x/µ), for x $ 0, (3.12) which has the form shown in Figure 3.3. For this distribution the standard deviation is always equal to the mean µ. The main application is as a model for the time until a certain event occurs, such as the failure time of an item being tested, the time between the reporting of cases of a rare disease, etc. Figure 3.3 Examples of probability density functions for exponential distributions. The Normal or Gaussian Distribution The normal or Gaussian distribution with a mean of µ and a standard deviation of F has the probability density function f(x) = {1/%(2BF 2 )} exp{-(x - µ) 2 /(2F 2 )}, for -4 < x < +4. (3.13) This distribution is discussed in Section A2 of Appendix A, and the form of the probability density function is illustrated in Figure A1. The normal distribution is the 'default' that is often assumed for a distribution that is known to have a symmetric bell-shaped form, at © 2001 by Chapman & Hall/CRC least roughly. It is often observed for biological measurements such as the height of humans, and it can be shown theoretically (through something called the central limit theorem) that the normal distribution will tend to result whenever the variable being considered consists of a sum of contributions from a number of other distributions. In particular, mean values, totals, and proportions from simple random samples will often be approximately normally distributed, which is the basis for the approximate confidence intervals for population parameters that have been described in Chapter 2. The Lognormal Distribution It is a characteristic of the distribution of many environmental variables that they are not symmetric like the normal distribution. Instead, there are many fairly small values and occasional extremely large values. This can be seen, for example, in the measurements of PCB concentrations that are shown in Table 2.3. With many measurements only positive values can occur, and it turns out that the logarithm of the measurements has a normal distribution, at least approximately. In that case the distribution of the original measurements can be assumed to be a lognormal distribution, with probability density function f(x) = [1/{x%(2BF 2 )}]exp[-{log e (x) - µ} 2 /{2F 2 }], for x > 0. (3.14) Here µ and F are the mean and standard deviation of the natural logarithm of the original measurement. The mean and variance of the original measurement itself are E(X) = exp(µ + ½F 2 ) (3.15) and Var(X) = exp(2µ + F 2 ){exp(F 2 ) - 1}. (3.16) Figure 3.4 shows some examples of probability density functions for three lognormal distributions. © 2001 by Chapman & Hall/CRC Figure 3.4 Examples of lognormal distributions with a mean of 1.0. The standard deviations are 0.5, 1.0 and 2.0. 3.4 The Linear Regression Model Linear regression is one of the most frequently used statistical tools. Its purpose is to relate the values of a single variable Y to one or more other variables X 1 , X 2 , , X p , in an attempt to account for the variation in Y in terms of variation in the other variables. With only one other variable this is often referred to as simple linear regression. The usual situation is that the data available consist of n observations y 1 , y 2 , , y n for the dependent variable Y, with corresponding values for the X variables. The model is assumed is y = ß 0 + ß 1 x 1 + ß 2 x 2 + + ß p x p + ,, (3.17) where , is a random error with a mean of zero and a constant standard deviation F. The model is estimated by finding the coefficients of the X values that make the error sum of squares as small as possible. In other words, if the estimated equation is í = b 0 + b 1 x 1 + b 2 x 2 + + b p x p , (3.18) then the b values are chosen so as to minimise SSE = E(y i - í i )², (3.19) where the í i is the value given by the fitted equation that corresponds to the data value y i , and the sum is over the n data values. Statistical packages or spreadsheets are readily available to do these calculations. © 2001 by Chapman & Hall/CRC There are various ways that the usefulness of a fitted regression equation can be assessed. One involves partitioning the variation observed in the Y values into parts that can be accounted for by the X values, and a part (SSE, above) which cannot be accounted for. To this end, the total variation in the Y values is measured by the total sum of squares SST = E(y i - y) 2 . (3.20) This is partitioned into the sum of squares for error (SSE), and the sum of squares accounted for by the regression (SSR), so that SST = SSR + SSE. The proportion of the variation in Y accounted for by the regression equation is then the coefficient of multiple determination, R 2 = SSR/SST = 1 - SSE/SST, (3.21) which is a good indication of the effectiveness of the regression. There are a variety of inference procedures that can be applied in the multiple regression situation when the regression errors , are assumed to be independent random variables from a normal distribution with a mean of zero and constant variance F 2 . A test for whether the fitted equation accounts for a significant proportion of the total variation in Y can be based on Table 3.1, which is a variety of what is called an 'analysis of variance table' because it compares the observed variation in Y accounted for by the fitted equation with the variation due to random errors. From this table, the F-ratio, F = MSR/MSE = [SSR/p]/[SSE/(n - p - 1)] (3.22) can be tested against the F-distribution with p and n - p - 1 degrees of freedom to see if it is significantly large. If this is the case, then there is evidence that Y is related to at least one of the X variables. © 2001 by Chapman & Hall/CRC [...]... 18 24 30 54 24 54 18 48 18 54 24 10 24 48 30 36 10 66 30 36 24 30 18 42 30 42 30 90 24 36 30 36 24 60 24 42 30 36 30 66 24 36 36 42 42 108 24 36 30 36 42 114 24 36 30 36 24 108 24 36 30 36 10 114 24 36 30 36 24 120 24 36 30 42 24 90 24 30 36 42 24 96 24 30 36 36 36 30 24 36 36 36 24 108 24 30 36 36 30 108 24 36 36 36 18 108 24 36 36 24 102 102 120 30 30 30 29 30 32 21. 53 41.27 28.20 39 .10 28. 93 69.00... examined for three types of fish The tabulated values are survival times in hours n Mean Std Dev Hatchery Brown Hatchery Rainbow Clark Fork Brown Trout Trout Trout Control Treated Control Treated Control Treated 8 10 24 54 30 36 18 60 24 48 30 30 24 60 24 48 30 30 24 60 24 54 36 30 24 54 24 54 30 36 24 72 24 36 36 30 18 54 24 30 36 42 18 30 24 18 24 54 24 36 24 48 36 30 18 48 24 36 36 48 10 48 24 24 36 24... SSA I-1 MSA = SSA/(I - 1) MSA/MSE Factor B SSB J-1 MSB = SSB/(J - 1) MSB/MSE Factor C SSC K-1 MSC = SSC/(K - 1) MSC/MSE AB Interaction SSAB (I - 1)(J - 1) MSAB = SSAB/{(I - 1)((J - 1)} MSAB/MSE AC Interaction SSAC (I - 1)(K - 1) MSAC = SSAC/{(I - 1)(K - 1)} MSAC/MSE BC Interaction SSBC (J - 1)(K - 1) MSBC = SSBC/{(J - 1)(K - 1)} MSBC/MSE SSABC (I - 1)(J - 1)(K - 1) MSABC = SSABC/{(I - 1)(J - 1)(K - 1)}... freedom Mean square F2 Factor A SSA I-1 MSA = SSA/(I - 1) MSA/MSE Factor B SSB J-1 MSB = SSB/(J - 1) MSB/MSE SSAB (I - 1)(J - 1) MSAB = SSAB/{(I - 1)(J - 1)} MSAB/MSE Error SSE IJ(m - 1) MSE = SSE/{IJ(m - 1)} Total SST = 33 3(xijk - x)2 n-1 Interaction 1 2 The sum for SST is over all levels for i, j and k, i.e., over all n observations The F-ratios for the factors are for fixed effects only Three Factor... Table 3. 13 Estimates from fitting a log-linear model to the dolphin bycatch data Estimate Standard Error "(1), year effect 1989/90 -7 .32 8 0.590 "(2), year effect 1990/91 -1 7.520 21 .38 0 " (3) , year effect 1991/92 -5 .509 0. 537 "(4), year effect 1992/ 93 -7 .254 0.612 "(5), year effect 19 93/ 94 -8 .260 0. 636 "(6), year effect 1994/95 -7 .4 63 0.551 Area effect (south v north) 1.822 0.411 Gear effect (mid-water... the right-hand side of Figure 3. 5 give little cause for concern (a) (b) Figure 3. 5 (a) Standardized residuals for chlorophyll-a plotted against the fitted value predicted from the regression equation (3. 25) and against the phosphorus and nitrogen concentrations for lakes, and (b) standardized residuals for log(chlorophyll-a) plotted against the fitted value, log(phosphorus), and log(nitrogen) for the... times Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Chlorophyll-a 95.0 39 .0 27.0 12.9 34 .8 14.9 157.0 5.1 10.6 96.0 7.2 130 .0 4.7 138 .0 24.8 50.0 12.7 7.4 8.6 94.0 3. 9 5.0 129.0 86.0 64.0 Phosphorus 32 9.0 211.0 108.0 20.7 60.2 26 .3 596.0 39 .0 42.0 99.0 13. 1 267.0 14.9 217.0 49 .3 138 .0 21.1 25.0 42.0 207.0 10.5 25.0 37 3.0 220.0 67.0 Nitrogen 8 6 11 16 9 17 4 13 11 16 25 17 18... "(1) for observations in 1989/90, "(fi) = "(2) for observations in 1990/91, and so on up to "(fi) = "(6) for observations in 1994/95; X i1 is 0 for North Taranaki and 1 for South Taranaki; X i2 is 0 for bottom trawls and 1 for mid-water trawls; and X i3 is 0 for day and 1 for night The fishing year is then being treated as a factor at six levels while the three X variables indicate the absence and presence... requires the assumption that the random components ,ij in the model (3. 30) have a normal distribution Table 3. 5 Form of the analysis of variance table for a one factor model, with I levels of the factor and n observations in total Sum of Squares 1 Degrees of freedom Mean square 2 F3 Factor SSF I-1 MSF = SSB/(I - 1) MSF/MSE Error SSE n-I MSE = SSE/(n - I) Total SST = 33 (xij - x)2 n-1 Source of variation 1 SSF... SSABC/{(I - 1)(J - 1)(K - 1)} MSABC/MSE Error SSE IJK(m - 1) MSE = SSE/{IJK(m - 1)} Total SST = 33 33( xijk - x)2 n-1 Source of variation ABC Interaction 1 2 The sum for SST is over all levels for i, j , k and m, i.e., over all n observations The F-ratios for the factors and two factor interactions are for fixed effects only © 2001 by Chapman & Hall/CRC Table 3. 8 Results from Marr et al.'s (1995) challenge 1 . $ 2 NT + , (3. 24) was fitted to the data in Table 3. 3, where CH denotes chlorophyll-a, PH denotes phosphorus, and NT denotes nitrogen. This gave CH = -9 .38 6 + 0 .33 3PH + 1.200NT, (3. 25) with an. 20.7 16 5 34 .8 60.2 9 6 14.9 26 .3 17 7 157.0 596.0 4 8 5.1 39 .0 13 9 10.6 42.0 11 10 96.0 99.0 16 11 7.2 13. 1 25 12 130 .0 267.0 17 13 4.7 14.9 18 14 138 .0 217.0 11 15 24.8 49 .3 12 16 50.0 138 .0 10 17. the variation due to random errors. From this table, the F-ratio, F = MSR/MSE = [SSR/p]/[SSE/(n - p - 1)] (3. 22) can be tested against the F-distribution with p and n - p - 1 degrees of freedom