Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 53 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
53
Dung lượng
4,39 MB
Nội dung
the data have a long tail to the right and could be from a distribution with a positively skewed density function such as in Figure 8.2b. From the discussion in Chapter 6, we may try to fit a lognormal or gamma distribution. The advantages of graphical methods can be summarized as follows: 1. They are fast and simple to use, in contrast with numerical methods, which may be computationally tedious and require considerable analyti- cal sophistication. The additional accuracy of numerical methods is usually not great enough in practice to warrant the effort involved. 2. Probability and hazard plots provide approximate estimates of the parameters of the distribution by simple graphical means. 3. They allow one to assess whether a particular theoretical distribution provides an adequate fit to the data. 4. Peculiar appearance of a plot or points in a plot can provide insight into the data when the reasons for the peculiarities are determined. 5. A graph provides a visual representation of the data that is easy to grasp. This is useful not only for oneself but also in presenting data to others, since a plot allows one to assess conclusions drawn from the data by graphical or numerical means. 8.2 PROBABILITY PLOTTING The basic ideas in probability plotting are illustrated by the following example. Example 8.1 Consider the white blood cell counts (WBCs) of 23 pediatric leukemia patients given in Table 8.1, ranging from 8000 to 120,000. A sample cumulative distribution is constructed by ordering the data from smallest to largest, as shown in Table 8.1. A sample cumulative distribution curve can then be made by plotting each WBC value versus the percentage of the sample equal to or less than that value. That is, the ith ordered data value in a sample of n values is plotted against the percentage 100i/n. Note that for tied observations, we compute and plot the sample distribution only for the one with the largest i value. This gives a conservative estimate of the survivorship function. For example, the third value of WBC, 10, is plotted against a percentage of 100;3/23 : 13%. A plot of the cumulative distribution function for most large populations contains many closely spaced values and can be well approximated by a smooth curve drawn though the points. In contrast, a sample cumulative distribution function has a relatively small number of points and thus some- what ragged appearance. To approximate the population cumulative distribu- tion function, one draws a smooth curve through the data points, obtaining a best fit by eye. Such a curve from the WBC data is given in Figure 8.3. It is an 200 Table 8.1 Ordered WBCs Data and Sample Cumulative Distribution for Example 8.1 Sample Distribution WBC Order, (10) ii/23 (i 9 0.5)/23 \(F )? 81 820.087 0.065 91.512 10 3 0.130 0.109 91.233 15 4 0.174 0.152 91.027 20 5 0.217 0.196 90.857 30 6 0.261 0.239 90.709 50 7 50 8 50 9 50 10 50 11 0.478 0.457 90.109 60 12 60 13 0.565 0.543 0.109 75 14 75 15 0.652 0.630 0.333 80 16 80 17 0.739 0.717 0.575 90 18 90 19 90 20 0.870 0.848 1.027 100 21 0.913 0.891 1.233 110 22 0.957 0.935 1.512 120 23 1.000 0.978 2.019 ? \( · ) denotes the inverse of the standard normal distribu- tion function. estimate of the cumulative distribution function of the population and is used to obtain estimates and other information about the population. An estimate of the population median (50th percentile) is obtained by entering the plot on the percentage scale at 50% going horizontally to the fitted line and then vertically down to the data scale to read the estimate of the median. For the WBC data, an estimate of the population median is 65,000. The median is a representative of nominal value for the population since half of the population values are above it and half below. An estimate of any other percentile can be obtained similarly by entering the plot at the appropriate point on the percentage scale going horizontally to the fitted line and then vertically down to the data scale where the estimate is read. For example, an estimate for the 25th percentile is 40,000. 201 Figure 8.3 Sample cumulative distribution curve of the WBC data. One can obtain an estimate of the proportion of the population that has a WBC below a specific value in a similar way. For example, to find the proportion of the population with a WBC of 10,000 or less, you enter the plot on the horizontal axis at the given value, 10, go vertically up to the line fitted to the data, and then horizontally to the probability scale, where the estimate of the population proportion is read, 8%. An estimate of the proportion of a population between two given values is obtained by first getting an estimate of the proportion below each value and then taking the difference. For example, the estimate of the population proportion with WBC between 10,000 and 65,000 is 509 8 : 42%. As mentioned above, a smooth curve can be fitted by eye to a sample cumulative distribution function to obtain an estimate of the population distribution function. Also, one can fit data with a theoretical cumulative distribution function by using a probability plot and then use this plot to estimate the parameters in the theoretical cumulative distribution function. The distribution may be the normal, lognormal, exponential, Weibull, gamma, or log-logistic. To make a probability plot, one generally uses (i 9 0.5)/n or i/(n ; 1) to estimate the sample cumulative distribution function at the ith ordered value of the n observations in the sample. The (i 9 0.5)/n for the WBC data are given in Table 8.1. The probability plot is so constructed that if the theoretical distribution is adequate for the data, the graph of a function of t (used as the y-axis) versus a function of the sample cumulative distribution function (used as the x-axis) will be close to a straight line. The parameters of the theoretical distribution can then be estimated from a fitted line. This is carried out as follows. Step 1. A theoretical distribution for the survival time t has to be selected. Step 2. The sample cumulative distribution function is estimated by using (i 9 0.5)/n or i/(n ; 1), i : 1, 2, , n, for the ith ordered t value. For tied 202 Figure 8.4 Normal probability plot of the WBC data in Example 8.1. observations have the same value, the sample cumulative distribution function is plotted against only the t with the largest i value. Step 3. Plot t or a function of it versus the estimated sample cumulative distribution or a function of it. Step 4. Fit a straight line through the points by eye. The position of the straight line should be chosen to provide a fit to the bulk of the data and may ignore outliers or data points of doubtful validity. Figure 8.4 gives a normal probability plot of the WBC versus \(F), where \( · ) is the inverse of the standard normal distribution function. The values of \(F (WBC G )) are shown in Table 8.1. The plot is reasonably linear. The straight line fitted by eye in a probability plot can be used to estimate percentiles and proportions within given limits in the same manner as for the sample cumulative distribution curve. In addition, a probability plot provides estimates of the parameters of the theoretical distribution chosen. The mean (or median) WBC estimated from the normal probability plot in Figure 8.4 is 56,000 [at \(F) : 0, F : 0.5 and WBC : 56,000]. At \(F) : 1, WBC : 91,000, which corresponds to the mean plus 1 standard deviation. Thus, the standard deviation is estimated as 35,000. We now discuss probability plots of the exponential, Weibull, lognormal, and log-logistic distributions. 203 Table 8.2 Probability Plotting for Example 8.2 Order, F, ti(i 9 0.5)/21 log[1/(19 F)] 11 1 2 0.071 0.074 23 2 4 0.167 0.182 3 5 0.214 0.241 46 4 7 0.310 0.370 58 5 9 0.405 0.519 6 10 0.452 0.602 811 8 12 0.548 0.793 9 13 0.595 0.904 10 14 10 15 0.690 1.173 12 16 0.738 1.340 14 17 0.786 1.540 16 18 0.833 1.792 20 19 0.881 2.128 24 20 0.929 2.639 34 21 0.976 3.738 Exponential Distribution The exponential cumulative distribution function is F(t) : 1 9 exp[9(t)] t 9 0(8.2.1) The probability plot for the exponential distribution is based on the relation- ship between t and F(t), from (8.2.1), t : 1 log 1 1 9 F(t) (8.2.2) This relationship is linear between t and the function log[1/(1 9 F(t))]. Thus, an exponential probability plot is made by plotting the ith ordered observed survival time t G versus log[1/(1 9 F (t G ))], where F (t G ) is an estimate of F(t G ), for example, (i 9 0.5)/n, for i : 1, , n. From (8.2.2), at log+1/[1 9 F(t)], : 1, t : 1/. This fact can be used to estimate 1/ and thus from the fitted straight line. That is, the value t 204 Figure 8.5 Exponential probability plot of the data in Example 8.2. corresponding to log+1/[19 F(t)], : 1 is an estimate of the mean 1/ and its reciprocal is an estimate of the hazard rate . Example 8.2 Suppose that 21 patients with acute leukemia have the following remission times in months: 1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 8, 8, 9, 10, 10, 12, 14, 16, 20, 24, and 34. We would like to know if the remission time follows the exponential distribution. The ordered remission times t G and the log+1/ [1 9 F(t)], are given in Table 8.2. The exponential probability plot is shown in Figure 8.5. A straight line is fitted to the points by eye, and the plot indicates that the exponential distribution fits the data very well. At the point log[1/ (1 9 F(t))] : 1.0, the corresponding t, approximately 9.0 months, is an esti- mate of the mean 1/ and thus an estimate of the hazard rate is : 1/9 : 0.111 per month. An alternative is to use (7.2.5) to estimate , : 21/198: 0.107, which is very close to the graphical estimate. Weibull Distribution The Weibull cumulative distribution function is F(t) : 1 9 exp[9(t)A] t 9 0, 90, 90(8.2.3) The probability plot for the Weibull distribution is based on the relationship log t : log 1 ; 1 log log 1 1 9 F(t) (8.2.4) 205 between t and the cumulative distribution function F of t obtained from (8.2.3). This relationship is linear between log t and the function log(log+1/[19F(t)],). Thus, a Weibull probability plot is a graph of log(t G ) and log(log+1/ [1 9 F (t G )],), where F (t G ) is an estimate of F(t G ), for example, (i 9 0.5)/n, for i : 1, , n. The shape parameter is estimated graphically as the reciprocal of the slope of the straight line fitted to the graph. If the fitted line is appropriate, then at log(log+1/[1 9 F(t)],) : 0, the corresponding log(t) is an estimate of log(1/) from (8.2.4). This fact can be used to estimate 1/ and thus graphically from a Weibull probability plot. At log(log+1/[1 9 F(t)],) : 0.5, (8.2.4) reduces to log t : log(1/) ; 0.5/. This equation can be used to estimate . Estimates of the parameters can also be obtained from the method described in Chapter 7 if the Weibull distribution appears to be a good fit graphically. The following hypothetical example illustrates the use of the Weibull probabil- ity plot. The small number of observations used in the example is only for illustrative purposes. In practice, many more observations are needed to identify an appropriate theoretical model for the data. Example 8.3 Six mice with brain tumors have survival times, in months of 3, 4, 5, 6, 8, and 10. Log(t G ) plotted against log(log+1/[1 9 (i 9 0.5)/6],) for i : 1, , 6 is shown in Figure 8.6. A straight line is fitted to the data point by eye. From the fitted line, at log(log+1/[1 9 F(t)],) : 0, the corresponding log(t) : 1.9, and thus an estimate of 1/ is approximately 6.69 [:exp(1.9)] months and an estimate of is 0.150. At log(log+1/[1 9 F(t)],) : 0.5, the corresponding log(t) : 2.09, and thus an estimate of : 0.5/(2.09—1.9) : 2.63. The maximum likelihood estimates of and obtained from the SAS procedure LIFEREG are 2.75 and 0.148, respectively. The graphical estimates of and are close to the MLE. Lognormal Distribution If the survival time t follows a lognormal distribution with parameters and , log t follows the normal distribution with mean and variance . Consequently, (log t 9 )/ has the standard normal distribution. Thus, the lognormal distribution function can be written as F(t) : log t 9 t 9 0(8.2.5) where ( · ) is the standard normal distribution function and and are, respectively, the mean and standard deviation of log t. A probability plot for the lognormal distribution is based on the following relationship obtained from (8.2.5): log t : ; \(F(t)) (8.2.6) 206 Figure 8.6 Weibull probability plot of the data in Example 8.3. The function \( · ) is the inverse of the standard normal distribution func- tion or its 100F percentile. This relationship is linear between the value log t and the function \(F(t)). Thus, a log-normal probability plot is a graph of log(t G ) versus \(F (t G )), where F (t G ) is an estimate of F(t G ). From (8.2.6),at\(F(t)) : 0, log t : ; and at, \(F(t)) : 1, : log t 9 . These facts can be used to estimate and from a straight line fitted to the graph. Example 8.4 In a study of a new insecticide, 20 insects are exposed. Survival times in seconds are 3, 5, 6, 7, 8, 9, 10, 10, 12, 15, 15, 18, 19, 20, 22, 25, 28, 30, 40, and 60. Suppose that prior experience indicates that the survival time follows a lognormal distribution; that is, some insects might react to the insecticide very slowly and not die for a long time. The log(t G ) versus \[(i 9 0.5)/20], i : 1, , 20, are plotted in Figure 8.7. The plot shows a reasonably straight line. From the fitted line, at \(F(t)) : 0, log t is an estimate of , which is equal to 2.64, and at \(F(t)) : 1, log t : 3.4 and thus : 3.4 9 2.64: 0.76. \(F(t)) can be obtained by applying Microsoft Excel function NORMSINV. 207 Figure 8.7 Lognormal probability plot of the data in Example 8.4. Log-Logistic Distribution The log-logistic distribution function is F(t) : tA 1 ; tA t 9 0, 90, 90(8.2.7) A probability plot for the log-logistic distribution is based on the following relationship obtained from (8.2.7): log t : 1 log 1 1 9 F(t) 9 1 9 1 log (8.2.8) Thus, a log-logistic probability plot is a graph of log(t G ) versus log(+1/ [1 9 F (t G )], 9 1), where F (t G ) is an estimate of F(t G ), for example, (i 9 0.5)/n, for i : 1, , n. From (8.2.8), at log+[1/(19 F)] 9 1, : 0, logt :9(1/) log ; and at log+[1/(1 9 F)] 9 1,: 1, log t : (1/)(1 9 log ). These facts can be used to estimate and . The following example illustrates the log-logistic probability plot. Example 8.5 Consider the following survival times of 10 experimental rats in days: 8, 15, 25, 30, 50, 90, 95, 100, 150, and 300. Figure 8.8 plots log(t G ) 208 Figure 8.8 Log-logistic probability plot of the data in Example 8.5. against log(+1/[1 9 (i 9 0.5)/10], 9 1) for i : 1, , 10. To estimate and , from the fitted line, at log(+1/[1 9 F(t)], 9 1) : 0, log t : 4.0; and at log(+1/ [1 9 F(t)], 9 1) : 1, log t : 4.6. Thus, we have two equations: 4.0 :9 1 log and 4.6 : 1 (1 9 log ) From these two equations, : 1.667 and : 0.0013. 8.3 HAZARD PLOTTING Hazard plotting (Nelson 1972, 1982) is analogous to probability plotting, the principal difference being that the survival time (or a function of it) is plotted against the cumulative hazard function (or a function of it) rather than the distribution function. Hazard plotting is designed to handle censored data. Similar to probability plotting, estimates of parameters in the distribution can be determined from the hazard plot with little computational effort. To determine if a set of survival time with censored observation is from a given theoretical distribution, we construct a hazard plot by plotting the survival time (or a function of it) versus an estimation cumulative hazard (or 209 [...]... 4 .51 1. 05 9.47 79. 05 2.02 4.26 11. 25 10.34 10.66 12.03 2.64 14.76 1.19 8.66 14.83 5. 62 18.10 25. 74 17.36 1. 35 9.02 6.94 7.26 4.70; 3.70 3.64 3 .57 11.64 6. 25 25. 82 3.88 3.02; 19.36; 20.28 46.12 5. 17 0.20 36.66 10.06 4.98 5. 06 16.62 12.07 6.97 0.08 1.40 2. 75 7.32 1.26 6.76 8.60; 7.62 3 .52 9.74 0.40 5. 41 2 .54 2.69 8.26 0 .50 5. 32 5. 09 2.09 7.93 12.02 t 13.80 5. 85 7.09 5. 32 4.33; 2.83 8.37 14.77 8 .53 11.98... : 1, (8.3 .5) can be written as : 1/(log t ; log ) This equation can be used to estimate Figure 8.9 Cumulative hazard functions of the Weibull distribution with :0 .5, 1, 2, 4 213 Figure 8.10 Weibull hazard plot of the data in Example 8.8 Example 8.8 Consider the following survival times in months of 14 patients: 15, 25, 38, 40;, 50 , 55 , 65, 80;, 90, 140, 150 ;, 155 , 250 ;, 252 Figure... for the exponential, Weibull, lognormal, and generalized gamma distributions The results are given in Table 9.2 For example, the MLE of in the exponential distribution is 5. 054 and the corresponding log-likelihood is 9 35. 359 , and the MLE of the two parameters in the Weibull distribution are : 5. 002 and : 0 .50 0 and the corresponding log-likelihood 229 — — 0 .50 0 0 .56 1 0 .52 7 0.332 4.739 BA 5. 054 5. 054 ... 8.3 Estimation of Cumulative Hazard t 6 6; 6 6 7 9; 10 10; 11; 13 16 17; 19; 20; 22 23 25; 32; 32; 34; 35; Reversed Order, K 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Hazard, 1/K Cumulative Hazard, H(t) 0.048 0. 053 0. 056 0. 059 0. 156 0.2 15 0.067 0.281 0.083 0.091 0.3 65 0. 456 0.143 0.167 0 .59 8 0.7 65 mean survival time 1/ of the distribution More simply, 1/ is the value of t when H(t) : 1 This... 0 .52 7 0.332 4.739 BA 5. 054 5. 054 5. 002 4.7 65 4.4 95 A@ — — — — — 91.088 CB X * 11.922C 19.762D 7.840D 2.326D — — LL 9 35. 359 9 35. 359 929.398 926.641 9 25. 478 926.867 — :0.001 :0.001 0.0 05 0.127 — p Value 930.268 — 937.060 932.800 930.042 930 .58 0 BIC 930.867 — 937. 359 933.398 930.641 931.478 AIC ? LL, log-likelihood; X , likelihood ratio statistic; p value, P(X 9 X ) * * @A : 9log for the exponential and the... (n) For each candidate distribution, compute p r : l(b ) 9 log n 2 (9.3.1) 231 Table 9.3 Remission Times (Months) of 137 Cancer Patients t 4 .50 19.13 14.24 7.87 5. 49 2.02 9.22 3.82 26.31 4. 65; 2.62 0.90 21.73 0.87; 0 .51 3.36 43.01 0.81 3.36 1.46 24.80; 10.86; 17.14 15. 96 7.28 4.33 22.69 2.46 3.48 4.23 6 .54 8. 65 5.41 2.23 4.34 t t 32. 15 4.87 5. 71 7 .59 3.02 4 .51 ... 38 45 46 50 53 54 58 66 69 77 78 81 84 85 91 95 101 108 1 15 118 120 1 25 134 1 35 Which of the distributions discussed in this chapter provide a reasonable fit to the data? Estimate graphically the parameters of the distribution chosen 220 8 .5 In a clinical study, 28 patients with cancer of the head and neck did not respond to chemotherapy Their survival. .. t : 91, and G G : L t> : 35 The MLE of GP> G 8 : 0.06 35 91 ; 35 and l ( ) : 8(log 0.06 35) 9 0.06 35( 91) 9 0.06 35( 35) : 930. 055 Under H , # l ( ) : 8(log 0.06) 9 0.06(91) 9 0.06( 35) : 930.067 Thus, following (9.4.1), # X : 2[930. 055 9 (930.067)] : 0.024 X : 3.84; therefore, we cannot
* reject the null hypothesis that the data are from the exponential distribution with : 0.06 2 Testing... ) :9 *l ( , ) 5 * *l ( , ) \ 5 * * *l ( , ) 5 * * *l ( , ) 5 * (9.1.7) and [*l ( , )/* ][*l ( , )/* ] 9 (*l ( , )/* * ) 5 5 5 (9.1.8) V \( , ) :9 *l ( , )/* 5 For a given significant-level , H is rejected if X 9 , when the likelihood ? * ratio statistic is used; or if X 9 or X : , when the Wald ? \? 5 5 statistic is used It... 0.367 0.333 0.267 0.267 0.233 0.000 0.034 0.069 0.1 05 0.143 0.182 0.223 0.266 0.310 0.4 05 0.4 05 0. 457 0 .51 1 0 .56 8 0.629 0.693 0.762 0.836 0.916 1.003 1.099 1.322 1.322 1. 455 ? r, ordered Cox—Snell residuals from the fitted lognormal model @S (r), Kaplan—Meier estimate of survivorship function for the 0 Cox—Snell residuals lognormal model may be appropriate for the tumor-free times observed In Chapter 9 (Example . 90. 857 30 6 0.261 0.239 90.709 50 7 50 8 50 9 50 10 50 11 0.478 0. 457 90.109 60 12 60 13 0 .56 5 0 .54 3 0.109 75 14 75 15 0. 652 0.630 0.333 80 16 80 17 0.739 0.717 0 .57 5 90 18 90 19 90 20 0.870 0.848. 0. 053 618 0. 056 0. 156 717 0. 059 0.2 15 9; 16 10 15 0.067 0.281 10; 14 11; 13 13 12 0.083 0.3 65 16 11 0.091 0. 456 17; 10 19; 9 20; 8 22 7 0.143 0 .59 8 23 6 0.167 0.7 65 25; 5 32; 4 32; 3 34; 2 35; . following survival times in months of 14 patients: 15, 25, 38, 40;, 50 , 55 , 65, 80;, 90, 140, 150 ;, 155 , 250 ;, 252 . Figure 8.10 is the hazard plot with log t versus log H(t) of the data. From