1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Medical biostatistics

215 250 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 215
Dung lượng 4,04 MB

Nội dung

Seminar notes Medical Biostatistics 2 Georg Heinze Section of Clinical Biometrics Core Unit for Medical Statistics and Informatics Medical University of Vienna Spitalgasse 23, A-1090 Vienna, Austria e-mail: georg.heinze@meduniwien.ac.at Version 2008-04 PART 1: STATISTICAL MODELS 5 Class 1: General issues on statistical modeling 5 Statistical tests and statistical models 5 What is a statistical test? 5 What is a statistical model? 5 Response or outcome variable 7 Independent variable 7 Representing a statistical test by a statistical model 7 Uncertainty of a model 9 Types of responses – types of models 9 Univariate and multivariable models 9 Multivariate models 11 Purposes of multivariable models 12 Confounding 16 Effect modification 16 Assumptions of various models 18 PART 2: ANALYSIS OF BINARY OUTCOMES 20 Class 2: Diagnostic studies 20 Assessment of diagnostic tests 20 Receiver-operating-characteristic (ROC) curves 22 Class 3: Risk measures 25 Absolute measures to compare the risk in two groups 25 Relative measures to compare the risk between two groups 26 Summary: calculation of risk measures and 95% confidence intervals 28 Class 4: Logistic regression 30 Simple logistic regression 30 Examples 32 Multiple logistic regression 34 SPSS Lab 1: Analysis of binary outcomes 39 Define a cut-point 39 2x2 tables: computing row and column percentages 41 ROC curves 43 Logistic regression 46 References 54 PART 3: ANALYSIS OF SURVIVAL OUTCOMES 55 Class 5: Survival outcomes 55 Definition of survival data 55 Kaplan-Meier estimates of survival functions 57 Simple tests 61 Class 6: Cox regression 66 Basics 66 2 Assumptions 68 Estimates derived from the model 71 Relationship Cox regression – log rank test 73 Class 7: Multivariable Cox regression 73 Stratification: another way to address confounding 76 Class 8: Assessing model assumptions 78 Proportional hazards assumption 78 Graphical checks of the PH assumption 78 Testing violations of the PH assumption 80 What to do if PH assumption is violated? 83 Influential observations 84 SPSS Lab 2: Analysis of survival outcomes 86 Kaplan-Meier analysis 86 Cumulative hazards plots 89 Cox regression 90 Stratified Cox model 92 Partial residuals and DfBeta plots 93 Testing the slope of partial residuals 97 Defining time-dependent effects 98 Dividing the time axis 100 References 105 PART 4: ANALYSIS OF REPEATED MEASUREMENTS 106 Class 9: Pretest-posttest data 106 Pretest-posttest data 106 Change scores 107 The regression to the mean effect 110 Analysis of covariance 113 Class 10: Visualizing repeated measurements 115 Introduction 115 Individual curves 116 Grouped curves 118 Drop-outs 120 Correlation of two repeatedly measured variables 120 Class 11: Summary measures 125 Example: slope of reciprocal creatinine values 126 Example: area under the curve 128 Example: Cmax vs. Tmax 129 Example: aspirin absorption 131 Class 12: ANOVA for repeated measurements 134 Extension of one-way ANOVA 134 Between-subject and within-subject effects 134 Specification of a RM-ANOVA 138 SPSS Lab 3: Analysis of repeated measurements 143 Pretest-posttest data 143 Individual curves 156 3 Grouped curves 159 Computing summary measures 161 ANOVA for repeated measurements 185 Restructuring a longitdudinal data set 197 References 214 4 Part 1: Statistical models Class 1: General issues on statistical modeling Statistical tests and statistical models In the basic course on Medical Biostatistics, several statistical tests were introduced. The course closed by presenting a statistical model, the linear regression model. Here, we start with a review of statistical tests and show how they can be represented as statistical models. Then we extend the idea of statistical models and discuss application, presentation of results, and other issues related to statistical modeling. What is a statistical test? In its simplest setting, a statistical test compares the values of a variable between two groups. Often we want to infer whether two groups of patients actually belong to the same population. We specify a null hypothesis and reject it if the observed data does not give evidence that the hypothesis holds. For simplification we restrict the hypothesis to the comparison of means, as the mean is the most important and most obvious feature of any distribution. If our patient groups belong to the same population, they should exhibit the same mean. Thus, our null hypothesis states “the means in the two groups are equal”. To perform the statistical test, we need two pieces of information for each patient: his/her group membership, and his/her value of the variable to be compared. (And so far, it is of no importance whether the variable we want to compare is a scale or a nominal variable.) In short, a statistical test verifies hypotheses about the study population. As an example, consider the rat diet example of the basic lecture. We tested the equality of weight gains between the groups of high protein diet and low protein diet. What is a statistical model? A statistical model establishes a relationship between variables, e. g., a rule how to predict a patient’s cholesterol level from his age and body mass index. Estimating the model parameters, we can quantify this relationship and (hopefully) predict cholesterol levels: 5 Coefficients a 153.115 8.745 17.509 .000 1.179 .326 .283 3.620 .001 .756 .091 .648 8.293 .000 (Constant) Body-mass-index Age Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Dependent Variable: Cholesterol level a. In this model, we would estimate a patient’s cholesterol level from age and body-mass- index as Cholesterol = 153.1+1.179*BMI + 0.756*Age The regression coefficients (parameters) are: 1.179 for BMI 0.756 for age. They have the following interpretation: Comparing two patients of the same age which differ in their BMI by 1 kg/m2, the heavier person’s cholesterol level is on average 1.179 units higher than that of the slimmer person. and Comparing two patients with the same BMI which differ in their age by one year, the older person will on average have a cholesterol level 0.756 units higher than the younger person. The column labeled “Sig.” informs us whether these coefficients can be assumed to be 0, the p-values in that column refer to testing that the corresponding regression coefficients are zero. If they were actually zero, then these variables had no effect on cholesterol, as can be demonstrated easily: Cholesterol = 180 + 0*BMI + 0*Age In the above equation, the cholesterol level is completely independent from BMI and age. No matter which values we insert for BMI or Age, the cholesterol level will not change from 180. Summarizing, we can get out more of a statistical model than we can get out of a statistical test: not only do we test the hypothesis of ‘no relationship’, we also obtain an estimate of the magnitude of the relationship, and even a prediction rule for cholesterol. 6 Response or outcome variable Statistical models, in their simplest form, and statistical tests are related to each other. We can express any statistical test as a statistical model, in which the P-value obtained by statistical testing is delivered as a ‘by-product’. In our example of a statistical model, the cholesterol level is our outcome or response variable. Generally, any variable we want to compare between groups is an outcome or response variable. In the rat diet example, the response variable is the weight gain. Independent variable The statistical model provides an equation to estimate values of the response variable by one or several independent variables. The denotation ‘independent’ points at their role in the model: their part is an active one; namely to explain differences in response and not to be explained themselves. In our example, these independent variables were BMI and age. In the rat diet example, we consider the diet group (high or low protein) as independent variable. The interpretability of estimated regression coefficients is of special importance. Since the interpretation of coefficients is not clear in some models, in the field of medicine such models are seldom used. Models which allow a clear interpretation of their results are generally preferred. Representing a statistical test by a statistical model Recall the rat diet example. We can represent the t-test which was applied to the data as linear regression of weight gain on diet group: Weight gain = b0 + b1*D where D=1 for the high protein group, and D=0 for the low protein group. Now the regression coefficients b0 and b1 have a clear interpretation: b0 is the mean weight gain in the low protein group (because for D=0, we have Weight gain = b0 + b1*0). 7 b1 is the excess average weight gain in the high protein group, compared to the low protein group, or, put another way, the difference in mean weight gain between the two groups. Clearly, if b1 is significantly different from zero, then the type of diet influences weight gain. Let’s proof by applying linear regression to the rat diet data: Coefficients a 139,000 14,575 9,537 ,000 -19,000 10,045 -,417 -1,891 ,076 (Constant) Dietary group Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Dependent Variable: Weight gain (day 28 to 84) a. For comparison, consider the results of the t-test: Independent Samples Test ,015 ,905 1,891 17 ,076 19,0000 10,0453 -2,1937 40,1937 1,911 13,082 ,078 19,0000 9,9440 -2,4691 40,4691 Equal variances assumed Equal variances not assumed Weight gain (g) F Sig. Levene's Test for Equality of Variances t df Sig. (2-tailed) Mean Difference Std. Error Difference Lower Upper 95% Confidence Interval of the Difference t-test for Equality of Means For interpreting the coefficient corresponding to ‘Dietary group’, we must know how this variable was coded. Actually, 1 was the code for the high protein group, and 2 for the low protein group. Inserting the codes into the regression model we obtain Weight gain = 139 – 19 = 120 for the high protein group and Weight gain = 139 – 19*2 = 101 for the low protein group, which exactly reproduces the means of weight gain in the two groups. The p-value associated with Dietary group exactly resembles that of a two-sample t-test. Other relationships exist for other statistical tests, e. g., the chi-square test has its analogue in logistic regression, or the log-rank test for comparing survival data can be expressed as a simple Cox regression model. Both will be demonstrated in later sessions. 8 Uncertainty of a model Since a model is estimated from a sample of limited size, we cannot be sure that the estimated values resemble exactly those of the underlying population. Therefore, it is important that when reporting results we also state how precise our estimates are. This is usually done by supplying confidence intervals in addition to point estimates. Even in the hypothetical case where we actually know the population values of regression coefficients, the structure of the equation may be insufficient to predict a patient’s outcome with 100% certainty. Therefore, we should give an estimate of the predictive accuracy of a model. In linear regression, such a measure is routinely computed by any statistical software, it’s called R-squared. This measure (sometimes called the coefficient of determination) describes the proportion of variance of the outcome variable that can be explained by variation in the independent variables. Usually, we don’t know or we won’t consider all the causes of variation of the outcome variable. Therefore, R-squared seldom approaches 100%. In logistic or Cox regression models, there is no unique definition of R-squared. However, some suggestions have been made and some of them are implemented in SPSS. In these kinds of models, R-squared is typically lower than in linear regression models. For logistic regression models, this is a consequence of the discreteness of the outcome variable. Usually we can only estimate the percentage of patients that will experience the event of interest. This means, that we know how many patients on average will have the event, but we cannot predict exactly who of them will or won’t. In survival (Cox) models, it’s the longitudinal nature of the outcome which prohibits its precise prediction. Summarizing, there are two sources of uncertainty related to statistical models: one source is due to limited sample sizes, and the other source due to limited ability of a model’s structure to predict the outcome. Types of responses – types of models The type of response defines the type of model to use. For scale variables as responses, we will most often use the linear regression model. For binary (nominal) outcomes, the logistic regression model is the model of choice. (There are other models for binary data, but with less appealing interpretability of results.) For survival outcomes (time to event data), the Cox regression model is useful. For repeated measurements on scale outcomes, the analysis of variance for repeated measurements can be applied. Univariate and multivariable models A univariate model is the translation of a simple statistical test into a statistical model: there is one independent variable and one response variable. The independent variable may be nominal, ordinal or scale. 9 A multivariable model uses more than one independent variable to explain the outcome variable. Multivariable models can be used for various purpose, some of them are listed in the next subsection but one. Often, univariate (crude) and multivariable (adjusted) models are contrasted in one table, as the following example (from a Cox regression analysis) shows [1]: Univariate and multivariable models may yield different results. These differences are caused by correlation between the independent variables: some of the variation in variable X1 may be reflected by variation in X2. In the above table, wee see substantial differences in the estimated effects for KLF5 expression, nodal status and tumor size, but not for differentiation grade. It was shown that KLF5 expression is correlated with nodal status and tumor size, but not with differentiation grade. Therefore, the univariate effect of differentiation grade does not change at all by including KLF5 expression into the model. On the other hand, the effect of KLF5 is reduced by about 40%, caused by the simultaneous consideration of nodal status and tumor size. In other examples, the reverse may occur; an effect may be insignificant in a univariate model and only be confirmable statistically if another effect is considered simultaneously: As an example, consider the relationship of sex and cholesterol level: 212,50 15,62 215,30 15,85 Cholesterol levelmale Cholesterol levelfemale Sex Mean Std Deviation 10 . Seminar notes Medical Biostatistics 2 Georg Heinze Section of Clinical Biometrics Core Unit for Medical Statistics and Informatics Medical University of Vienna Spitalgasse. on statistical modeling Statistical tests and statistical models In the basic course on Medical Biostatistics, several statistical tests were introduced. The course closed by presenting a

Ngày đăng: 12/09/2014, 17:47

TỪ KHÓA LIÊN QUAN

w