Statistical modeling for medical researcher

Statistical Modeling for Biomedical Researchers A Simple Introduction to the Analysis of Complex Data This text will enable biomedical researchers to use a number of advanced statistical methods that have proven valuable in medical research It is intended for people who have had an introductory course in biostatistics A statistical software package (Stata) is used to avoid mathematics beyond the high school level The emphasis is on understanding the assumptions underlying each method, using exploratory techniques to determine the most appropriate method, and presenting results in a way that will be readily understood by clinical colleagues Numerous real examples from the medical literature are used to illustrate these techniques Graphical methods are used extensively Topics covered include linear regression, logistic regression, Poisson regression, survival analysis, fixed-effects analysis of variance, and repeated-measures analysis of variance Each method is introduced in its simplest form and is then extended to cover situations in which multiple explanatory variables are collected on each study subject Educated at McGill University, and the Johns Hopkins University, Bill Dupont is currently Professor and Director of the Division of Biostatistics at Vanderbilt University School of Medicine He is best known for his work on the epidemiology of breast cancer, but has also published papers on power calculations, the estimation of animal abundance, the foundations of statistical inference, and other topics Statistical Modeling for Biomedical Researchers A Simple Introduction to the Analysis of Complex Data William D Dupont    Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge  , United Kingdom Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambrid ge.org/9780521820615 © Cambridge University Press 2002 This book is in copyright Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press First published in print format 2002 - isbn-13 978-0-511-06174-5 eBook (NetLibrary) - isbn-10 0-511-06174-9 eBook (NetLibrary) - isbn-13 978-0-521-82061-5 hardback - isbn-10 0-521-82061-8 hardback - isbn-13 978-0-521-65578-1 paperback - isbn-10 0-521-65578-1 paperback Cambridge University Press has no responsibility for the persistence or accuracy of s for external or third-party internet websites referred to in this book, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate Contents Preface xv Introduction 1.1 Algebraic Notation 1.2 Descriptive Statistics 1.2.1 Dot Plot 1.2.2 Sample Mean 1.2.3 Residual 1.2.4 Sample Variance 1.2.5 Sample Standard Deviation 1.2.6 Percentile and Median 1.2.7 Box Plot 1.2.8 Histogram 1.2.9 Scatter Plot 1.3 The Stata Statistical Software Package 1.3.1 Downloading Data from My Web Site 1.3.2 Creating Dot Plots with Stata 1.3.3 Stata Command Syntax 1.3.4 Obtaining Interactive Help from Stata 1.3.5 Stata Log Files 1.3.6 Displaying Other Descriptive Statistics with Stata 1.4 Inferential Statistics 1.4.1 Probability Density Function 1.4.2 Mean, Variance and Standard Deviation 1.4.3 Normal Distribution 1.4.4 Expected Value 1.4.5 Standard Error 1.4.6 Null Hypothesis, Alternative Hypothesis and P Value v 2 4 5 6 11 13 13 13 16 16 17 18 18 19 19 vi Contents 1.4.7 1.4.8 1.4.9 1.4.10 1.4.11 1.4.12 95% Confidence Interval Statistical Power The z and Student’s t Distributions Paired t Test Performing Paired t Tests with Stata Independent t Test Using a Pooled Standard Error Estimate 1.4.13 Independent t Test using Separate Standard Error Estimates 1.4.14 Independent t Tests using Stata 1.4.15 The Chi-Squared Distribution 1.5 Additional Reading 1.6 Exercises Simple Linear Regression 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 Sample Covariance Sample Correlation Coefficient Population Covariance and Correlation Coefficient Conditional Expectation Simple Linear Regression Model Fitting the Linear Regression Model Historical Trivia: Origin of the Term Regression Determining the Accuracy of Linear Regression Estimates Ethylene Glycol Poisoning Example 95% Confidence Interval for y[x] = α + βx Evaluated at x 95% Prediction Interval for the Response of a New Patient Simple Linear Regression with Stata Lowess Regression Plotting a Lowess Regression Curve in Stata Residual Analyses Studentized Residual Analysis Using Stata Transforming the x and y Variables 2.17.1 Stabilizing the Variance 2.17.2 Correcting for Non-linearity 20 21 22 23 24 26 28 28 31 32 32 34 34 35 36 37 37 38 40 41 43 43 44 45 49 51 51 54 55 55 56 vii Contents 2.18 2.19 2.20 2.21 2.22 2.17.3 Example: Research Funding and Morbidity for 29 Diseases Analyzing Transformed Data with Stata Testing the Equality of Regression Slopes 2.19.1 Example: The Framingham Heart Study Comparing Slope Estimates with Stata Additional Reading Exercises Multiple Linear Regression 3.1 The Model 3.2 Confounding Variables 3.3 Estimating the Parameters for a Multiple Linear Regression Model 3.4 R Statistic for Multiple Regression Models 3.5 Expected Response in the Multiple Regression Model 3.6 The Accuracy of Multiple Regression Parameter Estimates 3.7 Leverage 3.8 95% Confidence Interval for yî 3.9 95% Prediction Intervals 3.10 Example: The Framingham Heart Study 3.10.1 Preliminary Univariate Analyses 3.11 Scatterplot Matrix Graphs 3.11.1 Producing Scatterplot Matrix Graphs with Stata 3.12 Modeling Interaction in Multiple Linear Regression 3.12.1 The Framingham Example 3.13 Multiple Regression Modeling of the Framingham Data 3.14 Intuitive Understanding of a Multiple Regression Model 3.14.1 The Framingham Example 3.15 Calculating 95% Confidence and Prediction Intervals 3.16 Multiple Linear Regression with Stata 3.17 Automatic Methods of Model Selection 3.17.1 Forward Selection using Stata 3.17.2 Backward Selection 3.17.3 Forward Stepwise Selection 3.17.4 Backward Stepwise Selection 3.17.5 Pros and Cons of Automated Model Selection 57 59 62 63 65 69 69 72 72 73 74 74 74 75 76 76 76 77 78 79 80 81 81 83 85 85 88 88 92 93 95 96 96 96 viii Contents 3.18 Collinearity 3.19 Residual Analyses 3.20 Influence 3.20.1 βˆ Influence Statistic 3.20.2 Cook’s Distance 3.20.3 The Framingham Example 3.21 Residual and Influence Analyses Using Stata 3.22 Additional Reading 3.23 Exercises Simple Logistic Regression 4.1 Example: APACHE Score and Mortality in Patients with Sepsis 4.2 Sigmoidal Family of Logistic Regression Curves 4.3 The Log Odds of Death Given a Logistic Probability Function 4.4 The Binomial Distribution 4.5 Simple Logistic Regression Model 4.6 Generalized Linear Model 4.7 Contrast Between Logistic and Linear Regression 4.8 Maximum Likelihood Estimation 4.8.1 Variance of Maximum Likelihood Parameter Estimates 4.9 Statistical Tests and Confidence Intervals 4.9.1 Likelihood Ratio Tests 4.9.2 Quadratic Approximations to the Log Likelihood Ratio Function 4.9.3 Score Tests 4.9.4 Wald Tests and Confidence Intervals 4.9.5 Which Test Should You Use? 4.10 Sepsis Example 4.11 Logistic Regression with Stata 4.12 Odds Ratios and the Logistic Regression Model 4.13 95% Confidence Interval for the Odds Ratio Associated with a Unit Increase in x 4.13.1 Calculating this Odds Ratio with Stata 4.14 Logistic Regression with Grouped Response Data 97 97 99 99 100 100 102 105 105 108 108 108 110 110 112 112 112 113 114 115 115 116 116 117 118 118 119 121 122 122 123 ix Contents 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 95% Confidence Interval for π[x] 95% Confidence Intervals for Proportions Example: The Ibuprofen in Sepsis Trial Logistic Regression with Grouped Data using Stata Simple 2×2 Case-Control Studies 4.19.1 Example: The Ille-et-Vilaine Study of Esophageal Cancer and Alcohol 4.19.2 Review of Classical Case-Control Theory 4.19.3 95% Confidence Interval for the Odds Ratio: Woolf ’s Method 4.19.4 Test of the Null Hypothesis that the Odds Ratio Equals One 4.19.5 Test of the Null Hypothesis that Two Proportions are Equal Logistic Regression Models for × Contingency Tables 4.20.1 Nuisance Parameters 4.20.2 95% Confidence Interval for the Odds Ratio: Logistic Regression Creating a Stata Data File Analyzing Case-Control Data with Stata Regressing Disease Against Exposure Additional Reading Exercises Multiple Logistic Regression 5.1 Mantel–Haenszel Estimate of an Age-Adjusted Odds Ratio 5.2 Mantel–Haenszel χ Statistic for Multiple × Tables 5.3 95% Confidence Interval for the Age-Adjusted Odds Ratio 5.4 Breslow and Day’s Test for Homogeneity 5.5 Calculating the Mantel–Haenszel Odds Ratio using Stata 5.6 Multiple Logistic Regression Model 5.7 95% Confidence Interval for an Adjusted Odds Ratio 5.8 Logistic Regression for Multiple × Contingency Tables 5.9 Analyzing Multiple × Tables with Stata 123 124 124 127 131 131 132 133 133 134 134 135 135 136 138 139 140 141 143 143 145 146 146 148 151 153 153 155 372 Appendix: Summary of Stata commands used in this text Analysis Commands (cont.) Command graph var, bin(#) graph var, bin(#) freq graph var, box by(groupvar) graph var, box oneway by(groupvar) graph var1 var2, bar by(varstrat) graph y1 y2 x, connect(.l) symbol(oi) graph y x, connect(l[–#]) graph y x, connect(L) graph y1 y2 x, connect(II) symbol(ii) graph y x, connect(J) symbol(i) graph y x, symbol(O) graph varlist, matrix label symbol(o) connect(s) band(#) iri #a #b #Na #Nb ksm yvar xvar, lowess bwidth(#) gen(newvar) kwallis var, by(groupvar) list varlist list varlist, nodisplay logistic depvar varlist oneway responsevar factorvar Function Draw a histogram of var with # bars y-axis is proportion of subjects Draw a histogram of var with # bars y-axis is number of subjects Draw boxplots of var for each distinct value of groupvar Draw horizontal boxplots and one-dimensional scatterplots of var for each distinct value of groupvar Grouped bar graph of var1 and var2 stratified by varstrat Draw scatter plot of y1 vs x Graph y2 vs x, connect points, no symbol Graph y vs x Connect points with a dashed line Graph y vs x Connect consecutive points with a straight line if the values of x are increasing Draw error bars connecting y1 to y2 as a function of x Plot a step function of y against x Scatter plot of y vs x using large circles Plot matrix scatterplot of variables in varlist Estimate regression lines with # median bands and cubic splines Calculate relative risk from incidence data; #a and #b are the number of exposed and unexposed cases observed during #Na and #Nb person-years of follow-up Plot lowess curve of yvar vs xvar with bandwidth # Save lowess regression line as newvar Perform a Kruskal–Wallis test of var by groupvar List values of variables in varlist List values of variables in varlist with tabular format Logistic regression: regress depvar against variables in varlist One-way analysis of variance of responsevar in groups defined by factorvar Section 1.3.6 4.18 1.3.6 10.7 9.3 2.12 2.20 11.2 4.18 6.9 2.12 3.11.1 8.2 2.14 10.7 1.3.2 5.29 4.13.1, 5.9 10.7 373 Appendix: Summary of Stata commands used in this text Analysis Commands (cont.) Command ranksum var, by(groupvar) regress depvar varlist, level(#) stcox varlist stcox varlist1, strata(varlist2) stcox varlist, mgale(newvar) stset timevar, failure(failvar) stset exittime, id(idvar) enter(time entrytime) failure(failvar) sts generate var = survfcn sts graph, by(varlist) sts graph, gwood lost sts graph, by(varlist) failure sts list, by(varlist) sts test varlist summarize varlist, detail sw regress depvar varlist, forward pe(#) sw regress depvar varlist, pr(#) sw regress depvar varlist, forward pe(#) pr(#) sw regress depvar varlist, pe(#) pr(#) Function Perform a Wilcoxon–Mann–Whitney rank sum test of var by groupvar Regress depvar against variables in varlist Proportional hazard regression analysis with independent variables given by varlist A stset statement defines failure Exponentiated model coefficients are given Stratified proportional hazard regression analysis with strata defined by the values of variables in varlist2 Cox hazard regression analysis Define newvar to be the martingale-residual for each patient Declare timevar and failvar to be time and failure variables, respectively failvar = denotes failure Declare entrytime, exittime and failvar to be the entry time, exit time and failure variables, respectively id(idvar) is a patient identification variable needed for time-dependent hazard regression analysis Define var to equal one of several functions related to survival analyses Kaplan–Meier survival plots Plot a separate curve for each combination of distinct values of the variables in varlist Must be preceded by a stset statement Kaplan–Meier survival plots showing number of patients censored with 95% confidence bands Kaplan–Meier cumulative mortality plot List estimated survival function by patient groups defined by unique combinations of values of varlist Perform logrank test on groups defined by the values of varlist Summarize variables in varlist Automatic linear regression: forward covariate selection Automatic linear regression: backward covariate selection Automatic linear regression: stepwise forward covariate selection Automatic linear regression: stepwise backward covariate selection Section 10.7 2.12, 3.16 6.16 7.8 7.7 6.9 7.9.4 7.11 6.9 6.9 6.9 7.7 6.9 6.9 1.3.6 3.17.1 3.17.2 3.17.3 3.17.4 374 Appendix: Summary of Stata commands used in this text Analysis Commands (cont.) Command table rowvar colvar table rowvar colvar, row col table rowvar colvar, by(varlist) table rowvar colvar, contents(sum varname) tabulate varname tabulate varname1 varname2, column row ttest var1 = var2 ttest var, by(groupvar) ttest var, by(groupvar) unequal xi: glm depvar varlist i.catvar, family(dist) link(linkfcn) xi: glm depvar varlist i.var1*i var2, family(dist) link(linkfcn) xi: logistic depvar varlist i.catvar xi: logistic depvar varlist i.var1*i var2 xi: stcox varlist i.varname xi: stcox varlist i.var1*i.var2 xtgee depvar varlist, family(family) link(link) corr(correlation) i(idname) t(tname) Function Section Two-way frequency tables of values of rowvar by colvar Two-way frequency tables with row and column totals Two-way frequency tables of values of rowvar by colvar for each unique combination of values of varlist Create a table of sums of varname cross-tabulated by rowvar and colvar Frequency table of varname with percentages and cumulative percentages Two-way frequency tables with row and column percentages Paired t-test of var1 vs var2 Independent t-test of var in groups defined by groupvar Independent t-test of var in groups defined by groupvar Variances assumed unequal Glm with dichotomous indicator variables replacing a categorical variable catvar Glm with dichotomous indicator variables replacing categorical variables var1, var2 All two-way interaction terms are also generated Logistic regression with dichotomous indicator variables replacing a categorical variable catvar Logistic regression with dichotomous indicator variables replacing categorical variables var1 and var2 All two-way interaction terms are also generated Proportional hazards regression with dichotomous indicator variables replacing categorical variable varname Proportional hazards regression with dichotomous indicator variables replacing categorical variables var1 and var2 All two-way interaction terms are also generated Perform a generalized estimating equation analysis in regressing depvar against the variables in varlist 5.5 5.20 5.5 8.9 3.21 5.11.1 1.4.11 1.4.14 1.4.14 8.12 9.3 5.10 5.23 7.7 7.7 11.11 375 Appendix: Summary of Stata commands used in this text Post Estimation Commands (affected by preceding regression command) Command Function lincom expression lincom expression, or predict newvar, cooksd predict newvar, csnell predict newvar, ccsnell predict newvar, dfbeta(varname) predict newvar, h predict newvar, rstudent predict newvar, standardized deviance predict newvar, stdp predict newvar, stdf predict newvar, xb vce Calculate expression and a 95% CI associated with expression Calculate exp[espression] with associated 95% CI The hr and irr options perform the same calculations Set newvar = Cook’s D Set newvar = Cox–Snell residual Set newvar = Cox–Snell residual in the last record for each patient – used with multiple records per patient Set newvar = delta beta statistic for the varname covariate in the linear regression model Set newvar = leverage Set newvar = studentized residual Set newvar = standardized deviance residual Set newvar = standard error of the linear predictor Set newvar = standard error of a forecasted value Set newvar = linear predictor Display variance–covariance matrix of last model Section 5.20, 10.7 5.20, 7.7, 8.7 3.21 7.7 7.10.1 3.21 3.16 2.16, 3.21 9.5 2.12, 3.16 2.12, 3.16 2.12, 3.16 4.18 Command Prefixes Syntax Function by varlist: Repeat following command for each unique value of varlist Fit a model with either the forward, backward or stepwise algorithm N.B sw is not followed by a colon Execute the following estimation command with categorical variables like i.catvar and i.catvar1* i.catvar2 sw xi: Section 1.3.6 3.17.1 5.10, 5.23 376 Appendix: Summary of Stata commands used in this text Logical and Relational Operators and System Variables (See Stata User’s Manual) Meaning Section missing value true false greater than less than greater than or equal to less than or equal to equal to not equal to and or not Record number of current observation When used with the by id: prefix, n is reset to whenever the value of id changes and equals k at the k th record with the same value of id Total number of observations in the data set When used with the by id: prefix, N is the number of records with the current value of id The value of variable varname in observation expression 5.32.2 7.7 7.7 1.4.11 1.4.11 1.4.11 1.4.11 1.4.11 1.4.11 1.4.11 1.4.11 1.4.11 7.11, 11.5 Operator or Variable > < >=

Định dạng
Số trang	405
Dung lượng	3,45 MB
File đính kèm	49. Statistical Mo.rar (3 MB)