INTRODUCTORY BIOSTATISTICS INTRODUCTORY BIOSTATISTICS Second Edition CHAP T LE Distinguished Professor of Biostatistics Director of Biostatistics and Bioinformatics Masonic Cancer Center University of Minnesota LYNN E EBERLY Associate Professor of Biostatistics School of Public Health University of Minnesota Copyright © 2016 by John Wiley & Sons, Inc All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permissions Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com Library of Congress Cataloging‐in‐Publication Data Names: Le, Chap T., 1948– | Eberly, Lynn E Title: Introductory biostatistics Description: Second edition / Chap T Le, Lynn E Eberly | Hoboken, New Jersey : John Wiley & Sons, Inc., 2016 | Includes bibliographical references and index Identifiers: LCCN 2015043758 (print) | LCCN 2015045759 (ebook) | ISBN 9780470905401 (cloth) | ISBN 9781118595985 (Adobe PDF) | ISBN 9781118596074 (ePub) Subjects: LCSH: Biometry | Medical sciences–Statistical methods Classification: LCC QH323.5 L373 2016 (print) | LCC QH323.5 (ebook) | DDC 570.1/5195–dc23 LC record available at http://lccn.loc.gov/2015043758 Set in 10/12pt Times by SPi Global, Pondicherry, India Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 To my wife, Minhha, and my daughters, Mina and Jenna with love C.T.L To my husband, Andy, and my sons, Evan, Jason, and Colin, with love; you bring joy to my life L.E.E Contents Preface to the Second Edition Preface to the First Edition About the Companion Website xiii xv xix Descriptive Methods for Categorical Data 1.1 Proportions 1.1.1 Comparative Studies 1.1.2 Screening Tests 1.1.3 Displaying Proportions 1.2 Rates 10 1.2.1 Changes 11 1.2.2 Measures of Morbidity and Mortality 13 1.2.3 Standardization of Rates 15 1.3 Ratios 18 1.3.1 Relative Risk 18 1.3.2 Odds and Odds Ratio 18 1.3.3 Generalized Odds for Ordered × k Tables 21 1.3.4 Mantel–Haenszel Method 25 1.3.5 Standardized Mortality Ratio 28 1.4 Notes on Computations 30 Exercises32 Descriptive Methods for Continuous Data 2.1 Tabular and Graphical Methods 2.1.1 One‐Way Scatter Plots 2.1.2 Frequency Distribution 2.1.3 Histogram and Frequency Polygon 55 55 55 56 60 viiiContents 2.1.4 Cumulative Frequency Graph and Percentiles 64 2.1.5 Stem and Leaf Diagrams 68 2.2 Numerical Methods 69 2.2.1 Mean 69 2.2.2 Other Measures of Location 72 2.2.3 Measures of Dispersion 73 2.2.4 Box Plots 76 2.3 Special Case of Binary Data 77 2.4 Coefficients of Correlation 78 2.4.1 Pearson’s Correlation Coefficient 80 2.4.2 Nonparametric Correlation Coefficients 83 2.5 Notes on Computations 85 Exercises87 Probability and Probability Models 103 3.1 Probability 103 3.1.1 Certainty of Uncertainty 104 3.1.2 Probability 104 3.1.3 Statistical Relationship 106 3.1.4 Using Screening Tests 109 3.1.5 Measuring Agreement 112 3.2 Normal Distribution 114 3.2.1 Shape of the Normal Curve 114 3.2.2 Areas Under the Standard Normal Curve 116 3.2.3 Normal Distribution as a Probability Model 122 3.3 Probability Models for Continuous Data 124 3.4 Probability Models for Discrete Data 125 3.4.1 Binomial Distribution 126 3.4.2 Poisson Distribution 128 3.5 Brief Notes on the Fundamentals 130 3.5.1 Mean and Variance 130 3.5.2 Pair‐Matched Case–Control Study 130 3.6 Notes on Computations 132 Exercises134 4 Estimation of Parameters 4.1 Basic Concepts 4.1.1 Statistics as Variables 4.1.2 Sampling Distributions 4.1.3 Introduction to Confidence Estimation 4.2 Estimation of Means 4.2.1 Confidence Intervals for a Mean 4.2.2 Uses of Small Samples 4.2.3 Evaluation of Interventions 4.3 Estimation of Proportions 141 142 143 143 145 146 147 149 151 153 578 ANSWERS TO SELECTED EXERCISES –417.63 Smaller AIC and BIC indicate the better model, so we proceed with the random intercept model (i) Compare the random intercept model with all interactions to the model with only trt*weeks, smoker, and log(CFU0) using a likelihood ratio test since the two models are nested; p value = 0.91, so we proceed with the smaller model Smoker is not significant, but should be included since randomization was blocked by smoking status (j) Model assumptions are approximately satisfied: normality is approximately satisfied (however tails are heavy) and residuals show slightly increasing variability with increasing CFU ratio predicted value “Independence” here refers to independence between persons, which seems likely to be satisfied 12.5 (a) Proportion with positive culture at baseline was 0.20 in treatment A and 0.32 in treatment B, significantly different by Fisher’s exact test (p = 0.01) (d) Assume the model is parameterized so that group B is the reference group for treatment Regression coefficient for treatment represents the treatment effect at month 0: –0.586 (lower culture rate in group A compared to B, odds ratio = exp(–0.586) = 0.557) Treatment effect at month 3: –0.586 + 0.016*3 = –0.538 Treatment effect at month 6: –0.586 + 0.016*6 = –0.490 Treatment effect at month 9: –0.586 + 0.016*9 = –0.442 Treatment effect at month 12: –0.586 + 0.016*12 = –0.394 Regression coefficient for months represents the slope in the log odds of a positive culture in group B per month of follow‐up (0.00099, so log odds of a positive culture is essentially flat) Regression coefficient for months plus regression coefficient for months by treatment interaction represents the slope in the log odds in group A per month of follow‐up (0.00099 + 0.01636 = 0.0174, so log odds of a positive culture is going up quickly in group A) (e) Again assume the model is parameterized so that group B is the reference group for the treatment effect Regression coefficient for treatment represents the treatment effect at month 0: –0.636 [lower culture rate in group A compared to B, odds ratio = exp(–0.636) = 0.529] Treatment effect at each follow‐up visit: –0.636 + 0.187 = –0.449 Regression coefficient for post (–0.044) represents the follow‐up versus baseline effect in group B: in group B, the log odds of a positive culture is lower by 0.044 at the follow‐up visits compared to baseline Regression coefficient for post plus regression coefficient for post by treatment interaction (–0.044 + 0.189 = 0.145) represents the follow‐up versus baseline effect in group A: in group A, the log odds of a positive culture is higher by 0.145 at follow‐ up visits compared to baseline ANSWERS TO SELECTED EXERCISES 579 (f) There is no consistent trend across months 3, 6, 9, and 12 in the log odds of positive cultures for either treatment group, so the model with post versus baseline is preferred Group A log odds across months: –1.40, –1.33, –1.22, –1.40, –1.12 Group B log odds across months: –0.74, –0.86, –0.77, –0.71, –0.80 (g) The treatment by post interaction can be removed, also oral candidiasis at baseline and vaginal candidiasis at baseline Post is retained as important to the design of the trial (indicates visits post‐randomization) 12.6 (a) and (b) Sample variability and skewness in CFU are not quite constant across groups; the Placebo group has consistently slightly lower variability and the No Drug group has consistently slightly higher skewness Within group, the variability and skewness are approximately constant across weeks Placebo group has consistently higher means and medians, although at baseline especially they are not much different from the two other groups (g) For sex, z = 2.27 and p = 0.13, so we drop it from the model For DMFT teeth, z = –0.02 and p = 0.89, so we drop it from the model (h) Both terms in the treatment by smoker by weeks interaction are strongly statistically nonsignificant (z = –0.02 and p = 0.89 for whether smoker by weeks differs for drug vs no drug, z = –0.59 and p = 0.44 for whether smoker by weeks differs for drug vs placebo), so we drop the three-way interaction from the model The weeks by smoker interaction is also strongly nonsignificant (z = 0.11 and p = 0.74), so we drop it as well These model comparisons could instead be done using QIC; QIC is given automatically in SAS using PROC GENMOD and can be called in R using the QIC function in the MESS package (i) For a final model that includes main effects for treatment, weeks, and smoker, and interactions for treatment by weeks and treatment by smoker, examining treatment group differences overall (averaged over weeks and over smoking blocks) may be misleading At week in the non-smoker block, using Tukey adjustment for multiple comparisons, drug was no different from no drug (difference = –0.06, z = –0.29, p = 0.96) but strongly significantly different from placebo (difference = –0.55, z = –3.14, p = 0.0048), while no drug vs placebo had a similar effect size (difference = –0.48) but a larger significance level (z = –2.01, p = 0.11) At week in the smoker block, some effect sizes were smaller; using Tukey adjustment for multiple comparisons, drug was no different from no drug (difference = 0.29, z = 0.85, p = 0.67) and no different from placebo (difference = 0.18, z = 0.59, p = 0.83), while no drug was no different from placebo (difference = –0.11, z = –0.36, p = 0.93) 580 ANSWERS TO SELECTED EXERCISES Chapter 13 13.8 Log‐rank test: p = 0.0896; generalized Wilcoxon test: p = 0.1590 13.10 95% confidence interval for odds ratio: (1.997; 13.542); McNemar’s chi‐square: χ2 = 14.226; p value = 0.00016 13.11 McNemar’s chi‐square: χ2 = 0.077; p value = 0.78140 13.12 95% confidence interval for odds ratio: (1.126; 5.309); McNemar’s chi‐square: χ2 = 5.452; p value = 0.02122 13.13 For men: McNemar’s chi‐square, χ2 = 13.394; p value = 0.00025 For women: McNemar’s chi‐square, χ2 = 0.439; p value = 0.50761 Chapter 14 14.2 0.93 0.1 0.9 0.9 0.23 0.2 0.8 0.2 0.8 0.8 0.264 14.3 At each new dose level, enroll three patients; if no patient has DLT, the trial continues with a new cohort at the next higher dose; if two or three experience DLT, the trial is stopped If one experiences DLT, a new cohort of two patients is enrolled at the same dose, escalating to next‐higher dose only if no DLT is observed The new design helps to escalate a little easier; the resulting MTD would have a little higher expected toxicity rate 14.4 [0.63 + 3(0.4)(0.6)2(0.6)3]{0.53 + 3(0.5)3 + 3(0.5)3[1 − (0.5)3]} = 0.256 14.5 z1−β = 0.364, corresponding to a power of 64% 14.6 z1−β = 0.927, corresponding to a power of 82% 14.7 z1−β = 0.690, corresponding to a power of 75% 14.8 z1−β = 0.551, corresponding to a power of 71% 14.9 d = 42 events and we need 42 N 120 subjects 0.5 0.794 or 60 subjects in each group 14.10 n 1.96 0.9 0.1 0.05 139 subjects If we not use the 90% figure, we would need nmax 1.96 0.25 0.05 385 subjects 581 ANSWERS TO SELECTED EXERCISES 14.11 nmax 14.12 nmax 1.96 0.25 0.01 1.96 99 subjects 0.25 43 subjects 0.15 14.13 (a) With 95% confidence, we need nmax 1.96 0.25 0.01 9604 subjects With 99% confidence, we need nmax 2.58 0.25 0.01 16, 641 subjects (b) With 95% confidence, we need nmax 1.96 0.08 0.92 0.01 2827 subjects With 99% confidence, we need nmax 14.14 n 14.15 N 14.16 n e n 2.58 0.08 0.92 0.01 1.96 0.5 16 subjects 1.96 ln 0.9 400 10 62 or 31 per group ln 0.05 8, n e(ln 0.8 ln 0.95 ln 0.2 ln 0.8 ln 0.95 2, n 11, n 20, n 14.17 N 1.96 1.28 14.18 N 1.96 1.65 1.96 1.28 14.19 N 4900 subjects 2.28 50 or 25 per group 10.3 220 or 110 per group 0.97 2 496 or 248 per group ln 0.05 29, andd so on 582 ANSWERS TO SELECTED EXERCISES 14.20 N 2.58 1.28 14.21 N 2.58 1.28 14.22 N 1.96 0.84 14.23 N 1.96 0.84 0.075 0.925 0.05 0.12 0.88 1630 or 315 per group 0.1 0.275 0.725 0.15 0.3 0.7 0.2 ln 0.6 ln 0.7 1.432 d N 1.96 0.84 1.432 1.432 249 events 249 0.6 0.7 710 subjects or 355 per group ln 0.4 14.26 ln 0.5 1.322 d N 1.96 0.84 1.322 1.322 408 events 408 0.4 0.5 742 subjects or 371 per group ln 0.4 14.27 ln 0.6 1.794 d N 1.96 0.84 1.794 1.794 70 or 35 per group 42 or 21 per group 14.24 d = 0.196, almost 20% 14.25 1654 or 827 per group 98 events 98 0.4 0.6 98 subjects or 49 per group 583 ANSWERS TO SELECTED EXERCISES 14.28 π1 = 0.18: (a) N = 590, 245 cases and 245 controls; (b) N = 960, 192 cases and 768 controls; (c) m = 66 discordant pairs and M = 271 case–control pairs 14.29 π1 = 0.57: (a) N = 364, 182 cases and 182 controls; (b) N = 480, 120 cases and 350 controls; (c) m = 81 discordant pairs and M = 158 case–control pairs 14.30 N 1.96 84 ln 1.5 192; 96 cases and 96 controls Index addition rule, 106, 126 adjacent values, 76 adjusted rate, 13, 15 agreement, 112 AIC see Akaike Information Criterion Akaike Information Criterion, 418 alpha, 198 analysis of variance (ANOVA), 253, 273 analysis of variance (ANOVA) table, 254, 276, 310 antibody response, 316 antilog see exponentiation area under the density curve, 117, 124 average see mean mean, 128 variance, 128 binomial probability, 126 bioassay, 330 blinded study double, 496 triple, 496 block, 273, 280 complete, 281 fixed, 281 random, 284 blocking factor see block Bonferroni’s type I error adjustment, 258 box plot, 76 bar chart, baseline hazard, 451, 456, 460 Bayesian Information Criterion, 418 Bayes’ theorem, 111 Bernoulli distribution, 354 mean, 359 variance, 360 better treatment trials, 505 BIC see Bayesian Information Criterion binary characteristic, binary data see variable, binary binomial distribution, 126, 132 case–control study, 2, 130, 199, 358, 439, 494 matched, 518 pair matched, 464 unmatched, 516, 520 censoring, 442 censoring indicator, 443 central limit theorem, 115, 125, 146, 153, 182, 198, 236 chance, 103, 116 change rate, 10 chi‐square distribution, 125 Introductory Biostatistics, Second Edition Chap T Le and Lynn E Eberly © 2016 John Wiley & Sons, Inc Published 2016 by John Wiley & Sons, Inc Companion website: www.wiley.com/go/Le/Biostatistics 586Index chi‐square test, 212, 458, 470, 471 difference in proportions, 203 generalized Wilcoxon, 449 likelihood ratio, 366, 368, 370, 394, 399, 405, 417, 458, 480, 482 log rank, 449 Mantel–Haenszel, 207 McNemar’s, 200, 467 Pearson’s, 212, 387 score, 366, 448, 480 Wald, 480 Yates’ corrected, 215 clinical trial, 358, 494 phase I, 497 phase II, 499 phases I‐IV, 495 clustered study, 409 coefficient of correlation, 300 coefficient of determination, 308, 310 coefficient of multiple determination, 321 coefficient of variation, 76 cohort‐escalation study, 497 cohort study, 14, 130, 385, 439, 494 common odds ratio, 26 comparisonwise error, 258 complete case analysis, 410 compound event, 126 concordance, 22, 84, 112, 219 category‐specific, 112 overall, 112 conditional independence, 206 conditional logistic regression, 472 confidence interval, 142, 146, 192 for a correlation coefficient, 161 for a difference of means, 152 for a difference of proportions, 157 effect of sample size on, 148 for a hazard ratio, 453, 457 for a mean, 148 for a odds ratio, 157, 356, 363, 426, 467, 475, 479 for a paired mean difference, 152 for a proportion, 154 for a regression coefficient, 415 relation to p value, 191 for a relative risk, 394 confidence level, 148 confounder, 3, 15, 25, 131, 151, 165, 199, 206, 238, 464 contingency table, 22, 197, 211 contingency table, ordered, 219 continuity correction, 210, 215 continuous data, 318 correlation, 78, 299 autoregressive (AR), 425 compound symmetry, 413, 425 exchangeable (see correlation, compound symmetry) induced, 412 inter-correlation, 300 intra‐class (ICC), 413 intra-correlation, 300 Kendall’s tau (τ), 84 non‐parametric, 83 Pearson’s r, 80, 83 Spearman’s rho (ρ), 83 unstructured, 425 working, 425, 428 correlation coefficient, 81, 307 covariate see predictor variable covariate, time dependent, 461 Cox model see proportional hazards model cross‐classified table. see contingency table crossing survival curves, 449 crude rate, 13 cure model, 449 cut point, 183, 184, 187, 188 death rate adjusted, 13, 16 crude, 13 follow‐up, 14 death set, 452, 462 decision making rule see cut point degrees of freedom, 75, 125, 212, 236, 242, 254, 275, 310, 321, 387 density, 114 density curve, 115, 117, 124 dependent variable see response variable derived variable analysis, 410 deterministic relationship, 79 deviation, 73 diagnostic procedure, diagnostics, 287, 302, 419 dichotomous characteristic see variable, binary dichotomous data see variable, binary difference of means, 152 difference of proportions, 202 587 Index direct method, 16 discordance, 21, 23, 84, 219, 466, 518 discrete data see variable, discrete disease registry, dispersion, 73–78, 360, 402, 403 see also variance distribution sampling, 147, 157, 160 skewed, 63, 71, 73, 76 symmetric, 63, 76 unimodal, 61 DLT see dose‐limiting toxicity dose‐limiting toxicity, 497 dose‐response, 314, 317 dummy variable, 302, 318, 332, 351, 363, 393, 397, 399, 412, 443, 456, 463, 472, 477, 478 effect interaction, 274, 277 main, 274, 277, 400 modification, 4, 206–207, 274, 276–278, 281, 312, 319, 364, 368, 399, 457, 459, 479 simple, 274, 277 estimate, 141 interval, 146 point, 145 estimator, unbiased, 143 event time, 443 exact statistic, 249, 250 exclusion criteria, 496 expected deaths, 16, 29, 447 expected frequencies, 212 expected value, 303, 320 experimental study see randomized study experimental unit, 280 experiment wise error, 258 explanatory variable see predictor variable exponential growth (decay), 315, 316 exponentiation, 158, 161 exposure, factorial, 273, 274 factors see factorial false negative, 6, 186 false positive, 6, 107, 186 F distribution, 125 Fisher’s exact statistic, 217 Fisher’s transformation, 503 fixed effect, 280 force of mortality see hazard function frequency cumulative, 64 cumulative relative, 64 relative, 104 frequency distribution, 56, 114 frequency polygon, 60 F statistic, 255, 276–278, 284, 310, 322, 414, 425, 428 full model, 276 Gaussian distribution see Normal distribution GEE see Generalized Estimating Equations Generalized Estimating Equations, 425, 428 model‐based standard error, 425 robust (empirical) standard error, 425, 428 generalized odds, 22 general linear F test, 276 gold standard, 112 goodness of fit, 360, 402, 416, 462 goodness of fit statistic see chi‐square test, Pearson’s hazard function, 441 hazard ratio, 441, 442, 453, 514 hazard ratio, constant, 442 histogram, 60, 114 hypothesis, 181 alternative, 181, 198 composite, 193 global null, 277, 322, 366, 394, 458, 479 null, 181, 197, 235 omnibus (see hypothesis, global null) simple, 193 hypothesis test, 181 incidence, 13 inclusion criteria, 496 independence null hypothesis, 212 independent events, 108, 126 independent trials, 126 independent variable see predictor variable indicator variable see dummy variable infant mortality rate, 129 interaction see effect modification 588Index intercept, 301, 302 inter‐correlation, 300, 313 inter‐quartile range, 89 interval density, 60 interval midpoint, 70 intra‐correlation, 300 IQR see inter‐quartile range Kaplan–Meier curve, 444 kappa, 113 category‐specific, 114 overall, 114 problem with, 114 k samples, binary, 215 least squares estimation, 303, 320 likelihood function, 164, 354, 363, 391, 393, 469, 473, 476, 478 likelihood ratio test see chi‐square test, likelihood ratio linear association, 81 linearity, 302, 318, 364, 393, 416, 454, 457, 479 linear mixed model, 411 conditional mean, 424 marginal mean, 424 population‐average intercept, 411 random intercept, 411 random slope, 415 subject‐specific intercept, 411 line graph, log hazard, 442 logistic regression, 352, 424 logistic regression, conditional, 472 lognormal distribution, 125 log odds, 357, 424 log rank test, 448 log rank test, stratified, 460 longitudinal study, 409 Mantel–Haenszel odds ratio, 26, 206 margin of error, 499, 501 matching, 131, 199, 472 advantages and disadvantages, 464 efficiency, 468 multiple‐to‐one, 468 one‐to‐one, 466 maximum likelihood estimation, 164, 355, 391, 393, 414 maximum tolerated dose, 495 McNemar’s chi‐square test, 200, 476 mean, 67, 69 geometric, 71 square, 254, 275 measurement scale, effect of, 454 median, 65, 72, 76 median effective dose, 314 median test see Wilcoxon rank sum test midrange, 89 misclassification, missing data, 364, 394, 410 mode, 73 morbidity, 13 mortality, 13 MTD see maximum tolerated dose multi‐level model, 421 multiple comparisons adjustment, 258, 283 multiple testing, 369, 399 multiplication rule, 108, 126, 212 negative predictive value, 110 nested models, 417 normal curve, 114 normal distribution, 124, 290 mean, 116 variance, 116 observational study, 281 observed size see p value odds, 19, 157 odds, generalized, 21, 219 odds ratio, 18, 108, 131, 158, 355, 363, 426, 466, 516, 518, 520 as approximation to relative risk, 19, 131, 359 Mantel–Haenszel, 26, 207, 469 matched pairs, 132, 165, 469, 475 omnibus hypothesis see hypothesis, global null one‐sample binary, 197 continuous, 235 one‐sided test, 133, 188, 198, 202, 236, 242 one‐tailed see one‐sided test ordered contingency table, 21 outlier, 76 overdispersion, 359, 387, 402 589 Index paired‐sample, non‐parametric, 250 pair‐matched binary, 130, 199 case‐control study, 130 continuous, 237, 250 pairwise comparisons, 258 parallel lines assumption, 460 parameter, 116, 141, 143, 198, 236 partial likelihood function, 452, 456, 462 Pearson’s chi‐square test, 387 percentile, 64, 76 percentile score, 64 person‐years method, 14, 385 pie chart, placebo, 496 Poisson distribution, 128, 384 mean, 129, 384 offset (see Poisson distribution, size) relation to binomial, 384, 391 size, 389, 391, 427, 431 variance, 129, 384 Poisson regression, 383, 427 polytomous data, 318 pooled variance, 242 population, 116, 145, 182 average coefficient, 424 target, 104, 305 positive predictive value, 110 power, 193, 509 predicted value, 287, 303, 320 prediction, 297, 299, 303 predictor variable, 274, 297, 351, 383, 450 prevalence, 5, 103, 111 primary endpoint see primary outcome primary outcome, 99 probability, 103, 104, 117 conditional, 108 joint, 106 marginal, 106, 212 unconditional, 109 univariate, 107 probability density function, 124, 129, 164 product‐limit estimation, 444 proportion, 1, 77, 103, 104, 153, 198 proportional hazard assumption, 459, 462 proportional hazards, 514 proportional hazards model, 442, 451, 456 proportional hazards model, for matched pairs data, 475 prospective study, 2, 130, 358, 439, 494 p value, 189, 194, 199 p value, relation to confidence interval, 191 QIC see quasi‐likelihood information criterion quasi‐likelihood information criterion, 425–426, 428 random effect, 280, 410, 414, 421 randomization, 496 randomized complete block design, 419 randomized study, 281, 283 random sampling, 493 random selection, 103 range, 56, 73 rate, 10 ratio, 18 receiver operating characteristic (ROC) curve, 373, 374 reduced model, 276 reference group, 21, 28, 364, 399, 442 regression, 297, 299 coefficient, 302, 318, 356, 363, 391, 393, 410, 425, 428, 453, 458 logistic, 353, 363 multiple, 318, 351, 362, 393, 410, 423, 456, 478 Poisson, 389, 393 polynomial, 319, 365, 396 simple, 351, 389 simple linear, 298, 301 stepwise, 331, 332, 334, 351, 365, 369, 404, 459, 483 rejection region, 187–189, 198, 236, 242 relative frequency, 58, 114 relative hazard see hazard ratio relative risk, 18, 29, 359, 391, 466, 516, 518 repeated measures, 409 replication, 281, 283, 418 reproducibility, 112, 145 residual, 287, 416 residual, studentized, 287 response variable, 273, 297, 352, 383, 409, 473 retrospective study, 2, 130, 358, 439, 494 risk, 391 risk factor, 2, 298 590Index risk function see hazard function risk ratio, 18 risk set, 452, 462 R‐square, 310, 321 sample, 104, 116, 145, 182 paired, 151 pair matched, 151, 165 small, 148, 149 two independent, 152 sample mean, 115 sample proportion, 115 sample size, 499, 501, 502, 505–507, 509, 512, 514, 516, 518, 521 sampling for a block design, 281 random, 105 repeated, 104, 116, 143, 147, 182, 187, 198, 236 without replacement, 143 sampling distribution, 182, 198, 236 sampling frame, 105 sandwich estimator see Generalized Estimating Equations, robust standard error scaled deviance, 360, 402 scaled Pearson chi‐square, 360, 402 scatter diagram see scatter plot scatter plot, 55, 79, 302 score equation, 425, 428 score test see chi‐square test, score screening test, 5, 106, 186, 314, 372 seasonality, 333 sensitivity, 5, 107, 110, 186, 373 separation power, 373, 374 separator variable, 372, 374 sequential probability ratio test, 507 significance level, 188, 198 practical, 190 statistical, 180, 188, 190 test, 179 significant difference, minimum clinical, 505, 510 Simon two‐stage phase II study, 504 size of test see type I error skewness, 290 slope, 301, 302 small sample test, 217 spaghetti plot, 411 specificity, 5, 110, 186, 373 specific rate, 13 staggered entry, 440 standard deviation, 74, 78, 146 standard error, 146, 298 of a difference of means, 152 of a difference of proportions, 157 of a mean, 148 of a proportion, 153 standardize, 122, 147, 198, 219, 236, 247, 250, 308, 366, 396, 448 standardized mortality ratio, 28 standardized rate, 13, 15 standard normal distribution, 116, 124, 198 standard normal score see z statistic standard population, 16 statistic, 1, 116, 143, 198, 236, 298 statistical association, 22, 78, 79, 106, 108, 113, 297, 299 negative, 79, 81 positive, 79, 81 statistical inference, 141, 145 statistical relationship see statistical association stem‐and‐leaf plot, 68 stratification, 459 stratification, for matched pairs, 475 subject specific coefficient, 424 sum of squares between (SSB), 254, 275 within (SSW), 253, 275 error, 275, 310, 321 model, 275 regression, 310, 321 total (SST), 253, 275, 309, 321 survey study, 493, 499 survival curve, 441 survival data, 440 survival function, 441 survival rate see survival function survival time, 440 target population, 493 t distribution, 125, 149, 236, 242 mean, 125 variance, 125 591 Index test, 141 test for independence, 212, 307 test statistic, 182, 187 treatment factor, 280 t statistic, 308, 310, 323, 324, 326, 352, 414 one‐sample, 236 paired sample, 238, 476 two‐sample, 242, 255 t test see t statistic Tukey’s type I error adjustment, 259 two‐sample binary, 202 non‐parametric, 246 two‐sided test, 133, 188, 198, 202, 236, 242, 307 two‐tailed see two‐sided test two‐way or 2x2 table. see contingency table type analysis, 401 type analysis, 400 type I error, 180, 182, 185, 188, 252, 258, 369, 399, 404, 459, 509 type II error, 180, 182, 185, 187, 509 unit of observation, 389 variable, 55, 124 Bernoulli, 124 binary, 2, 77, 126, 318, 352 binomial, 352 categorical, 197 continuous, 55, 124, 297 dichotomous (see variable, binary) discrete, 55, 77, 125 point binomial (see variable, Bernoulli) polytomous, 2, 298, 311, 318, 363, 393, 456 variance, 73, 77 Wilcoxon generalized test, 448 Wilcoxon rank sum test, 246, 260 Wilcoxon signed rank test, 250, 476 Yates’ corrected chi‐square test, 215 z score see z statistic z statistic, 116, 122, 128, 129, 198, 200, 202, 207, 219, 247, 250, 367, 396, 425, 428, 448, 449, 458, 467, 480 z test see z statistic WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA ... INTRODUCTORY BIOSTATISTICS INTRODUCTORY BIOSTATISTICS Second Edition CHAP T LE Distinguished Professor of Biostatistics Director of Biostatistics and Bioinformatics... important to note that throughout this section, proportions have been defined so that both the numerator and the denominator are counts or frequencies, and the numerator corresponds to a subgroup... Disease Factor + − Total + − A C B D A+B C+D Total A+C B+D N=A+B+C+D and without the factor, as shown in Table 1.11 In a case–control study the data not present an immediate answer to this type