Bee Choo Tai, Saw Swee Hock School of Public Health, National University of Singapore, and National University Health System; and Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore David Machin, Medical Statistics Unit, School of Health and Related Sciences, University of Sheffield, Sheffield; and Cancer Studies, Faculty of Medicine, University of Leicester, Leicester, UK Regression Methods for Medical Research provides medical researchers with the skills they need to critically read and interpret research using more advanced statistical methods The statistical requirements of interpreting and publishing in medical journals, together with rapid changes in science and technology, increasingly demand an understanding of more complex and sophisticated analytic procedures Regression Methods for Medical Research is especially designed for clinicians, public health and environmental health professionals, para-medical research professionals, scientists, laboratory-based researchers and students Tai and Machin The text explains the application of statistical models to a wide variety of practical medical investigative studies and clinical trials Regression methods are used to appropriately answer the key design questions posed and in so doing take due account of any effects of potentially influencing co-variables It begins with a revision of basic statistical concepts, followed by a gentle introduction to the principles of statistical modelling The various methods of modelling are covered in a non-technical manner so that the principles can be more easily applied in everyday practice A chapter contrasting regression modelling with a regression tree approach is included The emphasis is on the understanding and the application of concepts and methods Data drawn from published studies are used to exemplify statistical concepts throughout Regression Methods for Medical Research Regression Methods for Medical Research Regression Methods for Medical Research Bee Choo Tai and David Machin ISBN 978-1-4443-3144-8 781444 331448 Tai_Regression_9781444331448_pb.indd 13/09/2013 11:20 Regression Methods for Medical Research Regression Methods for Medical Research Bee Choo Tai Saw Swee Hock School of Public Health National University of Singapore and National University Health System; Yong Loo Lin School of Medicine National University of Singapore and National University Health System Singapore David Machin Medical Statistics Unit School of Health and Related Sciences University of Sheffield; Cancer Studies, Faculty of Medicine University of Leicester Leicester, UK This edition first published 2014 © 2014 by Bee Choo Tai and David Machin Published 2014 by John Wiley & Sons, Ltd Registered Office John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Editorial Offices 9600 Garsington Road, Oxford, OX4 2DQ, UK The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK 111 River Street, Hoboken, NJ 07030-5774, USA For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell The right of the author to be identified as the author of this work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988 All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting a specific method, diagnosis, or treatment by health science practitioners for any particular patient The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of fitness for a particular purpose In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions Readers should consult with a specialist where appropriate The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read No warranty may be created or extended by any promotional statements for this work Neither the publisher nor the author shall be liable for any damages arising herefrom Library of Congress Cataloging-in-Publication Data Tai, Bee Choo, author Regression methods for medical research / Bee Choo Tai, David Machin p ; cm Includes bibliographical references and index ISBN 978-1-4443-3144-8 (pbk : alk paper) – ISBN 978-1-118-72198-8 – ISBN 978-1-118-72197-1 (Mobi) – ISBN 978-1-118-72196-4 – ISBN 978-1-118-72195-7 I. Machin, David, 1939– author. II. Title [DNLM: 1. Regression Analysis. 2. Biomedical Research. 3. Models, Statistical. WA 950] R853.S7 610.72′4–dc23 2013018953 A catalogue record for this book is available from the British Library Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books Cover image: Stethoscope - iStock file #13368468 © LevKing DNA - iStock file#1643638 © Andrey Prokhorov Cover design by Meaden Creative Set in 10/12pt Times by SPi Publisher Services, Pondicherry, India 1 2014 To Isaac Xu-En Koh and Kheng-Chuan Koh and Lorna Christine Machin Contents Preface 1 Introduction viii 2 Linear regression: practical issues25 3 Multiple linear regression43 Logistic Regression 64 5 Poisson Regression 98 Time-to-Event Regression 120 Model Building 146 8 Repeated Measures 176 9 Regression Trees 204 10 Further Time-to-Event Models 236 11 Further Topics 269 Statistical Tables References Index 285 294 298 Preface In the course of planning a new clinical study, key questions that require answering have to be determined and once this is done the purpose of the study will be to answer the questions posed Once posed, the next stage of the process is to design the study in detail and this will entail more formally stating the hypotheses of concern and considering how these may be tested These considerations lead to establishing the statistical models underpinning the research process Models, once established, will ultimately be fitted to the experimental data collated and the associated statistical techniques will help to establish whether or not the research questions have been answered with the desired reliability Thus, the chosen statistical models encapsulate the design structure and form the basis for the subsequent analysis, reporting and interpretation In general terms, such models are termed regression models, of which there are several major types, and the fitting of these to experimental data forms the basis of this text Our aim is not to describe regression methods in all their technical detail but more to illustrate the situations in which each is suitable and hence to guide medical researchers of all disciplines to use the methods appropriately Fortunately, several user-friendly statistical computer packages are available to assist in the model fitting processes We have used Stata statistical software in the majority of our calculations, and to illustrate the types of commands that may be needed, but this is only one example of packages that can be used for this purpose Statistical software is continually evolving so that, for example, several and improving versions of Stata have appeared during the time span in which this book has been written We strongly advise use of the most up-to-date software available and, as we mention within the text itself, one that has excellent graphical facilities We caution that, although we use real data extensively, our analyses are selective and are for illustration only They should not be used to draw conclusions from the studies concerned We would like to give a general thank you to colleagues and students of the Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, and a specific one for the permission to use the data from the Singapore Cardiovascular Cohort Study Thanks are also due to colleagues at the Skaraborg Institute, Skövde, Sweden In addition, we would like to thank the following for allowing us to use their studies for illustration: Tin Aung, Singapore Eye Research Institute; Michael J Campbell, University of Sheffield, UK; Boon-Hock Chia, Chia Clinic, Singapore; Siow-Ann Chong, Institute of Mental Health, Singapore; Richard G Grundy, University of Nottingham, UK; James H-P Hui, National University Health System, Singapore; Ronald C-H Lee, National University of Singapore; Daniel P-K Ng, National University of Singapore; R Paul Symonds, University of Leicester, UK; Veronique Viardot-Foucault, KK Women’s and Children’s Hospital, Singapore; Joseph T-S Wee, National Cancer Centre, Singapore; Chinnaiya Anandakumar, Camden Medical Centre, Singapore; and Annapoorna Venkat, National University Health System, Singapore Finally, we thank Haleh G Maralani for her help with some of the statistical programming George EP Box (1979): ‘All models are wrong, but some are useful.’ Bee Choo Tai David Machin Statistical Tables 289 Table T4 Students t -distribution The value tabulated is tα/2, such that if X is distributed as Student’s t -distribution with f degrees of freedom, then a is the probability that X ≤ -tα/2 or X ≥ tα/2 α/2 α/2 –t α1 tα/2 If following a Student t-test with df = 10, a value of the test statistic X = 2.764 is obtained, then this implies that probability that X ≤ -2.764 or ≥ 2.764 is a = 0.02 a df 0.20 0.10 0.05 0.04 0.03 0.02 0.01 0.001 10 3.078 1.886 1.634 1.530 1.474 1.439 1.414 1.397 1.383 1.372 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 15.895 4.849 3.482 2.999 2.757 2.612 2.517 2.449 2.398 2.359 21.205 5.643 3.896 3.298 3.003 2.829 2.715 2.634 2.574 2.528 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 63.657 9.925 5.842 4.604 4.032 3.707 3.499 3.355 3.250 3.169 636.6 31.60 12.92 8.610 6.869 5.959 5.408 5.041 4.781 4.587 11 12 13 14 15 16 17 18 19 20 1.363 1.356 1.350 1.345 1.340 1.337 1.333 1.330 1.328 1.325 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.328 2.303 2.282 2.264 2.249 2.235 2.224 2.214 2.205 2.196 2.491 2.461 2.436 2.415 2.397 2.382 2.368 2.356 2.346 2.336 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 21 22 23 24 25 26 27 28 29 30 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 2.079 2.074 2.069 2.064 2.059 2.056 2.052 2.048 2.045 2.042 2.189 2.183 2.178 2.172 2.166 2.162 2.158 2.154 2.150 2.147 2.327 2.320 2.313 2.307 2.301 2.396 2.291 2.286 2.282 2.278 2.517 2.508 2.499 2.492 2.485 2.479 2.472 2.467 2.462 2.457 2.830 2.818 2.806 2.797 2.787 2.779 2.770 2.763 2.756 2.750 3.819 3.790 3.763 3.744 3.722 3.706 3.687 3.673 3.657 3.646 ∞ 1.282 1.645 1.960 2.054 2.170 2.326 2.576 3.291 290 Statistical Tables Table T5 The c distribution The value tabulated is c2(a), such that if X is distributed as c2 with degrees of freedom, df, then α is the probability that X ≥ c2 α X2 (α) a df 0.2 0.1 0.05 0.04 0.03 0.02 0.01 0.001 1.64 3.22 4.64 5.99 7.29 2.71 4.61 6.25 7.78 9.24 3.84 5.99 7.81 9.49 11.07 4.22 6.44 8.31 10.03 11.64 4.71 7.01 8.95 10.71 12.37 5.41 7.82 9.84 11.67 13.39 6.63 9.21 11.34 13.28 15.09 10.83 13.82 16.27 18.47 20.51 10 8.56 9.80 11.03 12.24 13.44 10.64 12.02 13.36 14.68 15.99 12.59 14.07 15.51 16.92 18.31 13.20 14.70 16.17 17.61 19.02 13.97 15.51 17.01 18.48 19.92 15.03 16.62 18.17 19.68 21.16 16.81 18.48 20.09 21.67 23.21 22.46 24.32 26.12 27.88 29.59 11 12 13 14 15 14.63 15.81 16.98 18.15 19.31 17.28 18.55 19.81 21.06 22.31 19.68 21.03 22.36 23.68 25.00 20.41 21.79 23.14 24.49 25.82 21.34 22.74 24.12 25.49 26.85 22.62 24.05 25.47 26.87 28.26 24.73 26.22 27.69 29.14 30.58 31.26 32.91 34.53 36.12 37.70 16 17 18 19 20 20.47 21.61 22.76 23.90 25.04 23.54 24.77 25.99 27.20 28.41 26.30 27.59 28.87 30.14 31.41 27.14 28.44 29.75 31.04 32.32 28.19 29.52 30.84 32.16 33.46 29.63 31.00 32.35 33.69 35.02 32.00 33.41 34.81 36.19 37.57 39.25 40.79 42.31 43.82 45.31 21 22 23 24 25 26.17 27.30 28.43 29.55 30.68 29.62 30.81 32.01 33.20 34.38 32.67 33.92 35.17 36.42 37.65 33.60 34.87 36.13 37.39 38.64 34.76 36.05 37.33 38.61 39.88 36.34 37.66 38.97 40.27 41.57 38.93 40.29 41.64 42.98 44.31 46.80 48.27 49.73 51.18 52.62 26 27 28 29 30 31.79 32.91 34.03 35.14 36.25 35.56 36.74 37.92 39.09 40.26 38.89 40.11 41.34 42.56 43.77 39.89 41.13 42.37 43.60 44.83 41.15 42.41 43.66 44.91 46.16 42.86 44.14 45.42 46.69 47.96 45.64 46.96 48.28 49.59 50.89 54.05 55.48 56.89 58.30 59.70 Example For an observed test statistic of X = 9.7 with degrees of freedom, the tabular entries for α equal to 0.03 and 0.02 are 8.95 and 9.84, respectively Hence X = 9.7 suggests a p-value between 0.03 and 0.02, or more precisely close to 0.021 a 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 df2 1 2 3 4 5 6 7 8 3.46 5.32 11.26 3.59 5.59 12.25 3.78 5.99 13.75 4.06 6.61 16.26 4.54 7.71 21.20 5.54 10.13 34.12 8.53 18.51 98.50 39.86 161.45 4052.18 3.11 4.46 8.65 3.26 4.74 9.55 3.46 5.14 10.92 3.78 5.79 13.27 4.32 6.94 18.00 5.46 9.55 30.82 9.00 19.00 99.00 49.50 199.50 4999.34 2.92 4.07 7.59 3.07 4.35 8.45 3.29 4.76 9.78 3.62 5.41 12.06 4.19 6.59 16.69 5.39 9.28 29.46 9.16 19.16 99.16 53.59 215.71 5403.53 2.81 3.84 7.01 2.96 4.12 7.85 3.18 4.53 9.15 3.52 5.19 11.39 4.11 6.39 15.98 5.34 9.12 28.71 9.24 19.25 99.25 55.83 224.58 5624.26 2.73 3.69 6.63 2.88 3.97 7.46 3.11 4.39 8.75 3.45 5.05 10.97 4.05 6.26 15.52 5.31 9.01 28.24 9.29 19.30 99.30 57.24 230.16 5763.96 2.67 3.58 6.37 2.83 3.87 7.19 3.05 4.28 8.47 3.40 4.95 10.67 4.01 6.16 15.21 5.28 8.94 27.91 9.33 19.33 99.33 58.20 233.99 5858.95 df1 2.62 3.50 6.18 2.78 3.79 6.99 3.01 4.21 8.26 3.37 4.88 10.46 3.98 6.09 14.98 5.27 8.89 27.67 9.35 19.35 99.36 58.91 236.77 5928.33 2.59 3.44 6.03 2.75 3.73 6.84 2.98 4.15 8.10 3.34 4.82 10.29 3.95 6.04 14.80 5.25 8.85 27.49 9.37 19.37 99.38 59.44 238.88 5980.95 2.56 3.39 5.91 2.72 3.68 6.72 2.96 4.10 7.98 3.32 4.77 10.16 3.94 6.00 14.66 5.24 8.81 27.34 9.38 19.38 99.39 59.86 240.54 6022.40 2.54 3.35 5.81 2.70 3.64 6.62 2.94 4.06 7.87 3.30 4.74 10.05 3.92 5.96 14.55 5.23 8.79 27.23 9.39 19.40 99.40 60.19 241.88 6055.93 10 2.42 3.15 5.36 2.59 3.44 6.16 2.84 3.87 7.40 3.21 4.56 9.55 3.84 5.80 14.02 5.18 8.66 26.69 9.44 19.45 99.45 61.74 248.02 6208.66 20 (Continued ) 2.30 2.93 4.87 2.47 3.23 5.66 2.72 3.67 6.89 3.11 4.37 9.03 3.76 5.63 13.47 5.13 8.53 26.14 9.49 19.49 99.50 63.30 254.19 6362.80 ∞ Table T6 The F-distribution The value tabulated is F(α,df1,df2), such that if X has an F-distribution with df1 and df2 degrees of freedom, then α is the probability that X ≥ F(α,df1,df2) 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 100 100 100 ∞ ∞ ∞ 2.71 3.85 6.66 2.76 3.94 6.90 2.81 4.03 7.17 2.84 4.08 7.31 2.88 4.17 7.56 2.97 4.35 8.10 3.29 4.96 10.04 3.36 5.12 10.56 2.31 3.00 4.63 2.36 3.09 4.82 2.41 3.18 5.06 2.44 3.23 5.18 2.49 3.32 5.39 2.59 3.49 5.85 2.92 4.10 7.56 3.01 4.26 8.02 2.09 2.61 3.80 2.14 2.70 3.98 2.20 2.79 4.20 2.23 2.84 4.31 2.28 2.92 4.51 2.38 3.10 4.94 2.73 3.71 6.55 2.81 3.86 6.99 1.95 2.38 3.34 2.00 2.46 3.51 2.06 2.56 3.72 2.09 2.61 3.83 2.14 2.69 4.02 2.25 2.87 4.43 2.61 3.48 5.99 2.69 3.63 6.42 1.85 2.22 3.04 1.91 2.31 3.21 1.97 2.40 3.41 2.00 2.45 3.51 2.05 2.53 3.70 2.16 2.71 4.10 2.52 3.33 5.64 2.61 3.48 6.06 1.78 2.11 2.82 1.83 2.19 2.99 1.90 2.29 3.19 1.93 2.34 3.29 1.98 2.42 3.47 2.09 2.60 3.87 2.46 3.22 5.39 2.55 3.37 5.80 df1 1.72 2.02 2.66 1.78 2.10 2.82 1.84 2.20 3.02 1.87 2.25 3.12 1.93 2.33 3.30 2.04 2.51 3.70 2.41 3.14 5.20 2.51 3.29 5.61 1.68 1.95 2.53 1.73 2.03 2.69 1.80 2.13 2.89 1.83 2.18 2.99 1.88 2.27 3.17 2.00 2.45 3.56 2.38 3.07 5.06 2.47 3.23 5.47 1.64 1.89 2.43 1.69 1.97 2.59 1.76 2.07 2.78 1.79 2.12 2.89 1.85 2.21 3.07 1.96 2.39 3.46 2.35 3.02 4.94 2.44 3.18 5.35 1.61 1.84 2.34 1.66 1.93 2.50 1.73 2.03 2.70 1.76 2.08 2.80 1.82 2.16 2.98 1.94 2.35 3.37 2.32 2.98 4.85 2.42 3.14 5.26 10 1.43 1.58 1.90 1.49 1.68 2.07 1.57 1.78 2.27 1.61 1.84 2.37 1.67 1.93 2.55 1.79 2.12 2.94 2.20 2.77 4.41 2.30 2.94 4.81 20 1.08 1.11 1.16 1.22 1.30 1.45 1.33 1.45 1.70 1.38 1.52 1.82 1.46 1.63 2.02 1.61 1.85 2.43 2.06 2.54 3.92 2.16 2.71 4.32 ∞ Example For an observed test statistic of X = 5.1 with and degrees of freedom, the tabular entries for a equal to 0.10, 0.05 and 0.01 are 4.19, 6,59 and 16.69, respectively Hence X = 5.1 suggests a p-value between 0.10 and 0.05 or approximately 0.08 0.10 0.05 0.01 a 9 df2 Table T6 (cont‘d ) Statistical Tables 293 Table T7 Number of subjects required for a range of the Cohen (1988) standardized effect size, D, for a continuous endpoint assuming a two-group comparison with equal numbers in each group a = 0.05 D 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.35 b = 0.2 b = 0.1 788 352 200 128 90 66 52 42 34 20 1054 470 266 172 120 88 68 54 44 26 References Numbers in parentheses after each entry are the chapters in which these are cited *Books indicated for further reading in Chapter *Altman DG (1991) Practical Statistics for Medical Research London, Chapman and Hall [1] *Armitage P, Berry G and Matthews JNS (2002) Statistical Methods in Medical Research (4th edn) Blackwell Science, Oxford [1] Altman DG, Lausen B, Sauerbrei W and Schumacher M (1994) Dangers of using “optimal” cutpoints in the evaluation of prognostic factors Journal of the National Cancer Institute, 86, 829–835 and 1798–1799 [7] Altman DG, Machin D, Bryant TN and Gardner MJ (eds) (2000) Statistics with Confidence (2nd edn) British Medical Journal, London [4] Altman DG and Royston P (2000) What we mean by validating a prognostic model? Statistics in Medicine, 11, 453–473 [7] Bellary S, O’Hare JP, Raymond NT, Gumber A, Mughal S, Szczepura A, Kumar S and Barnett AH (2008) Enhanced diabetes care to patients of south Asian ethnic origin (the United Kingdom Asian Diabetes Study): a cluster randomised trial Lancet, 371, 1769–1776 [11] Beyersmann J, Dettenkofer M, Bertz H and M Schumacher (2007) A competing risks analysis of b loodstream infection after stem-cell transplantation using subdistribution hazards and cause-specific hazards Statistics in Medicine, 26, 5360–5369 [10] *Bland M (2000) An Introduction to Medical Statistics (3rd edn) Oxford University Press, Oxford [1] Böhning D, Dietz E, Schlattman P, Mendoca L and Kirchner U (1999) The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology Journal of the Royal Statistical Society, (A), 162, 195–209 [5] Boos D and Stefanski L (2010) Efron’s bootstrap Significance, 7, 186–188 [5] Breiman L, Freedman JH, Olshen RA and Stone CJ (1984) Classification and Regression Trees Wadsworth & Brooks/Cole Advanced Books and Software, Monterey, California [9] Busse WW, Lemanske Jr RF and Gern JE (2010) Role of viral infections in asthma exacerbations Lancet, 376, 826–834 [1, 2, 7, 11] *Campbell MJ (2006) Statistics at Square Two: Understanding Modern Statistical Applications in Medicine (2nd edn) Blackwell BMJ Books, Oxford [1] *Campbell MJ, Machin D and Walters SJ (2007) Medical Statistics: A Commonsense Approach: A Text Book for the Health Sciences, (4th edn) Wiley, Chichester [1, 5] Cassimally KA (2011) We come from one Significance, 8, 19–21 [3] Chia B-H, Chia A, Ng W-Y and Tai B-C (2010) Suicide trends in Singapore: 1955–2004 Archives of Suicide Research, 14, 276–283 [5] Chinnaiya A, Venkat A, Chia D, Chee WY, Choo KB, Gole LA and Meng CT (1998) Intrahepatic vein fetal sampling: Current role in prenatal diagnosis Journal of Obstetrics and Gynaecology, 24, 239–246 [1, 4, 7] Chong S-A, Tay JAM, Subramaniam M, Pek E and Machin D (2009) Mortality rates among patients with schizophrenia and tardive dyskinesia Journal of Clinical Psychopharmacology, 29, 5–8 [1, 5, 6, 7, 9, 11] *Clayton D and Hills M (1993) Statistical Models in Epidemiology, Oxford University Press, Oxford [1] Regression Methods for Medical Research, First Edition Bee Choo Tai and David Machin © 2014 Bee Choo Tai and David Machin Published 2014 by John Wiley & Sons, Ltd References 295 Cnattingius S, Hultman CM, Dahl M and Sparén P (1999) Very preterm birth, birth trauma, and the risk of anorexia nervosa among girls Archives of General Psychiatry, 56, 634–638 [4] Cohen J (1988) Statistical Power Analysis for the Behavioral Sciences, 2nd edn Lawrence Earlbaum, New Jersey [7] Cohn SL, Pearson ADJ, London WB, Monclair T, Ambros PF, Brodeur GM, Faldum A, Hero B, Iehara T, Machin D, Mosseri V, Simon T, Garaventa A, Castel V and Matthau KK (2009) The International Neuroblastoma Risk Group (INRG) Classification System: An INRG Task Force Report Journal of Clinical Oncology, 27, 289–297 [1, 9] *Collett D (2002) Modelling Binary Data, (2nd edn) Chapman and Hall/CRC, London [1] *Collett D (2003) Modelling Survival Data in Medical Research, (2nd edn), Chapman and Hall/CRC, London [1] Contoli M, Message SD, Laza-Stanca V, Edwards MR, Wark PA, Bartlett NW, Kebadze T, Malia P, Stanciu LA, Parker HL, Slater L, Lewis-Antes A, Kon OM, Holgate ST, Davies DE, Kotenko SV, Papi A and Johnston SL (2006) Role of deficient type III interferon-lambda production in asthma exacerbations Nature Medicine, 12, 1023–1026 [1] Cox DR (1972) Regression models and life tables (with discussion) Journal of the Royal Statistical Society, B34, 187–220 [6] Diggle PJ (1990) Time Series: A Biostatistical Introduction Oxford University Press, Oxford [8] *Diggle PJ, Liang K-Y and Zeger SL (1994) Analysis of Longitudinal Data Oxford Science Publications, Clarendon Press, Oxford [1, 8] *Dobson AJ and Barnett AG (2008) Introduction to Generalized Linear Models (3rd edn), Chapman and Hall/CRC, London [1] Everitt BS (2003) Modern Medical Statistics: A Practical Guide Arnold Publishers, London [9] *Everitt BS and Rabe-Hesketh S (2006) A Handbook of Statistical Analysis using Stata (4th edn), Chapman and Hall/CRC, London [1] Feinstein AR (1996) Multivariable Analysis: An Introduction Yale University Press, New Haven and London [9] Fine JP and Gray RJ (1999) A proportional hazards model for the subdistribution of a competing risk Journal of the American Statistical Association, 94, 496–509 [10] *Freeman JV, Walters SJ and Campbell MJ (2008) How to Display Data BMJ Books, Blackwell Publishing, Oxford [1] Frison L and Pocock SJ (1992) Repeated measures in clinical trials: analysing mean summary statistics and its implications for design Statistics in Medicine, 11, 1685–1704 [11] Grundy RH, Wilne SH, Robinson KJ, Ironside JW, Cox T, Chong WK, Michalski A, Campbell RHA, Bailey CC, Thorpe N, Pizer B, Punt J, Walker DA, Ellison DW and Machin D (2010) Primary postoperative chemotherapy without radiotherapy for treatment of brain tumours other than ependymoma in children under years: Results of the first UKCCSG/SIOP CNS 9204 trial European Journal of Cancer, 46, 120–133 [10] Grundy RG, Wilne SH, Weston CL, Robinson K, Lashford LS, Ironside J, Cox T, Chong WK, Campbell RHA, Bailey CC, Gattamaneni R, Picton S, Thorpe N, Mallucci C, English MW, Punt JAG, Walker DA, Ellison DW and Machin D (2007) Primary postoperative chemotherapy without radiotherapy for intracranial ependymoma in children: the UKCCSG/SIOP prospective study Lancet Oncology, 8, 696–705 [10] Husain R, Liang S, Foster PJ, Gazzard G, Bunce C, Chew PTK, Oen FTS, Khaw PT, Seah SKL and Aung T (2012) Cataract surgery after trabulectomy: The effect of trabulectomy function Archives of Ophthalmology, 130, 165–170 [10] ICH E9 Expert Working Group (1999) Statistical principles for clinical trials: ICH Harmonised Tripartite Guideline Statistics in Medicine, 18, 1905–1942 [7] Ince D (2011) The Duke University scandal – what can be done? Significance, 8, 113–115 [7] Jackson DJ, Gangnon RE, Evans MD, Roberg KA, Anderson EL, Pappas TE, Printz MC, Lee W-M, Shult PA, Reisdorf E, Carlson-Dakes KT, Salazar LP, DaSilva DF, Tisler CJ, Gern JE and Lemanske RF (2008) Wheezing rhinovirus illnesses in early life predict asthma development in high-risk children American Journal of Respiratory and Critical Care Medicine, 178, 667–672 [4, 7] *Kleinbaum G, Kupper LL, Muller KE and Nizam E (2007) Applied Regression Analysis and Other Multivariable Methods (4th edn), Duxbury Press, Florence, Kentucky [1] Korenman S, Goldman N, Fu H (1997) Misclassification bias in estimates of bereavement effects American Journal of Epidemiology, 145, 995–1002 [10] 296 References LeBlanc M and Crowley J (1993) Survival trees by goodness of split Journal of the American Statistical Association, 88, 457–467 [9] Lee CH, Tai BC, Soon CY, Low AF, Poh KK, Yeo TC, Lim GH, Yip J, Omar AR, Teo SG and Tan HC (2010) A new set of intravascular ultrasound-derived anatomical criteria for defining functionally significant stenoses in small coronary arteries: results from Intravascular Ultrasound Diagnostic Evaluation of Atherosclerosis in Singapore (IDEAS) study American Journal of Cardiology, 105, 1378–1384 [7, 9] Levitan EB, Yang AZ, Wolk W and Mittleman MA (2009) Adiposity and incidence of heart failure h ospitalization and mortality: a population-based prospective study Circulation: Heart Failure, 2, 203–208 [7] Levy HL, Milanowski A , Chakrapani A, Cleary M, Lee P, Trefz FK, Whitley CB, Feillet F, Feigenbaum AS, Bebchuk JD, Christ-Schmidt H and Dorenbaum A (2007) Efficacy of sapropterin dihydrochloride (tetrahydrobiopterin, 6R-BH4) for reduction of phenylalanine concentration in patients with phenylketonuria: a phase III randomised placebo-controlled study Lancet, 370, 504–510 [11] Little RJ, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Neaton JD, Shih W, Siegel JP and Stern H (2012) The design and conduct of clinical trials to limit missing data Statistics in Medicine, 31, 3433 – 3443 [10] *Machin D and Campbell MJ (2005) Design of Studies for Medical Research Wiley, Chichester [1] *Machin D, Campbell MJ, Tan SB and Tan SH (2009) Sample Size Tables for Clinical Studies (3rd edn) Wiley-Blackwell, Chichester [1, 7] *Machin D, Cheung Y-B and Parmar MKB (2006) Survival Analysis: A Practical Approach (2nd edn) Wiley, Chichester [1] Maheswaran R, Pearson T, Hoysal N and Campbell MJ (2010) Evaluation of the impact of a health forecast alert service on admissions for chronic obstructive pulmonary disease in Bradford and Airedale Journal of Public Health, 32, 97–102 [1, 5] Marshall RJ (2001) The use of classification and regression trees in clinical epidemiology Journal of Clinical Epidemiology, 54, 603–609 [9] Marubini E and Valsecchi MG (1995) Analysing Survival Data from Clinical Trials and Observational Studies Wiley, Chichester [6] *Mitchell MN (2012) Interpreting and Visualizing Regression Models Using Stata Stata Press, College Station, TX [1] Nejadnik H, Hui JH, Choong EP-F, Tai B-C and Lee E-H (2010) Autologous bone marrow-derived mesenchymal cells versus autologous chondrocyte implantation: An observational cohort study American Journal of Sports Medicine, 38, 1110–1116 [8, 11] Ng DPK, Fukushima M, Tai B-C, Koh D, Leong H, Imura H and Lim XL (2008) Reduced GFR and albuminuria in Chinese type diabetes mellitus patients are both independently associated with activation of the TNF-α system Diabetologia, 51, 2318–2324 [1, 7] Pearson ADJ, Pinkerton CR, Lewis IJ, Ellershaw C and Machin D (2008) High-dose rapid and standard induction chemotherapy for patients aged over year with stage neuroblastoma: a randomized trial Lancet Oncology, 9, 247–256 [6, 7, 10] Poole-Wilson PA, Uretsky BF, Thygesen K, Cleland JGF, Massie BM and Rydén L (2003) Mode of death in heart failure: findings from the ATLAS trial Heart, 89: 42–48 [10] Poon C-Y, Goh B-T, Kim M-J, Rajaseharan A, Ahmed S, Thongprasom K, Chaimusik M, Suresh S, Machin D, Wong H-B and Seldrup J (2006) A randomised controlled trial to compare steroid with cyclosporine for the topical treatment of oral lichen planus Oral Surgery, Oral Pathology, Oral Radiology and Endodontology, 102, 47–55 [1, 8, 11] Pregibon (1981) Logistic regression diagnostics Annals of Statistics, 9, 705–724 [4] *Rabe-Hesketh S and Skrondal A (2008) Multilevel and Longitudinal Modeling Using Stata (2nd edn), Stata Press, College Station, TX [1, 8] Richards SH, Bankhead C, Peters TJ, Austoker J, Hobbs FDR, Brown J, Tydeman C, Roberts L, Formby J, Redman V, Wilson S and Sharp DJ (2001) Cluster randomized controlled trial comparing the e ffectiveness and cost-effectiveness of two primary care interventions aimed at improving attendance for breast screening Journal of Medical Screening, 8, 91–98 [11] Rothman KJ (2002) Epidemiology: An Introduction Oxford University Press, New York, page 194 [7] Royston P, Reitz M and Atzpodien J (2006) An approach to estimating prognosis using fractional polynomials in metastatic renal carcinoma British Journal of Cancer, 94, 1785–1788 [11] Sackley CM, van den Berg ME, Lett K, Patel S, Hollands K, Wright CC and Hoppitt TJ (2009) Effects of a physiotherapy and occupational therapy intervention on mobility and activity in care home residents: a cluster randomized controlled trial British Medical Journal, 2009 Sep 1, 339:b3123.doi: 10.1136bmj.b3123 [11] References 297 SAS Institute (2012) SAS Enterprise Miner (EM): Version 12.1 SAS Institute, Cary, NC [9] Schoenfeld D (1982) Partial residuals for the proportional hazards regression model Biometrika, 69, 239–241 [6] Sridhar T, Gore A, Boiangiu I, Machin D and Symonds RP (2009) Concomitant (without adjuvant) temozolomide and radiation to treat glioblastoma: A retrospective study Clinical Oncology, 21, 19–22 [6, 7, 10] StataCorp (2007a) Stata Base Reference Manual, Volume 2: A-H, Release 10, College Station, TX [11] StataCorp (2007b) Stata Base Reference Manual, Volume 2: I-P, Release 10, College Station, TX [4] StataCorp (2007c) Stata Base Reference Manual, Volume 2: Q-Z, Release 10, College Station, TX [7] *Swinscow TV and Campbell MJ (2002) Statistics at Square One (10th edn), Blackwell, BMJ Books, Oxford [1] Tai BC, Grundy R and Machin D (2011) On the importance of accounting for competing risks in paediatric cancer trials designed to delay or avoid radiotherapy II Adjustment for covariates and sample size issues International Journal of Radiation Oncology, Biology and Physics, 79, 1139–1146 [10] Tai B-C, Peregoudov A and Machin D (2001) A competing risk approach to the analysis of trials of alternative intrauterine devices (IUDs) for fertility regulation Statistics in Medicine, 20, 3589–3600 [10] Tan C-K, Law N-M, Ng H-S and Machin D (2003) Simple clinical prognostic model for hepatocellular carcinoma in developing countries and its validation Journal of Clinical Oncology, 21, 2294–2298 [7, 9] Therneau T and Atkinson E (2011) An introduction to recursive partitioning using the RPART routines Technical report, Mayo Foundation, Rochester http://CRAN.R-project.org/package=rpart [9] Viardot-Foucault V, Prasath EB, Tai B-C, Chan JKY and Loh SF (2011) Predictive factors of success for Intra-Uterine Insemination: a single centre retrospective study Human Reproduction, 26 (Suppl 1), 1232–1233 [5, 8] Vuong Q (1989) Likelihood ratio tests for model selection and non-nested hypotheses Econometrica, 57, 307–334 [5] Wee J, Tan E-H, Tai B-C, Wong H-B, Leong S-S, Tan T, Chua E-T, Yang E, Lee K-M, Fong K-W, Tan HSK, Lee K-S, Loong S, Sethi V, Chua E-J and Machin D (2005) Randomized trial of radiotherapy versus concurrent chemoradiotherapy followed by adjuvant chemotherapy in patients with American Joint Committee on Cancer/International Union against cancer stage III and IV nasopharyngeal cancer of the endemic variety Journal of Clinical Oncology, 23, 6730–6738 [1, 7, 10] Whitehead J (1993) Sample size calculations for ordered categorical data Statistics in Medicine, 12, 2257–2272 [4] Wight J, Jakubovic M, Walters S, Maheswaran R, White P and Lennon V (2004) Variation in cadaveric organ donor rates in the UK Nephrology Dialysis Transplantation; 19, 963–968 [5] Young SS and Karr A (2011) Deming, data and observational studies: A process out of control and needing fixing Significance, 8, 116–119 [7] Zhang H, Crowley J, Sox HC and Olshen RA (1999) Tree-structured statistical methods In Armitage P and Colton T (eds) Encyclopedia of Biostatistics, 6, 4561–4573 [9] Zhang H, Holford T and Bracken MB (1996) A tree-based method of analysis for prospective studies Statistics in Medicine, 15, 37–49 [9] Index adjacent values 27 adjusted analysis 172 Akaike Information Criterion (AIC) 57, 58, 63, 110, 149, 170 alternative hypothesis 31, 96, 166 Analysis of Variance (ANOVA) 6, 20–1, 42, 45, 47, 48, 52, 54, 62, 116, 177 anorexia 89–92 association 9, 38, 45, 160, 162, 170, 183, 269, 272 linear 22, 59 asthma 8–9, 36, 64–8, 78–81, 94, 95, 148–9, 165 auto-correlation 22, 177, 178, 183–93, 203, 269, 271–5, 277 auto-regressive 184 correlation structure 184, 190, 203 exchangeable 183–4, 188, 190–93 independent 183 uniform 183–4 unstructured 184–5, 194, 197, 202 Auto-regressive 184 Autologous bone marrow-derived mesenchymal cells (BMSC) 182, 190, 192 Autologous chondrocyte (ACC) 182, 190, 192 backward selection 160 bandwidth 203 Barthel index 277 baseline adjustment 270–1 baseline characteristic 277 baseline hazard 126, 127, 251 before-and-after design 270 between subject variation 193 bias 168, 174–5 binary covariates 28, 79, 80, 86, 127, 136, 146, 151–4, 162, 169, 171, 214, 234 binary outcome 65–6, 83, 93, 205, 207–14 Binomial distribution 95, 116 Binomial models 99–101 body mass index (BMI) 151, 254 Bonferroni correction 169–1 bootstrap 190 box-whisker plot 27, 32 brain tumour 237 branches 15, 204, 206, 224–7, 229 base-control studies matched 64, 87, 89–91 unmatched 89–91 bataract surgery 256–8 bategorical covariates 69–75, 214, 244 ordered 8, 27–8, 32–3, 70–2, 75, 77, 130, 132, 135, 147, 198, 212, 214, 230 unordered 25, 28–3, 55, 69–70, 75, 86, 130, 132, 198 CART 205 cause-specific rates (CSR) 237–40 censored observation 120–5, 138, 141–3, 167, 237, 239–41, 248, 256, 268 censored survival time 121 chemotherapy 12, 237, 239, 258, 260, 266 change in estimates 156–7 chi-squared (χ2) distribution 96 chronic obstructive pulmonary disease (COPD) 11–12, 98 Classification And Regression Tree 205 classification trees 230–4 clinical significance 40–1 clinical trial 8, 22, 38–40, 120–2, 124, 148, 164, 179, 236–7, 239, 260, 267, 269 cluster 36, 235, 257, 269, 277–81 cluster design 12, 276–81 coefficient of determination (R2) 21 coefficient of variation 275–6 cohort 254 Regression Methods for Medical Research, First Edition Bee Choo Tai and David Machin © 2014 Bee Choo Tai and David Machin Published 2014 by John Wiley & Sons, Ltd Index collinearity 59–60 competing risks (CR) 236–7 adjusting for covariates 243–7 CuMulative Incidence Rates (CMIR) 240–3 complementary log-log plot 136, 143 complexity parameter (CP) 229 conditional logistic regression 64, 87–92, 105 conditional probability 142 confidence interval 4, 12, 17, 20, 40–1, 61, 68, 74 confounding 157 constant hazard 253 continuous covariate 25, 33–8, 72–7 continuous explanatory variable 146 continuous outcome 65, 202, 214–22, 235 continuous time-varying covariate 258–63 continuous variable 8, 17, 34, 73, 74, 82, 130, 165, 234, 279 correlation 21–2 correlation coefficient 22 correlation matrix 183–4 correlation structure 184 cost complexity 229 covariance 195–7, 199–201, 279–81 covariate pattern 64, 83, 86, 87 covariates 2, 6, 9–10, 17, 21, 38, 43 binary 28, 79, 80, 86, 127, 136, 146, 151–3, 162, 169, 171, 214, 234 continuous 25, 33–8, 72–7 continuous time-varying 258–63 design 10, 93, 148–9, 157, 265 discrete time-varying 256–8, 261 independent variables 25–31 knowingly influential 148, 149 linear regression 43–5 query 149, 165, 170 time-dependent 141 see also categorical covariates; time-varying covariates Cox regression model 12–13, 120, 127–36, 205, 237, 243, 253, 254 non-proportional hazards 141, 171, 262–7 proportional hazards 167, 205, 243, 262–7 stratified Cox model 138–41, 171 time-varying covariates 254–66 cross-validation 164, 171, 232–3 cumulative death rate (CDR) 128, 129, 133, 134, 221–3, 225, 232, 237 cumulative exposure 109–12 CuMulative Incidence Rates (CMIR) 240–3 cumulative survival probability 141 299 cut-points 132, 147, 164, 205–9, 211–18, 230 cyclosporine 13, 14 database format 178–82 long format 181, 197, 202, 256, 261 wide format 181, 202, 256 daughter node 205–7, 209, 212–15, 218, 219, 221–7, 230, 235 decision rules 204–5 degrees of freedom 16–18, 20–1, 39, 41, 45, 47, 50, 52, 63, 96, 153 dependent variable design covariates 10, 93, 148–9, 157, 265 deviances 96 diabetes 9–10, 147, 237 discrete time-varying covariate 256–9, 261 distant metastasis 236–9, 242–7, 267 dummy variables 29, 31 effect size 166, 167 endpoint 72, 99, 120–2, 147–51, 153–4, 161, 167, 168, 174, 181, 183, 208, 213, 237, 254, 269–70, 274, 276, 278, 281 binary 92, 147, 162, 196–202, 207, 217 continuous 165, 197, 207, 216, 230 entropy criterion 208 event 10, 12, 48, 65, 87, 95, 98–9, 101, 116, 120, 121, 165, 167, 236–41, 243 event-free-survival (EFS) 15, 172, 204, 237 Events Per candidate coVariate (EPV) 165 Exponential constant (e) 22–4 Exponential distribution 248, 251, 253, 254 Exponential survival function 248 F-distribution 21, 31, 62 F-statistic 20 F-test 31, 41–2, 62 fetal sampling 10, 94 fixed-effects models 185–93, 202 forced expiratory volume (FEV) 8, 22, 148–9, 165, 273–4 forced selection 155–7 forward selection 160 fractional flow reserve (FFR) 162, 205, 206, 230 fractional polynomials 282–4 Generalized Estimating Equation (GEE) 187, 202 geometric mean 23–4 300 Index Gini information index 235 glioblastoma 123–9, 142–5, 149, 249–53 global test 31, 41–2, 138, 139, 264 goodness-of-fit 87 hazard function 142 hazard rate 125–6 hazard ratio (HR) 12, 127, 171 hepatocellular carcinoma (HCC) 162, 171, 205 hierarchical models 193 hierarchical selection 157–9 hierarchical backward 157 hierarchical forward 158–9 high density lipoprotein (HDL) 2–8, 17–20, 22, 25–38, 40, 41, 43, 254, 282–4 histogram 2 hypothesis 31, 74 null 2–3, 12, 16, 18, 20–2, 27, 31, 33, 39, 41, 50–5, 62–3, 65, 68, 73, 74, 80, 96, 104, 108, 127, 133, 138, 148, 152, 157, 166, 169, 202, 244, 247, 251, 258, 271, 276 human rhinovirus (HRV) 8–9, 64–7, 79, 81, 148 impurity 208 incidence-rate ratio 100 independent variables 25–31, 44 see also covariates indicator variables 29, 31 influential observations 36, 38 information criterion 208 interactions 148, 169, 205, 234–5, 264, 282 logistic regression 80–2 multiple linear regression 53–9, 61 intercept 43 interferon-l 8, 37, 149, 165, 276 intermediate node 206, 212–14, 224–6, 229 intra-class correlation (ICC) 277–8 ischaemic heart disease (IHD) 69–77, 81–7, 148, 151–3, 155–8 intra-uterine device (IUD) 237–9, 242–3 intra-uterine insemination (IUI) 99–101 jackknife 105, 106 jittering 27 Kaplan-Meier estimates 12, 240 Kaplan-Meier (K-M) survival curves 123–5, 237 knowingly influential covariate 148, 149 lesion length 162, 205–6, 210, 211, 213, 214 leverage 83–4, 86 likelihood ratio 68, 96 linear regression 4–7, 17–18, 25–42 assumptions 32–8 independent variables 25–31 ordered Normal scores 42 residuals 33–9, 41, 42 link function 99 local maxima 211, 232 local recurrence 236–47, 267 log hazard ratio 138 log likelihood 96 log link function 99 log transformation 23–4 logarithms 22–4 logistic regression 64–97 categorical covariates 69–72 conditional 64, 87–2, 105 continuous covariates 72–7 goodness-of-fit 87 interactions 80–1 lack of important covariate 82 logit transformation 64–8 model checking 81–7 ordered 64, 92–4, 150 logit models 119 logit transformation 64–8 logrank test 220 long format 181, 197, 202, 256, 261 longitudinal studies 176–82, 191, 260 auto-correlation 177, 178 before-and-after design 270 database format 178–82 time series 177–8 repeated measures 13, 176–82 Locally Weighted Scatterplot Smoothing (Lowess) curve 187, 203 Lowess 187, 203 matched case-control studies 64, 89, 90, 91 maximum likelihood estimate (MLE) 96, 118–19 mean 1, 20, 29–32, 47, 51, 65, 87, 95, 103–5, 116, 175, 203, 216–19, 271, 275–6 comparing two 2–5, 17, 18 geometric 23–4 median 15, 23–4, 27, 32, 150, 151, 185, 190, 275 minimum lumen area 162, 205–6, 210, 211, 213, 217, 218 Index misclassification costs 226–8 missing covariate values 233–4 missing data 168–9 misspecified model 104 mixed-effects models 193–202 modeling a difference 271–3 modeling a ratio 273–4 multi-level models 278–81 multiple comparisons 169 multiple linear regression 43–63 adequacy of fit 45–8 assumptions 61 collinearity 59–60 interactions 53–8 nested models 51, 53, 57–9, 61–3 non-nested models 58 parsimonious models 61 multiple logistic regression 78–81 multiple modeling 170 multiple tests 169–70, 173 multivariable model 16, 47, 148, 167, 170 nasopharyngeal cancer 12–13, 171–2, 237, 244, 246, 262, 264, 265, 267 natural logarithm 116 nested models 51, 53, 57–9, 61–3 neuroblastoma 14, 15, 140, 173, 204, 266 non-informative censoring 124 non-linearity 33, 43, 49, 51, 61, 148, 176, 282 non-nested models 58 non-parametric 250 non-proportional hazards 141, 172, 267–9 Normal distribution 2, 3, null hypothesis 2–3 null model 41, 58, 96, 102, 104, 105, 108, 148, 152, 157, 160 numerically discrete covariates 28 odds 10, 64, 65, 73, 94, 202 odds ratio (OR) 64–5, 92, 94–5 offset 108 optimal split 220 optimal subtree 229 optimum cut-point 208, 217 oral lichen planus (OLP) 13–14, 22, 34, 38, 176, 177, 179, 282 ordered categorical covariates 8, 27–8, 32–3, 70–2, 75, 77, 130, 132, 135, 148, 202, 212, 214, 230 ordered logistic regression 64, 92–4, 150 ordinary least-squares 17, 178, 187, 278 outlier 27, 36–8, 41, 86, 168, 276 301 over-dispersion 98, 100, 103–5 overall survival 12, 204, 249, 250, 260, 264–6 p-value 104, 114, 127, 133, 138, 141, 149, 152, 155, 157, 160, 164, 168, 169, 173 parallel lines model 146 parametric models 247–54 Exponential distribution 248–51, 253, 254 Weibull distribution 248–51, 253, 254 parsimonious models 61 change in estimates 156–7 covariates 147–51 data checking 167–8 forced selection 155–6 grouping terms 159–60 hierarchical selection 157–9 missing data 168–9 multiple modeling 170 multiple tests 169–70 prognostic index (PI) 161–4 selection and stepwise 160 stratification 171–2 subgroup analysis 172–4 Pearson statistic 86, 90, 104 Pearson’s residuals 84 percentile 27 plaque burden 162, 205–7, 209–10, 212, 213, 217, 220 Poisson distribution 98, 99, 103–5, 112, 116–18 Poisson regression 98–119 cumulative exposure 109–12 over-dispersion 103–5 population size at risk 101–3, 105–9 residuals 114–15 robust estimates 98, 105, 190, 203 zero-inflated models 112–14 polynomial model 51, 282, 283 population-averaged models 187, 188, 192, 193, 202–3 population parameters 17, 185 population size at risk known 105–9 unknown 101–3 Prais-Watson method 177 predicted value 33, 34, 136, 284 pregnancy 99–102, 104, 196–203, 237, 238, 242 Pregibon leverage 83 primary splits 215, 219, 221, 231, 234 302 Index probability 3, 42, 65, 66, 83–7, 95, 99, 118, 141, 142, 248 prognostic index (PI) 161–4, 234–5 prognostic model 164, 171, 205 proportional hazards (PH) 127, 136–41 proportional odds 94 Q-Q plot 35–6 quadratic models 48–51, 148 quantile 35, 36, 62, 114, 115 query covariates 149, 165, 170 R routine 219 radiotherapy 12, 237 random intercept 195–7, 199 random-effects model 176, 194–6 random-effects slope 195–7, 199 random variation rank 27, 42, 121, 122, 141, 144–5, 197, 206–9, 213, 214, 216, 219, 240 recursive partitioning 206 regression coefficient 1, 5, 10, 12, 17, 30–2, 38, 40, 45, 51–5, 57, 59, 61–3, 66–74, 77, 80, 82–3, 88, 89, 99, 105, 110, 113, 126–9, 132, 133, 135–8, 143, 149, 152, 156, 163–5, 168, 170, 177, 183, 185, 190–1, 202, 203, 254, 265, 271, 279 regression models 234–5 regression tree 204–35 relative risk (RR) 94–5 repeated measures 13, 176–203 residual variation residuals 143–5 linear regression 33–9, 41, 42 Pearson’s 83 Q-Q plot 35–6 Schoenfeld 138, 143–4 standardized 64, 115, 116 Rivermead mobility index 277 robust estimates 90 Root Mean Square Error (MSE) 177 of ANOVA table 51 root node 204, 206, 208, 209, 212, 215, 216, 219, 226, 229 rpart 213, 235 sampling with replacement 105 sandwich estimate 105 SAS 235 scale parameter 248 scatter plot 5–7, 19, 27, 28, 32–9, 41, 49, 59, 60, 62, 72, 83, 86, 139, 145, 165, 177, 185–7, 203, 207, 209–14, 216–20, 272, 274 schizophrenia 128–39, 155, 233, 237 Schoenfeld residuals 138, 143–4 serial-correlation 183–5 see also auto-correlation SF-36 quality-of-life 179 shape parameter 248 significance level 157, 160, 169 significance test 2, 169, 170 skewed distribution 23, 118, 147 slope 8, 18, 22, 23, 41, 43, 48, 53–4, 59, 61, 165, 190, 194–200 smoothed hazard function 253 ‘split-sample’ analysis 171 standard deviation 1–3, 6, 16, 17, 20, 29, 33, 38, 39, 42, 51, 83, 87, 95, 105, 116, 166, 167, 177, 183, 185, 194, 196, 272, 273, 277, 278 standard error 39 standard Normal distribution standardized difference 166 standardised residuals 64, 115, 116 Stata 15, 27, 31, 66, 83, 105, 154, 187, 213, 235, 282 statistical significance 2, 3, 17, 18, 40–1, 45, 146, 157, 169, 251, 258 stepwise methods 160 steroid 13, 14, 176 stratification 171–2 stratified Cox model 138–41, 171 study size 165–7 sub-distribution 243, 247 sub-distribution hazard ratio 243–6 sub-trees 225–30, 233 subgroup analysis 172–4 subject-specific models 202–3 suicide 101–9, 112–14 surrogate split 233, 234 t-distribution 22 t-test 3, 16–17, 39, 45 tardive dyskinesia 109–11, 128–32, 134, 135, 137, 139, 148, 155, 220, 223, 224, 232, 278, 280 terminal node 15, 204, 205, 206, 212, 219, 221–32 test statistic 45 time-dependent covariates 141 Index time-series 177 time-to-event data 120–2 time-to-event outcome 220–3 time-varying covariates 141, 254–67 continuous 258–62 discrete 255–8 TNF-a system 9–10, 147 total sum of squares (SS) 17, 46, 57, 87, 215 total variation 20, 45, 47, 50, 55, 57 trabulectomy 255 transformations 149–51 see also log transformation; logit transformation tree building 206–24 binary covariates 214 binary outcome 207–14 categorical covariates 214 continuous covariates 207–9 continuous outcome 214–19 cross-validation 232–3 cut-off points 230 missing covariate values 233–4 time-to-event outcome 220–4 tree pruning 224–30 branches 204, 206, 224–7, 229 cost complexity 229 misclassification costs 227 sub-trees 225–6 tree homogeneity tree quality 226–9 tree homogeneity 228 tree quality 226–7 Type error 166 Type error 166 unadjusted analysis 172 unexplained variation 43, 51, 281 univariate model 78, 80, 89, 96, 114 non-nested model 58 regression models 59, 60 unmatched case-control studies 89 unordered categorical covariates 25, 28–31, 55, 69–70, 75, 86, 130, 132, 199 validation 233 cross-validation 164, 171, 232–3 variance 98, 103–5, 116–17, 272, 274 Visual Analog Scale (VAS) 14, 38, 176 Vuong test 113–14 Weibull distribution 248–51, 253, 254 wide format 181, 202, 255 within-cluster standard deviation 277 within-subject correlation 257 z-distribution 47 zero count 105, 112–13, 162 Zero-inflated Poisson (ZiP) 112–14 303 ... sPool = (nM − 1)sM2 + (nF − 1)sF2 For the data concerned (nM − 1) + (nF − 1) (5 5 − 1) 0.34252 + (6 5 − 1) 0.28812 = 0.3142 This is then used in equation (1 . 1), (5 5 − 1) + (6 5 − 1) ( y − yM ) − ( µ... deaths (% ) Live-births Intrahepatic vein (IHV) Percutaneous umbilical cord sampling (PUBS) Cardiocentesis Total (% ) 21 11 39 (1 0. 2) 10 21 2 18 (4 . 7) 30? ?(7 . 9) 52 (1 7. 8) 240 20 (2 8. 6) 50 15 (7 5. 0) 87... (2 00 2) Statistical Methods in Medical Research (4 th edn) Blackwell Science, Oxford Bland M (2 00 0) An Introduction to Medical Statistics (3 rd edn) Oxford University Press, Oxford Campbell MJ (2 006)