Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 89 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
89
Dung lượng
0,9 MB
Nội dung
Biostatistics A Methodology for the Health Sciences Second Edition GERALD VAN BELLE LLOYD D. FISHER PATRICK J. HEAGERTY THOMAS LUMLEY Department of Biostatistics and Department of Environmental and Occupational Health Sciences University of Washington Seattle, Washington A JOHN WILEY & SONS, INC., PUBLICATION Copyright 2004 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic format. Library of Congress Cataloging-in-Publication Data: Biostatistics: a methodology for the health sciences / Gerald van Belle [et al.]– 2nd ed. p. cm. – (Wiley series in probability and statistics) First ed. published in 1993, entered under Fisher, Lloyd. Includes bibliographical references and index. ISBN 0-471-03185-2 (cloth) 1. Biometry. I. Van Belle, Gerald. II. Fisher, Lloyd, 1939– Biostatistics. III. Series. QH323.5.B562 2004 610 ′ .1 ′ 5195–dc22 2004040491 Printed in the United States of America. 10987654321 Ad majorem Dei gloriam Contents Preface to the First Edition ix Preface to the Second Edition xi 1. Introduction to Biostatistics 1 2. Biostatistical Design of Medical Studies 10 3. Descriptive Statistics 25 4. Statistical Inference: Populations and Samples 61 5. One- and Two-Sample Inference 117 6. Counting Data 151 7. Categorical Data: Contingency Tables 208 8. Nonparametric, Distribution-Free, and Permutation Models: Robust Procedures 253 9. Association and Prediction: Linear Models with One Predictor Variable 291 10. Analysis of Variance 357 11. Association and Prediction: Multiple Regression Analysis and Linear Models with Multiple Predictor Variables 428 12. Multiple Comparisons 520 13. Discrimination and Classification 550 14. Principal Component Analysis and Factor Analysis 584 vii viii CONTENTS 15. Rates and Proportions 640 16. Analysis of the Time to an Event: Survival Analysis 661 17. Sample Sizes for Observational Studies 709 18. Longitudinal Data Analysis 728 19. Randomized Clinical Trials 766 20. Personal Postscript 787 Appendix 817 Author Index 841 Subject Index 851 Symbol Index 867 Preface to the First Edition The purpose of this book is for readers to learn how to apply statistical methods to the biomedical sciences. The book is written so that those with no prior training in statistics and a mathematical knowledge through algebra can follow the text—although the more mathematical training one has, the easier the learning. The book is written for people in a wide variety of biomedical fields, including (alphabetically) biologists, biostatisticians, dentists, epidemiologists, health services researchers, health administrators, nurses, and physicians. The text appears to have a daunting amount of material. Indeed, there is a great deal of material, but most students will not cover it all. Also, over 30% of the text is devoted to notes, problems, and references, so that there is not as much material as there seems to be at first sight. In addition to not covering entire chapters, the following are optional materials: asterisks ( ∗ ) preceding a section number or problem denote more advanced material that the instructor may want to skip; the notes at the end of each chapter contain material for extending and enriching the primary material of the chapter, but this may be skipped. Although the order of authorship may appear alphabetical, in fact it is random (we tossed a fair coin to determine the sequence) and the book is an equal collaborative effort of the authors. We have many people to thank. Our families have been helpful and long-suffering during the writing of the book: for LF, Ginny, Brad, and Laura; for GvB, Johanna, Loeske, William John, Gerard, Christine, Louis, and Bud and Stacy. The many students who were taught with various versions of portions of this material were very helpful. We are also grateful to the many collaborating investigators, who taught us much about science as well as the joys of collaborative research. Among those deserving thanks are for LF: Ed Alderman, Christer Allgulander, Fred Applebaum, Michele Battie, Tom Bigger, Stan Bigos, Jeff Borer, Martial Bourassa, Raleigh Bowden, Bob Bruce, Bernie Chaitman, Reg Clift, Rollie Dickson, Kris Doney, Eric Foster, Bob Frye, Bernard Gersh, Karl Hammermeister, Dave Holmes, Mel Judkins, George Kaiser, Ward Kennedy, Tom Killip, Ray Lipicky, Paul Martin, George McDonald, Joel Meyers, Bill Myers, Michael Mock, Gene Passamani, Don Peterson, Bill Rogers, Tom Ryan, Jean Sanders, Lester Sauvage, Rainer Storb, Keith Sullivan, Bob Temple, Don Thomas, Don Weiner, Bob Witherspoon, and a large number of others. For GvB: Ralph Bradley, Richard Cornell, Polly Feigl, Pat Friel, Al Heyman, Myles Hollander, Jim Hughes, Dave Kalman, Jane Koenig, Tom Koepsell, Bud Kukull, Eric Larson, Will Longstreth, Dave Luthy, Lorene Nelson, Don Martin, Duane Meeter, Gil Omenn, Don Peterson, Gordon Pledger, Richard Savage, Kirk Shy, Nancy Temkin, and many others. In addition, GvB acknowledges the secretarial and moral support of Sue Goleeke. There were many excellent and able typists over the years; special thanks to Myrna Kramer, Pat Coley, and Jan Alcorn. We owe special thanks to Amy Plummer for superb work in tracking down authors and publishers for permission to cite their work. We thank Robert Fisher for help with numerous figures. Rob Christ did an excellent job of using L A T E X for the final version of the text. Finally, several people assisted with running particular examples and creating the tables; we thank Barry Storer, Margie Jones, and Gary Schoch. ix x PREFACE TO THE FIRST EDITION Our initial contact with Wiley was the indefatigable Beatrice Shube. Her enthusiasm for our effort carried over to her successor, Kate Roach. The associate managing editor, Rose Ann Campise, was of great help during the final preparation of this manuscript. With a work this size there are bound to be some errors, inaccuracies, and ambiguous statements. We would appreciate receiving your comments. We have set up a special electronic- mail account for your feedback: http://www.biostat-text.info Lloyd D. Fisher Gerald van Belle Preface to the Second Edition Biostatistics did not spring fully formed from the brow of R. A. Fisher, but evolved over many years. This process is continuing, although it may not be obvious from the outside. It has been ten years since the first edition of this book appeared (and rather longer since it was begun). Over this time, new areas of biostatistics have been developed and emphases and interpretations have changed. The original authors, faced with the daunting task of updating a 1000-page text, decided to invite two colleagues to take the lead in this task. These colleagues, experts in longitudinal data analysis, survival analysis, computing, and all things modern and statistical, have given a twenty-first-century thrust to the book. The author sequence for the first edition was determined by the toss of a coin (see the Preface to the First Edition). For the second edition it was decided to switch the sequence of the first two authors and add the new authors in alphabetical sequence. This second edition adds a chapter on randomized trials and another on longitudinal data analysis. Substantial changes have been made in discussing robust statistics, model building, survival analysis, and discrimination. Notes have been added, throughout, and many graphs redrawn. We have tried to eliminate errata found in the first edition, and while more have undoubtedly been added, we hope there has been a net improvement. When you find mistakes we would appreciate hearing about them at http://www.vanbelle.org/biostatistics/. Another major change over the past decade or so has been technological. Statistical software and the computers to run it have become much more widely available—many of the graphs and new analyses in this book were produced on a laptop that weighs only slightly more than a copy of the first edition—and the Internet provides ready access to information that used to be available only in university libraries. In order to accommodate the new sections and to attempt to keep up with future changes, we have shifted some material to a set of Web appendices. These may be found at http://www.biostat-text.info. The Web appendices include notes, data sets and sample analyses, links to other online resources, all but a bare minimum of the statistical tables from the first edition, and other material for which ink on paper is a less suitable medium. These advances in technology have not solved the problem of deadlines, and we would particularly like to thank Steve Quigley at Wiley for his equanimity in the face of schedule slippage. Gerald van Belle Lloyd Fisher Patrick Heagerty Thomas Lumley Seattle, June 15, 2003 xi CHAPTER 1 Introduction to Biostatistics 1.1 INTRODUCTION We welcome the reader who wishes to learn biostatistics. In this chapter we introduce you to the subject. We define statistics and biostatistics. Then examples are given where biostatistical techniques are useful. These examples show that biostatistics is an important tool in advancing our biological knowledge; biostatistics helps evaluate many life-and-death issues in medicine. We urge you to read the examples carefully. Ask yourself, “what can be inferred from the information presented?” How would you design a study or experiment to investigate the problem at hand? What would you do with the data after they are collected? We want you to realize that biostatistics is a tool that can be used to benefit you and society. The chapter closes with a description of what you may accomplish through use of this book. To paraphrase Pythagoras, there is no royal road to biostatistics. You need to be involved. You need to work hard. You need to think. You need to analyze actual data. The end result will be a tool that has immediate practical uses. As you thoughtfully consider the material presented here, you will develop thought patterns that are useful in evaluating information in all areas of your life. 1.2 WHAT IS THE FIELD OF STATISTICS? Much of the joy and grief in life arises in situations that involve considerable uncertainty. Here are a few such situations: 1. Parents of a child with a genetic defect consider whether or not they should have another child. They will base their decision on the chance that the next child will have the same defect. 2. To choose the best therapy, a physician must compare the prognosis, or future course, of a patient under several therapies. A therapy may be a success, a failure, or somewhere in between; the evaluation of the chance of each occurrence necessarily enters into the decision. 3. In an experiment to investigate whether a food additive is carcinogenic (i.e., causes or at least enhances the possibility of having cancer), the U.S. Food and Drug Administration has animals treated with and without the additive. Often, cancer will develop in both the treated and untreated groups of animals. In both groups there will be animals that do Biostatistics: A Methodology for t he Health Sciences, Second Edition, by Gerald van Belle, Lloyd D. Fisher, Patrick J. Heagerty, and Thomas S. Lumley ISBN 0-471-03185-2 Copyright 2004 John Wiley & Sons, Inc. 1 2 INTRODUCTION TO BIOSTATISTICS not develop cancer. There is a need for some method of determining whether the group treated with the additive has “too much” cancer. 4. It is well known that “smoking causes cancer.” Smoking does not cause cancer in the same manner that striking a billiard ball with another causes the second billiard ball to move. Many people smoke heavily for long periods of time and do not develop cancer. The formation of cancer subsequent to smoking is not an invariable consequence but occurs only a fraction of the time. Data collected to examine the association between smoking and cancer must be analyzed with recognition of an uncertain and variable outcome. 5. In designing and planning medical care facilities, planners take into account differing needs for medical care. Needs change because there are new modes of therapy, as well as demographic shifts, that may increase or decrease the need for facilities. All of the uncertainty associated with the future health of a population and its future geographic and demographic patterns should be taken into account. Inherent in all of these examples is the idea of uncertainty. Similar situations do not always result in the same outcome. Statistics deals with this variability. This somewhat vague formula- tion will become clearer in this book. Many definitions of statistics explicitly bring in the idea of variability. Some definitions of statistics are given in the Notes at the end of the chapter. 1.3 WHY BIOSTATISTICS? Biostatistics is the study of statistics as applied to biological areas. Biological laboratory exper- iments, medical research (including clinical research), and health services research all use statistical methods. Many other biological disciplines rely on statistical methodology. Why should one study biostatistics rather than statistics, since the methods have wide appli- cability? There are three reasons for focusing on biostatistics: 1. Some statistical methods are used more heavily in biostatistics than in other fields. For example, a general statistical textbook would not discuss the life-table method of analyzing survival data—of importance in many biostatistical applications. The topics in this book are tailored to the applications in mind. 2. Examples are drawn from the biological, medical, and health care areas; this helps you maintain motivation. It also helps you understand how to apply statistical methods. 3. A third reason for a biostatistical text is to teach the material to an audience of health pro- fessionals. In this case, the interaction between students and teacher, but especially among the students themselves, is of great value in learning and applying the subject matter. 1.4 GOALS OF THIS BOOK Suppose that we wanted to learn something about drugs; we can think of four different levels of knowledge. At the first level, a person may merely know that drugs act chemically when introduced into the body and produce many different effects. A second, higher level of knowledge is to know that a specific drug is given in certain situations, but we have no idea why the particular drug works. We do not know whether a drug might be useful in a situation that we have not yet seen. At the next, third level, we have a good idea why things work and also know how to administer drugs. At this level we do not have complete knowledge of all the biochemical principles involved, but we do have considerable knowledge about the activity and workings of the drug. Finally, at the fourth and highest level, we have detailed knowledge of all of the interactions of the drug; we know the current research. This level is appropriate for researchers: those seeking [...]... 14 13 16 18 19 17 21 1 2 11 3 8 7 5 9 6 4 17 21 14 13 12 10 16 19 18 20 15 2 1 5 12 9 3 4 10 8 7 11 16 6 13 18 17 15 14 20 21 19 1 2 3 7 8 5 11 6 4 14 9 10 17 13 15 16 12 19 20 21 18 7 13 28 31 37 39 39 39 42 49 55 60 65 67 71 75 77 84 87 94 94 Rank 1 2 3 4 5 7 7 7 9 10 11 12 13 14 15 16 17 18 19 20.5 20.5 Rank of Costs of Procedures Orderedb 10 5 7 8 16 9 13 18 12 1 20 19 21 14 17 11 4 15 3 2... 3.3 A qualitative variable has values that are intrinsically nonnumerical (categorical) As suggested earlier, the values of a qualitative variable can always be put into numerical form The simplest numerical form is consecutive labeling of the values of the variable The values of a qualitative variable are also referred to as outcomes or states Note that examples 3 and 4 above are ambiguous In example... in a family (counts; 0, 1, 2, 3, ) Definition 3.4 A quantitative variable has values that are intrinsically numerical As illustrated by the examples above, we must specify two aspects of a variable: the scale of measurement and the values the variable can take on Some quantitative variables have numerical values that are integers, or discrete Such variables are referred to as discrete variables The. .. about clinical trials are Meinert [19 86], Friedman et al [19 81] , Tanur et al [19 89], and Fleiss [19 86] NOTES 1. 1 Some Definitions of Statistics • The science of statistics is essentially a branch of Applied Mathematics, and may be • • • • • • • regarded as mathematics applied to observational data Statistics may be regarded (i) as the study of populations, (ii) as the study of variation, (iii) as... example 3, what shall we do with Canadian citizens living outside Canada? We could arbitrarily add another “province” with the label “Outside Canada.” Example 4 is ambiguous because there may be more than one cause of death Both of these examples show that it is not always easy to anticipate all the values of a variable Either the list of values must be changed or the variable must be redefined The arithmetic... Clinical evaluation of bilateral internal mammary artery ligation as treatment of coronary heart disease American Journal of Cardiology, 4: 18 0 18 3 Cobb, L A. , Thomas, G I., Dillard, D H., Merendino, K A. , and Bruce, R A [19 59] An evaluation of internal-mammary-artery ligation by a double blind technique New England Journal of Medicine, 260: 11 15 11 18 Dimond, E G., Kittle, C F., and Crockett, J E [19 60]... permanence from one year to the next A second example of a tabulation involves keypunching errors made by a data-entry operator To be entered were 15 6 lines of data, each line containing data on the number of crib deaths for a particular month in King County, Washington, for the years 19 65 19 77 Other data on Table 3.2 Rearrangement of Graunt’s Data (Table 3 .1) by the 10 Most Common Causes of Death... with a written response by all participating investigators 2 Decide on the statistical analyses beforehand Check that specific analyses involving specific variables can be run Often, the analysis is changed during processing of the data or in the course of “interactive” data analysis This preliminary step is still necessary to ensure that data are available to answer the primary questions 3 Look at other... space or population is the set of all possible values of a variable The definition or listing of the sample space is not a trivial task In the examples of qualitative variables, we already discussed some ambiguities associated with the definitions of a variable and the sample space associated with the variable Your definition must be reasonably precise without being “picky.” Consider again the variable... of health care in the United States by means of a personal health questionnaire that is to be passed out at an American Medical Association convention In this case, the AMA respondents constitute a biased sample of the overall population 2 A famous historical example involves a telephone poll made during the Dewey–Truman presidential contest At that time—and to some extent today a large section of the . 20 L67 211 610 6 012 19 M 8 20 14 6 17 65 13 21 N1 810 1 313 136 714 14 O1 214 1 218 15 711 5 17 P1 913 1 017 167 516 11 Q2 016 1 615 127 717 4 R1 418 1 914 198 418 15 S1 019 1820208 719 3 T1 517 20 212 19420.52 U 212 115 1 918 9420.55 Source:. Ordered b A1 212 1 71 10 B26 212 132 5 C 5 411 5328 3 7 D45 312 7 314 8 E39898375 16 F1 311 73539 7 9 G 7 12 5 4 11 39 713 H 113 910 639 7 18 I 9 15 6 8 4 42 9 12 J 16 8 4 7 14 49 10 1 K 17 1 17 11 9 55 11 20 L67 211 610 6 012 . essentially a branch of Applied Mathematics, and may be regarded as mathematics applied to observational data. Statistics may be regarded (i) as the study of populations, (ii) as the study of variation,