CATEGORICAL DATA ANALYSIS BY EXAMPLE GRAHAM J G UPTON www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* CATEGORICAL DATA ANALYSIS BY EXAMPLE www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* CATEGORICAL DATA ANALYSIS BY EXAMPLE GRAHAM J G UPTON www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* Copyright © 2017 by John Wiley & Sons, Inc All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com Library of Congress Cataloging-in-Publication Data Names: Upton, Graham J G., author Title: Categorical data analysis by example / Graham J.G Upton Description: Hoboken, New Jersey : John Wiley & Sons, 2016 | Includes index Identifiers: LCCN 2016031847 (print) | LCCN 2016045176 (ebook) | ISBN 9781119307860 (cloth) | ISBN 9781119307914 (pdf) | ISBN 9781119307938 (epub) Subjects: LCSH: Multivariate analysis | Log-linear models Classification: LCC QA278 U68 2016 (print) | LCC QA278 (ebook) | DDC 519.5/35–dc23 LC record available at https://lccn.loc.gov/2016031847 Printed in the United States of America 10 www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* CONTENTS PREFACE ACKNOWLEDGMENTS 1 INTRODUCTION 1.1 1.2 1.3 1.4 1.5 1.6 1.7 What are Categorical Data? A Typical Data Set Visualization and Cross-Tabulation Samples, Populations, and Random Variation Proportion, Probability, and Conditional Probability Probability Distributions 1.6.1 The Binomial Distribution 1.6.2 The Multinomial Distribution 1.6.3 The Poisson Distribution 1.6.4 The Normal Distribution 1.6.5 The Chi-Squared (𝜒 ) Distribution *The Likelihood ESTIMATION AND INFERENCE FOR CATEGORICAL DATA 2.1 XI XIII 11 Goodness of Fit 11 v www.allitebooks.com ****************************************************************** ************************************************************************* ********************************************************* vi CONTENTS Pearson’s X Goodness-of-Fit Statistic 11 *The Link Between X and the Poisson and 𝜒 -Distributions 12 2.1.3 The Likelihood-Ratio Goodness-of-Fit Statistic, G2 13 2.1.4 *Why the G2 and X Statistics Usually Have Similar Values 14 2.2 Hypothesis Tests for a Binomial Proportion (Large Sample) 14 2.2.1 The Normal Score Test 15 2.2.2 *Link to Pearson’s X Goodness-of-Fit Test 15 2.2.3 G2 for a Binomial Proportion 15 2.3 Hypothesis Tests for A Binomial Proportion (Small Sample) 16 2.3.1 One-Tailed Hypothesis Test 16 2.3.2 Two-Tailed Hypothesis Tests 18 2.4 Interval Estimates for A Binomial Proportion 18 2.4.1 Laplace’s Method 19 2.4.2 Wilson’s Method 19 2.4.3 The Agresti–Coull Method 20 2.4.4 Small Samples and Exact Calculations 20 References 22 2.1.1 2.1.2 THE × CONTINGENCY TABLE 3.1 3.2 Introduction 25 Fisher’s Exact Test (For Independence) 27 3.2.1 *Derivation of the Exact Test Formula 28 3.3 Testing Independence with Large Cell Frequencies 29 3.3.1 Using Pearson’s Goodness-of-Fit Test 30 3.3.2 The Yates Correction 30 3.4 The × Table in a Medical Context 32 3.5 Measuring Lack of Independence (Comparing Proportions) 34 3.5.1 Difference of Proportions 35 3.5.2 Relative Risk 36 3.5.3 Odds-Ratio 37 References 40 www.allitebooks.com 25 ****************************************************************** ************************************************************************* ********************************************************* CONTENTS THE I × J CONTINGENCY TABLE vii 41 4.1 4.2 Notation 41 Independence in the I × J Contingency Table 42 4.2.1 Estimation and Degrees of Freedom 42 4.2.2 Odds-Ratios and Independence 43 4.2.3 Goodness of Fit and Lack of Fit of the Independence Model 43 4.3 Partitioning 46 4.3.1 *Additivity of G2 46 4.3.2 Rules for Partitioning 49 4.4 Graphical Displays 49 4.4.1 Mosaic Plots 49 4.4.2 Cobweb Diagrams 50 4.5 Testing Independence with Ordinal Variables 52 References 54 55 THE EXPONENTIAL FAMILY 5.1 5.2 Introduction 55 The Exponential Family 56 5.2.1 The Exponential Dispersion Family 57 5.3 Components of a General Linear Model 57 5.4 Estimation 58 References 59 61 A MODEL TAXONOMY 6.1 6.2 Underlying Questions 61 6.1.1 Which Variables are of Interest? 61 6.1.2 What Categories Should be Used? 61 6.1.3 What is the Type of Each Variable? 62 6.1.4 What is the Nature of Each Variable? 62 Identifying the Type of Model 63 THE × J CONTINGENCY TABLE 7.1 7.2 X2 A Problem with (and Using the Logit 66 G2 ) 65 www.allitebooks.com 65 ****************************************************************** ************************************************************************* ********************************************************* viii CONTENTS 7.2.1 Estimation of the Logit 67 7.2.2 The Null Model 68 7.3 Individual Data and Grouped Data 69 7.4 Precision, Confidence Intervals, and Prediction Intervals 73 7.4.1 Prediction Intervals 74 7.5 Logistic Regression with a Categorical Explanatory Variable 76 7.5.1 Parameter Estimates with Categorical Variables (J > 2) 78 7.5.2 The Dummy Variable Representation of a Categorical Variable 79 References 80 LOGISTIC REGRESSION WITH SEVERAL EXPLANATORY VARIABLES 8.1 8.2 8.3 81 Degrees of Freedom when there are no Interactions 81 Getting a Feel for the Data 83 Models with two-Variable Interactions 85 8.3.1 Link to the Testing of Independence between Two Variables 87 MODEL SELECTION AND DIAGNOSTICS 9.1 Introduction 89 9.1.1 Ockham’s Razor 90 9.2 Notation for Interactions and for Models 91 9.3 Stepwise Methods for Model Selection Using G2 92 9.3.1 Forward Selection 94 9.3.2 Backward Elimination 96 9.3.3 Complete Stepwise 98 9.4 AIC and Related Measures 98 9.5 The Problem Caused by Rare Combinations of Events 100 9.5.1 Tackling the Problem 101 9.6 Simplicity versus Accuracy 103 9.7 DFBETAS 105 References 107 www.allitebooks.com 89 ****************************************************************** ************************************************************************* ********************************************************* APPENDIX R CODE FOR COBWEB FUNCTION Readers are welcome to adapt the following code as they please The arguments of the function are: df (the data frame containing the cross-tabulation), scale (a number, usually 1, that controls the width of the cobweb lines), and outfile (the file containing the resulting cobweb diagram) cobweb