Cody’s Data Cleaning Techniques Using SAS ® Second Edition Ron Cody The correct bibliographic citation for this manual is as follows: Cody, Ron 2008 Cody’s Data Cleaning Techniques Using SAS®, Second Edition Cary, NC: SAS Institute Inc Cody’s Data Cleaning Techniques Using SASđ, Second Edition Copyright â 2008, SAS Institute Inc., Cary, NC, USA ISBN 978-1-59994-659-7 All rights reserved Produced in the United States of America For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication U.S Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987) SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513 1st printing, April 2008 SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at support.sas.com/publishing or call 1-800-727-3228 ® SAS and all other SAS Institute Inc product or service names are registered trademarks or trademarks of SAS Institute Inc in the USA and other countries ® indicates USA registration Other brand and product names are registered trademarks or trademarks of their respective companies Table of Contents List of Programs Preface Acknowledgments Checking Values of Character Variables Introduction Using PROC FREQ to List Values Description of the Raw Data File PATIENTS.TXT Using a DATA Step to Check for Invalid Values Describing the VERIFY, TRIM, MISSING, and NOTDIGIT Functions Using PROC PRINT with a WHERE Statement to List Invalid Values Using Formats to Check for Invalid Values Using Informats to Remove Invalid Values Che ix xv xvii 1 13 15 18 Checking Values of Numeric Variables Introduction Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers Using an ODS SELECT Statement to List Extreme Values Using PROC UNIVARIATE Options to List More Extreme Observations Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage Using PROC RANK to Look for Highest and Lowest Values by Percentage Presenting a Program to List the Highest and Lowest Ten Values Presenting a Macro to List the Highest and Lowest "n" Values Using PROC PRINT with a WHERE Statement to List Invalid Data Values Using a DATA Step to Check for Out-of-Range Values Identifying Invalid Values versus Missing Values 23 24 34 35 37 43 47 50 52 54 55 iv Table of Contents Listing Invalid (Character) Values in the Error Report Creating a Macro for Range Checking Checking Ranges for Several Variables Using Formats to Check for Invalid Values Using Informats to Filter Invalid Values Checking a Range Using an Algorithm Based on Standard Deviation Detecting Outliers Based on a Trimmed Mean and Standard Deviation Presenting a Macro Based on Trimmed Statistics Using the TRIM Option of PROC UNIVARIATE and ODS to Compute Trimmed Statistics Checking a Range Based on the Interquartile Range 80 86 Checking for Missing Values Introduction Inspecting the SAS Log Using PROC MEANS and PROC FREQ to Count Missing Values Using DATA Step Approaches to Identify and Count Missing Values Searching for a Specific Numeric Value Creating a Macro to Search for Specific Numeric Values 57 60 62 66 68 71 73 76 91 91 93 96 100 102 Working with Dates Introduction Checking Ranges for Dates (Using a DATA Step) Checking Ranges for Dates (Using PROC PRINT) Checking for Invalid Dates Working with Dates in Nonstandard Form Creating a SAS Date When the Day of the Month Is Missing Suspending Error Checking for Known Invalid Dates 105 106 107 108 111 113 114 Table of Contents v Loo Looking for Duplicates and "n" Observations per Subject Introduction Eliminating Duplicates by Using PROC SORT Detecting Duplicates by Using DATA Step Approaches Using PROC FREQ to Detect Duplicate ID's Selecting Patients with Duplicate Observations by Using a Macro List and SQL Identifying Subjects with "n" Observations Each (DATA Step Approach) Identifying Subjects with "n" Observations Each (Using PROC FREQ) Wor Working with Multiple Files Introduction Checking for an ID in Each of Two Files Checking for an ID in Each of "n" Files A Macro for ID Checking More Complicated Multi-File Rules Checking That the Dates Are in the Proper Order 117 117 123 126 129 130 132 135 135 138 140 143 147 Double Entry and Verification (PROC COMPARE) Introduction Conducting a Simple Comparison of Two Data Sets Using PROC COMPARE with Two Data Sets That Have an Unequal Number of Observations Comparing Two Data Sets When Some Variables Are Not in Both Data Sets 149 150 159 161 Som Some PROC SQL Solutions to Data Cleaning Introduction A Quick Review of PROC SQL Checking for Invalid Character Values Checking for Outliers 165 166 166 168 vi Table of Contents Checking a Range Using an Algorithm Based on the Standard Deviation Checking for Missing Values Range Checking for Dates Checking for Duplicates Identifying Subjects with "n" Observations Each Checking for an ID in Each of Two Files More Complicated Multi-File Rules 169 170 172 173 174 174 176 Corr Correcting Errors Introduction Hardcoding Corrections Describing Named Input Reviewing the UPDATE Statement 181 181 182 184 10 Corr Creating Integrity Constraints and Audit Trails Introducing SAS Integrity Constraints Demonstrating General Integrity Constraints Deleting an Integrity Constraint Using PROC DATASETS Creating an Audit Trail Data Set Demonstrating an Integrity Constraint Involving More than One Variable Demonstrating a Referential Constraint Attempting to Delete a Primary Key When a Foreign Key Still Exists Attempting to Add a Name to the Child Data Set Demonstrating the Cascade Feature of a Referential Constraint Demonstrating the SET NULL Feature of a Referential Constraint Demonstrating How to Delete a Referential Constraint 187 188 193 193 200 202 205 207 208 210 211 Table of Contents vii 11 Corr DataFlux and dfPower Studio Introduction Examples Appendix 213 215 Listing of Raw Data Files and SAS Programs Programs and Raw Data Files Used in This Book Description of the Raw Data File PATIENTS.TXT Layout for the Data File PATIENTS.TXT Listing of Raw Data File PATIENTS.TXT Program to Create the SAS Data Set PATIENTS Listing of Raw Data File PATIENTS2.TXT Program to Create the SAS Data Set PATIENTS2 Program to Create the SAS Data Set AE (Adverse Events) Program to Create the SAS Data Set LAB_TEST Listings of the Data Cleaning Macros Used in This Book Index 217 217 218 218 219 220 221 221 222 222 239 viii List of Programs Checking Values of Character Variables Program 1-1 Program 1-2 Program 1-3 Program 1-4 Program 1-5 Program 1-6 Program 1-7 Program 1-8 Program 1-9 Program 1-10 Che Writing a Program to Create the Data Set PATIENTS Using PROC FREQ to List All the Unique Values for Character Variables Using the Keyword _CHARACTER_ in the TABLES Statement Using a DATA _NULL_ Step to Detect Invalid Character Data Using PROC PRINT to List Invalid Character Values Using PROC PRINT to List Invalid Character Data for Several Variables Using a User-Defined Format and PROC FREQ to List Invalid Data Values Using a User-Defined Format and a DATA Step to List Invalid Data Values Using a User-Defined Informat to Set Invalid Data Values to Missing Using a User-Defined Informat with the INPUT Function 13 14 15 17 19 21 Checking Values of Numeric Variables Program 2-1 Program 2-2 