Data quality and record linkage techniques

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	225
Dung lượng	1,82 MB

Nội dung

Data Quality and Record Linkage Techniques Thomas N Herzog Fritz J Scheuren William E Winkler Data Quality and Record Linkage Techniques Thomas N Herzog Office of Evaluation Federal Housing Administration U.S Department of Housing and Urban Development 451 7-th Street, SW Washington, DC 20140 Fritz J Scheuren National Opinion Research Center University of Chicago 1402 Ruffner Road Alexandria, VA 22302 William E Winkler Statistical Research Division U.S Census Bureau 4700 Silver Hill Road Washington, DC 20233 Library of Congress Control Number: 2007921194 ISBN-13: 978-0-387-69502-0 e-ISBN-13: 978-0-387-69505-1 Printed on acid-free paper © 2007 Springer Science+Business Media, LLC All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights There may be no basis for a claim to copyright with respect to a contribution prepared by an officer or employee of the United States Government as part of that person‘s official duties Printed in the United States of America 987654321 springer.com Preface Readers will find this book a mixture of practical advice, mathematical rigor, management insight, and philosophy Our intended audience is the working analyst Our approach is to work by real life examples Most illustrations come out of our successful practice A few are contrived to make a point Sometimes they come out of failed experience, ours and others We have written this book to help the reader gain a deeper understanding, at an applied level, of the issues involved in improving data quality through editing, imputation, and record linkage We hope that the bulk of the material is easily accessible to most readers although some of it does require a background in statistics equivalent to a 1-year course in mathematical statistics Readers who are less comfortable with statistical methods might want to omit Section 8.5, Chapter 9, and Section 18.6 on first reading In addition, Chapter may be primarily of interest to those whose professional focus is on sample surveys We provide a long list of references at the end of the book so that those wishing to delve more deeply into the subjects discussed here can so Basic editing techniques are discussed in Chapter 5, with more advanced editing and imputation techniques being the topic of Chapter Chapter 14 illustrates some of the basic techniques Chapter is the essence of our material on record linkage In Chapter 9, we describe computational techniques for implementing the models of Chapter Chapters 9–13 contain techniques that may enhance the record linkage process In Chapters 15–17, we describe a wide variety of applications of record linkage Chapter 18 is our chapter on data confidentiality, while Chapter 19 is concerned with record linkage software Chapter 20 is our summary chapter Three recent books on data quality – Redman [1996], English [1999], and Loshin [2001] – are particularly useful in effectively dealing with many management issues associated with the use of data and provide an instructive overview of the costs of some of the errors that occur in representative databases Using as their starting point the work of quality pioneers such as Deming, Ishakawa, and Juran whose original focus was on manufacturing processes, the recent books cover two important topics not discussed by those seminal authors: (1) errors that affect data quality even when the underlying processes are operating properly and (2) processes that are controlled by others (e.g., other organizational units within one’s company or other companies) Dasu and Johnson [2003] provide an overview of some statistical summaries and other conditions that must exist for a database to be useable for v vi Preface specific statistical purposes They also summarize some methods from the database literature that can be used to preserve the integrity and quality of a database Two other interesting books on data quality – Huang, Wang and Lee [1999] and Wang, Ziad, and Lee [2001] – supplement our discussion Readers will find further useful references in The International Monetary Fund’s (IMF) Data Quality Reference Site on the Internet at http://dsbb.imf.org/Applications/web/dqrs/dqrshome/ We realize that organizations attempting to improve the quality of the data within their key databases best when the top management of the organization is leading the way and is totally committed to such efforts This is discussed in many books on management See, for example, Deming [1986], Juran and Godfrey [1999], or Redman [1996] Nevertheless, even in organizations not committed to making major advances, analysts can still use the tools described here to make substantial quality improvement A working title of this book – Playing with Matches – was meant to warn readers of the danger of data handling techniques such as editing, imputation, and record linkage unless they are tightly controlled, measurable, and as transparent as possible Over-editing typically occurs unless there is a way to measure the costs and benefits of additional editing; imputation always adds uncertainty; and errors resulting from the record linkage process, however small, need to be taken into account during future uses of the data We would like to thank the following people for their support and encouragement in writing this text: Martha Aliaga, Patrick Ball, Max Brandstetter, Linda Del Bene, William Dollarhide, Mary Goulet, Barry I Graubard, Nancy J Kirkendall, Susan Lehmann, Sam Phillips, Stephanie A Smith, Steven Sullivan, and Gerald I Webber We would especially like to thank the following people for their support and encouragement as well as for writing various parts of the text: Patrick Baier, Charles D Day, William J Eilerman, Bertram M Kestenbaum, Michael D Larsen, Kevin J Pledge, Scott Schumacher, and Felicity Skidmore Contents Preface v About the Authors xiii Introduction 1.1 Audience and Objective 1.2 Scope 1.3 Structure PART DATA QUALITY: WHAT IT IS, WHY IT IS IMPORTANT, AND HOW TO ACHIEVE IT What 2.1 2.2 2.3 2.4 2.5 Is Data Quality and Why Should We Care? When Are Data of High Quality? Why Care About Data Quality? How Do You Obtain High-Quality Data? Practical Tips Where Are We Now? 7 10 11 13 13 Examples of Entities Using Data to their Advantage/Disadvantage 3.1 Data Quality as a Competitive Advantage 3.2 Data Quality Problems and their Consequences 3.3 How Many People Really Live to 100 and Beyond? Views from the United States, Canada, and the United Kingdom 3.4 Disabled Airplane Pilots – A Successful Application of Record Linkage 3.5 Completeness and Accuracy of a Billing Database: Why It Is Important to the Bottom Line 3.6 Where Are We Now? 17 17 20 Properties of Data Quality and Metrics for Measuring It 4.1 Desirable Properties of Databases/Lists 4.2 Examples of Merging Two or More Lists and the Issues that May Arise 4.3 Metrics Used when Merging Lists 4.4 Where Are We Now? 29 29 25 26 26 27 31 33 35 vii viii Contents Basic 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 PART Data Quality Tools Data Elements Requirements Document A Dictionary of Tests Deterministic Tests Probabilistic Tests Exploratory Data Analysis Techniques Minimizing Processing Errors Practical Tips Where Are We Now? 37 37 38 39 40 44 44 46 46 48 SPECIALIZED TOOLS FOR DATABASE IMPROVEMENT Mathematical Preliminaries for Specialized Data Quality Techniques 6.1 Conditional Independence 6.2 Statistical Paradigms 6.3 Capture–Recapture Procedures and Applications Automatic Editing and Imputation of Sample Survey Data 7.1 Introduction 7.2 Early Editing Efforts 7.3 Fellegi–Holt Model for Editing 7.4 Practical Tips 7.5 Imputation 7.6 Constructing a Unified Edit/Imputation Model 7.7 Implicit Edits – A Key Construct of Editing Software 7.8 Editing Software 7.9 Is Automatic Editing Taking Up Too Much Time and Money? 7.10 Selective Editing 7.11 Tips on Automatic Editing and Imputation 7.12 Where Are We Now? Record Linkage – Methodology 8.1 Introduction 8.2 Why Did Analysts Begin Linking Records? 8.3 Deterministic Record Linkage 8.4 Probabilistic Record Linkage – A Frequentist Perspective 8.5 Probabilistic Record Linkage – A Bayesian Perspective 8.6 Where Are We Now? 51 51 53 54 61 61 63 64 65 66 71 73 75 78 79 79 80 81 81 82 82 83 91 92 Contents 10 ix Estimating the Parameters of the Fellegi–Sunter Record Linkage Model 93 9.1 Basic Estimation of Parameters Under Simple Agreement/Disagreement Patterns 93 9.2 Parameter Estimates Obtained via Frequency-Based Matching 94 9.3 Parameter Estimates Obtained Using Data from Current Files 96 9.4 Parameter Estimates Obtained via the EM Algorithm 97 9.5 Advantages and Disadvantages of Using the EM Algorithm to Estimate m- and u-probabilities 101 9.6 General Parameter Estimation Using the EM Algorithm 103 9.7 Where Are We Now? 106 Standardization and Parsing 10.1 Obtaining and Understanding Computer Files 10.2 Standardization of Terms 10.3 Parsing of Fields 10.4 Where Are We Now? 107 109 110 111 114 11 Phonetic Coding Systems for Names 115 11.1 Soundex System of Names 115 11.2 NYSIIS Phonetic Decoder 119 11.3 Where Are We Now? 121 12 Blocking 12.1 Independence of Blocking Strategies 12.2 Blocking Variables 12.3 Using Blocking Strategies to Identify Duplicate List Entries 12.4 Using Blocking Strategies to Match Records Between Two Sample Surveys 12.5 Estimating the Number of Matches Missed 12.6 Where Are We Now? 123 124 125 String Comparator Metrics for Typographical Error 13.1 Jaro String Comparator Metric for Typographical Error 13.2 Adjusting the Matching Weight for the Jaro String Comparator 13.3 Winkler String Comparator Metric for Typographical Error 13.4 Adjusting the Weights for the Winkler Comparator Metric 13.5 Where are We Now? 131 13 126 128 130 130 131 133 133 134 135 x Contents PART RECORD LINKAGE CASE STUDIES 14 Duplicate FHA Single-Family Mortgage Records: A Case Study of Data Problems, Consequences, and Corrective Steps 139 14.1 Introduction 139 14.2 FHA Case Numbers on Single-Family Mortgages 141 14.3 Duplicate Mortgage Records 141 14.4 Mortgage Records with an Incorrect Termination Status 145 14.5 Estimating the Number of Duplicate Mortgage Records 148 15 Record Linkage Case Studies in the Medical, Biomedical, and Highway Safety Areas 151 15.1 Biomedical and Genetic Research Studies 151 15.2 Who goes to a Chiropractor? 153 15.3 National Master Patient Index 154 15.4 Provider Access to Immunization Register Securely (PAiRS) System 155 15.5 Studies Required by the Intermodal Surface Transportation Efficiency Act of 1991 156 15.6 Crash Outcome Data Evaluation System 157 16 Constructing List Frames and Administrative Lists 16.1 National Address Register of Residences in Canada 16.2 USDA List Frame of Farms in the United States 16.3 List Frame Development for the US Census of Agriculture 16.4 Post-enumeration Studies of US Decennial Census 159 160 162 165 166 17 Social 17.1 17.2 17.3 17.4 169 169 173 175 177 PART 18 Security and Related Topics Hidden Multiple Issuance of Social Security Numbers How Social Security Stops Benefit Payments after Death CPS–IRS–SSA Exact Match File Record Linkage and Terrorism OTHER TOPICS Confidentiality: Maximizing Access to Micro-data while Protecting Privacy 181 18.1 Importance of High Quality of Data in the Original File 182 18.2 Documenting Public-use Files 183 18.3 Checking Re-identifiability 183 18.4 Elementary Masking Methods and Statistical Agencies 186 18.5 Protecting Confidentiality of Medical Data 193 18.6 More-advanced Masking Methods – Synthetic Datasets 195 18.7 Where Are We Now? 198 .. .Data Quality and Record Linkage Techniques Thomas N Herzog Fritz J Scheuren William E Winkler Data Quality and Record Linkage Techniques Thomas N Herzog Office... successes and failures–in data and database use In Chapter 4, we describe metrics that quantify the quality of databases and data lists In Chapter we revisit a number of data quality control and editing... and improve data quality. 1 After identifying when data are of high quality, we give reasons why we should care about data quality and discuss how one can obtain high -quality data Experts on quality

Ngày đăng: 05/03/2019, 08:26