Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 284 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
284
Dung lượng
6,37 MB
File đính kèm
79. A Step-By-Step.rar
(6 MB)
Nội dung
Survey Weights: A Step-by-Step Guide to Calculation RICHARD VALLIANT Universities of Michigan & Maryland JILL A DEVER RTI International (Washington, DC) ® A Stata Press Publication StataCorp LLC College Station, Texas đ Copyright â 2018 StataCorp LLC All rights reserved First edition 2018 Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Typeset in LATEX Printed in the United States of America 10 Print ISBN-10: 1-59718-260-5 Print ISBN-13: 978-1-59718-260-7 ePub ISBN-10: 1-59718-261-3 ePub ISBN-13: 978-1-59718-261-4 Mobi ISBN-10: 1-59718-262-1 Mobi ISBN-13: 978-1-59718-262-1 Library of Congress Control Number: 2017960405 No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means—electronic, mechanical, photocopy, recording, or otherwise— without the prior written permission of StataCorp LLC Stata, , Stata Press, Mata, StataCorp LLC , and NetCourse are registered trademarks of Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations NetCourseNow is a trademark of StataCorp LLC LATEX is a trademark of the American Mathematical Society Acknowledgments We are indebted to several people who have answered questions and encouraged us in the writing of this book Jeff Pitblado of StataCorp programmed svycal, which is a new Stata procedure that can handle raking, poststratification, general regression, and more general calibration estimation He also answered many specific Stata questions This book would not have been possible without him Matthias Schonlau at the University of Waterloo provided valuable assistance on how to use his boost plug-in and how to tune parameters in boosting Nicholas Winter helped us several times with questions about his survwgt package, which seems to get far less publicity than it deserves Stas Kolenikov advised us on Stata’s general capabilities and on his ipfraking raking procedure, which is also a useful tool for computing survey weights We thank Frauke Kreuter for many things Her boundless energy and endless fount of ideas have pushed us along for years Finally, we thank our spouses, Carla Maffeo and Vince Iannacchione, for their support throughout this several-year project Richard Valliant Jill A Dever November 2017 Contents Acknowledgments Figures Preface Glossary of acronyms Overview of weighting 1.1 Reasons for weighting 1.2 Probability sampling versus nonprobability sampling 1.3 Theories of population inference 1.4 Techniques used in probability sampling 1.5 Weighting versus imputation 1.6 Disposition codes 1.7 Flowchart of the weighting steps Initial steps in weighting probability samples 2.1 Base weights 2.2 Adjustments for unknown eligibility Adjustments for nonresponse 3.1 Weighting class adjustments 3.2 Propensity score adjustments 3.3 Tree-based algorithms 3.3.1 Classification and regression trees 3.3.2 Random forests 3.3.3 Boosting 3.4 Nonresponse in multistage designs Calibration and other uses of auxiliary data in weighting 4.1 Poststratified estimators 4.2 Raking estimators 4.3 More general calibration estimation 4.4 Calibration to sample estimates 4.5 Weight variability Use of weights in variance estimation 5.1 Exact formulas 5.2 The with-replacement workaround 5.3 Linearization variances 5.4 Replication variances 5.4.1 Jackknife 5.4.2 Balanced repeated replication 5.4.3 Bootstrap 5.4.4 Grouping PSUs to form replicates 5.5 Effects of multiple weight adjustments Nonprobability samples 6.1 Volunteer web surveys 6.2 Weighting nonprobability samples 6.3 Variance estimation for nonprobability surveys 6.4 Bayesian approaches 6.5 Some general comments Weighting for some special cases 7.1 Normalized weights 7.2 Multiple weights 7.3 Two-phase sampling 7.4 Composite weights 7.5 Masked strata and PSU IDs 7.6 Use of weights in fitting models 7.6.1 Comparing weighted and unweighted model fits 7.6.2 Testing whether to use weights Quality of survey weights 8.1 Design and planning stage 8.2 Base weights 8.3 Data editing and file preparation 8.4 Models for nonresponse and calibration 8.5 Calibration totals 8.6 Weighting checks 8.7 Analytic checks 8.8 Analysis file and documentation References Author index Subject index Figures 1.1 Illustration of sampling frame with over- and undercoverage of target population 1.2 Flowchart of steps used in weighting probability samples; KN known eligibility, UNK unknown eligibility, ER eligible respondent, ENR eligible nonrespondent, IN ineligible 3.1 Logistic versus boost predictions; reference line is drawn at 3.2 Boxplots of logistic and boost predictions 4.1 Negative GREG weights corrected by weight bounding 5.1 Histogram of 1,000 bootstrap estimates of birthweight from the NIMHS sample 6.1 Illustration of potential and actual coverage of a target population 7.1 Comparison of predictions from weighted and unweighted logistic regression for delaying medical care due to cost Reference line at Preface Many data analysts use survey data and understand the general purpose of survey weights However, they may not have studied the details of how weights are computed, nor they understand the purpose of different steps used in weighting Survey Weights: A Step-by-step Guide to Calculation is intended to fill these gaps in understanding Throughout the book, we explain the theoretical rationale for why steps are done Plus, we include many examples that give analysts tools for actually computing weights themselves in Stata We assume that the reader is familiar with Stata If not, Kohler and Kreuter (2012) provide a good introduction Finally, we also assume that the reader has some applied sampling experience and knowledge of “lite” theory Concepts of with-replacement versus without-replacement sampling and single- versus multistage designs should be familiar Sources for sampling theory and associated applications abound, including Valliant, Dever, and Kreuter (2013), Lohr (2010), and Särndal, Swensson, and Wretman (1992), to name just a few Structure of the book When faced with a new dataset, it is good practice to ask yourself a few questions before analyzing the data For example, Am I dealing with a sample, or does the dataset contain a whole population? If it is a sample, how was it selected? What is my goal for the analysis? Am I trying to draw inference to the population? Do I need to weight my sample to project it to the population? Williams, S R., 2.1 Williams, W W., 6.1 Winter, N., 4.2 , 5.4 , 5.4.1 , 5.4.2 , 5.5 Witten, D., 3.2 , 3.3 Wolter, K M., , 5.4.1 , 7.4 Wretman, J., 1.2 , 1.4 , 2.1 , 4.3 , Wu, C F J., 5.4 , 5.4.3 Z Zahs, D, , 6.1 Zapert, K., 6.2 Zeileis, A., 3.3.2 Zhou, H., 6.4 Zielinski, M W., Subject index A AAPOR, see American Association of Public Opinion Research American Association of Public Opinion Research margins of error in nonprobability samples, 6.5 reports on nonprobability samples, 1.2 attrition in opt-in panels, 6.1 B bagging (bootstrap aggregation), 3.3.2 balanced repeated replication, 5.4.2 example using survwgt, 5.4.2 example using svygen, 5.4.2 Fay method, 5.4.2 grouping strata and PSUs, 5.4.2 Hadamard matrix, 5.4.2 replicate weights, 5.4.2 BART, see Bayesian additive regression trees base weights, 1.7 , 2.1 baseball all-star voting, 6.1 Bayesian additive regression trees, 3.3.3 Bayesian approach multilevel regression and poststratification, 6.4 nonprobability samples, 6.4 boost, Stata plugin, 3.3.3 example using, 3.3.3 parameters for, 3.3.3 boosting, 3.3.3 explanation of, 3.3.3 bootstrap, 5.4.3 National Maternal and Infant Health Study (NMIHS) example, 5.4.3 number of replications, 5.4.3 use by Statistics Canada, 5.4.3 weights, 5.4.3 BRR, see balanced repeated replication bsweights, 5.4.3 C calibration, 1.7 , choice of covariates, 8.5 collect consistent with external sources, 8.5 contribution to variance of using sample estimates, 4.4 population totals needed, poststratification, 4.1 superpopulation models underlying, to sample estimates, 4.4 California Health Interview Survey, 5.4.1 CDC, see Centers for Disease Control Centers for Disease Control, 6.1 chain referral sampling, CHIS, see California Health Interview Survey classification and regression trees, 3.3.1 advantages of, 3.3.1 clustering, uses of, 1.4 collapsing adjustment cells, 5.5 control totals, recovering from public-use files, 4.1 convenience samples, convergence of iterative procedures problems in replicates, 5.5 coverage, 1.1 and selection bias, 6.1 overcoverage, 1.1 undercoverage, 1.1 Cramér’s , 3.2 current population survey, 1.1 D degrees of freedom, 5.2 descriptive statistics, 1.1 design and planning stage quality control at, 8.1 design effect because of weighting, 4.5 design unbiased, 1.3 deterministic model for nonresponse, diagram of universe and frame, 1.1 disposition codes, 1.6 categories used in weighting, 1.6 examples of in military personnel survey, 1.6 documentation of survey procedures, 8.8 E election polls, failures of, eligible units, 1.1 estimated controls, 4.4 estimating equations, 7.6 F Fay method of balanced repeated replication, 5.4.2 filenaming conventions, 1.7 finite population, 1.1 correction factors, 5.1 flowchart of weighting steps, 1.7 fpc ad hoc in ultimate cluster variance estimator, 5.2 specifying in svyset, 5.2 frame sampling, 1.1 G general regression (GREG) estimator, 4.3 bounded weight changes, 4.3 example of, 4.3 example of bounded weight changes, 4.3 incorrect specification in svyset, 4.3 using svycal, 4.3 weights, 4.3 grouping strata and PSUs, 5.4.1 , 5.4.4 balanced repeated replication, 5.4.2 in replication variance estimates, 5.4.1 jackknife example, 5.4.1 specifying in svyset, 5.4.4 weight adjustments in jackknife, 5.4.1 H Hadamard matrix, 5.4.2 hierarchical regression model in nonprobability samples, 6.4 I ignorability of sample design in model fitting, 7.6 imputation, 1.5 inference Bayesian, 1.3 design-based, 1.3 model-assisted, 1.3 model-based, 1.3 Internet coverage in European Union, 6.1 in United States, 6.1 interpretation of model parameter estimates, 1.1 ipfraking command, 4.2 J jackknife deleting PSUs, 5.4.1 example of grouping strata and PSUs, 5.4.1 grouping strata and PSUs, 5.4.1 JK2 version, 5.4.1 replicate weights, 5.4.1 stratified, 5.4.1 JK1 jackknife, 5.4.1 JK2 jackknife, 5.4.1 JKn jackknife, 5.4.1 L linear substitute in variance estimation, 5.3 linearization variance estimators, 5.3 M MART, see multiple additive regression trees measurement error in opt-in panels, 6.1 methods of sampling Bernoulli, 1.4 clustering, 1.4 equal probability, 1.4 multistage, 1.4 Poisson, 1.4 probability proportionate to size, 1.4 single-stage, 1.4 stratification, 1.4 model fit boxplots of predictions, 3.3.3 Cramér’s , 3.2 , 3.3.3 logistic versus boost, 3.3.3 majority vote method, 3.2 , 3.3.3 skewness of predictions, 3.3.3 model fitting census model, 7.6 estimating equations, 7.6 ignorability of sample design, 7.6 setting weights to 1, accounting for strata and clusters, 7.6.2 testing whether to use weights, 7.6.2 linear regression example, 7.6.2 logistic regression example, 7.6.2 using weights in, 7.6 weighted versus unweighted models, 7.6.1 example of, 7.6.1 model parameters, estimate of, 1.1 model unbiased, 1.3 model-based weighting, 6.2 example using svycal, 6.2 formulas for weights, 6.2 variance estimation, 6.3 leverage adjustments, 6.3 models for estimators poststratification, 4.1 raking, 4.2 MRP, see multilevel regression and poststratification multilevel regression and poststratification in nonprobability samples, 6.4 multiple additive regression trees, 3.3.3 multiple weighting steps effects on variances, 5.5 example using survwgt, 5.5 multiple weights, 7.2 multistage sampling, 2.1 uses of, 1.4 N nearest-neighbor method for sample matching, 6.2 network sampling chain referral, respondent driven, snowball, nonlinear estimator, 5.3 nonprobability samples, 1.2 , Bayesian approach, 6.4 convenience samples, covariate requirements, 6.2 hierarchical regression model, 6.4 incorrect election polls, model-based weighting, 6.2 variance estimation, 6.3 network sampling, observational studies, opt-in web samples, prediction approach, 6.2 presidential election polls, pseudo-inclusion probabilities, 6.2 quasirandomization approach to weighting, 6.2 reference samples, 6.2 river samples, sample matching, , 6.2 superpopulation modeling approach to weighting, 6.2 types of, variance estimation, 6.3 quasirandomization example, 6.3 superpopulation model example, 6.3 volunteer samples, weighting, 6.2 Xbox sample, nonresponse bias in deterministic model, bias in stochastic model, in opt-in panels, 6.1 MCAR, MAR, NMAR, NINR, models for, nonresponse adjustment, 1.7 boosting, 3.3.3 classification and regression trees, 3.3.1 example of, 3.1 example using propensity scores, 3.2 multistage designs, 3.4 propensity scores, 3.2 random forests, 3.3.2 ratio of weight sums, 3.1 survwgt, use for, 5.5 tree-based algorithms, 3.3 weighting class adjustments, 3.1 weighting classes, 3.1 nonresponse follow-up, 7.3 normalized weights, 7.1 objections to, 7.1 use in hierarchical linear models, 7.1 O observational studies, opt-in panels attrition, 6.1 coverage and selection bias, 6.1 measurement error, 6.1 nonresponse, 6.1 problems with, 6.1 opt-in web samples, overcoverage, 1.1 , P population finite, 1.1 target, 1.1 population inference, theories of, 1.3 poststratification, 4.1 misspecifying svyset, 4.1 model for, 4.1 saving weights with svycal, 4.1 specifying in svyset, 4.1 superpopulation model underlying, 4.1 using svycal, 4.1 using svyset, 4.1 weights, 4.1 prediction approach, 6.2 weights, 6.2 prediction approach to weighting, 6.2 presidential election polls, probability proportional to size sampling, 2.1 example of, 2.1 measures of size, 2.1 probability sampling, 1.2 base weights, 1.2 definition of, 1.2 propensity scores, 3.2 complementary log-log model, 3.2 forming classes based on quintiles, 3.2 logistic model, 3.2 probit model, 3.2 public-use files National Health and Nutrition Examination Survey (NHANES), 5.2 National Maternal and Infant Health Study (NMIHS), 5.4.3 National Maternal and Infant Health Survey (NMIHS), 7.6.2 replication variance estimation in, 5.4 Residential Energy Consumption Survey (RECS), 5.4.2 , 7.6.2 Statistics Canada, 5.4.3 pwr estimate, 2.1 Q quality control, analytic checks, 8.7 base weights, 8.2 data editing and file preparation, 8.3 design and planning stage, 8.1 disposition codes, 8.3 documentation, 8.8 master database, 8.2 quasirandomization approach to weighting in nonprobability samples, 6.2 quasirandomization to weighting, 6.2 quintiles for forming propensity classes, 3.2 R raking, 4.1 bounds on weight adjustments, 4.2 example of, 4.2 ipfraking, 4.2 jackknife example using survwgt, 5.4.1 model for, 4.2 sreweight, 4.2 random forests, 3.3.2 random-number seed, 2.1 recovering control totals from public-use files, 4.1 example of, 4.1 reference samples, 6.2 combining with nonprobability sample, 6.2 replicate sample releases, 2.1 replication variance estimation, 5.4 balanced repeated replication, 5.4.2 bootstrap, 5.4.3 replicate weights, 5.4.3 convergence problems in replicates, 5.5 grouping PSUs, 5.4.4 in public-use files, 5.4 JK1 jackknife, 5.4.1 JK2 jackknife, 5.4.1 JKn jackknife, 5.4.1 , 5.5 number of bootstrap samples, 5.4.3 using survwgt, 5.4.2 respondent driven sampling, response rates, decline of, river samples, S sample estimates used for calibration, 4.4 sample matching, imputing missing values, 6.2 in nonprobability samples, 6.2 nearest-neighbor method, 6.2 sample releases, 2.1 sampling frame, 1.1 methods of, 1.4 sandwich variance estimator, 5.3 score variable in variance estimation, 5.3 seed, random-number, 2.1 simple random sampling, 2.1 skewness of predicted response propensities, 3.3.3 snowball sampling, sreweight command, 4.2 srswor example, 2.1 fixed rate, 2.1 fixed size, 2.1 srswr example, 2.1 Stata commands ipfraking, 4.2 sample, 2.1 samplepps, 2.1 sreweight, 4.2 survwgt, 4.2 , 5.4 , 5.5 Stata package, svr, 4.2 , 5.4 , 5.4.2 , 5.5 Statistics Canada, 5.4.3 stochastic model for nonresponse, stratification, uses of, 1.4 stratified simple random sampling, 2.1 example, 2.1 superpopulation modeling approach to weighting in nonprobability samples, 6.2 survey weights, 1.1 survwgt command, 4.2 nonresponse adjustment, 5.5 raking example, 5.5 svyset command, 5.4.3 systematic sampling, 2.1 T target population, 1.1 totals, estimator of, 1.5 tree-based algorithms, 3.3 U UCLA Institute for Digital Research and Education, ultimate cluster variance estimator multistage sampling, 5.2 single-stage sampling, 5.2 undercoverage, 1.1 , unequal weighting effect, 4.5 use in quality checks, 8.6 universe, diagram of, 1.1 unknown eligibility adjustment, 1.7 , 2.2 UWE, see unequal weighting effect V variance estimation, 5.1 ad hoc fpc, 5.2 balanced repeated replication, 5.4.2 bootstrap, 5.4.3 Rao–Wu method, 5.4.3 weights, 5.4.3 degrees of freedom, 5.2 effects of multiple weighting steps, 5.4 , 5.5 example using survwgt, 5.5 exact formulas, 5.1 finite population correction factors, 5.1 simple random sampling without replacement, 5.1 grouping PSUs, 5.4.4 to form replicates, 5.4.1 jackknife, 5.4.1 example using survwgt, 5.4.1 grouping PSUs to form replicates, 5.4.1 replicate weights, 5.4.1 stratified, 5.4.1 weight adjustments, 5.4.1 JK1 jackknife, 5.4.1 JK2 jackknife, 5.4.1 JKn jackknife, 5.4.1 , 5.5 linearization estimators, 5.3 linear substitute, 5.3 score variable, 5.3 multistage example, 5.2 number of bootstrap samples, 5.4.3 one PSU per stratum design, 5.4.4 replication estimators, 5.4 sandwich estimator, 5.3 simple random sampling without replacement proportions, 5.1 stratified simple random sampling example, 5.2 stratified simple random sampling without replacement, 5.1 survwgt procedure, 5.4 svr package, 5.4 svyset syntax, 5.2 ultimate cluster estimator, 5.2 with replacement workaround, 5.2 volunteer samples, pseudo-inclusion probabilities, 6.2 volunteer web surveys, 6.1 W web panel, 1.2 webographics, 6.2 weight variability, 4.5 weighting flowchart of steps, 1.7 in nonprobability samples, 6.2 weighting class adjustments, 3.1 weighting versus imputation, 1.5 weights analytic checks, 8.7 descriptive statistics, 1.1 GREG using svycal, 4.3 in model fitting, 7.6 outlier checks, 8.6 quality checks, 8.6 raked weights with ipfraking, 4.2 raked weights with sreweight, 4.2 raked weights with svycal, 4.2 reasons for using nonresponse, probabilities of selection, reduce standard errors, unknown eligibility, unequal weighting effect, 8.6 use of, 1.5 with-replacement sampling, 2.1 X Xbox nonprobability sample, ... of StataCorp LLC Stata, , Stata Press, Mata, StataCorp LLC , and NetCourse are registered trademarks of Stata and Stata Press are registered trademarks with the World Intellectual Property Organization... Weights: A Step- by -Step Guide to Calculation RICHARD VALLIANT Universities of Michigan & Maryland JILL A DEVER RTI International (Washington, DC) ® A Stata Press Publication StataCorp LLC College Station,... theoretical rationale for why steps are done Plus, we include many examples that give analysts tools for actually computing weights themselves in Stata We assume that the reader is familiar with Stata