Big Data Meets Survey Science Big Data Meets Survey Science A Collection of Innovative Methods Edited by Craig A Hill Paul P Biemer Trent D Buskirk Lilli Japec Antje Kirchner Stas Kolenikov Lars E Lyberg This edition first published 2021 © 2021 John Wiley & Sons All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions The right of Craig A Hill, Paul P Biemer, Trent D Buskirk, Lilli Japec, Antje Kirchner, Stas Kolenikov, and Lars E Lyberg to be identified as the editorial material in this work has been asserted in accordance with law Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com Wiley also publishes its books in a variety of electronic formats and by print-on-demand Some content that appears in standard print versions of this book may not be available in other formats Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make This work is sold with the understanding that the publisher is not engaged in rendering professional services The advice and strategies contained herein may not be suitable for your situation You should consult with a specialist where appropriate Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages Library of Congress Cataloging-in-Publication Data Names: Hill, Craig A., 1956- editor | Biemer, Paul P., editor | Buskirk, Trent D., 1971- editor | Japec, Lilli, 1966- editor | Kirchner, Antje, 1982- editor | Kolenikov, Stanislav, editor | Lyberg, Lars, editor | John Wiley & Sons, publisher Title: Big data meets survey science : a collection of innovative methods / edited by Craig A Hill, Paul P Biemer, Trent D Buskirk, Lilli Japec, Antje Kirchner, Stas Kolenikov, Lars Erik Lyberg Description: Hoboken, NJ : John Wiley & Sons, Inc., 2021 | Includes index Identifiers: LCCN 2020022652 (print) | LCCN 2020022653 (ebook) | ISBN 9781118976326 (cloth) | ISBN 9781118976333 (adobe pdf) | ISBN 9781118976340 (epub) Subjects: LCSH: Big data | Social surveys–Methodology Classification: LCC QA76.9.B45 B5563 2021 (print) | LCC QA76.9.B45 (ebook) | DDC 001.4/33028557–dc23 LC record available at https://lccn.loc.gov/2021022652 LC ebook record available at https://lccn.loc.gov/2021022653 Cover Design: Wiley Cover Image: Courtesy of RTI International Set in 9.5/12.5pt STIXTwoText by SPi Global, Chennai, India Printed in the United States of America 10 v Contents List of Contributors xxiii Introduction Craig A Hill, Paul P Biemer, Trent D Buskirk, Lilli Japec, Antje Kirchner, Stas Kolenikov, and Lars E Lyberg Acknowledgments References Section The New Survey Landscape 1.1 1.2 1.3 1.3.1 1.3.2 1.3.3 1.3.3.1 1.3.3.2 1.3.3.3 1.4 1.4.1 1.4.2 1.4.3 Why Machines Matter for Survey and Social Science Researchers: Exploring Applications of Machine Learning Methods for Design, Data Collection, and Analysis 11 Trent D Buskirk and Antje Kirchner Introduction 11 Overview of Machine Learning Methods and Their Evaluation 13 Creating Sample Designs and Constructing Sampling Frames Using Machine Learning Methods 16 Sample Design Creation 16 Sample Frame Construction 18 Considerations and Implications for Applying Machine Learning Methods for Creating Sampling Frames and Designs 20 Considerations About Algorithmic Optimization 20 Implications About Machine Learning Model Error 21 Data Type Considerations and Implications About Data Errors 22 Questionnaire Design and Evaluation Using Machine Learning Methods 23 Question Wording 24 Evaluation and Testing 26 Instrumentation and Interviewer Training 27 vi Contents 1.4.4 1.5 1.5.1 1.5.2 1.6 1.6.1 1.6.2 1.6.3 1.6.4 1.7 1.7.1 1.7.2 1.8 1.8.1 1.8.2 1.8.3 1.9 2.1 2.2 2.3 2.3.1 2.3.1.1 2.3.1.2 2.3.2 2.4 2.4.1 2.4.2 2.5 2.5.1 2.5.1.1 Alternative Data Sources 28 Survey Recruitment and Data Collection Using Machine Learning Methods 28 Monitoring and Interviewer Falsification 29 Responsive and Adaptive Designs 29 Survey Data Coding and Processing Using Machine Learning Methods 33 Coding Unstructured Text 33 Data Validation and Editing 35 Imputation 35 Record Linkage and Duplicate Detection 36 Sample Weighting and Survey Adjustments Using Machine Learning Methods 37 Propensity Score Estimation 37 Sample Matching 41 Survey Data Analysis and Estimation Using Machine Learning Methods 43 Gaining Insights Among Survey Variables 44 Adapting Machine Learning Methods to the Survey Setting 45 Leveraging Machine Learning Algorithms for Finite Population Inference 46 Discussion and Conclusions 47 References 48 Further Reading 60 The Future Is Now: How Surveys Can Harness Social Media to Address Twenty-first Century Challenges 63 Amelia Burke-Garcia, Brad Edwards, and Ting Yan Introduction 63 New Ways of Thinking About Survey Research 67 The Challenge with … Sampling People 67 The Social Media Opportunities 68 Venue-Based, Time-Space Sampling 68 Respondent-Driven Sampling 70 Outstanding Challenges 71 The Challenge with … Identifying People 72 The Social Media Opportunity 73 Outstanding Challenges 73 The Challenge with … Reaching People 74 The Social Media Opportunities 75 Tracing 75 Contents 2.5.1.2 2.5.2 2.6 2.6.1 2.6.1.1 2.6.1.2 2.6.2 2.7 2.7.1 2.7.1.1 2.7.1.2 2.7.2 2.8 Paid Social Media Advertising 76 Outstanding Challenges 77 The Challenge with … Persuading People to Participate 77 The Social Media Opportunities 78 Paid Social Media Advertising 78 Online Influencers 79 Outstanding Challenges 80 The Challenge with … Interviewing People 81 Social Media Opportunities 82 Passive Social Media Data Mining 82 Active Data Collection 83 Outstanding Challenges 84 Conclusion 87 References 89 Linking Survey Data with Commercial or Administrative Data for Data Quality Assessment 99 A Rupa Datta, Gabriel Ugarte, and Dean Resnick Introduction 99 Thinking About Quality Features of Analytic Data Sources 101 What Is the Purpose of the Data Linkage? 101 What Kind of Data Linkage for What Analytic Purpose? 102 Data Used in This Chapter 104 NSECE Household Survey 104 Proprietary Research Files from Zillow 105 Linking the NSECE Household Survey with Zillow Proprietary Datafiles 107 Nonuniqueness of Matches 107 Misalignment of Units of Observation 110 Ability to Identify Matches 110 Identifying Matches 112 Implications of the Linking Process for Intended Analyses 114 Assessment of Data Quality Using the Linked File 116 What Variables in the Zillow Datafile Are Most Appropriate for Use in Substantive Analyses Linked to Survey Data? 116 How Did Different Steps in the Survey Administration Process Contribute to Representativeness of the NSECE Survey Data? 119 How Well Does the Linked Datafile Represent the Overall NSECE Dataset (Including Unlinked Records)? 123 3.1 3.2 3.2.1 3.2.2 3.3 3.3.1 3.3.2 3.3.3 3.3.3.1 3.3.3.2 3.3.3.3 3.3.3.4 3.3.3.5 3.4 3.4.1 3.4.2 3.4.3 vii viii Contents 3.5 Conclusion 125 References 127 Further Reading 129 Section Total Error and Data Quality 131 4.1 4.2 4.2.1 4.2.2 4.2.3 4.3 4.4 4.4.1 4.4.2 4.4.3 4.4.4 4.4.4.1 4.4.4.2 4.5 4.6 5.1 5.2 5.2.1 5.2.1.1 5.2.1.2 5.2.1.3 5.2.1.4 5.2.2 5.2.2.1 5.2.2.2 Total Error Frameworks for Found Data 133 Paul P Biemer and Ashley Amaya Introduction 133 Data Integration and Estimation 134 Source Datasets 135 The Integration Process 137 Unified Dataset 137 Errors in Datasets 138 Errors in Hybrid Estimates 141 Error-Generating Processes 141 Components of Bias, Variance, and Mean Squared Error 145 Illustrations 148 Error Mitigation 153 Sample Recruitment Error 153 Data Encoding Error 156 Other Error Frameworks 156 Summary and Conclusions 158 References 160 Measuring the Strength of Attitudes in Social Media Data 163 Ashley Amaya, Ruben Bach, Frauke Kreuter, and Florian Keusch Introduction 163 Methods 165 Data 165 European Social Survey Data 166 Reddit 2016 Data 167 Reddit Survey 169 Reddit 2018 Data 169 Analysis 170 Missingness 171 Measurement 173 Contents 5.2.2.3 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.4 5.A 5.B 5.B.1 5.B.2 5.B.3 5.B.4 5.B.5 5.B.6 5.C 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 Coding 173 Results 174 Overall Comparisons 174 Missingness 175 Measurement 177 Coding 178 Summary 180 2016 German ESS Questions Used in Analysis 184 Search Terms Used to Identify Topics in Reddit Posts (2016 and 2018) 186 Political Ideology 186 Interest in Politics 186 Gay Rights 186 EU 187 Immigration 187 Climate 187 Example of Coding Steps Used to Identify Topics and Assign Sentiment in Reddit Submissions (2016 and 2018) 188 References 189 Attention to Campaign Events: Do Twitter and Self-Report Metrics Tell the Same Story? 193 Josh Pasek, Lisa O Singh, Yifang Wei, Stuart N Soroka, Jonathan M Ladd, Michael W Traugott, Ceren Budak, Leticia Bode, and Frank Newport What Can Social Media Tell Us About Social Phenomena? 193 The Empirical Evidence to Date 195 Tweets as Public Attention 196 Data Sources 197 Event Detection 198 Did Events Peak at the Same Time Across Data Streams? 204 Were Event Words Equally Prominent Across Data Streams? 205 Were Event Terms Similarly Associated with Particular Candidates? 206 Were Event Trends Similar Across Data Streams? 207 Unpacking Differences Between Samples 211 Conclusion 212 References 213 ix x Contents 7.1 7.2 7.2.1 7.2.2 7.3 7.3.1 7.3.1.1 7.3.1.2 7.3.1.3 7.3.1.4 7.3.1.5 7.3.1.6 7.3.1.7 7.3.1.8 7.3.1.9 7.3.1.10 7.3.2 7.3.2.1 7.3.2.2 7.3.2.3 7.4 7.4.1 7.4.1.1 7.4.1.2 7.4.2 7.4.3 7.5 Improving Quality of Administrative Data: A Case Study with FBI’s National Incident-Based Reporting System Data 217 Dan Liao, Marcus E Berzofsky, G Lance Couzens, Ian Thomas, and Alexia Cooper Introduction 217 The NIBRS Database 220 Administrative Crime Statistics and the History of NIBRS Data 220 Construction of the NIBRS Dataset 221 Data Quality Improvement Based on the Total Error Framework 222 Data Quality Assessment for Using Row–Column–Cell Framework 224 Phase I: Evaluating Each Data Table 224 Row Errors 225 Column Errors 226 Cell Errors 226 Row–Column–Cell Errors Impacting NIBRS 227 Phase II: Evaluating the Integrated Data 227 Errors in Data Integration Process 227 Coverage Errors Due to Nonreporting Agencies 228 Nonresponse Errors in the Incident Data Table Due to Unreported Incident Reports 229 Invalid, Unknown, and Missing Values Within the Incident Reports 230 Improving Data Quality via Sampling, Weighting, and Imputation 231 Sample-Based Method to Improve Data Representativeness at the Agency Level 231 Statistical Weighting to Adjust for Coverage Errors at the Agency Level 232 Imputation to Compensate for Unreported Incidents and Missing Values in the Incident Reports 233 Utilizing External Data Sources in Improving Data Quality of the Administrative Data 234 Understanding the External Data Sources 234 Data Quality Assessment of External Data Sources 234 Producing Population Counts at the Agency Level Through Auxiliary Data 235 Administrative vs Survey Data for Crime Statistics 236 A Pilot Study on Crime in the Bakken Region 238 Summary and Future Work 239 References 241 Index Energy Information Agency (EIA) 148 Enhanced Frame address-level prediction, of boat ownership data collection results 530 limitation 533 oversampling 532 propensities by state 527, 529 registered boat households flagged by boat registry vendor (InfoLink) 525–530 screening and eligibility rate 530, 531 spatial boat density model 525–526 CDS component of 521 composite registry frame 522–523 data obstacles 523–524 and target population 522 errors administrative records 134 cell errors 226–227 column errors 226 coverage errors 228–229 data encoding process 146 data-generating mechanism 134 data generation process 157 in data integration process 227–228 in datasets cell error 140–141 column error 140 harmonization process 138 row/column/cell model 138 row error 139 error-generating processes data encoding error (DEE) 142 data encoding process 144 data processing error 145 dataset error framework 145 decompositions 143 gross domestic product (GDP) 142 invoice value 142 measurement error 145 noncoverage error 144 nonprobability datasets 144 populations and sample recruitment systems 144 population unit 143 recruitment process 143 sample recruitment error (SRE) 142 sample recruitment process 143 specification error 145 statistical adjustment errors 145 statistical value 142 statistical value vs invoice value 145 survey noncontacts 144 error mitigation data encoding error 156 sample recruitment error 153–156 frame deficiencies 134 identification and treatment illustration CAPI response rate 148 DEE bias 149 interviewer measurements 149 RECS square footage data 149–150 RelMSE of survey and Zillow data 151, 152 square footage of housing units 149 Zillow data 149 integrated-data movement 134 mean squared error (MSE) 145, 150, 233 measurement errors 173, 344–345 nonresponse errors 229–230 nontraditional data sources 134 observation and nonobservation error 157 probability samples 146 739 740 Index errors (contd.) row/column/cell framework 157 row errors 225 sample recruitment mechanism 146 simple random sampling 146 specification, measurement, and data processing error 157 eScience 713 challenges for 720 EU General Data Privacy Regulation 401 European Statistical System’s ESSnet on Big Data event–level files 581–582 f Facebook 65 FAIR Data Principles 723 fast-track MEPS imputation strategy applied to 2014 MEPS data 587–590 attribute selection 582–584 CART models 579 decision trees 579 effectiveness 585, 586 health-care costs algorithmic prediction 577 CARTs and random forests 575 machine learning approach 578–579 modeling challenges 577 high-level research workflow 580, 581 inter-variable correlation 584 multi-output random forest 584–585 person-level medical expenditures, estimates of 587 probability distribution 579 raw data extraction 581–582 testing of 580 WSHD method 579 financial loss, threat of 706 “Fit-for-purpose” paradigm FixMyStreet 488 floating car data (FCD) 307 focus groups 689 goal of 685 four paradigms of science 713, 714 fourth paradigm science, power of 720 Framingham Health Study 100 g Gallup Daily Tracking Survey 684, 685 data sharing in federal government administrative data/record sharing 701–704 community harm 707–708 government vs private company 704, 705 individual harm 706–707 Geographic information system (GIS) grid frames Geo–Referenced Infrastructure and Demographic Data for Development (GRID3) 604–605 geosampling 609–610 German ESS Questions 184–185 German Panel Study Labour Market and Social Security (PASS) 393 German voucher flyer 408, 409 Gini coefficient 585 Google Street View (GSV) images 436 in Baltimore actual vs fitted affect balance maps 480 actual vs fitted health maps 477, 478 actual vs fitted life evaluation scores 453, 479 Bonferroni outlier test 454–455 cross-validation results 457 estimated model fit 452 Index generalized linear models via penalized maximum likelihood with k–fold cross–validation 472–474 heat maps 453 Moran’s I|i calculation 456 Pearson correlation 451, 467–468 pictures and maps 461–463 predictor variables, boxplots for 464–466 stepwise AIC OLS regression models 469–470 data sources BRFSS surveys 438–439 GDT surveys 438 health-related quality of life, components of 438–439 survey data study outcomes 438–440 model development, testing, and evaluation 450–451 predictors from built environment data image labeling process 441–443, 445 image sampling 441–444 quality control 445, 447–449 predictors from Geospatial Imagery land use 448 TGI 447–448 tract size 447 sampling units and frames 437–438 in San Francisco actual vs fitted affect balance 484 actual vs fitted health maps 481, 482 actual vs fitted life evaluation 483 estimated model fit 455 generalized linear models via penalized maximum likelihood with k–fold cross–validation 475–476 Moran’s I|i calculation 456 stepwise AIC OLS regression models 471–472 government statistical programs 73 Gray, Jim 713 gridded population estimates 598, 599 areal weighting and basic dasymetric methods 601 challenges 605–606 data sources 600–601 GRID3 604–605 LandScan Global 601–602 LandScan HD 603–604 in Nigeria 613–616 population sampling from km × km grid cells 609–610 from 100 m × 100 m grid cells 611–613 pros and cons of 606, 607 in surveys implementation of 613 population sampling 609–613 standard sampling strategies 608–609 WorldPop data 602–603 gridEA algorithm 612 GridSample2.0 611–613 GridSample R package 611 gross domestic product (GDP) 142 Groves, Robert 720 h harmful decision-making based on data 707 harmonization process 138 hashtag 65 hierarchical regression modeling 493 high–throughput computing 721 home detection algorithms (HDA) call detailed records data 250 741 742 Index home detection algorithms (HDA) (contd.) correlation with ground truth data 256–258 counts and population counts 260 data and setup 255 data packages 250 distinct days criterion (DD) 253 ego-network of contacts 251 French mobile phone dataset 251–252 ground truth data 256 maximum amount of activities criterion (MA) 253 network operators 251 observation periods 253–255 performance and sensitivities 256 potential records 250 ratio and spatial patterns 258 sensitivity to criteria choice 266–267 sensitivity to duration of observation 266 temporality and sensitivity 258 temporality of correlations 260, 262–265 time constraints criterion (TC) 253 user counts and ground truth 258–260 Horvitz–Thompson estimator 494 i incentives smartphone data collection 390–392 Instagram 65 installation brochure 394 Institut für Arbeitsmarkt–und Berufsforschung (IAB)–SMART study experimental design 397 incentives effect of 397–398 invitation and data request 394–396 sampling frame and restrictions 393–394 selection process 394 interaction history records 394 Interactive Graduate Education and Research Traineeship 728 internet of things 11, 64 intricate variables 73 Inverse Probability Bootstrapping 493 invoice value 142 j Jewish Community Survey of Metropolitan Detroit (JCSMD) 541, 547, 549, 551, 555–556 just-in-time adaptive interventions (JITAIs) 30 l LandScan Global 601–602 LandScan HD 603–604 large-scale survey estimation see Medical Expenditure Panel Survey (MEPS) law enforcement agencies (LEAs) 219 Least Absolute Shrinkage and Selection Operator (LASSO) 14, 31, 45, 417, 419, 420, 422–427, 452, 455, 493 leverage-saliency theory 390 linear regression model 347 list frames 519 location information 396 logistic regression 120 long tail and data volume 719 lot-level analyses 122 low-and middle-income countries (LMICs) 597 gridded population estimates 598, 599 areal weighting and basic dasymetric methods 601 Index challenges 605–606 data sources 600–601 GRID3 604–605 LandScan Global 601–602 LandScan HD 603–604 in Nigeria 613–616 pros and cons of 606, 607 in surveys 608–613 WorldPop data 602–603 m machine learning methods (MLMs) algorithmic optimization clustering algorithms 21 data collection cost constraints 21 final segmentation 21 alternative data sources 28 confusion matrices 16 cross-validation 15 data error Big Data sources 23 categorical and continuous variables 22 density–based spatial clustering of applications with noise (DBSCAN) 23 k-nearest neighbors algorithm 22 k-prototypes clustering algorithm 22 sampling designs 22, 23 scaling binary or nominal variables 22 explanation vs prediction 15 explanatory models 15 hierarchical cluster analysis (HCA) 14 high-dimensional data 14 instrumentation and interviewer training 27–28 k-means clustering 14 longitudinal settings 31 model error data collection costs 22 false positive misclassification error 21 misclassification errors 21 ordinary least squares regression 14 prediction or classification models 15 predictive accuracy 16 question wording Cox survival model 25 evaluation and testing 26–27 Markov chains 25 recurrent neural networks (RNN) 25 Survey Quality Predictor (SQP) 24 reimagining traditional survey research 12 sample design development ANOVA 17 cross-validation information 18 Gaussian mixture models (GMMs) 16 k-means clustering 16, 17 random digit-dial (RDD) sample 17 unsupervised learning methods 16 sample frame construction convolutional neural networks (CNNs) 19 gridded samples 18 human coders 19 neural networks model 20 object recognition tasks 19 primary grid units (PGUs) 18 residential units 19 sample partitions/subsets 18 sampling frame 19, 20 sampling units 20 secondary grid units (SGUs) 18 two–category classification 18 743 744 Index machine learning methods (MLMs) (contd.) uniform record locator (URL) 18 sample weighting and survey adjustments propensity score estimation 37–41 sample matching 41–43 specification of hyperparameters 14 supervised and unsupervised machine learning technique 14 survey data analysis and estimation continuous and categorical auxiliary data 45 finite population inference 46–47 fuzzy forests method 44 LASSO and adaptive LASSO approaches 45 model–assisted approaches 45 poststratification adjustment cells 45 probability-based RDD survey 45 survey data coding and processing coding unstructured text 32–35 data validation and editing 35 imputation 35–36 record linkage and duplicate detection 36 survey recruitment and data collection monitoring and interviewer falsification 29 responsive and adaptive designs 29–32 validation sample 15 Markov chains (MC) 25 mean absolute bias (MAB) 553–554 mean squared error (MSE) 145, 150, 233 measurement error 145, 173, 344–345 Medical Expenditure Panel Survey (MEPS) description 561–562 expenditures, defined 582 imputation processes (see also fast-track MEPS imputation strategy) Agency for Healthcare Research and Quality 563 class variables 571–572 cost and coverage detail 564 data files and variables 566–567 evaluation phases 565 means and standard errors, of medical expenditures 574, 576 medical payments, predictors identification of 567–571 MEPS–HC 564–565 MEPS–IC 564 MEPS–MPC 565 quality assessment 572, 573 results 573–575 WSHD procedure 565–566, 572, 579 objectives/goals 563 Medical Expenditure Panel Survey–Household Component (MEPS–HC) 564–565 Medical Expenditure Panel Survey–Insurance Component (MEPS–IC) 564 Medical Expenditure Panel Survey–Medical Provider Component (MEPS–MPC) 565 missingness 171–173 missing-not-at-random (MNAR) 346 mobile data collection active 658, 670–671 passive 658, 670–672 privacy concerns 660 research concerns 659–661 second-level digital divide 661 web surveys and age influence 670 Index analysis plan 664–665 Austrian statistical office 663 average marginal effects 667–668 confidence intervals 667–668 demographic control variables 668, 670 descriptive statistics for the responses 664–665 German Internet Panel 662 German nonprobability online panel 662 multiple logistic regression estimates and standard errors 675–678 outcome variables 663 questions on concern 663–664 respondents reporting concern, percentage of 666–667, 669 results from logistic regression models 667–668 willingness rate for research tasks 659, 671 mobile phone data home detection algorithms (HDA) call detailed records data 250 correlation with ground truth data 256–258 counts and population counts 260 data and setup 255 data packages 250 distinct days criterion (DD) 253 ego-network of contacts 251 French mobile phone dataset 251–252 ground truth data 256 maximum amount of activities criterion (MA) 253 network operators 251 observation periods 253–255 performance and sensitivities 256 potential records 250 ratio and spatial patterns 258 sensitivity to criteria choice 266–267 sensitivity to duration of observation 266 temporality and sensitivity 258 temporality of correlations 260, 262–265 time constraints criterion (TC) 253 user counts and ground truth 258–260 home detection problem CDR data 247 census data 248 consequence 249 customer-related information 248 decent validation data 248 high-level validation 248 in-depth investigations 248 individual user’s level 248 mobile phone indicators and socioeconomic indicators 248 validity and sensitivities 249 official statistics analytical methods 246 Big Data applications 247 call detailed record (CDR) data 245 home detection 247 home detection algorithms (HDAs) 247 socioeconomic indicators 246 socioeconomic information 247 user traces 245 Monthly Retail Trade Survey (MRTS) 362 descriptive statistics 373 high-burden retailers 370 national-level data 371–372 nonrespondents 370 reporting history 370 745 746 Index MTurk 458 labeling, inter-rater agreement and reliability 445, 446 multi-armed bandits (MABs) 30 multilevel regression and poststratification (MRP) method 493 multi-output random forest 584–585 multiple frame surveys 523 multitask learning (MTL) problem 561 n Naïve Bayesian classifiers 28 National Incident-Based Reporting System (NIBRS) database 219 National Institutes of Health (NIH) data science challenges for 638 national-level data 370 National Recreational Boating Safety Survey (NRBSS) 519 registry frame 520, 521 RTI’s Enhanced Frame (see Enhanced Frame) target population 519–520, 522 natural language processing 7, 27, 28, 32–34 NCES CCD Local Education Agency (District) Universe Survey 634 NEEDs2 project, Big Data administrative data 634, 643–644 behavioral assessment practices 643, 645–646 creating master databases for 635 data harmonization 637 dataset creation and maintenance, perspective on 636 merging data sources 638 quality dimensions 648–649 research questions, for regular school and component districts 638, 639 social, emotional, and behavioral health screening 633 socioeconomic status variables/proxies 636, 637 survey data 633–634 syntax/code and data dictionaries 639 transparency 641–642 network quality 396 noncoverage error 144 nonparametric bootstrap technique 488 crowdsourced data area-level EBLUP 496 bootstrap algorithm steps 495 Horvitz–Thompson estimator 494 pseudo-sampling weights 494–495 SSRSWR 494 nonprobability datasets 144 nonprobability sample missing-at-random assumptions 340 statistical inference, binary variables breakdown of sample size 342 social media sources 341 target population 342, 343 nonprobability surveys 71 nonresponse errors 229–230 nonresponse model 349 nonsurvey data augmenting traditional survey data 12 "norm of reciprocity," 390 North American Industry Classification System (NAICS) 361 NPD Group, Inc (NPD) 368–369, 372–373 NSECE Household Survey CoreLogic data 126 eligibility 105 government program participation records 105 large-scale US surveys 105 Index overall weighted response rate 105 parental consents 105 policy and practice data 105 two-stage probability design 105 Zillow proprietary datafiles identify matches 110–114 intended analyses 114–116 misalignment of units of observation 110 nonuniqueness of matches 107–109 o official statistics 133, 359 applications 350–353 data integration 340 empirical studies 347–350 integrating data source measurement errors 344–345 undercoverage bias 343–344 integrating probability sample 345–347 limitation 353 nonprobability sample (see nonprobability sample) probability sample 339 public and private decisions 339 online influencers 79–80 open-ended questions 27, 199 overfitting 721 p paid social media advertising 76–77 Panel Study Labour Market and Social Security (PASS) 393 Paperwork Reduction Act 362 paradata 65, 635 participatory mapping 489 PASS see Panel Study Labour Market and Social Security (PASS) passive mobile data collection 658, 670–672 see also mobile data collection; smartphones Pearson’s correlation coefficients 207 personal safety, threat of 707 person-level files 581 petabytes of data 𝜋-shaped researcher 728 point-of-sale data 359, 382 political campaigns 210 population density tables (PDT) 604 primary sampling units (PSUs) 38 privacy (data) 684, 695–697 community harm 707–708 individual harm 706–707 Privacy Act Cognitive Test 690 privacy loss, threat of 707 privacy-utility trade-off and data collection 659 private online community 86, 87 probability sample 146, 339 probability surveys 71 product data 377–381 product-level data 370 propensity score adjustment 493 propensity score estimation Bayesian Adaptive Regression Trees 40 chi-square automatic interaction detector (CHAID) models 38 direct and stratification approaches 39 Laplace correction 41 occupational employment statistics survey 38 primary sampling units (PSUs) 38 survey researchers 38 pseudo-sampling weights 493 public agenda 196 public attention assumption 197 public opinion 193 public responses 194 747 748 Index q qualitative data 684, 685 qualitative survey activities 66 quantitative data 684 QuickCharmStats 640–641 r random-digit-dial (RDD) list–assisted landline interviewing 685 random-digit-dial (RDD) wireless phone sampling 685 random forest (RF) 561, 575 random forest models 538, 543, 544, 548, 555 recurrent neural networks (RNN) 25 recursive partitioning algorithms 538 Reddit data 167–170, 186–188 Reddit survey 169, 181 registration-based sample (RBS) religious flags, for survey research 537 challenge in combining large datasets 541 data match rates 542 data source 540–541 downsampling sample sizes 544 JCSMD data 541, 547, 549, 551, 555–556 L2 dataset 545, 546, 550, 555–556 modeling decisions 543–544 research agenda 539–540 results 545–552 sensitivity 539, 540, 547–549 specificity 539, 540, 547 SSRS omnibus 540–541 systematic matching rates 552–554 regularization networks 27 Re-Identification Survey Cognitive Test 690 data sharing in federal government 706–708 reliability of administrative data 632 reproducibility 630–632, 722–723 CharmStatsPro 641 QuickCharmStats platform 640–641 R markdown 640 syntax/code and data dictionaries 639 variable notes (Stata users) 640 Residential Energy Consumption Survey (RECS) 148 ResNet18 models 436 respondent burden business’s decision 365 business surveys 363 data collection and data sharing 364 Economic Census 364, 365 low response rates 365 MRTS 364 point-of-sale data brick and mortar retail stores and online 366 confidentiality agreements 366 credit card data/payment processor data 366 data lakes 367 IBM 366 monthly datasets 367 scanner data 366 statistical agencies 367 respondent-driven sampling (RDS) 70–71 and advertising 67 Respondent Messaging 689–690 restricted maximum likelihood (REML) method 496 retailer datasets 368–369 retention, app 402–403 reuse of scholarly data 722 R markdown file 640 row error 139, 225, 227, 236, 237 Index s sample matching 493 curse of dimensionality 41 demographic and political variables 42 nonprobability samples 43 proximity measures 41–43 random forest models 42 supervised vs unsupervised applications of random forest models 42–43 synthetic sample 42 variables 42 sample recruitment error 142, 153–156 sample recruitment mechanism 146 scanner data 305–306 second-level digital divide 661 selection bias, in web–surveys 493 Shapiro–Wilk test 507 simple random sampling 146 small area estimation techniques 512 smart devices 11 smart meter data 289–290 smartphone data collection activated data-sharing functions 400–401 app installation 398–400 costs analysis 403–405 deactivating functions 401–402 future research 407–408 influence of incentives 390–392 Institut für Arbeitsmarkt-und Berufsforschung (IAB) experimental design 397 incentives effect of 397–398 invitation and data request 394–396 sampling frame and restrictions 393–394 limitations 407–408 retention 402–403 smartphones 657 activities 674 for data collection (see also mobile data collection) active 658, 670–671 passive 658, 670–672 and second-level digital divide 661 and education levels 661 frequency of usage 673 and privacy 658, 660, 674 skills, rating of 673 smartphone sensor data 389 smartphone usage 396 social expressions 194 social media administrative procedures 65 communication 63 hashtag 65 identifying people government statistical programs 73 intricate variables 73 outstanding challenges 73–74 publicly key demographic variables 73 screening process 73 internet of things 64 interviewing people active data collection 83–84 passive social media data mining 82–83 private online community 86, 87 Twitter’s API 84–85 locating people ad blockers 77 paid social media advertising 76–77 tracing 75–76 mass migration and population displacement 63 paradata 65 749 750 Index social media (contd.) persuasion aid social media advertising 78–79 online influencers 79–80 paid social media campaign 80 practical and ethical considerations 66 qualitative survey activities 66 respondent-driven sampling (RDS) and advertising 67 sample people outstanding challenges 71–72 respondent-driven sampling 70–71 venue-based, time-space sampling (VBTS) 68–70 slang/code terms 65 social networking sites, active users 64 survey research typology 88 survey responses 65 unstructured and messy 65 user knowledge of consent 66 social media data attitude distributions 180 coding 173–174, 179–180 coding error 165 community of social media users 194 contemporary news event 196 data-generating process 194, 195 data sources 197–198 data streams event-related terms 207 event-related words 206 event-style terms 205 open-ended survey responses 209 Pearson’s correlation coefficients 207 political campaigns 210 Trump vs Clinton event–related words 208–209 Twitter and survey maximums 205 Twitter data stream 205 decompose error 164 dichotomous indicators 164 European Social Survey Data 166–167 event detection distinct events 199 events and attention metrics 200–203 open-ended questions 199 restrictive parameters 199 text preprocessing 199 forecast election outcomes 195 key research decisions 195 measurement error 173, 177–179 misestimated flu prevalence 163 missingness 171–173, 175–177 official statistics 164 public attitudes and behaviors 194 public responses 194 Reddit data 167–170, 186–187 Reddit survey 169, 181 research objective 171 sentiment models 182 social expressions 194 specification error and traditional measurement error 165 survey questions 195 survey respondents and Twitter users data-generating processes 212 event-related words 212 Total Survey Error (TSE) framework 165 t-tests 171 tweets measure of attention 196, 197 public agenda 196 public attention assumption 197 public concern 196 tweets-as-attention model 196 Index Twitter data 163 social media sentiment index (SMI) 319 social network function 394 social science research social sciences 715, 723 socioeconomic indicators 246 spatial boat density model 525–526 spatial outliers 493 Spearman correlation 584 specification error 145 SSRS omnibus 540–541 Stanford Education Data Archive (SEDA) 634 Stata, for data management 640 statistical agencies statistical analysis 726 statistical disclosure control (SDC) 325 variety 326–327 velocity 326 volume 326 statistical estimation store-level data 370, 375–377 stratified simple random samples with replacement (SSRSWR) 494 subjective well-being (SWB) 435 assessment (see Google Street View (GSV) images) Gallup computed model-based small-area estimation 439–440 and health outcomes 436 supervisory unions (SUs) 638, 639 survey data 12, 304, 319, 628 administrative data linkage 102 analysis and estimation continuous and categorical auxiliary data 45 finite population inference 46–47 fuzzy forests method 44 LASSO and adaptive LASSO approaches 45 model-assisted approaches 45 poststratification adjustment cells 45 probability-based RDD survey 45 coding and processing coding unstructured text 32–35 data validation and editing 35 imputation 35–36 record linkage and duplicate detection 36 costs for survey data collection NEEDs2 project, Big Data 633–634 study outcomes, GSV images 438–440 traditional data collection survey landscape Big Data centers 282–283 European NSIs 282 experimental statistics 283 forming partnerships 281–282 IT infrastructure, tools, and methods 284 organizing hackathons 283–284 training staff 281 survey questions 195 survey research 537 agenda 539–540 challenge in combining large datasets 541 data match rates 542 data source 540–541 downsampling sample sizes 544 JCSMD 541, 547, 549, 551, 555–556 L2 dataset 555–556 L2 religion variable, incidence and coverage 545, 546 variables 545, 550 modeling decisions 543–544 results 545–552 sensitivity 539, 540, 547–549 specificity 539, 540, 547 751 752 Index survey research (contd.) SSRS omnibus 540–541 systematic matching rates 552–554 survey research techniques 715 survey responses 65 Sysomos firehose access tool 198 system-to-system data collection 361 t Taylor, John 713 technology-related issues test-time feature acquisition 27 text preprocessing 199 theoretical science 715 The Third Wave i, 717 time-series methods 317 Toffler, Alvin 717 topic modelling 34–35 tracing 75–76 traditional survey data collection transparency, NEEDs2 project 641–642 Triangular Greenness Index (TGI) 447 t-tests 123, 171 tweets-as-attention model 196 Twitter 65 u undercoverage bias 343–344 UNECE Big Data project unemployment insurance (UI) records 102 The United Nations Economic Commission for Europe (UNECE) 217 United States Federal Statistical System (USFSS) 683 unsupervised techniques 36 user traces 245 US Federal Statistical System 278 US National Science Foundation 728 v validity of administrative data 632, 644 variable notes (Stata users) 640 venue-based, time-space sampling (VBTS) advertisements 70 asynchronous communication and interaction 69 Facebook-recruited men 70 Facebook sampling 70 venue-date-time units 70 web-based approaches 70 Volunteered Geographical Information (VGI) 489, 490 w web-based system 350 web scraping 361 web surveys, in mobile data collection and age influence 670 analysis plan 664–665 Austrian statistical office 663 average marginal effects 667–668 confidence intervals 667–668 demographic control variables 668, 670 descriptive statistics for the responses 664–665 German Internet Panel 662 German nonprobability online panel 662 multiple logistic regression estimates and standard errors 675–678 Index outcome variables 663 questions on concern 663–664 respondents reporting concern, percentage of 666–667, 669 results from logistic regression models 667–668 WeChat 65 weighted sequential hot deck (WSHD) imputation procedure 565–566, 572, 579 WorldPop data 602–603 WSHD procedure see weighted sequential hot deck (WSHD) imputation procedure 753 ... Survey Data with Commercial or Administrative Data for Data Quality Assessment 99 A Rupa Datta, Gabriel Ugarte, and Dean Resnick Introduction 99 Thinking About Quality Features of Analytic Data. .. that survey data are considered Big Data Integrating administrative data, paradata, and, for example, sensor-derived (and other similar peripherals) data to survey respondents’ records does make... Zeelenberg, and Sofie De Broe Introduction 303 Big Data and Official Statistics 304 Examples of Big Data in Official Statistics 305 Scanner Data 305 Traffic-Loop Data 306 Social Media Messages 307