STATISTICS for SOCIAL UNDERSTANDING With Stata and SPSS NANCY WHITTIER Smith College TINA WILDHAGEN Smith College HOWARD J GOLD Smith College Lanham • Boulder • New York • London Executive Editor: Nancy Roberts Assistant Editor: Megan Manzano Senior Marketing Manager: Amy Whitaker Interior Designer: Integra Software Services Pvt Ltd Credits and acknowledgments for material borrowed from other sources, and reproduced with permission, appear on the appropriate page within the text Published by Rowman & Littlefield An imprint of The Rowman & Littlefield Publishing Group, Inc 4501 Forbes Boulevard, Suite 200, Lanham, Maryland 20706 www.rowman.com Tinworth Street, London SE11 5AL, United Kingdom Copyright © 2020 by The Rowman & Littlefield Publishing Group, Inc All rights reserved No part of this book may be reproduced in any form or by any electronic or mechanical means, including information storage and retrieval systems, without written permission from the publisher, except by a reviewer who may quote passages in a review British Library Cataloguing in Publication Information Available Library of Congress Cataloging-in-Publication Data Names: Whittier, Nancy, 1966– author | Wildhagen, Tina, 1980– author | Gold, Howard J., 1958– author Title: Statistics for social understanding: with Stata and SPSS / Nancy Whitter (Smith College), Tina Wildhagen (Smith College), Howard J Gold (Smith College) Description: Lanham : Rowman & Littlefield, [2020] | Includes bibliographical references and index Identifiers: LCCN 2018043885 (print) | LCCN 2018049835 (ebook) | ISBN 9781538109847 (electronic) | ISBN 9781538109823 (cloth : alk paper) | ISBN 9781538109830 (pbk : alk paper) Subjects: LCSH: Statistics | Social sciences—Statistical methods | Stata Classification: LCC QA276.12 (ebook) | LCC QA276.12 W5375 2020 (print) | DDC 519.5—dc23 LC record available at https://lccn.loc.gov/2018043885 ∞ ™ The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences—Permanence of Paper for Printed Library Materials, ANSI/NISO Z39.48-1992 Printed in the United States of America Brief Contents Preface viii About the Authors xvi CHAPTER Introduction 1 CHAPTER Getting to Know Your Data 54 CHAPTER Examining Relationships between Two Variables 121 CHAPTER Typical Values in a Group 161 CHAPTER The Diversity of Values in a Group 203 CHAPTER Probability and the Normal Distribution 241 CHAPTER From Sample to Population 280 CHAPTER Estimating Population Parameters 314 CHAPTER Differences between Samples and Populations 356 CHAPTER 10 Comparing Groups 399 CHAPTER 11 Testing Mean Differences among Multiple Groups 435 CHAPTER 12 T esting the Statistical Significance of Relationships in Cross-Tabulations 463 CHAPTER 13 R uling Out Competing Explanations for Relationships between Variables 501 CHAPTER 14 Describing Linear Relationships between Variables 542 SOLUTIONS TO ODD-NUMBERED PRACTICE PROBLEMS 599 GLOSSARY 649 APPENDIX A Normal Table 656 APPENDIX B Table of t-Values 658 APPENDIX C F-Table, for Alpha = 05 660 APPENDIX D Chi-Square Table 662 APPENDIX E Selected List of Formulas 664 APPENDIX F Choosing Tests for Bivariate Relationships 666 INDEX 667 iii Contents Preface viii About the Authors xvi CHAPTER 1 Introduction 1 Why Study Statistics? Research Questions and the Research Process 3 Pinning Things Down: Variables and Measurement 4 Units of Analysis Measurement Error: Validity and Reliability Levels of Measurement Causation: Independent and Dependent Variables 11 Getting the Data: Sampling and Generalizing 12 Sampling Methods 13 Sources of Secondary Data: Existing Data Sets, Reports, and “Big Data” 15 Big Data 17 Growth Mindset and Math Anxiety 18 Using This Book 20 Statistical Software 21 Chapter Summary 23 Using Stata 25 Using SPSS 33 Practice Problems 45 Notes 52 CHAPTER 2 Getting to Know Your Data 54 Frequency Distributions 55 Percentages and Proportions 57 Cumulative Percentage and Percentile 60 iv Percent Change 62 Rates and Ratios 63 Rates 63 Ratios 65 Working with Frequency Distribution Tables 65 Missing Values 65 Simplifying Tables by Collapsing Categories 67 Graphical Displays of a Single Variable: Bar Graphs, Pie Charts, Histograms, Stem-and-Leaf Plots, and Frequency Polygons 69 Bar Graphs and Pie Charts 69 Histograms 72 Stem-and-Leaf-Plots 73 Frequency Polygons 75 Time Series Charts 76 Comparing Two Groups on the Same Variable Using Tables, Graphs, and Charts 77 Chapter Summary 84 Using Stata 85 Using SPSS 95 Practice Problems 109 Notes 120 CHAPTER 3 Examining Relationships between Two Variables 121 Cross-Tabulations and Relationships between Variables 122 Independent and Dependent Variables 123 Column, Row, and Total Percentages 127 Interpreting the Strength of Relationships 134 Contents Interpreting the Direction of Relationships 136 Graphical Representations of Bivariate Relationships 140 Chapter Summary 142 Using Stata 143 Using SPSS 147 Practice Problems 152 Notes 160 CHAPTER 4 Typical Values in a Group 161 What Does It Mean to Describe What Is Typical? 162 Mean 163 Median 167 Mode 171 Finding the Mode, Median, and Mean in Frequency Distributions 173 Choosing the Appropriate Measure of Central Tendency 175 Median Versus Mean Income 179 Chapter Summary 181 Using Stata 182 Using SPSS 187 Practice Problems 193 Notes 202 CHAPTER 5 The Diversity of Values in a Group 203 Range 205 Interquartile Range 205 Standard Deviation 210 Using the Standard Deviation to Compare Distributions 212 Comparing Apples and Oranges 214 Skewed Versus Symmetric Distributions 218 Chapter Summary 220 Using Stata 221 Using SPSS 225 Practice Problems 231 Notes 240 CHAPTER 6 Probability and the Normal Distribution 241 The Rules of Probability 242 The Addition Rule 245 The Complement Rule 246 The Multiplication Rule with Independence 248 The Multiplication Rule without Independence 249 Applying the Multiplication Rule with Independence to the “Linda” and “Birth-Order” Probability Problems 251 Probability Distributions 253 The Normal Distribution 254 Standardizing Variables and Calculating z-Scores 258 Chapter Summary 266 Using Stata 267 Using SPSS 270 Practice Problems 272 Notes 279 CHAPTER 7 From Sample to Population 280 Repeated Sampling, Sample Statistics, and the Population Parameter 281 Sampling Distributions 284 Finding the Probability of Obtaining a Specific Sample Statistic 287 Estimating the Standard Error from a Known Population Standard Deviation 288 Finding and Interpreting the z-Score for Sample Means 289 Finding and Interpreting the z-Score for Sample Proportions 292 The Impact of Sample Size on the Standard Error 293 Chapter Summary 295 Using Stata 295 Using SPSS 300 Practice Problems 306 Notes 313 v vi Contents CHAPTER 8 Estimating Population Parameters 314 Inferential Statistics and the Estimation of Population Parameters 315 Confidence Intervals Manage Uncertainty through Margins of Error 317 Certainty and Precision of Confidence Intervals 317 Confidence Intervals for Proportions 318 Constructing a Confidence Interval for Proportions: Examples 322 Confidence Intervals for Means 326 The t-Distribution 326 Calculating Confidence Intervals for Means: Examples 329 The Relationship between Sample Size and Confidence Interval Range 333 The Relationship between Confidence Level and Confidence Interval Range 335 Interpreting Confidence Intervals 337 How Big a Sample? 338 Assumptions for Confidence Intervals 341 Chapter Summary 342 Using Stata 344 Using SPSS 346 Practice Problems 349 Notes 354 CHAPTER 9 Differences between Samples and Populations 356 The Logic of Hypothesis Testing 357 Null Hypotheses (H0) and Alternative Hypotheses (Ha) 358 One-Tailed and Two-Tailed Tests 359 Hypothesis Tests for Proportions 359 The Steps of the Hypothesis Test 364 One-Tailed and Two-Tailed Tests 365 Hypothesis Tests for Means 367 Example: Testing a Claim about a Population Mean 373 Error and Limitations: How Do We Know We Are Correct? 375 Type I and Type II Errors 376 What Does Statistical Significance Really Tell Us? Statistical and Practical Significance 379 Chapter Summary 381 Using Stata 382 Using SPSS 386 Practice Problems 392 Notes 398 CHAPTER 10 Comparing Groups 399 Two-Sample Hypothesis Tests 401 The Logic of the Null and Alternative Hypotheses in Two-Sample Tests 401 Notation for Two-Sample Tests 402 The Sampling Distribution for Two-Sample Tests 403 Hypothesis Tests for Differences between Means 404 Confidence Intervals for Differences between Means 411 Hypothesis Tests for Differences between Proportions 412 Confidence Intervals for Differences between Proportions 416 Statistical and Practical Significance in Two-Sample Tests 418 Chapter Summary 419 Using Stata 420 Using SPSS 424 Practice Problems 429 Notes 434 CHAPTER 11 Testing Mean Differences among Multiple Groups 435 Comparing Variation within and between Groups 436 Hypothesis Testing Using ANOVA 438 Analysis of Variance Assumptions 439 The Steps of an ANOVA Test 440 Determining Which Means Are Different: Post-Hoc Tests 446 ANOVA Compared to Repeated t-Tests 447 Chapter Summary 448 Using Stata 448 Contents Using SPSS 450 Practice Problems 453 Notes 461 The “Best-Fitting” Line 552 Slope and Intercept 553 Calculating the Slope and Intercept 556 Goodness-of-Fit Measures 557 CHAPTER 12 Testing the Statistical Significance of Relationships in Cross-Tabulations 463 The Logic of Hypothesis Testing with Chi-Square 466 The Steps of a Chi-Square Test 469 Size and Direction of Effects: Analysis of Residuals 475 Example: Gender and Perceptions of Health 477 Assumptions of Chi-Square 481 Statistical Significance and Sample Size 481 Chapter Summary 486 Using Stata 487 Using SPSS 489 Practice Problems 492 Notes 500 CHAPTER 13 Ruling Out Competing Explanations for Relationships between Variables 501 Criteria for Causal Relationships 506 Modeling Spurious Relationships 508 Modeling Non-Spurious Relationships 513 Chapter Summary 520 Using Stata 521 Using SPSS 526 Practice Problems 532 Notes 541 CHAPTER 14 Describing Linear Relationships between Variables 542 Correlation Coefficients 544 Calculating Correlation Coefficients 545 Scatterplots: Visualizing Correlations 546 Regression: Fitting a Line to a Scatterplot 550 R-Squared (r2) 557 Standard Error of the Estimate 558 Dichotomous (“Dummy”) Independent Variables 559 Multiple Regression 563 Statistical Inference for Regression 565 The F-Statistic 566 Standard Error of the Slope 568 Assumptions of Regression 571 Chapter Summary 573 Using Stata 575 Using SPSS 581 Practice Problems 588 Notes 598 SOLUTIONS TO ODD-NUMBERED PRACTICE PROBLEMS 599 GLOSSARY 649 APPENDIX A Normal Table 656 APPENDIX B Table of t-Values 658 APPENDIX C F-Table, for Alpha = 05 660 APPENDIX D Chi-Square Table 662 APPENDIX E Selected List of Formulas 664 APPENDIX F Choosing Tests for Bivariate Relationships 666 INDEX 667 vii Preface The idea for Statistics for Social Understanding: With Stata and SPSS began with our desire to offer a different kind of book to our statistics students We wanted a book that would introduce students to the way statistics are actually used in the social sciences: as a tool for advancing understanding of the social world We wanted thorough coverage of statistical topics, with a balanced approach to calculation and the use of statistical software, and we wanted the textbook to cover the use of software as a way to explore data and answer exciting questions We also wanted a textbook that incorporated Stata, which is widely used in graduate programs and is increasingly used in undergraduate classes, as well as SPSS, which remains widespread We wanted a book designed for introductory students in the social sciences, including those with little quantitative background, but one that did not talk down to students and that covered the conceptual aspects of statistics in detail even when the mathematical details were minimized We wanted a clearly written, engaging book, with plenty of practice problems of every type and easily available data sets for classroom use We are excited to introduce this book to students and instructors We are three experienced instructors of statistics, two sociologists and a political scientist, with viii more than sixty combined years of teaching experience in this area We drew on our teaching experience and research on the teaching and learning of statistics to write what we think will be a more effective textbook for fostering student learning In addition, we are excited to share our experiences teaching statistics to social science students by authoring the book’s ancillary materials, which include not only practice problems, test banks, and data sets but also suggested class exercises, P owerPoint slides, assignments, lecture notes, and class exercises Statistics for Social Understanding is distinguished by several features: (1) It is the only major introductory statistics book to integrate Stata and SPSS, giving instructors a choice of which software package to use (2) It teaches statistics the way they are used in the social sciences This includes beginning every chapter with examples from real research and taking students through research questions as we cover statistical techniques or software applications It also includes extensive discussion of relationships between variables, through the earlier placement of the chapter on cross-tabulation, the addition of a dedicated chapter on causality, and comparative examples throughout every chapter of the book (3) It is informed by Preface research on the teaching and learning of quantitative material and uses principles of universal design to optimize its contents for a variety of learning styles Distinguishing Features 1) Integrates Stata and SPSS While most existing textbooks use only SPSS or assume that students will purchase an additional, costly, supplemental text for Stata, this book can be used with either Stata or SPSS We include parallel sections for both SPSS and Stata at the end of every chapter These sections are written to ensure that students understand that software is a tool to be used to improve their own statistical reasoning, not a replacement for it.1 The book walks students through how to use Stata and SPSS to analyze interesting and relevant research questions We not only provide students with the syntax or menu selections that they will use to carry out these commands but also carefully explain the statistical procedures that the commands are telling Stata or SPSS to perform In this way, we encourage students to engage in statistical reasoning as they use software, not to think of Stata or SPSS as doing the statistical reasoning for them For Stata, we teach students the basic underlying structure of Stata syntax This approach facilitates a more intuitive understanding of how the program works, promoting greater confidence and competence among students For SPSS, we teach students to navigate the menus fluently ix 2) Draws on teaching and learning research Our approach is informed by research on teaching and learning in math and statistics and takes a universal design approach to accommodate multiple learning styles We take the following research-based approaches: • Research on teaching math shows that students learn better when teachers use multiple examples and explanations of topics.2 The book explains topics in multiple ways, using both alternative verbal explanations and visual representations As e xperienced instructors, we know the topics that students frequently stumble over and give special attention to explaining these areas in multiple ways This approach also accommodates differences in learning styles across students • Some chapter examples and practice problems lead students through the process of addressing a problem by acknowledging commonly held mis conceptions before presenting the proper solution This approach is based on research that shows that simply presenting students with information that corrects their statistical misconceptions is not enough to change these “strong and resilient” misconceptions.3 Students need to be able to examine the differences in the reasoning underlying incorrect and correct strategies of statistical work • Each chapter provides numerous, care- fully proofread, practice problems, with additional practice problems on the text’s website Students learn best by CI = y ± t(SEy ), DF = N – SEy = s N One Sample Two Samples DF = N1 + N2 – CI = (y1 – y2) ± t( SEy1 – y2 ), SEy + SEy SEy – y = Confidence Interval Mean y–µ , SEy DF = N – t= SEy = s N One Sample Two Samples DF = N1 + N2 – SEy1 – y (y1 – y2) –0 SEy + SEy SEy – y = t= Hypothesis Test p(1 – p) N CI = p± z(SEp) SEp = One Sample SEp + SEp SEp – p = Two Samples z(SEp –p ) CI = (p1 – p2 ) ± Confidence Interval z= p–π SEp Two Samples z= SEp – p ( p1 − p2) −0 SEp – p = π(1 – π) π(1 – π) + N1 N2 Hypothesis Test π(1 – π) N One Sample SEp = ProporƟon Appendix E Inference for Means and ProporƟons Appendix E 665 666 Inference Test: Chi-Square CrossTabulaƟon Independent Variable: Nominal or Ordinal with ² categories (Not covered in this book) LogisƟc, Probit, MulƟnomial Regression Independent Variable: Interval-Ratio Dependent Variable: Nominal or Ordinal with ² categories Means: Inference Test: t Test; EsƟmaƟon: CI for Difference ProporƟons: Inference Test: z Test; EsƟmaƟon: CI for Difference Compare Means or ProporƟons? Independent Variable: Nominal or Ordinal with categories) Inference Test: F Test ANOVA Independent Variable: Nominal or ordinal with > categories Dependent Variable: Interval-Ratio Inference Test: t Test for Slopes, F Test for Global Model CorrelaƟon, Ordinary Least Squares Regression Independent Variable: Interval-Ratio Choosing Tests for Bivariate RelaƟonships, by Level of Measurement of Independent and Dependent Variables Appendix F Index Note: Page numbers in italics indicate figures and those in bold indicate tables Addition Rule, 245–46, 266 Adelman, Robert, 542–43 aggregate data, 55 aggregate level, 6, 24 alpha-level (α), 364, 381; in ANOVA, 441, 445, 447, 448; in chi-square test, 466, 470, 473, 474, 484, 486; in one-sample hypothesis tests, 364–70, 373, 378, 380, 381, 382, 384, 385, 387; in statistical inference for regression, 543, 567–70; in two-sample hypothesis tests, 400, 405–6, 408, 412, 413, 415, 417, 419, 421, 422, 427 alternative hypothesis: in chi-square test for goodness-of-fit, 483–85, 484, 485; in chi-square test of independence, 469–70, 478, 480–81, 483, 486–88, 489; in h ypothesis testing using ANOVA, 438–39, 439; in one-sample hypothesis tests, 359; in two-sample hypothesis tests, 402 American National Election Study (ANES), 15, 16, 65, 71, 73, 124, 126, 208, 212, 234, 235, 239–40, 253, 260–61, 521, 523, 526, 529, 595, 596–97 analysis, units of, 6, 55 analysis of variance (ANOVA), 435–62; alpha level in, 441, 445, 447, 448; assumptions, 439–40; decision f in, 439, 441–46, 448; decision t in, 442; defined/overview of, 435–36; degrees of freedom and, 439, 439, 441–46, 448; F-curves, 439, 439; F-statistic, 438, 439, 442, 445–46, 448, 449, 452, 454, 461; F-values, 438, 441, 445; hypothesis testing using, 438–39; null hypothesis in, 441, 444, 446, 448; as omnibus test, 436; personal control, research on, 435–36; post-hoc tests, 436, 446–48, 447, 450, 453, 458, 461; repeated t-tests compared to, 447; research, 435–36; SPSS for, 450–53; Stata for, 448–50; steps of, 440–46, 448; Tukey’s test, 446–47, 447, 448, 458; variation between groups sum of squares, 442, 443, 443–44; variation within and between groups, 436–38, 437; variation within groups sums of squares, 442, 442 antecedent variable, 509, 509, 513, 520 “apples and oranges” case, 214–17 arithmetic operators in Stata, 31, 31 assumptions: chi-square test, 481; confidence intervals, 341; hypothesis test, 372; linear regression, 579–80, 585–87; regression, 571–72 average, calculating, 11 axis: compressed, 83; horizontal, 83, 255; vertical, 69, 73, 82–83, 82–83, 105; x, 76, 106, 108, 393, 546, 548, 551, 561, 575, 581, 588; y, 76, 108, 299, 546, 548, 551, 554, 555, 556, 561, 575, 581, 588, 592 bar graphs, 69–72, 70, 78, 79, 85; clustered, 140, 140–41; comparing two groups on same variable using, 78, 79; defined, 85; of single variable, 69–72, 70; stacked, 140–41, 141; Stata to generate, 94 Base Reference Manual for Stata, 33 best-fitting line, 552–53 bias, nonresponse, 14 big data, 17–18 bivariate relationships, 122, 140–41, 140–41 box plot: generating, in SPSS, 229–30, 230; generating, in Stata, 223, 223; interquartile range displayed with, 208–10, 209 boyd, danah, 17–18 Brame, Robert, 314–15, 317 categories: in cross-tabulations, c ollapsing, 134; data in variable, 55; simplifying frequency distribution tables by collapsing, 67, 67–69, 68, 69 causal relationships, 11–12, 501–41; association between variables in, 506–7; chi-square statistic 667 668 Index Index causal relationships (continued) in controlled cross-tabulation, interpreting, 517–19; control variables in, 503–17, 520–21, 523–24, 525–27; correlation differentiated from, 550; experimental control, 11–12; independent and dependent variables in, 11; independent effect of variables in, 513, 520; intervening variables in, 509, 510, 513, 520; mediating variables in, 509, 513, 520; moderating variables in, 515, 520; non-spurious relationships in, modeling, 513, 513–17; spurious relationships in, modeling, 503, 503–5, 506, 508–13; statistical control, 12; statistical interaction between variables in, 515–17, 516, 520; Tufte’s research on, 501–3, 550 Central Limit Theorem, 284–85, 287, 293, 295, 319, 341, 362, 403–4, 439 central tendency, measures of See measures of central tendency certainty, of confidence intervals, 317–18, 342 chi-square test, 463–500; alpha-level (α) in, 466, 470, 473, 474, 484, 486; alternative hypothesis in, 469–70, 478, 480–81, 483, 486–88, 489; assumptions of, 481; in controlled cross- tabulation, interpreting, 517–19; degrees of freedom in, 470, 473, 473–75, 474, 475, 477, 479, 484–85, 486, 489, 492; example, gender and perceptions of health, 477–78; expected frequencies in, 470–71, 473, 476, 481, 483, 484, 486, 487–88, 490; for goodness-of-fit, 483–85, 484, 485; of independence, 467–69, 471, 478–79, 479, 481, 483, 486; to investigate relationship between class and political ideology, 478, 478–79, 479; logic of, 466–69; null hypothesis in, 464–75, 477–78, 480, 482–85, 486–88, 489, 492; SPSS for, 489–92; standardized residuals in, 475–76, 476; Stata for, 487–89; statistical significance and sample size, 481–83; steps of, 469–75; two-sample z-tests for × cross- tabulations and, relationship between, 480, 480–81 chi-square test for goodness-of-fit, 483–85, 484, 485 Clinton, Hillary, 121–22, 208–10, 209, 209, 270–71 closed-ended survey items, 5–6, 24 clustered bar graph, 140, 140 cluster sampling, 14 codebooks, 16, 16–17 code logic in Stata, 28 Command Box in Stata, 26, 26, 27 Command Window in Stata, 26, 26 competing explanations for relationships between variables, ruling out: antecedent variable, 509, 509, 513, 520; control variable, 501–41; independent effect, 513, 520; intervening variable, 509, 510, 513, 520; mediating variable, 509, 513, 520; moderating variable, 515, 520; non-spurious relationships, 513–17; spurious relationship, 508–13; statistical i nteraction, 515–17, 516, 520 Complement Rule, 246–48, 266 compressed axis, 83 Compute Variable dialog box in SPSS, 37–38, 38 Compute Variable in SPSS, 36, 37, 41 concepts, 4–5, 20, 23 confidence interval range: confidence level in relation to, 335–37; sample size in relation to, 333–35 confidence intervals, 314–54; assumptions for, 341; Brame’s research on, 314–15, 317; certainty of, 317–18, 342; differences between proportions and, 416–18, 419; election polling and, 325–26; inferential statistics and, 316, 316, 323, 342; interpreting, 337–38; lower bound of, 321, 324, 331, 334, 342; margin of error in, 280, 315, 317, 318, 320–21, 323, 324, 325–26, 328, 330, 331, 332, 335, 336, 337, 339–40, 343; of means, 326–33, 343, 344–45, 347–48; point estimates and, 316, 317, 324, 325, 329, 330, 332, 335, 337; precision of, 317–18, 335, 336, 339, 342; of proportions, 318–26, 342–43, 345–46, 348–49; range and (See confidence interval range); sample size and, 333–35, 338–41, 343; SPSS to calculate, 346–49; Stata to calculate, 344–46; two-tailed hypothesis tests and, equivalence of, 367; upper bound of, 321, 323, 324, 324, 329–32, 331, 334, 335, 336, 338, 342, 343 confidence level: in calculating confidence intervals for means, 330, 331; defined, 317, 342; degrees of freedom and, 330–32, 331, 335, 341, 343; in interpreting confidence intervals, 337; in relation to c onfidence interval range, 335–37; sample size in confidence intervals and, 339–41; z-score associated with, 318–22, 320, 321, 324, 342 continuous variables, 9–10 control variables, 501–41; in causal relationships, 503, 503–17, 520–21, 523–24, 525–27; in cross-tabulations, 503, 503, 520–5234, 526–29; defined, 503, 520; in modeling non-spurious Index data, 54–120 See also graphical representations of data; aggregate, 55; analysis before digital era, 22; cases in variable categories, 55; comparing two groups on same variable, 77–81; cumulative percentage, 60–62, 61, 62, 84; frequency distributions, 55–57, 56, 57; frequency distribution tables, 65–69, 66, 67, 68, 69; Furstenberg’s and Kennedy’s research on, 54–55; getting to know, 109–20; independence of, in assumptions of regression, 571; percentages, 55, 57–60, 58, 59, 84; percent change, 62–63, 84; percentile, 62, 84; proportions, 57–60; rate, 63–64, 84; ratio, 65, 84; raw frequency, 55; secondary (See secondary data); SPSS to generate statistics and graphs, 95–109; Stata to generate statistics and graphs, 85–94; unit of analysis, 55; univariate statistics, 55 data, gathering, 12–15; descriptive statistics, 12–13; inferential statistics, 13; non-probability sample, 13; population, 12–13; probability sample, 13; sample, 12–13; sampling methods, 13–15 Data Editor in SPSS, 34, 34–36, 36, 38, 38, 41, 42, 42 Data Editor in Stata: icon, 26, 26; opening, 26, 26; reviewing data in, 26–27, 27 data set display in Stata, 26–27, 27 Data View in SPSS, 35, 38–39, 39 decision f: in ANOVA, 439, 441–46, 448; in statistical inference for regression, 567–68, 574 decision t: in ANOVA, 442; in one-sample hypothesis tests, 369–71, 371, 374–75, 381; in statistical inference for regression, 569–71, 575; in two-sample hypothesis tests, 405, 405–9, 406, 409, 419, 423 degrees of freedom (DF): in ANOVA, 438–39, 439, 441–46, 448; in chi-square test, 470, 473, 473–75, 474, 475, 477, 479, 484–85, 486, 489, 492; confidence level and, 328, 328, 330–32, 331, 335, 341, 343; in one-sample hypothesis test, 369–70, 373; t-distributions and, 327, 327–28, 328; in two-sample hypothesis test, 404–7, 406, 412, 419, 423 dependent variables, 11–12 Descriptives dialog box in SPSS, 43–44, 44 descriptive statistics: cumulative percentage, 60–62, 61, 62, 84; frequency, 55, 57–60, 58, 59, 84; in gathering data, 12–13; percentages, 55, 57–60, 58, 59, 84; percent change, 62–63, 84; percentile, 62, 84; purpose of, 55; rate, 63–64, 84; ratio, 65, 84; sampling, 12–13 discrete variables, 9–10 distributions: frequency, 55–57, 56, 57, 85; frequency distribution tables, 65–69, 66, 67, 68, 69; normal, 254–58, 255, 256, 257, 272–79; probability, 253–58; sampling, 280–313, 403, 403–4, 438–39, 439; skewed, 218–19; standard deviation to compare, 212–14, 213–14; symmetric, 218–19 do-file in Stata, 31–33, 32, 94 drop-down menus: SPSS, 33, 35, 95; Stata, 25 drop-down menus in SPSS, 35 dummy variables, 559–61 Dweck, Carol, 19 Index relationships, 513–17; Tufte’s research on, 501–3, 550 correlation, 542–98; causation differentiated from, 550; direction of, 544, 546, 547; negative, 544, 546, 547; nonsensical, 550; positive, 546, 547, 551; r-squared (r2), 557–58; scatterplots for visualizing, 546–57, 547, 548, 549, 551; standard error of the estimate, 558–59; strength of, 544, 544, 546, 547 correlation coefficients, 544–46; assumptions of regression, 571–72; calculating, 545–46; defined, 544; dummy variables, 559–61; formula for, 545–46; f-statistic, 566–68; intercept, 553–57, 555, 556; inverse correlation, 544, 546, 547; linear regression, 550, 554, 561–62, 579–80, 584; linear relationship, 545, 546, 549, 552–53, 555; logistic regression, 561–62, 575; positive correlation, 546, 547, 551, 573; regression line, 550–57; r-squared (r2), 557–58; size of, 544; slope (regression coefficient), 553–57; standard error of the slope, 568–71 covariance of two variables, formula for, 545, 546 Crawford, Kate, 17–18 cross-tabulations: collapsing categories in, 134; controlling for third variable in, 503, 503, 520–23, 526–29; interpreting the chi-square statistic in, 517–19; statistical s ignificance as it applies to, 463–500 (See also chi-square test) cumulative percentage, 60–62, 61, 62, 84 cumulative probability, 263, 263, 264–65, 267, 269 curvilinear relationship, 549, 549 cutoff points, 207–8, 413 669 670 Index Index ecological fallacy, 6, 24 election polling, confidence intervals and, 325–26 election polling and confidence intervals, 325–26 Empirical Rule, 256, 256–58 equivalence of two-tailed hypothesis and confidence intervals, 367 error messages in Stata, 30–31 expected frequencies, 470–71, 473, 476, 481, 483, 484, 486, 487–88, 490 experimental control, 11–12 explanatory variables in social sciences, 513 F-curves, 438–39, 439 feeling thermometer: for big business, Tea Party, and Black Lives Matter, 212–14, 213–14; Clinton, 270–71; for conservatives, 234, 234–35; of different groups, 212; of gay men and lesbians by gender and region, 514, 514–15, 515; of illegal immigrants, 208–10, 209, 209; for liberals, 235, 236; Obama, 74, 71, 71–75, 72, 73, 74; recoding, 71, 71–72, 72; for Trump voters, 272, 272 frequencies, 55, 57–60, 58, 59; defined, 84; relative size assessed by, 59 frequency distributions, 55–57, 56, 57, 85; aggregate data in, 55; defined, 85; measures of central tendency in, finding, 173, 173–75, 175; raw frequency in, 56; SPSS to generate, 95–98, 95–98; Stata to generate, 86, 86–88, 87–88, 94; unit of analysis in, 55; univariate statistics in, 55 frequency distribution tables, 65–69, 66, 67, 68, 69; missing values, 65–67, 66; recoding, 67–69; simplifying by collapsing categories, 67, 67–69, 68, 69 frequency polygons, 75, 75, 79, 81; comparing two groups on same variable using, 79, 81; defined, 85; graphical displays of single variable, 75, 75 F-statistic, 566–68; ANOVA, 438, 439, 442, 445–46, 448, 449, 452, 454, 461; in statistical inference for regression, 566–68 Furstenberg, Frank, 54–55 F-values, 438, 441, 445 generalizing, 12–15 General Social Survey (GSS), 15, 48, 48, 56, 56–57, 57, 69, 85, 95, 132, 134–35, 137, 139, 143–44, 147, 149, 173, 177–78, 182, 186–87, 192, 205–6, 217, 252, 279, 322–23, 433, 434, 444–46, 448–49, 450–51, 487, 489, 499, 500, 521, 526, 540, 560 Goldberg, Amir, 17 goodness-of-fit, 557–59; chi-square test for, 483–85, 484, 485 goodness-of-fit measures, 557–59; r-squared (r2), 557–58; standard error of the estimate, 558–59 Gould, Stephen Jay, 203 graphical representations of data: bar graphs, 69–72, 70, 78, 79, 85; of bivariate relationships, 122, 140–41, 140–41; comparing two groups on same variable, 77–81; compressed axis, 83; displays of single v ariable, 69–75; frequency distributions, 55–57, 56, 57, 85, 86–88; frequency distribution tables, 65–69, 66, 67, 68, 69; frequency polygons, 75, 75, 79, 81, 85; histograms, 72–73, 73, 79, 80–81, 85, 94; horizontal axis, 83, 255; misleading, 82–83, 82–83; pie charts, 69–72, 70, 78, 80, 85; SPSS to generate, 95–109; Stata to generate, 85–94; s tem-and-leaf plots, 73–75, 74, 85; time series charts, 76, 76–77, 85; Venn diagrams, 563, 563, 594, 594; vertical axis, 69, 73, 82–83, 82–83, 105; x axis, 76, 106, 108, 393, 546, 548, 551, 561, 575, 581, 588; y axis, 76, 108, 299, 546, 548, 551, 554, 555, 556, 561, 575, 581, 588, 592 graphical user interface (GUI): SPSS, 33, 34, 34–36, 36; Stata, 25–26, 26 groups: comparing, 399–434; differences between, in social sciences, 399; diversity of values in, 231–40; Houle’s and Warner’s research on, 161–62, 204; Rivera’s and Tilcsik’s research on, 399–400, 418; sums of squares, v ariation between, 442, 443, 443–44; sums of squares, variation within, 442, 442; testing mean differences among multiple, 453–61; two, on same variable, 77–81; typical values in, 161–202, 193–202; variation within and between, 436–38, 437 growth mindset, 18–19 histograms, 72–73, 73, 79, 80–81; comparing two groups on same variable using, 79, 80–81; defined, 85; graphical displays of single variable, 72–73, 73; probability distributions found with, 253, 253; showing percentages, Stata to generate, 94 Hollerith, Herman, 22 homoscedasticity of residuals, 571 horizontal axis, 83, 255 Houle, Jason, 161–62, 204 hypothesis, defined, 3, 23 hypothesis testing: one-sample, 356–98; power of, 378, 382; two-sample, 399–434; using ANOVA, 438–39 Index and, 254–58, 257; right-tail probability for, 259, 259–60; sampling d istribution and, 288–90, 289, 290; s tandard deviations and, 258–65; z-scores and, 258–65 Kahneman, Daniel, 241–43, 251–52 Kennedy, Sheela, 54–55 launching: SPSS, 33–34; Stata, 25 left-tail probability, 264, 264–65, 383, 384, 387, 402 level of measurement, 9–11 linear regression, 550, 554, 561–62, 579–80, 584 See also regression line; check assumptions of, SPSS for, 585–87; check assumptions of, Stata for, 579–80; defined, 550, 573; intercept in, 554–57; models, 561–62; scatterplots to visualize, 550–51; slope in, 554–57 linear relationships between variables, 542–98; assumptions of regression, 571–72, 578–80, 585–87; best-fitting line, 552–53; irection of, correlation coefficients, 544–46; d 544; dummy variables, 559–61; g oodness-of-fit measures, 557–59; multiple regression, 563–65; scatterplots, 546–51; slope and intercept, 553–57; SPSS for, 581–88; Stata for, 575–81; statistical i nference for regression, 565–71; straight line for representing, 548–49, 573; strength of, 544, 544 logical operators in Stata, 31, 31 logistic regression, 561–62, 575 Long, L Scott, 562 lower bounds, 321, 324, 331, 342 Luker, Kristen, margin of error, 280, 315, 317–18, 320–21, 323, 324, 325–26, 328, 330, 331, 332, 335–37, 339–40, 343 Marked (Pager), 2–3 math anxiety, 19–20 means, 163–67; “apples and oranges” case, 214–17; claim about, testing, 373–75; c onfidence intervals for, 326–33; confidence intervals of, 326–33, 343, 344–45, 347–48; defined, 163; differences among m ultiple groups, testing, 453–61; differences between, confidence intervals for, 411–12, 419; differences between, hypothesis tests for, 404–12; differences between, post-hoc tests to determine, 446–47, 447; equation for calculating, 163; express individual observation as distance from, 258; in f requency distributions, Index ideological identification variable, 66, 67–68, 124, 124–26, 125, 134 “if” in a command in Stata, rule for, 29 income, median versus mean, 179–80, 180 independence, chi-square test of, 467–69, 471, 478–79, 479, 481, 483, 486 See also chi-square test independent effect, 513, 520 independent variables, 11–12 in depth feature: assessing relative size using percentages and frequencies, 59; assumptions of hypothesis tests, 372; collapsing categories in cross-tabulations, 134; election polling and confidence intervals, 325–26; equivalence of two-tailed hypothesis and confidence intervals, 367; GSS tests Americans’ knowledge of probability, 252; misleading graphs, 82–83; nonexistent values for mean and median in the population, 171; publication bias toward statistically significant results, 365; punched cards and data analysis before digital era, 22; redistributive property of the mean, 164–65; sampling from skewed population, 285; statistical notation for samples and populations, 281; why the mean has no meaning for nominal-level variables, 165 inferential statistics, 2, 280; confidence i ntervals and, 316, 316, 323, 342; defined, 13, 25, 342; normal curve and, 242, 264, 264–65, 281; population parameters and, estimating, 315–17; regression and, 543, 565–66, 571, 579; standard error and, 287; visual representation of, 316 Input Variable box in SPSS, 40 intercept, slope and See slope and intercept interface: SPSS, 34, 34–36, 36; Stata, 25–26, 26 interquartile range (IQR), 205–10; box plot for displaying, 208–10, 209; cutoff points, 207–8; defined, 205, 220; interval-ratio variables and, 205–6; ordinal variables and, 205; size of, calculating, 206; SPSS for finding, 228–29; Stata for finding, 222–23 interval-ratio variables, 60, 71, 85, 481, 523, 529; effect of independent variable on means of, using SPSS, 529–31; effect of independent variable on means of, using Stata, 523–24; interquartile range and, 205–6 intervening variables, 509, 510, 513, 520 inverse correlation, 544, 546, 547 IQ scores: cumulative probability for, 263, 263; for Mensa membership, 265; normal distribution 671 672 Index Index means (continued) finding, 173–75, 175; having no meaning for nominal-level variables, 165; income, versus median, 179–80, 180; of interval-ratio variables, effect of independent variable on, using SPSS, 529–31; nonexistent values for, in the population, 171; one-sample hypothesis test, 367–75, 385–86, 390–91; redistributive property of, 164–65; SPSS for calculating confidence intervals of, 347–48; SPSS for finding, 188–92; SPSS for one-sample hypothesis test, 390–91; SPSS for two-sample hypothesis test, 427–28; standard deviations, comparing individual score to, 217; standard deviations, in normal distribution, 256–57; Stata for calculating confidence intervals of, 344–45; Stata for finding, 183–84; Stata for one-sample hypothesis test, 385–86; Stata for two-sample hypothesis test, 422–23; t-distribution and, 326–29, 327 measurement, 5, 23; error, 6–9; goodness-offit, 557–59; of key concept, 4–6, 5; key terms involving, 23–24; level of, 9–11; scales, 10, 24 measures of central tendency, 161–202 See also means; median; mode; choosing, 175–79, 181–82; finding, in frequency distributions, 173, 173–75, 175; income, median versus mean, 179–80, 180; mean, 163–67; median, 167–71; mode, 171–73; SPSS for finding, 187–93; Stata for finding, 182–86 measures of variability, 203–40 See also interquartile range (IQR); range; standard deviation (SD); variance; Gould’s research on, 203; Houle’s and Cody’s research on, 161–62, 204; interquartile range, 205–10; range, 205; SPSS for finding, 225–31; standard deviation, 210–19; Stata for finding, 221–25; variance, 211 median, 167–71; defined, 167; finding, 167–71; in frequency distributions, finding, 173–75; income, versus mean, 179–80, 180; nonexistent values for, in the population, 171; SPSS for finding, 192; Stata for finding, 184 “Median Isn’t the Message, The” (Gould), 203 mediating variables, 509, 513, 520 misleading graphics, 82–83, 82–83 mixed methods, 4, 23 mode, 171–73; defined, 171; in frequency distributions, finding, 173–75; SPSS for finding, 192; Stata for finding, 184–86 moderating variables, 515, 520 Moving to Opportunity (MTO) project, 356–57 multiple regression, 563–65 Multiplication Rule: with independence, 248–49, 251–52, 266; without independence, 249–51, 266 multistage cluster sampling, 14 National Center for Education Statistics, 238 National Longitudinal Survey of Youth (NLSY), 15, 314, 344, 346, 347 negative correlation, 544, 546, 547 nominal-level variables, 10–11, 165 nominal variables, 10–11 nonlinear relationships, 549, 561–62 non-probability sample, 13 nonresponse bias, 14 non-spurious relationships, 513–17 normal curve: for analyzing distributions and finding probabilities, 257, 261; inferential statistics and, 242, 264, 264–65, 281; terminology for, 264–65 normal distribution, 254–58, 255, 256, 257, 272–79; characteristics of, 246, 266; defined, 254, 266; Empirical Rule, 256, 256–58; features of, 255–56; importance of, 254–55; probability and, 272–79; standard deviations of the mean, 256–57; z-scores in, calculating, 258–65 normal tables, 259, 260–67, 271, 291, 319–22, 320, 328, 362–63, 365 260, 415, 481 notation See statistical notation null hypothesis: in ANOVA steps, 441, 444, 446, 448; in chi-square test, 464–75, 477–78, 480, 482–85, 486–88, 489, 492; in chi-square test for goodness-of-fit, 483–85, 484, 485; in hypothesis testing using ANOVA, 438–39, 439; in one- sample hypothesis tests, 358–59; in two-sample hypothesis tests, 401–2 Numeric Expression box in SPSS, 38, 38 Obama, Barack, 71–75, 121, 508; Feeling Thermometer Ratings, 74, 71, 71–75, 72, 73, 74 Obama Feeling Thermometer Ratings, 74, 71, 72, 73, 74 observed frequencies, 469–72, 475–76, 478, 481, 484, 486–88, 490, 492 Old and New Values box in SPSS, 40, 41 omnibus test, 436 See also analysis of v ariance (ANOVA) one-sample hypothesis tests, 356–98; alpha-level (α), 364–70, 373, 378, 380, 381, 382, 384, 385, 387; alternative hypothesis, 359; ANOVA, 438–39; Index Pager, Devah, 2–3 percentages, 55, 57–60, 58, 59, 84; calculating, 58; cumulative, 60–62, 61, 62, 84; defined, 84; relative size assessed by, 59 percent change, 62–63, 84 percentiles, 62, 84; SPSS for finding, 227–28; Stata for finding, 222 personal control, in ANOVA research, 436 pie charts, 69–72, 70, 78, 80; comparing two groups on same variable using, 78, 80; defined, 85; of single variable, 69–72, 70 point estimates, 316, 317, 324, 325, 329, 330, 332, 335, 337 Police Public Contact Survey (PPCS), 15, 16, 16, 382–90, 420–21, 423, 424, 425 political party identification variables, 67, 68, 69, 82–83, 124, 124–26, 125, 134, 411, 482, 482–83, 507–8, 508, 510, 511, 511–13, 512, 518, 518–19, 523–24, 529–31 population parameters See also confidence intervals: defined, 281; estimating, 314–55; inferential statistics and, 315–17, 316; in sampling distributions, 284–87 populations: defined, 12; in gathering data, 12–13; nonexistent values for the mean and median in, 171; parameters in, e stimating, 349–54; regression equation for, 566; to sample from, 12, 306–13; samples and, differences between, 392–98; skewed, s ampling from, 285; statistical notation for, 281 positive correlation, 546, 547, 551, 573 post-hoc tests, 436, 446–48, 447, 450, 453, 458, 461 power of hypothesis test, 378, 382 practical significance: in one-sample hypothesis tests, 379–80, 381; in two-sample hypothesis tests, 418 practice problems See also SPSS p ractice problems; Stata practice problems: comparing groups, 429–34; describing linear relationships between variables, 588–97; differences between samples and populations, 392–98; diversity of values in a group, 231–40; estimating population parameters, 349–54; examining relationships between two variables, 152–60; getting to know your data, 109–20; introduction, 45–52; probability and the normal distribution, 272–79; from sample to population, 306–13; testing mean differences among multiple groups, 453–61; testing statistical significance of relationships in cross-tabulations, 492–500; typical values in a group, 193–202; variables, ruling out competing explanations for relationships between, 532–41 precision, of confidence intervals, 317–18, 335, 336, 339, 342 probability, 241–79 See also normal distribution; Americans’ knowledge of, 252; cumulative, 263, 263, 264–65, 267, 269; as fraction, 243; importance of, to statistics, 242; reasons for using, 242–52; sample, 13; SPSS for, 270–71; standardizing variables, 258–65; Stata for, 267–70; Tversky’s and Kahneman’s scenario, 241–43, 251–52; z-scores, 258–65 probability, rules of, 242–52; Addition Rule, 245–46, 266; Complement Rule, 246–48, 266; Multiplication Rule, with independence, 248–49, 251–52, 266; Multiplication Rule, without i ndependence, 249–51, 266 probability distributions, 253–58 See also normal distribution; of continuous variables, 253–54, 254; histogram to find, 253, 253; normal distribution, 254–58, 255, 256, 257, 272–79 Index assumptions of, 372; decision t in, 368, 369–71, 371, 374–75, 381; degrees of freedom, 369–70, 373; error and limitations, 375–79, 381–82; logic of, 357–59; for means, 367–75, 385–91; M oving to Opportunity project, 356–57; null hypothesis, 358–59; one-tailed test, 359, 365–67, 368, 381; power of, 378; practical significance in, 379–80, 381; for proportions, 359–64, 382–85, 387–90; SPSS for, 386–91; Stata for, 382–86; statistical significance in, 379–80, 381; steps of, 364–65, 381; t-value in, 327–28, 328, 330–32, 331, 335–36, 341–43, 345, 351, 367–72, 371, 374, 397–98; two-tailed test, 359, 365–67, 366, 373–74, 374; Type I and Type II error, 376–79, 381–82; z-scores in, 367–68, 370, 375, 394 one-tailed test, 359, 365–67, 368 open-ended survey items, 6, 24 operationalization, 4–5, 23 See also measurement operators in Stata, 31, 31 ordinal variables, 10–11; interquartile range and, 205 Organization for Economic Cooperation and Development (OECD), 64 Output Variable: Name box in SPSS, 40 Output window in SPSS, 34, 34, 35, 36, 44, 44 673 674 Index Index proportions, 57–60; confidence intervals of, 318–26, 342–43, 345–46, 348–49; differences between, confidence intervals for, 416–18, 419; differences between, hypothesis tests for, 412– 16; one-sample hypothesis test, 359–64, 382–85, 387–90; SPSS for calculating confidence levels of, 348–49; SPSS for one-sample hypothesis test, 387–90; SPSS for two-sample hypothesis test, 424–27; Stata for calculating confidence levels of, 345–46; Stata for one-sample hypothesis test, 382–85; Stata for two-sample hypothesis test, 420–22 publication bias toward statistically significant results, 365 publicly available secondary data sets, 15–16 punched cards and data analysis before digital era, 22 quantitative analysis, 4, 23 quantitative methods, 4, 23 random sampling: error, 287; importance of, 280; repeated, 281–84, 295, 297, 300, 316; sampling distributions, 284–87; simple, 13, 24; stratified, 14, 24 range, 205; defined, 205, 220; formula for, 205; SPSS for finding, 225–27; Stata for finding, 221–22 rank ordering, 9–10, 24, 57, 60, 69, 88, 98, 137, 165, 168, 169, 207 rate, 63–64, 84 ratio, 65, 84 ratio-level variables, 9, 23, 24 raw frequency, 56 Recode into Different Variables dialog box in SPSS, 39–40, 40, 52, 100, 100, 120 Recode into Same Variables dialog box in SPSS, 39 recode/recoding, 67–69; bar graphs and pie charts, 71–72; feeling thermometer variable, 71, 71–72; frequency distribution tables, 67–69; SPSS for, 38–42, 39, 40, 41, 98–102; Stata for, 88, 88–90, 90, 94, 119, 158; value labels, SPSS for assigning, 101–2, 102; value labels, Stata for assigning, 90–91, 91, 94 redistributive property of the mean, 164–65 regression analysis, 542–98; Adelman’s research on, 542–43; assumptions of, 571–72; inferential statistics and, 543, 565–66, 571, 579; linear, 550, 554, 561–62, 579–80, 584; logistic, 561–62, 575; logistic regression, 561–62, 575; multiple, 563– 65; statistical inference for, 565–71 regression equation: dummy variables in, 559–61; F-statistic for, 566–68; multiple, 564–65; in null and alternative hypotheses, 567; for a population, 566; r-squared (r2) in, 557–58; slope and intercept in, 554–56; standard error of the slope in, 569 regression line, 550–57 See also linear r egression; alternative terms and notation for, 554; in assumptions of regression, 571; best-fitting, 552–53; defined, 550, 573; dummy variables and, 561; equation for, 553, 554; goodness-of-fit measures for, 553–59; in scatterplots, 550–51; slope and intercept of, 553–57, 573; standard error of the estimate, 558–59, 574 regression models: in Adelman’s research, 542–43; goodness-of-fit measures for, 557; for nonlinear relationships, 561–62 Regression Models for Categorical and Limited Dependent Variables (Long), 562 relational operators in Stata, 31, 31 relationships: bivariate, 122, 140–41, 140–41; competing explanations between variables, ruling out, 532–41; in cross-tabulations, testing statistical significance of, 492–500; 2016 presidential election and, 121–22; between two variables, 152–60 relative frequencies, 55, 57–60, 58, 59 relative size, percentages and frequencies for assessing, 59 reliability, 6–9, repeated t-Tests, 447 research process, 3–4, 23; concepts, 4–5, 20, 23; measurement or operationalization, 5, 23; research question, 3–4, 23; sampling, 12–15 research question, 3–4, 23 residuals: in assumptions of regression, 571–72, 572; defined, 553; distribution of, in population, 571–72; h omoscedasticity of, 571; independence of, 571; size of, c alculating, 553, 558; standardized, in chi-square test, 475–76, 476; sum of squared, 558, 566, 568–71 Results Window in Stata, 26, 26, 27–28 right-tail probability, 259, 260, 264, 264–65, 269; for alternative hypotheses, 383, 387, 402; confidence levels and, 321, 321, 322, 342; defined, 260, 264–65; for IQ scores, 259, 259–60; z-scores and, 269, 319–22, 320, 321, 342 Rivera, Lauren, 399–400, 418 Index sample: in gathering data, 12–13; non-probability, 13; to population from, 306–13; populations and, differences between, 392–98; probability, 13; in sampling, 12; statistical notation for, 281 sample size: confidence intervals and, 333–35, 338–41, 343; in relation to confidence interval range, 333–35 sampling, 12–15; aggregate level, 6, 24; cluster, 14; descriptive statistics, 12–13; ecological fallacy, 6, 24; frame, 13; inferential statistics, 13; multistage cluster, 14; non-probability, 13; nonresponse bias, 14; population, 12; probability, 13; s ample in, 12; simple random sample, 13; from skewed population, 285; stratified random sample, 14; unit of analysis, 6, 16, 24, 55, 221 sampling distributions, 280–313; in hypothesis testing using ANOVA, 438–39, 439; population parameters in, 281, 284–87; for two-sample hypothesis tests, 403, 403–4 sampling frame, 13 sampling methods, 13–15; cluster sampling, 14; in gathering data, 13–15; multistage cluster sampling, 14; nonresponse bias, 14; sampling frame, 13; simple random sample, 13; steps in, 13; stratified random sampling, 14 saving your work in SPSS, 44 scales, 10, 24 scatterplots, 546–57, 547, 548, 551; best-fitting lines in, 552, 552–53; of curvilinear relationship, 549, 549; defined, 546; points in, 548–49; regression lines in, 550–51, 551; slope and intercept in, 553–57, 555, 556 secondary data, 15–18; big data, 17–18; codebooks, 16, 16–17; defined, 15; publicly available secondary data sets, 15–16 simple random sampling, 13 skewed distributions, 218–19 skewed population, sampling from, 285 slope and intercept, 553–57; calculating, 556–57; defined, 554; statistical software to calculate, 554 social sciences: analysis of variance in, 438; big data in, 17–18; confidence intervals in, 315; differences between groups, 399; explanatory variables in, 513; f requencies in, 55; hypothesis tests in, 375; to measure attributes of people, 171; m easurement error in, 6; research question in, 3; in study of statistics, 1; to study relationships among variables, 11; unit of analysis in, sources of data not collected directly by researcher: big data, 17–18; secondary data, 15–18 SPSS, 33–44; ANOVA, 450–53; box plot, generating, 229–30, 230; chi-square test, 489–92; commands, 36–44; Compute V ariable dialog box, 36–38, 37, 38, 41, 187, 188; confidence intervals, calculating, 346–49; Data Editor, 34, 34–36, 36, 38, 38, 41, 42, 42, 51–52, 101, 120, 270–71, 301; Data View, 35, 38–39, 39, 51–52, 120, 271, 279, 301; Descriptives dialog box, 43–44, 44, 52, 188, 225–26, 226, 230–31, 231, 240, 270–71, 271, 302, 304, 312; drop-down menus, 33, 35, 95; Explore command, 229; Frequencies dialog box, 95–96, 95–96, 190, 190–93, 191, 227, 227–28, 228; frequency distributions, generating, 9–98, 95–98; graphical user interface, 33, 34, 34–36, 36; Input V ariable box, 40; interquartile range, finding, 228–29; launching, 33–34; linear regression, checking assumptions of, 585–87; linear relationships between variables, 581–88; means, finding, 188–92; means, two-sample test for difference between, 427–28; measures of central tendency, finding, 187–93; measures of variability, finding, 225–31; median, finding, 192; mode, f inding, 192; Numeric Expression box, 38, 38; Old and New Values box, 40, 41; one-sample hypothesis tests, 386–91; O utput Variable: Name box, 40; Output window, 34, 34, 35, 36, 44, 44, 97, 97, 103, 106, 188, 270; overview, 33; percentiles, finding, 227–28; probability, 270–71; p roportions, two-sample test for difference between, 424–27; range, finding, 225–27; Recode into Different Variables dialog box, 39–40, 40, 52, 100, 100, 109, 120; Recode into Different Variables: Old and New Values, 40, 100, 100; Recode into Same Variables dialog box, 39; saving your work, 44; Split File command, 530–31, 531; standard deviation, finding, 230; statistics and graphs, generating, 95–109; S ystem-missing, 41, 41; Target Variable box, 38, 38; Transform drop-down menu, 36, 37; two-sample hypothesis tests, 4 24–28; value labels, assigning to recoded variables, 101–2, 102; Value Labels d ialog box, 42, 43, 101; values as “missing” for any variable, 42; variables, Index Romney, Mitt, 121 Root Mean Square Error (RMSE), 558–59 r-squared (r2), 557–58 675 676 Index Index SPSS (continued) analyze existing, 43, 43–44; variables, calculating new, 187–88; variables, controlling for third variable in cross-tabulations, 526–29; variables, creating frequency distributions for, 95, 95–98, 96, 97, 98; variables, creating new, 37–38; variables, effect of independent variable on means of interval-ratio variable, 529–31; variables, names of, 42, 270; variables, recoding, 38–42, 39, 40, 41, 98–101, 99, 100, 100, 101; variables, save standardized values as, 270–71, 271; variables, transform existing, 38–43; variables, values as “missing” for any, 42; Variable View, 34, 35, 36, 38, 39, 41, 42, 42; variance, finding, 230 SPSS practice problems: comparing groups, 434; describing linear relationships between variables, 596–97; differences between samples and populations, 397–98; diversity of values in a group, 239–40; estimating population parameters, 354; examining relationships between two variables, 159–60; getting to know your data, 119–20; introduction, 51–52; probability and the normal distribution, 279; ruling out competing explanations for relationships between variables, 540–41; from sample to population, 312–13; testing mean differences among multiple groups, 461; testing statistical significance of relationships in crosstabulations, 500; typical values in a group, 202 SPSS procedures, review of: comparing groups, 428; describing linear relationships between variables, 588; differences between samples and populations, 391; diversity of values in a group, 231; estimating population parameters, 349; examining relationships between two variables, 152; getting to know your data, 109; probability and the normal distribution, 271; ruling out competing explanations for relationships between variables, 531; from sample to population, 305; testing mean differences among multiple groups, 453; testing statistical significance of relationships in cross-tabulations, 492; typical values in a group, 193 spurious relationships, 508–13 stacked bar graph, 140–41, 141 standard deviation (SD), 210–19; “apples and oranges” case, 214–17; to compare individual score to the mean, 217; defined, 210, 220; distributions compared with, 212–14, 213–14; formula for, 211; interval-ratio variables compared with, 214–15; skewed versus symmetric distributions, 218–19; SPSS for finding, 230; squared d eviations, 210, 211; Stata for finding, 223–24; s ummarizing, 210; variance, 211 standard deviations of mean in normal distribution, 256–57 standard error: of the estimate, 558–59; inferential statistics and, 287; of the slope, 568–71 Stata, 25–33 See also tabulate command in Stata; ANOVA, 448–50; bar graphs, generating, 94; Base Reference Manual, 33; bootstrap command, 297–98, 298, 312; box plot, generating, 223, 223; by sort command, 523–24, 524; chi-square test, 487–89; ci command, 344, 345, 353, 344–345; code, basic logic of, 28; coding mistakes, 29; Command Box, 26, 26, 27; commands, 28–30; Command Window, 26, 26; confidence intervals, calculating, 344–46; corr command, 577, 577, 595; Data Editor, 26–27, 27; data set display, 26–27, 27; d isplay command, 268, 268, 269, 269, 279; display invnormal command, 269, 269; do-file, 31–33, 32, 94; drop-down menus, 25; egen command, 182, 185; error messages, 30–31; frequency distributions, generating, 86, 86–88, 87–88, 94; generate command, 28, 29, 51, 89, 119; graph box command, 223, 225, 239; graph command, 91, 223; graphical user interface, 25–26, 26; help sources, 33; histogram command, 92–93, 201, 299, 312, 596; histograms showing percentages, generating, 94; “if” in a command, rule for, 29; interquartile range, finding, 222–23; keep command, 295–96; label define command, 90, 94, 119; label values command, 90, 94, 119; launching, 25; linear regression, checking assumptions of, 579–80; mean, finding, 183–84; measures of central tendency, finding, 182–86; measures of variability, finding, 221–25; median, finding, 184; mode, finding, 184–86; one-sample hypothesis tests, 382–86; oneway command, 449, 449, 460; operators in, 31, 31; overview, 25; percentiles, finding, 222; predict command, 579, 596; probability, 267–70; proportions, two-sample test for difference between, 420–22; prtest command, 383, 383, 421, 421; pwmean command, 461; range, finding, 221–22; recode command, 89, 119, 158; regress command, 579, 596; replace command, 29, 51; Results Window, 26, 26, 27–28; sample command, 576; sample command, 296, 297; saving your work, 33, 94; scatter command, 575–76, 595, 596; standard deviation, finding, 223–24; s tatistics, generating, Index variables, 11–12; experimental control, 11–12; generalizing, 12–15; growth mindset, 18–19; independent variables, 11–12; level of measurement, 9–11; math anxiety, 19–20; measurement error, 6–9; measurements, 4–6, 5, 23–24; overview of book, 20–21; Pager’s study and, 2–3; quantitative analysis, 4, 23; quantitative methods, 4, 23; reliability, 6–9, 8; research process, 3–4, 23; research questions, 3–4; sampling, 12–15; SPSS to generate, 95–109; Stata to generate, 85–94; statistical control, 12; statistical software programs, 21–22; studying, reasons for, 1–3; units of analysis, 6; validity, 6–9, 8; variables, 4–6, 23–24 stem-and-leaf plots, 73–75, 74, 85 stratified random sampling, 14 studying statistics, reasons for, 1–3 summarize command in Stata, 30 sum of squared residual, 558, 566, 568–71 sums of squares (SS): variation between groups, 442, 443, 443–44; variation within groups, 442, 442 Survey of Income and Program Participation (SIPP), 254, 285 symmetric distributions, 218–19 System-missing in SPSS, 41, 41 tabulate command in Stata: to conduct ANOVA, 449, 449–50; to draw random samples from larger sample, 296, 296; to find mode, 185, 185–86, 186, 201; to generate cross-tabulation, 143–46, 144, 145, 146, 158, 499, 521, 521–23, 522, 540; to g enerate frequency distributions, 86, 86–88, 87–88, 94, 119, 185, 185–86, 186, 296, 296, 300; for recoding variables, 88, 88; to run chi-square test, 487–89, 488 Target Variable box in SPSS, 38, 38 t-distributions: confidence interval for means and, 326–29, 327; degrees of freedom and, 327, 327–28, 328 testing statistical significance of relationships in cross-tabulations: chi-square test for goodnessof-fit, 483–85, 484, 485; chi-square test of independence, 467–69, 471, 478–79, 479, 481, 483, 486; expected frequencies, 470–71, 473, 476, 481, 483, 484, 486, 487–88, 490; observed frequencies, 469–72, 475–76, 478, 481, 484, 486–88, 490, 492 third variables See control variables Tilcsik, András, 399–400, 418 time series charts, 76, 76–77, 85 Transform drop-down menu in SPSS, 36, 37 Index 85–94; statistics and graphs, generating, 85–94; summarize command, 30, 30, 51, 183, 183, 201, 224, 224, 296, 296, 298, 298, 312; tabstat command, 183–84, 184, 185, 186, 201, 221, 222, 222, 223, 224, 224, 239; Tools menu, 33; ttest command, 397, 422–23, 423; two-sample hypothesis tests, 420–24; User’s Guide, 33; variables, analyze existing, 30; variables, controlling for third variable in cross-tabulations, 520–23; variables, create new, 28–29, 87, 94, 182–83; variables, creating and attaching labels to categories of, 94; variables, effect of independent variable on means of interval-ratio variable, 523–24; variables, linear relationships between, 575–81; variables, names of, 28–29; variables, transform existing, 29–30; Variables Window, 26, 26; variance, finding, 223–24; z-scores, finding, 267–70 Stata practice problems: comparing groups, 433; describing linear relationships between variables, 595–96; differences between samples and populations, 397; diversity of values in a group, 239; estimating population parameters, 353–54; examining relationships between two variables, 158–59; getting to know your data, 119; introduction, 51; probability and the normal distribution, 279; ruling out competing explanations for relationships between variables, 540; from sample to population, 312; testing mean differences among multiple groups, 460–61; testing statistical significance of relationships in cross-tabulations, 499–500; typical values in a group, 201 statistical control, 12 statistical inference for regression, 565–71; alphalevel (α) in, 543, 567–70; decision f in, 567–68, 574; decision t in, 569–71, 575; F-statistic in, 566–68 statistical interaction between variables, 515–17, 516, 520 statistical notation: for regression line, 554; for sample, 281; for samples and populations, 281; for two-sample hypothesis tests, 402–3 statistical significance: defined, 358, 381; in one-sample hypothesis test, 358, 379–80, 381; publication bias toward, 365; of relationships in cross-tabulations, testing, 492–500; in two- sample hypothesis tests, 418 statistical software programs, 21–22 statistics: causal relationships, 11–12; data, gathering, 12–15; data, secondary, 15–18; dependent 677 678 Index Index Trump, Donald, 121–22, 208–10, 209, 209, 272, 272 t-Table: in chi-square test, 474; in linear relations, 569–70; in one-sample hypothesis test, 328, 328, 332, 336, 369–70, 373, 375, 381; in two-sample hypothesis test, 406, 407, 412, 419 t-Tests, 447 Tufte, Edward R., 501–3, 550 Tukey’s test, 446–47, 447, 448, 458 t-value: in ANOVA, 446–47, 453; in linear relations, 543, 569, 571, 575; in one-sample hypothesis test, 327–28, 328, 330–32, 331, 335–36, 341–43, 345, 351, 367–72, 371, 374, 397–98; in two-sample hypothesis test, 406, 412, 419, 423, 434 Tversky, Amos, 241–43, 251–52 two-sample hypothesis tests, 399–434; alpha-level (α) in, 400, 405–6, 408, 412, 413, 415, 417, 419, 421, 422, 427; alternative hypothesis in, 402; ANOVA in, 438–39; decision t in, 405, 405–9, 406, 409, 419423; defined, 401, 419; degrees of freedom and, 404, 405–7, 406, 412, 419, 423; for means, differences between, 404–12, 419, 422– 23, 427–28; null hypothesis in, 401–2; practical significance in, 418; for proportions, differences between, 409–11, 412–17, 420–22, 424–27; sampling distribution for, 403, 403–4; SPSS for, 424–28; Stata for, 420–24; statistical notation for, 402–3; statistical significance in, 418; steps for conducting, 419; t-value in, 406, 412, 419, 423, 434; two-tailed hypothesis test in, 402, 406, 414; z-scores in, 413–15, 414, 417, 419, 429, 431, 433 two-tailed hypothesis test: confidence intervals and, equivalence of, 367; in one-sample hypothesis test, 359, 365–67, 366; in two-sample hypothesis test, 402, 406, 414 Type I and Type II error, 376–79, 381–82 unit of analysis, 6, 16, 24, 55, 221 univariate statistics, 55 upper bounds, 321, 323, 324, 324, 329–32, 331, 334, 335, 336, 338, 342, 343 User’s Guide for Stata, 33 validity, 6–9, Value Labels dialog box in SPSS, 42, 43 value labels for recoding variables: SPSS for assigning, 101–2, 102; Stata for assigning, 90–91, 91, 94 values: in groups, diversity of, 231–40; treating as “missing” for any variable in SPSS, 42; typical, in groups, 193–202 variability, 167 variability, measures of See measures of variability variable names: SPSS, 270; Stata, 28–29 ariables; variables, 4–6, 23–24 See also control v linear relationships between variables; analyze existing, in SPSS, 43, 43–44; analyze existing, in Stata, 30; antecedent, 509, 509, 513, 520; association between, 506–7; categories of, cases in, 55; in causal relationships, independent and dependent, 11; closed-ended survey items, 5–6, 24; comparing two groups on same, 77–81; competing explanations for relationships between, ruling out, 532–41; continuous, 9–10; continuous variables, 9–10; covariance of two, formula for, 545, 546; creating new, SPSS for, 37–38; creating new, Stata for, 28–29, 87, 94; dependent, 11–12; describing linear relationships between, 588–97; discrete, 9–10; displays of single, 69–75; dummy, 559–61; examining relationships between two, 152–60; explanatory, in social sciences, 513; formula for covariance of two, 545; frequency distributions for, Stata for creating, 86, 86–88, 87–88; ideological identification, 66, 67–68, 124, 124–26, 125, 134; independent, 11–12; independent effect of, in causal relationships, 513, 520; interval-ratiolevel, 60, 71, 85, 481, 523, 529; intervening, 509, 510, 513, 520; mediating, 509, 513, 520; moderating, 515, 520; name of, adding full label to in SPSS, 42; nominal-level, 10–11, 165; non-spurious relationships in, modeling, 513, 513–17; open-ended survey items, 6, 24; ordinal-level, 10–11; ratio-level, 9, 23, 24; recoding, Stata for, 88, 88–90, 90; ruling out competing explanations for relationships between, 532–41; social sciences to study relationships among, 11; spurious relationships in, modeling, 503, 503–13, 506, 508–13; statistical interaction between, 515–17, 516, 520; transform existing, in SPSS, 38–43; transform existing, in Stata, 29–30; validity, 6–9, 8; variability, 167 Variables Window in Stata, 26, 26 Variable View in SPSS, 34, 35, 36, 38, 39, 41, 42, 42 variance: defined, 211, 220; finding, 211; SPSS for finding, 230; Stata for finding, 223–24 Venn diagrams, 563, 563, 594, 594 vertical axis, 69, 73, 82–83, 82–83, 105 visual representations of data See graphical representations of data Index x axis, 76, 106, 108, 393, 546, 548, 551, 561, 575, 581, 588 y axis, 76, 108, 299, 546, 548, 551, 554, 555, 556, 561, 575, 581, 588, 592 Yerkes-Dodson Law, 20 z-scores: calculating, probability and, 258–65; in chi-square test, 481; confidence levels and, 321, 321, 322, 342; defined, 258, 266; in normal tables, 259, 260–67, 271, 291, 319–22, 320, 328, 362–63, 365 260, 415, 481; in one-sample hypothesis test, 367–68, 370, 375, 394; right-tail probability and, 269, 319–22, 320, 321, 342; Stata for finding, 267–70; in two-sample hypothesis test, 413–15, 414, 417, 419, 429, 431, 433 Index Warner, Cody, 161–62, 204 World Values Survey (WVS), 15, 16, 128, 143, 147, 158, 159, 201, 202, 221, 225, 244, 247, 281–82, 289–91, 290, 295, 297, 300, 312–13, 353–54, 397 679 ... Title: Statistics for social understanding: with Stata and SPSS / Nancy Whitter (Smith College), Tina Wildhagen (Smith College), Howard J Gold (Smith College) Description: Lanham : Rowman & Littlefield, ... INDEX 667 vii Preface The idea for Statistics for Social Understanding: With Stata and SPSS began with our desire to offer a different kind of book to our statistics students We wanted a book... lecture notes, and class exercises Statistics for Social Understanding is distinguished by several features: (1) It is the only major introductory statistics book to integrate Stata and SPSS, giving