Harnessing the Power of Statistics

Chapter - Harnessing the Power of Statistics It is the things that vary that interest us Things that not vary are inherently boring Winter weather in Miami, Florida, may be more pleasant than winter weather in Clay Center, Kansas, but it is not as much fun to talk about Clay Center, with its variations in wind, precipitation, and temperature, has a lot more going on in its atmosphere Or take an extreme case of low variation You would not get much readership for a story about the number of heads on the typical human being Since we are all one-headed and there is no variance to ponder or explain or analyze, the quantitative analysis of number of heads per human gets dull rather quickly Only if someone were to notice an unexpected number of two-headed persons in the population would it be interesting Number of heads would then become a variable On the other hand, consider human intelligence as measured by, say, the StanfordBinet IQ test It varies a lot, and the sources of the variation are of endless fascination News writers and policy makers alike are always wondering how much of the variation is caused by heredity and how much by environment, whether it can be changed, and whether it correlates with such things as athletic ability, ethnic category, birth order, and other interesting variables Variance, then, makes news And in any statistical analysis, the first thing we generally want to know is whether the phenomenon we are studying is a variable, and, if so, how much and in what way it varies Once we have that figured out, we are usually interested in finding the sources of the variance Ideally, we would hope to find what causes the variance But causation is difficult to prove, and we often must settle for discovering what correlates or covaries with the variable in which we are interested Because causation is so tricky to establish, statisticians use some weasel words that mean almost –but not quite – the same thing If two interesting phenomena covary (meaning that they vary together), they say that one depends on the other or that one explains the other These are concepts that come close to the idea of causation but stop short of it, and rightly so For example, how well you perform in college may depend on your entrance test scores But the test scores are not the cause of that performance They merely help explain it by indicating the level of underlying ability that is the cause of both test scores and college performance Statistical applications in both journalism and science are aimed at finding causes, but so much caution is required in making claims of causation that the more modest concepts are used much more freely Modesty is becoming, so think of statistics as a quest for the unexplained variance It is a concept that you will become more comfortable with, and, in time, it may even seem romantic Measuring variance There are two ways to use statistics You can cookbook your way through, applying formulas without fully understanding why or how they work Or you can develop an intuitive sense for what is going on The cookbook route can be easy and fast, but to really improve your understanding, you will have to get some concepts at the intuitive level Because the concept of variance is so basic to statistics, it is worth spending some time to get it at the intuitive level If you see the difference between low variance (number of human heads) and high variance (human intelligence), your intuitive understanding is well started Now let's think of some ways to measure variance A measure has to start with a baseline (Remember the comedian who is asked, “How is your wife?” His reply: “Compared to what?”) In measuring variance, the logical “compared to what” is the central tendency, and the convenient measure of central tendency is the arithmetic average or mean Or you could think in terms of probabilities, like a poker player, and use the expected value Start with the simplest possible variable, one that varies across only two conditions: zero or one, white or black, present or absent, dead or alive, boy or girl Such variables are encountered often enough in real life that statisticians have a term for them They are called dichotomous variables Another descriptive word for them is binary Everything in the population being considered is either one or the other There are two possibilities, no more An interesting dichotomous variable in present-day American society is minority status Policies aimed at improving the status of minorities require that each citizen be first classified as either a minority or a nonminority (We'll skip for now the possible complications of doing that.) Now picture two towns, one in the rural Midwest and one in the rural South The former is percent minority and the latter is 40 percent minority Which population has the greater variance? With just a little bit of reflection, you will see that the midwestern town does not have much variance in its racial makeup It is 98 percent nonminority The southern town has a lot more variety, and so it is relatively high in racial variance Here is another way to think about the difference If you knew the racial distribution in the midwestern town and had to guess the category of a random person, you would guess that the person is a nonminority, and you would have a 98 percent chance of being right In the southern town, you would make the same guess, but would be much less certain of being right Variance, then, is related to the concept of uncertainty This will prove to be important later on when we consider the arithmetic of sampling For now, what you need to know is that Variance is interesting Variance is different for different variables and in different populations The amount of variance is easily quantified (We'll soon see how.) A Continuous variable Now to leap beyond the dichotomous case Let's make it a big leap and consider a variable that can have an unlimited number of divisions Instead of just or 1, it can go from to infinity Or from to some finite number but with an infinite number of divisions within the finite range Making this stuff up is too hard, so let's use real data: the frequency of misspelling “minuscule” as “miniscule” in nine large and prestigious news organizations archived in the VU/TEXT and NEXIS computer databases for the first half of calendar 1989 Miami Herald Los Angeles Times Philadelphia Inquirer Washington Post Boston Globe New York Times Chicago Tribune Newsday Detroit Free Press 2.5% 2.9 4.0 4.5 4.8 11.0 19.6 25.0 30.0 Just by eyeballing the list, you can see a lot of variance there The worst-spelling paper on the list has more than ten times the rate of misspelling as the best-spelling paper And that method of measuring variance, taking the ratio of the extremes, is an intuitively satisfying one But it is a rough measure because it does not use all of the information in the list So let's measure variance the way statisticians First they find a reference point (a compared-to-what) by calculating the mean, which is the sum of the values divided by the number of cases The mean for these nine cases is 11.6 In other words, the average newspaper on this list gets “minuscule” wrong 11.6 percent of the time When we talk about variance we are really talking about variance around (or variance from) the mean Next, the following: Take the value of each case and subtract the mean to get the difference Square that difference for each case Add to get the sum of all those squared differences Divide the result by the number of cases That is quite a long and detailed list If this were a statistics text, you would get an equation instead You would like the equation even less than the above list Trust me So all of the above, and the result is the variance in this case It works out to about 100, give or take a point (Approximations are appropriate because the values in the table have been rounded.) But 100 what? How we give this number some intuitive usefulness? Well, the first thing to remember is that variance is an absolute, not a relative concept For it to make intuitive sense, you need to be able to relate it to something, and we are getting close to a way to that If we take the square root of the variance (reasonable enough, because it is derived from a listing of squared differences), we get a wonderfully useful statistic called the standard deviation of the mean Or just standard deviation for short And the number you compare it to is the mean In this case, the mean is 11.6 and the standard deviation is 10, which means that there is a lot of variation around that mean In a large population whose values follow the classic bell-shaped normal distribution, two-thirds of all the cases will fall within one standard deviation of the mean So if the standard deviation is a small value relative to the value of the mean, it means that variance is small, i.e., most of the cases are clumped tightly around the mean If the standard deviation is a large value relative to the mean, then the variance is relatively large In the case at hand, variation in the rate of misspelling of “minuscule,” the variance is quite large with only one case anywhere close to the mean The cases on either side of it are at half the mean and double the mean Now that's variance! For contrast, let us consider the circulation size of each of these same newspapers.1 Miami Herald Los Angeles Times Philadelphia Inquirer Washington Post Boston Globe New York Times Chicago Tribune Newsday Detroit Free Press 416,196 1,116,334 502,756 769,318 509,060 1,038,829 715,618 680,926 629,065 The mean circulation for this group of nine is 708,678 and the standard deviation around that mean is 238,174 So here we have relatively less variance In a large number of normally distributed cases like these, two-thirds would lie fairly close to the mean – within a third of the mean's value One way to get a good picture of the shape of a distribution, including the amount of variance, is with a graph called a histogram Let's start with a mental picture Intelligence, as measured with standard IQ tests, has a mean of 100 and a standard deviation of 16 So imagine a Kansas wheat field with the stubble burned off, ready for plowing, on which thousands of IQ-tested Kansans have assembled Each of these Kansans knows his or her IQ score, and there is a straight line on the field marked with numbers at one-meter intervals from to 200 At the sounding of a trumpet, each Kansan obligingly lines up facing the marker indicating his or her IQ Look at Figure 3A A living histogram! Because IQ is normally distributed, the longest line will be at the 100 marker, and the length of the lines will taper gradually toward the extremes Some of the lines have been left out to make the histogram easier to draw If you were to fly over that field in a blimp at high altitude, you might not notice the lines at all You would just see a curved shape as in Figure 3B This curve is defined by a series of distinct lines, but statisticians prefer to think of it as a smooth curve, which is okay with us We don't notice the little steps from one line of people to the next, just as we don't notice the dots in a halftone engraving But now you see the logic of the standard deviation By measuring outward in both directions from the mean with the standard deviation as your unit of measurement, you can define a specific area of the space under the curve Just draw two perpendiculars from the baseline to the curve If those perpendiculars are each one standard deviation – 16 IQ points – from the mean, you will have counted off two-thirds of the people in the wheat field Two-thirds of the population has an IQ between 84 and 116 For that matter, you could go out about two standard deviations (1.96 if you want to be precise) and know that you had included 95 percent of the people, for 95 percent of the population has an IQ between 68 and 132 Figures 3C and 3D are histograms based on real data When you are investigating a body of data for the first time, the first thing you are going to want is a general picture in your head of its distribution Does it look like the normal curve? Or does it have two bumps instead of one–meaning that it is bimodal? Is the bump about in the center, or does it lean in one direction with a long tail running off in the other direction? The tail indicates skewness and suggests that using the mean to summarize that particular set of data carries the risk of being overly influenced by those extreme cases in the tail A statistical innovator named John Tukey has invented a way of sizing up a data set by hand.2 You can it on the back of an old envelope in one of the dusty attics where interesting records are sometimes kept Let's try it out on the spelling data cited above, but this time with 38 newspapers Spelling Error Rates: Newspapers Sorted by Frequency of Misspelling "Minuscule" Paper Akron Beacon Journal Gary Post Tribune Lexington Herald Leader Sacramento Bee San Jose Mercury News Arizona Republic Miami Herald Los Angeles Times St Paul Pioneer Press Philadelphia Inquirer Charlotte Observer Washington Post Boston Globe St Louis Post Dispatch Journal of Commerce Allentown Morning Call Wichita Eagle Atlanta Constitution New York Times Fresno Bee Orlando Sentinel Palm Beach Post Seattle Post Intelligence Chicago Tribune Los Angeles Daily News Newsday Newark State Ledger Ft Lauderdale News Columbus Dispatch Philadelphia Daily News Detroit Free Press Richmond News Leader Anchorage Daily News Houston Post Rocky Mountain News Albany Times Union Columbia State Annapolis Capital Error Rate 00000 00000 00000 00000 00000 01961 02500 02857 03333 04000 04167 04545 04762 05128 08696 09091 10526 10714 11000 13793 13793 15385 15789 19643 22222 25000 25000 26667 28571 29412 30000 31579 33333 34615 36364 45455 55556 85714 Tukey calls his organizing scheme a stem-and-leaf chart The stem shows, in shorthand form, the data categories arranged along a vertical line An appropriate stem for these data would set the categories at to 9, representing, in groups of 10 percentage points, the misspell rate for “minuscule.” The result looks like this: 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 0, 1, 0, 0, 0, 1, 2, 2, 0, 1, 5, 3, 0, 4, 5, 5, 0, 2, 2, 3, 3, 4, 4, 5, 5, 5, 9, 4, 5, 7, 9, The first line holds values from to 9, the second from 11 to 16, etc The stemand-leaf chart is really a histogram that preserves the original values, rounded here to the nearest full percentage point It tells us something that was not obvious from eyeballing the alphabetical list Most papers are pretty good at spelling The distribution is not normal, and it is skewed by a few extremely poor spellers Both the interested scientist and the interested journalist would quickly want to investigate the extreme cases and find what made them that way The paper that misspelled “minuscule” 86 percent of the time, the Annapolis Capital, had no spell-checker in its computer editing system at the time these data were collected (although one was on order) Here is another example The following numbers represent the circulation figures of the same newspapers in thousands: 221, 76, 119, 244, 272, 315, 416, 1116, 193, 503, 231, 769, 509 372, 24, 136, 120, 275, 1039, 145, 255, 156, 237, 716, 171, 681, 462, 190, 254, 235, 629, 140, 56, 318, 345, 106, 136, 42 See the pattern there? Not likely But put them into a stem-and-leaf chart and you see that what you have is a distribution skewed to the high side Here's how to read it The numbers on the leaf part (right of the vertical line) have been rounded to the second significant figure of the circulation number –or tens of thousands in this case The number on the stem is the first figure Thus the circulation figures in the first row are 20,000, 40,000, 60,000 and 80,000 In the second row, we have 120,000, 190,000, 140,000 and so on Toward the bottom of the stem, we run into the millions, and so a has been added to the left of the stem to signify that the digit is added here These represent rounded circulation figures of 1,040,000 (The New York Times) and 1,120,000 (the Los Angeles Times) respectively 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10 | 11 | 8, 2, 2, 1, 2, 0, 8, 7, 2, 9, 4, 7, 6, 4, 7, 2, 2, 3, 6, 7, 7, 5, 9, 4, 4, 5, 1, 4 Notice that in our first example, the misspelling rate for “minuscule,” we started with a list that had already been sorted, and so the values on each leaf were in ascending order In the second case, we were dealing with a random assortment of numbers more like the arrays you will encounter in real life The stem-and-leaf puts them in enough order so that you can very quickly calculate the median if you want Just pencil in another column of numbers that accumulates the cases row by row 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10 | 11 | 8, 2, 2, 1, 2, 0, 8, 7, 2, 9, 4, 7, 6, 4, 4, 2, 4, 6, 7, 9, 4, 1, 7, 3, 7, 5, 4, 5, 2, 4 15 24 28 30 32 34 36 37 38 Because there are 38 observations, the median will lie between the 19th and 20th The 19th case would be the fourth highest in the row representing the 200,000 range By you are projecting to past years and maybe even to future years You might even think of your one-year data set as a sample of an infinite universe of all possible years and all possible Division I schools The bottom line for journalistic applications: whenever you have a situation where someone is likely to challenge your results by claiming coincidence, use chisquare or a related test to find out how big a coincidence it takes to explain what you have Chi-square belongs to a large family of statistical tests called significance tests All yield a significance level which is just the probability of getting, by chance alone, a difference of the magnitude you found Therefore, the lower the probability, the greater the significance level If p = 05, it means the distribution is the sort that chance could produce in five cases out of 100 If you are planning to base a lead on your hypothesis and want to find significance, then the smaller the probability number the better (A big coincidence is an event with a low probability of happening.) In addition to chi-square, there is one other significance test you are likely to need sooner or later It is a test for comparing the differences between two means It is called Students t, or the t-test for short There are two basic forms: one for comparing the means of two groups (independent samples) and one for comparing the means of two variables in the same group (paired samples) This test is not as easy to calculate by hand as chisquare If you want to learn how, consult a statistics text All the good statistical packages for computers have t-tests as standard offerings One final point about significance tests: Low probability (i.e., high significance) is not always the same thing as important Low probability events are, paradoxically, quite commonplace, especially if you define them after the fact Here is a thought experiment Make a list of the first five people you passed on the street or the campus or the most recent public place where you walked Now think back to where you were one year ago today Projecting ahead a year, what would have been the probability that all the random events in the lives of those five people would have brought them into your line of vision in that particular order on this particular day? Quite remote, of course But it doesn't mean anything, because there was nothing to predict it Now suppose you had met a psychic with a crystal ball, and she had written the names of those five people on a piece of paper, sealed it in an envelope, and given you the envelope to open one year later If you did and her prediction proved to be true, that would have led you to search for explanations other than coincidence That's what statistical significance does for you When unusual events happen it is not their unusualness alone that makes them important It is how they fit into a larger picture as part of a theoretical model that gives them importance Remember Rick (played by Humphrey Bogart) in the film Casablanca when he pounds the table? “Of all the gin joints in all the towns in all the world, she walks into mine,” laments Rick The coincidence is important only because he and the woman who walked in had a history with unresolved conflict Her appearance fit into a larger pattern Most improbable events are meaningless because they don't fit into a larger pattern One way to test for the fit of an unusual event in a larger pattern is by using it to test a theory's predictive power In science and in journalism, one looks for the fit Continuous variables You have noticed by now that we have been dealing with two different ways of measuring variables In the Detroit riot table, we measured by classifying people into discrete categories: northerner or southerner, rioter or non-rioter But when we measured the error rate for “minuscule” at 38 different newspapers, the measure was a continuum, ranging from zero (the Akron Beacon Journal) to 86 percent (the Annapolis Capital) Most statistics textbooks suggest four or five kinds of measurement, but the basic distinction is between categorical and continuous There is one kind that is a hybrid of the two It is called ordinal measurement If you can put the things you are measuring in some kind of rank order without knowing the exact value of the continuous variable on which you are ordering them, you have something that gives more information than a categorical measure but less than a continuous one In fact, you can order the ways of measuring things by the amount of information they involve From lowest to highest, they are: Categorical (also called nominal) Ordinal (ranking) Continuous (also called interval unless it has a zero point to anchor it, in which case it is called ratio) Categorical measures are the most convenient for journalism because they are easiest to explain But the others are often useful because of the additional information about relative magnitude that they contain When collecting data, it is often a good idea to try for the most information that you can reasonably get You can always downgrade it in the analysis In the Detroit case, we used categorical measures to show how two conditions, northernness and rioting, occur together more often than can readily be explained by chance If the rioters in Detroit had been measured by how many hours and minutes they spent rioting, a nice continuous measure of intensity would have resulted And that measure could easily have been converted to an ordinal or categorical measure just by setting cutting points for classification purposes The Detroit data collection did not that, however, and there is no way to move in the other direction and convert a categorical measure to a continuous one because that would require additional information that the categorical measure does not pick up Continuous measures are very good for doing what we set out to in this section, and that is see how two variables vary together When you have continuous measurement, you can make more powerful comparisons by finding out whether one thing varies in a given direction and – here's the good part –to a given degree when the other thing varies Time for an example When USA Today was getting ready to cover the release of 1990 census data, the special projects team acquired a computer file of 1980 census data for Wyoming This state was chosen because it is small, easy for the census bureau to enumerate, and usually first out of the chute when census data are released So USA Today used Wyoming to shake down its analysis procedures Because the census uses geographic areas as its basic units of analysis, virtually all of its data are continuous A county has so many blacks, so many farmers, so many persons of Irish ancestry under the age of and on and on Here are two continuous measures of counties in Wyoming: one is the percent of single-person household members who are female; the other is the percent of persons living in the same house they lived in five years earlier 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Wyoming Counties Albany Big Horn Campbell Carbon Converse Crook Fremont Goshen Hot Springs Johnson Laramie Lincoln Natrona Niobrara Park Platte Sheridan Sublette Sweetwater Teton Uinta Washakie Weston SingleFemale Stability Rate Rate 47 28 56 46 34 20 43 33 46 24 46 41 47 39 62 48 58 37 56 38 53 37 46 43 47 34 63 46 56 40 44 30 58 38 40 42 36 32 45 22 42 36 51 32 51 44 These variables are treated in the database as characteristics of counties, not of people, so let's give them names that will help us remember that: Dependent variable: Single-Female Rate Defined as the number out of every 100 people living alone who are female Independent variable: Stability Rate Defined as the number out of every 100 persons who lived at the same address five years earlier If single females are less mobile than single males, then these two variables might be related In other words, we might expect that counties were the single people are mostly female would be more stable than counties where the single people are more likely to be males We shall now explore some different ways of checking that possibility Is there an association between these two variables? One way to find out is to take a piece of graph paper and plot each county's position so that female ratio is on the vertical axis and percent in same house is on the horizontal axis Then you can see if the counties arrayed by these variables form some kind of a pattern And they do! The plot is in Fig 3E They tend to cluster around an imaginary diagonal line running upward from left to right Just by inspection, you can see enough of a relationship to justify saying something like this: in general, among Wyoming counties the greater the proportion of females in single-person households, the higher the proportion of people who have lived in the same house for five years The fact that the points on the plot cluster around a straight line shows that the general linear model can be applied here The general linear model (GLM) is a way of describing a great many kinds of interconnectedness in data Its basic statistic is the correlation coefficient Read a statistics text if you want to know how to one by hand A good pocket calculator (with a statistics function) or a computer is easier.6 Here is what you need to know about the correlation coefficient: Its range is from -1 to The farther it gets from zero, the greater the correlation, i.e., the closer the points on the plot are to a straight line In a negative correlation the line slants down from left to right: the X variable (horizontal axis) gets smaller when the Y variable (vertical) gets bigger In a positive correlation the line slants up: the two variables get bigger together Correlation is a rather precise expression of covariance or the ability of one variable to predict or explain another (These are the weasel words we use to keep from claiming causation, remember?) The square of the correlation coefficient gives you the amount of variance in one variable that is statistically accounted for or “explained” by variance in the other Let's make this less abstract Look at the plot in Figure 3E again The correlation between the two variables is 613 (At 1.0 the dots would form a perfect straight line) And 613 times 613 is 38 Thus, the variance explained (the correlation coefficient squared) is 38 percent So 38 percent in the variation in home stability is explained by the variation in the rate of female single-person households What about the rest of the variance? Much of it might be explained by other things that could be measured You will never get all of it because some of the variance is due to measurement error Explaining 38 percent of the variance may not sound like much, but in social science, anybody who can explain as much as 10 percent (a correlation of 333) usually feels pretty good about it This concept of variance explained is so important that you need to have an intuitive sense of its meaning Here is another way to think about it The correlation coefficient comes with an equation that describes the straight line The general form of the equation is Y = C + BX The particular equation in this case is Y = 27 + 62X It means that for every percent increase in the percent who have lived in the same house for five years there is, on average, a 62 percent increase in the rate of single-person female households (The 27 is the regression constant It is what Y is worth when X = In other words, it is where the graph starts on the Y axis.) You see statements like this in newspapers all the time, usually when economists are being quoted For every Y increase in the unemployment rate, they say, retail sales decrease by X percent There is a correlation and a regression equation behind that statement Such statements have all kinds of uses because they enable a prediction to be made about one variable when the value of the other variable is known When we talk about “variance explained,” we're talking about making a good prediction To get a better grip on this idea, let's look at our Wyoming data again If you had to guess the value of either variable for a random county in Wyoming about which you knew nothing, it would help if you knew something about patterns in the state as a whole Knowing the mean value for the variable would help, for example You could minimize your error just by guessing the mean, because a randomly chosen county has a good chance of being pretty close to the mean If you also had the regression equation and the value of the other variable, you could improve the accuracy of your guess even more How much more? Thirty-three percent more Where does the 33 come from? The square of the correlation coefficient is the variance explained Quite literally, you remove 33 percent of the error that you would make if you tried to guess all the values on the basis of the mean And if the correlation were 1.00, you would remove 100 percent of the error and be right every time (The square of is 1.) A word or two about substance Why should the proportion of females in singleperson households predict the rate of staying put for five years? Does one cause the other? Probably not directly Women live longer than men Older people are less mobile So counties with older people have more single women because the men have died off, and the surviving females are in a stage of life where they don't move much You could check on this by collecting data on the age of the population in each county and then holding the age factor constant with a partial correlation – something a computer can easily but which you are better off learning from a statistics text There is one other nice thing about the correlation coefficient It comes with its own significance test And it is a more sensitive test than chi-square because it uses more data It looks at the closeness of fit in the plot to the straight-line model and asks, “Out of all the ways of distributing the dots on the plot, how many would be that close or closer to a straight line?” Durn few, it turns out in the Wyoming case The significance level of the correlation coefficient is 002, meaning that if the association between the variables is accidental, it is a one-in-five-hundred accident or a pretty improbable one This case illustrates the value of having interval data, because when we cut back to categorical data and run a chi-square test, the significance goes away However, the categorical comparison is often easier to explain to newspaper readers How we change it to categorical data? Find the approximate midpoint for each variable and divide the cases into two categories, high and low Counties with Low Stability Counties with a Low Rate of SingleFemale Househol7 ds Counties with a High Rate of SingleFemale Household5 s 11 11 12 23 Counties with High Stability Total Total 12 What we have done here is classify each of the 23 Wyoming counties into one of four categories: high on stability and single-female rate, low on both, high on the first and low on the second, and the reverse In other words, we have reduced continuous variables to categorical variables and cross-classified the 23 cases according to these categories Does it show anything interesting? What it shows becomes more apparent if we convert the number of cases in each cell to a percent, using the column total as the base Here's what that looks like: Counties with Low Stability Counties with High Stability Total 64% 42% 36 58 100 100 This table shows a big difference The counties with a low rate of single-female households are much more likely to experience low residential stability than those with a high rate of single-female households, by 64 to 42 percent That's a difference of 22 percentage points While possibly important, that difference is not statistically significant The number of cases in the cells is only 7, 4, 5, and The chi-square value is only 1.1, far less than needed for statistical significance Worse yet, chi-square starts to give freaky results when cell sizes dip below How can the relationship be significant by one test and not by another? Because we threw away the information that made it significant when we went from a continuous measure to a categorical one Moral: when you want to prove a point with a small sample, it helps to have continuous measurement Even when you end up reporting only the categorical comparison, you may, for your own peace of mind, want to look at the interval-level significance test to be sure that you have something worth reporting Sampling Everybody samples Your editor looks out the window, sees a lot of women in miniskirts and commissions the style section to a piece on the return of the miniskirt You buy a Toyota and suddenly you notice when you drive down the street that every other car you pass is a Toyota Their ubiquity had escaped your notice before, and you hadn't realized what a conformist you were turning out to be All of us extrapolate from what we see to what is unseen Such sampling might be termed accidental sampling If the results are generalizable, it is an accident Scientific method needs something better Unfortunately, there is no known way to produce a sample with certainty that the sample is just like the real world But there is a way to sample with a known risk of error of a given magnitude It is based on probability theory, and it is called probability sampling Try an experiment It requires ten pennies You can it as a thought experiment or you can actually get ten pennies, find a cup to shake them in, and toss them onto a flat surface so that each penny has an even chance of landing with the head facing up That is a sample Of what? It is a sample of all of the possible coin flips in the universe through all of recorded and unrecorded time, both past and future In that theoretical universe of theoretical flips of unbiased coins, what is the ratio of heads to tails? Of course: 50-50 When you flip just ten coins you are testing to see how much and how often a sample of ten will deviate from that true ratio of 50-50 The “right” answer is five heads and five tails (That's redundant For the rest of this discussion, we'll refer only to the number of heads since the number of tails has to be, by the definition of the experiment, equal to ten minus the number of heads.) So go ahead, try it Are you going to get exactly five heads on the first throw? Probably not While that outcome is more probable than any other definite number of heads, it is not more probable than all the other possibilities put together Probability theory can tell us what to expect There are exactly 1,024 ways to flip ten coins (To understand why, you'll have to find a basic statistics text But here is a hint: the first coin has two possibilities, heads and tails For each of those, the second coin creates two more possible patterns And so it goes until you have multiplied two times itself ten times Two to the tenth power is 1,024.) Of those finite possibilities or permutations, only one contains ten heads and only one contains zero heads So those two probabilities are each 1/1024 or, in decimals, 00098 The other outcomes are more probable because there are more ways to get them A total of one head can happen in ten different ways (first toss, second toss, etc.) A total of two can happen in 45 different ways Here is chart to show the expected outcome of 1,024 flips of ten coins (Figure 3F provides a histogram to help you visualize it): Heads: Frequency: 10 10 45 120 210 252 210 120 45 10 If you think of each toss of ten coins as a sample, you can see how sampling works The chances of your being badly misled by a sample of only ten are not too great But the best part is that the risk is knowable Figure this out: what is the risk that your sample of ten would be more than 20 percentage points off the “true” value? The true value in our imaginary universe of all coin flips is 50 percent heads Allowing for a 20point deviation in either direction gives us a range of 30 to 70 either way And if you add up the expected outcomes in the 1,024 possible, you find that only 102 of them (56 in each tail of the distribution) are outside the 30-to-70 range So you can be 90 percent certain that your first toss –or any given toss –will yield from to heads In other words, it will be within 20 percentage points of being exactly representative of the total universe That is a pretty important concept, and to let it soak in, you might want to flip ten coins a few times and try it Or if you are using this book in a class, get the whole class to it and track a hundred or so tries on the blackboard The distribution will gradually start to look like the histogram in Figure 3F, and it will help you convince yourself that there is some reality to these hypothetical probabilities Now consider what we can with it Two important tools have just been handed to you: When you sample, you can deal with a known error margin You can know the probability that your sample will fall within that error margin The first is called sampling error The second is called confidence level Here's the good part: you can choose whatever sampling error you want to work with and calculate its confidence level We did that with the coin flips: we set the sampling error at 20 percentage points and found out by looking at the sampling distribution that the confidence level was 90 percent Alternatively –and this happens more often in everyday life – you can set the confidence level you are comfortable with and then calculate an error margin to fit it To that, you have to have an equation Here is an example This is the equation for calculating the error margin at the 67 percent level of confidence: E = sqrt(.25/n) The n in the formula is the sample size That 25 in the parenthesis represents the variance in the coin-flipping case or, for that matter, in any case where the real-world distribution is 50-50 – a close election with two candidates, for example The shortcut formula for variance in any situation where there are just two possible outcomes (heads or tails, Republican or Democrat, boy or girl) is p*q where p is the probability of getting one outcome and q is the probability of the other The sum of p and q has to be 1, so q is defined as 1-p The formula for sampling error uses 25 to be conservative That's the maximum variance in the two-outcome situation If the split were 60-40 instead of 50-50, the variance would be 24 If it were 90-10, the variance would be 09 To see that the formula makes intuitive sense, try it out for a sample of one Sound crazy? Sure If you tried to generalize to the universe of all possible coin flips from just one trial, you couldn't possibly get it right And the formula lets you know that Work it out It gives you a sampling error of 5, or plus or minus 50 percentage points, which pretty much covers the ball park Now try it for a sample of 100 Sampling error is now plus or minus five percentage points, which is a lot better In most sampling situations, we are not content with a confidence level of 67 percent The formula gives the sampling error for that confidence level because it covers one standard error around the true value Standard error is like the concept of standard deviation around the mean in a population When dealing with a sample, it makes sense to call it standard error because the reference point is an exact (although often unknown) real-world value rather than the foggier concept of central tendency Remember the example of the Kansans in the wheat field? And how one standard deviation in each direction from the mean of a population covers two-thirds of the cases in a normal distribution? In a sample distribution, something similar happens One standard error in each direction covers two-thirds of the expected samples If you flipped coins in groups of 100, two-thirds of the groups would yield an error of no more than percentage points: that is, they would turn up between 45 and 55 heads In real life, one usually deals with one sample at a time, and so it is easier to think in terms of probabilities In a sample of 100, the probability is 67 percent that the error is within plus or minus percentage points Suppose 67 percent isn't enough confidence? If you kept that as your lifelong standard, you would be embarrassed one time out of three If you did a dozen polls a year, four of them would turn out wrong In both journalistic and social science applications, most practitioners prefer a higher level of confidence How you get it? By covering more of the space under the sampling distribution curve Covering two standard errors, for example, includes slightly more than 95 percent of the possibilities Of course the error margin goes up when you that, because those added possibilities all involve greater error than the percent that falls within the one standard error range Life is a tradeoff Because of a fondness for round numbers, most people who work with samples set the 95 percent confidence level as their standard That means being right 19 times out of 20, which is pretty good over the course of a career The exact number of standard errors it takes to attain that is 1.96 in either direction And cranking it into the formula is simple enough: E = 1.96 * sqrt(.25/n) And you can modify the formula to change the confidence level whenever you want The standard textbook designation for the term we just added to the formula for sampling error is z When z = 1, the confidence level is 67 percent, and when z = 1.96, the confidence level is 95 percent Here are some other confidence levels for different values of z z 95 1.04 1.17 1.28 1.44 1.65 1.96 2.58 3.29 confidence 65.0% 70% 75.0 80.0 85.0 90.0 95.0 99.0 99.9 Remember that you can have a high confidence level or you can have a small margin for sampling error, but you usually can't have both unless your sample is very large To get a feel for the tradeoffs involved, try this exercise Take the general formula for sampling error: E = z * sqrt(.25/n) and recast it to solve for z: z = 2*E*sqrt(n) and to solve for sample size: n = 25 * (z2/E2) Now try out various combinations of sample size, error, and confidence level on your pocket calculator to see how they change Better yet, put these formulas into a spreadsheet program where you can vary the error margin, the z for different confidence levels, and the sample size to see how they interact with one another What you will find is some good news and some bad news First, the bad news: Increasing the sample size a lot decreases the sampling error only a little The good news is the converse proposition: Decreasing the sample size doesn't increase sampling error as much as you might think Here is a number to keep in your head as a reference point: 384 That is the sample size you need for a percent error margin at the 95 percent level of confidence Double it to 768, and sampling error is still 3.5 percentage points Cut it in half to 192, and sampling error is still only percentage points The question of how much error you can tolerate and what it is worth to trim that error will be considered in the chapter on surveys We will also look at some costeffective ways to improve accuracy in surveys But for now, relax The next chapter is about using computers to make these things easy Editor & Publisher International Year Book 198, New York The figures are for the period ending September 30, 1988 John W Tukey, Exploratory Data Analysis (Boston: Addison-Wesley, 1972), pp 7-26 You may have noticed a potential complication here Not all of the newspapers are the same size By averaging across the error rates of each newspaper, we are treating the small papers with the same weight as the large ones A weighted average would be the same as the total error rate for all uses of “minuscule” in all of the sample newspapers The difference is not always important, but you need to watch your language and be aware of what you are describing The uweighted mean entitles us to talk about the error rate at the average newspaper The weighted average yields the overall error rate Electronic mail communication from Barbara Pearson, USA Today, August 24, 1989 Victor Cohn, News and Numbers: A Guide to Reporting Statistical Claims and Controversies in Health and Related Fields (Ames: Iowa State University Press, 1989) For a more detailed but still unintimidating explanation of correlation and regression, see my The Newspaper Survival Book (Bloomington: Indiana University Press, 1985), pp 47-62 ... compare them They are: The mode The median The mean And they are often confused with one another The mode is simply the most frequent value Consulting the stem-and-leaf chart for the misspelling of. .. they find a reference point (a compared-to-what) by calculating the mean, which is the sum of the values divided by the number of cases The mean for these nine cases is 11.6 In other words, the. .. (right of the vertical line) have been rounded to the second significant figure of the circulation number –or tens of thousands in this case The number on the stem is the first figure Thus the circulation

Định dạng
Số trang	33
Dung lượng	630,5 KB