Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 24 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
24
Dung lượng
802,01 KB
Nội dung
CHAPTER 1 VARIATION (OR WHAT STATISTICS IS ALL ABOUT) 11 FIGURE 1.7 Selecting charts and plots from the DDXL menu. FIGURE 1.8 Selecting the type of graph desired. out their tape measures a second time and rule off the distance from the fingertips of the left hand to the fingertips of the right while the student they were measuring stood with arms outstretched like a big bird. After the assistant principal had come and gone (something about how the class was a little noisy, and though we were obviously having a good time, could we just be a little quieter), they recorded their results in the form of a two-dimensional scatter plot. They had to reenter their height data (it had been sorted, remember) and then enter their arm span data : Height = 141, 156.5, 162, 159, 157, 143.5, 154, 158, 140, 142, 150, 148.5, 138.5, 161, 153, 145, 147, 158.5, 160.5, 167.5, 155, 137 Arm span = 141, 156.5, 162, 159, 158, 143.5, 155.5, 160, 140, 142.5, 148, 148.5, 139, 160, 152.5, 142, 146.5, 159.5, 160.5, 164, 157, 137.5 This is trickier than it looks, because unless the data are entered in exactly the same order by student in each data set, the results are meaningless. (We told you that 90% of the problems are in collecting the data and 12 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® FIGURE 1.9 Dotplot of the classroom height data. entering it in the computer for analysis. In another text of mine, A Manager’s Guide to The Design and Conduct of Clinical Trials, I recom- mend eliminating paper forms completely and entering all data directly into the computer.) Once the two data sets have been read in, creating a scatterplot is easy. Well, almost easy. The first chart, Fig. 1.10, I created with the Excel Chart menu, next to the question mark, selecting XY(Scatter) and repeat- edly pressing Next. To create Fig. 1.11 from the first scatterplot, I had to complete several steps. Placing my cursor on the chart, and depressing the right mouse button, yielded the menu shown in Fig. 1.12. Clicking on chart options allowed me to enter a title, “Sixth Grade Data” and labels for the X and Y axis, “Height” and “Arm Span.” Escaping from this menu, I put my cursor on the X-axis and clicked to bring up the menu shown in Fig. 1.13. I changed only one item, setting the Minor tick mark type to “outside.” Then I clicked on the “Scale” tab, removed all the check marks under “Auto,” and put in the values I wanted as shown in Fig. 1.14. I clicked OK to obtain Fig. 1.11. Exercise 1.3. Is performance on the LSAT used for law school admission related to one’s grade point average? Prepare a scatterplot of the following data drawn from a population of 82 law schools. We’ll look at this data again later in this chapter as well as in Chapters 3 and 4. CHAPTER 1 VARIATION (OR WHAT STATISTICS IS ALL ABOUT) 13 Arm Span 135 140 145 150 155 160 165 170 0 50 100 150 200 Arm Span FIGURE 1.10 Scatterplot using excel’s default settings. 14 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® Sixth Grade Data 130 140 150 160 170 130 140 150 160 170 Height Arm Span FIGURE 1.11 Scatterplot using excel’s full capabilities. FIGURE 1.12 Chart format menu. LSAT = 576, 635, 558, 578, 666, 580, 555, 661, 651, 605, 653, 575, 545, 572, 594 GPA = 3.39, 3.3, 2.81, 3.03, 3.44, 3.07, 3, 3.43, 3.36, 3.13, 3.12, 2.74, 2.76, 2.88, 2.96 1.4.3. Percentiles of the Distribution The values one reads from a box plot like Fig. 1.4 are approximations. To obtain exact values for the minimum and maximum, you can sort the data as shown in Fig. 1.5. To obtain the values of the median and other per- centiles, we would go to Excel’s formula bar , choose “Statistical” as our Function category if we have not already done so, and then select “Percentile.” The result will be a display similar to Fig. 1.15. One word of caution: Excel (like most statistics software) yields an excessive number of digits. Because we only measured heights to the nearest centimeter, reporting the 25th percentile as 143.875 would CHAPTER 1 VARIATION (OR WHAT STATISTICS IS ALL ABOUT) 15 FIGURE 1.13 Format axis menu. suggest far more precision in our measurements than actually exists. Report the value 144 centimeters instead. 16 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® FIGURE 1.14 Setting up the X-axis for Fig. 1.11. PERCENTILES The 25th percentile of a sample is such that 25% of the observations are smaller in value and 75% are greater. The median or 50th percentile of a sample is such that 50% of the observations are smaller in value and 50% are greater, and so forth. The socially conscious are concerned as much with what the 10th percentile of a population is earning as with what the median income is. Still another way to display your data is via the cumulative distribution function. Begin by sorting the data and then typing the numbers 1, 2, and 3 in Column B opposite the data values as shown in Fig. 1.16. Place your cursor in the first entry in this column (the “1” in B3), hold down your mouse button, and pull the cursor straight down the column, until the numbers 1, 2, and 3 are all highlighted. Release the mouse button. Move your cursor to the lower right corner of B5, until a plus sign appears. Holding down the mouse button, again pull straight down Column B and watch as Excel fills in the numbers 4, 5, , up to 22 (the number of observations) automatically as you pull. Enter = B3/22 in cell C3, then copy the entry in C3 all the way down the column to C24. The result should look like Fig. 1.17. Note that the entries in Column C are the cumulative frequencies of the observations, that is, 0.045 are 137 or less, 0.09 are 138.5 or less, and so forth. CHAPTER 1 VARIATION (OR WHAT STATISTICS IS ALL ABOUT) 17 FIGURE 1.15 Computing the percentiles of a sample. FIGURE 1.16 The sorted data. The next step in preparing a graph of these cumulative frequencies is to insert an extra row and a column label as shown in Fig. 1.18. Afterward, highlight the entire region between A2 and C25, select “Charts and Plots” from the DDXL menu, and complete the resultings Charts and Plots Dialog as shown in Fig. 1.19 to obtain the plot of Fig. 1.20. Note that the X-axis of the cumulative distribution function extends from the minimum to the maximum value of the class data. The Y-axis corresponding to the cumulative frequency reveals that the probability that 18 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® FIGURE 1.17 Cumulative frequencies. FIGURE 1.18 Preparing to graph the cumulative frequencies. CHAPTER 1 VARIATION (OR WHAT STATISTICS IS ALL ABOUT) 19 FIGURE 1.19 Plotting the empirical cumulative distribution function. FIGURE 1.20 Cumulative distribution of heights of Dr. Good’s sixth- grade class. a data value is less than the minimum is 0 (you knew that) and the proba- bility that a data value is less than or equal to the maximum is 1. Using a ruler, see what X value or values correspond to 0.5 on the Y-scale. Exercise 1.4. What do we call this value(s)? Exercise 1.5. Construct cumulative distribution functions for the data you’ve collected. 1.5. TYPES OF DATA Statistics such as the minimum, maximum, median, and percentiles make sense only if the data is ordinal, that is, if it can be ordered from smallest to largest. Clearly height, weight, number of voters, and blood pressure are ordinal. So are the answers to survey questions such as “How do you feel about President Bush?” Ordinal data can be subdivided into metric and nonmetric data. Metric data like heights and weights can be added and subtracted. We can compute the mean as well as the median of metric data. (We can further subdivide metric data into observations like time that can be measured on a continuous scale and counts such as “buses per hour” that are discrete.) But what is the average of “He’s destroying our country” and “He’s no worse than any other politician”? Such preference data is ordinal, in that it may be ordered, but it is not metric. Many times, in order to analyze ordinal data, statisticians will impose a metric on it—assigning, for example, weight 1 to “Bush is destroying our country” and weight 5 to “Bush is no worse than any other politician.” Such analyses are suspect, for another observer using a different set of weights might get quite a different answer. The answers to other survey questions are not so readily ordered. For example, “What is your favorite color?” Oops, bad example, because we can associate a metric wavelength with each color. Consider instead the answers to “What is your favorite breed of dog?” or “What country do your grandparents come from?” The answers to these questions fall into nonordered categories. Pie charts and bar charts are used to display such categorical data, and contingency tables are used to analyze them. A scat- terplot of categorical data would not make sense. Exercise 1.6. For each of the following, state whether the data are metric and ordinal, only ordinal, categorical, or you can’t tell: 20 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® [...]... command =rand() We copied this command all the way down the column, using Windows’ standard cut and paste commands ctrl-C and ctrl-V 34 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL Files Insluded in Initial Audit Name Start Date rand() Reed, Agnes 23 -Jan-03 0.0055 Hason, Arnold 13-Aug-03 0.0104 Wolfe, Carissa 25 -Jun-03 0.0173 Sartre, Jean-Paul 17-Oct-03 0. 022 2 Brown, James 29 -Oct-03... FIGURE 1 .21 Region of origin of classmates 22 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL FIGURE 1 .22 Classdata by sex of student FIGURE 1 .23 Boxplot of class heights by sex list is a boy, the next seven are girls, then another boy, six girls, and finally seven boys To create the side-by-side boxplots shown in Fig 1 .23 , we selected “Boxplot by Groups” from the DDXL Charts and Plots... recorded on it We return the card to the hat and repeat the procedure for a total of 22 times until I have a second 5 Of course, there is a point at which each additional observation will cost more than it yields in information The bootstrap described here will also help us to find the “optimal” sample size 28 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL sample, the same size as... is pronounced “sigma” 24 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL Is adding a set of numbers and then dividing by the number in the set too much work? To find the mean height of the students in my classroom, we would use Excel s average function A playground seesaw (or teeter-totter) is symmetric in the absence of kids Its midpoint or median corresponds to its center of gravity... CHAPTER 1 VARIATION (OR WHAT STATISTICS IS ALL ABOUT) 25 FIGURE 1 .25 Using XLStat to create a histogram from the class heights To construct this histogram, I downloaded a trial version of XLStat from http:/ /www.xlstat.com/index.html and installed this program after selecting “Add-ins” from Excel s Tools menu As you can see from Fig 1 .25 , I selected Describing Data and the Histograms from XLStat’s menu... you’ll need to download and install a trial version of the Resampling Stats in Excel add-in from http:/ /www resample.com/content/software /excel/ download.shtml Before you add it in, make sure that the “Analysis Toolpak” and “Analysis Toolpak VBA” options are checked in Excel s Tools/Add-ins menu Clicking on the R on the newly appeared Resampling Stats in Excel menu yields the display of Fig 1 .26 Pressing... precise) estimate: mean or median? To answer this question, at least for the data on heights I collected in my classroom, apply the bootstrap, then construct side-by-side boxplots for the results 30 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL FIGURE 1 .27 First step in getting a confidence interval for P25 FIGURE 1 .28 The eight largest values of the 25 th percentile for 100 bootstrap... data 2 Estimating population parameters 3 Aids to decision making Our choice of one statistic rather than another depends on the use(s) to which it is to be put 26 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL THE CENTER OF A POPULATION Median: the value in the middle; the halfway point; that value which has equal numbers of larger and smaller elements around it Arithmetic mean... as well represented as boisterous people and that a small group of activists couldn’t bias the results.6 6 To see how surveys could be biased deliberately, you might enjoy reading Grisham’s The Chamber 32 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL One sample we would all insist be representative is the jury.7 The Federal Jury Selection and Service Act of 1968 as revised8 states... 17-Oct-03 0. 022 2 Brown, James 29 -Oct-03 0. 022 6 Rooney, Kevin 9-Jul-03 0.03 32 Mills, Louise 4-Sep-03 0.04 12 Smith, Thomas 2- Oct-03 0.0497 Dudley, Morris 8-Aug-03 0.0540 A series of numbers was displayed down the column To lock these in place, we went to the Tools menu, clicked on “options” and then on the calculation tab We made sure that Calculation was set to manual and there was no check mark opposite . classdata. 22 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® FIGURE 1 .22 Classdata by sex of student. FIGURE 1 .23 Boxplot of class heights by sex. The primary value of charts and. serve a second purpose: to help establish the number of subpopulations. 24 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® Histogram of Class Data 0 1 2 3 4 5 6 7 135 140 145. the results are meaningless. (We told you that 90% of the problems are in collecting the data and 12 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® FIGURE 1.9 Dotplot of the