Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 19 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
19
Dung lượng
689,37 KB
Nội dung
true population mean. The equatio n for the standard error may be seen in Equation 4.3: SEM ¼ standard deviation H number in sample ðEquation 4:3Þ If we wanted to check that the value of the standard e rror calculated in the Descriptive Statistics function was correct then we would ins ert the following formula into a cell on the spreadsheet, using the data from Group 1 as an example: ¼3.7/SQRT(9) 83DESCRIPTIVE STATISTICS Figure 4.1 Descriptive Statistics functions in Excel where 3.7 is the standard deviati on of the sample , for which there were nin e observations, so it could be calculated by: ¼STDEV (range of values in sample)/SQRT (number in sample) When presenti ng graphs showing mean values it is usually expected that error bars are included by using either the standard deviation values to demonstrate the variability in the sample, or the standard error to demonstrate the deviation of the sample from the true population mean. Kurtosis and skewness Values for ku rtosis and skewness are also produced by the Des criptive Statistics function.These are used to characterize the data relative to a normal distribution. Skewness is a measure of symmetry.Where data are symmetri cal about the mean the skewness would be expected to have a value of around 0. If data are skewed to the left or right the n the ce ntre of the data is not around the mean and so a negative or positive value for skewness would be obtained. Skewed distributio ns are further discussed in sectio n 4.2. Kurtosis compares the shape of the data to a normal distribution and is a measure of whether the data tend to b e peaked or £at .Where a hi gh value for kurtosis is obse rved, data show a distinct peak about the mean and then decl ine rapidly. For lower kurtosi s values, data are more spread out, giving a £at top to the shape of the distribution rather than a peak. A value of around 3 would represent a normal distr ibution. 84 4PRELIMINARYDATAANALYSIS Figure 4.2 Descriptive Statistics for the television viewing data Coefficient of variation This function also does not appear in Excel but is a very useful parameter to calculate. The coe⁄cient of variation represents the standard deviation as a percentage of the mean value; it is particularly useful when comparing the reproducibility of results. In quantitative analytical methods, th e coe⁄cient of variation is used as a measure of pre cision i n quality control determinations. T he coe⁄cient of variation is calculated as shown in Equation 4.4: coefficient of variation ¼ standard deviation mean  100% ðEquation 4:4Þ The coe⁄cient of variation is usually given as a percentage and expresses the variability (from the standard deviation) of the sample compared to the mean valu e. It is a useful parame ter to use when comparing two or more samples with di¡erent means to see if the variabili ty is the same in each sample. Exercise 4.1 If we take as an example a laboratory analysis conducted by two students. Each performed an assay to determine the protein concentration of a sample containing 125 mgÁml À1 of protein. Each repeated the analysis 10 times and the results are shown in Table 4.3. Enter the data on a spr eadsheet in Excel and perform the descriptive statistics on the data. Using the data for the mean and standard deviation for each sample, enter the following equation into one of cells on the worksheet, inserting the appropriate value for the mean and standard deviation in each case: ¼(value for standard deviation/value for the mean) * 100 When comparing the means you should find that both students have a mean value of 125 mgÁml À1 from their protein determi- nations, but student 2 has a more precise technique as the coefficient of variation is 2.3 per cent for their analysis compared with 7.3 per cent for student 1. 85DESCRIPTIVE STATISTICS 4.2 Frequency distributions When we conduct scien ti¢c investigations, we collect data by taking samples from much larger populations. In order to learn something about the popula- tion we use de scriptive statistics, but we also need to examine the characteristics of the dis tribution in order to determine the best way to summarize and analyse data. In Section 3 we learnt abou t pres enting data in the form of bar charts.We can draw bar charts of data in which we me asure frequency (the number of times a part icular occurrence takes place, for example the numb er of indivi- duals in a population with blue eyes); if we draw a li ne at the midpoint of the bar then we obtain a frequency polygon. Inc reasing the number of bars in the plot, providing there is su⁄cient data to do so, will even tually produce a smooth curve, the shape of which will tell us something about the character- istics of the population. Figure 4.3 shows how a frequency polygon may be produced from a bar chart, using data showing height of a sample of adults from a population. This type of bar chart is known as a histogram. 86 4PRELIMINARYDATAANALYSIS Table 4.3 Protein determinations performed by two students with a sample125 mgÁml À1 Student1 125 120 122 130 115 140 130 121 125 Student 2 121 124 127 122 125 126 1 28 126 12 6 Figure 4.3 Normal distribution of heights of subjects 87FREQUENCY DISTRIBUTIONS Figure 4.4 Skewed and bimodal distributions Where the resulting frequency polygon re sembles a bell-shape we can see that the population is symmetrical and the shape o f the curve is said to be ‘bell-shaped’. At e ach end, or tail, of the curve, there is a small nu mber of extremely small or extremely large values, but the majority of the observations fall in the middle part of the curve, i.e. they are centred around the value for the mode. If we were to calculate the mean and the median for these data we would ¢nd that values would be virtually identical. A curve is said to follow a normal distribution where this occurs, so as the mean will re£ect the central tendency of the distribution it should also resemble the midpoint of the distribution, represented by the median. It is useful when considering the shape of a population to look at the tail of the curve that is produced. In Figure 4.4 we can see two distributions that cannot be normal as they do not follow a bell-shape; these are known as skewed distributions, of which there are two types, p ositive an d negative (see also the subsection ‘Descriptive statistics in Excel’ in section 4.1). A d istribution with a positive skew will contain more extremely large value s than extremely small ones and therefore resembles Chart A. Clearly the mean calculated for these data would not represent the central location of the distribution. Similarly, if we consider Chart B there are clearly more extremely small values than extremely large ones, in which case the data are n egatively skewed. For each of these cu rves, the best measure of the central tendency for the data would be represented by the median value and not the mean. Sometimes the shape of this distribution appears as if two normal (bell- shape d) distributions have been comb ined together, as shown in Chart C in Figure 4.4. This would su ggest that there is a mixed population, which might arise where a population contains two species. In plotting these cur ves we have split the data into groups, or inte rvals, that are equal ly spaced apart.The more intervals we are able to divid e the data into, the more well-de¢ned the curve becomes.We will see how by using raw data for heights of individuals we are able to produce a frequency dis tribution and how the Excel Paste Function may be applied to aid this process. Exercise 4.2 The data in Table 4.4 have been collected from a sample of 40 individuals from a population. Enter the data in one column in a new workbook in Excel. The height of each subject was recorded to the nearest centimetre, so in terms of the absolute accuracy of the results, a person whose height is between 88 4PRELIMINARYDATAANALYSIS 153.5 and 154.4 cm would still be recorded as 154 cm (by rounding up or down). Height would therefore be described as being a continuous variable, but because we are taking recorded measurements correct to the near est centimetre, we are sampling discrete values. The data on the worksheet make little sense as they stand and need to be organized. The first, most obvious step is to place them in order. Using the DatajjSort command (as described in Section 3), organize the data into ascending order. Look down the column of data to see the results. We can now see that the smallest (minimum) value for height is 147 cm whereas the la rgest (maximum) is 188 cm, so the heights of the individuals range from 147 to 188 cm. Even after sorting, the data are still difficult to interpret as each value has to be examined in relation to all the others (and what if we had thousands of measurements?). The next stage is clearly to group the data; this is done by dividing it into classes – with evenly spaced intervals between groups. Rule : When data are divided into intervals it should usually be into no more than10 in tervals and no less than ¢ve intervals. Each interval should be of an equal width. To determine h ow many groups to divide the data into, count the number of observations. In this case n ¼40. Take the square root of the total and round to the nearest whole number ( p 40 ¼ 6.325), i.e. 6. Excel is able to automatically group frequency data but needs to be given the parameters by which to do this. You 89FREQUENCY DISTRIBUTIONS Table 4.5 Height (cm) of forty individuals from a university tutorial group 147 154 157 163 163 165 168 171 173 177 151 155 152 161 161 169 1 69 1 72 17 5 177 158 155 159 161 164 167 165 182 1 7 5 1 72 154 156 165 162 16 0 188 176 173 170 167 will first of all have to make some decisions about your data. Firstly, look at the range of the data (147–188 cm). In order to group the data we need to work out how to have evenly spaced intervals. Clearly, if we group the data into six classes then the interval between them should be: interval ¼ ðhighest numberÀlowest numberÞ number of classes ðEquation 4:5Þ ¼(1887147)/6 which gives us an answe r of 6.83, so the interval between the classes should be 7 cm. In Table 4.5 we can see how the data need to be grouped. The number in the class column is the lower value for the class and moves upwards in steps of 7 cm. The first class (147–153) will contain the discrete values: 147 148 149 150 151 152 153 where 147 is the lower class boundary and 153 is the upper class boun dary. In Excel, data are divided into bins (classes) in which you define the upper class boundary. Using these bins, frequency data can be produced from a list of observations, so you will need to ent er onto your data sheet the classes (bins) in which you want to categorize your data. On the wo rksheet, type in the upper class boundaries for the data (so from Table 4.5 the upper class boundaries will be 153, 160, 167, 174, 181 and 188; enter the data in one column). 90 4PRELIMINARYDATAANALYSIS Ta bl e 4 . 5 Classes for the student height data Height (cm) 147^153 154^160 161^167 168^174 175^181 182^188 Using the histogram function From the Tools menu select Data Analysis and from the list provided choose Histogram. A dialogue box should appear as shown in Figure 4.5. Enter the input range of the data and then the range of cells containing your bins. Click on the Chart Output box so that a histogram of the data is plotted on the worksheet and confirm your selections. A table should now appear on the worksheet in which the data has been placed into the six classes provided. The data should be presented as in Table 4.6. We now have what is known as a frequency distribution of our data. The data is also presented in a histogram as in 91FR EQUENCY DISTRIBUTIONS Figure 4.5 Using the Histogram function in Excel Table 4.6 Output table from Excel showing grouping of data into bins Bin Frequency 153 3 160 9 167 12 174 9 181 5 188 2 More 0 Figure 4.6. We can see that this appears to approximate to a normal distribution , but it is difficult to be certain with a limited number in the data set. If the sample were larger we could increase the number of bars in the frequency histogram by setting classes (bins) closer together; the histogram would appear more as a smooth curve. The shape of the distribution is represented by the shape of this curve. When considering the statistical testing of data, it is important to establish in conducting an experiment: (a) whether a sample is su⁄ci ently large eno ugh to represent the population as a whole. (b) that the characterist ics of the population are known (i.e. normal, skewed, bimodal) in order to choose the correct test to be applied to the data and the most appropriate summary statistics to describe it. 4.3 Correlation and linear regression Sometimes we conduc t an investigation to determine whether there is an association between two variables of interest.The starting point of ¢nding out 92 4PRELIMINARYDATAANALYSIS Figure 4.6 Frequency histogram for heights of university students [...]... 0.425 0.509 0 .61 4 0.729 0.822 To perform the regression analysis select Toolsj Data Analysis j and highlight Regression from the list A pop-up box appears in which to enter the range of the data and select some options for the analysis as shown in Figure 4.10 Input the range of the Y (absorbance) data and then the range of the X (concentration) data Include data labels in this selection and tick Labels... REGRESSION is to place the dates in chronological order Enter the data into an Excel worksheet and then, using the Sort command from the Data menu, arrange the dates into ascending order (making sure that you select all of the data for sorting) Using Chart Wizard, plot the data and choose the XY Scatter format Add a suitable title and labels for the x- and y-axes Scattergraphs In Chart Wizard select the Scattergraph... experimental technique, and the analysis should be used to identify any outlier values, so all the replicates must be included 99 CORRELATION AND LINEAR REGRESSION Table 4.8 Protein determination using the Lowry Assay Absorbance Concentration (mg/ml) Replicate 1 Replicate 2 Replicate 3 20 40 60 80 100 120 140 150 0.1 06 0.204 0.311 0.417 0.508 0 .61 2 0.722 0.809 0.108 0.202 0.310 0.419 0.510 0 .61 6 0.734 0.819... relationship between x and y, so this would indicate a perfect correlation between the two variables A value of 0 would indicate no possible relationship between x and y, so there would be no Figure 4.7 Scattergraphs showing positive, negative and questionable correlations 93 94 4 PRELIMINARY DATA ANALYSIS correlation whatsoever In practice these values represent two extremes and most correlation coe⁄cients... between the two variables: Select Tools /Data Analysis and Click on CORREL from the menu The CORREL function calculates the product–moment correlation coefficient for the data Input the range of cells you want analysed, giving the reference for the dates on the gravestones as the first array and the cell references for lichen size in the second array CORRELATION AND LINEAR REGRESSION Confirm the selection... used to confirm whether there is a significant relationship between x and y The P value from the table (shown under the heading Significance F) shows there is a 100 4 PRELIMINARY DATA ANALYSIS Figure 4.10 Inserting cell ranges for regression analysis highly significant relationship between absorbance and concentration as P ¼ 8.19Â10À29, and this value is well below 0.05, the level of significance adopted... plot produced for the data shows individual data points and (usually in pink) the values of Y (absorbance) that are calculated as part of the analysis You will also find these listed in a table at the bottom of the worksheet The predicted Y values on the graph would be more appropriately substituted by a line of best fit through the observations Highlight one of the predicted Y values and right click the... Under the Output options, click on the New Worksheet ply to enter the results of the regression analysis on a new worksheet Select both Line Fit Plots and Residuals then confirm your selections by clicking on OK Excel analyses the relationship between independent and dependent variables and produces a report and charts on a new page in your workbook You may need to move some of the statistics around on... of the analysis are shown in Figure 4.11 The most important statistic from the analysis is the R square (R2) value This indicates how strong a relationship exists between the dependent and independent variables As the value is 0.997 there is clearly a very strong relationship between concentration and absorbance The results also show an ANOVA table (see section 5.3 for further explanation of analysis. .. With polynomial and moving average trendlines you may need to adapt the ¢t of the line by increasing the Order (default value 2) Figure 4.8 Inserting trendlines 95 96 4 PRELIMINARY DATA ANALYSIS Various features of the plot may be formatted It is usually necessary to edit the thickness of the trendline so that points are not obscured To format, click on the trendline then change the style and weight of . group 147 154 157 163 163 165 168 171 173 177 151 155 152 161 161 169 1 69 1 72 17 5 177 158 155 159 161 164 167 165 182 1 7 5 1 72 154 1 56 165 162 16 0 188 1 76 173 170 167 will first of all have to. be 153, 160 , 167 , 174, 181 and 188; enter the data in one column). 90 4PRELIMINARYDATAANALYSIS Ta bl e 4 . 5 Classes for the student height data Height (cm) 147^153 154^ 160 161 ^ 167 168 ^174 175^181 182^188 Using. technique, and the analysis should be used to identify any outlier values, so all the replicates must be included. 98 4PRELIMINARYDATAANALYSIS To perform the regression analysis select ToolsjjData Analysis and