Mean
The mean, often referred to as the "arithmetic average," represents the central value of a set of scores When my daughter, then in fifth grade, expressed her confusion about calculating averages, I realized the importance of explaining this concept clearly.
Jennifer asked for help with a math problem, so I explained that she needed to add all the scores together and then divide by the total number of scores However, she responded with a serious expression and insisted, "Dad, this is serious!" indicating that she believed I was joking.
“See these numbers in your book; add them up What is the answer?” (She did that.)
“Now, how many numbers do you have?” (She answered that question.) © Springer International Publishing Switzerland 2016
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
“Then, take the number you got when you added up the numbers, and divide that number by the number of numbers that you have.”
By applying the same reasoning, you will easily find the correct answer, as Excel will automate all the necessary steps for you.
We will call this average of the scores the “mean” which we will symbolize as:
X, and we will pronounce it as: “Xbar.”
The formula for finding the mean with your calculator looks like this:
The Greek letter sigma (Σ) represents the concept of "sum," instructing you to total all the values denoted by the letter X and then divide that total by n, which signifies the count of numbers involved.
Suppose that you had these six environmental science test scores on an 7-item true-false quiz:
To find the mean of these scores, you add them up, and then divide by the number of scores So, the mean is: 25/6ẳ4.17
Standard Deviation
The standard deviation (STDEV), represented by the letter S, measures the proximity of scores to the mean A small standard deviation indicates that the scores are closely grouped around the mean, while a large standard deviation signifies that the scores are more widely dispersed Understanding standard deviation is essential for analyzing data variability.
The formula look complicated, but what it asks you to do is this:
1 Subtract the mean from each scoreXX
2 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
2 Then, square the resulting number to make it a positive number.
3 Then, add up these squared numbers to get a total score.
4 Then, take this total score and divide it by n1 (where n stands for the number of numbers that you have).
5 The final step is to take the square root of the number you found in step 4.
This article focuses on calculating the standard deviation using Excel rather than a calculator, as detailed in basic statistics resources like Schuenemeyer and Drew (2011) By applying Excel to the example set of six numbers mentioned previously, the standard deviation (STDEV) is determined to be 1.47.
Standard Error of the Mean
The formula for the standard error of the mean(s.e., which we will use S X to symbolize) is: s:e:ẳS X ẳ S
To calculate the standard error (s.e.), divide the standard deviation (STDEV) by the square root of n, where n represents the total number of values in your data set For instance, in the example provided, the standard error is 0.60, which can be verified using a calculator.
To understand the concepts of standard deviation and standard error of the mean, refer to the works of McKillup and Dyar (2010) and Schuenemeyer and Drew (2011) This article will demonstrate how to utilize Excel to calculate sample size, mean, standard deviation, and standard error of the mean, specifically analyzing the level of sulfur dioxide in rainfall measured in milligrams (mg) per liter (L) It’s important to note that one milligram (mg) is one-thousandth of a gram, and one liter is a metric unit representing the volume of one kilogram of pure water under standard conditions For this analysis, we will consider data from eight samples of rainfall, as illustrated in Fig 1.1.
1.3 Standard Error of the Mean 3
Sample Size, Mean, Standard Deviation, and Standard
Using the Fill/Series/Columns Commands
Objective: To add the sample numbers 2–8 in a column underneath Sample #1
Home (top left of screen)
Fig 1.1 Worksheet Data for Sulphur Dioxide Levels
4 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
Important note: The “Paste” command should be on the top of your screen on the far left of the screen.
Important note: Notice the Excel commands at the top of your computer screen:
File Home Insert Page Layout Formulas etc.
If these commands ever “disappear” when you are using Excel, you need to click on “Home” at the top left of your screen to make them reappear!
Fill (top right of screen: click on the down arrow; see Fig.1.2)
Fig 1.2 Home/Fill/Series commands
Fig 1.3 Example of Dialogue Box for Fill/Series/Columns/Step Value/Stop Value commands1.4 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean 5
The sample numbers should be identified as 1–8, with 8 in cell B11.
Enter the milligrams per liter values in cells C4 to C11, ensuring that you double-check your figures for accuracy to obtain the correct results.
Since your computer screen shows the information in a format that does not look professional, you need to learn how to “widen the column width” and how to
“center the information” in a group of cells Here is how you can do those two steps:
Changing the Width of a Column
Objective: To make a column width wider so that all of the information fits inside that column
To ensure all information fits properly, you need to widen Column C on your computer screen.
Click on the letter, C, at the top of your computer screen
Place your mouse pointer on your computer at the far right corner of C until you create a “cross sign” on that corner
Left-click on your mouse, hold it down, and move this corner to the right until it is “wide enough to fit all of the data”
Take your finger off your mouse to set the new column width (see Fig.1.4)
Fig 1.4 Example of How to Widen the Column Width
6 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
Then, click on any empty cell (i.e., any blank cell) to “deselect” column C so that it is no longer a darker color on your screen.
When you widen a column, you will make all of the cells in all of the rows of this column that same width.
Now, let’s go through the steps to center the information in both Column B andColumn C.
Centering Information in a Range of Cells
Objective: To center the information in a group of cells
In order to make the information in the cells look “more professional,” you can center the information using the following steps:
Left-click your mouse pointer on B3 and drag it to the right and down to highlight cells B3:C11 so that these cells appear in a darker color
At the top of your computer screen, you will notice a series of lines that are uniformly centered under the "Alignment" option, which is the second icon located at the bottom left of the Alignment box (refer to Fig 1.5).
Fig 1.5 Example of How to Center Information Within Cells
1.4 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean 7
Click on this icon to center the information in the selected cells (see Fig.1.6)
To simplify referencing milligrams per liter in your formulas, it's beneficial to assign a name to your data range instead of recalling specific cell addresses like C4:C11 For instance, you can name this group of cells "Weight," or choose any other name that suits your preference.
Naming a Range of Cells
Objective: To name the range of data for the milligrams per liter with the name:
Highlight cells C4:C11 by left-clicking your mouse pointer on C4 and dragging it down to C11
Formulas (top left of your screen)
Define Name (top center of your screen)
Weight (type this name in the top box; see Fig.1.7)
Centering Information in the Cells
8 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
Then, click on any cell of your spreadsheet that does not have any information in it (i.e., it is an “empty cell”) to deselect cells C4:C11
Now, add the following terms to your spreadsheet:
Fig 1.7 Dialogue box for “naming a range of cells” with the name: Weight
Fig 1.8 Example of Entering the Sample Size, Mean, STDEV, and s.e Labels
1.4 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean 9
When using a formula in Excel, it is essential to begin the formula with an equal sign (ẳ) to ensure that Excel recognizes it as a formula.
Finding the Sample Size Using the ẳ COUNT
Objective: To find the sample size (n) for these data using theẳCOUNT function
This command should insert the number 8 into cell F6 since there are eight samples of rainfall in your sample.
Finding the Mean Score Using the ẳ AVERAGE
Objective: To find the mean weight figure using theẳAVERAGE function
This command should insert the number 0.8125 into cell F9.
Finding the Standard Deviation Using the ẳ STDEV
Objective: To find the standard deviation (STDEV) using theẳSTDEV function
This command should insert the number 0.352288 into cell F12.
10 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
Finding the Standard Error of the Mean
Objective: To find the standard error of the mean using a formula for these eight data points
This command should insert the number 0.124553 into cell F15 (see Fig.1.9).
It is crucial to verify all figures in your spreadsheet throughout this book to ensure they are placed in the correct cells; otherwise, the formulas may not function properly.
1.4.8.1 Formatting Numbers in Number Format (Two Decimal Places)
Objective: To convert the mean, STDEV, and s.e to two decimal places
Home (top left of screen)
To decrease the decimal places in your document, move your mouse pointer to the bottom right corner of the number displayed at the top center of your screen, specifically at the 00.0 position, until the option "Decrease Decimal" appears.
Fig 1.9 Example of Using Excel Formulas for Sample Size, Mean, STDEV, and s.e.
1.4 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean 11
Click on this icontwiceand notice that the cells F9:F15 are now all in just two decimal places (see Fig.1.11)
Now, click on any “empty cell” on your spreadsheet to deselect cells F9:F15.
Saving a Spreadsheet
Objective: To save this spreadsheet with the name: sulphur3
To ensure you can access your spreadsheet later, the first step is to choose the appropriate location for saving it This decision is crucial for future retrieval.
Fig 1.10 Using the “Decrease Decimal Icon” to convert Numbers to Fewer Decimal Places
Fig 1.11 Example of Converting Numbers to Two Decimal Places
When working with sample size, mean, standard deviation, and standard error of the mean, you have multiple options for saving your data If you're using your personal computer, consider saving the information directly to your hard drive, but be sure to seek guidance if you're unsure how to do this Alternatively, you can store your data on a CD or a flash drive for easy access and portability.
To save a file, scroll through the left sidebar to select your desired location, such as "This PC" or "My Documents," and click on it to complete the saving process.
File name: sulphur3 (enter this name to the right of File name; see Fig.1.12)
Important note: Be very careful to save your Excel file spreadsheet every few minutes so that you do not lose your information!
Printing a Spreadsheet
Objective: To print the spreadsheet
Use the following procedure when printing any spreadsheet.
Print Active Sheets (see Fig.1.13)
Fig 1.12 Dialogue Box of Saving an Excel Workbook File as “sulphur3” in My Documents location
Fig 1.13 Example of How to Print an Excel Worksheet Using the File/Print/Print Active Sheets Commands
14 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
Print (top of your screen)
The final spreadsheet is given in Fig.1.14
Before concluding this chapter, let's explore how to format figures in a spreadsheet through two practical examples: first, applying two decimal places for dollar amounts, and second, utilizing three decimal places for numerical figures.
To save your final spreadsheet, navigate to File and select Save, then close the spreadsheet by choosing File and Close To start a new project, open a blank Excel spreadsheet by clicking on File, then New, and selecting Blank Workbook from the options available in the top left corner of your screen.
Formatting Numbers in Currency Format (Two Decimal Places)
Objective: To change the format of figures to dollar format with two decimal places
Fig 1.14 Final Result of Printing an Excel Spreadsheet
1.7 Formatting Numbers in Currency Format (Two Decimal Places) 15
Highlight cells A4:A6 by left-clicking your mouse on A4 and dragging it down so that these three cells are highlighted in a darker color
Number (top center of screen: click on the down arrow on the right; see Fig.1.15)
Decimal places: 2 (then see Fig.1.16)
Fig 1.15 Dialogue Box for Number Format Choices
16 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
The three cells should have a dollar sign in them and be in two decimal places.Next, let’s practice formatting figures in number format, three decimal places.
Formatting Numbers in Number Format (Three Decimal Places)
Objective: To format figures in number format, three decimal places
Highlight cells A4:A6 on your computer screen
Number (click on the down arrow on the right)
At the right of the box, change two decimal places to three decimal places by clicking on the “up arrow” once
Fig 1.16 Dialogue Box for Currency (two decimal places) Format for Numbers
1.8 Formatting Numbers in Number Format (Three Decimal Places) 17
Ensure the three figures are formatted as numbers with three decimal places Next, click on a blank cell to deselect the range A4:A6 Finally, close the file by navigating to File > Close and select "Don’t Save," as saving is unnecessary for this practice exercise.
You can use these same commands to format a range of cells in percentage format (and many other formats) to whatever number of decimal places you want to specify.
End-of-Chapter Practice Problems
To determine the mean, standard deviation, and standard error of the mean for the number of Potentilla seeds in a quarter-ounce sample of Phleum pratense grass seeds, we analyzed the hypothetical data presented in Fig 1.17 The calculations revealed the average number of Potentilla seeds, the variability of the seed counts, and the precision of the mean estimate within the sample.
Fig 1.17 Worksheet Data for Chap 1: Practice
18 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
To analyze the data effectively, utilize Excel to the right of the table to calculate the sample size, mean, standard deviation, and standard error of the mean Ensure that all results are clearly labeled and that the mean, standard deviation, and standard error of the mean are rounded to two decimal places, applying number formatting for these values.
(b) Print the result on a separate page.
(c) Save the file as: seed3
As a research assistant, your task is to calculate the average lead concentration in air samples, measured in micrograms per cubic meter (μg/m³), specifically from locations near Route 101 in San Francisco This analysis focuses on weekday afternoons between 4 p.m and 7 p.m., utilizing the hypothetical data provided in Fig 1.18.
To analyze your data effectively, begin by organizing it into a table using Excel Next, calculate the sample size, mean, standard deviation, and standard error of the mean, placing these results to the right of your table Ensure that you label each calculation clearly and round the mean, standard deviation, and standard error of the mean to two decimal places using the number format for clarity.
(b) Print the result on a separate page.
(c) Save the file as: air3
Measurements of environmental variables can fluctuate with each attempt For instance, if you aim to establish baseline measurements for tetrachlorobenzene (TcCB) at an uncontaminated site, these readings can serve as a reference for assessing potential contamination in future locations The hypothetical data collected from this site, presented in parts per billion (ppb), is illustrated in Fig 1.19.
Fig 1.18 Worksheet Data for Chap 1: Practice
1.9 End-of-Chapter Practice Problems 19
To analyze the data effectively, utilize Excel to construct a table Adjacent to this table, calculate the sample size, mean, standard deviation, and standard error of the mean Ensure to label each result clearly and round the mean, standard deviation, and standard error of the mean to three decimal places using the number format for precision.
(b) Print the result on a separate page.
(c) Save the file as: SITE3
McKillup S., Dyar M Geostatistics Explained: an introductory guide for earth scientists Cam- bridge: Cambridge University Press; 2010.
Schuenemeyer J, Drew L Statistics for Earth and Environmental Scientists Hoboken: John Wiley & Sons; 2011.
Fig 1.19 Worksheet Data for Chap 1: Practice
20 1 Sample Size, Mean, Standard Deviation, and Standard Error of the Mean
Salt marshes are vital coastal wetlands located along the protected shorelines of the eastern United States, where freshwater and seawater converge These unique ecosystems experience tidal flooding, requiring the resident plants to adapt to the saline conditions.
Salinity, or the salt content of water, is influenced by the proximity of a marsh to the ocean A biogeographer researching the impact of salinity on vegetation in a Maine salt marsh has conducted a mapping study of the area.
To conduct a study on salinity levels in a salt marsh, you need to randomly select 5 out of 32 distinct geographic areas This process requires you to establish a "sampling frame" using your Excel skills, which will help in accurately measuring the salinity percentage in each chosen area.
A sampling frame is essential for selecting a random sample, and in this case, it consists of 32 distinct areas within a salt marsh Each area is assigned a unique identification code, starting with 1 for the first area and continuing sequentially up to 32 for the last area This structured approach ensures that the sampling frame is clearly defined, allowing for effective data collection and analysis.
32 with each area having a unique ID number.
We will first create the frame numbers as follows in a new Excel worksheet:
Creating Frame Numbers for Generating
Objective: To create the frame numbers for generating random numbers
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
To create frame numbers in column A, utilize the Home/Fill commands as outlined in Section 1.4.1 of this book Begin by filling the cells with numbers ranging from 1 to 32, ensuring that the number 32 is placed in cell A35 Follow the specified steps to achieve this efficiently.
Click on cell A4 to select this cell
Fill (then click on the “down arrow” next to this command and select)
Then, save this file as: Random29 You should obtain the result in Fig.2.3.
Fig 2.1 Dialogue Box for Fill/Series Commands
Fig 2.2 Dialogue Box for Fill/Series/Columns/Step value/Stop value Commands
Now, create a column next to these frame numbers in this manner:
To format your spreadsheet, use the Home/Fill command to populate frame numbers starting from cell B4 to B35 Ensure that columns A and B are widened to accommodate all data, and center the content within both columns This will result in a layout similar to Fig 2.4.
Fig 2.3 Frame Numbers from 1 to 32
2.1 Creating Frame Numbers for Generating Random Numbers 23
Save this file as: Random30
To ensure you have exactly 32 frame numbers before sorting them into a random sequence, it is important to replicate the information in both Column A and Column B of your spreadsheet This duplication allows for accurate verification of the total count of frame numbers, ensuring consistency in your data management process.
Now, let’s add a random number to each of the duplicate frame numbers as follows:
Creating Random Numbers in an Excel Worksheet
C3: RANDOM NO (then widen columns A, B, C so that their labels fit inside the columns; then center the information in A3:C35)
Next, hit the Enter key to add a random number to cell C4.
To generate a random number in Excel, ensure that you use the RAND() function with both an open and closed parenthesis The RAND command will reference the cells to the left of where it is entered, assigning a unique random number to the designated cell.
To add a random number to the ID frame numbers in Excel, position your mouse pointer over cell C4 and move it to the bottom right corner of the cell until a “plus sign” appears Then, click and drag the pointer down to cell C35 to apply the random number to all 32 ID frame numbers.
Then, click on any empty cell to deselect C4:C35 to remove the dark color highlighting these cells.
Random Numbers Assigned to the Duplicate Frame
2.2 Creating Random Numbers in an Excel Worksheet 25
Save this file as: Random31
Now, let’s sort these duplicate frame numbers into a random sequence:
Sorting Frame Numbers into a Random Sequence
Objective: To sort the duplicate frame numbers into a random sequence
Highlight cells B3:C35 (include the labels at the top of columns B and C) Data (top of screen)
Sort (click on this word at the top center of your screen; see Fig.2.6)
Sort by: RANDOM NO (click on the down arrow)
Smallest to Largest (see Fig.2.7)
Fig 2.6 Dialogue Box for Data/Sort Commands
Click on any empty cell to deselect B3:C35.
Save this file as: Random32
These steps will produce Fig.2.8with the DUPLICATE FRAME NUMBERS sorted into a random order:
Important note: Because Excel randomly assigns these random numbers, your
Excel commands will produce a different sequence of random numbers from everyone else who reads this book!
Fig 2.7 Dialogue Box for Data/Sort/RANDOM NO./Smallest to Largest Commands
2.3 Sorting Frame Numbers into a Random Sequence 27
Because your objective at the beginning of this chapter was to select randomly
5 of the 32 areas of the salt marsh, you now can do that by selecting thefirst five ID numbersin DUPLICATE FRAME NO column after the sort.
While your initial set of five random numbers will differ from our chosen random selection in this chapter, we will identify these five area IDs using Fig 2.9.
Save this file as: Random33
When using the RAND() function in Excel, it's important to note that the five ID numbers generated from your random selection will differ from those shown in Fig 2.9, as Excel produces a new random number with each execution of the command.
Before concluding this chapter, it's essential to understand how to print a file so that all its information is contained on a single page, avoiding any overflow onto additional pages.
2.3 Sorting Frame Numbers into a Random Sequence 29
Printing an Excel File So That All of the Information
Objective: To print a file so that all of the information fits onto one page
Note that the three practice problems at the end of this chapter require you to sort random numbers when the files contain 42 water samples, 86 field mice, and
There are 75 toxic waste sites, and the associated files are likely to be too large to fit on a single printed page To ensure that these files can be printed on one page, it is necessary to format them accordingly.
Let’s create a situation where the file does not fit onto one printed page unless you format it first to do that.
Go back to the file you just created, Random 33, and enter the name:Jennifer into cell: A50.
Printing this file will result in the name "Jennifer" appearing on a second page due to its overflow beyond the current page boundaries.
To ensure that all information, including the name Jennifer, fits onto a single page when printing, you will need to adjust the page format by following specific steps.
Page Layout (top left of the computer screen)
(Notice the “Scale to Fit” section in the center of your screen; see Fig.2.10)
Hit the down arrow to the right of 100 %onceto reduce the size of the page to
In Fig 2.11, the name Jennifer appears on the second page of your screen, positioned below the horizontal dotted line, which indicates the outline dimensions of the file as it would appear when printed.
Fig 2.10 Dialogue Box for Page Layout/Scale to Fit Commands
To resize the worksheet to 90% of its original size, simply press the down arrow on the right once more This adjustment will ensure that all content, including Jennifer’s name, is formatted to fit on a single printed page, as indicated by the dotted lines on your screen in Fig 2.12.
Save the file as: Random34
Print the file Does it all fit onto one page? It should (see Fig.2.13).
Fig 2.12 Example of Scale Reduced to 90 % with “Jennifer” to be printed on the first page (note the dotted line below Jennifer on your screen)
Fig 2.11 Example of Scale Reduced to 95 % with “Jennifer” to be Printed on a Second Page2.4 Printing an Excel File So That All of the Information Fits onto One Page 31
Spreadsheet of 90 % Scale to Fit
End-of-Chapter Practice Problems
In Jefferson County, Colorado, you have been tasked with testing fluoride levels in drinking water, focusing on a total of 42 collection sites Due to budget limitations, a random sample of 12 sites must be selected for testing to ensure a representative analysis of fluoride levels in the area's drinking water.
(a) Set up a spreadsheet of frame numbers for these water samples with the heading: FRAME NUMBERS
To enhance your spreadsheet, first, create a column titled "Duplicate Frame Numbers" adjacent to your original frame numbers Next, add another column labeled "RAND NO." where you will utilize the RAND() function to generate random numbers corresponding to each duplicate frame number Ensure that the format of this column is set to display three decimal places for each random number generated.
(d) Sort theduplicate frame numbers and random numbersinto a random order. (e) Print the result so that the spreadsheet fits onto one page.
(f) Circle on your printout the I.D number of the first 12 water sample locations that you would use in your test.
(g) Save the file as: RAND13
It is important to note that each time the RAND() function is used in Excel, it generates a unique random order of water sample site ID numbers Consequently, the sequence of random numbers provided in this Excel Guide will differ from the one you generate, which is completely normal and expected.
2 Suppose that a biology field researcher wants to take a random sample of 25 of
86 wild field mice that have been collected from the prairie grass that grows above the bluffs along the Mississippi River in Elsah, Illinois for a field research study.
(a) Set up a spreadsheet of frame numbers for these mice with the heading: FRAME NUMBERS.
To organize your data effectively, first create a column labeled "Duplicate Frame Numbers" adjacent to your original frame numbers Next, add another column titled "Random Number" to the right of the duplicate frame numbers In this column, utilize the =RAND() function to generate random numbers corresponding to each frame number Finally, adjust the formatting of the random number column to display three decimal places for each value.
(d) Sort the duplicate frame numbers and random numbers into a random order
2.5 End-of-Chapter Practice Problems 33
(e) Print the result so that the spreadsheet fits onto one page
(f) Circle on your printout the I.D number of the first 25 mice that the field biologist should select for her study.
(g) Save the file as: RAND14
3 Suppose that a chemical field researcher wants to take a random sample of 20 of
A recent field research study has identified 75 toxic waste sites surrounding a closed and abandoned commercial house paint plant The researcher aims to test the soil in this area for lead contamination, highlighting the potential environmental hazards associated with abandoned industrial sites.
(a) Set up a spreadsheet of frame numbers for these sites with the heading: FRAME NUMBERS.
To enhance your spreadsheet, first, create a column labeled "Duplicate Frame Numbers" to the right of your existing frame numbers Next, add another column titled "Random Number" adjacent to the duplicate frame numbers, utilizing the =RAND() function to generate random numbers for each entry Finally, adjust the formatting of the random numbers column to display three decimal places for a polished appearance.
(d) Sort the duplicate frame numbers and random numbers into a random order (e) Print the result so that the spreadsheet fits onto one page
(f) Circle on your printout the I.D number of the first 20 sites that the field chemist should select for her study.
(g) Save the file as: RAND5
Confidence Interval About the Mean Using the TINV Function and Hypothesis Testing
This chapter focuses on two ideas: (1) finding the 95 % confidence interval about the mean, and (2) hypothesis testing.
Let’s talk about the confidence interval first.
Confidence Interval About the Mean
How to Estimate the Population Mean
Objective: To estimate the population mean,μ
The population mean represents the average of a specific target group, such as adults aged 25–44 For instance, determining this group's preference for a new Ben & Jerry's ice cream flavor would be impractical if we attempted to survey all individuals in that age range across the U.S Conducting such a comprehensive study would be time-consuming and financially unfeasible.
Instead of testing the entire population, we can take a sample to estimate the population mean, which is more efficient in terms of time and cost This approach is known as "inferential statistics," as it allows us to infer the overall population mean from the sample mean.
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
When we study a sample of people in science research, we know the size of our sample (n), the mean of our sample X
, and the standard deviation of our sample(STDEV) We use these figures to estimate the population mean with a test called the “confidence interval about the mean.”
Estimating the Lower Limit and the Upper Limit
of the 95 % Confidence Interval About the Mean
The theoretical background of this test is beyond the scope of this book, and you can learn more about this test from studying any good statistics textbook (e.g Levine
(2011) or Bremer and Doerge (2010)), but the basic ideas are as follows.
We assume that the population mean is somewhere in an interval which has a
In this book, we establish a "lower limit" and an "upper limit" to define a confidence interval for the population mean Our goal is to achieve a 95% confidence level, ensuring that the true population mean falls within this specified interval.
“We are 95 % confident that the population mean in miles per gallon (mpg) for the Chevy Impala automobile is between 26.92 miles per gallon and 29.42 miles per gallon.”
To effectively promote the Chevy Impala's environmental benefits, we can highlight its fuel efficiency of 28 miles per gallon (mpg) on a billboard This figure falls within the 95% confidence interval established in our research, which ranges from 26.92 mpg to 29.42 mpg While the exact population mean remains unknown, we can confidently assert that 28 mpg is a valid representation of the Impala's performance.
But we are only 95 % confident that the population mean is inside this interval, and 5 % of the time we will be wrong in assuming that the population mean is
In scientific research, a 95% confidence level is commonly adopted as the standard for accuracy, although this percentage is arbitrary and can be adjusted to 80%, 90%, or 99% depending on the desired level of confidence For the purpose of this book, a 95% confidence level will be consistently assumed to provide a clear guideline for problem-solving This approach eliminates the need for guessing the desired confidence level, ensuring consistency throughout the problems presented in the book.
So how do we find the 95 % confidence interval about the mean for our data?
In words, we will find this interval this way:
To determine the confidence interval, first calculate the sample mean and then add 1.96 times the standard error of the mean (s.e.) to establish the upper limit For the lower limit, subtract 1.96 times the standard error from the sample mean.
36 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
The standard error of the mean (s.e.) is calculated by dividing the sample's standard deviation (STDEV) by the square root of the sample size (n).
In mathematical terms, the formula for the 95 % confidence interval about the mean is:
To calculate the confidence interval, first add and subtract 1.96 times the standard error (s.e.) from the mean Specifically, the upper limit is obtained by adding 1.96 times the s.e to the mean, while the lower limit is found by subtracting 1.96 times the s.e from the mean This process ensures that the confidence interval accurately reflects the range within which the true population parameter is likely to fall.
Note: We will explain shortly where the number 1.96 came from.
Let’s try a simple example to illustrate this formula.
Estimating the Confidence Interval for the Chevy
Impala in Miles Per Gallon
In a study examining the carbon footprint of Chevy Impala drivers, 49 owners recorded their mileage and fuel usage for two tanks of gas The findings revealed an average fuel efficiency of 27.83 miles per gallon (mpg) with a standard deviation of 3.01 mpg The standard error of the mean was calculated to be 0.43, derived from the standard deviation divided by the square root of the sample size.
The 95 % confidence interval for these data would be:
Theupper limit of this confidence intervaluses the plus sign of thesign in the formula Therefore, the upper limit would be:
Similarly, the lower limit of this confidence interval uses the minus sign of thesign in the formula Therefore, the lower limit would be:
3.1 Confidence Interval About the Mean 37
The result of our part of the ongoing research study would, therefore, be the following:
“We are 95 % confident that the population mean for the Chevy Impala is somewhere between 26.99 mpg and 28.67 mpg.”
Highlighting the impressive 28 mpg of the Chevy Impala, we can create a compelling billboard that emphasizes its superior fuel efficiency and reduced environmental impact Our data confirms that this fuel economy not only enhances the driving experience but also aligns with eco-friendly values, making the Impala an attractive choice for environmentally conscious consumers.
95 % confidence interval for the population mean.
You are probably asking yourself: “Where did that 1.96 in the formula come from?”
Where Did the Number “1.96” Come From?
A detailed mathematical answer to that question is beyond the scope of this book, but here is the basic idea.
We assume that the population data follows a "normal distribution," meaning that if we tested everyone in the population, the results would resemble a "normal curve." This curve is symmetric, similar to the outline of the Liberty Bell in Philadelphia, allowing for perfect alignment when folded in half.
This article focuses on establishing confidence intervals in population data analysis using integral calculus principles To determine the limits that encompass 95% of the area under a normal curve, researchers need to identify the upper and lower bounds For studies involving more than 40 participants, these limits are calculated as plus or minus 1.96 times the standard error of the mean (s.e.) This method provides a reliable way to ascertain confidence intervals, and for further insights, readers are encouraged to refer to comprehensive statistics resources, such as Schuenemeyer and Drew (2011).
The number 1.96 would change if we wanted to be confident of our results at a different level from 95 % as long as we have more than 40 people in our research study.
1 If we wanted to be 80 % confident of our results, this number would be 1.282.
2 If we wanted to be 90 % confident of our results, this number would be 1.645.
3 If we wanted to be 99 % confident of our results, this number would be 2.576.
In this book, we prioritize achieving 95% confidence in our results, which is why we consistently use a z-score of 1.96 for research studies involving more than 40 participants.
You might be wondering if the number 1.96 is always used in the confidence interval for the mean The answer is no, and we will clarify the reasons behind this shortly.
38 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
Finding the Value for t in the Confidence
Objective: To find the value for t in the confidence interval formula
The correct formula for the confidence interval about the mean for different sample sizes is the following:
To calculate a 95% confidence interval, start by determining the sample mean (X) The upper limit of the interval is found by adding the product of the t-value and the standard error (s.e.) to the sample mean Conversely, the lower limit is obtained by subtracting the same product from the sample mean To find the appropriate t-value, refer to the table in Appendix E of this book.
Objective: To find the value of t in the t-table in AppendixE
Before we get into an explanation of what is meant by “the value of t,” let’s give you practice in finding the value of t by using the t-table in AppendixE.
Keep your finger on Appendix Eas we explain how you need to “read” that table.
In this chapter, we will utilize the “confidence interval about the mean test” to determine the critical value of t for your research study, referencing the first column labeled “sample size n” in Appendix E.
To determine the value of t for your research study, locate the sample size in the first column of the t-table, then move to the right to find the corresponding t value in the "critical t column," which is used for a 95% confidence interval about the mean For instance, with a sample size of 14 participants, the t value is 2.160.
If you have 26 people in your research study, the value of t is 2.060.
In research studies with more than 40 participants, the t-value is consistently 1.96, which is essential for achieving 95% confidence in your results The "critical t column" in Appendix E provides the necessary t-value for determining significant results This book assumes a 95% confidence level for statistical tests, guiding you to use the t-value from the t-table in Appendix E for calculating the 95% confidence interval around the mean.
To calculate the confidence interval for the mean using Excel, first determine the value of t based on your sample size and desired confidence level Then, use the appropriate Excel functions to compute the mean and standard deviation of your data set Finally, apply the formula for the confidence interval by incorporating the t value, mean, and standard error, allowing you to effectively analyze the reliability of your sample mean.
3.1 Confidence Interval About the Mean 39
Using Excel ’ s TINV Function to Find the Confidence
Objective: To use the TINV function in Excel to find the confidence interval about the mean
When you use Excel, the formulas for finding the confidence interval are:
Lower limit:ẳXTINV 1ð 0:95,n1ị*s:e:ðno spaces between these symbolsị ð3:3ị
Upper limit:ẳXỵTINV 1ð 0:95,n1ị*s:e: ðno spaces between these symbolsị ð3:4ị
In Excel formulas, the asterisk (*) represents multiplication, indicating the "times" operation commonly used in math As mentioned in Chapter 1, 'n' denotes the sample size, while 's' represents the sample size minus one (n-1).
In Chapter 1, we learned that the standard error of the mean (s.e.) is calculated by dividing the standard deviation (STDEV) by the square root of the sample size (n) To illustrate this concept, we will solve a sample problem using Excel to determine the 95% confidence interval for the mean.
Let’s suppose that General Motors wanted to claim that its Chevy Impala achieves 28 miles per gallon (mpg) Let’s call 28 mpg the “reference value” for this car.
If you are employed at Ford Motor Co and wish to verify a specific claim, conducting research to gather data is essential Utilizing a two-sided 95% confidence interval will allow you to assess the mean and determine the validity of your findings This statistical approach provides a reliable framework for evaluating the evidence and drawing informed conclusions.
Using Excel to Find the 95 % Confidence Interval
Objective: To analyze the data using a two-side 95 % confidence interval about the mean
In a study involving new car owners, participants were asked to monitor their mileage over two tanks of gas while recording the average miles per gallon achieved The findings, illustrated in Fig 3.1, provide valuable insights into the vehicle's fuel efficiency.
40 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
To analyze the data effectively, create a spreadsheet in Excel to calculate the sample size (n), mean, standard deviation (STDEV), and standard error of the mean (s.e.) using the specified cell references.
Enter the other mpg data in cells A7:A30
To enhance the professional appearance of your table, first select cells A6:A30 and format them to display numbers with one decimal place, ensuring they are centered in Column A Next, increase the width of both Columns A and B to double their original size, and expand Column C to three times the width of the original Column A.
Fig 3.1 Worksheet Data for Chevy Impala (Practical Example)
3.1 Confidence Interval About the Mean 41
B26: Draw a picture below this confidence interval
B29: lower (right-align this word)
B30: limit (right-align this word)
To properly format cell C28 in Excel, start with a single quotation mark (‘) to indicate that it is a label rather than a number This is crucial for ensuring that the data is interpreted correctly Additionally, refer to Figure 3.2 for an example of how to structure the Chevy Impala format when presenting the confidence interval about the mean labels.
42 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
D28: ‘ -– (note the single quotation mark)
E28: ‘29.42 (note the single quotation mark)
Now, align the labels underneath the picture of the confidence interval so that they look like Fig.3.3.
Next, name the range of data from A6:A30 as: miles
D7: Use Excel to find the sample size
D10: Use Excel to find the mean
Fig 3.3 Example of Drawing a Picture of a Confidence Interval About the Mean Result
3.1 Confidence Interval About the Mean 43
D13: Use Excel to find the STDEV
D16: Use Excel to find the s.e.
Now, you need to find the lower limit and the upper limit of the 95 % confidence interval for this study.
We will use Excel’s TINV function to do this We will assume that you want to be 95 % confident of your results.
F21: ẳD10TINV 1ð :95, 24ị*D16 ðno spaces between symbolsị
Note that this TINV formula uses 24 since 24 is one less than the sample size of
The confidence interval's lower limit is calculated using the formula F23 = D10 + TINV(1:95, 24) * D16, where D10 represents the mean and D16 signifies the standard error of the mean In this case, the lower limit is determined to be 26.92.
The upper limit of the confidence interval is 29.42, while the lower limit is 26.92 To ensure clarity in your Excel spreadsheet, format the mean, standard deviation, standard error of the mean, and both limits of the confidence interval to two decimal places If you print the spreadsheet now, the values for the lower and upper limits may extend onto a second page due to the current formatting, which does not fit all the information on a single page.
To adjust the size of your Excel spreadsheet, utilize the "Scale to Fit" commands found in the Page Layout section, reducing the size to 95% of its current dimensions After applying this setting, observe the dotted line indicators next to the values 26.92 and 29.42, which now suggest that these measurements will fit onto a single printed page.
44 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
Note that you have drawn a picture of the 95 % confidence interval beneath cell B26, including the lower limit, the upper limit, the mean, and the reference value of
28 mpg given in the claim that the company wants to make about the car’s miles per gallon performance.
Now, let’s write the conclusion to your research study on your spreadsheet:
C33: Since the reference value of 28 is inside
C34: the confidence interval, we accept that
C35: the Chevy Impala does get 28 mpg.
It's important to understand why the conclusion is presented on three separate lines in the spreadsheet rather than a single long line If the conclusion were written on one line, it could lead to two undesirable outcomes: first, the formatting may become cluttered when printed, and second, clarity may be compromised, making it harder to interpret the results effectively.
3.1 Confidence Interval About the Mean 45 reducing the size of the layout of the page so that the entire spreadsheet would fit onto one page, the print font size for the entire spreadsheet would be so small that you could not read it without a magnifying glass, and (2) If you printed the spreadsheet without reducing the page size layout, it would “dribble over” part of the conclusion to a separate page all by itself, and your spread- sheet would not look professional.
The research study confirmed that the Chevy Impala achieved an average fuel efficiency of 28 miles per gallon, with the study's findings indicating a slightly higher average of 28.17 miles per gallon (refer to Fig 3.5) Please save the resulting spreadsheet as CHEVY7.
Fig 3.5 Final Spreadsheet for the Chevy Impala Confidence Interval About the Mean
46 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
Hypothesis Testing
Hypotheses Always Refer to the Population of People, Plants, or Animals That You Are Studying
Plants, or Animals That You Are Studying
The first step is to understand that our hypotheses always refer to thepopulationof people, plants, or animals under study.
To study a species of noxious weed prevalent along southern South Dakota highways, we would choose different highway sections to estimate the weed population in those areas This sample would then serve as a basis for generalizing our findings to all highways throughout southern South Dakota.
In our study, we focus on the highways located in southern South Dakota, which represent the broader population we aim to analyze The specific segments of these highways that we examine are referred to as the sample drawn from this population.
Our analysis focuses on a subset of highways, and we aim to understand how the findings from this sample can be generalized to the broader population of highways that we are ultimately interested in.
That is why our hypotheses always refer to the population, and never to the sample of people, plants, animals, or events in our study.
You will recall from Chap.1that we used the symbol:Xto refer to the mean of the sample we use in our research study (See Sect.1.1).
We will use the symbol:μ(the Greek letter “mu”) to refer to the population mean.
In testing our hypotheses, we are trying to decide which one of two competing hypothesesabout the population meanwe should accept given our data set.
The Null Hypothesis and the Research (Alternative)
These two hypotheses are called thenull hypothesisand theresearch hypothesis. Statistics textbooks typically refer to thenull hypothesiswith the notation:H0.
Theresearch hypothesisis typically referred to with the notation:H1, and it is sometimes called thealternative hypothesis.
Let’s explain first what is meant by the null hypothesis and the research hypothesis:
(1) The null hypothesis is what we accept as true unless we have compelling evidence that it is not true.
(2) The research hypothesis is what we accept as true whenever we reject the null hypothesis as true.
In the American legal system, individuals are presumed innocent until proven guilty by a jury This principle mirrors the scientific approach, where the null hypothesis posits that the defendant is innocent, while the research hypothesis suggests their guilt.
In Missouri, the state slogan "Show me" reflects the residents' skepticism and demand for proof before believing claims This phrase signifies that Missourians prioritize actions over words, emphasizing the importance of tangible evidence to support assertions.
Hypothesis testing involves determining which of the two competing statements—null hypothesis or research hypothesis—will be accepted as true, as both cannot coexist This process utilizes statistical formulas to make an informed decision on which hypothesis to reject (Schuenemeyer and Drew, 2011).
In scientific research, rating scales are frequently employed to assess individuals' attitudes toward a company, its products, or their purchase intentions Commonly used scales include 5-point, 7-point, and 10-point formats, although variations may also be utilized.
48 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
3.2.2.1 Determining the Null Hypothesis and the Research Hypothesis
When Rating Scales Are Used
Here is a typical example of a 7-point scale in science education for parents of 8th grade pupils at the end of a school year (see Fig.3.6):
So, how do we decide what to use as the null hypothesis and the research hypothesis whenever rating scales are used?
Objective: To decide on the null hypothesis and the research hypothesis when- ever rating scales are used.
In order to make this determination, we will use a simple rule:
Rule: Whenever rating scales are used, we will use the “middle” of the scale as the null hypothesis and the research hypothesis.
In the above example, since 4 is the number in the middle of the scale (i.e., three numbers are below it, and three numbers are above it), our hypotheses become:
In our statistical analysis of the attitude scale item, a population mean close to 4 suggests that we accept the null hypothesis, indicating that parents of 8th-grade students are neutral regarding their satisfaction with the quality of the science program provided by their child's school.
If our statistical test reveals a significant difference between the population mean and the value of 4, we will reject the null hypothesis in favor of the research hypothesis.
Parents of 8th grade students expressed high satisfaction with the science program provided by their child's school, as indicated by a sample mean that significantly exceeds the expected population mean of 4.
Parents of 8th grade students expressed notable dissatisfaction with the quality of the science program at their children's school, particularly when the sample mean falls significantly below the anticipated population mean of 4.
Both of these conclusions cannot be true We accept one of the hypotheses as
“true” based on the data set in our research study, and the other one as “not true” based on our data set.
A research scientist's primary responsibility is to determine which hypothesis—either the null hypothesis or the research hypothesis—should be accepted as valid based on the data collected in their study.
Let’s try some examples of rating scales so that you can practice figuring out what the null hypothesis and the research hypothesis are for each rating scale.
In the spaces in Fig.3.7, write in the null hypothesis and the research hypothesis for the rating scales:
Fig 3.7 Examples of Rating Scales for Determining the Null Hypothesis and the Research Hypothesis
50 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
Here are the answers to these three questions:
1 The null hypothesis isμẳ3, and the research hypothesis isμ6ẳ3 on this 5-point scale (i.e the “middle” of the scale is 3).
2 The null hypothesis isμẳ4, and the research hypothesis isμ6ẳ4 on this 7-point scale (i.e., the “middle” of the scale is 4).
3 The null hypothesis isμ ẳ5:5, and the research hypothesis isμ6ẳ5:5 on this 10-point scale (i.e., the “middle” of the scale is 5.5 since there are 5 numbers below 5.5 and 5 numbers above 5.5).
Texas Parks and Wildlife employs a 4-point scale in its post-hunting satisfaction survey, which informs the number of wildlife management licenses issued for the subsequent hunting season based on the survey results.
On this scale, the null hypothesis is: μẳ2:5 and the research hypothesis is: μ6ẳ2:5, because there are two numbers below 2.5, and two numbers above 2.5 on that rating scale.
Now, let’s discuss the 7 STEPS of hypothesis testing for using the confidence interval about the mean.
The 7 Steps for Hypothesis-Testing Using
the Confidence Interval About the Mean
Objective: To learn the 7 steps of hypothesis-testing using the confidence interval about the mean
There are seven basic steps of hypothesis-testing for this statistical test.
3.2.3.1 STEP 1: State the Null Hypothesis and the Research Hypothesis
When utilizing numerical scales in surveys, it's essential to focus on the midpoint of the scale For instance, in a 7-point scale ranging from 1 (poor) to 7 (excellent), the hypotheses should center around the middle values of the scale.
3.2.3.2 STEP 2: Select the Appropriate Statistical Test
In this chapter we are studying the confidence interval about the mean, and so we will select that test.
3.2.3.3 STEP 3: Calculate the Formula for the Statistical Test
You will recall (see Sect.3.1.5) that the formula for the confidence interval about the mean is:
In this chapter, we previously outlined the procedure for calculating the confidence interval for the mean using Excel The steps involved in applying this formula are essential for accurate results.
1 Use Excel’sẳCOUNT function to find the sample size.
2 Use Excel’sẳAVERAGE function to find the sample mean,X.
3 Use Excel’sẳSTDEV function to find the standard deviation, STDEV.
4 Find the standard error of the mean (s.e.) by dividing the standard deviation (STDEV) by the square root of the sample size, n.
5 Use Excel’s TINV function to find the lower limit of the confidence interval.
6 Use Excel’s TINV function to find the upper limit of the confidence interval.
3.2.3.4 STEP 4: Draw a Picture of the Confidence Interval About the Mean, Including the Mean, the Lower Limit of the Interval, the Upper Limit of the Interval, and the Reference Value Given in the Null Hypothesis, H 0 (We Will Explain Step 4 Later in the Chapter.)
3.2.3.5 STEP 5: Decide on a Decision Rule
(a)If the reference value is inside the confidence interval, accept the null hypoth- esis, H0
(b) If the reference value is outside the confidence interval, reject the null hypoth- esis, H0, and accept the research hypothesis, H1
52 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
3.2.3.6 STEP 6: State the Result of Your Statistical Test
When utilizing the confidence interval for the mean, there are two potential outcomes, but only one can be deemed “true.” Therefore, your findings will fall into one of these two categories.
Either:Since the reference value is inside the confidence interval, we accept the null hypothesis, H0
Or:Since the reference value is outside the confidence interval,we reject the null hypothesis, H0, and accept the research hypothesis, H1
3.2.3.7 STEP 7: State the Conclusion of Your Statistical Test in Plain English!
Summarizing the results of your statistical test in clear and simple language can be challenging, especially when aiming for accuracy and brevity The goal is to convey the conclusion of your confidence interval about the mean in a way that is easily understandable, even for those without a background in statistics, such as a supervisor Throughout this book, we will provide ample opportunities to practice this crucial skill.
Let’s set some basic rules for stating the conclusion of a hypothesis test.
Rule #1: Whenever you reject H0and accept H1, you must use the word “signifi- cantly” in the conclusion to alert the reader that this test found an important result.
Rule #2: Create an outline in words of the “key terms” you want to include in your conclusion so that you do not forget to include some of them.
Rule #3: Write the conclusion in plain English so that the reader can understand it even if that reader has never taken a statistics course.
Let’s practice these rules using the Chevy Impala Excel spreadsheet that you created earlier in this chapter, but first we need to state the hypotheses for that car.
If General Motors wants to claim that the Chevy Impala gets 28 miles per gallon on a billboard ad, the hypotheses would be:
The reference value of 28 mpg falls within the 95% confidence interval for the data, leading us to accept the null hypothesis (H0) for the Chevy Impala, confirming that the vehicle achieves an average fuel efficiency of 28 mpg.
Objective: To state the result when you accept H 0
Result: Since the reference value of 28 mpg is inside the confidence interval, we accept the null hypothesis, H 0
Let’s try our three rules now:
Objective: To write the conclusion when you accept H 0
In this chapter, we adhere to a fundamental rule: if the reference value falls within the confidence interval, we cannot conclude that the results are "significant." This principle applies consistently across all problems discussed.
Rule #2: The key terms in the conclusion would be:
Rule #3: The Chevy Impala did get 28 mpg.
Writing a conclusion when accepting the null hypothesis (H0) is straightforward, as it simply reiterates the initial statement of the null hypothesis In contrast, formulating a conclusion upon rejecting H0 and accepting the alternative hypothesis (H1) is more complex To enhance understanding, we will practice crafting such conclusions through three illustrative case examples.
Objective: To write the result and conclusion when you reject H 0
CASE #1: Suppose that an ad inThe Wall Street Journalclaimed that the Honda
Accord Sedan gets 34 miles per gallon The hypotheses would be:
Suppose that your research yields the following confidence interval:
30 31 32 34 lower Mean upper Ref. limit limit Value
Result:Since the reference value is outside the confidence interval, we reject the null hypothesis and accept the research hypothesis
The three rules for stating the conclusion would be:
54 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
Rule #1: We must include the word “significantly” since the reference value of 34 is outside the confidence interval.
Rule #2: The key terms would be:
– either “more than” or “less than”
Rule #3: The Honda Accord Sedan got significantly less than 34 mpg, and it was probably closer to 31 mpg.
The conclusion indicates that the miles per gallon (mpg) was below 34, as the sample mean recorded was only 31 mpg Additionally, it is important to clarify that simply stating a result as “significantly less than” the null hypothesis is inadequate; further context is necessary to fully understand the significance of the findings.
34 mpg,” because that does not tell the reader “how much less than 34 mpg” the sample mean was from 34 mpg To make the conclusion clear, you need to add:
“probably closer to 31 mpg” since the sample mean was only 31 mpg.
The National Association of Environmental Professionals (NAEP) is committed to safeguarding both the natural and human environment and organizes an annual five-day conference for its members To assess the effectiveness of this conference, the NAEP plans to conduct an Internet survey to gather feedback from participants As part of this initiative, you have been tasked with analyzing the data collected from the surveys, specifically focusing on the hypothetical Item #15 presented in Fig 3.8.
The hypotheses for this one item would be:
Fig 3.8 Example of Item #15 of the NAEP Survey
The null hypothesis posits that a mean score of 4 indicates that attendees are neutral about recommending next year's annual conference to colleagues, suggesting no significant difference from this score on the rating scale.
Suppose that your analysis produced the following confidence interval for this item on the survey.
1.8 2.8 3.8 4 lower Mean upper Ref. limit limit Value
Result: Since the reference value is outside the confidence interval, we reject the null hypothesis and accept the research hypothesis.
Rule #1: You must include the word “significantly” since the reference value is outside the confidence interval
Rule #2: The key terms would be:
– NAEP annual meeting this year
– either likely or unlikely (since the result is significant)
– attend next year’s annual meeting of the NAEP
Rule #3: Attendees at this year’s annual meeting of the NAEP were significantly unlikely to recommend to colleagues that they attend next year’s annual meeting of the NAEP.
Note that you need to use the word “unlikely” since the sample mean of 2.8 was on the unlikely side of the middle of the rating scale.
The National Association of Environmental Professionals (NAEP) publishes the peer-reviewed journal Environmental Practice through Cambridge University Press, focusing on Ecology and the Environment To assess the quality of articles in this journal, NAEP has distributed an Internet Survey to its members As part of this initiative, you have been tasked with analyzing the data from the returned surveys, specifically focusing on the hypothetical Item #8.
56 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
This item would have the following hypotheses:
Suppose that your research produced the following confidence interval for this item on the survey:
Result:Since the reference value is outside the confidence interval, we reject the null hypothesis and accept the research hypothesis
The three rules for stating the conclusion would be:
Rule #1: You must include the word “significantly” since the reference value is outside the confidence interval
Rule #2: The key terms would be:
– rated the quality of articles
– either “positive” or “negative” (we will explain this)
Rule #3: Members of the NAEP rated the quality of articles in Environmental
Practice dealing with Ecology/Environment in an Internet survey as significantly positive
In English, it is uncommon to use the phrase "significantly excellent," as something is either excellent or not without modifiers Additionally, the average rating for articles related to Ecology and Environment was notably high at 5.8.
3.2 Hypothesis Testing 57 greater than 5.5 on the positive side of the scale, we would say “significantly positive” to indicate this fact.
If you want a more detailed explanation of the confidence interval about the mean, see Hoshmand (1998).
This chapter concludes with three practice problems designed to enhance your skills in articulating the conclusions of your results Additionally, this book offers numerous examples to guide you in crafting clear and precise conclusions for your research findings.
Alternative Ways to Summarize the Result
Different Ways to Accept the Null Hypothesis
The following quotes are typical of the language used in statistics and research bookswhen the null hypothesis is accepted:
“The null hypothesis is not rejected.” (Black 2010, p 310)
“The null hypothesis cannot be rejected.” (McDaniel and Gates 2010, p 545)
58 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
“The null hypothesis claims that there is no difference between groups.” (Salkind 2010, p 193)
“The difference is not statistically significant.” (McDaniel and Gates 2010, p 545)
“ the obtained value is not extreme enough for us to say that the difference between
Groups 1 and 2 occurred by anything other than chance.” (Salkind 2010, p 225)
“If we do not reject the null hypothesis, we conclude that there is not enough statistical evidence to infer that the alternative (hypothesis) is true.” (Keller 2009, p 358)
“The research hypothesis is not supported.” (Zikmund and Babin 2010, p 552)
Different Ways to Reject the Null Hypothesis
The following quotes are typical of the quotes used in statistics and research books when the null hypothesis is rejected:
“The null hypothesis is rejected.” (McDaniel and Gates 2010, p 546)
“If we reject the null hypothesis, we conclude that there is enough statistical evidence to infer that the alternative hypothesis is true.” (Keller 2009, p 358)
“If the test statistic ’ s value is inconsistent with the null hypothesis, we reject the null hypothesis and infer that the alternative hypothesis is true.” (Keller 2009, p 348)
“Because the observed value is greater than the critical value , the decision is to reject the null hypothesis.” (Black 2010, p 359)
“If the obtained value is more extreme than the critical value, the null hypothesis cannot be accepted.” (Salkind 2010, p 243)
“The critical t-value must be surpassed by the observed t-value if the hypothesis test is to be statistically significant ” (Zikmund and Babin 2010, p 567)
“The calculated test statistic exceeds the upper boundary and falls into this rejection region The null hypothesis is rejected.” (Weiers 2011, p 330)
Statisticians and professors often utilize various quotes when interpreting the outcomes of hypothesis tests Therefore, it's not uncommon for someone to request a summary of statistical test results using terminology that differs from the language presented in this book.
End-of-Chapter Practice Problems
Michigan is renowned for its exceptional fishing spots, particularly in its inland lakes A research project conducted five years ago identified 230 of these lakes and recorded an average sulfate level of 4.65 mg/L in a sample taken at that time To assess any changes in sulfate levels since then, a new random sample of these lakes has been collected, as illustrated in Fig 3.10.
3.4 End-of-Chapter Practice Problems 59
Using Excel, determine the sample size, mean, standard deviation, and standard error of the mean for the provided figures, ensuring to label each answer clearly Present the mean, standard deviation, and standard error of the mean in number format with two decimal places for precision.
To analyze the data, input the null hypothesis and research hypothesis into your spreadsheet Then, utilize Excel's TINV function to calculate the 95% confidence interval for the mean of these figures, ensuring to label your results clearly and format the numbers to two decimal places.
(e) Enter yourconclusion in plain Englishonto your spreadsheet.
To finalize your spreadsheet, ensure it fits on a single page before printing If you need assistance with this process, refer to the objectives outlined at the end of Chapter 2, Section 2.4 Once printed, manually draw a diagram representing the 95% confidence interval on the printout Finally, save the file under the name: LAKES3.
A fish hatchery in Colorado has requested an assessment of the average weight of trout being released into local streams and lakes Concerns arise when the fish are released at sizes that are deemed too small, leading to complaints from licensed fishermen.
Fig 3.10 Worksheet Data for Chap 3: Practice
The TINV function is essential for calculating a 60% confidence interval about the mean, particularly in studies involving fish populations Concerns arise when undersized fish are caught, as this indicates potential issues in the fishery management Conversely, if fish grow too large due to excessive feeding—where fish size correlates positively with feed amount—the state faces increased costs that exceed the hatchery's budgetary allocations.
A Colorado fish hatchery aims to maintain an average weight of 308 grams, or 11 ounces, for trout released into streams and lakes To analyze this goal, a random sample of trout weights from the past week has been collected By examining the data, we can determine if the hatchery is meeting its target weight for released trout, with the goal of ensuring a healthy population in the state's waterways.
Create an Excel spreadsheet with these data.
To analyze the data effectively, utilize Excel to the right of the table to determine the sample size, mean, standard deviation, and standard error of the mean Ensure that you label each result clearly, rounding the mean, standard deviation, and standard error of the mean to two decimal places for accuracy.
(b) Enter the null hypothesis and the research hypothesis for these data on your spreadsheet.
To determine the 95% confidence interval for the mean using Excel’s TINV function, input the relevant data into your spreadsheet Ensure to label your findings clearly, rounding both the lower and upper limits of the confidence interval to two decimal places Additionally, record the results of the test in your spreadsheet for future reference.
(e) Enter theconclusionof the test in plain English on your spreadsheet. Fig 3.11 Worksheet Data for Chap 3: Practice Problem #2
3.4 End-of-Chapter Practice Problems 61
To ensure your final spreadsheet is concise and easily readable, print it so that it fits on a single page For guidance on achieving this, refer to the objectives outlined in Chapter 2, Section 2.4.
(g) Draw a picture of the confidence interval, including the reference value, onto your spreadsheet.
(h) Save the final spreadsheet as: TROUT10
An analysis of SO2 concentration data from various sites in Texas reveals significant changes over the past three years Previously, the average SO2 level was recorded at 120 parts per billion (ppb) In response to growing environmental concerns, the state implemented a comprehensive air quality improvement program aimed at enhancing the air quality for residents The latest data analysis will determine if these efforts have successfully reduced SO2 levels and improved the air quality for the communities involved.
Three years ago, the quality of air in specific locations may have changed, prompting an investigation into current air conditions To assess this, a small sample of hypothetical data, illustrated in Fig 3.12, has been analyzed using Excel skills This examination aims to determine any significant differences in the air people breathe today compared to three years prior.
Fig 3.12 Worksheet Data for Chap 3: Practice Problem #3
62 3 Confidence Interval About the Mean Using the TINV Function and Hypothesis
Create an Excel spreadsheet with these data.
To analyze the data effectively, utilize Excel to calculate the sample size, mean, standard deviation, and standard error of the mean Ensure that your results are clearly labeled and formatted to two decimal places for accuracy.
(b) Enter the null hypothesis and the research hypothesis for these data onto your spreadsheet.
To calculate the 95% confidence interval for the mean using Excel's TINV function, input the appropriate data into your spreadsheet Ensure to label your results clearly, displaying both the lower and upper limits of the confidence interval Round these values to two decimal places for precision.
(d) Enter theresultof the test on your spreadsheet.
To conclude the test, clearly summarize the findings in simple language on your spreadsheet Ensure that the final version of your spreadsheet is printed to fit on a single page; for guidance on this, refer to the objectives outlined in Chapter 2, Section 2.4.
(g) Draw a picture of the confidence interval, including the reference value, onto your spreadsheet.
(h) Save the final spreadsheet as: PARTS3
Black, K Business Statistics: for Contemporary Decision Making (6 th ed.) Hoboken, NJ: John Wiley& Sons, Inc., 2010.
Bremer, M and Doerge, R.W Statistics at the Bench: A Step-by-Step Handbook for Biologists. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 2010.
Hoshmand A.R Statistical methods for environmental and agricultural sciences (2e) Boca Raton, FL: CRC Press, 1998.
Keller, G Statistics for Management and Economics (8th ed.) Mason, OH: South-Western Cengage learning, 2009.
Levine, D.M Statistics for Managers using Microsoft Excel (6 th ed.) Boston, MA: Prentice Hall/ Pearson, 2011.
McDaniel, C and Gates, R Marketing Research (8 th ed.) Hoboken, NJ: John Wiley & Sons, Inc., 2010.
Salkind, N.J Statistics for People Who (think they) Hate Statistics (2 nd Excel 2007 ed.) Los Angeles, CA: Sage Publications, 2010.
Schuenemeyer J and Drew L Statistics for earth and environmental scientists Hoboken, NJ: John Wiley & Sons, 2011.
Webster R and Oliver M Geostatistics for environmental scientists (2 nd ed.) Hoboken, NJ: John Wiley & Sons, 2007.
Weiers, R.M Introduction to Business Statistics (7 th ed.) Mason, OH: South-Western Cengage Learning, 2011.
Zikmund, W.G and Babin, B.J Exploring Marketing Research (10 th ed.) Mason, OH: South- Western Cengage learning, 2010.
One-Group t-Test for the Mean
In this chapter, you will learn how to use one of the most popular and most helpful statistical tests in science research: the one-group t-test for the mean.
The formula for the one-group t-test is as follows: tẳXμ
The 7 STEPS for Hypothesis-Testing Using
STEP 1: State the Null Hypothesis and the
When utilizing numerical scales in surveys, it's essential to focus on the midpoint of the scale For instance, in a 7-point scale where 1 represents "poor" and 7 signifies "excellent," the hypotheses should be centered around the middle values of the scale.
As a second example, suppose that you worked for Honda Motor Company and that you wanted to place a magazine ad that claimed that the new Honda Fit got
35 miles per gallon (mpg) The hypotheses for testing this claim on actual data would be:
STEP 2: Select the Appropriate Statistical
In this chapter we will be studying the one-group t-test, and so we will select that test.
STEP 3: Decide on a Decision Rule
for the One-Group t-Test
(a) If the absolute value of t is less than the critical value of t, accept the null hypothesis.
(b) If the absolute value of t is greater than the critical value of t, reject the null hypothesis and accept the research hypothesis.
You are probably saying to yourself: “That sounds fine, but how do I find the absolute value of t?”
66 4 One-Group t-Test for the Mean
4.1.3.1 Finding the Absolute Value of a Number
To do that, we need another objective:
Objective: To find the absolute value of a number
In high school algebra, you likely learned about "absolute value," which refers to the non-negative value of a number, regardless of its original sign.
For example, the absolute value of 2.35 is +2.35.
And the absolute value of minus 2.35 (i.e.2.35) is also +2.35.
Utilizing the t-table in Appendix E is crucial for conducting a one-group t-test In Step 5 of the t-test process, we will elaborate on how to determine the critical value of t using this appendix.
STEP 4: Calculate the Formula
Objective: To learn how to use the formula for the one-group t-test
The formula for the one-group t-test is as follows: tẳXμ
The formula proposed by Foster, Stine, and Waterman (1998) is based on key assumptions regarding the data: it assumes that the data points are independent, meaning each individual contributes only one score, and that the overall population from which the data is drawn follows a normal distribution.
(3) the data have a constant variance (note that the standard deviation is the square root of the variance).
To use this formula, you need to follow these steps:
1 Take the sample mean in your research study and subtract the population meanμ from it (remember that the population mean for a study involving numerical rating scales is the “middle” number in the scale).
2 Then take your answer from the above step, and divide your answer by the standard error of the mean for your research study (you will remember that you learned how to find the standard error of the mean in Chap.1; to find the standard error of the mean, just take the standard deviation of your research study and4.1 The 7 STEPS for Hypothesis-Testing Using the One-Group t-Test 67 divide it by the square root ofn, wheren is the number of people, plants, or animals used in your research study).
3 The number you get after you complete the above step is the value fort that results when you use the formula stated above.
4.1.5 STEP 5: Find the Critical Value of t in the t-Table in Appendix E
Objective: To find the critical value of t in the t-table in AppendixE
Before explaining the concept of "the critical value of t," let's practice locating this value using the t-table found in Appendix E.
Keep your finger on Appendix Eas we explain how you need to “read” that table.
In this chapter, the test referred to is the “one-group t-test,” and to determine the critical value of t for your research study, you should consult the first column on the left in Appendix E, which is labeled “sample size n.”
To determine the critical value of t, locate the sample size in the first column of the table, then move to the right to find the corresponding critical t value in the designated column This critical t value is applicable for both one-group t-tests and constructing a 95% confidence interval for the mean.
For example, if you have 27 people in your research study, the critical value of t is 2.056.
If you have 38 people in your research study, the critical value of t is 2.026.
If you have more than 40 people in your research study, the critical value of t is always 1.96.
The "critical t column" in Appendix E indicates the t value necessary to achieve 95% confidence in your results being deemed significant This critical t value serves as a benchmark to determine the significance of your findings.
“significant result” in your statistical test.
The t-table in Appendix E showcases a collection of bell-shaped normal curves, named for their resemblance to the outline of the Liberty Bell located outside Independence Hall in Philadelphia.
In statistical analysis, the center of normal curves is often regarded as the zero point on the x-axis For those interested in a deeper understanding of this concept, it is recommended to consult reputable statistics literature, such as Zikmund and Babin (2010), which provides comprehensive explanations.
Values of t located to the right of the zero point are positive and are indicated with a plus sign, while values of t to the left of this zero point are negative.
68 4 One-Group t-Test for the Mean negative values that use a minus sign before them Thus, some values of t are positive, and some are negative.
Statistics books typically present only the positive side of the t-table, as the negative side mirrors the positive, containing identical numbers with a negative sign To effectively utilize the t-table in Appendix E, it's essential to take the absolute value of the t-value obtained from the t-test formula, since Appendix E exclusively lists positive t-values.
This book operates under the assumption that you aim for 95% confidence in your statistical test results Consequently, the t-value from the t-table in Appendix E indicates whether the t-value calculated from the one-group t-test formula falls within the 95% confidence interval of the t-distribution.
If the t-value calculated from the one-group t-test falls within the 95% confidence interval, the results are considered not significant, which effectively means accepting the null hypothesis.
If the t-value from your one-group t-test falls outside the 95% confidence interval, it indicates a significant result that is expected to occur in less than 5% of cases This outcome allows for the rejection of the null hypothesis in favor of the research hypothesis.
STEP 6: State the Result of Your Statistical Test
There are two possible results when you use the one-group t-test, and only one of them can be accepted as “true.”
If the absolute value of t calculated from the t-test formula is less than the critical value outlined in Appendix E, the null hypothesis is accepted Conversely, if the absolute value of t exceeds the critical value, the null hypothesis is rejected in favor of the research hypothesis.
STEP 7: State the Conclusion of Your
Summarizing the results of your statistical test in clear and concise language can be challenging, especially when aiming to communicate effectively with individuals who may lack a background in statistics, like your boss This article will guide you through practical exercises to master this crucial skill as we explore the seven steps for hypothesis testing using the one-group t-test.
If you have read this far, you are ready to sit down at your computer and perform the one-group t-test using Excel on some hypothetical data.
One-Group t-Test for the Mean
A local park has introduced informative displays along a nature trail to raise awareness about the significance of riparian areas in sustaining healthy aquatic ecosystems To assess the effectiveness of these educational messages, the managing organization conducted a survey among visitors.
The survey contains a number of items, but suppose a hypothetical Item #7 is the one in Fig.4.1:
Suppose further, that you have decided to analyze the data from visitors using the one-group t-test.
Important note: You would need to use this test for each of the survey items separately.
In a hypothetical analysis of Item #7 from the Riparian Survey, a sample of 124 visitors reported an average score of 6.58, with a standard deviation of 2.44, indicating a varied perception among respondents.
Objective: To analyze the data for each question separately using the one-group t-test for each survey item.
Fig 4.1 Sample Survey Item for Item #7 of the Riparian Survey (Practical Example)
70 4 One-Group t-Test for the Mean
Create an Excel spreadsheet with the following information:
Note: Remember that when you are using a rating scale item, both the null hypothesis and the research hypothesis refer to the “middle of the scale.”
On a 10-point scale, the midpoint is 5.5, with five values below (1-5) and five values above (6-10) This establishes a balanced rating system for evaluating responses.
D23: enter the STDEV (see Fig.4.2)
4.2 One-Group t-Test for the Mean 71
D26 compute the standard error using the formula in Chap.1
D29: find the critical t value of t in the t-table in AppendixE
Now, enter the following formula in cell D32 to find the t-test result: ẳðD205:5ị ðno spaces between symbolsị
Table for Item #7 of the
72 4 One-Group t-Test for the Mean
To calculate the t-test result, subtract the hypothesized population mean of 5.5 from the sample mean located in cell D20, and divide the result by the standard error of the mean found in cell D26 Ensure to format the formula correctly by enclosing the population mean in parentheses, resulting in a calculation of 1.08, which is then divided by the standard error of 0.22 This yields a t-test result of 4.93, rounded to two decimal places for both the standard error and the t-test result.
Fig 4.3 t-test Formula Result for Item #7 of the Riparian Survey
4.2 One-Group t-Test for the Mean 73
Now, write the following sentence in D36–D39 to summarize the result of the t-test:
D36: Since the absolute value of t of 4.93 is
D37: greater than the critical t of 1.96, we
D38: reject the null hypothesis and accept
Lastly, write the following sentence in D41–D43 to summarize the conclusion of the result for Item #7 of the Riparian Survey:
D41: Visitors rated the quality of the new riparian
D42: educational information provided in the displays
D43: along the nature trail as significantly positive.
Save your file as: Riparian4
It is important to note that we have chosen the term "significantly positive" to describe a mean rating of 6.58, which falls on the positive side of the rating scale We intentionally avoided the phrase "significantly excellent," as it is not commonly used in English; something is either excellent or it is not Thus, "significantly positive" is the more appropriate terminology for this type of rating scale.
When creating a spreadsheet, it's essential to enter results and conclusions in separate cells to maintain readability and professionalism Combining them into one cell can lead to issues when printing, such as overly small font sizes if the spreadsheet is adjusted to fit one page, or content spilling onto a second page if not To ensure your final spreadsheet looks polished and is easy to read, keep results and conclusions distinctly separated.
Print the final spreadsheet so that it fits onto one page as given in Fig.4.4 Enter the null hypothesis and the research hypothesis by hand on your spreadsheet
74 4 One-Group t-Test for the Mean
Important note: It is important for you to understand that “technically” the above conclusion in statistical terms should state:
Visitors positively rated the quality of the new riparian educational information displayed along the nature trail, indicating that this outcome is likely significant and not coincidental.
Fig 4.4 Final Spreadsheet for Item #7 of the Riparian Survey
4.2 One-Group t-Test for the Mean 75
In this book, we use the term "significantly" in the conclusions of statistical tests to indicate that the results are likely not due to chance This shorthand simplifies the explanation for readers, allowing them to grasp the conclusions in plain English rather than complex statistical terminology.
Can You Use Either the 95 % Confidence Interval
About the Mean OR the One-Group t-Test When
You are probably asking yourself:
To analyze the results of the problems discussed in this book, you may need to utilize either the 95% confidence interval for the mean or the one-group t-test Is this statement accurate?
The answer is a resounding:“Yes!”
In scientific research, both the confidence interval for the mean and the one-group t-test are frequently utilized for analyzing data related to the problems discussed in this book Remarkably, these two statistical methods yield identical results and conclusions from the same data set.
This book provides a comprehensive explanation of two statistical tests: the confidence interval for the mean and the one-group t-test Researchers may prefer one test over the other, or choose to use both to enhance clarity in their findings To ensure you are well-equipped for data analysis, we have detailed both methods, allowing you to confidently apply either test based on your research needs.
Now, let’s try your Excel skills on the one-group t-test on these three problems at the end of this chapter.
End-of-Chapter Practice Problems
The U.S Environmental Protection Agency (EPA) has established a maximum total phosphorus concentration of 0.015 mg/L for wastewater effluent from chemical plants Over a 90-day period, a random sample of wastewater effluent from a specific chemical plant was collected and analyzed for phosphorus concentration You are tasked with testing your Excel skills using the hypothetical data provided in Fig 4.5.
76 4 One-Group t-Test for the Mean
To initiate data analysis, begin by stating the null hypothesis and research hypothesis in your spreadsheet Next, utilize Excel to calculate key statistical metrics, including the sample size, mean, standard deviation, and standard error of the mean, positioning these values to the right of the data set Ensure that the mean, standard deviation, and standard error of the mean are formatted to display four decimal places for accuracy.
(c) Enter the critical t from the t-table in AppendixEonto your spreadsheet, and label it.
(d) Use Excel to compute the t-value for these data (use two decimal places) and label it on your spreadsheet
(e) Type the result on your spreadsheet, and then type the conclusion in plain English on your spreadsheet
(f) Save the file as: WASTE31
To investigate changes in the average mass of rainbow trout (Oncorhynchus mykiss) in a southern Colorado river, we note that five years ago, the average mass was 112 grams A recent sample of rainbow trout has been collected to determine if there has been a significant change in their average mass since that time This preliminary analysis will allow for a practice run in Excel before conducting a more extensive data evaluation.
Fig 4.5 Worksheet Data for Chap 4: Practice Problem #1
4.4 End-of-Chapter Practice Problems 77
(a) On your Excel spreadsheet, write the null hypothesis and the research hypothesis for these data.
To analyze the data effectively, utilize Excel to calculate the sample size, mean, standard deviation, and standard error of the mean, ensuring that the results are presented with two decimal places for accuracy.
(c) Use Excel to perform aone-group t-teston these data (two decimal places). (d) On your printout, type the critical value of t given in your t-table in AppendixE.
(e) On your spreadsheet, type theresultof the t-test.
(f) On your spreadsheet, type theconclusionof your study in plain English. (g) save the file as: TROUT33
Maine, located on the northeastern seaboard of the United States, is renowned for its abundant lakes, boasting over 2,000 named lakes and more than 4,000 unnamed lakes larger than one acre A key indicator of lake water quality is the level of dissolved oxygen (DO), which plays a crucial role in maintaining a healthy aquatic ecosystem.
78 4 One-Group t-Test for the Mean
Dissolved oxygen (DO) levels in lakes decline as waste accumulates, with optimal levels recommended at 5 milligrams (mg) per liter (L) according to Burt et al (2009) If you've gathered data from a selection of lakes in Maine, you can utilize Excel to analyze a smaller sample before tackling a larger dataset, as illustrated in Fig 4.7.
To conduct your analysis, first, clearly define the null hypothesis and the research hypothesis in your spreadsheet Next, utilize Excel to calculate the sample size, mean, standard deviation, and standard error of the mean, placing these results to the right of your data set Ensure that the mean, standard deviation, and standard error of the mean are formatted to display two decimal places for clarity.
(c) Enter the critical t from the t-table in AppendixEonto your spreadsheet, and label it.
Fig 4.7 Worksheet Data for Chap 4: Practice problem #3
4.4 End-of-Chapter Practice Problems 79
(d) Use Excel to compute the t-value for these data (use two decimal places) and label it on your spreadsheet
(e) Type the result on your spreadsheet, and then type the conclusion in plain English on your spreadsheet
(f) Save the file as: MElakes3
Burt J, Barber G, Rigby D Elementary Statistics for Geographers New York: The Guilford Press; 2009.
Foster, D.P., Stine, R.A., Waterman, R.P Basic Business Statistics: A Casebook New York, NY: Springer-Verlag, 1998.
Hoshmand A.R Statistical Methods for Environmental and Agricultural Sciences (2nd ed.) Boca Raton, FL: CRC Press, 1998.
Townend J Practical Statistics for Environmental and Biological Scientists Hoboken: John Wiley & Sons, 2002.
Zikmund, W.G and Babin, B.J Exploring Marketing Research (10th ed.) Mason, OH: South- Western Cengage Learning, 2010.
80 4 One-Group t-Test for the Mean
Two-Group t-Test of the Difference of the Means for Independent Groups
In this section of the book, we shift our focus from studying a single group of people, plants, or animals with one measurement to examining two distinct groups This transition allows for a more comprehensive analysis and comparison between the two sets of subjects.
The two-group t-test for independent groups is used to analyze situations where two distinct groups of individuals are measured on a single variable, resulting in one numerical value for each person In this context, the groups are considered "independent of one another," as no individual belongs to both groups.
The two-group t-test is based on two key assumptions: first, that both groups are drawn from a normally distributed population, and second, that the variances of these populations are roughly equal (Zikmund and Babin, 2010) It's important to note that the standard deviation is simply the square root of the variance While there are specific formulas for situations where measurements are taken from the same individuals—known as "dependent" data—this book focuses exclusively on independent groups, ensuring that no individual belongs to both data sets.
When testing for differences between the means of two groups, it's crucial to use the appropriate formula based on the sample sizes of each group.
(1) Use Formula #1 in this chapter when both of the groups have a sample size greater than 30, and
(2) Use Formula #2 in this chapter when either one group, or both groups, have a sample size less than 30.
We will illustrate both of these situations in this chapter.
Before exploring the formulas for hypothesis testing involving two groups, it's essential to understand the necessary steps in the process.
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
The 9 STEPS for Hypothesis-Testing Using
STEP 1: Name One Group, Group 1,
and the Other Group, Group 2
In this chapter, we will utilize the numbers 1 and 2 to differentiate between two groups By designating one group as Group 1 and the other as Group 2, you can simplify your calculations by using these numerical identifiers instead of repeatedly writing out the group names.
When conducting research involving college freshmen males and females, it's efficient to label the groups as "Group 1" for males and "Group 2" for females This method simplifies referencing the groups throughout your writing, saving time and effort compared to repeatedly using the full terms "College Freshmen Males" and "College Freshmen Females."
When studying flower preferences of hummingbirds, comparing two vibrant red flowers, fuchsias and mandevillas, can be effective Instead of repeatedly naming each flower, referring to them as Group 1 and Group 2 simplifies the discussion and saves time, enhancing clarity in your analysis.
It is important to note that the labels assigned to the two groups, whether Group 1 or Group 2, are arbitrary; the outcomes and conclusions derived from the formulas will remain consistent regardless of how these groups are defined.
82 5 Two-Group t-Test of the Difference of the Means for Independent Groups
STEP 2: Create a Table That Summarizes
the Sample Size, Mean Score, and Standard
To ensure accuracy in your two-group t-test calculations, it is crucial to use the correct numbers in your formulas, as mixing them up can lead to significant errors For instance, when examining college freshman males' approval of fracking through two different commercials—one featuring scientists and the other showcasing families—labeling the groups as "Scientist Message" and "Family Message" can be cumbersome Instead, designating the Scientist Message group as Group 1 and the Family Message group as Group 2 simplifies the process, saving time and reducing the likelihood of mistakes in your analysis.
In a study examining the impact of different commercial messages on college freshman males' perceptions of fracking, participants were randomly assigned to view either a Scientist Message or an alternative commercial After viewing, the 52 males in the Scientist Message group rated their approval of fracking on a 100-point scale, revealing a mean approval rating.
In a study comparing the approval ratings of two commercial types among male freshmen, the standard group had 55 participants with an average rating of 55 and a standard deviation of 7 In contrast, the Family Message group consisted of 57 males, achieving an average approval rating of 64 with a standard deviation of 13 To accurately assess the significant differences in approval ratings between these two groups, it is essential to utilize six key statistical numbers: the sample size, mean, and standard deviation for each group, ensuring proper application of these values in the analysis.
If you create a table to summarize these data, a good example of the table, using both Step 1 and Step 2, would be the data presented in Fig.5.1:
Fig 5.1 Basic Table Format for the Two-group t-test
5.1 The 9 STEPS for Hypothesis-Testing Using the Two-Group t-Test 83
In your research study, you can categorize the data by labeling Group 1 as the Scientist Message group and Group 2 as the Family Message group, effectively organizing the six numbers into the designated categories as illustrated in Fig 5.2.
You can now use the formulas for the two-group t-test with more confidence that the six numbers will be placed in the proper place in the formulas.
You can refer to Group 1 as the Family Message group and Group 2 as the Scientist Message group; the naming of these groups is entirely at your discretion and does not affect their meaning.
STEP 3: State the Null Hypothesis and the Research
Hypothesis for the Two-Group t-Test
In a two-group t-test, if you have completed the previous step, understanding the hypotheses becomes straightforward The null hypothesis asserts that the population means of the two groups (μ) are equal, while the research hypothesis posits that these means are not equal.
You can now see that this notation is much simpler than having to write out the names of the two groups in all of your formulas.
STEP 4: Select the Appropriate Statistical Test
This chapter focuses on scenarios involving two groups, each with a single measurement for every individual, plant, or animal within those groups To analyze the data effectively, we will utilize the two-group t-test throughout the discussion.
Fig 5.2 Results of Entering the Data Needed for the Two-group t-test
84 5 Two-Group t-Test of the Difference of the Means for Independent Groups
STEP 5: Decide on a Decision Rule
for the Two-Group t-Test
The decision rule is exactly what it was in the previous chapter (see Sect.4.1.3) when we dealt with the one-group t-test.
(a) If the absolute value of t is less than the critical value of t, accept the null hypothesis.
(b) If the absolute value of t is greater than the critical value of t, reject the null hypothesis and accept the research hypothesis.
Since you learned how to find the absolute value of t in the previous chapter (seeSect.4.1.3.1), you can use that knowledge in this chapter.
STEP 6: Calculate the Formula
for the Two-Group t-Test
In this chapter, we will discuss the application of two distinct formulas for the two-group t-test, which vary based on the sample sizes of the groups Detailed guidance on utilizing these formulas will be provided later in the chapter.
STEP 7: Find the Critical Value of t
in the t-Table in Appendix E
In the previous chapter, we explored the one-group t-test, where you determined the critical value of t by locating the sample size in the first column of the t-table and reading the corresponding value in the "critical t column." This process was straightforward with practice However, when conducting a two-group t-test, finding the critical value of t becomes more complex due to the presence of two distinct groups, which often have varying sample sizes.
To use AppendixEcorrectly in this chapter, you need to learn how to find the
“degrees of freedom” for your study We will discuss that process now.
5.1 The 9 STEPS for Hypothesis-Testing Using the Two-Group t-Test 85
5.1.7.1 Finding the Degrees of Freedom (df) for the Two-Group t-Test
Objective: To find the degrees of freedom for the two-group t-test and to use it to find the critical value of t in the t-table in AppendixE
The concept of "degrees of freedom" is essential in statistics, and while a detailed mathematical explanation is not provided here, it can be explored in any reputable statistics book, such as Keller (2009) For practical purposes, understanding how to calculate degrees of freedom and apply it to determine the critical value of t is straightforward, as outlined in Appendix E The formula for calculating degrees of freedom (df) is df = n1 + n2 - 2.
In other words, you add the sample size for Group 1 to the sample size for Group
2 and then subtract 2 from this total to get the number of degrees of freedom to use in AppendixE.
To determine the critical value of t for a two-group t-test, it is essential to utilize the second column of the table, which represents the degrees of freedom (df), rather than relying on the first column that corresponds to the sample size (n) of a single group, as done in the one-group t-test.
To calculate the degrees of freedom for two groups, add the number of participants in each group; for instance, with 13 individuals in Group 1 and 17 in Group 2, the degrees of freedom would total 28 To find the critical t value, refer to a t-distribution table, locate the row corresponding to 28 degrees of freedom, and identify the critical value of t, which is 2.048.
In a scenario with 52 individuals in Group 1 and 57 in Group 2, the total degrees of freedom would be calculated as 107 According to Appendix E, once the degrees of freedom exceed 39, the critical t-value consistently remains at 1.96, which is the value applicable for this example.
STEP 8: State the Result of Your Statistical Test
The result follows the exact same result format that you found for the one-group t-test in the previous chapter (see Sect.4.1.6):
Either:Since the absolute value of t that you found in the t-test formula isless than the critical value of tin AppendixE, you accept the null hypothesis.
Since the calculated absolute value of t from the t-test exceeds the critical value listed in Appendix E, we reject the null hypothesis in favor of the research hypothesis.
86 5 Two-Group t-Test of the Difference of the Means for Independent Groups
STEP 9: State the Conclusion of Your
Writing conclusions for a two-group t-test is more challenging than for a one-group t-test, as it requires analyzing and interpreting the differences between the two groups being compared.
When you accept the null hypothesis, the conclusion is simple to write: “There is no difference between the two groups in the variable that was measured.”
But when you reject the null hypothesis and accept the research hypothesis, you need to be careful about writing the conclusion so that it is both accurate and concise.
Let’s give you some practice in writing the conclusion of a two-group t-test.
5.1.9.1 Writing the Conclusion of the Two-Group t-Test
When You Accept the Null Hypothesis
Objective: To write the conclusion of the two-group t- test when you have accepted the null hypothesis.
After participating in a one-day environmental education program focused on the significance of hunting in wildlife management, college students were surveyed to assess their understanding and perceptions The survey included Item #10, which aimed to gauge the impact of the program on their views regarding hunting's role in conservation efforts The results indicate that the program effectively enhanced students' awareness of hunting as a crucial tool for maintaining ecological balance and supporting wildlife populations.
In your analysis of the data, you have chosen to compare Males and Females using the two-group t-test, designating Males as Group 1 and Females as Group 2.
Important note: You would need to use this test for each of the survey items separately.
Fig 5.3 Environmental Education Survey Item #10
5.1 The 9 STEPS for Hypothesis-Testing Using the Two-Group t-Test 87
Suppose that the hypothetical data for Item #10 was based on a sample size of
A study involving 124 males revealed a mean score of 6.58 with a standard deviation of 2.44, while data from 86 females indicated a mean score of 6.45 and a standard deviation of 1.86.
In this chapter, we will later detail the process of calculating the results of a two-group t-test using its formulas For now, the key findings are as follows: the degrees of freedom are 208, the critical t value is 1.96 (as referenced in Appendix E), and the t-test formula yields a result of 0.44 when calculated.
Result: Since the absolute value of 0.44 is less than the critical t of
1.96, we accept the null hypothesis.
Conclusion: There was no difference between male and female students in their satisfaction with their understanding of hunting as an important part of wildlife management.
Now, let’s see what happens when you reject the null hypothesis (H 0 ) and accept the research hypothesis (H 1 ).
5.1.9.2 Writing the Conclusion of the Two-Group t-Test When You
Reject the Null Hypothesis and Accept the Research Hypothesis
Objective: To write the conclusion of the two-group t-test when you have rejected the null hypothesis and accepted the research hypothesis
Let’s continue with this same example, but with the result that we reject the null hypothesis and accept the research hypothesis.
In a study involving 85 males, the average score on the question was 7.26, with a standard deviation of 2.35 In contrast, data from 48 females revealed a mean score of 4.37 and a standard deviation of 3.26, indicating a notable difference in responses between the genders.
Fig 5.4 Worksheet Data for Males vs Females for Item #10 for Accepting the Null Hypothesis
88 5 Two-Group t-Test of the Difference of the Means for Independent Groups
Without going into the details of the formulas for the two-group t-test, these data would produce the following result and conclusion based on Fig.5.5:
Research Hypothesis: μ16ẳμ2 degrees of freedom: 131 critical t: 1.96 (in AppendixE) t-test formula: 5.40 (when you use your calculator!)
Result: Since the absolute value of 5.40 is greater than the critical t of 1.96, we reject the null hypothesis and accept the research hypothesis.
To determine which group had a more positive rating of their environmental education experience, it is essential to compare the ratings of males and females Analyzing the data will reveal insights into the perceptions of each gender regarding their educational experiences in this field Ultimately, the findings will highlight whether males or females rated their environmental education more favorably.
In summarizing the conclusion of a two-group t-test, focus on comparing the means of the two groups If the null hypothesis is rejected and the research hypothesis is accepted, it is essential to use the term "significantly" in your conclusion to highlight the statistical difference observed.
To effectively conclude the two-group t-test analysis using a rating scale, it is beneficial to visualize the mean scores of both groups This can be achieved by creating a graphical representation of the scale, which allows for a clear comparison of the mean scores For instance, in the context of our environmental education example, you would illustrate this comparison in a diagram, as shown in Fig 5.6.
Fig 5.5 Worksheet Data for Item #10 for Obtaining a Significant Difference between Males and Females
5.1 The 9 STEPS for Hypothesis-Testing Using the Two-Group t-Test 89
The visual representation indicates that males received a significantly higher positive rating (7.26) compared to females (4.37) By rejecting the null hypothesis and accepting the research hypothesis, it is clear that a notable difference exists between the mean scores of the two groups.
So, our conclusion needs to contain the following key words:
– important part of wildlife management
We can use these key words to write the either of two conclusions which are logically identical:
Either:Males were significantly more satisfied with their understanding of hunting as an important part of wildlife management than Females (7.26 vs 4.37).
Or:Females were significantly less satisfied with their understanding of hunting as an important part of wildlife management than Males (4.37 vs 7.26).
Both of these conclusions are accurate, so you can decide which one you want to write It is your choice.
It is essential to ensure that the mean scores provided in parentheses at the conclusion of your findings correspond to the order of the groups discussed For instance, if the statement reads, "Males exhibited a higher level of satisfaction compared to Females," the accompanying scores must reflect this sequence: (7.26 vs 4.37), aligning with the mention of Males first and Females second.
Alternately, if you wrote that: “Females were significantly less satisfied than Males,” the end of this conclusion should be: (4.37 vs 7.26) since you mentioned Females first, and Males second.
Including the mean scores at the conclusion of your research report allows readers to easily access this information without needing to refer back to the table, enhancing the clarity of your findings regarding the differences between the mean scores.
Fig 5.6 Example of Drawing a “Picture” of the Means of the Two Groups on the Rating Scale
90 5 Two-Group t-Test of the Difference of the Means for Independent Groups
Now, let’s discuss FORMULA #1 that deals with the situation in which both groups have a sample size greater than 30.
Objective: To use FORMULA #1 for the two-group t-test when both groups have a sample size greater than 30
Formula #1: Both Groups Have a Sample Size Greater
In this article, we will explore the two-group t-test formula, which is applicable when each of the two groups has a sample size exceeding 30 and involves a single measurement for each participant in both groups The formula for the two-group t-test is represented as t = (X1 - X2).
S1 2 n1 þS2 2 n2 s ð5:3ị and where degrees of freedomẳdf ẳn1ỵn22 ð5:1ị
This formula looks daunting when you first see it, but let’s explain some of the parts of this formula:
In this chapter, we previously discussed the concept of "degrees of freedom," which you can now use to determine the necessary degrees of freedom for the formula to find the critical value of t in Appendix E.
In the previous chapter,the formula for the one-group t-test was the following: tẳXμ
In a one-group t-test, you calculate the mean score and subtract the population mean, then divide the difference by the standard error of the mean (s.e.) to obtain the t-test result This result is then compared to the critical value of t to determine whether to accept the null hypothesis or reject it in favor of the research hypothesis.
The two-group t-test utilizes a distinct formula due to the presence of two groups, each representing a mean score on a specific variable This statistical test aims to evaluate the validity of the null hypothesis, which posits that the population means of the two groups are equal In cases where both groups have a sample size greater than 30, a specific formula is applied to analyze the data effectively.
An Example of Formula #1 for the Two-Group t-Test
Now, let’s use Formula #1 in a situation in which both groups have a sample size greater than 30.
Suppose that a large university offered several sections of Introductory Biology
In the last semester, a study involving 101 undergraduates aimed to analyze the student evaluation form results to identify potential gender differences between male and female students Specifically, the focus was on Item #12 of the evaluation form, as illustrated in Fig 5.7.
92 5 Two-Group t-Test of the Difference of the Means for Independent Groups
Suppose you collect these ratings and determine (using your new Excel skills) that the 52 Males in this course had a mean rating of 55 with a standard deviation of
7, while the 57 Females in this course had a mean rating of 64 with a standard deviation of 13.
The two-group t-test is robust, meaning it does not require equal sample sizes for both groups This flexibility allows for effective comparison between groups, making the t-test a valuable tool in statistical analysis.
Your data then produce the following table in Fig.5.8:
Create an Excel spreadsheet, and enter the following information:
Now, widen column B so that it is twice as wide as column A, and center the six numbers and their labels in your table (see Fig.5.9)
Fig 5.7 Example of a Rating Scale for Item #12 for Introductory Biology 101 (Practical Example)
Fig 5.8 Worksheet Data for Item #12 for Introductory Biology 101
5.2 Formula #1: Both Groups Have a Sample Size Greater Than 30 93
Since both groups have a sample size greater than 30, you need to use Formula
#1 for the t-test for the difference of the means of the two groups.
Let’s “break this formula down into pieces” to reduce the chance of making a mistake.
B13: STDEV1 squared/n1 (note that you square the standard deviation of Group
1, and then divide the result by the sample size of Group 1)
You now need to compute the values of the above formulas in the following cells: Fig 5.9 Results of Widening Column B and Centering the Numbers in the Cells
94 5 Two-Group t-Test of the Difference of the Means for Independent Groups
To compute the values in the spreadsheet, the formula for cell B13 yields a result displayed in D13 with two decimal places Similarly, the calculation for cell B16 is reflected in D16, also formatted to two decimal places The value for cell B19 is represented in D19, adhering to the same two-decimal format Finally, the result for D22 is determined by taking the square root of D19, ensuring it is presented with two decimal places as well.
This formula should give you a standard error (s.e.) of 1.98.
(Since dfẳn1 + n22, this gives dfẳ1092ẳ107, and the critical t is, therefore, 1.96 in AppendixE.)
D28: ẳðD4D5ị ðuse two decimalsị ðno spaces between symbolsị This formula should give you a value for the t-test of:4.55.
Fig 5.10 Formula Labels for the Two-group t-test
5.2 Formula #1: Both Groups Have a Sample Size Greater Than 30 95
Next, check to see if you have rounded off all figures in D13: D28 to two decimal places (see Fig.5.11).
Now, write the following sentence in D31 to D34 to summarize the result of the study:
D31: Since the absolute value of4.55
D32: is greater than the critical t of
D33: 1.96, we reject the null hypothesis
D34: and accept the research hypothesis.
Fig 5.11 Results of the t-test Formula for Item #12 for Introductory
96 5 Two-Group t-Test of the Difference of the Means for Independent Groups
Finally, write the following sentence in D36 to D38 to summarize the conclusion of the study in plain English:
D36: Overall, females rated the quality of Biology 101
D37: this past semester as significantly better than
Save your file as: BIOL12
It's essential to separate the result and conclusion into different cells in your spreadsheet to maintain a professional appearance Combining them into one cell can lead to issues when printing: either the text will shrink to an unreadable font size to fit on a single page, or it will overflow onto a second page, disrupting the layout To ensure clarity and professionalism in your final document, always keep these elements distinct.
Print this file so that it fits onto one page, and write by hand the null hypothesis and the research hypothesis on your printout.
The final spreadsheet appears in Fig.5.12.
5.2 Formula #1: Both Groups Have a Sample Size Greater Than 30 97
Fig 5.12 Final Worksheet for Item #12 for Biology 101
98 5 Two-Group t-Test of the Difference of the Means for Independent Groups
If you would like more information about the two-group t-test, see Wheater and Cook (2000).
Now, let’s use the second formula for the two-group t-test which we use whenever either one group, or both groups, have a sample size less than 30.
Objective: To use Formula #2 for the two-group t-test when one or both groups have a sample size less than 30
Now, let’s look at the case when one or both groups have a sample size less than 30.
Formula #2: One or Both Groups Have a Sample Size
To investigate geographic variations in body length (in mm) of queen honeybees from REGION A and REGION B, a two-group t-test for independent samples was conducted Utilizing newly acquired Excel skills, a small sample of queen honeybees from each region was analyzed based on the hypothetical data presented in Fig 5.13 The results will provide insights into potential differences in body lengths between the two regions.
Fig 5.13 Worksheet Data for Body Length of Queen Honeybees (Practical Example)
5.3 Formula #2: One or Both Groups Have a Sample Size Less Than 30 99
Let’s call REGION A as Group 1, and REGION B as Group 2.
Note: Since both groups have a sample size less than 30, you need to use Formula
Create an Excel spreadsheet, and enter the following information:
A3: BODY LENGTH (in mm) OF QUEEN HONEYBEES IN TWO REGIONS B5: REGION A
Ensure that you accurately input all figures into the table, as even a single incorrect entry can lead to an incorrect solution to the problem Double-checking your data is essential for achieving the correct results.
Now, widen columns B and C so that all of the information fits inside the cells.
To adjust the width of columns B and C in your spreadsheet, click on the headers of both columns to highlight them Next, position your mouse at the right edge of column B until a cross sign appears Click and drag this cross to the right until the text is fully visible Release the mouse button, and both columns will now have the same width.
Then, center all information in the table except the top title by using the following steps:
To center the content in cells B5:C17, simply left-click and highlight these cells Next, navigate to the "Alignment" section located at the top-center of the Home tab and click on the second icon from the left on the bottom line This action will center all the information within each selected cell.
Your spreadsheet should now look like Fig.5.14.
100 5 Two-Group t-Test of the Difference of the Means for Independent Groups
Utilize your Excel skills from Chapter 1 to accurately fill in the sample sizes (n), Means, and Standard Deviations (STDEV) in the designated Table within cells F12:H13 It's crucial to double-check your entries, as even a single incorrect figure can lead to an incorrect solution for this problem.
Since both groups have a sample size less than 30, you need to use Formula #2 for the t-test for the difference of the means of two independent samples.
Formula #2 for the two-group t-test is the following: tẳX1X2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n11 ð ịS1 2 ỵðn21ịS2 2 n1þn22
1 n1 þ1 n2 s ð5:5ị and where degrees of freedom ẳdf ẳn1ỵn22 ð5:6ị
To minimize errors when writing this complex formula, it is advisable to break it down into smaller components rather than attempting to create the entire formula in a single cell entry.
Now, enter these words on your spreadsheet:
Fig 5.14 Queen Honeybees Body Length Worksheet Data for Hypothesis Testing
5.3 Formula #2: One or Both Groups Have a Sample Size Less Than 30 101
Fig 5.15 Queen Honeybees Body Length Formula Labels for the Two-group t-test
102 5 Two-Group t-Test of the Difference of the Means for Independent Groups
You now need to use your Excel skills to compute the values of the above formulas in the following cells:
To calculate the value for cell E16, apply the specified formula and round the result to two decimal places Similarly, for cell E19, utilize the appropriate formula and ensure the outcome is presented with two decimal precision Lastly, compute the result for cell E22 using the designated formula, again rounding to two decimals for accuracy.
H25: the result of the formula needed to compute cell E25 (use two decimals) H28: ẳSQRT(((H16 + H19)/H22)*H25) (no spaces between symbols)
To ensure the formula functions correctly, it is essential to include three opening parentheses after SQRT and three closing parentheses at the end Without this specific arrangement of parentheses, the formula will not operate as intended.
The above formula gives a standard error of the difference of the means equal to 0.51 (two decimals) in cell H28.
H31: Enter the critical t value from the t-table in AppendixEin this cell using dfẳn 1 + n 2 2 to find the critical t value
To calculate the t-test value, ensure you place an open parenthesis before G12 and a closed parenthesis after G13 This will allow you to divide the result of 1.94 by the standard error of the difference of the means, which is 0.51, yielding a t-test value of 3.77 Remember to present the t-test result using two decimal places (refer to Fig 5.16).
5.3 Formula #2: One or Both Groups Have a Sample Size Less Than 30 103
Now write the following sentence in C37 to C38 to summarize theresultof the study:
C37: Since the absolute value of 3.77 is greater than 2.086, we reject the null C38: hypothesis and accept the research hypothesis.
Finally, write the following sentence in C40 to C41 to summarize theconclusion of the study:
Fig 5.16 Queen Honeybees Body Length Two-group t-test Formula Results
104 5 Two-Group t-Test of the Difference of the Means for Independent Groups
C40: The body length of queen honeybees in Region A was significantly longer than C41: the body length of queen honeybees in Region B (20.84 vs 18.90). Save your file as: Honey14
Print the final spreadsheet so that it fits onto one page.
Write the null hypothesis and the research hypothesis by hand on your printout. The final spreadsheet appears in Fig.5.17.
Fig 5.17 Queen Honeybees Body Length Final Spreadsheet
5.3 Formula #2: One or Both Groups Have a Sample Size Less Than 30 105
If you would like more information about the two-group t-test, seeHoshmand (1998).
End-of-Chapter Practice Problems
1 Suppose that you wanted to compare the wing length (in mm) of a species of adult mosquitoes in the northeast region and the southeast region of the United States You have obtained the cooperation of other biologists in Vermont and New Hampshire in the northeast region, and Kentucky and South Carolina in the southeast region who have shared their data with you from a previous study In the North, 124 mosquitoes had wings with a mean length 3.2 mm and a standard deviation of 1.2 mm In the South, 135 mosquitoes had wings with a mean length of 3.4 mm with a standard deviation of 1.3 mm.
(a) State the null hypothesis and the research hypothesis on an Excel spreadsheet.
To find the standard error of the difference between the means in Excel, first calculate the standard deviations of the two groups and then apply the appropriate formula Next, determine the critical t value from Appendix E and input it into your spreadsheet Finally, conduct a t-test using Excel to obtain the t value for your data analysis.
Use three decimal places for all figures in the formula section of your spreadsheet.
(e) State your result on your spreadsheet.
(f) State your conclusion in plain English on your spreadsheet.
(g) Save the file as: Mosquito3
2 Wheater and Cook (2000) discussed an interesting study comparing the amount of sediment in rivers when construction sites are nearby the river (Urban) vs agricultural land that is nearby the river (Rural) They measured the amount of suspended sediment loads in a section of the river in these two types of sites in milligrams per liter (mg/L) Each river was only used once in the study (i.e., the data are independent samples) The sections of the rivers were similar in size, flow rate, and altitude.
Suppose that you have been hired as a research assistant in a similar study and that you have been asked to analyze the hypothetical data given in Fig.5.18:
106 5 Two-Group t-Test of the Difference of the Means for Independent Groups
(a) On your Excel spreadsheet, write the null hypothesis and the research hypothesis.
To analyze the data effectively, create a table in your spreadsheet that summarizes the information Utilize Excel to calculate the sample sizes, means, and standard deviations for both groups, ensuring that the means and standard deviations are presented with two decimal places.
(c) Use Excel to find the standard error of the difference of the means (two decimal places).
(d) Use Excel to perform a two-group t-test What is the value of tthat you obtain (use two decimal places)?
(e) On your spreadsheet, type the critical value of t using the t-table in AppendixE.
(f) Type yourresulton the test on your spreadsheet.
(g) Type yourconclusion in plain Englishon your spreadsheet.
(h) save the file as: SEDIMENT3
3 Polychlorinated Biphenyls (PCBs) are yellow, oily liquids that do not smell and are made out of the fat of people and animals They can be carried long distances in rivers, lakes, and oceans, and fish can have levels of PCB in their fatty tissues that are much higher than the surrounding water In 1977, the U.S Environmental Protection Agency (EPA) banned the use of PCBs in man-made materials (Wisconsin Department of Health Services2014).
Fig 5.18 Worksheet Data for Chap 5: Practice
5.4 End-of-Chapter Practice Problems 107
Ofungwu (2014) raises a compelling question regarding the accumulation of PCBs in contaminated rivers, specifically examining whether the flow of a river over a dam influences PCB concentrations differently above and below the dam.
As a research assistant tasked with data analysis, you are examining PCB load measurements taken over a 22-day period, both upstream and downstream from a dam Your objective is to assess whether the differences in PCB levels, expressed in kilograms per day, are statistically significant Utilizing Excel, you have compiled the relevant data into a table, which facilitates your analysis of the variations in PCB loads.
Fig 5.19 Worksheet Data for Chap 5: Practice Problem #3
108 5 Two-Group t-Test of the Difference of the Means for Independent Groups
(a) State the null hypothesis and the research hypothesis on an Excel spreadsheet.
(b) Find the standard error of the difference between the means using Excel (two decimals).
(c) Find the critical t value using AppendixE, and enter it on your spreadsheet. (d) Perform a t-test on these data using Excel What is the value of t that you obtain (two decimals)?
(e) State your result on your spreadsheet.
(f) State your conclusion in plain English on your spreadsheet.
(g) Save the file as: STREAM3
Keller, G Statistics for Management and Economics (8 th ed.) Mason, OH: South-Western Cengage Learning, 2009.
Hoshmand, A R Statistical Methods for Environmental and Agricultural Sciences (2 nd ed.) Boca Raton, FL: CRC Press, 1998.
Ofungwu J Statistical Applications for Environmental Analysis and Risk Assessment Hoboken, NJ: John Wiley & Sons, 2014.
Wheater C, Cook P Using Statistics to Understand the Environment New York, NY: Routledge, 2000.
Wisconsin Department of Health Services http://www.dhs.wisconsin.gov/eh/ChemFS/fs/PCB. htm (October 29, 2014).
Zikmund, W.G and Babin, B.J Exploring Marketing Research (10 th ed.) Mason, OH: South- Western Cengage Learning, 2010.
Correlation and Simple Linear Regression
There are many different types of “correlation coefficients,” but the one we will use in this book is the Pearson product-moment correlation which we will call:r.
What Is a “Correlation?”
Basically, a correlation is a number between 1 and +1 that summarizes the relationship between two variables, which we will call X and Y.
A correlation can be classified as either positive or negative A positive correlation indicates that an increase in X leads to an increase in Y, while a negative correlation signifies that an increase in X results in a decrease in Y This aspect of the relationship is referred to as the direction of the correlation in statistical literature.
The correlation also tells us themagnitude of the relationship between X and
Y As the correlation approaches closer to +1, we say that the relationship isstrong and positive.
As the correlation approaches closer to1, we say that the relationship isstrong and negative.
A zero correlation means that there is no relationship between X and Y This means that neither X nor Y can be used as a predictor of the other.
To understand correlation, visualize a scatterplot where variable X is plotted on the x-axis and variable Y on the y-axis For instance, in Figure 6.1, a perfect positive correlation of r = +1.0 is illustrated, indicating that each y-value can be perfectly predicted from its corresponding x-value The data points in this scatterplot align perfectly along a straight line that moves upward and to the right, demonstrating a clear relationship between the two variables.
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
The scatterplot in Figure 6.2 illustrates a moderately positive correlation with a coefficient of r = +0.54 This indicates that there is a moderate predictive relationship between the x-values and y-values, as evidenced by the upward trend of the data points While the points do not align perfectly in a straight line, they can be encompassed by a "football" shape that suggests a general upward movement.
The scatterplot in Figure 6.3 illustrates a low, positive correlation with a coefficient of r = +0.09 This indicates that the x-values are not effective predictors of the y-values, as the arrangement of the data points resembles a circular shape.
Fig 6.1 Example of a Scatterplot for a Perfect, Positive Correlation (r ẳ +1.0)
Fig 6.2 Example of a Scatterplot for a Moderate, Positive Correlation (r ẳ +.54)
112 6 Correlation and Simple Linear Regression
A zero correlation, represented by a value of r = 0.00, indicates that there is no relationship between variables X and Y In a scatterplot, this would appear as a perfect circular distribution of data points, signifying that X cannot be used to predict Y due to the lack of correlation between the two variables.
Figure 6.4 illustrates a scatterplot demonstrating a low negative correlation of r = -0.09, indicating that each X value is an ineffective predictor of Y in an inverse relationship As X increases, Y tends to decrease, evident from the downward slope of the "football" shape formed by the data points.
Figure 6.5 illustrates a scatterplot demonstrating a moderate negative correlation of r = -0.54, indicating that variable X serves as a moderately effective predictor of variable Y This inverse relationship signifies that as X increases, Y decreases The downward slope of the "football" shape surrounding the data points further emphasizes this negative correlation.
Fig 6.3 Example of a Scatterplot for a Low, Positive Correlation (r ẳ +.09)
Fig 6.4 Example of a Scatterplot for a Low, Negative Correlation (r ẳ 09)
Figure 6.6 illustrates a perfect negative correlation with a coefficient of r = -1.0, indicating that variable X is an exact predictor of variable Y in an inverse relationship As X increases, Y correspondingly decreases, with the data points aligning precisely along a downward-sloping straight line.
Let’s explain the formula for computing the correlation r so that you can understand where the number summarizing the correlation came from.
In order to help you to understandwherethe correlation number that ranges from 1.0 to +1.0 comes from, we will walk you through the steps involved to use the
Fig 6.5 Example of a Scatterplot for a Moderate, Negative Correlation (r ẳ 54)
Fig 6.6 Example of a Scatterplot for a Perfect, Negative Correlation (r ẳ 1.0)
To understand the correlation and simple linear regression formula, it's essential to utilize a pocket calculator for calculation This practical exercise, which is unique in this book, will enhance your comprehension of how correlation is computed step-by-step, allowing you to grasp the formula's application effectively.
To do that, let’s create a situation in which you need to find the correlation between two variables.
In a cutting-edge hydroponic laboratory, a study was conducted to examine the impact of nitrogen on plant growth Researchers utilized a predetermined nitrogen solution, administering varying volumes in milliliters (ml) to plants within a controlled greenhouse environment The independent variable, or x-variable, was the amount of nitrogen solution applied, while the dependent variable, or y-variable, measured plant growth in centimeters (cm) To evaluate the effect, a random sample of eight plants was selected, and hypothetical data was recorded for analysis, simplifying measurements to one decimal place for ease of calculation.
We utilized Excel to determine the sample size for both variables, X and Y, along with calculating the MEAN and STDEV for each You can enhance your Excel skills by replicating these results in your own spreadsheet using the provided data.
Now, let’s use the above table to compute the correlation rbetween nitrogen solution and plant growth using your pocket calculator.
Fig 6.7 Worksheet Data for Nitrogen Solution and
6.1.1 Understanding the Formula for Computing a Correlation
Objective: To understand the formula for computing the correlation r
The formula for computing the correlationris as follows: rẳ n 1 1 ΣXX
This formula looks daunting at first glance, but let’s “break it down into its steps” to understand how to compute the correlation r.
6.1.2 Understanding the Nine Steps for Computing a Correlation, r
Objective: To understand the nine steps of computing a correlation r
The nine steps are as follows:
1 Find the sample size n by noting the number of plants 8
2 Divide the number 1 by the sample size minus 1 (i.e., 1/7) 0.14286
3 For each Plant, take its nitrogen solution and subtract the mean nitrogen solution for the 8 plants, and call this X X (For example, for Plant # 6, this would be: 2.6 2.86)
Note: With your calculator, this difference is 0.26, but when
Excel uses 16 decimal places for every computation, this result could be slightly different for each plant
4 For each Plant, take its growth and subtract the mean Plant growth for the 8 plants, and call this Y Y (For example, for Plant # 6, this would be: 2.3 2.74)
5 Then, for each Plant, multiply X X times Y Y
(For exam- ple, for Plant # 6 this would be: ( 0.26) ( 0.44))
6 Add the results of X X times Y Y for the 8 plants +1.09
Steps 1–6 would produce the Excel table given in Fig.6.8.
116 6 Correlation and Simple Linear Regression
In Excel, multiplying two negative numbers results in a positive number, as demonstrated by Plant #7 with the calculation (0.46)(0.64) = +0.29 Conversely, when a negative number is multiplied by a positive number, the outcome is negative, illustrated by Plant #1 where (0.06)(+0.16) = -0.01.
Note: Excel computes all computation to 16 decimal places So, when you check your work with a calculator, you frequently get a slightly different answer than Excel’s answer.
For example, when you compute above:
YY for Plant #2, your calculator gives:
Excel provides a more precise answer of 0.02, utilizing 16 decimal places for calculations, despite only displaying two decimal places in the output.
When performing Step 6, ensure that you first sum all positive numbers to achieve +1.10, followed by summing all negative numbers to reach 0.03 Subtracting these results will yield +1.07 However, when using Excel for these calculations, the final total will be +1.09, as Excel maintains precision in every computation.
16 decimal places which is much more accurate than your calculator.
7 Multiply the answer for step 2 above by the answer for step 6 0.1557
8 Multiply the STDEV of X times the STDEV of Y 0.1872
9 Finally, divide the answer from step 7 by the answer from step 8 +0.83
Fig 6.8 Worksheet for Computing the Correlation, r
The correlation coefficient of +0.83 between Nitrogen Solution (X) and Plant Growth (Y) indicates a strong positive relationship, suggesting that increased nitrogen solution contributes to enhanced growth in the eight studied plants For an in-depth analysis of correlation, refer to McCleery et al (2007).
You could also use the results of the above table in the formula for computing the correlation r in the following way: correlation rẳ ð1=ðn1ịị P
= STDEVxSTDEVy correlation rẳ ẵð1=7ị 1:09= :ẵð ị 48 ð ị:39 correlationẳrẳ0:83
Using Excel to Compute a Correlation Between
Objective: To use Excel to find the correlation between two variables
In a study conducted on 4-door sedans, the relationship between vehicle weight and fuel consumption over a distance of 150 miles was analyzed The findings indicate that as the weight of the sedans increases, fuel consumption also tends to rise, highlighting the impact of vehicle weight on efficiency This research provides valuable insights for car manufacturers aiming to optimize fuel economy in their sedan models.
Twelve brand-new sedan models were tested by hired drivers on a designated route from Forest Park in St Louis, Missouri, to Kansas City, Missouri The journey covered a distance of 150 miles, with all drivers maintaining similar weights and adhering to specific speed limits throughout the trip.
To evaluate your Excel proficiency, you compiled a table displaying the weight of various cars in thousands of pounds alongside the corresponding gallons of gasoline consumed during their drives The hypothetical data is illustrated in Fig 6.9.
118 6 Correlation and Simple Linear Regression
Important note: Note that the weight of the cars is recorded in thousands of pounds, so that a car that weighed 3,500 pounds would be recorded as 3.5 in this table.
To analyze the relationship between car weight and fuel consumption, we can utilize correlation as a statistical method In this context, we define car weight as the predictor variable (X) and the fuel consumption measured in gallons as the criterion variable (Y) This approach will help us understand how changes in car weight may influence fuel efficiency.
Create an Excel spreadsheet with the following information:
A3: WEIGHT OF 4-DOOR SEDANS VS NO OF GALLONS USED TO DRIVE
B5: Is there a relationship between the weight of a 4-door sedan
B6: and the number of gallons used to drive 150 miles?
Next, change the width of Columns B and C so that the information fits inside the cells.
To finalize the table, ensure that the values in B20 equal 4.1 and C20 equal 6.9 It is essential to verify the accuracy of all figures before proceeding Additionally, center the data within all relevant cells for a polished presentation.
Fig 6.9 Worksheet Data for Weight and Number of Gallons Used (Practical Example)6.2 Using Excel to Compute a Correlation Between Two Variables 119
Next, define the “name” to the range of data from B9:B20 as: weight
We discussed earlier in this book (see Sect.1.4.4) how to “name a range of data,” but here is a reminder of how to do that:
To give a “name” to a range of data:
Click on the top number in the range of data and drag the mouse down to the bottom number of the range.
To name the range of cells B9:B20 as "weight," first click on cell B9 and drag the pointer down to B20 to highlight the desired cells Next, click on the appropriate option to assign the name.
Define name (top center of your screen) weight (enter this in the Name box; see Fig.6.10)
Fig 6.10 Dialogue Box for Naming a Range of Data as: “weight”
120 6 Correlation and Simple Linear Regression
Now, repeat these steps to give the name: gallons to C9:C20
Finally, click on any blank cell on your spreadsheet to “deselect” cells C9:C20 on your computer screen.
To complete the data for the specified sample sizes, means, and standard deviations, ensure that the value in cell B23 is set to 3.08 and the value in cell C24 is adjusted to 0.75 Remember to format the means and standard deviations to two decimal places, as illustrated in Figure 6.11.
Objective: Find the correlation between weight and gallons used
C26: ẳcorrel(weight,gallons) ; see Fig.6.12
Fig 6.11 Example of Using Excel to Find the Sample Size, Mean, and STDEV
6.2 Using Excel to Compute a Correlation Between Two Variables 121
Hit the Enter key to compute the correlation
C26: format this cell to two decimals
Note that the equal sign inẳcorrel(weight,gallons) in C26 tells Excel that you are going to use a formula in this cell.
The analysis reveals a strong positive correlation of +0.91 between weight (X) and the number of gallons used (Y), indicating a significant relationship between these variables This suggests that as the weight increases, the amount of gallons required to drive 150 miles also rises.
Save this file as: GALLONS3
The final spreadsheet appears in Fig.6.13.
Fig 6.12 Example of Using Excel ’ s ẳ correl Function to Compute the Correlation Coefficient
122 6 Correlation and Simple Linear Regression
Creating a Chart and Drawing the Regression Line
This section explores the concept of linear regression, specifically focusing on simple linear regression, which utilizes a single predictor variable (X) to forecast the outcome variable (Y) For the effective application of this statistical model, it is essential that the data satisfies four key assumptions.
1 The underlying relationship between the two variables under study (X and Y) is linearin the sense that a straight line, and not a curved line, can fit among the data points on the chart.
2 The errors of measurement are independent of each other (e.g the errors from a specific time period are sometimes correlated with the errors in a previous time period).
3 The errors fit a normal distribution of Y-values at each of the X-values.
4 The variance of the errors is the same for all X-values (i.e., the variability of the Y-values is the same for both low and high values of X).
Fig 6.13 Final Result of Using the ẳ correl Function to Compute the Correlation Coefficient6.3 Creating a Chart and Drawing the Regression Line onto the Chart 123
A detailed explanation of these assumptions is beyond the scope of this book, but the interested reader can find a detailed discussion of these assumptions in Levine et al (2011, pp 529–530).
Now, let’s create a chart summarizing these data.
When creating charts in Excel, it is crucial to position the predictor variable (X) on the left and the criterion variable (Y) on the right in your spreadsheet This practice not only clarifies which variable serves as the predictor and which is the criterion but also helps prevent confusion and errors in correlation and simple linear regression analyses By following this guideline, you can streamline your data analysis process and avoid potential difficulties.
You need to understand that in any chart that has one predictor and a criterion that there are really TWO LINES that can be drawn between the data points:
(1)One line uses X as the predictor, and Y as the criterion.
(2)A second line uses Y as the predictor, and X as the criterion.
When preparing your input data, it is crucial to accurately identify the cells designated for X as the predictor and Y as the criterion Mixing up these cells can lead to incorrect data representation and significantly compromise the integrity of your analysis.
In this book, we strongly recommend positioning the predictor variable (X data) on the left side of your table and the criterion variable (Y data) on the right side in your spreadsheet This practice helps prevent any confusion between the two variables.
The correlation coefficient, denoted as r, remains constant regardless of how the variables are labeled, whether as predictor or criterion This statistical measure succinctly captures the relationship between two variables, independent of their roles in the analysis.
Using the weight of a car as a predictor variable, we can effectively estimate the gallons required to drive 150 miles With a strong positive correlation of +0.91 between car weight and fuel consumption, it is evident that weight serves as a reliable indicator for determining the number of gallons needed for this distance.
1 Open the file that you saved earlier in this chapter: GALLONS3
124 6 Correlation and Simple Linear Regression
6.3.1 Using Excel to Create a Chart and the Regression Line
Objective: To create a chart and the regression line summarizing the relationship between weight and gallons used
2 Click and drag the mouse to highlight both columns of numbers (B9:C20),but do not highlight the labels above the data points.
Insert (top left of screen)
Highlight: Scatter chart icon (immediately above the word: “Charts” at the top center of your screen)
Click on the down arrow on the right of the chart icon
Highlight the top left scatter chart icon (see Fig.6.14)
Click on the top left chart to select it
Click on the “+ icon” to the right of the chart (CHART ELEMENTS)
Click on the check mark next to “Chart Title” and also next to “Gridlines” to remove these check marks (see Fig.6.15)
Fig 6.14 Example of Selecting a Scatter Chart
6.3 Creating a Chart and Drawing the Regression Line onto the Chart 125
Click on the box next to “Chart Title” and then click on the arrow to its right Then, click on: “Above Chart”.
Note that the words: “Chart Title” are now in a box at the top of the chart (see Fig.6.16).
Enter this Chart Title to the right of “Chartfx” at the top of your screen): RELATIONSHIP BETWEEN WEIGHT AND NO OF GALLONS USED (see Fig.6.17)
Fig 6.15 Example of Chart Elements Selected
Fig 6.16 Example of Chart Title Selected
126 6 Correlation and Simple Linear Regression
Hit the enter key to place this title above the chart
Click onany white space outside of the top title but inside the chartto “deselect” this chart title (see Fig.6.18).
Fig 6.18 Example of a Chart Title Inserted Into the Chart
Fig 6.17 Example of Creating a Chart Title
6.3 Creating a Chart and Drawing the Regression Line onto the Chart 127
Click on the “+ box” to the right of the chart
Add a check mark to the left of “Axis Titles” (This will create an “Axis Title” box on the y-axis of the chart).
Click on the right arrow for: “Axis titles” and then click on: “Primary Horizontal” to remove the check mark in its box (this will create the y-axis title).
Enter the following y-axis title to the right off x at the top of your screen:
Then, hit the Enter Key to enter this y-axis title to the chart
Clickinside the chart at the top right corner of the chart to “deselect” the box around the y-axis title (see Fig.6.19)
Click on the “+ box” to the right of the chart
Highlight: “Axis Titles” and click on its right arrow
Click on the words: “Primary Horizontal” to add a check mark to its box (this creates an “Axis Title” box on the x-axis of the chart).
Enter the following x-axis title to the right off x at the top of your screen:
Then, hit the Enter Key to add this x-axis title to the chart.
Clickinside the chart at the top right corner of the chart to “deselect” the box around the x-axis title (see Fig.6.20).
Fig 6.19 Example of Adding a y-axis Title to the Chart
128 6 Correlation and Simple Linear Regression
To visualize the data points effectively, we will add the least-squares regression line to the chart This line represents the optimal straight line that best fits the data, providing a clear interpretation of the relationship between the variables.
6.3.1.1 Drawing the Regression Line Through the Data Points on the Chart
Objective: To draw the regression line through the data points on the chart
Right-clickon any one of the data points inside the chart
Highlight: Add Trendline (see Fig.6.21)
Fig 6.20 Example of a Chart Title, an x-axis Title, and a y-axis Title
6.3 Creating a Chart and Drawing the Regression Line onto the Chart 129
Linear (be sure the “Linear” button near the top is selected on the “Format Trendline” dialog box (see Fig.6.22).
Fig 6.21 Dialogue Box for Adding a Trendline to the Chart
Fig 6.22 Dialogue Box for a Linear Trendline
130 6 Correlation and Simple Linear Regression
Click on the X at the top right of the “Format Trendline” dialog box to close this dialog box.
Now, click on any blank celloutside the chartto “deselect” the chart
Save this file as: GALLONS4
Your spreadsheet should look like the spreadsheet in Fig.6.23.
6.3.1.2 Moving the Chart Below the Table in the Spreadsheet
Objective: To move the chart below the table
To reposition the chart, left-click on any white space to the right of the top title, hold the click, and drag the chart down and to the left until the top left corner aligns with cell A29, then release the mouse button.
Fig 6.23 Final Chart with the Trendline Fitted Through the Data Points of the Scatterplot6.3 Creating a Chart and Drawing the Regression Line onto the Chart 131
6.3.1.3 Making the Chart “Longer” So That It Is “Taller”
Objective: To make the chart “longer” so that it is taller
To extend the chart, left-click on the bottom-center of the chart to create an "up-and-down arrow" sign While holding down the left mouse button, drag the bottom of the chart down to row 48, then release the mouse button to finalize the adjustment.
Objective: To make the chart “wider”
To widen the chart, position the pointer at the center of the right border, click and hold the left mouse button, then drag the border towards the middle of Column H.
Now, click on any blank cell outside he chart to “deselect” the chart (see Fig.6.25).
Fig 6.24 Example of Moving the Chart Below the Table
132 6 Correlation and Simple Linear Regression
Now, click on any blank cell outside the chart to “deselect” the chart
Save this file as: GALLONS4A
To print this spreadsheet on a single page, you must reduce the scale below 100 percent, as it currently exceeds the page size and would spill over onto four pages Follow the steps outlined below to print either a portion or the entire spreadsheet effectively.
Printing a Spreadsheet So That the Table and Chart
the Table and Chart Fit onto One Page
Objective: To print the spreadsheet so that the table and the chart fit onto one page
Fig 6.25 Example of a Chart that is Enlarged to Fit the Cells: A29:H48
6.4 Printing a Spreadsheet So That the Table and Chart Fit onto One Page 133
Page Layout (top of screen)
To ensure that the table and chart fit on a single page for printing, adjust the scale by clicking the down-arrow next to the “Scale to Fit” icon at the top center of the screen and selecting “80%.”
Fig 6.26 Example of the Page Layout for Reducing the Scale of the Chart to 80 % of Normal Size
134 6 Correlation and Simple Linear Regression
Save your file as: GALLONS5
Finding the Regression Equation
The primary objective of analyzing the correlation between weight (X) and the number of gallons used (Y) is to determine if a significant relationship exists This relationship allows us to utilize a regression equation to accurately predict the value of Y based on a specified value of X.
The strong positive correlation of +0.91 between car weight and fuel consumption indicates that weight is a reliable predictor of the number of gallons used, based on historical data.
We now need to find that regression equation that is the equation of the “best- fitting straight line” through the data points.
Fig 6.27 Final Spreadsheet of a Table and a Chart (80 % Scale to Fit Size)
Objective: To find the regression equation summarizing the relationship between X and Y.
In order to find this equation, we need to check to see if your version of Excel contains the “Data Analysis ToolPak” necessary to run a regression analysis.
6.5.1 Installing the Data Analysis ToolPak into Excel
Objective: To install the Data Analysis ToolPak into Excel
Since there are currently four versions of Excel in the marketplace (2007, 2010,
2013, 2016), we will give a brief explanation of how to install the Data Analysis ToolPak into each of these versions of Excel.
6.5.1.1 Installing the Data Analysis ToolPak into Excel 2016
Click on: Data (at the top of your screen)
To check if the Data Analysis ToolPak for Excel 2016 is properly installed, look at the top right corner of your monitor screen for the words "Data Analysis." If you see them, the ToolPak was successfully installed with Office 2016, and you can proceed to Section 6.5.2.
If the words: “Data Analysis” are not at the top right of your monitor screen, then the ToolPak component of Excel 2016 was not installed when you installed Office
2016 onto your computer If this happens, you need to follow these steps:
Options (bottom left of screen)
Note: This creates a dialog box with “Excel Options” at the top left of the box Add-Ins (on left of screen)
Manage: Excel Add-Ins (at the bottom of the dialog box)
Go (at bottom center of dialog box)
Highlight: Analysis ToolPak (in the Add-Ins dialog box)
Put a check mark to the left of Analysis Toolpak
OK (at the right of this dialog box)
You now should have the words: “Data Analysis” at the top right of your screen to show that this feature has been installed correctly
136 6 Correlation and Simple Linear Regression
Note: If these steps do not work, you should try these steps instead:
File/Options (bottom left)/Add-ins/Analysis ToolPak/Go/ click to the left of Analysis ToolPak to add a check mark/OK
If you need help doing this, ask your favorite “computer techie” for help. You are now ready to skip ahead to Sect.6.5.2
6.5.1.2 Installing the Data Analysis ToolPak into Excel 2013
Click on: Data (at the top of your screen)
If you see "Data Analysis" at the far right of your monitor screen, it indicates that the Data Analysis ToolPak for Excel 2013 was successfully installed with Office 2013; you can proceed to Sect 6.5.2.
If the words: “Data Analysis” are not at the top right of your monitor screen, then the ToolPak component of Excel 2013 was not installed when you installed Office
2013 onto your computer If this happens, you need to follow these steps:
Options (bottom left of screen)
Note: This creates a dialog box with “Excel Options” at the top left of the box Add-Ins (on left of screen)
Manage: Excel Add-Ins (at the bottom of the dialog box)
Go (at bottom center of dialog box)
Highlight: Analysis ToolPak (in the Add-Ins dialog box)
Put a check mark to the left of Analysis Toolpak
OK (at the right of this dialog box)
You now should have the words: “Data Analysis” at the top right of your screen to show that this feature has been installed correctly
If you get a prompt asking you for the “installation CD,” put this CD in the CD drive and click on: OK
Note: If these steps do not work, you should try these steps instead:
File/Options (bottom left)/Add-ins/Analysis ToolPak/Go/ click to the left of Analysis ToolPak to add a check mark/OK
If you need help doing this, ask your favorite “computer techie” for help. You are now ready to skip ahead to Sect.6.5.2
6.5.1.3 Installing the Data Analysis ToolPak into Excel 2010
Click on: Data (at the top of your screen)
To check if the Data Analysis ToolPak for Excel 2010 is properly installed, look at the top right corner of your monitor screen for the words "Data Analysis." If you see them, the installation was successful, and you can proceed to Section 6.5.2.
If the words: “Data Analysis” are not at the top right of your monitor screen, then the ToolPak component of Excel 2010 was not installed when you installed Office
2010 onto your computer If this happens, you need to follow these steps:
Excel options ( creates a dialog box)
Manage: Excel Add-Ins (at the bottom of the dialog box)
Highlight: Analysis ToolPak (in the Add-Ins dialog box)
Data (You now should have the words: “Data Analysis” at the top right of your screen)
If you get a prompt asking you for the “installation CD,” put this CD in the CD drive and click on: OK
Note: If these steps do not work, you should try these steps instead:
File/Options (bottom left)/Add-ins/Analysis ToolPak/Go/ click to the left of Analysis ToolPak to add a check mark/OK
If you need help doing this, ask your favorite “computer techie” for help. You are now ready to skip ahead to Sect.6.5.2.
6.5.1.4 Installing the Data Analysis ToolPak into Excel 2007
Click on: Data (at the top of your screen)
If the words “Data Analysis” do not appear at the top right of your screen, you need to install the Data Analysis ToolPak using the following steps:
Microsoft Office button (top left of your screen)
Excel options (bottom of dialog box)
Add-ins (far left of dialog box)
Go (to create a dialog box for Add-Ins)
138 6 Correlation and Simple Linear Regression
OK (If Excel asks you for permission to proceed, click on: Yes)
Data (You should now have the words: “Data Analysis” at the top right of your screen)
If you need help doing this, ask your favorite “computer techie” for help.
You are now ready to skip ahead to Sect.6.5.2.
6.5.2 Using Excel to Find the SUMMARY OUTPUT of Regression
You have now installedToolPak, and you are ready to find the regression equation for the “best-fitting straight line” through the data points by using the following steps:
Open the Excel file:GALLONS5(if it is not already open on your screen)
To deselect a chart in a file that is already open, simply click on any empty cell outside the chart area.
Now that you have installed Toolpak, you are ready to find the regression equation summarizing the relationship between weight and number of gallons used in your data set.
Remember that you gave the name:weightto the X data (the predictor), and the name:gallonsto the Y data (the criterion) in a previous section of this chapter (see Sect.6.2)
Data analysis (far right at top of screen; see Fig.6.28)
Scroll down the dialog box using the down arrowand click on: Regression (see Fig.6.29)
Fig 6.28 Example of Using the Data/Data Analysis Function of Excel
Fig 6.29 Dialog Box for Creating the Regression Function in Excel
140 6 Correlation and Simple Linear Regression
Click on the “button” to the left of Output Range to select this, and enter
A50 in the box as the place on your spreadsheet to insert the
TheSUMMARY OUTPUTshould now be in cells: A50: I67
To enhance readability, adjust the column widths in the Regression Summary Output section of your spreadsheet Additionally, ensure that the data in the specified two cells is formatted as numbers with two decimal places.
Next, change this cell to four decimal places: B67
To ensure consistency, convert all decimal numbers to a number format with three decimal places and center them within their respective cells Adjust the scale under "Page Layout" to 60% for optimal fitting on a single page Your final document should resemble the example shown in Fig 6.30.
Save the resulting file as: GALLONS6
Note the following problem with the summary output.
Whoever wrote the computer program for this version of Excel made a mistake and gave the name: “Multiple R” to cell A53.
This is not correct Instead, cell A53 should say: “correlation r” since this is the notation that we are using for the correlation between X and Y.
You can now use your printout of the regression analysis to find the regression equation that is the best-fitting straight line through the data points.
But first, let’s review some basic terms.
Fig 6.30 Final Spreadsheet of Correlation and Simple Linear Regression including the SUM- MARY OUTPUT for the Data
142 6 Correlation and Simple Linear Regression
6.5.2.1 Finding the y-Intercept, a, of the Regression Line
The y-intercept, represented by the letter "a," is the point where the regression line intersects the y-axis when extended In Fig 6.30, the y-intercept is noted as 2.75, located in cell B66 This indicates that if the regression line were to be extended downward, it would cross the y-axis at the value of 2.75, which is the reason for its designation as the "y-intercept."
6.5.2.2 Finding the Slope, b, of the Regression Line
The slope of the regression line, often referred to as the "tilt," indicates how much the line deviates from a horizontal position in relation to the data points When the correlation between variables X and Y is zero, the regression line remains perfectly horizontal along the X-axis, resulting in a slope of zero.
In a positive correlation between X and Y, the regression line rises upward to the right above the X-axis As illustrated in Fig 6.30, the slope of the regression line is +1.0762, as indicated in cell B67.
We will use the notation “b” to stand for the slope of the regression line (Note that Excel calls the slope of the line: “X Variable 1” in the Excel printout.)
The analysis reveals a strong positive correlation of +0.91 between weight and gallons used, indicating that as weight increases, the gallons used also tend to rise This relationship is visually represented by an upward-sloping regression line, as shown in the data summary output in Fig 6.30, where the correlation coefficient, r, is recorded as +0.91 in cell B53.
If the correlation between X and Y were negative, the regression line would
“slope down to the right” above the X-axis This would happen whenever the correlation between X and Y is a negative correlation that is between zero and minus one (0 and1).
6.5.3 Finding the Equation for the Regression Line
To predict the number of gallons used based on a car's weight, we can derive the regression equation using two key values from the SUMMARY OUTPUT in Fig 6.30: B66 and B67.
The format for the regression line is: YẳaỵbX ð6:3ị whereaẳthe y-intercept(2.75 in our example in cell B66) andbẳthe slope of the line(+1.0762 in our example in cell B67)
Therefore, the equation for the best-fitting regression line for our example is:
Remember that Y is the number of gallons used that we are trying to predict, using the weight of the car as the predictor, X.
Let’s try an example using this formula to predict the number of gallons used for a car.
6.5.4 Using the Regression Line to Predict the y-Value for a Given x-Value
Objective: To find the number of gallons predicted for a car that weighed 3,000 pounds (Note: 3,000 pounds, when measured in thousands of pounds, is recorded as 3.0)
Important note: Remember that the weight of the car is in thousands of pounds.
Since the weight is 3,000 pounds (i.e., Xẳ3.0 in thousands of pounds), substituting this number into our regression equation gives:
Yẳ5.98 gallons of gas needed to drive 150 miles
Important note: If you look at your chart, if you go directly upwards for a weight of
3.0 until you hit the regression line, you see that you hit this line just below 6 on the y-axis to the left when you draw a line horizontal to the x-axis (actually, it is 5.98), the result above for predicting the number of gallons needed for a car weighing 3,000 pounds.
To predict the number of gallons required for a weight of 3,500 pounds, we first convert the weight into thousands of pounds, resulting in 3.5 This conversion is essential for accurate calculations in our analysis.
Yẳ6.52 gallons of gas needed to drive 150 miles
Important note: If you look at your chart, if you go directly upwards for a weight of
3.5 until you hit the regression line, you see that you hit this line
Adding the Regression Equation to the Chart
Objective: To Add the Regression Equation to the Chart
If you want to include the regression equation within the chart next to the regression line, you can do that, but a word of caution first.
Throughout this book, we are using the regression equation for one predictor and one criterion to be the following:
YẳaỵbX ð6:3ị where aẳy-intercept and bẳslope of the line
See, for example, the regression equation in Sect.6.5.3where the y-intercept wasaẳ2.75and the slope of the line wasbẳ+1.0762to generate the following regression equation:
However, Excel 2016 uses a slightly different regression equation (which is logically identical to the one used in this book) when you add a regression equation to a chart:
YẳbXỵa ð6:4ị where aẳy-intercept and bẳslope of the line
Note that this equation is identical to the one we are using in this book with the terms arranged in a different sequence.
For the example we used in Sect.6.5.3, Excel 2016 would write the regression equation on the chart as:
This is the format that will result when you add the regression equation to the chart using Excel 2016 using the following steps:
6.6 Adding the Regression Equation to the Chart 145
Open the file:GALLONS6(that you saved in Sect.6.5.2)
Click justinside the outer border of the chart in the top right corner to add the
“border” around the chart in order to “select the chart” for changes you are about to make
Right-click on any of the data-points in the chart
Highlight: Add Trendline, and click on it to select this command
To display the equation on the chart, first select the "Linear button" located near the top left of the dialog box Then, scroll down and click on the option labeled "Display Equation on chart," which can be found near the bottom of the dialog box.
Click on the X at the top right of the Format Trendline dialogue box to remove this box.
Click on any empty celloutside of the chartto deselect the chart.
Note that the regression equation on the chart is in the following form next to the regression line on the chart (see Fig.6.32).
Fig 6.31 Dialogue Box for Adding the Regression Equation to the Chart Next to the Regression Line on the Chart
146 6 Correlation and Simple Linear Regression
Click on any empty cell outside of the chart to deselect the chart.
Now, save this file as: GALLONS7, and print it out so that it fits onto one page
Fig 6.32 Example of a Chart with the Regression Equation Displayed Next to the Regression Line
6.6 Adding the Regression Equation to the Chart 147
How to Recognize Negative Correlations
in the SUMMARY OUTPUT Table
Important note: Since Excel does not recognize negative correlations in the SUM-
In analyzing MARY OUTPUT results, it's crucial to recognize that all correlations are mistakenly treated as positive, which can lead to misinterpretations Therefore, one must exercise caution, as there could be a negative correlation between X and Y, despite the printout indicating a positive correlation.
You will know that the correlation between X and Y is a negative correlation when these two things occur:
(1) THE SLOPE, b, IS A NEGATIVE NUMBER This can only occur when there is a negative correlation
(2) THE CHART CLEARLY SHOWS A DOWNWARD SLOPE INTHE REGRESSION LINE, which can only occur when the correlation between X and Y is negative
Printing Only Part of a Spreadsheet Instead of the Entire
Objective: To print part of a spreadsheet separately instead of printing the entire spreadsheet
When working with extensive spreadsheets containing numerous data cells and charts, it is often necessary to print only a specific section to ensure readability Printing a selected portion of your spreadsheet allows for clearer visibility and avoids the issue of tiny, illegible text.
Learn how to print specific sections of a spreadsheet on a separate page using the GALLONS7 file created in Section 6.6 This guide provides three examples, starting with how to print only the table and chart on an individual page.
(2) printing only the chart on a separate page, and (3) printing only the SUMMARY OUTPUT of the regression analysis on a separate page.
Note: If the file: GALLONS7 is not open on your screen, you need to open it now.
If the “border” is around the outside of the chart, click on any white space outside of the chart to deselect the chart.
Let’s describe how to do these three goals with three separate objectives:
148 6 Correlation and Simple Linear Regression
6.8.1 Printing Only the Table and the Chart on a Separate Page
Objective: To print only the table and the chart on a separate page
1 Left-click your mouse starting at the top left of the tablein cell A3and drag the mousedown and to the right so that all of the table and all of the chart are highlighted in light blue on your computer screen from cell A3 to cell H48(the light blue cells are called the “selection” cells).
Print Active Sheet (hit the down arrow on the right)
The resulting printout should contain only the table of the data and the chart resulting from the data.
Then, click on any empty cell in your spreadsheet to deselect the table and chart.
6.8.2 Printing Only the Chart on a Separate Page
Objective: To print only the chart on a separate page
1 Click on any “white space”just inside the outside border of the chart in the top right corner of the chartto create the border around all of the borders of the chart in order to “select” the chart.
The resulting printout should contain only the chart resulting from the data.
To ensure proper printing of a chart in Excel, it's essential to click on any white space outside the chart immediately after printing it on a separate page This action removes the chart's border, signaling to Excel that you intend to print the chart independently.
6.8 Printing Only Part of a Spreadsheet Instead of the Entire Spreadsheet 149
6.8.3 Printing Only the SUMMARY OUTPUT of the Regression Analysis on a Separate Page
Objective: To print only the SUMMARY OUTPUT of the regression analysis on a separate page
1 Left-click your mouse at the cell just above SUMMARY OUTPUT incell A50 on the left of your spreadsheet and drag the mousedown and to the rightuntil all of the regression output is highlighted in dark blue on your screen from A50 to I67.
Change the “Scale to Fit” to 60 % so that the SUMMARY OUTPUT will fit onto one page when you print it out.
Print active sheets (hit the down arrow on the right)
The resulting printout should contain only the summary output of the regression analysis on a separate page.
Finally, click on any empty cell on the spreadsheet to “deselect” the regression table.
End-of-Chapter Practice Problems
1 What is the relationship between the weight of the car (measured in thousands of pounds) and its city miles per gallon (mpg) in 4-door passenger sedans? Suppose that you wanted to study this question using different models of cars Analyze the hypothetical data that are given in Fig.6.33.
150 6 Correlation and Simple Linear Regression
Create an Excel spreadsheet, and enter the data.
(a) create anXY scatterplotof these two sets of data such that:
• top title: RELATIONSHIP BETWEEN WEIGHT AND CITY mpg IN 4-DOOR SEDANS
• y-axis title: CITY MILES PER GALLON (mpg)
• move the chart below the table
• re-size the chart so that it is 7 columns wide and 25 rows long
To create the least-squares regression line for the given data on the scatterplot, utilize Excel to run regression statistics, which will yield the equation for the least-squares regression line Ensure to display the results below the chart on your spreadsheet, and incorporate the regression equation directly onto the chart Format the correlation and coefficients to three decimal places for clarity Finally, print only the input data and the chart to fit onto a single page in portrait format.
Then, printjust the regression output tableon a separate page so that it fits onto that separate page in portrait format.
(d) Circle and label the value of they-interceptand theslopeof the regression line on your printout.
(e) Write the regression equation by handon your printout for these data (use three decimal places for the y-intercept and the slope).
(f) Circle and label the correlation between the two sets of scores in the regression analysis summary output table on your printout.
Fig 6.33 Worksheet Data for Chap 6: Practice Problem #1
6.9 End-of-Chapter Practice Problems 151
(g) Underneath the regression equation you wrote by hand on your printout, use the regression equation to predict the average city mpg of a 4-door sedan that weighted 2,500 pounds.
(h) Read from the graph, the average city mpg you would predict for a 4-door sedan that weighed 3,600 pounds, and write your answer in the space immediately below:
save the file as: sedan3
2 Permafrost is soil, sediment, or rock that is frozen based on its temperature The ground must remain at or below zero degrees centigrade for 2 years or more to be called permafrost It is found at high altitudes, including the Rocky Mountains in the state of Colorado Permafrost is measured by down-hole depth created by a drill hole in a formation that is used as part of geophysical studies Suppose that you wanted to study the relationship between down-hole depth (X) and temper- ature Suppose that down-hole depth was measured in meters (m) while temper- ature was measured in degrees centigrade ( C).
To analyze the relationship between DEPTH and TEMPERATURE, create an Excel spreadsheet where DEPTH serves as the independent variable and TEMPERATURE as the dependent variable Utilize a small sample of drill hole data, as illustrated in Fig 6.34, to test your Excel skills effectively.
Create an Excel spreadsheet and enter the data using DEPTH (meters) as the independent variable (predictor) and TEMPERATURE (degrees centigrade) as the dependent variable (criterion).
Fig 6.34 Worksheet Data for Chap 6: Practice Problem #2
152 6 Correlation and Simple Linear Regression
(a) create anXY scatterplotof these two sets of data such that:
• top title: RELATIONSHIP BETWEEN DOWN-HOLE DEPTH AND TEMPERATURE
• y-axis title: TEMPERATURE (degrees centigrade)
• re-size the chart so that it is 7 columns wide and 25 rows long
• move the chart below the table
To create the least-squares regression line for the given data on the scatterplot, utilize Excel to run the regression statistics This will allow you to derive the equation for the least-squares regression line, which should be displayed below the chart in your spreadsheet Ensure that the correlation coefficient, r, as well as both the y-intercept and the slope of the line, are formatted to two decimal places, while all other decimal figures are presented to four decimal places for clarity and precision.
To ensure clarity and organization, print the input data along with the chart on a single
(1a) Circle and label the value of the y-intercept and the slope of the regression line onto that separate page.
(2b) Read from the graphthe temperature you would predict for adepth of three metersand write your answer in the space immediately below: _
(f) save the file as: DEPTH3
Answer the following questions using your Excel printout:
3 What is the slope of the line?
4 What is the regression equation for these data (use two decimal places for the y-intercept and the slope)?
5 Use that regression equation to predict the temperature you would expect for a down-hole depth of two meters.
(Note that this correlation is not the multiple correlation as the Excel table indicates, but is merely the correlation r instead.)
Note that you found a positive correlation of +.94 between depth and tempera- ture You know that the correlation is a positive correlation for two reasons:
(1) the regression line slopes upward and to the right on the chart, signaling a positive correlation, and (2) the slope is +0.53 which also tells you that the correlation is a positive correlation.
But how does Excel treatnegative correlations?
6.9 End-of-Chapter Practice Problems 153
Important note: Since Excel does not recognize negative correlations in the
When analyzing data, it's crucial to recognize that not all correlations are positive; some may indicate a negative relationship between the variables being studied Therefore, careful attention must be paid to identify and report any negative correlations to ensure accurate interpretation of the results.
You know that the correlation is negative when:
(1) The slope, b, is a negative number which can only occur when there is a negative correlation.
(2) The chart clearly shows a downward slope in the regression line, which can only happen when the correlation is negative.
3 In a greenhouse setting, how does temperature effect overall plant height? Suppose that you wanted to study this question using the height of vegetable plants The plants were germinated from seeds collected from 15 random com- mercial agricultural sites around the United States The plants were reared in a greenhouse to control for the effect of temperature on plant height The hypo- thetical data are given in Fig.6.35.
Fig 6.35 Worksheet Data for Chap 6: Practice Problem #3
154 6 Correlation and Simple Linear Regression
Create an Excel spreadsheet, and enter the data.
(a) create anXY scatterplotof these two sets of data such that:
• top title: RELATIONSHIP BETWEEN TEMPERATURE AND HEIGHT
• x-axis title: TEMPERATURE (degrees Centigrade)
• move the chart below the table
• re-size the chart so that it is 7 columns wide and 25 rows long
To create the least-squares regression line for the given data on the scatterplot, utilize Excel to perform regression analysis This will yield the equation for the least-squares regression line, which should be displayed below the chart in your spreadsheet Ensure that the regression equation is also added to the chart, with all numerical values formatted to three decimal places for both the correlation and the coefficients Finally, print only the input data and the chart to ensure that both fit onto a single page in portrait orientation.
Then, printjust the regression output tableon a separate page so that it fits onto that separate page in portrait format.
(d) Circle and label the value of they-interceptand theslopeof the regression line on your printout.
(e) Write the regression equation by handon your printout for these data (use three decimal places for the y-intercept and the slope).
(f) Circle and label the correlation between the two sets of scores in the regression analysis summary output table on your printout.
(g) Underneath the regression equation you wrote by hand on your printout, use the regression equation to predict the average height of vegetable plants you would predict for a temperature of 20
(h) Read from the graph, the average height of vegetable plants you would predict for a temperature of 15 , and write your answer in the space imme- diately below:
Save the file as: Vegetable3
Black K Business statistics: for contemporary decision making 6 th ed Hoboken: John Wiley & Sons, Inc.; 2010.
Levine D, Stephan D, Krehbiel T, Berenson M Statistics for managers using microsoft excel 6 th ed Boston: Prentice Hall Pearson; 2011.
McCleery R, Watt T, Hart T Introduction to statistics for biology 3 rd ed Boca Raton: Chapman & Hall/CRC; 2007.
McKillup S, Dyar M Geostatistics explained: an introductory guide for earth scientists Cam- bridge: Cambridge University Press; 2010.
156 6 Correlation and Simple Linear Regression
In scientific research, predicting a criterion, Y, often involves exploring whether a combination of multiple predictors (e.g., X1, X2, X3) can yield a more accurate model than relying on a single predictor, X This approach is known as "multiple correlation," as it utilizes two or more predictors together to enhance the prediction of Y, rather than depending solely on one variable.
X Each predictor is “weighted” differently based on its separate correlation with Y and its correlation with the other predictors The job of multiple correlation is to produce a regression equation that will weight each predictor differently and in such a way that the combination of predictors does a better job of predicting Y than any single predictor by itself We will call the multiple correlation:Rxy.
You will recall (see Sect.6.5.3) that the regression equation that predicts Y when only one predictor, X, is used is:
Multiple Regression Equation
The multiple regression equation follows a similar format and is:
Yẳaỵb1X1ỵb2X2ỵb3X3ỵetc:depending on the number of predictors used ð7:2ị
The “weight” given to each predictor in the equation is represented by the letter
In the context of statistical analysis, the multiple correlation coefficient, denoted as Rxy, ranges from 0 to +1, indicating a positive correlation, while the correlation coefficient, r, varies from -1 to +1 Notably, Rxy cannot take on negative values, highlighting its unique properties in measuring relationships among predictors.
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
Important note: In order to do multiple regression, you need to have installed the
“Data Analysis ToolPak” that was described in Chap.6(see Sect. 6.5.1) If you did not install this, you need to do so now.
In a mid-size fruit farm in Colorado, data analysis is being conducted to explore the relationship between water, light, and fertilizer on fruit production The farm utilizes a laboratory greenhouse to control environmental factors, measuring water in milliliters, light in minutes, and fertilizer in microliters The objective is to determine how these variables predict the mass of fruit produced in grams The farm manager has also requested an examination of fruit yield in relation to varying growing conditions To achieve this, three measured growing conditions (X1, X2, and X3) will serve as predictors, while the mass of fruit produced (Y) will be the criterion A sample of 11 plants has been randomly selected for recording their growing conditions to test these relationships.
Let’s use the following notation:
Suppose, further, that you have collected the following hypothetical data sum- marizing these scores (see Fig.7.1):
Create an Excel spreadsheet for these data using the following cell reference:
158 7 Multiple Correlation and Multiple Regression
A2: FRUIT PRODUCED IN RELATION TO GROWING CONDITIONS A4: Is there a relationship between fruit produced and growing conditions? A6: FRUIT PRODUCED (g)
Next, change the column width to match the above table, and change all FRUIT PRODUCED figures to number format (two decimal places).
Now, fill in the additional data in the chart such that:
Then, center all numbers in your table
Ensure that all numerical data in your table is accurate to avoid errors in your spreadsheets Save this document under the name: FRUIT3.
Fig 7.1 Worksheet Data for Growing Conditions versus Fruit Produced (Practical Example)
Before we do the multiple regression analysis, we need to try to make one important point very clear:
When predicting a criterion variable Y using a predictor variable X, it is essential to position the X variable on the left side of your table and the Y variable on the right This arrangement helps prevent any confusion between the two variables.
However, in multiple regression, you need to follow this rule which is exactly the opposite:
In multiple regression analysis, it is crucial to position the criterion variable, Y, on the far left of your table, with all predictor variables arranged to the right This clear organization helps to easily identify the criterion and the predictors, ultimately preventing confusion and saving time in your analysis process Adopting this practice will enhance your efficiency and accuracy in data interpretation.
In the table, criterion Y (FRUIT PRODUCED) is positioned on the far left, while the three predictors—WATER, LIGHT RECEIVED, and FERTILIZER—are aligned to the right Adhering to this arrangement can significantly reduce the likelihood of errors in your analysis.
Finding the Multiple Correlation and the Multiple
Objective: To find the multiple correlation and multiple regression equation using Excel.
You do this by the following commands:
Click on: Data Analysis (far right top of screen)
Regression (scroll down to this in the box; see Fig.7.2)
160 7 Multiple Correlation and Multiple Regression
Note that both the input Y Range and the Input X Range above both include the label at the top of the columns.
Click on the Labels box toadd a check markto it (because you have included the column labels in row 6)
Output Range (click on the button to its left, and enter): A20 (see Fig.7.3)
Excel automatically adds a dollar sign ($) before each column letter and row number, ensuring that data ranges remain constant during regression analysis.
Fig 7.2 Dialogue Box for Regression Function
7.2 Finding the Multiple Correlation and the Multiple Regression Equation 161
OK (see Fig.7.4to see the resulting SUMMARY OUTPUT)
Fig 7.3 Dialogue Box for Growing Conditions vs Fruit Produced Data
162 7 Multiple Correlation and Multiple Regression
Next, format cell B23 in number format (two decimal places)
Next, format the following four cells in Number format (four decimal places):
Change all other decimal figures to two decimal places, and center all figures within their cells.
Save the file as: FRUIT4
Now, print the file so that it fits onto one page by changing the scale to55 % size. The resulting regression analysis is given in Fig.7.5.
Fig 7.4 Regression SUMMARY OUTPUT of Growing Conditions vs Fruit Produced Data7.2 Finding the Multiple Correlation and the Multiple Regression Equation 163
After obtaining the SUMMARY OUTPUT, you can identify the multiple correlation and derive the optimal regression equation that represents the best-fit line through the data points This analysis utilizes WATER, LIGHT RECEIVED, and FERTILIZER as the three predictor variables, while FRUIT PRODUCED serves as the dependent criterion.
The term "Multiple R" in the SUMMARY OUTPUT refers to the multiple correlation coefficient of +0.80, indicating a strong positive relationship among the variables of WATER, LIGHT RECEIVED, and FERTILIZER in predicting FRUIT PRODUCED.
To find the regression equation, notice the coefficients at the bottom of the SUMMARY OUTPUT:
Intercept: a (this is the y-intercept) 1.5363
Fig 7.5 Final Spreadsheet for Growing Conditions vs Fruit Produced Regression Analysis
164 7 Multiple Correlation and Multiple Regression
Since the general form of the multiple regression equation is:
Yẳaỵb1X1ỵb2X2ỵb3X3 ð7:2ị we can now write the multiple regression equation for these data:
Using the Regression Equation to Predict FRUIT
Objective: To find the predicted FRUIT PRODUCED using a WATER Score of
600, a LIGHT RECEIVED Score of 500, and a FERTILIZER Score of 550
Plugging these three numbers into our regression equation gives us:
Yẳ3:20 grams of fruit produced
If you want to learn more about the theory behind multiple regression, see Keller
Using Excel to Create a Correlation Matrix
The final step in multiple regression is to find the correlation between all of the variables that appear in the regression equation.
In our example, this means that we need to find the correlation between each of the six pairs of variables:
To do this, we need to use Excel to create a “correlation matrix.” This matrix summarizes the correlations between all of the variables in the problem.
Objective: To use Excel to create a correlation matrix between the four variables in this example.
To use Excel to do this, use these steps:
Data (top of screen under “Home” at the top left of screen)
7.4 Using Excel to Create a Correlation Matrix in Multiple Regression 165
Correlation (scrollupto highlight this formula; see Fig.7.6)
The dataset encompasses key variables including fruit produced, water, light received, and fertilizer, along with their corresponding figures This comprehensive data analysis highlights the relationships between these factors and their impact on fruit production, providing valuable insights for optimizing agricultural practices.
Put a check in the box for:
Labels in the First Row (since you included the labels at the top of the columns in your input range of data above)
Output range (click on the button to its left, and enter): A42 (see Fig.7.7) Fig 7.6 Dialogue Box for Growing Conditions vs Fruit Produced Correlations
166 7 Multiple Correlation and Multiple Regression
The resulting correlation matrix appears in A42:E46 (See Fig.7.8).
To enhance the readability of the correlation matrix, format all decimal numbers to two decimal places Additionally, adjust the column widths to ensure that all labels fit comfortably within their cells, and center the correlations within each cell for a polished appearance.
Save this Excel file as: FRUIT5
The final spreadsheet for these scores appears in Fig.7.9.
Fig 7.7 Dialogue Box for Input/Output Range for Correlation Matrix
Fig 7.8 Resulting Correlation Matrix for Growing Conditions vs Fruit Produced Data7.4 Using Excel to Create a Correlation Matrix in Multiple Regression 167
In the correlation matrix, the presence of the number "1" along the diagonal indicates a perfect positive correlation of 1.0 for each variable with itself It's important to note that correlation coefficients are presented with two decimal places You are now prepared to analyze the correlations between the six pairs of variables.
The analysis reveals significant correlations among various factors affecting fruit production Notably, the strongest correlation is between fertilizer and fruit produced, with a value of þ:77 Water also plays a crucial role, showing a correlation of þ:51 with fruit production Additionally, light received has a correlation of þ:45 with fruit output The interrelationships among these factors are further highlighted by the correlations between light received and water (þ:47), fertilizer and water (þ:44), and fertilizer and light received (þ:43) Understanding these correlations is essential for optimizing fruit yield.
The strongest predictor of fruit production is fertilizer, exhibiting a correlation of +0.77 While incorporating additional factors such as water and light received marginally enhanced the prediction to +0.80, the improvement was minimal Therefore, fertilizer remains an outstanding predictor of fruit yield on its own.
If you want to learn more about the correlation matrix, see Levine et al (2011).
Fig 7.9 Final Spreadsheet for Growing Conditions vs Fruit Produced Regression and the Correlation Matrix
168 7 Multiple Correlation and Multiple Regression
End-of-Chapter Practice Problems
1 Agriculture around the world depends on viable seed production Crops that originated in various parts of the world are now grown in a wide range of climates For example, corn (maize) originated in present day Mexico and is now grown throughout the world The amount of seeds produced in different growing conditions, especially in marginal climates, affect how successful a plant will be for agricultural purposes The main conditions that can affect plant growth are water, light, and fertilizer Additionally, “plant crowding” in terms of how close the plants are to each other generally causes negative changes in productivity (i.e., the closer the plants are to one another, the fewer the seeds produced) As a result, agricultural companies have started studying the impact that space between plants has on overall plant production.
In analyzing the experimental data from the greenhouse, we aim to determine how various growing conditions, including water (ml), light exposure (min), and fertilizer (μl), influence seed production of a new agricultural plant Additionally, we will evaluate the impact of plant spacing (cm) as a potential predictor of seed yield The seed supplier seeks guidance on whether incorporating space between plants is beneficial for optimizing seed production alongside the main growing variables.
In this study, multiple correlation and multiple regression analysis were employed to evaluate data collected from a random sample of seed pods from 12 crop plants grown under controlled conditions over a full growing season The analysis aims to test Excel skills using the hypothetical data presented in Fig 7.10.
Fig 7.10 Worksheet Data for Chap 7: Practice Problem #1
7.5 End-of-Chapter Practice Problems 169
(a) Create an Excel spreadsheet using AVERAGE SEEDS PER POD as the criterion (Y),and WATER (X 1 ), LIGHT RECEIVED (X 2 ), SPACE (X 3 ), and FERTILIZER (X 4 ) as the predictors.
(b) Use Excel’s multiple regression function to find the relationship between these five variables and place it below the table.
(c) Use number format (two decimal places) for the multiple correlation on the SUMMARY OUTPUT, and use four decimal places for the coefficients in the SUMMARY OUTPUT).
(d) Print the table and regression results below the table so that they fit onto one page.
(e) Save this file as: seed14
Answer the following questions using your Excel printout:
1 What is the multiple correlationRxy?
3 What is the coefficient for WATERb1?
4 What is the coefficient for LIGHT RECEIVEDb2?
5 What is the coefficient for SPACE b 3 ?
6 What is the coefficient for FERTILIZER b 4 ?
7 What is the multiple regression equation?
8 Predict the AVERAGE SEEDS PER POD you would expect for a WATER score of 610, a LIGHT RECEIVED score of 550, a SPACE score of 3, and a FERTILIZER score of 610.
(f) Now, go back to your Excel file and create acorrelation matrixfor these five variables, and place it underneath the SUMMARY OUTPUT.
(g) Save this file as: seed15
(h) Now, print outjust this correlation matrixon a separate sheet of paper.
Answer the following questions using your Excel printout Be sure to include the plus or minus sign for each correlation:
9 What is the correlation between WATER and AVERAGE SEEDS PER POD?
10 What is the correlation between LIGHT RECEIVED and AVERAGE SEEDS PER POD?
11 What is the correlation between SPACE and AVERAGE SEEDS PER POD?
12 What is the correlation between FERTILIZER and AVERAGE SEEDS PER POD?
13 What is the correlation between SPACE and WATER?
14 What is the correlation between LIGHT RECEIVED and FERTILIZER?
15 Discuss which of the four predictors is the best predictor of AVERAGE SEEDS PER POD.
170 7 Multiple Correlation and Multiple Regression
16 Explain in words how much better the four predictor variables together predict AVERAGE SEEDS PER POD than the best single predictor by itself.
2 In order to grow properly, most plants need water, warmth, carbon dioxide gas, light, and minerals Suppose you wanted to study the relationship between temperature and precipitation and their effect on plant productivity In ecolog- ical terms, “productivity” refers to the amount of plant growth and it is measured in grams per meter squared per year (g/m 2 /year) at the site Precipitation (rainfall) is measured as the annual mean precipitation (cm/year) at the site. Temperature is measured as the average annual temperature in degrees Centi- grade ( C) at the site Let productivity be the dependent variable (criterion), and precipitation and temperature be the independent variables (predictors) at each site Hypothetical data for 14 sites are presented in Fig.7.11.
(a) create an Excel spreadsheet using PRODUCTIVITY as the criterion (Y), and the other variables as the two predictors of this criterion.
(b) Use Excel’s multiple regression function to find the relationship between these variables and place it below the table.
(c) Use number format (two decimal places) for the multiple correlation on the Summary Output, and use number format (three decimal places) for the coefficients in the Summary Output.
(d) Print the table and regression results below the table so that they fit onto one page.
(e) By hand on this printout,circle and label:
(2b) coefficients for the y-intercept, precipitation, and temperature. (f) Save this file as: PLANT3A
Fig 7.11 Worksheet Data for Chap 7: Practice Problem #2
7.5 End-of-Chapter Practice Problems 171
To create a correlation matrix for the three specified variables, return to your Excel file and position it below the Summary Table Ensure that each correlation value is rounded to two decimal places Save the updated file as "PLANT3A." Finally, print the correlation matrix in portrait mode on a separate sheet of paper.
Answer the following questions using your Excel printout:
1 What is the multiple correlation R xy ?
3 What is the coefficient for precipitationb1?
4 What is the coefficient for temperatureb2?
5 What is the multiple regression equation?
6 Underneath this regression equation by hand, predict the productivity you would expect for an annual precipitation of 300 cm/year and an annual temperature of +2 C.
Answer the following questions using your Excel printout Be sure to include the plus or minus sign for each correlation:
7 What is the correlation between precipitation and productivity?
8 What is the correlation between temperature and productivity?
9 What is the correlation between temperature and precipitation?
10 Discuss which of the two predictors is the better predictor of productivity.
11 Explain in words how much better the two predictor variables combined predict productivity than the better single predictor by itself.
3 Suppose that you have been hired by the United States Department of Agricul- ture (USDA) to analyze corn yields from Iowa farms over one year (a single growing season) Suppose, further, that these data will represent a pilot study that will be included in a larger ongoing analysis of corn yield in the Midwest. You want to determine if you can predict the amount of corn produced in bushels per acre (bu/acre) based on three predictors: (1) water measured in inches of rainfall per year (in./year), (2) fertilizer measured in the amount of nitrogen applied to the soil in pounds per acre (lbs/acre), and (3) average temperature during the growing season measured in degrees Fahrenheit ( F).
172 7 Multiple Correlation and Multiple Regression
To check your skills in Excel, you have selected a random sample of corn from each of eleven farms selected randomly and recorded the hypothetical data given in Fig.7.12.
(a) create an Excel spreadsheet using YIELD as the criterion (Y), and the other variables as the three predictors of this criterion (X 1 ẳRAINFALL,
Utilize Excel's multiple regression function to analyze the relationship among the four variables, and then include the SUMMARY OUTPUT beneath the corresponding table Ensure that the multiple correlation in the Summary Output is formatted to two decimal places, and apply the same two-decimal format for the coefficients in the SUMMARY OUTPUT.
(d) Save the file as: yield15
(e) Print the table and regression results below the table so that they fit onto one page.
Answer the following questions using your Excel printout:
1 What is the multiple correlationRxy?
3 What is the coefficient for RAINFALLb1?
4 What is the coefficient for NITROGENb2?
5 What is the coefficient for TEMPERATUREb3?
6 What is the multiple regression equation?
7 Predict the corn yield you would expect for rainfall of 28 inches per year, nitrogen at 205 pounds/acre, and temperature of 83 degrees Fahrenheit.
(f) Now, go back to your Excel file and create a correlation matrix for these four variables, and place it underneath the SUMMARY OUTPUT.
Fig 7.12 Worksheet Data for Chap 7: Practice Problem #3
7.5 End-of-Chapter Practice Problems 173
(g) Re-save this file as: yield15
(h) Now, print outjust this correlation matrixon a separate sheet of paper.
Answer to the following questions using your Excel printout (Be sure to include the plus or minus sign for each correlation):
8 What is the correlation between RAINFALL and YIELD?
9 What is the correlation between NITROGEN and YIELD?
10 What is the correlation between TEMPERATURE and YIELD?
11 What is the correlation between NITROGEN and RAINFALL?
12 What is the correlation between TEMPERATURE and RAINFALL?
13 What is the correlation between TEMPERATURE and NITROGEN?
14 Discuss which of the three predictors is the best predictor of corn yield.
15 Explain in words how much better the three predictor variables com- bined predict corn yield than the best single predictor by itself.
Hoshmand A.R Statistical Methods for Environmental and Agricultural Sciences (2 nd ed.) Boca Raton, FL: CRC Press, 1998.
Keller, G Statistics for Management and Economics (8 th ed.) Mason, OH: South-Western Cengage Learning, 2009.
Levine, D.M., Stephan, D.F., Krehbiel, T.C., and Berenson, M.L Statistics for Managers using Microsoft Excel (6 th ed.) Boston, MA: Prentice Hall/Pearson, 2011.
174 7 Multiple Correlation and Multiple Regression
One-Way Analysis of Variance (ANOVA)
In this 2016 Excel Guide, you have discovered how to utilize a one-group t-test to compare a sample mean to a population mean, as well as a two-group t-test to assess the differences between two sample means However, when faced with more than two groups, it is essential to employ statistical methods that can determine if significant differences exist among the means of these multiple groups.
The answer to this question is:Analysis of Variance (ANOVA).
The ANOVA test allows you to test for the difference between the means when you havethree or more groupsin your research study.
To perform a One-way Analysis of Variance (ANOVA), it is essential to have the "Data Analysis Toolpak" installed, as outlined in Chapter 6, Section 6.5.1 If you have not yet installed this tool, please do so before proceeding with the analysis.
As a research scientist, you may conduct a study to compare the highway miles per gallon (mpg) efficiency of five vehicle categories: subcompacts, compacts, mid-size, large, and SUVs This research aims to provide valuable insights into fuel efficiency across different vehicle types, helping consumers make informed choices based on mpg performance By analyzing the data collected from these vehicle categories, the study will highlight the differences in fuel economy, contributing to the ongoing discussion about sustainable transportation options.
This study investigates the relationship between vehicle size and gasoline usage by analyzing highway mileage data from SUV owners Participants agreed to track their mileage over a specified route while using three tanks of gasoline The hypothetical data collected for this research can be found in Fig 8.1.
T.J Quirk et al., Excel 2016 for Environmental Sciences Statistics,
ANOVA can be applied to data sets with varying numbers of vehicles across different car types, highlighting its robustness as a statistical test Statisticians often emphasize this feature, affectionately calling ANOVA a "very robust test."
Create an Excel spreadsheet for these data in this way:
A4: HIGHWAY MILES PER GALLON (mpg) COMPARISON OF FIVE TYPES OF CARS
After completing the data entry in your spreadsheet, ensure that cell B15 displays the value 35.0 and cell F15 shows 19.1 Center-align the numbers in each column and apply a number format with one decimal place for all entries.
It's crucial to verify the accuracy of all figures in the table to ensure you arrive at the correct solution for this problem.
Save this file as: CARS2
Fig 8.1 Worksheet Data for Highway mpg Test
176 8 One-Way Analysis of Variance (ANOVA)
Using Excel to Perform a One-Way Analysis
Objective: To use Excel to perform a one-way ANOVA test.
You are now ready to perform an ANOVA test on these data using the following steps:
Data (at top of screen)
Data Analysis (far right at top of screen)
ANOVA: Single Factor (scroll up to this formula and highlight it; see Fig.8.2)
Input range: B7:F17 (note that you have included in this range the column titles that are in row 7)
When comparing groups with varying sample sizes, ensure that the INPUT RANGE begins at the column title of the first group on the left and extends to the last column on the right, down to the lowest row containing data.
8.1 Using Excel to Perform a One-Way Analysis of Variance (ANOVA) 177 entire data matrix so that the INPUT RANGE has the “shape” of a rectangle when you highlight it Since LARGE has 21.3 in cell E17, your “rectangle” must include row 17!
Put a check mark in: Labels in First Row
Output range (click on the button to its left): A19 (see Fig.8.3)
Center all of the numbers in the ANOVA table, and round off all numbers that are decimals to two decimal places.
Save this file as: CARS2A
You should have generated the table given in Fig.8.4.
Fig 8.3 Dialog Box for ANOVA: Single Factor Input/Output Range
178 8 One-Way Analysis of Variance (ANOVA)
To ensure all information is displayed on a single page, print both the data table and the ANOVA summary table by adjusting the Page Layout settings to fit to 75% scale.
As a check on your analysis, you should have the following in these cells:
Now, let’s discuss how you should interpret this table:
Fig 8.4 ANOVA Results for Highway mpg Test
8.1 Using Excel to Perform a One-Way Analysis of Variance (ANOVA) 179
How to Interpret the ANOVA Table Correctly
Objective: To interpret the ANOVA table correctly
ANOVA, or Analysis of Variance, is a statistical method used to determine if there are significant differences between the means of three or more groups of data The ANOVA test utilizes the F-test statistic, commonly represented by the letter F, to assess these differences effectively.
The formula for the F-test is this:
FẳMean Square between groups (MS b ) divided by Mean Square within groups (MSw)
This Excel Guide focuses on teaching users how to effectively use Excel, rather than delving into the statistical theories underlying ANOVA formulas For those seeking a comprehensive understanding of ANOVA, refer to the works of Hibbert and Gooding (2006) and Hochmand (1998).
In Excel, dividing the values in cell D32 (MS b = 179.56) by the value in cell D33 (MS w = 4.01) yields an F-test result of 44.80, which is displayed in cell E32 This demonstrates that Excel provides more precise calculations than a standard calculator.
To assess if the F value of 44.80 signifies a significant difference between the means of the five car groups, it is essential to formulate both the null hypothesis and the research hypothesis.
In our analysis of highway miles per gallon (mpg), we compare five groups to evaluate their population means The null hypothesis posits that these means are equal, while the research hypothesis suggests they are not, indicating a significant difference among the groups Based on the ANOVA results, the appropriate hypothesis to accept will depend on whether the statistical analysis shows a significant variance among the population means.
Using the Decision Rule for the ANOVA F-Test
To state the hypotheses, let’s call SUBCOMPACTS as Group 1, COMPACTS as Group 2, MID-SIZE as Group 3, LARGE as Group 4, and SUVs as Group 5 The hypotheses would then be:
180 8 One-Way Analysis of Variance (ANOVA)
The decision-making process for this question mirrors the rule applied in both the one-group t-test and the two-group t-test discussed in this book.
If the absolute value of t is less than the critical t, you accept the null hypothesis. or
If the absolute value of t is greater than the critical t, you reject the null hypothesis, and accept the research hypothesis.
Now, here is the decision rule for ANOVA:
Objective: To learn the decision rule for the ANOVA F-test
The decision rule for the ANOVA F-test is the following:
If the value for F is less than the critical F-value, accept the null hypothesis. or
If the value of F is greater than the critical F-value, reject the null hypothesis, and accept the research hypothesis.
Note that Excel tells you the critical F-value in cell G32: 2.63
Therefore, our decision rule for the types of cars AVOVA test is this:
Since the value of F of 44.80 is greater than the critical F-value of 2.63, we reject the null hypothesis and accept the research hypothesis.
Therefore, our conclusion, in plain English, is:
There is a significant difference between the highway mpg between the five types of cars.
The F-value, which cannot fall below one, is inherently a positive value; therefore, there is no need to calculate its absolute value, as it will never be negative.
ANOVA indicates a significant difference among the population means of various groups; however, it does not specify which specific pairs of groups differ significantly from one another.
Testing the Difference Between Two Groups
To answer that question, we need to do a different test called the ANOVA t-test.
Objective: To test the difference between the means of two groups using an
ANOVA t-test when the ANOVA F-test results indicate a significant difference between the population means.
8.4 Testing the Difference Between Two Groups Using the ANOVA t-Test 181
To analyze the differences among five groups of data representing various car types, we must conduct ten distinct ANOVA t-tests These tests will help identify which of the ten pairs of groups show significant differences in their data Each pair requires its own separate ANOVA t-test to ensure accurate results.
In this article, we will demonstrate how to conduct an ANOVA t-test by comparing two categories of cars: COMPACTS and LARGE The methodology outlined here can be applied similarly to the remaining nine pairs of groups in our analysis.
8.4.1 Comparing COMPACTS vs LARGE in Highway mpg
Objective: To compare COMPACTS vs LARGE in highway mpg using the
The first step is to write the null hypothesis and the research hypothesis for these two types of cars.
In the context of the ANOVA t-test, the null hypothesis posits that the population means of the two groups, COMPACTS (Group 2) and LARGE (Group 4), are equal Conversely, the research hypothesis asserts that there is a significant difference between these two population means, indicating that they are not equal.
For Group 2 vs Group 4, the formula for the ANOVA t-test is:
182 8 One-Way Analysis of Variance (ANOVA)
ANOVA tẳ X1X2 s:e:ANOVA ð8:2ị where s:e:ANOVAẳ
The steps involved in computing this ANOVA t-test are:
1 Find the difference of the sample means for the two groups (29.4123.73ẳ5.68).
2 Find 1/n 2 + 1/n 4 (since both groups have a different number of cars in them, this becomes: 1/9 + 1/10ẳ0.11 + 0.10ẳ0.21.
3 Multiply MS w times the answer for step 2 (4.010.21ẳ0.84)
4 Take the square root of step 3 (SQRT(0.84)ẳ0.92)
5 Divide Step 1 by Step 4 to find ANOVA t(5.68/0.92ẳ6.17)
Excel performs calculations with a precision of up to 16 decimal places, ensuring highly accurate results While your answer may be rounded to 6.18 in two decimal places, the underlying computations in Excel maintain greater accuracy due to this extended decimal precision.
To interpret the ANOVA t-test result of 6.18, it is essential to identify the critical value of t This requires calculating the degrees of freedom associated with the ANOVA t-test.
8.4.1.1 Finding the Degrees of Freedom for the ANOVA t-Test
Objective: To find the degrees of freedom for the ANOVA t-test.
The degrees of freedom (df) for the ANOVA t-test is calculated by taking the total sample size from all groups and subtracting the number of groups in the study, represented as df = n_TOTAL - k, where k denotes the number of groups.
The total sample size across the five groups is 42, with Group 1 containing 8 cars, Group 2 having 9 cars, Group 3 consisting of 7 cars, Group 4 with 10 cars, and Group 5 including 8 cars Consequently, the degrees of freedom for the ANOVA t-test is calculated to be 37.
In the t-table found in Appendix E, the critical t-value for df = 37 is 2.026, which can be located in the degrees of freedom column on the left side of the table.
When conducting the ANOVA t-test to compare two groups, it is crucial to reference the degrees of freedom (df) column in Appendix E to determine the critical t value This step ensures accurate statistical analysis and interpretation of the results.