part © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in Business Analytics: Data Analysis and Chapter Decision Making Describing the Distribution of a Single Variable Introduction (slide of 2) The goal is to present data in a form that makes sense to people Tools that are used to this include: Graphs: bar charts, pie charts, histograms, scatterplots, time series graphs Numerical summary measures: counts, percentages, averages, measures of variability Tables of summary measures: totals, averages, counts, grouped by categories It is a challenge to summarize data so that the important information stands out clearly © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Introduction (slide of 2) There are four steps in data analysis: Recognize a problem that needs to be solved Gather data to help understand and then solve the problem Analyze the data Act on this analysis It is up to you to ask good questions—and then take advantage of the most appropriate tools to answer them © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Populations and Samples A population includes all of the entities of interest in a study (people, households, machines, etc.) Examples: All potential voters in a presidential election All subscribers to cable television All invoices submitted for Medicare reimbursement by nursing homes A sample is a subset of the population, often randomly chosen and preferably representative of the population as a whole Examples: Gallup, Harris, other polls today © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Data Sets, Variables, and Observations A data set is usually a rectangular array of data, with variables in columns and observations in rows A variable (or field or attribute) is a characteristic of members of a population, such as height, gender, or salary An observation (or case or record) is a list of all variable values for a single member of a population © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.1: Questionnaire Data.xlsx Objective: To illustrate variables and observations in a typical data set Solution: Data set includes observations on 30 people who responded to a questionnaire on the president’s environmental policies Variables include: age, gender, state, children, salary, opinion Include a row that lists variable names Include a column that shows an index of the observation © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Types of Data (slide of 5) A variable is numerical if meaningful arithmetic can be performed on it Otherwise, the variable is categorical There is also a third data type, a date variable Excel® stores dates as numbers, but dates are treated differently from typical numbers A categorical variable is ordinal if there is a natural ordering of its possible values If there is no natural ordering, it is nominal © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Types of Data (slide of 5) Categorical variables can be coded numerically or left uncoded A dummy variable is a 0–1 coded variable for a specific category It is coded as for all observations in that category and for all observations not in that category Categorizing a numerical variable by putting the data into discrete categories (called bins) is called binning or discretizing A variable that has been categorized in this way is called a binned or discretized variable © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Environmental Data Using a Different Coding (slide of 5) © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Types of Data (slide of 5) A numerical variable is discrete if it results from a count, such as the number of children A continuous variable is the result of an essentially continuous measurement, such as weight or height Cross-sectional data are data on a cross section of a population at a distinct point in time Time series data are data collected over time © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.3 (Continued): Baseball Salaries 2011.xlsx Objective: To illustrate the features of a box plot, particularly how it indicates skewness Solution: In StatTools, select Box-Whisker Plot from the Summary Graphs dropdown list and fill in the dialog box © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Time Series Data Our main interest in time series variables is how they change over time, and this information is lost in traditional summary measures and in histograms or box plots For time series data, a time series graph is used This is a graph of the values of one or more time series, using time on the horizontal axis This is always the place to start a time series analysis © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.5: Crime in US.xlsx (slide of 3) Objective: To see how time series graphs help to detect trends in crime data Solution: Data set contains annual data on violent and property crimes for the years 1960 to 2010 In StatTools, designate a StatTools data set Then select Times Series Graph from the Time Series and Forecasting dropdown list and fill in the resulting dialog box © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.5: Crime in US.xlsx (slide of 3) Total Violent and Property Crimes Population Totals © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.5: Crime in US.xlsx (slide of 3) Violent and Property Crime Rates © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.6: DJIA Monthly Close.xlsx (slide of 2) Objective: To find useful ways to summarize the monthly Dow data Solution: Data set contains monthly values of the Dow from 1950 through 2011 Create summary measures and time series graphs for monthly values and percentage changes of the Dow © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.6: DJIA Monthly Close.xlsx (slide of 2) © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Outliers An outlier is a value or an entire observation (row) that lies well outside of the norm Some statisticians define an outlier as any value more than three standard deviations from the mean, but this is only a rule of thumb Even if values are not unusual by themselves, there still might be unusual combinations of values When dealing with outliers, it is best to run the analyses two ways: with the outliers and without them © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Missing Values Most real data sets have gaps in the data There are two issues: how to detect these missing values and what to about them The more important issue is what to about them: One option is to simply ignore them Then you will have to be aware of how the software deals with missing values Another option is to fill in missing values with the average of nonmissing values, but this isn’t usually a very good option A third option is to examine the nonmissing values in the row of a missing value; these values might provide clues on what the missing value should be © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Excel Tables for Filtering, Sorting, and Summarizing Tables are a tool introduced in Excel 2007 You now have the ability to designate a rectangular data set as a table and then employ a number of powerful tools for analyzing tables These tools include: Filtering Sorting Summarizing © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.7: Catalog Marketing.xlsx (slide of 2) Objective: To illustrate Excel tables for analyzing the HyTex data Solution: Data set contains data on 1000 customers of HyTex, a fictional direct marketing company Designate the data set as a table by selecting any cell in the data set and clicking the Table button on the Insert ribbon Use the dropdown arrows next to the variable names to filter in many different ways © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.7: Catalog Marketing.xlsx (slide of 2) © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Filtering Finding records that match particular criteria is called filtering One way to filter is to create an Excel table, which automatically provides dropdown arrows next to the field names that allow you to filter There are also three ways to filter on any rectangular data set with variable names: Use the Filter button from the Sort & Filter dropdown list on the Home ribbon Use the Filter button from the Sort & Filter group on the Data ribbon Right-click any cell in the data set and select Filter You get several options, the most popular of which is Filter by Selected Cell’s Value © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.7 (Continued): Catalog Marketing.xlsx (slide of 2) Objective: To investigate the types of filters that can be applied to the HyTex data Solution: There is almost no limit to the filters you can apply, but here are a few possibilities: Filter on one or more values in a field Filter on more than one field Filter on a continuous numerical field Top 10 and Above/Below Average filters Filter on a text field Filter on a date field Filter on color or icon Use a custom filter © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part Example 2.7 (Continued): Catalog Marketing.xlsx (slide of 2) Results from a Typical Filter © 2015 Cengage Learning All Rights Reserved May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part ... empirical rules should be applied with caution, especially when the data are clearly skewed, as illustrated by the calculations for baseball salaries below © 2015 Cengage Learning All Rights Reserved... posted to a publicly accessible website, in whole or in part Data Sets, Variables, and Observations A data set is usually a rectangular array of data, with variables in columns and observations... duplicated, or posted to a publicly accessible website, in whole or in part Example 2.3: Baseball Salaries 2011.xlsx (slide of 2) Objective: To learn how salaries are distributed across all