Thống kê mô tả
Lecture – Descriptive Statistics Content • Measure of data’s location, variability • Exploratory Data Analysis • Association Between Two Variables Measures of Location Mean If the measures are computed Median for data from a sample, they are called sample statistics Mode Percentiles If the measures are computed Quartiles for data from a population, they are called population parameters A sample statistic is referred to as the point estimator of the corresponding population parameter Mean Perhaps the most important measure of location is the mean The mean provides a measure of central location x • The mean of a data set is the average of all the data values The sample mean is the point estimator of the population mean Sample Mean x x x i Sum of the values of the n observations n Number of observations in the sample Population Mean x i Sum of the values of the N observations N Number of observations in the population Median The median of a data set is the value in the middle when the data items are arranged in ascending order Whenever a data set has extreme values, the median is the preferred measure of central location The median is the measure of location most often reported for annual income and property value data A few extremely large incomes or property values can inflate the mean For an odd number of observations: Median 26 18 27 12 14 27 19 observations 12 14 18 19 26 27 27 in ascending order the median is the middle value Median = 19 Mode The mode of a data set is the value that occurs with greatest frequency The greatest frequency can occur at two or more different values If the data have exactly two modes, the data are bimodal If the data have more than two modes, the data are multimodal Caution: If the data are bimodal or multimodal, Excel’s MODE function will incorrectly identify a single mode Example: Mode: Apartment Rents 450 occurred most frequently (7 times) Mode = 450 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 Note: Data is in ascending order 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615 Distribution Shape: Skewness • Skewness is negative • Mean will usually be less than the median Relative Frequency 35 30 25 20 15 10 05 Skewness = .31 z-Scores The The z-score z-score is is often often called called the the standardized standardized value value It It denotes denotes the the number number of of standard standard deviations deviations aa data data value value xxii is is from from the the mean mean xi x zi s Excel’s Excel’s STANDARDIZE STANDARDIZE function function can can be be used used to to compute compute the the z-score z-score z-Scores An observation’s z-score is a measure of the relative location of the observation in a data set A data value less than the sample mean will have a z-score less than zero A data value greater than the sample mean will have a z-score greater than zero A data value equal to the sample mean will have a z-score of zero Exploratory Data Analysis Exploratory Exploratory data data analysis analysis procedures procedures enable enable us us to to use use simple simple arithmetic arithmetic and and easy-to-draw easy-to-draw pictures pictures to to summarize summarize data data We We simply simply sort sort the the data data values values into into ascending ascending order order and and identify identify the the five-number five-number summary summary and and then then construct construct aa box box plot plot Five-Number Summary Smallest Value First Quartile Median Third Quartile Largest Value Box Plot A A box box plot plot is is aa graphical graphical summary summary of of data data that that is is based based on on aa five-number five-number summary summary A A key key to to the the development development of of aa box box plot plot is is the the computation computation of of the the median median and and the the quartiles quartiles Q Q11 and and Q Q33 Box Box plots plots provide provide another another way way to to identify identify outliers outliers Box Plot Example: Apartment Rents • A box is drawn with its ends located at the first and third quartiles • A vertical line is drawn in the box at the location of the median (second quartile) 40 42 45 47 50 52 55 57 60 62 0 5 5 Q1 = 445 Q3 = 525 Q2 = 475 Box Plot Limits are located (not drawn) using the interquartile range (IQR) Data outside these limits are considered outliers The locations of each outlier is shown with the symbol * Box Plot An excellent graphical technique for making comparisons among two or more groups Measures of Association Between Two Variables Thus Thus far far we we have have examined examined numerical numerical methods methods used used to to summarize summarize the the data data for for one one variable variable at at a a time time Often Often aa manager manager or or decision decision maker maker is is interested interested in in the the relationship relationship between between two two variables variables Two Two descriptive descriptive measures measures of of the the relationship relationship between between two two variables variables are are covariance covariance and and correlation correlation coefficient coefficient Covariance The The covariance covariance is is a a measure measure of of the the linear linear association association between between two two variables variables Positive Positive values values indicate indicate aa positive positive relationship relationship Negative Negative values values indicate indicate a a negative negative relationship relationship Covariance The The covariance covariance is is computed computed as as follows: follows: ( xi x )( yi y ) sxy n xy ( xi x )( yi y ) N for samples for populations Correlation Coefficient Correlation Correlation is is aa measure measure of of linear linear association association and and not not necessarily necessarily causation causation Just Just because because two two variables variables are are highly highly correlated, correlated, it it does does not not mean mean that that one one variable variable is is the the cause cause of of the the other other Correlation Coefficient The The correlation correlation coefficient coefficient is is computed computed as as follows: follows: rxy sxy sx s y for samples xy xy x y for populations Correlation Coefficient The The coefficient coefficient can can take take on on values values between between -1 -1 and and +1 +1 Values Values near near -1 -1 indicate indicate a a strong strong negative negative linear linear relationship relationship Values Values near near +1 +1 indicate indicate a a strong strong positive positive linear linear relationship relationship The The closer closer the the correlation correlation is is to to zero, zero, the the weaker weaker the the relationship relationship ... 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600... Mean 35 30 25 20 15 10 05 Skewness = Distribution Shape: Skewness • Skewness is negative • Mean will usually be less than the median Relative Frequency 35 30 25 20 15 10 05 Skewness = .31 z-Scores... the the z-score z-score z-Scores An observation’s z-score is a measure of the relative location of the observation in a data set A data value less than the sample mean will have a z-score