Mô tả một biến định lượng

 Trung bình nhân ít bị ảnh hưởng bởi cực trị và thường được sử dụng trong các phân phối lệch dương. Trung bình nhân là một số đo thích hợp để tính tốc độ trung bình.  The geometric mean is a useful measure of position for data involving ratios. As it can be seen from this relation, the geometric mean is undefined for a set of values with zeros or negative values. 3. Trung bình điều hòa (Harmonic Mean)  Trung bình điều hòa được sử dụng để tính trung bình các cỡ mẫu. Nếu có k mẫu có n đối tượng thì trung bình điều hòa được định nghĩa:

Trang 1

MÔ TẢ MỘT BIẾN ĐỊNH LƯỢNG

Trang 2

SỐ ĐO LƯỜNG TẬP TRUNG

1 Trung bình cộng (Arithmetic Mean)

 Trung bình cộng tính bằng tổng các giá trị chia cho số giá trị Trung bình cộng

là số đo tốt nhất dùng trong phân phối cân đối hay phân phối bình thường

 Công thức tính

2 Trung bình nhân (Geometric mean)

 Trung bình nhân chính là căn bậc n của tích n số hạng Theo đó, trung bình nhân của các số hạng 1, 2, 3, 10 là căn bậc 4 của 1x2x3x10 bằng căn bậc bốn của 60 và bằng 2,78

 Trung bình nhân có thể tính bằng cách khác như sau: a) Lấy ln của mỗi số hạng X; b) Tính trung bình cộng của các giá trị ln; c) Trung bình nhân chính là e mũ trung bình cộng

 Trung bình nhân ít bị ảnh hưởng bởi cực trị và thường được sử dụng trong các phân phối lệch dương Trung bình nhân là một số đo thích hợp để tính tốc độ trung bình

 The geometric mean is a useful measure of position for data involving ratios

As it can be seen from this relation, the geometric mean is undefined for a set

of values with zeros or negative values

3 Trung bình điều hòa (Harmonic Mean)

 Trung bình điều hòa được sử dụng để tính trung bình các cỡ mẫu Nếu có k mẫu có n đối tượng thì trung bình điều hòa được định nghĩa:

 When a data set contains values which represent rates of change, the harmonic mean is an useful measure of central tendency

4 Trung vị (Median)

 Trung vị là điểm giữa của phân phối Trung vị mô tả tốt hơn trung bình trong các phân phối bị lệch cao

Trang 3

 How do you find the median when N is odd?

a) arrange all values (N) from smallest to largest

b) the median is the center of the list

c) find it by counting (N + 1) /2 observations up from the bottom

 How do you find the median when N is even?

a) arrange all values (N) from smallest to largest

b) the median is the average of the center two values

c) count (N + 1) /2 observations up from the bottom

5 Yếu vị (mode)

 The mode is the most frequently occurring score in a distribution It is the only measure of central tendency that can be used with nominal data

 The mean, median, and mode are equal in symmetric distributions The mean is typically higher than the median in positively skewed distributions and lower than the median in negatively skewed distributions, although this may not be the case in bimodal distributions

6 Trimean

 The trimean is computed by adding the 25th percentile plus twice the 50th percentile plus the 75th percentile and dividing by four

 The trimean is almost as resistant to extreme scores as the median and is less subject to sampling fluctuations than the arithmetic mean in extremely skewed

distributions It is less efficient than the mean for normal distributions The trimean is a good measure of central tendency and is probably not used as much as it should be

7 Trimmed Mean

 A trimmed mean is calculated by discarding a certain percentage of the lowest and the highest scores and then computing the mean of the remaining scores For example, a mean trimmed 50% is computed by discarding the lower and higher 25% of the scores and taking the mean of the remaining scores

 The median is the mean trimmed 100% and the arithmetic mean is the mean trimmed 0%

 A trimmed mean is obviously less susceptible to the effects of extreme scores than is the arithmetic mean It is therefore less susceptible to sampling

fluctuation than the mean for extremely skewed distributions It is less efficient

than the mean for normal distributions

 Trimmed means are often used in Olympic scoring to minimize the effects of extreme ratings possibly caused by biased judges

 It is highly recommended that at least one of these two be computed in addition

to the mean

SỐ ĐO LƯỜNG PHÂN TÁN

Trang 4

1 Range

 It is equal to the difference between the largest and the smallest values

 It is very sensitive to extreme scores since it is based on only two values

 The range can be informative if used as a supplement to other measures of spread such as the standard deviation or semi-interquartile range

2 Interquartile Range

 Range of the middle 50% of the distribution

 Quartiles split the distribution into “quarters” (“fourths”)

Q1 = value at 25%ile Q3 = value at 75%ile

Q2 = value at 50%ile (median) Q4 = value at 100%ile

 IQR calculated: Q3 – Q1

 To calculate:

(a) Arrange all scores from lowest to highest

(b) Find the median value location

(c) Find the quartile (Q1 and Q3) locations & quartiles

(d) Take the difference between Q1 and Q3

3 Standard Deviation and Variance

 The variance is computed as the average squared deviation of each number from its mean

 The standard deviation is the square root of the variance It is the most

commonly used measure of spread

 In a normal distribution, about 68% of the scores are within one standard deviation of the mean and about 95% of the scores are within two standard deviations of the mean

 Although less sensitive to extreme scores than the range, the standard deviation

is more sensitive than the semi-interquartile range Thus, the standard deviation should be supplemented by the semi-interquartile range when the possibility of extreme scores is present

Trang 5

SỐ ĐO LƯỜNG HÌNH DẠNG PHÂN PHỐI

1 Độ lệch (skew)

 Một phân phối gọi là lệch khi có một đuôi dài hơn đuôi còn lại Phân phối lệch dương khi có một đuôi dài về phía chiều dương Phân phối lệch âm có một đuôi dài về phía âm Phân phối cân đối là phân phối có hai đuôi dài tương tự nhau

 Phân phối lệch dương phổ biến hơn phân phối lệch âm

 Độ lệch được tính bằng công thức

Trong đó μ là trung bình và σ là độ lệch chuẩn

 Độ lệch bằng 0 thì phân phối cân đối

 If S is positive (mode < mean), the distribution is skewed toward the right side;

 If S is negative (mode > mean), the distribution is skewed toward to left side

2 Độ nhọn (Kurtosis)

 Kurtosis is based on the size of a distribution's tails Distributions with

relatively large tails are called "leptokurtic"; those with small tails are called

"platykurtic." A distribution with the same kurtosis as the normal distribution

is called "mesokurtic."

Trang 6

 The following formula can be used to calculate kurtosis:

where σ is the standard deviation The kurtosis of a normal distribution is 0

 The following two distributions have the same variance, approximately the same skew, but differ markedly in kurtosis

Trang 7

ỨNG DỤNG CỦA THỐNG KÊ MÔ TẢ

MÔ TẢ BẰNG ĐỒ THỊ

1 Frequency Polygon

 A frequency polygon is a graphical display of a frequency table The intervals are shown on the X-axis and the number of scores in each interval is

represented by the height of a point located above the middle of the interval The points are connected so that together with the X-axis they form a polygon

 Frequency polygons are useful for comparing distributions This is achieved by overlaying the frequency polygons drawn for

different data sets

 Frequency polygons can be based on the actual frequencies or the relative frequencies When based on relative frequencies, the percentage of scores instead of the number of scores in each category is plotted

 In a cumulative frequency polygon, the number of scores (or the percentage of scores) up to and including the category in question is plotted A cumulative frequency polygon is shown below

Trang 8

2 Histogram

 A histogram is constructed from a frequency table The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval

 The shapes of histograms will vary depending on the choice of the size of the intervals

3 Stem and Leaf Plots

 A stem and leaf display is particularly useful when the data are not too

numerous

 A stem and leaf plot is similar to a histogram except it portrays a little more precision

 A stem and leaf plot of the tournament players from the dataset "chess" as well as the data themselves are shown below:

85.3 80.3 75.9 74.9

Trang 9

71.1 58.1 56.4 51.2 45.6 40.1

 The largest value, 85.3, is approximated as:

10 x 8 + 5

 This is represented in the plot as a stem of 8 and a leaf of 5 It is shown as the 5

in the first line of the plot Similarly, 80.3 is approximated as 10 x 8 + 0; it has

a stem of 8 and a leaf of 0 It is shown as the "0" in the first line of the plot

 Depending on the data, each stem is displayed 1, 2, or 5 times When a stem is displayed only once (as on the plot shown above), the leaves can take on the values from 0-9

 When a stem is displayed twice, (as in the example shown below) one stem is associated with the leaves 5-9 and the other stem is associated with the leaves 0-4

 Finally, when a stem is displayed five times, the first has the leaves 8-9, the second 6-7, the third 4-5, and so on

 If positive and negative numbers are both present, +0 and -0 are used as stems

as they are in the plot A stem of -0 and a leaf of 7 is a value of (-0 x 1) + (-0.1

x 7) = -0.7

Trang 10

 The two distributions are placed back to back along a common column of stems The figure below shows such a graph It compares two distributions The stems are in the middle, the leaves to the left are for one distribution, and the leaves to the right are for the other For example, the second-to-last row shows that the distribution on the left contains the values 11, 12, and 13

whereas the distribution on the right contains two 12's and three 14's

4 Box Plot

 The box stretches from the lower hinge (defined as the 25th percentile) to the upper hinge (the 75th percentile) and therefore contains the middle half of the scores in the distribution

 The median is shown as a line across the box

Therefore 1/4 of the distribution is between this line and the top of the box and 1/4 of the distribution is between this line and the bottom of the box

 The "H-spread" is defined as the difference between the hinges and a "step" is defined as 1.5 times the H-spread

 Inner fences are 1 step beyond the hinges Outer fences are 2 steps beyond the hinges

 There are two adjacent values: the largest value below the upper inner fence and the smallest value above the lower inner fence For

the data plotted in the figure, the minimum value is above the lower inner fence and is therefore the lower adjacent value The maximum value is the inner fences so it is not the upper adjacent value

 As shown in the figure, a line is drawn from the upper hinge to the upper adjacent value and from the lower hinge to the lower adjacent value

 Every score between the inner and outer fences is indicated by an "o" whereas

a score beyond the outer fences is indicated by a "*"

 It is often useful to compare data from two or more groups by viewing box plots from the groups side by side Plotted are data from Example 2a and

Example 2b The data from 2b are higher, more spread out, and have a positive skew. That the skew is positive can be determined by the fact that the mean is higher than the median and the upper whisker is longer than the lower whisker

Trang 11

 How to construct a box plot:

1 Find the median location & median value

2 Find the quartile location

3 Find the lower (Q1) & upper (Q3) quartiles (also called hinges)

4 Find IQR (also called H-spread): Q3 – Q1

5 Take IQR (H-spread) X 1.5

6 Find the maximum lower whisker: Q1 – (1.5)(IQR)

7 Find the maximum upper whisker: Q3 + (1.5)(IQR)

8 Find the end lower & end upper whisker values: values closest to but

not exceeding the whisker lengths

9 Draw a vertical line & label it with values of the X variable

10 Draw a horizontal line representing the median

11 Draw a box around the median line, with the top of the box at Q3 and the bottom of the box at Q1

12 Draw the upper whisker, a vertical line extending up from the top of the box

to the END upper whisker value

13 Draw the lower whisker, a vertical line extending down from the bottom of the box to the END lower whisker value

14 Place * at any location beyond the whiskers where there are data points

 Example:

From previous calculations, we know:

Median location is: 5 Median value is 6

Quartile location: 3

Lower (Q1) quartile: 5 Upper (Q3) quartile: 8

IQR (H-spread) = 3

Thus:

IQR*1.5 = 4.5

Maximum lower whisker: Q1 – 4.5 = 5 – 4.5 = 5

Maximum upper whisker: Q3 + 4.5 = 8 + 4.5 = 12.5

End lower whisker value = Smallest value  5 = 1

End upper whisker value = largest value  12.5 = 9

Trang 12

MÔ TẢ HAI BIẾN

Trang 13

A scatterplot shows the scores on one variable plotted against scores on a second variable Below is a plot showing the relationship between grip strength and arm strength for 147 people working at physically-demanding jobs The data are from a

case study in the Rice Virtual Lab in Statistics The plot shows a very strong but certainly not a perfect relationship between these two variables

Scatterplots should almost always be constructed when the relationship between two variables is of interest Statistical summaries are no substitute for a full plot of the data

Pearson's Correlation (1 of 3)

Trang 14

The correlation between two variables reflects the degree to which the variables are related The most common measure of correlation is the Pearson Product Moment Correlation (called Pearson's correlation for short) When measured in a

population the Pearson Product Moment correlation is designated by the Greek letter rho (•) When computed in a sample, it is designated by the letter "r" and is sometimes called "Pearson's r." Pearson's correlation reflects the degree of linear relationship between two variables It ranges from +1 to -1 A correlation of +1 means that there is a perfect positive linear relationship between variables The scatterplot shown on this page depicts such a relationship It is a positive

relationship because high scores on the X-axis are associated with high scores on the Y-axis

A correlation of -1 means that there is a perfect negative linear relationship

between variables The scatterplot shown below depicts a negative relationship It

is a negative relationship because high scores on the X-axis are associated with low scores on the Y-axis

A correlation of 0 means there is no linear relationship between the two variables The second graph shows a Pearson correlation of 0

Correlations are rarely if ever 0, 1, or -1 Some real data showing a moderately high correlation are shown on the next page

The scatterplot below shows arm strength as a function of grip strength for 147 people working in physically-demanding jobs (click here for details about the study) The plot reveals a strong positive relationship The value of Pearson's correlation is 0.63

Trang 15

Other information about Pearson's correlation can be obtained by clicking one of the following links:

Computing Pearson's Correlation Coefficient

The formula for Pearson's correlation takes on many forms A commonly used formula is shown below The formula looks a bit complicated, but taken step by step as shown in the numerical example, it is really quite simple

A simpler looking formula can be used if the numbers are converted into z scores:

where zx is the variable X converted into z scores and zy is the variable Y

converted into z scores

Example Correlations

Định dạng
Số trang	17
Dung lượng	251,5 KB
File đính kèm	TONGHOP.zip (76 KB)