Chúng ta thấy biến miles có 1 missing value. Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao. Các missing value luôn nằm ở cuối bảng. Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng.
Trang 1CHAPTER 1: DATA CLEANSING
Kiểm tra missing data của biến trong excel
Chúng ta có thể dùng hàm countblank để đếm số missing value trong mộtbiến của excel
Chúng ta thấy biến miles có 1 missing value
Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao Cácmissing value luôn nằm ở cuối bảng
Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu
Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình,
độ lệch chuẩn cho các biến định lượng
Trong ví dụ dữ liệu hình trên, chúng ta gõ vào ô H3 công thức
=AVERAGE(C2:C457)
Sau đó chúng ta gõ vào ô H4 công thức
=STDEV.S(C2:C457)
Chúng ta làm tương tự cho các ô còn lại
Kết quả cho thấy các biến này đều có trung bình và độ lệch chuẩn ổn địnhkhông phát hiện bất thường về outlier
Trang 2Chúng ta cũng có thể tính giá trị tối đa và tối thiểu cho từng biến.
Nhập công thức vào ô H5
=MIN(C2:C457)
Nhập công thức vào ô H6
= MAX(C2:C457)
Chúng ta thấy giá trị tối thiểu và tối đa của biến life of tires là 1.8 months và
601.0 Giá trị tối đa này (50 năm) là không phù hợp đối với biến life of tires Đểxác định xe nào có outlier này chúng ta cần sort toàn bộ dữ liệu theo biến Life ofTire (Months) và cuốn đến các dòng cuối cùng của dữ liệu
We see in Figure 2.34 that the observation with Life of Tire (Months) value of601.0
is the left rear tire from the automobile with ID Number 8696859 Also note that
Trang 3rear tirevalue is also misplaced Both of these erroneous entries can now be corrected
By repeating this process for the remaining variables in the data (Tread Depth andMiles) in columns I and J, we determine that the minimum and maximum values
are in error and if so, what might be the correct value
Not all erroneous values in a data set are extreme; these erroneous values are muchmore difficult to find However, if the variable with suspected erroneous values has
a relatively strong relationship with another variable in the data, we can use thisknowledge
to look for erroneous values Here we will consider the variables Tread Depth andMiles;
because more miles driven should lead to less tread depth on an automobile tire,
we expect these two variables to have a negative relationship A scatter chart will
Trang 4whether any of the tires in the data set have values for Tread Depth and Miles thatare
counter to this expectation
The red ellipse in Figure 2.35 shows the region in which the points representingTread
Depth and Miles would generally be expected to lie on this scatter plot The points
Trang 5Closer examination of outliers and potential erroneous values may reveal an error
Trang 6may supply similar information Such variables can be aggregated or removed toallow
more parsimonious model development
A critical part of data mining is determining how to represent the measurements ofthe
variables and which variables to consider The treatment of categorical variables isparticularly important Typically, it is best to encode categorical variables with 0–1dummy
variables Consider a data set that contains the variable Language to track thelanguage
preference of callers to a call center The variable Language with the possiblevalues
of English, German, and Spanish would be replaced with three binary variablescalled
English, German, and Spanish An entry of German would be captured using a 0
Trang 7for theEnglish dummy variable, a 1 for the German dummy variable, and a 0 for theSpanish
dummy variable Using 0–1 dummy variables to encode categorical variables withmany
different categories results in a large number of variables In these cases, the use ofPivotTables is helpful in identifying categories that are similar and can possibly be
therefore the number of categories may be reduced by combining categories
Often data sets contain variables that, considered separately, are not particularlyinsightful but that, when appropriately combined, result in a new variable thatreveals an important relationship Financial data supplying information on stock
may be as useful as the derived variable representing the price/earnings (PE) ratio.A
variable tabulating the dollars spent by a household on groceries may not beinteresting
because this value may depend on the size of the household Instead, consideringthe
proportion of total household spending on groceries may be more informative
Trang 8Lập bảng phân phối tần suất cho biến định tính trong excel
Giả sử chúng ta có bảng dữ liệu như sau
Chúng ta muốn lập bảng phân phối tần suất của từng loại thức uống, chúng ta sửdụng hàm countif như sau
Kết quả như sau
Trang 9Lập bảng phân phối tần suất cho biến định lượng trong Excel
Để lập bảng phân phối tần suất cho biến định lượng chúng ta dùng hàm frequency
Vẽ tổ chức đồ (histogram) cho biến định lượng
Để có thể vẽ histogram chúng ta cần phải có công cụ Data Analysis Toolpak
Step 1 Click the Data tab in the Ribbon Step 2 Click Data Analysis in the Analyze group Step 3 When the Data Analysis dialog box opens, choose Histogram from the
Trang 10list of
Analysis Tools, and click OK
Trong hộp Input Range: chúng ta nhập chuỗi dữ liệu vào ở đây ví dụ chúng ta
nhập A2:A21
Trong hộp Bin Range: chúng ta nhập các giới hạn trên của từng bin Ở đây chúng
ta nhập C2:C6
Under Output Options:, select New Worksheet Ply:
Select the check box for Chart Output (see Figure 2.13) Click OK
Kết quả là chúng ta được một sheet mới với histogram
Trang 11Vẽ box plot cho biến định lượng
The step-by-step directions below illustrate how to create boxplots in Excel for
single variable and multiple variables First we will create a boxplot for a singlevariable
Step 1 Select cells B1:B13
Step 2 Click the Insert tab on the Ribbon
Click the Insert Statistic Chart button in the Charts group
Choose the Box and Whisker chart from the drop-down menu
The resulting boxplot created in Excel is shown in Figure 2.24
Comparing this figure to
Figure 2.22, we see that all the important elements of a boxplot are generated here.Excel
orients the boxplot vertically, and by default it also includes a marker for the mean
Next we will use the HomeSalesComparison file to create boxplots in Excel for
multiple
variables similar to what is shown in Figure 2.26
Trang 12Step 1 Select cells B1:F11
Step 2 Click the Insert tab on the Ribbon Click the Insert Statistic Chart button in the Charts group Choose the Box and Whisker chart from the drop-down menu
The boxplot created in Excel is shown in Figure 2.25 Excel again orients theboxplot
vertically The different selling locations are shown in the Legend at the top of thefigure,
and different colors are used for each boxplot
Trang 14CHƯƠNG 2: THỐNG KÊ MÔ TẢ
Tính các giá trị tập trung cho biến định lượng
Chúng ta có thể tính trung bình, trung vị, yếu vị của biến định lượng bằng cáchdùng các hàm evarge, median, mode
Trong bảng trên dãy dữ liệu của chúng ta có hai yếu vị là 138 và 25400000 do đóchúng ta phải dùng hàm MODE.MULTI (MULTI tượng trưng cho multimodal).Còn nếu dãy dữ liệu của chúng ta chỉ có 1 yếu vị, chúng ta sử dụng hàmMODE.SNGL
Chúng ta có thể tính geometric mean với Excel
The geometric mean is often used in analyzing growth rates in financial data Inthese
types of situations, the arithmetic mean or average value will provide misleadingresults
Trang 15To illustrate the use of the geometric mean, consider Table 2.10, which shows thepercentage annual returns, or growth rates, for a mutual fund over the past 10
$77.90 1 5 0.287($77.90) $77.90(1 1 5 0.287) $77.90(1.287) $ 5 100.26Note that 1.287 is the growth factor for year 2 By substituting $100(0.779) for
Trang 16investment times the
Trang 17What was the mean percentage annual return or mean rate of growth for thisinvestment over the 10-year period? The geometric mean of the 10 growth factors
answer this question Because the product of the 10 growth factors is 1.335, thegeometric
The geometric mean tells us that annual returns grew at an average annual rate
of (1.029 2 1) 100, or 2.9% In other words, with an average annual growth rate
of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to
$100(1.029) $ 10 5 133.09 at the end of 10 years We can use Excel to calculatethe
geometric mean for the data in Table 2.10 by using the function GEOMEAN InFigure 2.17, the value for the geometric mean in cell C13 is found using theformula
= GEOMEAN(C2:C11)
It is important to understand that the arithmetic mean of the percentage annualreturns does not provide the mean annual growth rate for this investment The sumof
the 10 percentage annual returns in Table 2.10 is 50.4 Thus, the arithmetic mean
Trang 18mean is appropriate only for an additive process For a multiplicative process, such
as applications involving growth rates, the geometric mean is the appropriate
Trang 19applications include changes in the populations of species, crop yields, pollution
death rates The geometric mean can also be applied to changes that occur over anynumber of successive periods of any length In addition to annual changes, the
often applied to find the mean rate of change over quarters, months, weeks, andeven days
Tính các số phân tán cho biến định lượng
Chúng ta có thể dùng các công thức sau để tính các số phân tán cho biến địnhlượng
Lưu ý để tính độ lệch chuẩn cũng như variance chúng ta dùng S nghĩa là tính chosample chứ không phải cho toàn bộ dân số
Variance is a measure of variability in the values of a random variable It is a
weighted
Trang 20average of the squared deviations of a random variable from its mean where theweights
are the probabilities Below we define the formula for calculating the variance of adiscrete
referred to as the variance The notations Var(x) and s 2 are both used to denote the
variance of a random variable
The calculation of the variance of the number of payments made per year by amortgage
customer is summarized in Table 4.12 We see that the variance is 42.360 The
standard
deviation, s , is defined as the positive square root of the variance Thus, the
standard deviation for the number of payments made per year by a mortgage
Trang 21The Excel function SUMPRODUCT can be used to easily calculate equation
a custom discrete random variable We illustrate the use of the SUMPRODUCT
We can also use Excel to find the variance directly from the data when the values
occur with relative frequencies that correspond to the probability distribution of therandom
variable Cell F305 in Figure 4.12 shows that we use the Excel formula
=VAR.P(F2:F301) to calculate the variance from the complete data This formula
which is the same as that calculated in Table 4.12 and Figure 4.13 Similarly, we
formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508.
As with the AVERAGE function and expected value, we cannot use the Excelfunctions
VAR.P and STDEV.P directly on the x values to calculate the variance and standard deviation of a custom discrete random variable if the x values are not
Instead we must either use the formula from equation (4.14) or use the Excel
the entire data set as shown in Figure 4.12
Trang 23variables, each with different standard deviations and different means
Tính các chỉ số phân phối của biến định lượng
A z-score allows us to measure the relative location of a value in the data set More
specifically, a z-score helps us determine how far a particular value is from the
mean relative to the data set’s standard deviation Suppose we have a sample of n
values denoted by x1 2 , , x x , , n In addition, assume that the sample mean, x , and
the sample standard deviation, s, are already computed Associated with each
value called its z-score Equation (2.8) shows how the z-score is computed for each xi:
Trang 24The z-score is often called the standardized value The z-score, zi, can be
that the value of the observation is equal to the mean
The z-scores for the class size data are computed in Table 2.13 Recall the
Trang 25deviations below the mean.
The z-score can be calculated in Excel using the function STANDARDIZE Figure
Trang 26Tính đồng phương sai (covariance) giữa hai biến định lượng
Covariance is a descriptive measure of the linear association between two
that for our calculations, x 5 84.6 and y 5 26.3
The covariance calculated in Table 2.15 is s 12.8
than 0, it indicates a positive relationship between the high temperature and sales
water This verifies the relationship we saw in the scatter chart in Figure 2.26 that
Trang 27high temperature for a day increases, sales of bottled water generally increase.The sample covariance can also be calculated in Excel using the COVARIANCE.Sfunction Figure 2.27 shows the data from Table 2.14 entered into an Excel
Trang 28For the bottled water, the covariance is positive, indicating that higher
Trang 31Tính correlation coefficient giữa hai biến định lượng
The correlation coefficient measures the relationship between two variables, and,unlike
covariance, the relationship between two variables is not affected by the units of
measurement for x and y For sample data, the correlation coefficient is defined as
Trang 32scales the correlation coefficient so that it will always take values between 21 and11.
Let us now compute the sample correlation coefficient for bottled water sales atQueensland Amusement Park Recall that we calculated sxy 5 12.8 using equation(2.9)
Using data in Table 2.14, we can compute sample standard deviations for x and y
The sample correlation coefficient is computed from equation (2.10) as follows:12.8
The correlation coefficient can take only values between 21 and 11 Correlation
coefficient values near 0 indicate no linear relationship between the x and y
variables Correlation coefficients greater than 0 indicate a positive linear
variables The closer the correlation coefficient is to 11, the closer the x and y
to forming a straight line that trends upward to the right (positive slope).Correlation coefficients less than 0 indicate a negative linear relationship between
The closer the correlation coefficient is to 21, the closer the x and y values are to
Trang 33we cansee in Figure 2.26, one could draw a straight line with a positive slope that would
close to all of the data points in the scatter chart.Because the correlation coefficient defined here measures only the strength of thelinear
relationship between two quantitative variables, it is possible for the correlationcoefficient
to be near zero, suggesting no linear relationship, when the relationship between
strong visual evidence of a nonlinear relationship That is, we can see that as thedaily high
Trang 34outside temperature increases, the money spent on environmental control first
less heating is required and then increases as greater cooling is required
We can compute correlation coefficients using the Excel function CORREL Thecorrelation coefficient in Figure 2.27 is computed in cell B18 for the sales of
Trang 35CHƯƠNG 3: THỐNG KÊ SUY DIỄN
Sampling from a Finite Population
Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical inferences about
the population The simplest type of probability sample is one in which each
sample of size
n has the same probability of being selected It is called a simple random sample A
simple
random sample of size n from a finite population of size N is defined as follows.
Procedures used to select a simple random sample from a finite population are based on the
use of random numbers We can use Excel’s RAND function to generate a random number
between 0 and 1 by entering the formula 5RAND() into any cell in a worksheet
The number
generated is called a random number because the mathematical procedure used by the RAND function guarantees that every number between 0 and 1 has the same probability of being
selected Let us see how these random numbers can be used to select a simple random sample
Our procedure for selecting a simple random sample of size n from a population of
Trang 36N involves two steps.
Step 1 Assign a random number to each element of the population.
Step 2 Select the n elements corresponding to the n smallest random numbers.
Because each set of n elements in the population has the same probability of being assigned the n smallest random numbers, each set of n elements has the same
probability of
being selected for the sample If we select the sample using this two-step
procedure, every
sample of size n has the same probability of being selected; thus, the sample
selected satisfies the definition of a simple random sample
Let us consider the process of selecting a simple random sample of 30 EAI
Step 1 In cell D1, enter the text Random Numbers
Step 2 In cells D2:D2501, enter the formulaRAND()
Step 3 Select the cell range D2:D2501
Step 4 In the Home tab in the Ribbon:
Click Copy in the Clipboard group
Click the arrow below Paste in the Clipboard group When the Paste
window appears, click Values in the Paste Values area
Press the Esc key
Trang 37Step 5 Select cells A1:D2501
Step 6 In the Data tab on the Ribbon, click Sort in the Sort & Filter group
Step 7 When the Sort dialog box appears:
Select the check box for My data has headers
In the first Sort by dropdown menu, select Random Numbers
order, and that the employees are not in their original order For instance,
Trang 38Sampling from an infinite Population
Sometimes we want to select a sample from a population, but the population is infinitely
large or the elements of the population are being generated by an ongoing process for
which there is no limit on the number of elements that can be generated Thus, it is not
possible to develop a list of all the elements in the population This is considered the
Trang 39infinite population case With an infinite population, we cannot select a simple random
sample because we cannot construct a frame consisting of all the elements In the infinite
population case, statisticians recommend selecting what is called a random sample
Care and judgment must be exercised in implementing the selection process forobtaining a random sample from an infinite population Each case may require a different
selection procedure Let us consider two examples to see what we mean by the conditions: (1) Each element selected comes from the same population, and (2) each element is
Trang 40population, is satisfied To ensure that this condition is satisfied, the boxes must be selected
at approximately the same point in time This way the inspector avoids the
possibility of
selecting some boxes when the process is operating properly and other boxes whenthe process is not operating properly and is underfilling or overfilling the boxes With a production process such as this, the second condition, each element is selected independently,
is satisfied by designing the production process so that each box of cereal is filled independently With this assumption, the quality-control inspector need only worryabout satisfying the same population condition
As another example of selecting a random sample from an infinite population, consider
the population of customers arriving at a fast-food restaurant Suppose an
employee is asked
to select and interview a sample of customers in order to develop a profile of
customers who