1. Trang chủ
  2. » Giáo án - Bài giảng

Phan Tich Kinh Te Bang Excel.docx

318 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Data Cleansing
Định dạng
Số trang 318
Dung lượng 36,5 MB

Nội dung

Chúng ta thấy biến miles có 1 missing value. Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao. Các missing value luôn nằm ở cuối bảng. Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng.

Trang 1

CHAPTER 1: DATA CLEANSING

Kiểm tra missing data của biến trong excel

Chúng ta có thể dùng hàm countblank để đếm số missing value trong mộtbiến của excel

Chúng ta thấy biến miles có 1 missing value

Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao Cácmissing value luôn nằm ở cuối bảng

Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu

Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình,

độ lệch chuẩn cho các biến định lượng

Trong ví dụ dữ liệu hình trên, chúng ta gõ vào ô H3 công thức

=AVERAGE(C2:C457)

Sau đó chúng ta gõ vào ô H4 công thức

=STDEV.S(C2:C457)

Chúng ta làm tương tự cho các ô còn lại

Kết quả cho thấy các biến này đều có trung bình và độ lệch chuẩn ổn địnhkhông phát hiện bất thường về outlier

Trang 2

Chúng ta cũng có thể tính giá trị tối đa và tối thiểu cho từng biến.

Nhập công thức vào ô H5

=MIN(C2:C457)

Nhập công thức vào ô H6

= MAX(C2:C457)

Chúng ta thấy giá trị tối thiểu và tối đa của biến life of tires là 1.8 months và

601.0 Giá trị tối đa này (50 năm) là không phù hợp đối với biến life of tires Đểxác định xe nào có outlier này chúng ta cần sort toàn bộ dữ liệu theo biến Life ofTire (Months) và cuốn đến các dòng cuối cùng của dữ liệu

We see in Figure 2.34 that the observation with Life of Tire (Months) value of601.0

is the left rear tire from the automobile with ID Number 8696859 Also note that

Trang 3

rear tirevalue is also misplaced Both of these erroneous entries can now be corrected

By repeating this process for the remaining variables in the data (Tread Depth andMiles) in columns I and J, we determine that the minimum and maximum values

are in error and if so, what might be the correct value

Not all erroneous values in a data set are extreme; these erroneous values are muchmore difficult to find However, if the variable with suspected erroneous values has

a relatively strong relationship with another variable in the data, we can use thisknowledge

to look for erroneous values Here we will consider the variables Tread Depth andMiles;

because more miles driven should lead to less tread depth on an automobile tire,

we expect these two variables to have a negative relationship A scatter chart will

Trang 4

whether any of the tires in the data set have values for Tread Depth and Miles thatare

counter to this expectation

The red ellipse in Figure 2.35 shows the region in which the points representingTread

Depth and Miles would generally be expected to lie on this scatter plot The points

Trang 5

Closer examination of outliers and potential erroneous values may reveal an error

Trang 6

may supply similar information Such variables can be aggregated or removed toallow

more parsimonious model development

A critical part of data mining is determining how to represent the measurements ofthe

variables and which variables to consider The treatment of categorical variables isparticularly important Typically, it is best to encode categorical variables with 0–1dummy

variables Consider a data set that contains the variable Language to track thelanguage

preference of callers to a call center The variable Language with the possiblevalues

of English, German, and Spanish would be replaced with three binary variablescalled

English, German, and Spanish An entry of German would be captured using a 0

Trang 7

for theEnglish dummy variable, a 1 for the German dummy variable, and a 0 for theSpanish

dummy variable Using 0–1 dummy variables to encode categorical variables withmany

different categories results in a large number of variables In these cases, the use ofPivotTables is helpful in identifying categories that are similar and can possibly be

therefore the number of categories may be reduced by combining categories

Often data sets contain variables that, considered separately, are not particularlyinsightful but that, when appropriately combined, result in a new variable thatreveals an important relationship Financial data supplying information on stock

may be as useful as the derived variable representing the price/earnings (PE) ratio.A

variable tabulating the dollars spent by a household on groceries may not beinteresting

because this value may depend on the size of the household Instead, consideringthe

proportion of total household spending on groceries may be more informative

Trang 8

Lập bảng phân phối tần suất cho biến định tính trong excel

Giả sử chúng ta có bảng dữ liệu như sau

Chúng ta muốn lập bảng phân phối tần suất của từng loại thức uống, chúng ta sửdụng hàm countif như sau

Kết quả như sau

Trang 9

Lập bảng phân phối tần suất cho biến định lượng trong Excel

Để lập bảng phân phối tần suất cho biến định lượng chúng ta dùng hàm frequency

Vẽ tổ chức đồ (histogram) cho biến định lượng

Để có thể vẽ histogram chúng ta cần phải có công cụ Data Analysis Toolpak

Step 1 Click the Data tab in the Ribbon Step 2 Click Data Analysis in the Analyze group Step 3 When the Data Analysis dialog box opens, choose Histogram from the

Trang 10

list of

Analysis Tools, and click OK

Trong hộp Input Range: chúng ta nhập chuỗi dữ liệu vào ở đây ví dụ chúng ta

nhập A2:A21

Trong hộp Bin Range: chúng ta nhập các giới hạn trên của từng bin Ở đây chúng

ta nhập C2:C6

Under Output Options:, select New Worksheet Ply:

Select the check box for Chart Output (see Figure 2.13) Click OK

Kết quả là chúng ta được một sheet mới với histogram

Trang 11

Vẽ box plot cho biến định lượng

The step-by-step directions below illustrate how to create boxplots in Excel for

single variable and multiple variables First we will create a boxplot for a singlevariable

Step 1 Select cells B1:B13

Step 2 Click the Insert tab on the Ribbon

Click the Insert Statistic Chart button in the Charts group

Choose the Box and Whisker chart from the drop-down menu

The resulting boxplot created in Excel is shown in Figure 2.24

Comparing this figure to

Figure 2.22, we see that all the important elements of a boxplot are generated here.Excel

orients the boxplot vertically, and by default it also includes a marker for the mean

Next we will use the HomeSalesComparison file to create boxplots in Excel for

multiple

variables similar to what is shown in Figure 2.26

Trang 12

Step 1 Select cells B1:F11

Step 2 Click the Insert tab on the Ribbon Click the Insert Statistic Chart button in the Charts group Choose the Box and Whisker chart from the drop-down menu

The boxplot created in Excel is shown in Figure 2.25 Excel again orients theboxplot

vertically The different selling locations are shown in the Legend at the top of thefigure,

and different colors are used for each boxplot

Trang 14

CHƯƠNG 2: THỐNG KÊ MÔ TẢ

Tính các giá trị tập trung cho biến định lượng

Chúng ta có thể tính trung bình, trung vị, yếu vị của biến định lượng bằng cáchdùng các hàm evarge, median, mode

Trong bảng trên dãy dữ liệu của chúng ta có hai yếu vị là 138 và 25400000 do đóchúng ta phải dùng hàm MODE.MULTI (MULTI tượng trưng cho multimodal).Còn nếu dãy dữ liệu của chúng ta chỉ có 1 yếu vị, chúng ta sử dụng hàmMODE.SNGL

Chúng ta có thể tính geometric mean với Excel

The geometric mean is often used in analyzing growth rates in financial data Inthese

types of situations, the arithmetic mean or average value will provide misleadingresults

Trang 15

To illustrate the use of the geometric mean, consider Table 2.10, which shows thepercentage annual returns, or growth rates, for a mutual fund over the past 10

$77.90 1 5 0.287($77.90) $77.90(1 1 5 0.287) $77.90(1.287) $ 5 100.26Note that 1.287 is the growth factor for year 2 By substituting $100(0.779) for

Trang 16

investment times the

Trang 17

What was the mean percentage annual return or mean rate of growth for thisinvestment over the 10-year period? The geometric mean of the 10 growth factors

answer this question Because the product of the 10 growth factors is 1.335, thegeometric

The geometric mean tells us that annual returns grew at an average annual rate

of (1.029 2 1) 100, or 2.9% In other words, with an average annual growth rate

of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to

$100(1.029) $ 10 5 133.09 at the end of 10 years We can use Excel to calculatethe

geometric mean for the data in Table 2.10 by using the function GEOMEAN InFigure 2.17, the value for the geometric mean in cell C13 is found using theformula

= GEOMEAN(C2:C11)

It is important to understand that the arithmetic mean of the percentage annualreturns does not provide the mean annual growth rate for this investment The sumof

the 10 percentage annual returns in Table 2.10 is 50.4 Thus, the arithmetic mean

Trang 18

mean is appropriate only for an additive process For a multiplicative process, such

as applications involving growth rates, the geometric mean is the appropriate

Trang 19

applications include changes in the populations of species, crop yields, pollution

death rates The geometric mean can also be applied to changes that occur over anynumber of successive periods of any length In addition to annual changes, the

often applied to find the mean rate of change over quarters, months, weeks, andeven days

Tính các số phân tán cho biến định lượng

Chúng ta có thể dùng các công thức sau để tính các số phân tán cho biến địnhlượng

Lưu ý để tính độ lệch chuẩn cũng như variance chúng ta dùng S nghĩa là tính chosample chứ không phải cho toàn bộ dân số

Variance is a measure of variability in the values of a random variable It is a

weighted

Trang 20

average of the squared deviations of a random variable from its mean where theweights

are the probabilities Below we define the formula for calculating the variance of adiscrete

referred to as the variance The notations Var(x) and s 2 are both used to denote the

variance of a random variable

The calculation of the variance of the number of payments made per year by amortgage

customer is summarized in Table 4.12 We see that the variance is 42.360 The

standard

deviation, s , is defined as the positive square root of the variance Thus, the

standard deviation for the number of payments made per year by a mortgage

Trang 21

The Excel function SUMPRODUCT can be used to easily calculate equation

a custom discrete random variable We illustrate the use of the SUMPRODUCT

We can also use Excel to find the variance directly from the data when the values

occur with relative frequencies that correspond to the probability distribution of therandom

variable Cell F305 in Figure 4.12 shows that we use the Excel formula

=VAR.P(F2:F301) to calculate the variance from the complete data This formula

which is the same as that calculated in Table 4.12 and Figure 4.13 Similarly, we

formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508.

As with the AVERAGE function and expected value, we cannot use the Excelfunctions

VAR.P and STDEV.P directly on the x values to calculate the variance and standard deviation of a custom discrete random variable if the x values are not

Instead we must either use the formula from equation (4.14) or use the Excel

the entire data set as shown in Figure 4.12

Trang 23

variables, each with different standard deviations and different means

Tính các chỉ số phân phối của biến định lượng

A z-score allows us to measure the relative location of a value in the data set More

specifically, a z-score helps us determine how far a particular value is from the

mean relative to the data set’s standard deviation Suppose we have a sample of n

values denoted by x1 2 , , x x , , n In addition, assume that the sample mean, x , and

the sample standard deviation, s, are already computed Associated with each

value called its z-score Equation (2.8) shows how the z-score is computed for each xi:

Trang 24

The z-score is often called the standardized value The z-score, zi, can be

that the value of the observation is equal to the mean

The z-scores for the class size data are computed in Table 2.13 Recall the

Trang 25

deviations below the mean.

The z-score can be calculated in Excel using the function STANDARDIZE Figure

Trang 26

Tính đồng phương sai (covariance) giữa hai biến định lượng

Covariance is a descriptive measure of the linear association between two

that for our calculations, x 5 84.6 and y 5 26.3

The covariance calculated in Table 2.15 is s 12.8

than 0, it indicates a positive relationship between the high temperature and sales

water This verifies the relationship we saw in the scatter chart in Figure 2.26 that

Trang 27

high temperature for a day increases, sales of bottled water generally increase.The sample covariance can also be calculated in Excel using the COVARIANCE.Sfunction Figure 2.27 shows the data from Table 2.14 entered into an Excel

Trang 28

For the bottled water, the covariance is positive, indicating that higher

Trang 31

Tính correlation coefficient giữa hai biến định lượng

The correlation coefficient measures the relationship between two variables, and,unlike

covariance, the relationship between two variables is not affected by the units of

measurement for x and y For sample data, the correlation coefficient is defined as

Trang 32

scales the correlation coefficient so that it will always take values between 21 and11.

Let us now compute the sample correlation coefficient for bottled water sales atQueensland Amusement Park Recall that we calculated sxy 5 12.8 using equation(2.9)

Using data in Table 2.14, we can compute sample standard deviations for x and y

The sample correlation coefficient is computed from equation (2.10) as follows:12.8

The correlation coefficient can take only values between 21 and 11 Correlation

coefficient values near 0 indicate no linear relationship between the x and y

variables Correlation coefficients greater than 0 indicate a positive linear

variables The closer the correlation coefficient is to 11, the closer the x and y

to forming a straight line that trends upward to the right (positive slope).Correlation coefficients less than 0 indicate a negative linear relationship between

The closer the correlation coefficient is to 21, the closer the x and y values are to

Trang 33

we cansee in Figure 2.26, one could draw a straight line with a positive slope that would

close to all of the data points in the scatter chart.Because the correlation coefficient defined here measures only the strength of thelinear

relationship between two quantitative variables, it is possible for the correlationcoefficient

to be near zero, suggesting no linear relationship, when the relationship between

strong visual evidence of a nonlinear relationship That is, we can see that as thedaily high

Trang 34

outside temperature increases, the money spent on environmental control first

less heating is required and then increases as greater cooling is required

We can compute correlation coefficients using the Excel function CORREL Thecorrelation coefficient in Figure 2.27 is computed in cell B18 for the sales of

Trang 35

CHƯƠNG 3: THỐNG KÊ SUY DIỄN

Sampling from a Finite Population

Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical inferences about

the population The simplest type of probability sample is one in which each

sample of size

n has the same probability of being selected It is called a simple random sample A

simple

random sample of size n from a finite population of size N is defined as follows.

Procedures used to select a simple random sample from a finite population are based on the

use of random numbers We can use Excel’s RAND function to generate a random number

between 0 and 1 by entering the formula 5RAND() into any cell in a worksheet

The number

generated is called a random number because the mathematical procedure used by the RAND function guarantees that every number between 0 and 1 has the same probability of being

selected Let us see how these random numbers can be used to select a simple random sample

Our procedure for selecting a simple random sample of size n from a population of

Trang 36

N involves two steps.

Step 1 Assign a random number to each element of the population.

Step 2 Select the n elements corresponding to the n smallest random numbers.

Because each set of n elements in the population has the same probability of being assigned the n smallest random numbers, each set of n elements has the same

probability of

being selected for the sample If we select the sample using this two-step

procedure, every

sample of size n has the same probability of being selected; thus, the sample

selected satisfies the definition of a simple random sample

Let us consider the process of selecting a simple random sample of 30 EAI

Step 1 In cell D1, enter the text Random Numbers

Step 2 In cells D2:D2501, enter the formulaRAND()

Step 3 Select the cell range D2:D2501

Step 4 In the Home tab in the Ribbon:

Click Copy in the Clipboard group

Click the arrow below Paste in the Clipboard group When the Paste

window appears, click Values in the Paste Values area

Press the Esc key

Trang 37

Step 5 Select cells A1:D2501

Step 6 In the Data tab on the Ribbon, click Sort in the Sort & Filter group

Step 7 When the Sort dialog box appears:

Select the check box for My data has headers

In the first Sort by dropdown menu, select Random Numbers

order, and that the employees are not in their original order For instance,

Trang 38

Sampling from an infinite Population

Sometimes we want to select a sample from a population, but the population is infinitely

large or the elements of the population are being generated by an ongoing process for

which there is no limit on the number of elements that can be generated Thus, it is not

possible to develop a list of all the elements in the population This is considered the

Trang 39

infinite population case With an infinite population, we cannot select a simple random

sample because we cannot construct a frame consisting of all the elements In the infinite

population case, statisticians recommend selecting what is called a random sample

Care and judgment must be exercised in implementing the selection process forobtaining a random sample from an infinite population Each case may require a different

selection procedure Let us consider two examples to see what we mean by the conditions: (1) Each element selected comes from the same population, and (2) each element is

Trang 40

population, is satisfied To ensure that this condition is satisfied, the boxes must be selected

at approximately the same point in time This way the inspector avoids the

possibility of

selecting some boxes when the process is operating properly and other boxes whenthe process is not operating properly and is underfilling or overfilling the boxes With a production process such as this, the second condition, each element is selected independently,

is satisfied by designing the production process so that each box of cereal is filled independently With this assumption, the quality-control inspector need only worryabout satisfying the same population condition

As another example of selecting a random sample from an infinite population, consider

the population of customers arriving at a fast-food restaurant Suppose an

employee is asked

to select and interview a sample of customers in order to develop a profile of

customers who

Ngày đăng: 03/05/2024, 08:35

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w