Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 29 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
29
Dung lượng
642,23 KB
Nội dung
Page 60 Cleaning the Data I now have the complete data set for modeling. The next step is to examine the datafor errors, outliers, and missing values. This is the most time-consuming, least exciting, and most important step in the data preparation process. Luckily there are some effective techniques for managing this process. First, I describe some techniques for cleaning and repairing datafor continuous variables. Then I repeat the process for categorical variables. Continuous Variables To perform data hygiene on continuous variables, PROC UNIVARIATE is a useful procedure. It provides a great deal of information about the distribution of the variable including measures of central tendency, measures of spread, and the skewness or the degree of imbalance of the data. For example, the following code will produce the output for examining the variable estimated income (inc_est). proc univariate data=acqmod.model plot; weight smp_wgt; var inc_est; run; The results from PROC UNIVARIATE for estimated income (inc_est) are shown in Figure 3.4. The values are in thousands of dollars. There is a lot of information in this univariate analysis. I just look for a few key things. Notice the measures in bold. In the moments section, the mean seems reasonable at $61.39224. But looking a little further I detect some data issues. Notice that the highest value in the extreme values is 660. In Figure 3.5, the histogram and box plot provide a good visual analysis of the overall distribution and the extreme value. I get another view of this one value. In the histogram, the bulk of the observations are near the bottom of the graph with the single high value near the top. The box plot also shows the limited range for the bulk of the data. The box area represents the central 50% of the data. The distance to the extreme value is very apparent. This point may be considered an outlier. Outliers and Data Errors An outlier is a single or low-frequency occurrence of the value of a variable that is far from the mean as well as the majority of the other values for that variable. Determining whether a value is an outlier or a data error is an art as well as a science. Having an intimate knowledge of your data is your best strength. Figure 3.4 Initial univariate analysis of estimated income . Figure 3.5 Histogram and box plot of estimated income . Page 62 Common sense and good logic will lead you to most of the problems. In our example, the one value that seems questionable is the maximum value (660). It could have an extra zero. One way to see if it is a data error is to look at some other values in the record. The variable estimated income group serves as a check for the value. The following code prints the record: proc print data=acqmod.model(where=(inc_est=660)); var inc_grp; run; The following output shows the estimated income group is ''K": OBS INC_GRP 85206 K Based on the information provided with the data, I know the range of incomes in group K to be between $65,000 and $69,000. This leads us to believe that the value 660 should be 66. I can verify this by running a PROC MEANS for the remaining records in group K. proc means data=acqmod.model maxdec = 2; where inc_grp = 'K' and inc_est ^= 660; var inc_est; run; The following SAS output validates our suspicion. All the other prospects with estimated income group = K have estimated income values between 65 and 69. Analysis Variable : INC_EST (K) N Mean Std Dev Minimum Maximum 4948 66.98 1.41 65.00 69.00 Here I replace the value 660 with 66. When substituting a new value for a missing value, it is always a good idea to create a new variable name. This maintains the integrity of the original data. data acqmod.model; set acqmod.model; if inc_est = 660 then inc_est2 = 66; else inc_est2 = inc_est; run; In Figure 3.6, we see the change in the histogram and the box plot as a result of correcting the outlier. The distribution is still centered near a lower range of values, but the skewness is greatly decreased. TEAMFLY Team-Fly ® Figure 3.6 Histogram and box plot of estimated income with corrections. If you have hundreds of variables, you may not want to spend a lot of time on each variable with missing or incorrect values. Time techniques for correction should be used sparingly. If you find an error and the fix is not obvious, you can treat it as a missing value. Outliers are common in numeric data, especially when dealing with monetary variables. Another method for dealing with outliers is to develop a capping rule. This can be accomplished easily using some features in PROC UNIVARIATE. The following code produces an output data set with the standard deviation (incstd) and the 99th percentile value (inc99) for estimated income (inc_est). proc univariate data=acqmod.model noprint; weight smp_wgt; var inc_est; output out=incdata std=incstd pctlpts=99 pctlpre=inc; run; data acqmod.model; set acqmod.model; if (_n_ eq 1) then set incdata(keep= incstd inc99); if incstd > 2*inc99 then inc_est2 = min(inc_est,(4*inc99)); else inc_est2 = inc_est; run; Page 64 The code in bold is just one example of a rule for capping the values of a variable. It looks at the spread by seeing if the standard deviation is greater than twice the value at the 99th percentile. If it is, it caps the value at four times the 99th percentile. This still allows for generous spread without allowing in obvious outliers. This particular rule only works for variables with positive values. Depending on your data, you can vary the rules to suit your goals. Missing Values As information is gathered and combined, missing values are present in almost every data set. Many software packages ignore records with missing values, which makes them a nuisance. The fact that a value is missing, however, can be predictive. It is important to capture that information. Consider the direct mail company that had its customer file appended with data from an outside list. Almost a third of its customers didn't match to the outside list. At first this was perceived as negative. But it turned out that these customers were much more responsive to offers for additional products. After further analysis, it was discovered that these customers were not on many outside lists. This made them more responsive because they were not receiving many direct mail offers from other companies. Capturing the fact that they had missing values improved the targeting model. In our case study, we saw in the univariate analysis that we have 84 missing values for income. The first step is to create an indicator variable to capture the fact that the value is missing for certain records. The following code creates a variable to capture the information: data acqmod.model; set acqmod.model; if inc_est2 = . then inc_miss = 1; else inc_miss = 0; run; The goal for replacing missing values is twofold: to fill the space with the most likely value and to maintain the overall distribution of the variable. Single Value Substitution Single value substitution is the simplest method for replacing missing values. There are three common choices: mean, median, and mode. The mean value is based on the statistical least-square-error calculation. This introduces the least variance into the distribution. If the distribution is highly skewed, the median may be a better choice. The following code substitutes the mean value for estimated income (inc_est2): Page 65 data acqmod.model; set acqmod.model; if inc_est2 = . then inc_est3 = 61; else inc_est3 = inc_est2; run; Class Mean Substitution Class mean substitution uses the mean values within subgroups of other variables or combinations of variables. This method maintains more of the original distribution. The first step is to select one or two variables that may be highly correlated with income. Two values that would be highly correlated with income are home equity (hom_equ) and inferred age (infd_ag). The goal is to get the average estimated income for cross-sections of home equity ranges and age ranges for observations where estimated income is not missing. Because both variables are continuous, a data step is used to create the group variables, age_grp and homeq_r. PROC TABULATE is used to derive and display the values. data acqmod.model; set acqmod.model; if 25 <= infd_ag <= 34 then age_grp = '25-34'; else if 35 <= infd_ag <= 44 then age_grp = '35-44'; else if 45 <= infd_ag <= 54 then age_grp = '45-54'; else if 55 <= infd_ag <= 65 then age_grp = '55-65'; if 0 <= hom_equ<=100000 then homeq_r = '$0-$100K'; else if 100000<hom_equ<=200000 then homeq_r = '$100-$200K'; else if 200000<hom_equ<=300000 then homeq_r = '$200-$300K'; else if 300000<hom_equ<=400000 then homeq_r = '$300-$400K'; else if 400000<hom_equ<=500000 then homeq_r = '$400-$500K'; else if 500000<hom_equ<=600000 then homeq_r = '$500-$600K'; else if 600000<hom_equ<=700000 then homeq_r = '$600-$700K'; else if 700000<hom_equ then homeq_r = '$700K+'; run; proc tabulate data=acqmod.model; where inc_est2^=.; weight smp_wgt; class homeq_r age_grp; var inc_est2; table homeq_r ='Home Equity',age_grp='Age Group'* inc_est2=' '*mean=' '*f=dollar6. /rts=13; run; The output in Figure 3.7 shows a strong variation in average income among the different combinations of home equity and age group. Using these values for missing value substitution will help to maintain the distribution of the data. Figure 3.7 Values for class mean substitution. The final step is to develop an algorithm that will create a new estimated income variable (inc_est3 ) that has no missing values. data acqmod.model; set acqmod.model; if inc_est2 = . then do; if 25 <= infd_ag <= 34 then do; if 0 <= hom_equ<=100000 then inc_est3= 47; else if 100000<hom_equ<=200000 then inc_est3= 70; else if 200000<hom_equ<=300000 then inc_est3= 66; else if 300000<hom_equ<=400000 then inc_est3= 70; else if 400000<hom_equ<=500000 then inc_est3= 89; else if 500000<hom_equ<=600000 then inc_est3= 98; else if 600000<hom_equ<=700000 then inc_est3= 91; else if 700000<hom_equ then inc_est3= 71; end; else if 35 <= infd_ag <= 44 then do; if 0 <= hom_equ<=100000 then inc_est3= 55; else if 100000<hom_equ<=200000 then inc_est3= 73; else " " " " " " " " if 700000<hom_equ then inc_est3= 101; end; else if 45 <= infd_ag <= 54 then do; if 0 <= hom_equ<=100000 then inc_est3= 57; else if 100000<hom_equ<=200000 then inc_est3= 72; else " " " " " " " " Page 67 if 700000<hom_equ then inc_est3= 110; end; else if 55 <= infd_ag <= 65 then do; if 0 <= hom_equ<=100000 then inc_est3= 55; else if 100000<hom_equ<=200000 then inc_est3= 68; else " " " " " " " " if 700000<hom_equ then inc_est3= 107; end; end; run; Regression Substitution Similar to class mean substitution, regression substitution uses the means within subgroups of other variables. The advantage of regression is the ability to use continuous variables as well as look at many variables for a more precise measurement. The resulting regression score is used to impute the replacement value. In our case study, I derive values for estimated income (inc_est2) using the continuous form of age (infd_ag), the mean for each category of home equity (hom_equ), total line of credit (credlin), and total credit balances (tot_bal). The following code performs a regression analysis and creates an output data set (reg_out) with the predictive coefficients. proc reg data=acqmod.model outest=reg_out; weight smp_wgt; inc_reg: model inc_est2 = infd_ag hom_equ credlin tot_bal/ selection = backward; run; Figure 3.8 shows a portion of the regression output. The parameter estimates are saved in the data set (reg_out ) and used in PROC SCORE. The following code is used to score the data to create inc_reg, the substitute value for income. A new data set is created called acqmod.model2. This creates a backup data set. proc score data=acqmod.model score=reg_out out=acqmod.model2 type=parms predict; var infd_ag hom_equ credlin tot_bal; run; The following code creates inc_est3 using the regression value: data acqmod.model2; set acqmod.model2; if inc_est2 = . then inc_est3 = inc_reg; else inc_est3 = inc_est2; run; Page 68 Figure 3.8 Output for regression substitution. One of the benefits of regression substitution is its ability to sustain the overall distribution of the data. To measure the effect on the spread of the data, I look at a PROC MEANS for the variable before (inc_est2) and after (inc_est3) the substitution: proc means data=acqmod.model2 n nmiss mean std min max; weight smp_wgt; var inc_est2 inc_est3; run; I see in Figure 3.9 that the distribution is almost identical for both variables. The 84 values from the regression (inc_reg) that replaced the missing values in inc_est2 are a good match for the distribution. These actions can be performed on all of the continuous variables. See Appendix A for univariate analysis of remaining continuous variables. Now I must examine the quality of the categorical variables. [...]... validity of the data Now that I have the ingredients, that is the data, I am ready to start preparing it for modeling In chapter 4, I use some interesting techniques to select the final candidate variables I also find the form or forms of each variable that maximizes the predictive power of the model Page 71 Chapter 4— Selecting and Transforming the Variables At this point in the process, the data has been... refined for analysis The next step is to define the goal in technical terms For our case study, the objective is to build a net present value (NPV) model for a direct mail life insurance campaign In this chapter, I will describe the components of NPV and detail the plan for developing the model Once the goal has been defined, the next step is to find a group of candidate variables that show potential for. .. Marketing Expense The marketing expense for this product is $.78 This is a combination of the cost of the mail piece, $.45, postage of $.23 per piece, and $.10 for processing Deriving Variables Once the data is deemed correct and missing values have been handled, the next step is to look for opportunities to derive new variables This is a situation where knowledge of the data and the customer is critical... with the data and the industry is so valuable Once you've extracted, formatted, and created all eligible variables, it's time to narrow the field to a few strong contenders Continuous Variables If you have fewer than 50 variables to start, you may not need to reduce the number of variables for final eligibility in the model As the amount of data being collected continues to grow, the need for variable... Value = P(Activation) × Risk Index × Product Profitability –Marketing Expense For our case study, I have specific methods and values for these measures Probability of Activation To calculate the probability of activation, I have two options: build one model that predicts activation or build two models, one for response and one for activation, given response (model build on just responders to target actives)... frequency, so I can see the number of missing values The data dictionary states that the correct values for pop_den are A, B, and C I presume that the value P is an error I have a couple of choices to remedy the situation I can delete it or replace it For the purposes of this case study, I give it the value of the mode, which is C Missing Values When modeling with nonnumeric (categorical) variables, the... huge amounts of data are generated Some common methods include addition, subtraction, and averaging Consider the amount of data in an active credit card transaction file Daily processing includes purchases, returns, fees, and interchange income Interchange income is the revenue that credit card banks collect from retailers for processing payments through their system To make use of this information, it... crl_rat data acqmod.model2; set acqmod.model2; if age_fil2 > 0 then crl_rat=credlin2/age_fil2; else crl_rat = 0; run; Dates Dates are found in almost every data set They can be very predictive as time measures or used in combination with other dates or different types of variables In order to use them it is necessary to get them into a format that supports "date math." Date math is the ability to perform... an additional category This will be covered in greater detail in chapter 4 See Appendix B for simple frequencies of the remaining categorical variables Page 70 Figure 3.10 Frequency of population density Summary In this chapter I demonstrated the process of getting data from its raw form to useful pieces of information The process uses some techniques ranging from simple graphs to complex univariate... The p -value represents the probability that the event occurred by chance The chi -square statistic is the underlying test for many modeling procedures including logistic regression and certain classification trees Method 2: Two Models title2 "Modeling Response"; proc logistic data= acqmod.model2 descending; weight smp_wgt; model respond = inc_est3 inc_miss infd_ag2 hom_equ2 tot_acc2 actopl62 tot_bal2 . methodology for discounting in chapter 12 . Table 4 .1 Risk Matrix MALE FEMALE AGE MARRIED SINGLE DIVORCED WIDOWED MARRIED SINGLE DIVORCE < 40 1. 09 1. 06 1. 04 1. 01 1 .14 1. 10 1. 07 40–49 1. 01 1.02 0.96 0.95 1. 04 1. 07 1. 01 50–59 0.89 0.83 0. 81 0.78 0.97 0.99 0.95 60+ 0.75 0.65 0.72 0.70 0.94 0.89 0.84 . 40 1. 09 1. 06 1. 04 1. 01 1 .14 1. 10 1. 07 40–49 1. 01 1.02 0.96 0.95 1. 04 1. 07 1. 01 50–59 0.89 0.83 0. 81 0.78 0.97 0.99 0.95 60+ 0.75 0.65 0.72 0.70 0.94 0.89 0.84 Page 74 Marketing Expense The marketing expense for this product is $.78 change: data ccbank.dailyact; set ccbank.dailyact; janpurch = sum(of pur 010 1-pur 013 1); /* Summarize daily purchases */ febpurch = sum(of pur02 01- pur0208); | | | | | | decpurch = sum(of pur12 01- pur12 31) ;