Cookbook Modeling Data for Marketing_3 pptx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	29
Dung lượng	775,46 KB

Nội dung

The following code uses PROC TABULATE to create the decile analysis, a table that calculates the number of observations (records) in each decile, the average predicted probability per decile, the percent active per responders (target of Method 2, model 2), the response rate (target of Method 2, model 1), and the active rate (target of Method 1). title1 "Decile Analysis - Activation Model - One Step"; title2 "Model Data - Score Selection"; proc tabulate data=acqmod.mod_dec; weight smp_wgt; class mod_dec; var respond active pred records activ_r; table mod_dec='Decile' all='Total', records='Prospects'*sum=' '*f=comma10. pred='Predicted Probability'*(mean=' '*f=11.5) activ_r='Percent Active of Responders'*(mean=' '*f=11.5) respond='Percent Respond'*(mean=' '*f=11.5) active='Percent Active'*(mean=' '*f=11.5) /rts = 9 row=float; run; In Figure 5.5, the decile analysis shows the model's ability to rank order the prospects by their active behavior. To clarify, each prospect's probability of becoming active is considered its rank. The goal of the model is to rank order the prospects so as to bring the true actives to the lowest decile. At first glance we can see that the best decile (0) has 17.5 as many actives as the worst decile Figure 5.5 Decile analysis using model data. Page 117 (9). And as we go from decile 0 to decile 9, the percent active value is monotonically decreasing with a strong decrease in the first three deciles. The only exception is the two deciles in the middle with the same percent active rate. This is not unusual since the model is most powerful in deciles 0 and 9, where it gets the best separation. Overall, the model score does a good job of targeting active accounts. Because the model was built on the data used in Figure 5.5, a better test will be on the validation data set. Another consideration is how closely the ''Percent Active" matches the "Predicted Probability." The values in these columns for each decile are not as close as they could be. If my sample had been larger, they would probably be more equal. I will look for similar behavior in the decile analysis for the validation data set. Preliminary Evaluation Because I carried the validation data through the model using the missing weights, each time the model is processed, the validation data set is scored along with the model data. By creating a decile analysis on the validation data set we can evaluate how well the model will transfer the results to similar data. As mentioned earlier, a model that works well on alternate data is said to be robust. In chapter 6, I will discuss additional methods for validation that go beyond simple decile analysis. The next code listing creates the same table for the validation data set. This provides our first analysis of the ability of the model to rank order data other than the model development data. It is a good test of the robustness of the model or its ability to perform on other prospect data. The code is the same as for the model data decile analysis except for the (where=( splitwgt = .)) option. This accesses the "hold out" sample or validation data set. proc univariate data=acqmod.out_act1 (where=( splitwgt = .)) noprint; weight smp_wgt; var pred active; output out=preddata sumwgt=sumwgt; run; data acqmod.val_dec; set acqmod.out_act1(where=( splitwgt = .)); if (_n_ eq 1) then set preddata; retain sumwgt; number+smp_wgt; if number < .1*sumwgt then val_dec = 0; else if number < .2*sumwgt then val_dec = 1; else if number < .3*sumwgt then val_dec = 2; else if number < .4*sumwgt then val_dec = 3; else if number < .5*sumwgt then val_dec = 4; else if number < .6*sumwgt then val_dec = 5; else if number < .7*sumwgt then val_dec = 6; else if number < .8*sumwgt then val_dec = 7; else if number < .9*sumwgt then val_dec = 8; else val_dec = 9; activ_r = (activate = '1'); run; title1 "Decile Analysis - Activation Model - One Step"; title2 "Validation Data - Score Selection"; PROC tabulate data=acqmod.val_dec; weight smp_wgt; class val_dec; var respond active pred records activ_r; table val_dec='Decile' all='Total', records='Prospects'*sum=' '*f=comma10. pred='Predicted Probability'*(mean=' '*f=11.5) activ_r='Percent Active of Responders'*(mean=' '*f=11.5) respond='Percent Respond'*(mean=' '*f=11.5) active='Percent Active'*(mean=' '*f=11.5) /rts = 9 row=float; run; The validation decile analysis seen in Figure 5.6 shows slight degradation from the original model. This is to be expected. But the rank ordering is still strong Figure 5.6 Decile analysis using validation data. Page 119 with the best decile attracting almost seven times as many actives as the worst decile. We see the same degree of difference between the "Predicted Probability" and the actual "Percent Active" as we saw in the decile analysis of the model data in Figure 5.5. Decile 0 shows the most dramatic difference, but the other deciles follow a similar pattern to the model data. There is also a little flipflop going on in deciles 5 and 6, but the degree is minor and probably reflects nuances in the data. In chapter 6, I will perform some more general types of validation, which will determine if this is a real problem. Method 2: Two Models — Response The process for two models is similar to the process for the single model. The only difference is that the response and activation models are processed separately through the stepwise, backward, and Score selection methods. The code differences are highlighted here: proc logistic data=acqmod.model2 (keep=variables) descending; weight splitwgt ; model respond = variables /selection = stepwise sle=.3 sls=.3; run; proc logistic data=acqmod.model2 (keep=variables) descending; weight splitwgt ; model respond = variables /selection = backward sls=.3; run; proc logistic data=acqmod.model2 (keep=variables) descending; weight splitwgt ; model respond = variables /selection = score best=2; run; The output from the Method 2 response models is similar to the Method 1 approach. Figure 5.7 shows the decile analysis for the response model calculated on the validation data set. It shows strong rank ordering for response. The rank ordering for activation is a little weaker, which is to be expected. There are different drivers for response and activation. Because activation is strongly driven by response, the ranking for activation is strong. Method 2: Two Models — Activation As I process the model for predicting activation given response (active|response), recall that I can use the value activate because it has a value of missing for nonresponders. This means that the nonresponders will be eliminated from the model processing. The following code processes the model: Figure 5.7 Method 2 response model decile analysis. proc logistic data=acqmod.model2 (keep=variables) descending; weight splitwgt ; model activate = variables /selection = stepwise sle=.3 sls=.3; run; proc logistic data=acqmod.model2 (keep=variables) descending; weight splitwgt ; model activate = variables /selection = backward sls=.3; run; proc logistic data=acqmod.model2 (keep=variables) descending; weight splitwgt ; model activate = variables /selection = score best=2; run; The output from the Method 2 activation models is similar to the Method 1 approach. Figure 5.8 shows the decile analysis for the activation model calculated on the validation data set. It shows strong rank ordering for activation given response. As expected, it is weak when predicting activation for the entire file. Our next step is to compare the results of the two methods. Figure 5.8 Method 2 activation model decile analysis. Comparing Method 1 and Method 2 At this point, I have several options for the final model. I have a single model that predicts the probability of an active account that was created using Method 1, the single- model approach. And I have two models from Method 2, one that predicts the probability of response and the other that predicts the probability of an active account, given response (active|response). To compare the performance between the two methods, I must combine the models developed in Method 2. To do this, I use a simplified form of Bayes' Theorem. Let's say: P(R) = the probability of response (model 1 in Method 2) P(A|R) = the probability of becoming active given response (model 2 in Method 2) P(A and R) = the probability of responding and becoming active Then: P(A and R) = P(R)* P(A|R) Therefore, to get the probability of responding and becoming active, I multiply the probabilities created in model 1 and model 2. TEAMFLY Team-Fly ® Page 122 Following the processing of the score selection for each of the two models in Method 2, I reran the models with the final variables and created two output data sets that contained the predicted scores, acqmod.out_rsp2 and acqmod.out_act2. The following code takes the output data sets from the Method 2 models built using the score option. The where= ( splitwgt = .) option designates both probabilities are taken from the validation data set. Because the same sample was used to build both models in Method 2, when merged together by pros_id the names should match up exactly. The rename=(pred=predrsp) creates different names for the predictors for each model. proc sort data=acqmod.out_rsp2 out=acqmod.validrsp (where=( splitwgt = .) rename=(pred=predrsp)); by pros_id; run; proc sort data=acqmod.out_act2 out=acqmod.validact (where=( splitwgt = .) rename=(pred=predact)); by pros_id; run; data acqmod.blend; merge acqmod.validrsp acqmod.validact; by pros_id; run; data acqmod.blend; set acqmod.blend; predact2 = predrsp*predact; run; To compare the models, I create a decile analysis for the probability of becoming active derived using Method 2 (predact2) with the following code: proc sort data=acqmod.blend; by descending predact2; run; proc univariate data=acqmod.blend noprint; weight smp_wgt; var predact2; output out=preddata sumwgt=sumwgt; run; data acqmod.blend; set acqmod.blend; if (_n_ eq 1) then set preddata; retain sumwgt; number+smp_wgt; if number < .1*sumwgt then act2dec = 0; else Page 123 if number < .2*sumwgt then act2dec = 1; else if number < .3*sumwgt then act2dec = 2; else if number < .4*sumwgt then act2dec = 3; else if number < .5*sumwgt then act2dec = 4; else if number < .6*sumwgt then act2dec = 5; else if number < .7*sumwgt then act2dec = 6; else if number < .8*sumwgt then act2dec = 7; else if number < .9*sumwgt then act2dec = 8; else act2dec = 9; run; title1 "Decile Analysis - Model Comparison"; title2 "Validation Data - Two Step Model"; PROC tabulate data=acqmod.blend; weight smp_wgt; class act2dec; var active predact2 records; table act2dec='Decile' all='Total', records='Prospects'*sum=' '*f=comma10. predact2='Predicted Probability'*(mean=' '*f=11.5) active='Percent Active'*(mean=' '*f=11.5) /rts = 9 row=float; run; In Figure 5.9, the decile analysis of the combined scores on the validation data for the two-model approach shows a slightly better performance than the one-model approach in Figure 5.6. This provides confidence in our results. At first glance, it's difficult to pick the winner. Figure 5.9 Combined model decile analysis. Page 124 Summary This chapter allowed us to enjoy the fruits of our labor. I built several models with strong power to rank order the prospects by their propensity to become active. We saw that many of the segmented and transformed variables dominated the models. And we explored several methods for finding the best-fitting model using two distinct methodologies. In the next chapter, I will measure the robustness of our models and select a winner. Page 125 Chapter 6— Validating the Model The masterpiece is out of the oven! Now we want to ensure that it was cooked to perfection. It's time for the taste test! Validating the model is a critical step in the process. It allows us to determine if we've successfully performed all the prior steps. If a model does not validate well, it can be due to data problems, poorly fitting variables, or problematic techniques. There are several methods for validating models. In this chapter, I begin with the basic tools for validating the model, gains tables and gains charts. Marketers and managers love them because they take the modeling results right to the bottom line. Next, I test the results of the model algorithm on an alternate data set. A major section of the chapter focuses on the steps for creating confidence intervals around the model estimates using resampling. This is gaining popularity as an excellent method for determining the robustness of a model. In the final section I discuss ways to validate the model by measuring its effect on key market drivers. Gains Tables and Charts A gains table is an excellent tool for evaluating the performance of a model. It contains actionable information that can be easily understood and used by non- [...]... cumulative percent of file for validation portion of the prospect data set Column C, as in Figure 5.6, is the average Probability of Activation for each decile as defined by the model Column D, as in Figure 5.6, is the average Percent Actives in the validation data set for each decile This represents the number of true actives in the validation data set divided by the total prospects for each decile Column... for creating deciles In this case, this process is repeated 100 times to create 100 decile values The value &prcnt increments by 1 during each iteration proc sort data= acqmod.outk&prcnt; by descending pred; run; proc univariate data= acqmod.outk&prcnt noprint; weight smp_wgt; var pred; output out=preddata sumwgt=sumwgt; run; data acqmod.outk&prcnt; set acqmod.outk&prcnt; if (_n_ eq 1) then set preddata;... (acqmod.fullmean) using the decile value It then calculates the mean and standard deviation for each estimate: active rate, predicted probability, and lift Following the formula for bootstrap coefficients, it concludes with calculations for the bootstrap estimates and confidence intervals for the three estimates data acqmod.bs_sum(keep = liftf bsest_p prdmnf lci_p uci_p bsest_a actmnf lci_a uci_a bsest_l... of Colorado I will concentrate on the Method 1 approach for this validation because the mechanics and results should be similar The same processing will be performed on the Method 2 combined model for comparison The first step is to read in the data from the Colorado campaign The input statement is modified to read in only the necessary variables data acqmod.colorado; infile 'F:\insur\acquisit\camp3.txt'... ; run; The next step is to create the variable transformations needed for the model scoring The code is a reduced version of the code used in chapter 4 to create the original variables so it won't be repeated here I rerun the logistic model on the original (New York) data to create a data set with one observation using outest=acqmod.nyscore This data set contains the coefficients of the model The... techniques Jackknifing In its purest form, jackknifing is a resampling technique based on the "leave -one-out" principle So, if N is the total number of observations in the data set, jackknifing calculates the estimates on N –1 different samples each having N –1 observations This works well for small data sets In model development, though, we are dealing with large data sets that can be cumbersome to... of active, the actual active rate, and the lift for each decile using 100– 99% samples I will show the code for this process for the Method 1 model The process will be repeated for the Method 2 model, and the results will be compared The program begins with the logistic regression to create an output file (acqmod.resamp) that contains only the validation data and a few key variables Each record is scored... validation data Figure 6.4 Gains chart comparing models from Method 1 and Method 2 Page 130 Scoring Alternate Data Sets In business, the typical purpose of developing a targeting model is to predict behavior on a data set other than that on which the model was developed In our case study, I have our "hold out" sample that I used for validation But because it was randomly sampled from the same data set... values for active rate (actmn&samp) and predicted probability (prdmn&samp) are calculated for each decile This is repeated and incremented 100 times; 100 output data sets (jkmns&prcnt ) are created with the average values proc summary data= acqmod.outk&prcnt; var active pred; class val_dec; weight smp_wgt; output out=acqmod.jkmns&prcnt mean=actmn&prcnt prdmn&prcnt; run; To calculate the lift for each... 1.96*lftsdjk; confidence itnerval on lift */ Proc format and proc tabulate create the gains table for validating the model Figure 6.8 shows the jackknife estimates, along with the confidence intervals for the predicted probability of active, the actual active rate, and the lift proc format; picture perc low-high = '009.999%' (mult=1000000); proc tabulate data= acqmod.jk_sum; var prdmjk lci_p uci_p actmjk . the robustness of the model or its ability to perform on other prospect data. The code is the same as for the model data decile analysis except for the (where=( splitwgt = .)) option. This accesses. 132 proc sort data= acqmod.validco; by descending estimate; run; proc univariate data= acqmod.validco noprint; weight smp_wgt; var estimate; output out=preddata sumwgt=sumwgt; run; data. here: proc logistic data= acqmod.model2 (keep=variables) descending; weight splitwgt ; model respond = variables /selection = stepwise sle= .3 sls= .3; run; proc logistic data= acqmod.model2 (keep=variables)

Ngày đăng: 21/06/2014, 21:20

Xem thêm

Cookbook Modeling Data for Marketing_3 pptx