Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 29 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
29
Dung lượng
647,25 KB
Nội dung
Figure 9.9 Validation decile analysis. Figure 9.10 Validation gains table with lift. Page 227 Figure 9.11 Validation gains chart. The following data step appends the overall mean values to every record: data ch09.bs_all; set ch09.bs_all; if (_n_ eq 1) then set preddata; retain sumwgt rspmean salmean; run; PROC SUMMARY creates mean values of respond (rspmnf) and 12 - month sales (salmnf) for each decile (val_dec) : proc summary data=ch09.bs_all; var respond sale12mo; class val_dec; output out=ch09.fullmean mean= rspmnf salmnf; run; The next data step uses the output from PROC SUMMARY to create a separate data set (salfmean) with the two overall mean values renamed. The overall mean values are stored in the observation where val_dec has a missing value ( val_dec = .). These will be used in the final bootstrap calculation: Page 228 data salfmean(rename=(salmnf=salomn_g rspmnf=rspomn_g) drop=val_dec); set ch09.fullmean(where=(val_dec=.) keep=salmnf rspmnf val_dec); smp_wgt=1; run; In the next data step, the means are appended to every value of the data set ch09.fullmean. This will be accessed in the final calculations following the macro. data ch09.fullmean; set ch09.fullmean; if (_n_ eq 1) then set salfmean; retain salomn_g rspomn_g; run; The bootstrapping program is identical to the one in chapter 6 up to the point where the estimates are calculated. The following data step merges all the bootstrap samples and calculates the bootstrap estimates: data ch09.bs_sum(keep=liftf bsest_r rspmnf lci_r uci_r bsest_s salmnf lci_s uci_s bsest_l lftmbs lci_l uci_l val_dec salomn_g); merge ch09.bsmns1 ch09.bsmns2 ch09.bsmns3 ch09.bsmns4 ch09.bsmns5 ch09.bsmns6 ch09.bsmns7 ch09.bsmns8 ch09.bsmns9 ch09.bsmns10 ch09.bsmns11 ch09.bsmns12 ch09.bsmns13 ch09.bsmns14 ch09.bsmns15 ch09.bsmns16 ch09.bsmns17 ch09.bsmns18 ch09.bsmns19 ch09.bsmns20 ch09.bsmns21 ch09.bsmns22 ch09.bsmns23 ch09.bsmns24 ch09.bsmns25 ch09.fullmean; by val_dec; rspmbs = mean(of rspmn1-rspmn25); /* mean of response */ rspsdbs = std(of rspmn1-rspmn25); /* st dev of response */ salmbs = mean(of salmn1-salmn25); /* mean of sales */ salsdbs = std(of salmn1-salmn25); /* st dev of sales */ lftmbs = mean(of liftd1-liftd25); /* mean of lift */ lftsdbs = std(of liftd1-liftd25); /* st dev of lift */ liftf = 100*salmnf/salomn_g; /* overall lift for sales */ bsest_r = 2*rspmnf - rspmbs; /* boostrap est - response */ lci_r = bsest_r - 1.96*rspsdbs; /* lower conf interval */ uci_r = bsest_r + 1.96*rspsdbs; /* upper conf interval */ bsest_s = 2*salmnf - salmbs; /* boostrap est - sales */ lci_s = bsest_s - 1.96*salsdbs; /* lower conf interval */ uci_s = bsest_s + 1.96*salsdbs; /* upper conf interval */ bsest_l = 2*liftf - lftmbs; /* boostrap est - lift */ lci_l = bsest_l - 1.96*lftsdbs; /* lower conf interval */ uci_l = bsest_l + 1.96*lftsdbs; /* upper conf interval */ run; Finally, I use PROC TABULATE to display the bootstrap and confidence interval values by decile. proc tabulate data=ch09.bs_sum; var liftf bsest_r rspmnf lci_r uci_r bsest_s salmnf lci_s uci_s bsest_l lftmbs lci_l uci_l; class val_dec; table (val_dec='Decile' all='Total'), (rspmnf='Actual Resp'*mean=' '*f=percent6. bsest_r='BS Est Resp'*mean=' '*f=percent6. lci_r ='BS Lower CI Resp'*mean=' '*f=percent6. uci_r ='BS Upper CI Resp'*mean=' '*f=percent6. salmnf ='12-Month Sales'*mean=' '*f=dollar8. bsest_s='BS Est Sales'*mean=' '*f=dollar8. lci_s ='BS Lower CI Sales'*mean=' '*f=dollar8. uci_s ='BS Upper CI Sales'*mean=' '*f=dollar8. liftf ='Sales Lift'*mean=' '*f=6. bsest_l='BS Est Lift'*mean=' '*f=6. lci_l ='BS Lower CI Lift'*mean=' '*f=6. uci_l ='BS Upper CI Lift'*mean=' '*f=6.) /rts=10 row=float; run; Figure 9.12 Bootstrap analysis. Page 230 The results of the bootstrap analysis give me confidence that the model is stable. Notice how the confidence intervals are fairly tight even in the best decile. And the bootstrap estimates are very close to the actual value, providing additional security. Keep in mind that these estimates are not based on actual behavior but rather a propensity toward a type of behavior. They will, however, provide a substantial improvement over random selection. Implementing the Model In this case, the same file containing the score will be used for marketing. The marketing manager at Downing Office Products now has a robust model that can be used to solicit businesses that have the highest propensity to buy the company's products. The ability to rank the entire business list also creates other opportunities for Downing. It is now prepared to prioritize sales efforts to maximize its marketing dollar. The top scoring businesses (deciles 7–9) are targeted to receive a personal sales call. The middle group (4–8) is targeted to receive several telemarketing solicitations. And the lowest group (deciles 0– 3) will receive a postcard directing potential customers to the company's Web site. This is expected to provide a substantial improvement in yearly sales. Summary Isn't it amazing how the creative use of weights can cause those high spenders to rise to the top? This case study is an excellent example of how well this weighting technique works. You just have to remember that the estimated probabilities are not accurate predictors. But the ability of the model to rank the file from most profitable to least profitable prospects is superior to modeling without weights. In addition, the mechanics of working with business data are identical to those of working with individual and household data. Response models are the most widely used and work for almost any industry. From banks and insurance companies selling their products to phone companies and resorts selling their services, the simplest response model can improve targeting and cut costs. Whether you're targeting individuals, families, or business, the rules are the same: clear objective, proper data preparation, linear predictors, rigorous processing, and thorough validation. In our next chapter, we try another recipe. We're going to predict which prospects are more likely to be financially risky. Page 231 Chapter 10— Avoiding High-Risk Customers: Modeling Risk Most businesses are interested in knowing who will respond, activate, purchase, or use their services. As we saw in our case study in part 2, many companies need to manage another major component of the profitability equation, one that does not involve purchasing or using products or services. These businesses are concerned with the amount of risk they are taking by accepting someone as a customer. Our case study in part 2 incorporated the effect of risk on overall profitability for life insurance. Banks assume risk through loans and credit cards, but other business such as utilities and telcos also assume risk by providing products and services on credit. Virtually any company delivering a product or service with the promise of future payment takes a financial risk. In this chapter, I start off with a description of credit scoring, its origin, and how it has evolved into risk modeling. Then I begin the case study in which I build a model that predicts risk by targeting failure to pay on a credit-based purchase for the telecommunications or telco industry. (This is also known as an approval model.) As in chapter 9, I define the objective, prepare the variables, and process and validate the model. You will see some similarities in the processes, but there are also some notable differences due to the nature of the data. Finally, I wrap up the chapter with a brief discussion of fraud modeling and how it's being used to reduce losses in many industries. Page 232 Credit Scoring andRisk Modeling If you've ever applied for a loan, I'm sure you're familiar with questions like, ''Do you own or rent?" "How long have you lived at your current address?" and "How many years have you been with your current employer?" The answers to these questions— and more— are used to calculate your credit score. Based on your answers (each of which is assigned a value), your score is summed and evaluated. Historically, this method has been very effective in helping companies determine credit worthiness. Credit scoring began in the early sixties when Fair, Isaac and Company developed the first simple scoring algorithm based on a few key factors. Until that time, decisions to grant credit were primarily based on judgment. Some companies were resistant to embrace a score to determine credit worthiness. As the scores proved to be predictive, more and more companies began to use them. As a result of increased computer power, more available data, and advances in technology, tools for predicting credit risk have become much more sophisticated. This has led to complex credit scoring algorithms that have the ability to consider and utilize many different factors. Through these advances, risk scoring has evolved from a simple scoring algorithm based on a few factors to the sophisticated scoring algorithms we see today. Over the years, Fair, Isaac scores have become a standard in the industry. While its methodology has been closely guarded, it recently published the components of its credit-scoring algorithm. Its score is based on the following elements: Past payment history • Account payment information on specific types of accounts (e.g., credit cards, retail accounts, installment loans, finance company accounts, mortgage) • Presence of adverse public records (e.g., bankruptcy, judgments, suits, liens, wage attachments), collection items, and/or delinquency (past due items) • Severity of delinquency (how long past due) • Amount past due on delinquent accounts or collection items • Time since (recency of) past due items (delinquency), adverse public records (if any), or collection items (if any) • Number of past due items on file Page 233 • Number of accounts paid as agreed Amount of credit owing • Amount owing on accounts • Amount owing on specific types of accounts • Lack of a specific type of balance, in some cases • Number of accounts with balances • Proportion of credit lines used (proportion of balances to total credit limits on certain types of revolving accounts) • Proportion of installment loan amounts still owing (proportion of balance to original loan amount on certain types of installment loans) Length of time credit established • Time since accounts opened • Time since accounts opened, by specific type of account • Time since account activity Search for and acquisition of new credit • Number of recently opened accounts, and proportion of accounts that are recently opened, by type of account • Number of recent credit inquiries • Time since recent account opening(s), by type of account • Time since credit inquiry(s) • Reestablishment of positive credit history following past payment problems Types of credit established • Number of (presence, prevalence, and recent information on) various types of accounts (credit cards, retail accounts, installment loans, mortgage, consumer finance accounts, etc.) Over the past decade, numerous companies have begun developing their own risk scores to sell or for personal use. In this case study, I will develop a risk score that is very similar to those available on the market. I will test the final scoring algorithm against a generic risk score that I obtained from the credit bureau. Page 234 Defining the Objective Eastern Telecom has just formed an alliance with First Reserve Bank to sell products and services. Initially, Eastern wishes to offer cellular phones and phone services to First Reserve's customer base. Eastern plans to use statement inserts to promote its products and services, so marketing costs are relatively small. Its main concern at this point is managing risk. Since payment behavior for a loan product is highly correlated with payment behavior for a product or service, Eastern plans to use the bank's data to predict financial risk over a three-year period. To determine the level of risk for each customer, Eastern Telecom has decided to develop a model that predicts the probability of a customer becoming 90+ days past due or defaulting on a loan within a three-year period. To develop a modeling data set, Eastern took a sample of First Reserve's loan customers. From the customers that were current 36 months ago, Eastern selected all the customers now considered high risk or in default and a sample of those customers who were still current and considered low risk. A high-risk customer was defined as any customer who was 90 days or more behind on a loan with First Reserve Bank. This included all bankruptcies and charge-offs. Eastern created three data fields to define a high-risk customer: bkruptcy to denote if they were bankrupt, chargoff to denote if they were charged off, and dayspdue, a numeric field detailing the days past due. A file containing name, address, social security number, and a match key (idnum ) was sent to the credit bureau for a data overlay. Eastern requested that the bureau pull 300+ variables from an archive of 36 months ago and append the information to the customer file. It also purchased a generic risk score that was developed by an outside source. The file was returned and matched to the original extract to combine the 300+ predictive variables with the three data fields. The following code takes the combined file and creates the modeling data set. The first step defines the independent variable, highrisk. The second step samples and defines the weight, smp_wgt. This step creates two temporary data sets, hr and lr, that are brought together in the final step to create the data set ch10.telco: data ch10.creddata; set ch10.creddata; if bkruptcy = 1 or chargoff = 1 or dayspdue => 90 then highrisk = 1; else highrisk = 0; run; data hr lr(where=(ranuni(5555) < .14)); set ch10.creddata; TEAMFLY Team-Fly ® [...]... age, and marital status It easily could have been derived from a ris chapter 8, we saw how risk scores are used to segment customers and prospects The model we just completed is being used to score Fir banking customer for Eastern Telecom Those banking customers who are low risk will receive a great offer for cellular products and se next monthly statement Page 252 Scaling the Risk Score To make risk. .. in the modeling process The next step is to prepare the predictive variables Table 10.1 Population and Sample Frequencies and Weights GROUP POPULATION POPULATION PERCENT SAMPLE WEIGHT High Risk 10,875 3.48% 10,875 1 Low Risk 301,665 96.5% 43,095 7 TOTAL 312,540 100% 53,970 The overall rate of high -risk customers is 3.48% This is kept intact by the use of weights in the modeling process The next step... In the following data step, I create an array called riskvars This represents the 61 variables that I've selected as preliminary candidates an array called rvar This represents the group of renamed variables, rvar1– rvar61 The "do loop" following the array names takes each variables and renames it to rvar1-rvar61 data riskmean; set ch10.telco; array riskvars (61) COLLS LOCINQS INQAGE TADB25... else else else proc tabulate data=ch10.scored; weight smp_wgt; class val_dec; var highrisk pred records; table val_dec='Decile' all='Total', records='Customers'*sum=' '*f=comma11 pred='Predicted Probability'*mean=' '*f=11.5 highrisk='Percent Highrisk'*mean=' '*f=11.5 /rts = 9 row=float; run; The parameter estimates and model statistics are displayed in Figure 10.6 Notice the variable with the highest... Figure 10.10 Model score bootstrap estimates and confidence intervals The overall model performance validates well with the top decile capturing almost three times the average The tight width of the confid shows that the model is robust Implementing the Model Risk scores are used in many phases of customer relationship management As we saw in part 2, a risk adjustment was used to determin value of...Page 235 if highrisk = 1 then do; smp_wgt=1; output hr; end; else do; smp_wgt=7; output lr; end; run; data ch10.telco; set hr lr; run; Table 10.1 displays the original population size and percents, the sample size, and the weights: The overall rate of highrisk customers is 3.48% This is kept intact by the use of weights in the modeling process... risk comparison more user friendly, most risk scores are translated into a scale ranging from 450 to 850 This is accomplished by selecting a base value that equates to a 50/50 chance of some negative risk action, i.e., your objective The scale is designed in such a way that the risk decreases as the score increases— the highest scoring customers are the lowest risk For the case study, I want to derive... (61) rvar1-rvar61; do count = 1 to 61; rvar(count) = riskvars(count); end; run; Page 239 NOTE: An array is active only during the data step and must be declared by name for each new data step The next step calculates the means for each of the 61 variables and creates an output data set called outmns with the mean values mrvar1-mrvar61 proc summary data=riskmean; weight smp_wgt; var rvar1-rvar61; output... data=ch10.telco2(keep=modwgt splitwgt smp_wgt highrisk MALE RVAR13 V61_CU V6_SQRT V7_CURT V8_CURT V9_CURT) descending; weight modwgt; model highrisk = MALE RVAR13 V61_CU V6_SQRT V7_CURT V8_CURT V9_CURT /selection=stepwise; run; proc logistic data=ch10.telco2(keep=modwgt splitwgt smp_wgt highrisk MALE RVAR13 V61_CU V9_CURT) descending; weight modwgt; model highrisk = MALE RVAR13 RVAR36 RVAR58 RVARM11... 500- . set outmns; retain mrvar1-mrvar61; array rvars (61 ) rvar1-rvar61; array mrvars (61 ) mrvar1-mrvar61; array rvarmiss (61 ) rvarm1-rvarm61; do count = 1 to 61 ; rvarmiss(count) = (rvars(count). variables. Table 10.1 Population and Sample Frequencies and Weights GROUP POPULATION POPULATION PERCENT SAMPLE WEIGHT High Risk 10,875 3.48% 10,875 1 Low Risk 301 ,66 5 96. 5% 43,095 7 TOTAL 312,540 100% 53,970 . of the 61 variables and renames it to rvar1-rvar61. data riskmean; set ch10.telco; array riskvars (61 ) COLLS LOCINQS INQAGE . . . . . . . . . . TADB25 TUTRADES TLTRADES; array rvar (61 ) rvar1-rvar61;