UNIVERSITY OF INFORMATIONTECHNOLOGY AEFADVANCED PROGRAM IN INFORMATION SYSTEMS THESIS PROPOSAL Advanced Education Program THESIS TITLE: USE OF MACHINE LEARNING TO CREATE A CREDIT SCORING
Trang 1UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF INFORMATION SYSTEMS
TRAN HOANG LONG - 17521305
HO CHi MINH CITY, 2021
VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY :
Trang 2UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF INFORMATION SYSTEMS
TRAN HOANG LONG - 17521305
HO CHi MINH CITY, 2021
VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY :
Trang 3ASSESSMENT COMMITTEE
1 Associate Prof Dr Nguyễn Dinh Thuan — Chairman.
2 Associate Prof Dr Đỗ Phuc — Secretary.
3 Dr Nguyễn Thanh Binh — Member.
Trang 4First off, I would like to thank all the Lecturers of the University ofInformation Technology, especially the Members of Information Systems Facultywho provided me with helpful and valuable knowledge I could not haveaccomplished my study in this university without their whole — hearted lectures
In particular, the completion of this study could not have been possiblewithout the expertise of Dr Cao Thi Nhan, my beloved thesis advisor Her kindnessand enthusiasm in assisting me during the work of this thesis are extremely precious
to me as well as my study I was so lucky to have such a wise and devoted advisor
In the making of this thesis, I tried my best to apply my domain knowledge inbanking that I had the opportunity to learn, along with new technologies research tomaking this thesis come true Given that I am still an undergraduate, during theimplementation of this thesis, shortcomings are unavoidable Therefore, I am lookingforward to comments and suggestions to make this paper even better and earn morevaluable experience from experts
Sincerely,
Tran Hoang Long
Trang 5UNIVERSITY OF INFORMATION
TECHNOLOGY AEFADVANCED PROGRAM
IN INFORMATION SYSTEMS
THESIS PROPOSAL
Advanced Education
Program
THESIS TITLE: USE OF MACHINE LEARNING TO CREATE A CREDIT SCORING
MODEL
Advisor: | Dr Cao Thi Nhan
Duration: January 11", 2021 — June 26", 2021
Student: — Tran Hoàng Long — 17521305
Contents:
1 Descriptions
There are more and more financial institutions joining the lending operations inVietnam, FE Credit particularly, the leading financial institution in lending, hasdisbursed over 79 billion VND just in the range of February 2021 (data collectedfrom Trusting Social) Credit growth in Vietnam is the highest in the region, rising
to 18.7 per cent in 2016, 18.17 per cent in 2017 and 14 per cent in 2018, owing to
a more consumer-oriented economy and a low interest rate environment In 2019,credit growth in Vietnam reached 12.1 per cent, which was the lowest growth rate
in the previous five years In 2020, credit growth is expected to bounce back to
around 14 per cent (after adjustment to account for the covid-19 situation).
Therefore, this project is to study a suitable machine learning model aimed at creditscoring that can cope with the credit growth rate in Vietnam
2.Scope
- Dataset about accepted and rejected loans
- Credit Scoring
Trang 6- Logistic Model.
- Classification and Regression.
3 Objectives
- Study about the lending operations in financial institutions and how they decide
whether to approve the loan or not
- Understand core concepts and algorithm using in credit scoring and credit risk
model.
- Learn how to train and test predictive models by using available data
- Build a credit scoring model to analyze data and display results
4 Methodologies
> Data analysis: perform an exploratory analysis of the data and provide summary
statistics about the variables
> Feature Engineering and Selection [2]: involves data manipulation processes like
transformation of categorical features, missing values treatment, infinite valueshandling, outlier’s detection, data leakage avoidance
> Machine Learning Models [3]: assign a score to a lead using
= Logistic regression (LR): provides binary classifications using linearrelationships
= Decision tree (DT): is constructed to assess the potential improvement using anonlinear model
= Random Forest (RF): is deployed by averaging over a collection of decision
trees.
5 Expected results.
- Understand the lending operations in financial institutions
- Understand fundamental algorithms and methodologies using in credit scoring
and credit risk model
- Successfully build the credit scoring model
Timeline:
Phase 1 (11/01/2021 — 15/03/2021): Study about lending operations in financial
institution and their statuses.
Trang 7Achieve by joining one of the biggest credit scoring partners of most financial
institutions and banks in Vietnam — Trusting Social.
Phase 2 (16/03/2021 — 13/04/2021): Study about Machine Learning models.
Study Logistic regression, Decision tree and Random Forest, which would be used
in the scope of this project
Phase 3 (14/04/2021 — 23/05/2021): Apply Machine Learning models to credit
scoring.
Apply Machine Learning to assign credit ratings through genetic algorithms.
Phase 4 (24/05/2021 — 26/06/2021): Build a credit scoring model.
Train and test a usable scoring model with available data and display the result
[2] A R Provenzano, D Trifiro`, A Datteo, L Giada, N Jean, A Riciputi, G Le
Pera, M Spadaccino, L Massaron and C Nordio Machine Learning approach for
Credit Scoring, August 5, 2020
[3] Bernard Dushimimana, Yvonne Wambui, Timothy Lubega and Patrick E
McSharry Use of Machine Learning Techniques to Create a Credit Score Model
for Airtime Loans, 13 August 2020
Approved by the advisor Ho Chi Minh city, 18" Mar 2021.
Signature of advisor Signature of Student
Trang 8Table of Contents
LIST OF TABLES cccssssessssessssesessesesscssseeseseeseneecnesseseseeseaeeseneseneseeneseeneneeneneene 10LIST OF FIGURES cscscsssssssssessssesesscseseeseseeseseecseseeseseeseseesenesneneseeneseeneaeeneneeee 11
LIST OF ACRONYMS AND ABBREVIATIONS -«-<<eeesees 13
CHAPTER 1 INTRODUCTION o5 5< S555 5< SsSsSEsesetstsetstessrseserse 14
1.2 CREDIT SCORING SYSTEM& - nh riey 141.2.1 Credit Information Center (CIC)
1.2.2 Trusting SOGlẠL Shin 16
CHAPTER 2 MACHINE LEARNING MODEL FOR CREDIT SCORING 20
2.1 CREDIT SCORING METHODS - c nS 9E Ssksrkskskerkrrerree 20
2.1.1 Expert judgements-based ImetÏOd cà ccscscererererertrrerreveex 20
PL 0n nố ố 212.1.3 Why is credit SCOrINg ÌIHĐOFAHẨP cà ttetetetetetererererrrrrrkrtee 222.2 CREDIT SCORECARD MODEL sessssseseeseeseeeeseeseeeceesaeeeeeeeseeaeeaeeeseeseeateee 22
221 /U con 242.2.2 Logistic regression algorithm
2.2.3 Weight of Evidence (WOE) c.cccccecscessevesesetesetereseensesesesnsveneeeneneneees 252.2.4 k-fold cross validation
2.3 CHAPTER SUMMARY ccccceeescsceseseseseeseseseseescscseneeecsesesaescsesesesseseseeeeees 28
CHAPTER 3 IMPLEMENTATION
3.1 RUNTIME ENVIRONMENT cccssesescsesessescseseescecseeeeesesesecsesesesesscseseeeeeeee 293.2 PRELIMINARY DATA EXPLORATION & SPLITTING -5-5-55++c++ 29
Trang 9LUN iaêi n8( ốc nnố.ốố 33 Z2 5 n nh 33
CHAPTER 4 CONCLUSION cccsscsssessssessesessesssscsesneseseeseneesesesnenesneneeceneneeneneeee 74
REFERENCES
Trang 10LIST OF TABLES
Table 2-1 Advantages and disadvantages of expert judgements-based method 21Table 2-2 Example of a SCOT€CATC St 1g riey 23Table 2-3 IV values interpretation ccc - St rrrey 26Table 3-1 Dataset inf0 c.ccccsceceessseeeseeeseeesesneseseesesseseseeessseesssesisseeesseseesseeeseeee 30Table 3-2 Reasons to drop features after preliminary data exploration 31Table 3-3 Train and Test data after pr€DTOC€SSInE - s55 S+scsrerrrvrex 42Table 3-4 Original confusion ImAfTIX - eeeeseteeeseeeecetseseseseetereneeneneneaee 49Table 3-5 List of reference Caf€ØOTI€S - nh rrey 50Table 3-6 Final feature scorecard c.cccscessssesseeseeseseeseseeeeseeeessseeseseeneseeeeseeeeneeee 54Table 3-7 Dummy variables table of the first 5 customers in the dataset 60
Table 3-8 Score calculation for the first 5 customers in the dataset
Table 3-9 Score quality after SCOTIN cccecee esse eeeeteeeesesesesesesesesseeeeeeeeneneaes 70 Table 3-10 Confusion matrix after applying the best threshold 72 Table 3-11 Loan status results table c.cccececeeeseeeeeeeeeeseseseseseseseseseeseseeeneeneneaes 73
Trang 11CIC Homepage - csccssescsesesesscsesesesscsesesesscseseseseseseneseeee 15
Trust Scores Websif€ - nền HH He 16General Credit Decision Process Diagraim - 5 c+c+sss+ 17Operation flow chart of a P2P lender [Š] -¿-55-++x+c+c+sss+ 18Runtime environment technical specification - - - 5+ 29List of 18 features with more than 80% missing values 32Proportion of loan_status values ccecsseseeeseseseseseesereeeeeeeeesereees 33train_test_split configuration code snippet -cccscccccscsc+ 34Calculated p — values of feafUT€S cttetetererererrrrree 35
correlated
Figure 3.8 Remaining features after feature selection -+-s-+c++++x+x++ 39Figure 3.9 Calculated WoE and IV of gradĂe - ctteeererererrrrkree 40Figure 3.10 Plot of WoE by grade - + nh riey AlFigure 3.11 k — fold validation accuracy tabÌ - - - 6 ctstevexererererrrrkree 43Figure 3.12 Parameters tuning for LightGBM ccccceeerererrrrree 44Figure 3.13 LightGBM training Fold 1’s reSuÏ - - -cccccvcvrererererexee 45Figure 3.14 LightGBM training Fold 2’s reSuÏ - - - cccxvcvrererererrxee 45Figure 3.15 LightGBM training Fold 3’s reSuÏ - - - - cccecerererererrxee 45Figure 3.16 LightGBM training Fold 4’s reSuÏ - 5-5 cccscsrererererrxee 46Figure 3.17 LightGBM training Fold 5’s reSuÏ( - - - ccccecsesrerervrrvee 46Figure 3.18 Mean AUC of Logistic regression on training set
Trang 12The score range of FICO Scores [12] c.cccsesseseseseeseressseeeseseereees 52Scorecard after features scores calculafiOn +-+c+++x+x+x++ 53Sample score VaÏU€S - + tt TT 111gr riey 69Score distribution on total |afa ¿- - ccxtstetererererrrrkrkrvee 70Score, approval and rejection rates at the best threshold 73
Trang 13LIST OF ACRONYMS AND ABBREVIATIONS
No Acronyms Meaning
1 CIC Credit Information Center
2 WoE Weight of Evidence
3 IV Information Value
4 ANOVA Analysis of Variance
5 ROC Receiver Operating Curve
6 AUC, AUROC Area under the ROC Curve
7 PR AUC Precision-Recall AUC
8 TPR True Positive Rate
9 FNR False Negative Rate
10 FPR False Positive Rate
11 TNR True Negative Rate
12 EDA Exploratory Data Analysis
13 PD Probability of Default
14 GBDT Gradient Boosting Decision Tree
15 GOSS Gradient-based One-Side Sampling
16 EFB Exclusive Feature Bundling
Trang 14In 2020, credit growth in Vietnam reached 12.1 per cent, which was thelowest growth rate in the previous five years In 2021, credit growth isexpected to bounce back to around 14 per cent [1] Given the fact that creditgrowth in Vietnam is the highest in the region, credit risk cannot be effectivelycontrolled by credit officers anymore, at least in the traditional way they have
been doing Therefore, many credits scoring systems have been created to
shorten this process
Credit scoring systems
Credit rating is an important part of the consumer lending process It is
an endeavor seen as one of the most popular fields of application for both datamining and operations research techniques However, credit scoring systemsare somehow unfamiliar to general customers, here are some considerableinstitutions and businesses which have built trustworthy credit scoring
systems.
Trang 15TRANG CHỦ GIỚITHỆU _ TÀIUỆU
‘THONG BẢO VỀ GIÁ SAN PHẨM THONG TIN TÍN DỤNG KHÁCH HÀNG _—_
VAY 17531 230072021
CCBess chan
1,THÔNG BẢO 96 SUNG A CAPTONA
.2.THÔNG Bho VA TẠO LẬP LẠI BẢN TINO
‘THONG BẢO VE CHAM ĐIỂM CHẤT LƯỢNG BẢO CAO TTTD CUA TGTD THN TH UH
THANG 6 NĂM 2021 1637 | 2906/2021 3-IÔNG BẢO THIỂN hi HORT ĐỘNG DANG KY
108 ws one “TN(GỤNG TRỰC TUYẾN ĐỔI VỚIKHÒCH HANG VAY
THE NHÂN TẠIHÀ NỘI TPHCM, HAI PHÒNG NGHỆ
AM BÀNẴNG,CÀN THƠ
4, THONG BẢO THIỂN Ki HORT ĐỘNG DANG KY
Ti DUNG TRỰC TUYỂN ĐỔI VỚI QC» HANG VAY
“THE NHÂN TA HÀ NỘIVÀTPRCM.
CCANH BẢO CÁC HÌNH THỨC LỮA BAO MỚI VE THONG TIN TÍN DỤNG.
(BÀI3) 1814 18062021 5.NGÀN HÀNG NHÀ NƯỚC CHỦ TAL HỘI THÁO
‘6 cag cung pháo áo đợg Đ hận hon tok in hí qua sb và ng dợng “TRAD 964 THÔNG TA TN DUNG XUYEN BIEN GÓP.
‘0 Great Core wn toa ông mien ivy CC cá bà aac ing gd
See ener in is og eg tr Oo 6 7HONG BẢO VE GI SAN Pr THONG TH TẾ,
seams = DỤNG KHÁCH HÀNG var
cao
`:-THÔNG BẢO VE CHẩùĐiỂM CHẤT LONG BẢO
Figure 1.1 CIC Homepage
CIC is an institution of the State Bank of Vietnam (as shown in Figure1.1) This institution has its functions of collecting, storing, analyzing,forecasting personal credit information in support of banks and financialinstitutions’ operations [2] CIC gathers profiles from commercial banks inViet Nam and proceed credit scoring upon those datasets Individuals andbusinesses can access its database and get credit information with an amount
of fee
As a government institution, CIC is an extremely reliable source ofcredit information, therefore, this database is being used by many banks andfinancial institutions throughout Viet Nam
Trang 16Why choose us?
Figure 1.2 Trust Scores website
Trusting Social is a fintech company which is a bridge between
underbanked consumers and credit institutions; and helps shorten the credit
decision making process Its scoring system — Trust Scores (as shown in
Figure 1.2) —is based on telco data, mostly from Viettel, which is an important
partner of Trusting Social In average, one in every three consumer loans has
been offered using the credit scoring system of Trusting Social [3]
1.3 Types of loans
There are many types of loans available on the market, however, in this
thesis we only mention those that credit scoring can be useful for
e Personal (unsecured) loans (VP Bank, OCB, Shinhan Bank )
A relatively small loan mount offered by commercial banks
Borrowers will not have to use an asset as collateral, so they probably
need a high credit score to get a good interest, the term “unsecured” is
to distinguish this type of loan from mortgage loan, which is defined as
Trang 17“secured” Normally, a credit decision is made for a personal loanthrough a process as displayed in Figure 1.3.
General Credit Decision Process
CONDUCTOR
Search and meet clients
Receive, check client's profile
® Assess client's info and demands.
© Make the application for credit facilities.
Credit Appraisal
Credit Approval
Announce credit approval result
| Customer Service Department |
|.aSSoC Operations Support Department i
| Loan Officer |
| Credit Underwriting Officer |
l Credit Authorizer |
Figure 1.3 General Credit Decision Process Diagram
As shown in the diagram, the loan application goes through a set
of steps, which is proceeded by many conductors and departments.Since it takes quite a while for banks to disburse each personal loan due
Trang 18e Payday loans (FE Credit, Mcredit, Cash24, SHB Finance,
Tienngay.vn )
Payday loans are loans that are high cost, short-term and often
for small amounts The reason why this type of loan is so attractive is
that it does not take borrowers a long waiting time for approval Aremarkable name recently in this field in Viet Nam is FE Credit, whichhas become a leading revenue driver for its parent bank, VP Bank [4]
e Peer — to — peer (P2P) loans (VNVON, LendingClub )
Also known as “social lending” or “crowd lending”, normally aplatform that connects borrowers directly to investors, its underwritingprocess is shown in Figure 1.4 A borrower applies for the loan, then if
he meets all the basic requirements, based on the credit scoring model,the platform sets a rate and term for that application After that, the P2Plender gives investors access to the loan with information about the loanand the borrower (including credit score) and investors decide whether
or not to invest money in this loan The lower the score, the higher theinterest rate, which means investors could take more risk for possibly
higher returns.
Private Individuals Ì ⁄ —| Private individual (natural person)
fm P2P Lending Platform
4
Institutional Investors ‡ Business (legal entity)
“mm 4: Repayments and interest
S:Rapaymensandintre yung ony ltr fee)
Figure.1.4 Operation flow chart of a P2P lender [5]
e Credit card (VIB )
Credit card is a card issued by a commercial bank; it enables thecardholder to borrow funds from that bank Cardholders agree to pay
Trang 19the money back with interest, and since they can instantly borrow fundsfor each payment, they must have a good credit score.
1.4 Chapter summary
This chapter introduces the research area and outlines the backgroundfor the present study It briefly reviews the context of Viet Nam’s creditmarket and outlines some available credit scoring systems The chaptersubsequently describes some types of loans that the present study can aim for
to help shorten the process In the next chapter, we will go through therationale of this thesis and model that is used for credit scoring
Trang 20Chapter 2 Machine learning model
for credit scoring
2.1 Credit scoring methods
2.1.1 Expert judgements-based method
As it is called, expert judgements-based method will count on experts’appraisals on a credit risk Risk is predicted using basic information:
e Character: Appraise reputation, trustworthiness of the borrower
e Capital: Appraise the difference between the assets and source of
capital of the borrower Assets are all values that the bank can claim assoon as the borrower charge off Source of capital can be all kinds ofcost that the borrower is paying out such as family expenses, houserental, After deducting all expenses, the experts can calculate theborrower’s saving and whether that amount can afford the loan interest
e Collateral: There are two types of loans according to collateral,
mortgage loan (with collateral) and unsecured loan (without collateral).Within this thesis, we will only talk about unsecured loans since this isthe type of loans that is more likely to require credit information,especially credit score
e Capacity: All information which is directly relevant to borrower’s
financial capability such as employment, income, marriage status,number of dependents,
e Condition: Briefly appraise the borrower according to market
condition, financial context, competitive pressure, loan purpose,
There are both advantages and disadvantages to consider while using theexpert judgment technique as shown in Table 2-1:
Trang 21Table 2-1 Advantages and disadvantages of expert judgements-based method
Advantages Disadvantages
Different perspectives Time consuming
Valuable use of prior knowledge | Costly if hire external experts
and experience
Helps find creative solutions With different projects, process
activities may have different
durations
Avoid re-inventing the wheel Experts are human, so they need to
rest, which affects their workload
2.1.2 Model method:
The model method is based on the score that was quantified by machinelearning models This method has more advantages than the traditional expertjudgements-based method:
¢ Models return the result immediately, which leads to shorter appraisalduration and more suitable for online lending platform
® Appraisal performance of a model is much better than that of expertssince a model can handle the workload of a hundred experts
e Helps deduct direct labor cost as banks or financial institutions nolonger have to pay for appraisal experts
¢ Profile appraisal results are consistent due to the unique credit scoringmodel, while experts can give different results based on their
perspective of risk
Trang 222.1.3 Why is credit scoring important?
Credit score acts as a tool to help consumer lenders assess thecreditworthiness of customers before deciding whether to lend to thatcustomer or not Credit score is also considered as a scale to measure acustomer's ability to borrow money, credit score also determines themaximum loan limit that the bank can disburse when a customer has a loanneed In addition, the credit score will also affect the customer's subsequentloans if the customer's credit score is lower than the minimum score allowed
by a bank According to [6], To identify credit cardholders’ defaults, theauthors used a credit office data set and commercial bank customertransactions to establish a forecast estimation Their results indicate costsavings from 6% to 25% of total losses when machine learning forecastingtechniques are employed to estimate the delinquency rates Therefore, thissubject is inevitable as businesses are in need of this However, there is notmuch room for innovation due to the lack of data and confidentiality issues,this is also somehow a difficulty of this study
2.2 Credit scorecard model
Financial institutions and commercial banks have complex creditmodels that use the information contained in data warehouse like salary, creditcommitments and historical loan data to determine a credit score of anapplication or an existing customer The model generates a score thatrepresents the probability that the lender will receive a repayment on time ifthey give a person a loan or credit card
A credit scorecard is one of those credit models, it is one of the mostcommon credit models because it is relatively easy to interpret for customersand it has existed for the last few decades, so the development process it isstandard and widely understood Credit scorecards represent differentcharacteristics of a customer (age, residency status, time at current address,time at current job, etc.) translated into points and the total number of points
Trang 23is converted into the credit score Therefore, a credit scorecard is a search tablethat maps a borrower’s specific characteristics in points The total number ofpoints is converted to a credit score.
For example, a credit card can give points to individual borrowers fortheir age and income according to Table 2-2
Table 2-2 Example of a scorecard
Features Values Points
20 mil to 50 mil 28
Using the credit card in this example, a particular customer who is 31years old and has an income of 30 million a year is in the second age class(26-40) and gets 25 points for his age, and similarly, he gets 28 points for hisincome, which means the total for these two features is 53 Of course, thereare still many other features to put into the scoring process, therefore, toshorten this calculation, building a scorecard model using machine learning
algorithms is a bright idea.
Trang 24Many algorithms can be applied to the building of a credit scorecardmodel, yet in this thesis, I will try Logistic regression and LightGBM and theresult is determined based on the ease of interpretation.
2.2.1 LightGBM
LightGBM originated from Gradient Boosting Decision Tree (GBDT),
which is an ensemble learning approach using the decision tree as the baseclassifier GBDT could enhance a weak classifier into a strong one by iterativetraining In each iteration, GBDT learns the decision trees by fitting thenegative gradients (also known as residual errors) The main cost in GBDTlies in learning the decision trees, and the most time-consuming part inlearning a decision tree is to find the best split points While traditionaltechniques may cause inefficient in both training speed and memoryconsumption Therefore, LightGBM is created using two novel techniques:Gradient — based One — Side Sampling (GOSS) and Exclusive FeatureBundling (EFB) Details of the LightGBM theory can be found in [7]
In summary, LightGBM can be a good candidate in building the creditscorecard model for the reasons as follows:
e LightGBM can handle both classification and regression
problems
e GBDT is an ensemble method, and the performance is
significantly better than most of the conventional machinelearning methods As one type of GBDT, LightGBM has shown
to have good stability and accuracy It has a relatively smallcomputational cost but provides good training effect
2.2.2 Logistic regression algorithm
Logistic regression is perhaps the most widely used algorithm withinthe consumer credit rating industry A regression model generates a
continuous response variable using linear combinations of predictor variables.
Trang 25Because credit rating is a binary problem, we want to reduce this result to 0 or
1 A logistic regression achieves this by applying a logistic transformation thatlimits the output of[-œ, + œ] to a probability between 0 and 1 In credit scorewhen there are only two groups of results (that is, good and bad) binarylogistic regression is used In this sense, a binary dependent variable isconsidered that assumes the value | when the customer is a good loan and 0when not
0C =1) = Tat fi (2.1)
On what:
Zi = Bot Biri to + uXk, (2.2)
Being:
Bx: the parameters of the model
Xj! the variables representing the explanatory factors of the probability
of each user be good loan
£¡: the error
The regression receives input features which are preprocessed by theWeight of Evidence (WoE) method The output of the model is the defaultprobability of a loan application The higher the probability, the higher therisk And the calculated probability, through scaling, will be transformed intocredit score which represents consumer’s reliability This score is equal to thetotal of equivalent score of each consumer’s feature created by WoE
2.2.3 Weight of Evidence (WoE)
WoE is one of the most common feature engineering and feature
Trang 26measures the "strength" of the pool to differentiate between good and bad riskand attempts to find a monotonous relationship between the independentvariables and the target variable The criterion for ranking is information value(IV), IV assists with ranking features based on their relative importance.
The formula to calculate WoE is as follow:
% of bad customers ) (23)
A positive WoE means that the proportion of good customers is morethan that of bad customers and vice versa for a negative WoE value
IV is calculated as follows:
IV = ¥(% of good customers — % of bad customers) x WoE (2.4)
According to Siddiqi [8], by convention, the values of IV in creditscoring is interpreted as follows:
Table 2-3 IV values interpretation
Less than 0.02 Not useful for prediction
0.02 to 0.1 Weak predictive Power
0.1 to 0.3 Medium predictive Power
0.3 to 0.5 Strong predictive Power
> 0.5 Suspicious Predictive Power
Trang 27Steps for WoE feature engineering:
1 Calculate WoE for each unique value (bin) of a categorical
variable, e.g., for each of grade:A, grade:B, grade:C, etc
2 Bin a continuous variable into discrete bins based on its
distribution and number of unique observations (called fine
classing)
3 Calculate WoE for each derived bin of the continuous variable
4 Once WoE has been calculated for each bin of both categorical
and numerical features, combine bins as per the following rules(called coarse classing)
Rules related to combining WoE bins:
1 Each bin should have at least 5% of the observations
2 Each bin should be non-zero for both good and bad loans
3 The WOE should be distinct for each category Similar groups
should be aggregated or binned together It is because the binswith similar WoE have almost the same proportion of good orbad loans, implying the same predictive power
4 The WOE should be monotonic, ie., either growing or
decreasing with the bins
5 Missing values are binned separately
The above rules are generally accepted and well documented inacademic literature [9]
2.2.4 k-fold cross validation
In this thesis, the machine learning model is evaluated using k-foldcross-validation The data set will be divided into 5 folds that do not overlap.Each of the folds has the opportunity to be used as a retained test set, while allthe other folds are collectively used as a training database The general model
Trang 28The k-fold cross-validation procedure can be implemented in this thesis usingRepeated Stratified K-Fold from the scikit-learn machine learning library.
2.3 Chapter summary
This chapter first introduces and compares two typical credit scoringmethods which are expert judgements-based method and model method Itexplains the credit scorecard model which is used in this thesis The chaptersubsequently mentions logistic regression algorithm, LightGBM, Weight ofEvidence method and the model evaluation method implemented in this thesis,k-fold cross validation In the next chapter, we will have a look at the dataset,start building the model and eventually evaluate it
Trang 29Chapter 3 Implementation
3.1 Runtime environment
The model is built, trained and test on a MacBook Pro (13-inch, 2016,Two Thunderbolt 3 ports) with technical specifications [10] as shown infollowing Figure 3.1:
macOS Big Sur
Phién ban 11.4
MacBook Pro (13-inch, 2016, Two Thunderbolt 3 ports)
Bộ xử lý 2 GHz Intel Core i5 lõi kép
Bộ nhớ 8 GB 1867 MHz LPDDR3
Đồ họa_ Intel Iris Graphics 540 1536 MB.
Số Sê ri _C02SQ3TFGY25
Figure 3.1 Runtime environment technical specification
e Processor: 2.0GHz dual-core Intel Core i5, Turbo Boost up to 3.1GHz,
with 4MB shared L3 cache
¢ Storage: 256GB PCle-based onboard SSD
e Memory: 8GB of 1866MHz LPDDR3 onboard memory
3.2 Preliminary Data Exploration & Splitting
We will use a database available on Kaggle that relates to consumerloans provided by Lending Club [11] The raw data includes information onmore than 450,000 consumer loans granted between 2007 and 2014 withalmost 75 characteristics, including the current loan status and various
Trang 30Table 3-1 Dataset info
Total Dataset
Number of observations 466,285
Number of features 74
Features with missing values 18
Initial data research reveals the following:
e As shown in Figure 3.2, 18 features with more than 80% missing
values Given the high proportion of missing values, any
technique to impute them is likely to result in inaccurate results.
List of these features is shown in Figure 6
e Certain static features not related to credit risk, e.g., id,
member _id, url, title
e Other predictive functions that are expected to complete only
after the borrower has failed, for example, bailouts,collection_recovery_ Because our goal here is to predict thefuture probability of a default, having such functions in ourmodel will be counterintuitive as they will not be observed untilthe predetermined event occurs
All the above features will be dropped due to for following reason aspointed out in Table 3-2
Trang 31Table 3-2 Reasons to drop features after preliminary data exploration.
18 features in Figure 3.2 Values missing as explained above
id, member_id, title, emp_titl’, url,
sub_grade Same information is captured in
grade column
supposed to have future dates,therefore, it will not make sense forthe model
Trang 32800000
000000 00800 000000 900000
800000
000000 900000 000000 00800
800000
000000
900000
000000 000000 000000 900000
Trang 333.2.1 Identify Target Variable
Based on the data exploration, our target variable appears to beloan_status Figure 3.3 shows a quick look at /oan_status unique values andtheir proportion thereof confirms the same
Current - 480878 Fully Paid - 396193 Charged Off - 091092
Late (31-120 days) - 014798
In Grace Period - 006747
Figure 3.3 Proportion of loan_status values
Based on domain knowledge, we will classify loans with the followingloan_status values as being in default (or 0):
e Charged Off
e Default
e Late (31-120 days)
¢ Does not meet the credit policy Status:Charged Off
All the other values will be classified as good (or 1).
3.2.2 Data Split
Now divide our data into the following sets: training (80%) and testing(20%) We will perform Repeated Stratified K-Fold testing during the trainingtest to pre-evaluate our model, while the test set will remain intact until thefinal evaluation of the model This approach follows the best evaluation of the
Trang 34Figure 3.3 above shows us that our data, as expected, is highly distortedfor good loans Therefore, in addition to intermixed random sampling, we willalso stratify the train / test split so that the distribution of good and bad loans
in the test set is the same as in the previously shared data This is achievedthrough the train_test_split function’s stratify parameter as shown in the code
Figure 3.4 train_test_split configuration code snippet
3.3 Data Cleaning
Data cleaning process includes tasks as follows:
e Remove text from the emp_length column (e.g., years) and
convert it to numeric
e For all columns with dates: convert them to Python’s datetime
format, create a new column as a difference between modeldevelopment date and the respective date feature and then drop
the original feature.
e¢ Remove text from the term column and convert it to numeric
We will define helper functions for each of the above tasks and applythem to the training dataset, those functions are emp_length_converter,
date_columns, loan_term_converter.
3.4 Feature Selection
We will perform a feature selection to identify the most appropriatefeatures for our binary classification problem using the Chi — square test forcategory features and the ANOVA F — statistic for number features First, have
a brief definition of these methods
Trang 35The Chi-Squared test is used to determine the extent of relationship ordependence between two categorical variables — in our case, one categoricalinput feature, and the other, a categorical target variable.
The Analysis of Variance (ANOVA) F - statistic calculates the ratio ofvariances of the means of two or more samples of data The higher this ratio
between a numerical input feature and a categorical target feature, the lower
the independence between the two and more likely to be useful for model
training.
The p — values, in ascending order, from our Chi-squared test on thecategorical features are as shown in Figure 3.5 For simplicity, we only keepthe top four features and remove the rest
Feature p-value
grade 0.000000
home_ownership 0.000000 verification_status 0.000000
purpose 0.000000 addr_state 0.000000
1nitiaL_tist_status 0.000000
pymnt_plan 0.000923
appLication_type 1.000000
Figure 3.5 Calculated p — values of features
The ANOVA F— statistic for 34 numerical features shows a wide range
of F — values as in Figure 3.6, from 23.513 to 0.39 We will keep the top 20
Trang 36Next, we will calculate the pair-wise correlations of the selected top 20numerical features to detect any potentially multicollinear variables A heat —map as shown in Figure 3.7 of these pair-wise correlations identifies twofeatures (out_prncp_inv and total_pymnt_inv) as highly correlated Therefore,
we will drop them also for the model
Trang 37wo œ@ ¬1 Œœ ƠI h Œ@ MB BC
Numerical_Featuremths_since_last_pymnt_d
total_pymnt_inv
total_pymnt
int_rate tast_pymnt_amnt
788313 949727
.116160 + 442129
218888 028871
820465 811890
- 954085 -561249
.947851
.496755 889615
.59ó857 260798
.299684 558687
.116419
„079668
.86ó138 690487
.593157
p values
9.
©œ œ 0 œ œ GŒœ Œ œ Œœ œ Œœ GŒœ Œ œ Œœ Œ Œœ Œœ Œœ œ Œœ Œ œŒ œ Œœ ŒG Œœ Œœ ŒG œ DA ® ORWPP GŒG GŒœ Œœ GŒG GŒ Gœ GŒG GŒ Gœ G GŒ Œœ Œœ GŒ Œœ Œœ GŒ TB Œœ Gœ Œœ TB FB GŒG Œœ œ Gœ ©œ
Trang 38mths_since_last_pymnt_d ~
total_pymnt_inv total_pymnt
total_rev_ hi lim total_rec_int
mths_since_last_credit revol_util mths_since_earliest_c
last
mths_since_last,
Figure 3.7 Heat — map shows that out_prncp_inv and total_pymnt_inv are
highly correlated