1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp: Use of machine learning to create a credit scoring model

76 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Use of Machine Learning to Create a Credit Scoring Model
Tác giả Tran Hoang Long
Người hướng dẫn Cao Thi Nhan, PhD
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Graduation Thesis
Năm xuất bản 2021
Thành phố Ho Chi Minh City
Định dạng
Số trang 76
Dung lượng 19,17 MB

Nội dung

UNIVERSITY OF INFORMATIONTECHNOLOGY AEFADVANCED PROGRAM IN INFORMATION SYSTEMS THESIS PROPOSAL Advanced Education Program THESIS TITLE: USE OF MACHINE LEARNING TO CREATE A CREDIT SCORING

Trang 1

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF INFORMATION SYSTEMS

TRAN HOANG LONG - 17521305

HO CHi MINH CITY, 2021

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY :

Trang 2

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF INFORMATION SYSTEMS

TRAN HOANG LONG - 17521305

HO CHi MINH CITY, 2021

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY :

Trang 3

ASSESSMENT COMMITTEE

1 Associate Prof Dr Nguyễn Dinh Thuan — Chairman.

2 Associate Prof Dr Đỗ Phuc — Secretary.

3 Dr Nguyễn Thanh Binh — Member.

Trang 4

First off, I would like to thank all the Lecturers of the University ofInformation Technology, especially the Members of Information Systems Facultywho provided me with helpful and valuable knowledge I could not haveaccomplished my study in this university without their whole — hearted lectures

In particular, the completion of this study could not have been possiblewithout the expertise of Dr Cao Thi Nhan, my beloved thesis advisor Her kindnessand enthusiasm in assisting me during the work of this thesis are extremely precious

to me as well as my study I was so lucky to have such a wise and devoted advisor

In the making of this thesis, I tried my best to apply my domain knowledge inbanking that I had the opportunity to learn, along with new technologies research tomaking this thesis come true Given that I am still an undergraduate, during theimplementation of this thesis, shortcomings are unavoidable Therefore, I am lookingforward to comments and suggestions to make this paper even better and earn morevaluable experience from experts

Sincerely,

Tran Hoang Long

Trang 5

UNIVERSITY OF INFORMATION

TECHNOLOGY AEFADVANCED PROGRAM

IN INFORMATION SYSTEMS

THESIS PROPOSAL

Advanced Education

Program

THESIS TITLE: USE OF MACHINE LEARNING TO CREATE A CREDIT SCORING

MODEL

Advisor: | Dr Cao Thi Nhan

Duration: January 11", 2021 — June 26", 2021

Student: — Tran Hoàng Long — 17521305

Contents:

1 Descriptions

There are more and more financial institutions joining the lending operations inVietnam, FE Credit particularly, the leading financial institution in lending, hasdisbursed over 79 billion VND just in the range of February 2021 (data collectedfrom Trusting Social) Credit growth in Vietnam is the highest in the region, rising

to 18.7 per cent in 2016, 18.17 per cent in 2017 and 14 per cent in 2018, owing to

a more consumer-oriented economy and a low interest rate environment In 2019,credit growth in Vietnam reached 12.1 per cent, which was the lowest growth rate

in the previous five years In 2020, credit growth is expected to bounce back to

around 14 per cent (after adjustment to account for the covid-19 situation).

Therefore, this project is to study a suitable machine learning model aimed at creditscoring that can cope with the credit growth rate in Vietnam

2.Scope

- Dataset about accepted and rejected loans

- Credit Scoring

Trang 6

- Logistic Model.

- Classification and Regression.

3 Objectives

- Study about the lending operations in financial institutions and how they decide

whether to approve the loan or not

- Understand core concepts and algorithm using in credit scoring and credit risk

model.

- Learn how to train and test predictive models by using available data

- Build a credit scoring model to analyze data and display results

4 Methodologies

> Data analysis: perform an exploratory analysis of the data and provide summary

statistics about the variables

> Feature Engineering and Selection [2]: involves data manipulation processes like

transformation of categorical features, missing values treatment, infinite valueshandling, outlier’s detection, data leakage avoidance

> Machine Learning Models [3]: assign a score to a lead using

= Logistic regression (LR): provides binary classifications using linearrelationships

= Decision tree (DT): is constructed to assess the potential improvement using anonlinear model

= Random Forest (RF): is deployed by averaging over a collection of decision

trees.

5 Expected results.

- Understand the lending operations in financial institutions

- Understand fundamental algorithms and methodologies using in credit scoring

and credit risk model

- Successfully build the credit scoring model

Timeline:

Phase 1 (11/01/2021 — 15/03/2021): Study about lending operations in financial

institution and their statuses.

Trang 7

Achieve by joining one of the biggest credit scoring partners of most financial

institutions and banks in Vietnam — Trusting Social.

Phase 2 (16/03/2021 — 13/04/2021): Study about Machine Learning models.

Study Logistic regression, Decision tree and Random Forest, which would be used

in the scope of this project

Phase 3 (14/04/2021 — 23/05/2021): Apply Machine Learning models to credit

scoring.

Apply Machine Learning to assign credit ratings through genetic algorithms.

Phase 4 (24/05/2021 — 26/06/2021): Build a credit scoring model.

Train and test a usable scoring model with available data and display the result

[2] A R Provenzano, D Trifiro`, A Datteo, L Giada, N Jean, A Riciputi, G Le

Pera, M Spadaccino, L Massaron and C Nordio Machine Learning approach for

Credit Scoring, August 5, 2020

[3] Bernard Dushimimana, Yvonne Wambui, Timothy Lubega and Patrick E

McSharry Use of Machine Learning Techniques to Create a Credit Score Model

for Airtime Loans, 13 August 2020

Approved by the advisor Ho Chi Minh city, 18" Mar 2021.

Signature of advisor Signature of Student

Trang 8

Table of Contents

LIST OF TABLES cccssssessssessssesessesesscssseeseseeseneecnesseseseeseaeeseneseneseeneseeneneeneneene 10LIST OF FIGURES cscscsssssssssessssesesscseseeseseeseseecseseeseseeseseesenesneneseeneseeneaeeneneeee 11

LIST OF ACRONYMS AND ABBREVIATIONS -«-<<eeesees 13

CHAPTER 1 INTRODUCTION o5 5< S555 5< SsSsSEsesetstsetstessrseserse 14

1.2 CREDIT SCORING SYSTEM& - nh riey 141.2.1 Credit Information Center (CIC)

1.2.2 Trusting SOGlẠL Shin 16

CHAPTER 2 MACHINE LEARNING MODEL FOR CREDIT SCORING 20

2.1 CREDIT SCORING METHODS - c nS 9E Ssksrkskskerkrrerree 20

2.1.1 Expert judgements-based ImetÏOd cà ccscscererererertrrerreveex 20

PL 0n nố ố 212.1.3 Why is credit SCOrINg ÌIHĐOFAHẨP cà ttetetetetetererererrrrrrkrtee 222.2 CREDIT SCORECARD MODEL sessssseseeseeseeeeseeseeeceesaeeeeeeeseeaeeaeeeseeseeateee 22

221 /U con 242.2.2 Logistic regression algorithm

2.2.3 Weight of Evidence (WOE) c.cccccecscessevesesetesetereseensesesesnsveneeeneneneees 252.2.4 k-fold cross validation

2.3 CHAPTER SUMMARY ccccceeescsceseseseseeseseseseescscseneeecsesesaescsesesesseseseeeeees 28

CHAPTER 3 IMPLEMENTATION

3.1 RUNTIME ENVIRONMENT cccssesescsesessescseseescecseeeeesesesecsesesesesscseseeeeeeee 293.2 PRELIMINARY DATA EXPLORATION & SPLITTING -5-5-55++c++ 29

Trang 9

LUN iaêi n8( ốc nnố.ốố 33 Z2 5 n nh 33

CHAPTER 4 CONCLUSION cccsscsssessssessesessesssscsesneseseeseneesesesnenesneneeceneneeneneeee 74

REFERENCES

Trang 10

LIST OF TABLES

Table 2-1 Advantages and disadvantages of expert judgements-based method 21Table 2-2 Example of a SCOT€CATC St 1g riey 23Table 2-3 IV values interpretation ccc - St rrrey 26Table 3-1 Dataset inf0 c.ccccsceceessseeeseeeseeesesneseseesesseseseeessseesssesisseeesseseesseeeseeee 30Table 3-2 Reasons to drop features after preliminary data exploration 31Table 3-3 Train and Test data after pr€DTOC€SSInE - s55 S+scsrerrrvrex 42Table 3-4 Original confusion ImAfTIX - eeeeseteeeseeeecetseseseseetereneeneneneaee 49Table 3-5 List of reference Caf€ØOTI€S - nh rrey 50Table 3-6 Final feature scorecard c.cccscessssesseeseeseseeseseeeeseeeessseeseseeneseeeeseeeeneeee 54Table 3-7 Dummy variables table of the first 5 customers in the dataset 60

Table 3-8 Score calculation for the first 5 customers in the dataset

Table 3-9 Score quality after SCOTIN cccecee esse eeeeteeeesesesesesesesesseeeeeeeeneneaes 70 Table 3-10 Confusion matrix after applying the best threshold 72 Table 3-11 Loan status results table c.cccececeeeseeeeeeeeeeseseseseseseseseseeseseeeneeneneaes 73

Trang 11

CIC Homepage - csccssescsesesesscsesesesscsesesesscseseseseseseneseeee 15

Trust Scores Websif€ - nền HH He 16General Credit Decision Process Diagraim - 5 c+c+sss+ 17Operation flow chart of a P2P lender [Š] -¿-55-++x+c+c+sss+ 18Runtime environment technical specification - - - 5+ 29List of 18 features with more than 80% missing values 32Proportion of loan_status values ccecsseseeeseseseseseesereeeeeeeeesereees 33train_test_split configuration code snippet -cccscccccscsc+ 34Calculated p — values of feafUT€S cttetetererererrrrree 35

correlated

Figure 3.8 Remaining features after feature selection -+-s-+c++++x+x++ 39Figure 3.9 Calculated WoE and IV of gradĂe - ctteeererererrrrkree 40Figure 3.10 Plot of WoE by grade - + nh riey AlFigure 3.11 k — fold validation accuracy tabÌ - - - 6 ctstevexererererrrrkree 43Figure 3.12 Parameters tuning for LightGBM ccccceeerererrrrree 44Figure 3.13 LightGBM training Fold 1’s reSuÏ - - -cccccvcvrererererexee 45Figure 3.14 LightGBM training Fold 2’s reSuÏ - - - cccxvcvrererererrxee 45Figure 3.15 LightGBM training Fold 3’s reSuÏ - - - - cccecerererererrxee 45Figure 3.16 LightGBM training Fold 4’s reSuÏ - 5-5 cccscsrererererrxee 46Figure 3.17 LightGBM training Fold 5’s reSuÏ( - - - ccccecsesrerervrrvee 46Figure 3.18 Mean AUC of Logistic regression on training set

Trang 12

The score range of FICO Scores [12] c.cccsesseseseseeseressseeeseseereees 52Scorecard after features scores calculafiOn +-+c+++x+x+x++ 53Sample score VaÏU€S - + tt TT 111gr riey 69Score distribution on total |afa ¿- - ccxtstetererererrrrkrkrvee 70Score, approval and rejection rates at the best threshold 73

Trang 13

LIST OF ACRONYMS AND ABBREVIATIONS

No Acronyms Meaning

1 CIC Credit Information Center

2 WoE Weight of Evidence

3 IV Information Value

4 ANOVA Analysis of Variance

5 ROC Receiver Operating Curve

6 AUC, AUROC Area under the ROC Curve

7 PR AUC Precision-Recall AUC

8 TPR True Positive Rate

9 FNR False Negative Rate

10 FPR False Positive Rate

11 TNR True Negative Rate

12 EDA Exploratory Data Analysis

13 PD Probability of Default

14 GBDT Gradient Boosting Decision Tree

15 GOSS Gradient-based One-Side Sampling

16 EFB Exclusive Feature Bundling

Trang 14

In 2020, credit growth in Vietnam reached 12.1 per cent, which was thelowest growth rate in the previous five years In 2021, credit growth isexpected to bounce back to around 14 per cent [1] Given the fact that creditgrowth in Vietnam is the highest in the region, credit risk cannot be effectivelycontrolled by credit officers anymore, at least in the traditional way they have

been doing Therefore, many credits scoring systems have been created to

shorten this process

Credit scoring systems

Credit rating is an important part of the consumer lending process It is

an endeavor seen as one of the most popular fields of application for both datamining and operations research techniques However, credit scoring systemsare somehow unfamiliar to general customers, here are some considerableinstitutions and businesses which have built trustworthy credit scoring

systems.

Trang 15

TRANG CHỦ GIỚITHỆU _ TÀIUỆU

‘THONG BẢO VỀ GIÁ SAN PHẨM THONG TIN TÍN DỤNG KHÁCH HÀNG _—_

VAY 17531 230072021

CCBess chan

1,THÔNG BẢO 96 SUNG A CAPTONA

.2.THÔNG Bho VA TẠO LẬP LẠI BẢN TINO

‘THONG BẢO VE CHAM ĐIỂM CHẤT LƯỢNG BẢO CAO TTTD CUA TGTD THN TH UH

THANG 6 NĂM 2021 1637 | 2906/2021 3-IÔNG BẢO THIỂN hi HORT ĐỘNG DANG KY

108 ws one “TN(GỤNG TRỰC TUYẾN ĐỔI VỚIKHÒCH HANG VAY

THE NHÂN TẠIHÀ NỘI TPHCM, HAI PHÒNG NGHỆ

AM BÀNẴNG,CÀN THƠ

4, THONG BẢO THIỂN Ki HORT ĐỘNG DANG KY

Ti DUNG TRỰC TUYỂN ĐỔI VỚI QC» HANG VAY

“THE NHÂN TA HÀ NỘIVÀTPRCM.

CCANH BẢO CÁC HÌNH THỨC LỮA BAO MỚI VE THONG TIN TÍN DỤNG.

(BÀI3) 1814 18062021 5.NGÀN HÀNG NHÀ NƯỚC CHỦ TAL HỘI THÁO

‘6 cag cung pháo áo đợg Đ hận hon tok in hí qua sb và ng dợng “TRAD 964 THÔNG TA TN DUNG XUYEN BIEN GÓP.

‘0 Great Core wn toa ông mien ivy CC cá bà aac ing gd

See ener in is og eg tr Oo 6 7HONG BẢO VE GI SAN Pr THONG TH TẾ,

seams = DỤNG KHÁCH HÀNG var

cao

`:-THÔNG BẢO VE CHẩùĐiỂM CHẤT LONG BẢO

Figure 1.1 CIC Homepage

CIC is an institution of the State Bank of Vietnam (as shown in Figure1.1) This institution has its functions of collecting, storing, analyzing,forecasting personal credit information in support of banks and financialinstitutions’ operations [2] CIC gathers profiles from commercial banks inViet Nam and proceed credit scoring upon those datasets Individuals andbusinesses can access its database and get credit information with an amount

of fee

As a government institution, CIC is an extremely reliable source ofcredit information, therefore, this database is being used by many banks andfinancial institutions throughout Viet Nam

Trang 16

Why choose us?

Figure 1.2 Trust Scores website

Trusting Social is a fintech company which is a bridge between

underbanked consumers and credit institutions; and helps shorten the credit

decision making process Its scoring system — Trust Scores (as shown in

Figure 1.2) —is based on telco data, mostly from Viettel, which is an important

partner of Trusting Social In average, one in every three consumer loans has

been offered using the credit scoring system of Trusting Social [3]

1.3 Types of loans

There are many types of loans available on the market, however, in this

thesis we only mention those that credit scoring can be useful for

e Personal (unsecured) loans (VP Bank, OCB, Shinhan Bank )

A relatively small loan mount offered by commercial banks

Borrowers will not have to use an asset as collateral, so they probably

need a high credit score to get a good interest, the term “unsecured” is

to distinguish this type of loan from mortgage loan, which is defined as

Trang 17

“secured” Normally, a credit decision is made for a personal loanthrough a process as displayed in Figure 1.3.

General Credit Decision Process

CONDUCTOR

Search and meet clients

Receive, check client's profile

® Assess client's info and demands.

© Make the application for credit facilities.

Credit Appraisal

Credit Approval

Announce credit approval result

| Customer Service Department |

|.aSSoC Operations Support Department i

| Loan Officer |

| Credit Underwriting Officer |

l Credit Authorizer |

Figure 1.3 General Credit Decision Process Diagram

As shown in the diagram, the loan application goes through a set

of steps, which is proceeded by many conductors and departments.Since it takes quite a while for banks to disburse each personal loan due

Trang 18

e Payday loans (FE Credit, Mcredit, Cash24, SHB Finance,

Tienngay.vn )

Payday loans are loans that are high cost, short-term and often

for small amounts The reason why this type of loan is so attractive is

that it does not take borrowers a long waiting time for approval Aremarkable name recently in this field in Viet Nam is FE Credit, whichhas become a leading revenue driver for its parent bank, VP Bank [4]

e Peer — to — peer (P2P) loans (VNVON, LendingClub )

Also known as “social lending” or “crowd lending”, normally aplatform that connects borrowers directly to investors, its underwritingprocess is shown in Figure 1.4 A borrower applies for the loan, then if

he meets all the basic requirements, based on the credit scoring model,the platform sets a rate and term for that application After that, the P2Plender gives investors access to the loan with information about the loanand the borrower (including credit score) and investors decide whether

or not to invest money in this loan The lower the score, the higher theinterest rate, which means investors could take more risk for possibly

higher returns.

Private Individuals Ì ⁄ —| Private individual (natural person)

fm P2P Lending Platform

4

Institutional Investors ‡ Business (legal entity)

“mm 4: Repayments and interest

S:Rapaymensandintre yung ony ltr fee)

Figure.1.4 Operation flow chart of a P2P lender [5]

e Credit card (VIB )

Credit card is a card issued by a commercial bank; it enables thecardholder to borrow funds from that bank Cardholders agree to pay

Trang 19

the money back with interest, and since they can instantly borrow fundsfor each payment, they must have a good credit score.

1.4 Chapter summary

This chapter introduces the research area and outlines the backgroundfor the present study It briefly reviews the context of Viet Nam’s creditmarket and outlines some available credit scoring systems The chaptersubsequently describes some types of loans that the present study can aim for

to help shorten the process In the next chapter, we will go through therationale of this thesis and model that is used for credit scoring

Trang 20

Chapter 2 Machine learning model

for credit scoring

2.1 Credit scoring methods

2.1.1 Expert judgements-based method

As it is called, expert judgements-based method will count on experts’appraisals on a credit risk Risk is predicted using basic information:

e Character: Appraise reputation, trustworthiness of the borrower

e Capital: Appraise the difference between the assets and source of

capital of the borrower Assets are all values that the bank can claim assoon as the borrower charge off Source of capital can be all kinds ofcost that the borrower is paying out such as family expenses, houserental, After deducting all expenses, the experts can calculate theborrower’s saving and whether that amount can afford the loan interest

e Collateral: There are two types of loans according to collateral,

mortgage loan (with collateral) and unsecured loan (without collateral).Within this thesis, we will only talk about unsecured loans since this isthe type of loans that is more likely to require credit information,especially credit score

e Capacity: All information which is directly relevant to borrower’s

financial capability such as employment, income, marriage status,number of dependents,

e Condition: Briefly appraise the borrower according to market

condition, financial context, competitive pressure, loan purpose,

There are both advantages and disadvantages to consider while using theexpert judgment technique as shown in Table 2-1:

Trang 21

Table 2-1 Advantages and disadvantages of expert judgements-based method

Advantages Disadvantages

Different perspectives Time consuming

Valuable use of prior knowledge | Costly if hire external experts

and experience

Helps find creative solutions With different projects, process

activities may have different

durations

Avoid re-inventing the wheel Experts are human, so they need to

rest, which affects their workload

2.1.2 Model method:

The model method is based on the score that was quantified by machinelearning models This method has more advantages than the traditional expertjudgements-based method:

¢ Models return the result immediately, which leads to shorter appraisalduration and more suitable for online lending platform

® Appraisal performance of a model is much better than that of expertssince a model can handle the workload of a hundred experts

e Helps deduct direct labor cost as banks or financial institutions nolonger have to pay for appraisal experts

¢ Profile appraisal results are consistent due to the unique credit scoringmodel, while experts can give different results based on their

perspective of risk

Trang 22

2.1.3 Why is credit scoring important?

Credit score acts as a tool to help consumer lenders assess thecreditworthiness of customers before deciding whether to lend to thatcustomer or not Credit score is also considered as a scale to measure acustomer's ability to borrow money, credit score also determines themaximum loan limit that the bank can disburse when a customer has a loanneed In addition, the credit score will also affect the customer's subsequentloans if the customer's credit score is lower than the minimum score allowed

by a bank According to [6], To identify credit cardholders’ defaults, theauthors used a credit office data set and commercial bank customertransactions to establish a forecast estimation Their results indicate costsavings from 6% to 25% of total losses when machine learning forecastingtechniques are employed to estimate the delinquency rates Therefore, thissubject is inevitable as businesses are in need of this However, there is notmuch room for innovation due to the lack of data and confidentiality issues,this is also somehow a difficulty of this study

2.2 Credit scorecard model

Financial institutions and commercial banks have complex creditmodels that use the information contained in data warehouse like salary, creditcommitments and historical loan data to determine a credit score of anapplication or an existing customer The model generates a score thatrepresents the probability that the lender will receive a repayment on time ifthey give a person a loan or credit card

A credit scorecard is one of those credit models, it is one of the mostcommon credit models because it is relatively easy to interpret for customersand it has existed for the last few decades, so the development process it isstandard and widely understood Credit scorecards represent differentcharacteristics of a customer (age, residency status, time at current address,time at current job, etc.) translated into points and the total number of points

Trang 23

is converted into the credit score Therefore, a credit scorecard is a search tablethat maps a borrower’s specific characteristics in points The total number ofpoints is converted to a credit score.

For example, a credit card can give points to individual borrowers fortheir age and income according to Table 2-2

Table 2-2 Example of a scorecard

Features Values Points

20 mil to 50 mil 28

Using the credit card in this example, a particular customer who is 31years old and has an income of 30 million a year is in the second age class(26-40) and gets 25 points for his age, and similarly, he gets 28 points for hisincome, which means the total for these two features is 53 Of course, thereare still many other features to put into the scoring process, therefore, toshorten this calculation, building a scorecard model using machine learning

algorithms is a bright idea.

Trang 24

Many algorithms can be applied to the building of a credit scorecardmodel, yet in this thesis, I will try Logistic regression and LightGBM and theresult is determined based on the ease of interpretation.

2.2.1 LightGBM

LightGBM originated from Gradient Boosting Decision Tree (GBDT),

which is an ensemble learning approach using the decision tree as the baseclassifier GBDT could enhance a weak classifier into a strong one by iterativetraining In each iteration, GBDT learns the decision trees by fitting thenegative gradients (also known as residual errors) The main cost in GBDTlies in learning the decision trees, and the most time-consuming part inlearning a decision tree is to find the best split points While traditionaltechniques may cause inefficient in both training speed and memoryconsumption Therefore, LightGBM is created using two novel techniques:Gradient — based One — Side Sampling (GOSS) and Exclusive FeatureBundling (EFB) Details of the LightGBM theory can be found in [7]

In summary, LightGBM can be a good candidate in building the creditscorecard model for the reasons as follows:

e LightGBM can handle both classification and regression

problems

e GBDT is an ensemble method, and the performance is

significantly better than most of the conventional machinelearning methods As one type of GBDT, LightGBM has shown

to have good stability and accuracy It has a relatively smallcomputational cost but provides good training effect

2.2.2 Logistic regression algorithm

Logistic regression is perhaps the most widely used algorithm withinthe consumer credit rating industry A regression model generates a

continuous response variable using linear combinations of predictor variables.

Trang 25

Because credit rating is a binary problem, we want to reduce this result to 0 or

1 A logistic regression achieves this by applying a logistic transformation thatlimits the output of[-œ, + œ] to a probability between 0 and 1 In credit scorewhen there are only two groups of results (that is, good and bad) binarylogistic regression is used In this sense, a binary dependent variable isconsidered that assumes the value | when the customer is a good loan and 0when not

0C =1) = Tat fi (2.1)

On what:

Zi = Bot Biri to + uXk, (2.2)

Being:

Bx: the parameters of the model

Xj! the variables representing the explanatory factors of the probability

of each user be good loan

£¡: the error

The regression receives input features which are preprocessed by theWeight of Evidence (WoE) method The output of the model is the defaultprobability of a loan application The higher the probability, the higher therisk And the calculated probability, through scaling, will be transformed intocredit score which represents consumer’s reliability This score is equal to thetotal of equivalent score of each consumer’s feature created by WoE

2.2.3 Weight of Evidence (WoE)

WoE is one of the most common feature engineering and feature

Trang 26

measures the "strength" of the pool to differentiate between good and bad riskand attempts to find a monotonous relationship between the independentvariables and the target variable The criterion for ranking is information value(IV), IV assists with ranking features based on their relative importance.

The formula to calculate WoE is as follow:

% of bad customers ) (23)

A positive WoE means that the proportion of good customers is morethan that of bad customers and vice versa for a negative WoE value

IV is calculated as follows:

IV = ¥(% of good customers — % of bad customers) x WoE (2.4)

According to Siddiqi [8], by convention, the values of IV in creditscoring is interpreted as follows:

Table 2-3 IV values interpretation

Less than 0.02 Not useful for prediction

0.02 to 0.1 Weak predictive Power

0.1 to 0.3 Medium predictive Power

0.3 to 0.5 Strong predictive Power

> 0.5 Suspicious Predictive Power

Trang 27

Steps for WoE feature engineering:

1 Calculate WoE for each unique value (bin) of a categorical

variable, e.g., for each of grade:A, grade:B, grade:C, etc

2 Bin a continuous variable into discrete bins based on its

distribution and number of unique observations (called fine

classing)

3 Calculate WoE for each derived bin of the continuous variable

4 Once WoE has been calculated for each bin of both categorical

and numerical features, combine bins as per the following rules(called coarse classing)

Rules related to combining WoE bins:

1 Each bin should have at least 5% of the observations

2 Each bin should be non-zero for both good and bad loans

3 The WOE should be distinct for each category Similar groups

should be aggregated or binned together It is because the binswith similar WoE have almost the same proportion of good orbad loans, implying the same predictive power

4 The WOE should be monotonic, ie., either growing or

decreasing with the bins

5 Missing values are binned separately

The above rules are generally accepted and well documented inacademic literature [9]

2.2.4 k-fold cross validation

In this thesis, the machine learning model is evaluated using k-foldcross-validation The data set will be divided into 5 folds that do not overlap.Each of the folds has the opportunity to be used as a retained test set, while allthe other folds are collectively used as a training database The general model

Trang 28

The k-fold cross-validation procedure can be implemented in this thesis usingRepeated Stratified K-Fold from the scikit-learn machine learning library.

2.3 Chapter summary

This chapter first introduces and compares two typical credit scoringmethods which are expert judgements-based method and model method Itexplains the credit scorecard model which is used in this thesis The chaptersubsequently mentions logistic regression algorithm, LightGBM, Weight ofEvidence method and the model evaluation method implemented in this thesis,k-fold cross validation In the next chapter, we will have a look at the dataset,start building the model and eventually evaluate it

Trang 29

Chapter 3 Implementation

3.1 Runtime environment

The model is built, trained and test on a MacBook Pro (13-inch, 2016,Two Thunderbolt 3 ports) with technical specifications [10] as shown infollowing Figure 3.1:

macOS Big Sur

Phién ban 11.4

MacBook Pro (13-inch, 2016, Two Thunderbolt 3 ports)

Bộ xử lý 2 GHz Intel Core i5 lõi kép

Bộ nhớ 8 GB 1867 MHz LPDDR3

Đồ họa_ Intel Iris Graphics 540 1536 MB.

Số Sê ri _C02SQ3TFGY25

Figure 3.1 Runtime environment technical specification

e Processor: 2.0GHz dual-core Intel Core i5, Turbo Boost up to 3.1GHz,

with 4MB shared L3 cache

¢ Storage: 256GB PCle-based onboard SSD

e Memory: 8GB of 1866MHz LPDDR3 onboard memory

3.2 Preliminary Data Exploration & Splitting

We will use a database available on Kaggle that relates to consumerloans provided by Lending Club [11] The raw data includes information onmore than 450,000 consumer loans granted between 2007 and 2014 withalmost 75 characteristics, including the current loan status and various

Trang 30

Table 3-1 Dataset info

Total Dataset

Number of observations 466,285

Number of features 74

Features with missing values 18

Initial data research reveals the following:

e As shown in Figure 3.2, 18 features with more than 80% missing

values Given the high proportion of missing values, any

technique to impute them is likely to result in inaccurate results.

List of these features is shown in Figure 6

e Certain static features not related to credit risk, e.g., id,

member _id, url, title

e Other predictive functions that are expected to complete only

after the borrower has failed, for example, bailouts,collection_recovery_ Because our goal here is to predict thefuture probability of a default, having such functions in ourmodel will be counterintuitive as they will not be observed untilthe predetermined event occurs

All the above features will be dropped due to for following reason aspointed out in Table 3-2

Trang 31

Table 3-2 Reasons to drop features after preliminary data exploration.

18 features in Figure 3.2 Values missing as explained above

id, member_id, title, emp_titl’, url,

sub_grade Same information is captured in

grade column

supposed to have future dates,therefore, it will not make sense forthe model

Trang 32

800000

000000 00800 000000 900000

800000

000000 900000 000000 00800

800000

000000

900000

000000 000000 000000 900000

Trang 33

3.2.1 Identify Target Variable

Based on the data exploration, our target variable appears to beloan_status Figure 3.3 shows a quick look at /oan_status unique values andtheir proportion thereof confirms the same

Current - 480878 Fully Paid - 396193 Charged Off - 091092

Late (31-120 days) - 014798

In Grace Period - 006747

Figure 3.3 Proportion of loan_status values

Based on domain knowledge, we will classify loans with the followingloan_status values as being in default (or 0):

e Charged Off

e Default

e Late (31-120 days)

¢ Does not meet the credit policy Status:Charged Off

All the other values will be classified as good (or 1).

3.2.2 Data Split

Now divide our data into the following sets: training (80%) and testing(20%) We will perform Repeated Stratified K-Fold testing during the trainingtest to pre-evaluate our model, while the test set will remain intact until thefinal evaluation of the model This approach follows the best evaluation of the

Trang 34

Figure 3.3 above shows us that our data, as expected, is highly distortedfor good loans Therefore, in addition to intermixed random sampling, we willalso stratify the train / test split so that the distribution of good and bad loans

in the test set is the same as in the previously shared data This is achievedthrough the train_test_split function’s stratify parameter as shown in the code

Figure 3.4 train_test_split configuration code snippet

3.3 Data Cleaning

Data cleaning process includes tasks as follows:

e Remove text from the emp_length column (e.g., years) and

convert it to numeric

e For all columns with dates: convert them to Python’s datetime

format, create a new column as a difference between modeldevelopment date and the respective date feature and then drop

the original feature.

e¢ Remove text from the term column and convert it to numeric

We will define helper functions for each of the above tasks and applythem to the training dataset, those functions are emp_length_converter,

date_columns, loan_term_converter.

3.4 Feature Selection

We will perform a feature selection to identify the most appropriatefeatures for our binary classification problem using the Chi — square test forcategory features and the ANOVA F — statistic for number features First, have

a brief definition of these methods

Trang 35

The Chi-Squared test is used to determine the extent of relationship ordependence between two categorical variables — in our case, one categoricalinput feature, and the other, a categorical target variable.

The Analysis of Variance (ANOVA) F - statistic calculates the ratio ofvariances of the means of two or more samples of data The higher this ratio

between a numerical input feature and a categorical target feature, the lower

the independence between the two and more likely to be useful for model

training.

The p — values, in ascending order, from our Chi-squared test on thecategorical features are as shown in Figure 3.5 For simplicity, we only keepthe top four features and remove the rest

Feature p-value

grade 0.000000

home_ownership 0.000000 verification_status 0.000000

purpose 0.000000 addr_state 0.000000

1nitiaL_tist_status 0.000000

pymnt_plan 0.000923

appLication_type 1.000000

Figure 3.5 Calculated p — values of features

The ANOVA F— statistic for 34 numerical features shows a wide range

of F — values as in Figure 3.6, from 23.513 to 0.39 We will keep the top 20

Trang 36

Next, we will calculate the pair-wise correlations of the selected top 20numerical features to detect any potentially multicollinear variables A heat —map as shown in Figure 3.7 of these pair-wise correlations identifies twofeatures (out_prncp_inv and total_pymnt_inv) as highly correlated Therefore,

we will drop them also for the model

Trang 37

wo œ@ ¬1 Œœ ƠI h Œ@ MB BC

Numerical_Featuremths_since_last_pymnt_d

total_pymnt_inv

total_pymnt

int_rate tast_pymnt_amnt

788313 949727

.116160 + 442129

218888 028871

820465 811890

- 954085 -561249

.947851

.496755 889615

.59ó857 260798

.299684 558687

.116419

„079668

.86ó138 690487

.593157

p values

9.

©œ œ 0 œ œ GŒœ Œ œ Œœ œ Œœ GŒœ Œ œ Œœ Œ Œœ Œœ Œœ œ Œœ Œ œŒ œ Œœ ŒG Œœ Œœ ŒG œ DA ® ORWPP GŒG GŒœ Œœ GŒG GŒ Gœ GŒG GŒ Gœ G GŒ Œœ Œœ GŒ Œœ Œœ GŒ TB Œœ Gœ Œœ TB FB GŒG Œœ œ Gœ ©œ

Trang 38

mths_since_last_pymnt_d ~

total_pymnt_inv total_pymnt

total_rev_ hi lim total_rec_int

mths_since_last_credit revol_util mths_since_earliest_c

last

mths_since_last,

Figure 3.7 Heat — map shows that out_prncp_inv and total_pymnt_inv are

highly correlated

Ngày đăng: 02/10/2024, 05:14