Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring
Trang 1MINISTRY OF EDUCATION AND TRAINING
UNIVERSITY OF ECONOMICS HO CHI MINH
Trang 2Ho Chi Minh City - 2024
Trang 3MINISTRY OF EDUCATION AND TRAINING
UNIVERSITY OF ECONOMICS HO CHI MINH
Trang 4Ho Chi Minh City - 2024
Trang 5STATEMENT OF AUTHENTICATION
I certify that the Ph.D dissertation, “Imbalanced data in classification: A
case study of credit scoring”, is solely my own research.
This dissertation is only used for the Ph.D degree at the University ofEco- nomics Ho Chi Minh City (UEH), and no part of it has been submitted
to any other university or organization to obtain any other degree Anystudies of other authors used in this dissertation are properly cited
Ho Chi Minh City, April 2, 2024
Trang 6ACKNOWLEDGMENT
First of all, I would like to express my deepest gratitude to mysupervisors, Assoc Prof Dr Le Xuan Truong and Dr Ta Quoc Bao, for theirscientific direction and dedicated guidance throughout the process ofconducting this Ph.D dissertation
I sincerely thank the teachers of the UEH’s doctoral training program forimparting valuable knowledge, and the teachers at the Department ofMathe- matics and Statistics, UEH for their sincere comments on mydissertation
I sincerely thank Dr Le Thi Thanh An for her moral and academicsupport so that I could complete the research Besides, I really appreciatethe interest and help of my colleagues at Ho Chi Minh City University ofBanking
Finally, I am grateful for the unconditional support that my mother and
my family have given to me on my educational path
Ho Chi Minh City, April 2, 2024
Trang 7TABLE OF CONTENTS
1.1 Overview of imbalanced data in classification 1
1.2 Motivations 3
1.3 Research gap identifications 5
1.3.1 Gaps in credit scoring 5
1.3.2 Gaps in the approaches to solving imbalanced data 7
1.3.3 Gaps in Logistic regression with imbalanced data 9
1.4 Research objectives, research subjects, and research scopes 10
1.4.1 Research objectives 10
1.4.2 Research subjects 11
1.4.3 Research scopes 11
1.5 Research data and research methods 12
1.5.1 Research data 12
1.5.2 Research methods 12
Trang 81.6 Contributions of the dissertation 13
1.7 Dissertation outline 14
2 LITERATURE REVIEW OF IMBALANCED DATA 16 2.1 Imbalanced data in classification 16
2.1.1 Description of imbalanced data 16
2.1.2 Obstacles in imbalanced classification 16
2.1.3 Categories of imbalanced data 17
2.2 Performance measures for imbalanced data 19
2.2.1 Performance measures for labeled outputs 19
2.2.1.1 Single metrics 19
2.2.1.2 Complex metrics 21
2.2.2 Performance measures for scored outputs 22
2.2.2.1 Area under the Receiver Operating Character-istics Curve 22
2.2.2.2 Kolmogorov-Smirnov statistic 24
2.2.2.3 H-measure 25
2.2.3 Conclusion of performance measures in imbalanced clas-sification 25
2.3 Approaches to imbalanced classification 26
2.3.1 Algorithm-level approach 26
2.3.1.1 Modifying the current classifier algorithms 26
2.3.1.2 Cost-sensitive learning 28
2.3.1.3 Comments on algorithm-level approach 30
2.3.2 Data-level approach 30
2.3.2.1 Under-sampling method 30
2.3.2.2 Over-sampling method 34
2.3.2.3 Hybrid method 38
2.3.2.4 Comments on data-level approach 39
2.3.3 Ensemble-based approach 41
2.3.3.1 Integration of algorithm-level method and en-semble classifier algorithm 42
Trang 92.3.3.2 Integration of data-level method and ensemble
classifier algorithm 43
2.3.3.3 Comments on ensemble-based approach 45
2.3.4 Conclusions of approaches to imbalanced data 46
2.4 Credit scoring 48
2.4.1 Meaning of credit scoring 48
2.4.2 Inputs for credit scoring models 49
2.4.3 Interpretability of credit scoring models 51
2.4.4 Approaches to imbalanced data in credit scoring 52
2.4.5 Recent credit scoring ensemble models 53
2.5 Chapter summary 55
3 IMBALANCED DATA IN CREDIT SCORING 56 3.1 Classifiers for credit scoring 56
3.1.1 Single classifiers 56
3.1.1.1 Discriminant analysis 56
3.1.1.2 K-nearest neighbors 57
3.1.1.3 Logistic regression 58
3.1.1.4 Lasso-Logistic regression 60
3.1.1.5 Decision tree 61
3.1.1.6 Support vector machine 62
3.1.1.7 Artificial neural network 64
3.1.2 Ensemble classifiers 66
3.1.2.1 Heterogeneous ensemble classifiers 66
3.1.2.2 Homogeneous ensemble classifiers 67
3.1.3 Conclusions of statistical models for credit scoring 69
3.2 The proposed credit scoring ensemble model base Decision tree 71 3.2.1 The proposed algorithms 71
3.2.1.1 Algorithm for balancing data - OUS(B) algorithm 71 3.2.1.2 Algorithm for constructing ensemble classifier -DTE(B) algorithm 72
3.2.2 Empirical data sets 73
Trang 103.2.3 Computation process 74
3.2.4 Empirical results 76
3.2.4.1 The optimal Decision tree ensemble classifier 76 3.2.4.2 Performance of the proposed model on the Viet-namese data sets 77
3.2.4.3 Performance of the proposed model on the pub-lic data sets 79
3.2.4.4 Evaluations 81
3.2.5 Conclusions of the proposed credit scoring ensemble model based Decision tree 82
3.3 The proposed algorithm for imbalanced and overlapping data 83 3.3.1 The proposed algorithms 84
3.3.1.1 Algorithm for dealing with noise, overlapping, and imbalanced data 84
3.3.1.2 Algorithm for constructing ensemble model 84
3.3.2 Empirical data sets 85
3.3.3 Computation process 86
3.3.3.1 Computation protocol of the Lasso Logistic en-semble 87
3.3.3.2 Computation protocol of the Decision tree en-semble 88
3.3.4 Empirical results 88
3.3.4.1 The optimal ensemble classifier 88
3.3.4.2 Performance of LLE(B) 89
3.3.4.3 Performance of DTE(B) 90
3.3.5 Conclusions of the proposed technique 91
3.4 Chapter summary 92
4 A MODIFICATION OF LOGISTIC REGRESSION WITH IM-BALANCED DATA 93 4.1 Introduction 93
4.2 Related works 95
Trang 114.2.1 Prior correction 95
4.2.2 Weighted likelihood estimation (WLE) 96
4.2.3 Penalized likelihood regression (PLR) 97
4.3 The proposed works 98
4.3.1 The modification of the cross-validation procedure 99
4.3.2 The modification of Logistic regression 101
4.4 Empirical study 103
4.4.1 Empirical data sets 103
4.4.2 Performance measures 104
4.4.3 Computation process 105
4.4.4 Empirical results 107
4.4.5 Statistical test 110
4.4.6 Important variables for output 111
4.4.6.1 Important variables for F-LLR fitted model 111
4.4.6.2 Important variables of the Vietnamese data set 112 4.5 Discussions and Conclusions 115
4.5.1 Discussions 115
4.5.2 Conclusions 116
4.6 Chapter summary 116
5 CONCLUSIONS 118 5.1 Summary of contributions 118
5.1.1 The interpretable credit scoring ensemble classifier 118
5.1.2 The technique for imbalanced data, noise, and overlap-ping samples 119
5.1.3 The modification of Logistic regression 120
5.2 Implications 121
5.3 Limitations and suggestions for further research 122
Trang 12C.1 German credit data set (GER) 140
C.2 Vietnamese 1 data set (VN1) 141
C.3 Vietnamese 2 data set (VN2) 142
C.4 Taiwanese credit data set (TAI) 143
C.5 Bank personal loan data set (BANK) 145
C.6 Hepatitis C patients data set (HEPA) 146
C.7 The Loan schema data from lending club (US) 147
C.8 Vietnamese 3 data set (VN3) 150
C.9 Australian credit data set (AUS) 151
C.10 Credit risk data set (Credit 1) 152
C.11 Credit card data set (Credit 2) 153
C.12 Credit default data set (Credit 3) 154
C.13 Vietnamese 4 data set (VN4) 155
Trang 13LIST OF ABBREVIATIONS
regression FN, FNR False negative, False negative rate
Trang 14balancing data
for
Trang 15xi
Trang 16LIST OF FIGURES
2.1 Examples of circumstances of imbalanced data 18
2.2 Illustration of ROCs 23
2.3 Illustration of KS metric 24
2.4 Illustration of RUS technique 31
2.5 Illustration of CNN rule 31
2.6 Illustration of tomek-links 32
2.7 Illustration of ROS technique 35
2.8 Illustration of SMOTE technique 35
2.9 Approaches to imbalanced data in classification 47
3.1 Illustration of a Decision tree 61
3.2 Illustration of a decision boundary of SVM 63
3.3 Illustration of a two-hidden-layer ANN 65
3.4 Importance level of features of the Vietnamese data sets 77
3.5 Computation protocol of the proposed ensemble classifier 86
4.1 Illustration of F-CV 100
4.2 Illustration of F-LLR 102
Trang 17LIST OF TABLES
1.1 General implementation protocol in the dissertation 13
2.1 Confusion matrix 19
2.2 Representatives employing the algorithm-level approach to ID 27 2.3 Cost matrix in Cost-sensitive learning 28
2.4 Summary of SMOTE algorithm 36
2.5 Representatives employing the data-level approach to ID 41
2.6 Representatives employing the ensemble-based approach to ID 45 3.1 Representatives of classifiers in credit scoring 70
3.2 OUS(B) algorithm 72
3.3 DTE(B) algorithm 73
3.4 Description of empirical data sets 74
3.5 Computation protocol of empirical study on DTE 75
3.6 Performance measures of DTE(B) on the Vietnamese data sets 76 3.7 Performance of ensemble classifiers on the Vietnamese data sets 78 3.8 Performance of ensemble classifiers on the German data set 80
3.9 Performance of ensemble classifiers on the Taiwanese data set 81 3.10 TOUS(B) algorithm 84
3.11 TOUS-F(B) algorithm 85
3.12 Description of empirical data sets 85
3.13 Average testing AUC of the proposed ensembles 89
3.14 Average testing AUC of the models based LLR 90
3.15 Average testing AUC of the ensemble classifiers based tree 91
4.1 Cross-validation procedure for Lasso Logistic regression 99
4.2 F-measure-oriented Cross-Validation Procedure 100
4.3 Algorithm for F-LLR classifier 101
4.4 Description of empirical data sets 104
Trang 184.5 Implementation protocol of empirical study 106
4.6 Average testing performance measures of classifiers 108
4.7 Average testing performance measures of classifiers (cont.) 109
4.8 The number of wins of F-LLR on empirical data sets 110
4.9 Important features of the Vietnamese data set 113
4.10 Important features of the Vietnamese data set (cont.) 114
B.1 Algorithm of Bagging classifier 138
B.2 Algorithm of Random Forest 138
B.3 Algorithm of AdaBoost 139
C.1 Summary of the German credit data set 140
C.2 Summary of the Vietnamese 1 data set 141
C.3 Summary of Vietnamese 2 data set 142
C.4 Summary of the Taiwanese credit data set (a) 143
C.5 Summary of the Taiwanese credit data set (b) 144
C.6 Summary of the Bank personal loan data set 145
C.7 Summary of the Hepatitis C patients data set 146
C.8 Summary of the Loan schema data from lending club (a) 147
C.9 Summary of the Loan schema data from lending club (b) 148
C.10 Summary of the Loan schema data from lending club (c) 149
C.11 Summary of the Vietnamese 3 data set 150
C.12 Summary of the Australian credit data set 151
C.13 Summary of the Credit 1 data set 152
C.14 Summary of the Credit 2 data set 153
C.15 Summary of the Credit 3 data set 154
C.16 Summary of the Vietnamese 4 data set 155
Trang 19ABSTRACT
In classification, imbalanced data occurs when there is a great difference
in the quantities of classes of the training data set This problem frequentlyarises in various fields, for example, credit scoring and medical diagnosis.With imbalanced data, predictive modeling for real-world applications hasposed a challenge because most machine learning algorithms are designed forbalanced data sets Therefore, addressing imbalanced data has attractedmuch attention from researchers and practitioners
In this dissertation, we propose solutions for imbalanced classification.Fur- thermore, these solutions are applied to a credit scoring case study Thesolu- tions are derived from three papers published in the scientificjournals
• The first paper presents an interpretable decision tree ensemble model for imbalanced credit scoring data sets
• The second paper introduces a novel technique for addressing
imbalanced data, particularly in the cases of overlapping and noisy samples
• The final paper proposes a modification of Logistic regression focusing
on the optimization F-measure, a popular metric in imbalanced
classification
These classifiers have been trained on a range of public and private datasets with highly imbalanced status and overlapping classes The primaryresults demonstrate that the proposed works outperform both traditionaland some recent models
Trang 20dự báo cho các bài toán ứng dụng thực tế đã đặt ra một thách thức lớn bởi
vì hầu hết các thuật toán học máy được thiết kế cho dữ liệu cân bằng Vìvậy, xử lý dữ liệu không cân bằng cho bài toán phân loại đã và đang thu hútnhiều sự quan tâm của các nhà nghiên cứu và người làm ứng dụng
Trong luận án này, chúng tôi đề xuất một số giải pháp cho bài toán phânloại với dữ liệu không cân bằng Những giải pháp này được áp dụng chomột tình huống nghiên cứu là đánh giá tín dụng Các kết quả mới của luận
án được trích từ ba bài báo đã được công bố trên những tạp chí khoa học,bao gồm:
• Bài báo thứ nhất đề xuất một mô hình có khả năng giải thích Đây là
mô hình quần hợp các mô hình cây quyết định và ứng dụng cho đánhgiá tín dụng
• Bài báo thứ hai giới thiệu một kỹ thuật mới cho dữ liệu không cânbằng, đặc biệt trong trường hợp dữ liệu có chồng chéo các lớp vànhiễu
• Bài báo thứ ba đề xuất một hiệu chỉnh cho mô hình hồi quy Logistic Sựđiều chỉnh này tập trung vào tối đa hoá độ đo F - một độ đo hiệu quảphổ biến trong các bài toán phân loại không cân bằng
Các mô hình phân loại này được thực nghiệm trên tập dữ liệu công khai
và dữ liệu riêng với tính chất không cân bằng và chồng chéo các lớp Kếtquả đã chứng minh rằng các mô hình của chúng tôi có hiệu quả vượt trội sovới các mô hình truyền thống và các mô hình được đề xuất gần đây
Trang 21Chapter 1
INTRODUCTION
1.1 Overview of imbalanced data in classification
Nowadays, classification plays a crucial role in several fields, forexample, medicine (cancer diagnosis), finance (fraud detection), businessadministration (customer churn prediction), information retrieval (oil spilltracking, telecommu- nication fraud), image identification (face recognition),and so on Classification is the problem of predicting a class label for a givensample On training data sets that comprise samples with different labeltypes, classification algorithms learn samples’ features to recognize thelabels’ patterns After that, these pat- terns, now presented as a fittedclassification model, will make predictions about the labels of new samples.Classification is categorized into two types, binary and multi-classification Binary classification, which is the basic type, focuses on thetwo-class label problems In contrast, multi-classification solves the tasks ofseveral class la- bels Multi-classification is sometimes considered binarywith two classes: one class corresponding to the concern label, and the otherrepresenting the remain- ing labels In binary classification, data sets arepartitioned into positive and negative classes The positive is the interestclass, which has to be identified in the classification task In thisdissertation, we focus on binary classification For convenience, we definesome concepts as follows
the set of samples S = X × Y , where X ⊂ Rk is the domain of samples’ features
The subset of samples labeled 1 is called the positive class, denoted S+ The remaining subset is called the negative class, denoted S− A sample s ∈ S+ is called a positive sample, otherwise it is called a negative sample.
Trang 22Definition 1.1.2 A binary classifier is a function mapping the domain of
features X to the set of labels {0, 1}
With a given sample s0 = (x0, y0) ∈ S , there are four possibilities follows:
• If f (s0) = y0 = 1, s0 is called a true positive sample.
• If f (s0) = y0 = 0, s0 is called a true negative sample.
• If f (s0) = 1 and y0 = 0, s0 is called a false positive sample.
• If f (s0) = 0 and y0 = 1, s0 is called a false negative sample.
The number of the true positive, true negative, false positive, and false negative samples, are denoted TP, TN, FP, and FN, respectively.
Some popular criteria used to evaluate the performance of a classifier areaccuracy, true positive rate (TPR), true negative rate (TNR), false positiverate (FPR), and false negative rate (FNR)
TP + TN
Trang 23; FPR
=
FP
TN + FP
For example, in fraud detection, the customers are divided into “bad”and “good” classes Since the credit regulations are made public and thecustomers have preliminarily been screened before applying for a loan, acredit data set often includes a majority class of good customers and aminority class of the bad The loss of misclassifying the “bad” into “good”
is often far greater than
Trang 24the loss of misclassifying the “good” into “bad” Hence, identifying the bad
is often considered more crucial than the other task Consider a list of creditcustomers consisting of 95% good and 5% bad If pursuing a high accuracy,
we can choose a trivial classifier mapping all customers with good labels.Then the accuracy of this classifier is 95%, but TPR is 0% In other words,this classifier was unable to identify bad customers Instead, anotherclassifier with a lower accuracy but greater TPR can be considered toreplace this trivial classifier
Another example of the rare classification is cancer diagnosis In thiscase, the data set has two classes, which are the “malignant” and “benign”.The num- ber of malignant patients is always much less than those ofbenign However, malignancy is the first target of any cancer diagnosisprocess because of the heavy consequences of missing cancer patients.Therefore, it is unreasonable to base on the accuracy metric to evaluate theperformance of cancer diagnosis classifiers
The phenomenon of skew distribution in training data sets for
classification is known as imbalanced data.
positive and negative classes, respectively If the quantity of S+ is far less than the one of S−
, S is called an imbalanced data set Besides, the imbalanced ratio (IR) of S is defined as the ratio of the quantities of negative and positive class:
by the error type I and error type II (Shen, Zhao, Li, Li, & Meng, 2019).Therefore, the classification results are often biased toward the majorityclass (the negative class) (Galar, Fernandez, Barrenechea, Bustince, &Herrera, 2011; Haixiang et al., 2017) In the case of a rather highimbalanced ratio, the minority class
Trang 25(the positive class) is usually ignored since the common classifiers oftentreat it as noise or outliers Hence, the target of recognizing the patterns ofthe positive class fails although identifying the positive samples is often thecrucial task of imbalanced classification Therefore, imbalanced data is achallenge in classification.
Besides, experiment studies showed that if the imbalanced ratioincreased, the overall model performance decreased (Brown & Mues, 2012).Furthermore, some authors stated that imbalanced data was not only themain reason for the poor performance but the noise and overlappingsamples also degraded the performance of learning methods (Batista, Prati,
& Monard, 2004; Haixiang et al., 2017) Thus, researchers or practitionersshould deeply understand the nature of data sets to handle them correctly
A typical case study of imbalanced classification is credit scoring Thisissue is reflected in the bad debt ratio of commercial banks For example, inViet- nam, the bad debt ratio in the on-balance sheet was 1.9% in 2021 and1.7% in 2020 Besides, the gross bad debt ratio (including on-balance sheetbad debt, unresolved bad debt sold to VAMC, and potential bad debt fromrestructuring) was 7.3% in 2021 and 5.1% in 20201 Although bad customersaccount for a very small part of the credit customers, the consequences ofthe bad debt of the bank are extremely heavy In countries where mosteconomic activities rely on the banking system, the increase in the bad debtratio may not only threaten the execution of the banking system but alsopush the economy to a series of collapses Therefore, it is important toidentify the bad customers in credit scoring
In Vietnam, the credit market is tightly controlled by regulations of theState bank Commercial banks now consciously manage credit risk bystrictly applying credit appraisal processes before funding In the field ofacademic research, credit scoring has attracted many authors (Bình & Anh,2021; Hưng & Trang, 2018; Quỳnh, Anh, & Linh, 2018; Thắng, 2022).However, few works have solved the imbalanced issue (Mỹ, 2021)
1
https://sbv.gov.vn/webcenter/portal/vi/links/cm255?dDocName=SBV489213
Trang 26These facts prompted us to study imbalanced classification deeply The
dis- sertation titled “Imbalance data in classification: A case study of
credit scoring” aims to find suitable solutions for the imbalanced data and
related issues, especially a case study of credit scoring in Vietnam
1.3 Research gap identifications
1.3.1 Gaps in credit scoring
In the dissertation, we choose credit scoring as a case study ofimbalanced classification
Credit scoring is an arithmetical representation based on the analysis ofthe creditworthiness of customers (Louzada, Ara, & Fernandes, 2016).Credit scor- ing provides valuable information to banks and financeinstitutions in order not only to hedge the credit risk but also to standardizeregulations on credit management Therefore, credit-scoring classifiers have
to meet two significant requirements They are:
i) The ability to accurately classify the bad customers;
ii) The ability to easily explain the predicted results of the classifiers.Over the two recent decades, the first requirement has been solved withthe development of methods to improve the performance of credit scoringmod- els They are traditional statistical methods (K-nearest neighbors,Discriminant analysis, and Logistic regression) and popular machinelearning models (Deci- sion tree, Artificial neural network, and Supportvector machine) (Baesens et al., 2003; Brown & Mues, 2012; Louzada et al.,2016) Those are called single classifiers The effectiveness of a singleclassifier is not similar across the data sets For example, some studiesshowed that Logistic regression outperformed Decision tree (Marqués,García, & Sánchez, 2012; Wang, Ma, Huang, & Xu, 2012), but another resultconcluded that the Logistic regression worked worse than Decision tree(Bensic, Sarlija, & Zekic-Susac, 2005) Besides, according to (Baesens et al.,2003), Support vector machine was better than Logistic re- gression, Li et al.(2019); Van Gestel et al (2006) indicated that there was an
Trang 27insignificant difference among Support vector machine, Logistic regression, andLinear discriminant analysis In summary, empirical credit scoring studieslead to the important conclusion that there is no best single classifier for alldata sets.
Under the development of computational software and programminglan- guages, there is a shift from single classifiers to ensemble ones Theterm “ensemble classifier” or “ensemble model” refers to the collection ofmultiple classifier algorithms Ensemble models work by leveraging thecollective power for decision-making across multiple sub-classifiers In theliterature on credit scoring, empirical studies concluded that the ensemblemodels had superior per- formance to the single ones (Brown & Mues, 2012;Dastile, Celik, & Potsane, 2020; Lessmann, Baesens, Seow, & Thomas, 2015;Marqués et al., 2012) How- ever, ensemble algorithms do not directlyhandle the imbalanced data issue
While the second requirement of a credit scoring model often attracts lessattention than the first, its role is equally important It provides the reasonsfor the classification results, which is the framework for assessing,managing, and hedging credit risk For example, nowadays, customers’features are col- lected into empirical data sets more and more diversely,but not all of them are useful for credit scoring Administrators needimportant information from the classification model that influences thelikelihood of default to set transpar- ent credit standards There is usually atrade-off between the effectiveness and transparency of classifiers (Brown &Mues, 2012) As performance measures increase, explaining the predictedresult becomes more difficult For example, single classifiers such asDiscriminant analysis, Logistic regression, and Decision trees areinterpretable, but they usually work far less effectively than Support vectormachine and Artificial neural network, which are the representatives of
“black box” classifiers Another case is ensemble classifiers Most of themoperate in an incomprehensible process although they have outstanding perfor-mance Even with popular ensemble classifiers such as Bagging Tree,Random Forest, or AdaBoost, which do not have very complicatedstructures, their in- terpretability is not discussed According to Dastile et
al (2020), in the credit
Trang 28scoring application, only 8% studies proposed new models with thediscussion of interpretability.
Therefore, building a credit-scoring ensemble classifier that satisfies bothrequirements is an essential task
In Vietnam, credit data sets usually suffer from imbalance, noise, andover- lapping issues Although the economy is under the influence of thedigital trans- formation process and credit scoring models have developedrapidly, Vietnamese commercial banks have still applied traditional methodssuch as Logistic regres- sion and Discriminant analysis Some studies usedmachine learning methods such as Artificial neural network (Kiều, Diệp,Nga, & Nam, 2017; Nguyen & Nguyen, 2016; Thịnh & Toàn, 2016), Supportvector machine (Nhâm, 2021), Random forest (Ha, Nguyen, & Nguyen,2016), and ensemble models (Luu & Hung, 2021) The idea of these studies
is to support the applications of advanced methods in credit scoring, butthey are not concerned with the imbalanced issue and interpretability Veryfew studies dealt with the imbalance issue (Mỹ, 2021; Toàn, Lịch, Hương, &Thọ, 2017) However, these works only solved imbalanced data and ignoredthe noise and overlapping samples
In summary, it is necessary to build a credit-scoring ensemble classifierthat can tackle the imbalanced data and other related issues such as noiseand over- lapping samples to raise the performance measures, especially onVietnamese data sets Furthermore, the proposed model can point out theimportant fea- tures to predict the credit risk status
1.3.2 Gaps in the approaches to solving imbalanced data
There are three popular approaches to imbalanced classification in thelit- erature They are algorithm-level, data-level, and ensemble-basedapproaches (Galar et al., 2011)
The algorithm-level approach solves imbalanced data by modifying the sifier algorithms to reduce the bias toward the majority class This approachneeds deep knowledge about the intrinsic classifiers which users usuallylack In addition, designing specific corrections or modifications for thegiven clas-
Trang 29clas-sifier algorithms makes this approach not versatile A representative of thealgorithm-level approach is the Cost-sensitive learning method which imposes
or corrects the costs of loss upon misclassifications and requires theminimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012;Xiao et al., 2020) However, the values of the costs of losses are usuallyassigned by the researchers’ intention In short, the algorithm-levelapproach is inflexible and unwieldy
The data-level approach balances training data sets by applying
re-sampling techniques, which belong to three main groups, including
over-sampling, under- over-sampling, and the hybrid of over and under-sampling sampling techniques increase the quantity of the minority class while under-sampling techniques de- crease the one of the majority class This approach implements easily and performs independently of the classifier algorithms However, re-sampling tech- niques change the distribution of the training data set which may lead to a poor classification model For instance,random over-sampling techniques in- crease the computation time and may repeat the noise, and overlapping samples, thus probably leading to an over-fitting classification model Some hierarchical methods of over-sampling can cause other problems For example, the Synthetic Minority Over-sampling technique (SMOTE) can exacerbate the overlapping is- sue In contrast, under-sampling techniques may miss useful information about the majority class, especially on severely imbalanced data (Baesens et al., 2003; Sun, Lang,Fujita, & Li, 2018)
Over-The third is the ensemble-based approach which integrates ensembleclassi- fier algorithms with algorithm-level or data-level approaches Thisapproach exploits the advantage of ensemble classifiers to improve theperformance cri- teria The ensemble-based approach seems to be the trend
in dealing with imbalanced data (Abdoli, Akbari, & Shahrabi, 2023; Shen,Zhao, Kou, & Al- saadi, 2021; Yang, Qiao, Huang, Wang, & Wang, 2021;Zhang, Yang, & Zhang, 2021) However, the ensemble-based approach oftenfaces complex models that are too difficult to interpret the results This is aconcern that must be realized fully
Trang 30In summary, although there are many methods for imbalanced classification,each of them has some drawbacks Some hybrid methods are complex andinaccessible Moreover, there are very few studies dealing with eitherimbalance or noise and overlapping samples With the available studies, onsome data sets, the methods do not raise the performance measures as high
as expected Hence, it is coming up with the idea of a new algorithm thatcan deal with imbalance, noise, and overlapping to increase theperformance measure on the positive class
1.3.3 Gaps in Logistic regression with imbalanced data
Logistic regression (LR) is one of the most popular single classifiers,especially in credit scoring (Onay & Öztürk, 2018) LR can provide anunderstandable output that is a conditional probability of belonging to thepositive class This probability is the reference to predict the sample’s label
by comparing it with a given threshold The sample is classified into thepositive class if and only if its conditional probability is greater than thisthreshold This characteristic of LR can innovate into multi-classification.Besides, the computation process of LR, which employs the maximumlikelihood estimator, is quite simple It does not take much time since thereare several available packages of software or programming languages.Furthermore, LR can show the impact of predictors on the output byevaluating the statistically significant level of the parameters corresponding
to the predictors In other words, LR provides an interpretable andaffordable model
However, LR is ineffective on imbalanced data sets (Firth, 1993; King &Zeng, 2001), specifically, the conditional probability of positive samples isun- derestimated Therefore, the positive samples are likely misclassified.Besides, the statistically significant level of predictors is usually based onthe parameter testing procedure, which uses the “p-value” criterion as aframework Mean- while, the p-value has recently been criticized in thestatistical community be- cause of its misunderstanding (Goodman, 2008).Those lead to the limitation in the application fields of LR although it hasseveral advantages
Trang 31There are multiple methods to deal with imbalanced data for LR such asprior correction (Cramer, 2003; King & Zeng, 2001), weighted likelihoodesti- mation (WLE) (Maalouf & Trafalis, 2011; Manski & Lerman, 1977;Ramalho & Ramalho, 2007) and penalized likelihood regression (PLR)(Firth, 1993; Green- land & Mansournia, 2015; Puhr, Heinze, Nold, Lusa, &Geroldinger, 2017) All of them are related to the algorithm-level approach,which requires much effort from the users For example, prior correctionand WLE need the ratio of the positive class in the population which isusually unavailable in real-world ap- plications Besides, some methods ofPLR are too sensitive for initial values in the computation process of themaximum likelihood estimation Furthermore, some methods of PLR werejust for the biased parameter estimates, not for the biased conditionalprobability (Firth, 1993) A hybrid of these methods and re-samplingtechniques has not been considered in the literature on LR with imbalanceddata The hybrid methods can exploit the advantages of each individual anddirectly solve imbalanced data for LR.
In summary, LR for imbalanced data needs to be modified in thecomputation process by a combination of data-level and algorithm-levelapproaches The modification can deal with imbalanced data and still retainthe ability to provide the impacts of the predictors on the response withoutthe “p-value” criterion
1.4 Research objectives, research subjects, and research scopes
1.4.1 Research objectives
In this dissertation, we aim to achieve the following objectives
The first objective is to propose a new ensemble classifier that satisfiestwo key requirements of a credit-scoring model This ensemble classifier isexpected to outperform the traditional classification models and popularbalanced methods such as the Bagging tree, Random forest, and AdaBoostcombined with random over-sampling (ROS), random under-sampling(RUS), SMOTE, and Adaptive synthetic sampling (ADASYN) Furthermore,the proposed model can identify the significance of input features inpredicting the credit risk status
The second objective is to propose a novel technique to address the
Trang 32chal-lenges of imbalanced data, noise, and overlapping samples This techniquecan leverage the strengths of re-sampling methods and ensemble models totackle these critical issues in classification Subsequently, this technique can beapplied to credit scoring and other imbalanced classification applications,for example, medical diagnosis.
The final objective is to modify the computation process of Logisticregres- sion to address imbalanced data and mitigate the issue ofoverlapping samples This modification directly impacts the F-measure,which is commonly used to evaluate the performance of classifiers inimbalanced classification The pro- posed work can compete with popularbalanced methods for Logistic regression such as weighted likelihoodestimation, penalized likelihood regression, and re- sampling techniques,including ROS, RUS, and SMOTE
1.4.2 Research subjects
This dissertation investigates the phenomenon of imbalanced data andother related issues such as noise and overlapping samples in classification Weexam- ine various balancing methods, encompassing algorithm-level, data-level, and ensemble-based approaches in a case study of credit scoring.Within these ap- proaches, data-level and ensemble-based are paid moreattention than algorithm- level Additionally, Lasso-Logistic regression, which is
a version of penalization on Logistic regression, is studied in two applicationcontexts: a based learner of an ensemble classifier and the individualclassifier
1.4.3 Research scopes
The dissertation focuses on binary classification problems for imbalanceddata sets and their application in credit scoring Interpretable classifiers, in-cluding Logistic regression, Lasso-logistic regression, and Decision trees,are considered To deal with imbalanced data, the dissertation concentrates
on the data-level approach and the integration of data-level methods andensem- ble classifier algorithms Some popular re-sampling techniques such
as ROS, RUS, SMOTE, ADASYN, Tomek-link, and Neighborhood CleaningRule, are
Trang 33investigated in this study In addition, popular performance criteria, whichare suitable for imbalanced classification such as AUC (Area Under the Re-ceiver Operating Characteristics Curve), KS (Kolmogorov-Smirnovstatistic), F-measure, G-mean, and H-measure, are used to evaluate theeffectiveness of considered classifiers.
1.5 Research data and research methods
1.5.1 Research data
The case study of credit scoring uses six secondary data sets Three ofthem are from the UCI machine learning repository such as German,Taiwan, and the Bank personal loan data sets These data sets are verypopular in studying credit scoring and are used as a benchmark in theliterature Besides, the three private data sets are collected from commercialbanks in Vietnam All Viet- namese data sets are highly imbalanced withdifferent levels Furthermore, to justify the ability to improve theperformance measures of the proposed works, the empirical study usedone data set belonging to the medical field, Hepatitis data This data set wasavailable on the UCI machine learning repository
The case study of Logistic regression employs nine data sets Four ofthem, which are German, Taiwanese, Bank personal loan, and Hepatitisdata sets, are also used in the case study of credit scoring The others areeasy to access through the Kaggle website and UCI machine learningrepository
1.5.2 Research methods
The dissertation applies the quantitative research method to clarify theef- fectiveness of the proposed works such as the credit scoring ensembleclassifier, the algorithm for balancing and free-overlapping data, and themodification of Logistic regression
The general implementation protocol of the proposed works follows thesteps in Table 1.1 This implementation protocol is applied in allcomputation pro- cesses in the dissertation However, in each case, thecontent in Step 2 may vary in some ways The computation processes areconducted by the programming
Trang 34Table 1.1: General implementation protocol in the dissertation
Steps Contents
2 Constructing the new model with different hyper-parameters to find
the optimal model on the training data
and classifier algorithms on the same training data
test- ing data, then calculating their performance measures
language R, which has been widely used in the machine learning community
1.6 Contributions of the dissertation
The dissertation contributes three methods to the literature on creditscoring and imbalanced classification The proposed methods werepublished in three articles, including:
(1) An interpretable decision tree ensemble model for imbalanced credit
scoring datasets, Journal of Intelligent and Fuzzy System, Vol 45, No 6,
10853–10864, 2023
(2) TOUS: A new technique for imbalanced data classification, Studies in Sys- tems, Decision, and Control, Vol 429, 595–612, 2022, Springer.
(3) A modification of Logistic regression with imbalanced data:
F-measure-oriented Lasso-logistic regression, ScienceAsia, 49S, 68–77, 2023.
Regarding the literature on credit scoring, the dissertation suggests theinter- pretable ensemble classifier which can address imbalanced data Theproposed model which uses Decision tree as the base learner has morespecific advan- tages than the popular approaches such as higherperformance measures and interpretability The proposed modelcorresponds to the first article
Trang 35Regarding the literature on imbalanced data, the dissertation proposes amethod for balancing, de-noise, and free-overlapping samples thanks to theensemble-based approach This method outperforms the integration of there- sampling techniques (ROS, RUS, and SMOTE, Tomek-link, andNeighborhood Cleaning Rule) and popular ensemble classifier algorithms(Bagging tree, Ran- dom forest, and AdaBoost) This work corresponds tothe second article.
Regarding the literature on Logistic regression, the dissertation provides
a modification to its computation process The proposed work makesLogistic regression more effective than the existing methods for Logisticregression with imbalanced data and retain the ability to show theimportant level of input features without using p−value This modification
is in the third article
1.7 Dissertation outline
The dissertation “Imbalanced data in classification: A case study of creditscoring” has five chapters
• Chapter 1 Introduction
• Chapter 2 Literature review of imbalanced data
• Chapter 3 Imbalanced data in credit scoring
• Chapter 4 A modification of Logistic regression with imbalanced data
• Chapter 5 Conclusions
Chapter 1 is the introduction, which briefly introduces the contents of thedissertation This chapter presents the overview of imbalanced data inclassifi- cation Besides, other contents are the motivations, research gapidentifications, objectives, subjects, scopes, data, methods, contributions,and the dissertation outline
Chapter 2 is the literature review on imbalanced data in classification.This chapter provides the definition, obstacles, and related issues of imbalanceddata, for example, the overlapping classes Besides, this chapter deeplypresents the performance measures for imbalanced data The mostimportant section is the
Trang 36review of approaches to imbalanced data, including algorithm-level, data-level,and ensemble-based-level Chapter 2 also examines the basic backgroundand recent proposed works of credit scoring The detailed discussion ofprevious studies clarifies the pros and cons of existing balancing methods.That is the framework for developing the new balanced methods in thedissertation.
Chapter 3 is the case study of imbalanced classification - credit scoring.This chapter is based on the main contents of the first and second articlesreferred to in Section 1.6 We propose an ensemble classifier that canaddress imbalanced data and provide the importance level of predictors.Furthermore, we innovate the algorithm of this credit-scoring ensembleclassifier to handle overlapping and noise before dealing with imbalanced data.The empirical studies are conducted to verify the effectiveness of theproposed algorithms
Chapter 4 is another study on imbalanced data which is related toLogistic regression This chapter proposes a modification of the inner andouter of the computation process of Logistic regression The inner is achange in the perfor- mance criterion to estimate the score, and the outer is
a selective application of re-sampling techniques to re-balance the trainingdata The experiment stud- ies on nine data sets to verify the performance
of the modification Chapter 4 corresponds to the third article referred to inSection 1.6
Chapter 5 is the conclusion, which summarizes the dissertation, impliesthe applications of the proposed works, and refers to some further studies
Trang 37Chapter 2
LITERATURE REVIEW OF IMBALANCED DATA
2.1 Imbalanced data in classification
2.1.1 Description of imbalanced data
According to Definition 1.1.4, any data set with a skewed quantity of
samples in two classes is technically imbalanced data (ID) In other words,
any two-class data set with an imbalanced ratio (IR) greater than one isconsidered ID There are not any conventional definitions of the IRthreshold to conclude that a data set is imbalanced Most authors simplydefine ID that there is a class with a much greater (or lower) number ofsamples than one of the other (Brown & Mues, 2012; Haixiang et al., 2017).Other authors assess a data set imbalanced if the interest class hassignificantly fewer samples than the other and ordinary classifier algorithmsencounter difficulty in distinguishing two classes (Galar et al., 2011; López,Fernández, García, Palade, & Herrera, 2013; Sun, Wong, & Kamel, 2009).Therefore, a data set is considered as ID when its IR is greater than one andmost samples of the minority class cannot be identified by standard classifiers
2.1.2 Obstacles in imbalanced classification
In ID, the minority class is usually misclassified since there is too little formation about their patterns Besides, standard classifier algorithms oftenoperate according to the rules of the maximum accuracy metric Hence, theclassification results are usually biased toward the majority class to get thehighest global accuracy and very low accuracy for the minority class Onthe other hand, the patterns of the minority class are often specific,especially in extreme ID, which leads to the ignorance of minority samples(they may be treated as noise) to favor the more general patterns of themajority class As a
Trang 38in-consequence, the minority class, which is the interested object in theclassifica- tion process, is usually misclassified in ID.
The above analyzes are also supported by empirical studies Brown andMues (2012) concluded that the higher the IR, the lower the performance ofclassi- fiers Furthermore, Prati, Batista, and Silva (2015) found that theexpected performance loss, which was the proportion of the performancedifference be- tween ID and the balanced data, became significant when IRwas from 90/10 and greater Prati et al also pointed out that theperformance loss tended to increase quickly for higher values of IR
In short, IR is the factor that reduces the effectiveness of standard classifiers
2.1.3 Categories of imbalanced data
In real applications, combinations of ID and other phenomena makeclassifi- cation processes more difficult Some authors even claim that ID isnot only the main reason for the poor performance but the overlapping,small sample size, small disjuncts, borderline, rare, and outlier samples arealso the causes of the low effectiveness of popular classifier algorithms(Batista et al., 2004; Fernández et al., 2018; Napierala & Stefanowski, 2016;Sun et al., 2009)
• Overlapping or class separability (Fig.2.1b) is the phenomenon of theun- clear decision boundary of two classes It also means that somesamples of two classes are blended On data sets with overlapping, thestandard classi- fier algorithms such as Decision tree, Support vectormachine, or K-nearest neighbors become harder to perform Batista et
al (2004) stated that the IR was less important than the degree ofoverlap between classes Similarly, Fernández et al (2018) believed thatany simple classifier algorithm could perform classificationindependently of the IR in case of no overlapping
• Small sample size: Learning algorithms need a sufficient amount ofsam- ples of data sets to generalize the rule to discriminate classes.Without large training sets, a classifier cannot only generalizecharacteristics of the data but it can also produce an over-fitting model(Cui, Davis, Cheng, & Bai, 2004; Wasikowski & Chen, 2009) Onimbalanced and small data
Trang 39Figure 2.1: Examples of circumstances of imbalanced data.
Source: Galar et al (2011)
sets, the lack of information about the positive class becomes moreserious Krawczyk and Woźniak (2015) stated that when fixing the IR,the more samples of the minority class, the lower the error rate ofclassifiers
• Small disjuncts (Fig 2.1c): This problem occurs when the minority classconsists of several sub-spaces in the feature space Therefore, small dis-juncts provide classifiers with a smaller number of positive samplesthan large disjuncts In other words, small disjuncts cover rare samplesthat are too hard to be found in the data sets, and learning algorithmsoften ignore rare samples to set the general classification rules It leads
to a higher error rate on small disjuncts (Prati, Batista, & Monard,2004; Weiss, 2009)
• The characteristics of positive samples such as borderline, rare, andoutlier, affect the performance of standard classifiers The fact is thatborderline samples are always too difficult to be recognized Inaddition, the rare and outliers are extremely hard to be identified.According to Napierala and Stefanowski (2016); Van Hulse andKhoshgoftaar (2009), an imbalanced data set with many borderline orrare and outlier samples made standard classifiers less efficient
In summary, studying ID should pay attention to the related issues such
as the overlapping, small sample size, small disjuncts, and thecharacteristics of the positive samples
Trang 402.2 Performance measures for imbalanced data
The quality of a classifier is evaluated by inspecting how effective itshows on testing data It means the outputs of the classifier are comparedwith the true labels of the testing data which are hidden in the process ofconstructing the classifier There are two types of outputs, which are labeledand scored types Depending on each type, some metrics are used toanalyze the performance of classifiers In ID, there are some notes on thechoice of performance measures
2.2.1 Performance measures for labeled outputs
Most learning algorithms show labeled outputs, for example, K-nearestneigh- bors, Decision tree, ensemble classifier based Decision tree, and so
on A conve- nient way to introduce the performance of labeled-outputclassifiers is a cross- tabulation between actual and predicted labels, known
as confusion matrix.
Table 2.1: Confusion matrix
Predicted positive Predicted negative Total
In Table 2.1, TP, FP, FN, and TN follow the Definition 1.1.3 Besides, POSand NEG are the numbers of the actual positive and negative samples inthe training data, respectively PPOS and PNEG are the numbers of thepredicted positive and negative samples, respectively N is the total number
of samples
From the confusion matrix, several metrics are built to provide aframework for analyzing many aspects of a classifier These metrics can bedivided into two types, single and complex metrics
2.2.1.1 Single metrics
The most popular single metric is accuracy or its complement, error rate.
Accuracy is the proportion of the correct outputs, and error rate is thepropor- tion of the incorrect ones Therefore, the higher (or lower) accuracy