1. Trang chủ
  2. » Luận Văn - Báo Cáo

Imbalanced Data in classification: A case study of credit scoring

173 3 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Imbalanced Data in Classification: A Case Study of Credit Scoring
Tác giả Bui Thi Thien My
Người hướng dẫn Assoc. Prof. Dr. Le Xuan Truong, Dr. Ta Quoc Bao
Trường học University of Economics Ho Chi Minh City
Chuyên ngành Statistics
Thể loại Doctoral Dissertation
Năm xuất bản 2024
Thành phố Ho Chi Minh City
Định dạng
Số trang 173
Dung lượng 1,2 MB

Cấu trúc

  • 1.1 Overview of imbalanced data in classification (19)
  • 1.2 Motivations (21)
  • 1.3 Research gap identifications (23)
    • 1.3.1 Gaps in credit scoring (23)
    • 1.3.2 Gaps in the approaches to solving imbalanced data (25)
    • 1.3.3 Gaps in Logistic regression with imbalanced data (27)
  • 1.4 Research objectives, research subjects, and research scopes (28)
    • 1.4.1 Research objectives (28)
    • 1.4.2 Research subjects (29)
    • 1.4.3 Research scopes (29)
  • 1.5 Research data and research methods (30)
    • 1.5.1 Research data (30)
    • 1.5.2 Research methods (30)
  • 1.6 Contributions of the dissertation (31)
  • 1.7 Dissertation outline (32)
  • 2.1 Imbalanced data in classification (34)
    • 2.1.1 Description of imbalanced data (34)
    • 2.1.2 Obstacles in imbalanced classification (34)
    • 2.1.3 Categories of imbalanced data (35)
  • 2.2 Performance measures for imbalanced data (37)
    • 2.2.1 Performance measures for labeled outputs (37)
      • 2.2.1.1 Single metrics (37)
      • 2.2.1.2 Complex metrics (39)
    • 2.2.2 Performance measures for scored outputs (40)
      • 2.2.2.1 Area under the Receiver Operating Character- (40)
      • 2.2.2.2 Kolmogorov-Smirnov statistic (42)
      • 2.2.2.3 H-measure (43)
    • 2.2.3 Conclusion of performance measures in imbalanced clas- (43)
  • 2.3 Approaches to imbalanced classification (44)
    • 2.3.1 Algorithm-level approach (44)
      • 2.3.1.1 Modifying the current classifier algorithms (44)
      • 2.3.1.2 Cost-sensitive learning (46)
      • 2.3.1.3 Comments on algorithm-level approach (48)
    • 2.3.2 Data-level approach (48)
      • 2.3.2.1 Under-sampling method (48)
      • 2.3.2.2 Over-sampling method (52)
      • 2.3.2.3 Hybrid method (56)
      • 2.3.2.4 Comments on data-level approach (57)
    • 2.3.3 Ensemble-based approach (59)
      • 2.3.3.1 Integration of algorithm-level method and en- (60)
      • 2.3.3.3 Comments on ensemble-based approach (63)
    • 2.3.4 Conclusions of approaches to imbalanced data (64)
  • 2.4 Credit scoring (66)
    • 2.4.1 Meaning of credit scoring (66)
    • 2.4.2 Inputs for credit scoring models (67)
    • 2.4.3 Interpretability of credit scoring models (69)
    • 2.4.4 Approaches to imbalanced data in credit scoring (70)
    • 2.4.5 Recent credit scoring ensemble models (71)
  • 2.5 Chapter summary (73)
  • 3.1 Classifiers for credit scoring (74)
    • 3.1.1 Single classifiers (74)
      • 3.1.1.1 Discriminant analysis (74)
      • 3.1.1.2 K-nearest neighbors (75)
      • 3.1.1.3 Logistic regression (76)
      • 3.1.1.4 Lasso-Logistic regression (78)
      • 3.1.1.5 Decision tree (79)
      • 3.1.1.6 Support vector machine (80)
      • 3.1.1.7 Artificial neural network (82)
    • 3.1.2 Ensemble classifiers (84)
      • 3.1.2.1 Heterogeneous ensemble classifiers (84)
      • 3.1.2.2 Homogeneous ensemble classifiers (85)
    • 3.1.3 Conclusions of statistical models for credit scoring (87)
  • 3.2 The proposed credit scoring ensemble model base Decision tree 71 (89)
    • 3.2.1 The proposed algorithms (89)
      • 3.2.1.1 Algorithm for balancing data - OUS( B ) algorithm 71 (89)
      • 3.2.1.2 Algorithm for constructing ensemble classifier - DTE( B ) algorithm (90)
    • 3.2.2 Empirical data sets (91)
    • 3.2.3 Computation process (92)
    • 3.2.4 Empirical results (94)
      • 3.2.4.1 The optimal Decision tree ensemble classifier . 76 (94)
      • 3.2.4.2 Performance of the proposed model on the Viet- (95)
      • 3.2.4.3 Performance of the proposed model on the pub- (97)
      • 3.2.4.4 Evaluations (99)
    • 3.2.5 Conclusions of the proposed credit scoring ensemble model (100)
  • 3.3 The proposed algorithm for imbalanced and overlapping data . 83 (101)
    • 3.3.1 The proposed algorithms (102)
      • 3.3.1.1 Algorithm for dealing with noise, overlapping, (102)
      • 3.3.1.2 Algorithm for constructing ensemble model (102)
    • 3.3.2 Empirical data sets (103)
    • 3.3.3 Computation process (104)
      • 3.3.3.1 Computation protocol of the Lasso Logistic en- (105)
      • 3.3.3.2 Computation protocol of the Decision tree en- (106)
    • 3.3.4 Empirical results (106)
      • 3.3.4.1 The optimal ensemble classifier (106)
      • 3.3.4.2 Performance of LLE( B ) (107)
      • 3.3.4.3 Performance of DTE( B ) (108)
    • 3.3.5 Conclusions of the proposed technique (109)
  • 3.4 Chapter summary (110)
  • 4.1 Introduction (111)
  • 4.2 Related works (113)
    • 4.2.1 Prior correction (113)
    • 4.2.2 Weighted likelihood estimation (WLE) (114)
    • 4.2.3 Penalized likelihood regression (PLR) (115)
  • 4.3 The proposed works (116)
    • 4.3.1 The modification of the cross-validation procedure (117)
    • 4.3.2 The modification of Logistic regression (119)
  • 4.4 Empirical study (121)
    • 4.4.1 Empirical data sets (121)
    • 4.4.2 Performance measures (122)
    • 4.4.3 Computation process (123)
    • 4.4.4 Empirical results (125)
    • 4.4.5 Statistical test (128)
    • 4.4.6 Important variables for output (129)
      • 4.4.6.1 Important variables for F-LLR fitted model (129)
      • 4.4.6.2 Important variables of the Vietnamese data set 112 (130)
  • 4.5 Discussions and Conclusions (133)
    • 4.5.1 Discussions (133)
    • 4.5.2 Conclusions (134)
  • 4.6 Chapter summary (134)
  • 5.1 Summary of contributions (136)
    • 5.1.1 The interpretable credit scoring ensemble classifier (136)
    • 5.1.2 The technique for imbalanced data, noise, and overlap- (137)
    • 5.1.3 The modification of Logistic regression (138)
  • 5.2 Implications (139)
  • 5.3 Limitations and suggestions for further research (140)
  • C.1 German credit data set (GER) (158)
  • C.2 Vietnamese 1 data set (VN1) (159)
  • C.3 Vietnamese 2 data set (VN2) (160)
  • C.4 Taiwanese credit data set (TAI) (161)
  • C.5 Bank personal loan data set (BANK) (163)
  • C.6 Hepatitis C patients data set (HEPA) (164)
  • C.7 The Loan schema data from lending club (US) (165)
  • C.8 Vietnamese 3 data set (VN3) (168)
  • C.9 Australian credit data set (AUS) (169)
  • C.10 Credit risk data set (Credit 1) (170)
  • C.11 Credit card data set (Credit 2) (171)
  • C.12 Credit default data set (Credit 3) (172)
  • C.13 Vietnamese 4 data set (VN4) (173)
    • 2.1 Examples of circumstances of imbalanced data (0)
    • 2.2 Illustration of ROCs (0)
    • 2.3 Illustration of KS metric (0)
    • 2.4 Illustration of RUS technique (0)
    • 2.5 Illustration of CNN rule (0)
    • 2.6 Illustration of tomek-links (0)
    • 2.7 Illustration of ROS technique (0)
    • 2.8 Illustration of SMOTE technique (0)
    • 2.9 Approaches to imbalanced data in classification (0)
    • 3.1 Illustration of a Decision tree (0)
    • 3.2 Illustration of a decision boundary of SVM (0)
    • 3.3 Illustration of a two-hidden-layer ANN (0)
    • 3.4 Importance level of features of the Vietnamese data sets (0)
    • 3.5 Computation protocol of the proposed ensemble classifier (0)
    • 4.1 Illustration of F-CV (0)
    • 4.2 Illustration of F-LLR (0)
    • 1.1 General implementation protocol in the dissertation (0)
    • 2.1 Confusion matrix (0)
    • 2.2 Representatives employing the algorithm-level approach to ID . 27 (0)
    • 2.3 Cost matrix in Cost-sensitive learning (0)
    • 2.4 Summary of SMOTE algorithm (0)
    • 2.5 Representatives employing the data-level approach to ID (0)
    • 2.6 Representatives employing the ensemble-based approach to ID . 45 (0)
    • 3.1 Representatives of classifiers in credit scoring (0)
    • 3.2 OUS( B ) algorithm (0)
    • 3.3 DTE( B ) algorithm (0)
    • 3.4 Description of empirical data sets (0)
    • 3.5 Computation protocol of empirical study on DTE (0)
    • 3.6 Performance measures of DTE( B ) on the Vietnamese data sets 76 (0)
    • 3.7 Performance of ensemble classifiers on the Vietnamese data sets 78 (0)
    • 3.8 Performance of ensemble classifiers on the German data set (0)
    • 3.9 Performance of ensemble classifiers on the Taiwanese data set . 81 (0)
    • 3.10 TOUS( B ) algorithm (0)
    • 3.11 TOUS-F( B ) algorithm (0)
    • 3.12 Description of empirical data sets (0)
    • 3.13 Average testing AUC of the proposed ensembles (0)
    • 3.14 Average testing AUC of the models based LLR (0)
    • 3.15 Average testing AUC of the ensemble classifiers based tree (0)
    • 4.1 Cross-validation procedure for Lasso Logistic regression (0)
    • 4.2 F-measure-oriented Cross-Validation Procedure (0)
    • 4.3 Algorithm for F-LLR classifier (0)
    • 4.4 Description of empirical data sets (0)
    • 4.5 Implementation protocol of empirical study (0)
    • 4.6 Average testing performance measures of classifiers (0)
    • 4.7 Average testing performance measures of classifiers (cont.) (0)
    • 4.8 The number of wins of F-LLR on empirical data sets (0)
    • 4.9 Important features of the Vietnamese data set (0)
    • 4.10 Important features of the Vietnamese data set (cont.) (0)
  • B.1 Algorithm of Bagging classifier (0)
  • B.2 Algorithm of Random Forest (0)
  • B.3 Algorithm of AdaBoost (0)
  • C.1 Summary of the German credit data set (0)
  • C.2 Summary of the Vietnamese 1 data set (0)
  • C.3 Summary of Vietnamese 2 data set (0)
  • C.4 Summary of the Taiwanese credit data set (a) (0)
  • C.5 Summary of the Taiwanese credit data set (b) (0)
  • C.6 Summary of the Bank personal loan data set (0)
  • C.7 Summary of the Hepatitis C patients data set (0)
  • C.8 Summary of the Loan schema data from lending club (a) (0)
  • C.9 Summary of the Loan schema data from lending club (b) (0)
  • C.10 Summary of the Loan schema data from lending club (c) (0)
  • C.11 Summary of the Vietnamese 3 data set (0)
  • C.12 Summary of the Australian credit data set (0)
  • C.13 Summary of the Credit 1 data set (0)
  • C.14 Summary of the Credit 2 data set (0)
  • C.15 Summary of the Credit 3 data set (0)
  • C.16 Summary of the Vietnamese 4 data set (0)

Nội dung

Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring

Overview of imbalanced data in classification

Nowadays, classification plays a crucial role in several fields, for example, medicine (cancer diagnosis), finance (fraud detection), business administration (customer churn prediction), information retrieval (oil spill tracking, telecommu- nication fraud), image identification (face recognition), and so on Classification is the problem of predicting a class label for a given sample On training data sets that comprise samples with different label types, classification algorithms learn samples’ features to recognize the labels’ patterns After that, these pat- terns, now presented as a fitted classification model, will make predictions about the labels of new samples.

Classification is categorized into two types, binary and multi-classification. Binary classification, which is the basic type, focuses on the two-class label problems In contrast, multi-classification solves the tasks of several class la- bels Multi-classification is sometimes considered binary with two classes: one class corresponding to the concern label, and the other representing the remain- ing labels In binary classification, data sets are partitioned into positive and negative classes The positive is the interest class, which has to be identified in the classification task In this dissertation, we focus on binary classification. For convenience, we define some concepts as follows.

Definition 1.1.1 A data set with k input features for binary classification is the set of samples S = X × Y , where X ⊂ R k is the domain of samples’ features and Y = {0, 1} is the set of labels.

The subset of samples labeled 1 is called the positive class, denoted S + The remaining subset is called the negative class, denoted S − A sample s ∈ S + is called a positive sample, otherwise it is called a negative sample.

Definition 1.1.2 A binary classifier is a function mapping the domain of features X to the set of labels {0, 1}

Definition 1.1.3 Considering a data set S and a classifier f : X → {0, 1} With a given sample s 0 = (x 0 , y 0 ) ∈ S , there are four possibilities follows:

• If f(s 0 ) = y 0 = 1 , s 0 is called a true positive sample.

• If f(s 0 ) = y 0 = 0 , s 0 is called a true negative sample.

• If f(s 0 ) = 1 and y 0 = 0 , s 0 is called a false positive sample.

• If f(s 0 ) = 0 and y 0 = 1 , s 0 is called a false negative sample.

The number of the true positive, true negative, false positive, and false negative samples, are denoted TP, TN, FP, and FN, respectively.

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

In many application domains where there is a balance of the positive and negative classes, accuracy is the first target of classifiers However, the interest class (the positive class) sometimes consists of unusual events or rare events. The number of samples in the positive class is too small for classifiers to rec- ognize the positive patterns In such situations, if classifiers make mistakes in the positive class, the cost of loss will be very heavy Therefore, accuracy is no longer the most important performance criterion but something related to TP such as the TPR.

For example, in fraud detection, the customers are divided into “bad” and

“good” classes Since the credit regulations are made public and the customers have preliminarily been screened before applying for a loan, a credit data set often includes a majority class of good customers and a minority class of the bad The loss of misclassifying the “bad” into “good” is often far greater than the loss of misclassifying the “good” into “bad” Hence, identifying the bad is often considered more crucial than the other task Consider a list of credit customers consisting of 95% good and 5% bad If pursuing a high accuracy, we can choose a trivial classifier mapping all customers with good labels Then the accuracy of this classifier is 95%, but TPR is 0% In other words, this classifier was unable to identify bad customers Instead, another classifier with a lower accuracy but greater TPR can be considered to replace this trivial classifier. Another example of the rare classification is cancer diagnosis In this case, the data set has two classes, which are the “malignant” and “benign” The num- ber of malignant patients is always much less than those of benign However, malignancy is the first target of any cancer diagnosis process because of the heavy consequences of missing cancer patients Therefore, it is unreasonable to base on the accuracy metric to evaluate the performance of cancer diagnosis classifiers.

The phenomenon of skew distribution in training data sets for classification is known as imbalanced data.

Definition 1.1.4 Let S = S + ∪ S − be the data set, where S + and S − are the positive and negative classes, respectively If the quantity of S + is far less than the one of S − , S is called an imbalanced data set Besides, the imbalanced ratio (IR) of S is defined as the ratio of the quantities of negative and positive class:

Motivations

When a training data set is imbalanced, simple classifiers usually have a very high accuracy but low TPR These classifiers aim to maximize the accuracy(sometimes called global accuracy), thus equating losses caused by the error type I and error type II (Shen, Zhao, Li, Li, & Meng, 2019) Therefore, the classification results are often biased toward the majority class (the negative class) (Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2011; Haixiang et al., 2017) In the case of a rather high imbalanced ratio, the minority class

(the positive class) is usually ignored since the common classifiers often treat it as noise or outliers Hence, the target of recognizing the patterns of the positive class fails although identifying the positive samples is often the crucial task of imbalanced classification Therefore, imbalanced data is a challenge in classification.

Besides, experiment studies showed that if the imbalanced ratio increased, the overall model performance decreased (Brown & Mues, 2012) Furthermore, some authors stated that imbalanced data was not only the main reason for the poor performance but the noise and overlapping samples also degraded the performance of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al., 2017) Thus, researchers or practitioners should deeply understand the nature of data sets to handle them correctly.

A typical case study of imbalanced classification is credit scoring This issue is reflected in the bad debt ratio of commercial banks For example, in Viet- nam, the bad debt ratio in the on-balance sheet was 1.9% in 2021 and 1.7% in

2020 Besides, the gross bad debt ratio (including on-balance sheet bad debt, unresolved bad debt sold to VAMC, and potential bad debt from restructuring) was 7.3% in 2021 and 5.1% in 2020 1 Although bad customers account for a very small part of the credit customers, the consequences of the bad debt of the bank are extremely heavy In countries where most economic activities rely on the banking system, the increase in the bad debt ratio may not only threaten the execution of the banking system but also push the economy to a series of collapses Therefore, it is important to identify the bad customers in credit scoring.

In Vietnam, the credit market is tightly controlled by regulations of the State bank Commercial banks now consciously manage credit risk by strictly applying credit appraisal processes before funding In the field of academic research, credit scoring has attracted many authors (Bình & Anh, 2021; Hưng

& Trang, 2018; Quỳnh, Anh, & Linh, 2018; Thắng, 2022) However, few works have solved the imbalanced issue (Mỹ, 2021).

1 https://sbv.gov.vn/webcenter/portal/vi/links/cm255?dDocName=SBV489213

These facts prompted us to study imbalanced classification deeply The dis- sertation titled“Imbalance data in classification: A case study of credit scoring” aims to find suitable solutions for the imbalanced data and related issues, especially a case study of credit scoring in Vietnam.

Research gap identifications

Gaps in credit scoring

In the dissertation, we choose credit scoring as a case study of imbalanced classification.

Credit scoring is an arithmetical representation based on the analysis of the creditworthiness of customers (Louzada, Ara, & Fernandes, 2016) Credit scor- ing provides valuable information to banks and finance institutions in order not only to hedge the credit risk but also to standardize regulations on credit management Therefore, credit-scoring classifiers have to meet two significant requirements They are: i) The ability to accurately classify the bad customers; ii) The ability to easily explain the predicted results of the classifiers.

Over the two recent decades, the first requirement has been solved with the development of methods to improve the performance of credit scoring mod- els They are traditional statistical methods (K-nearest neighbors, Discriminant analysis, and Logistic regression) and popular machine learning models (Deci- sion tree, Artificial neural network, and Support vector machine) (Baesens et al., 2003; Brown & Mues, 2012; Louzada et al., 2016) Those are called single classifiers The effectiveness of a single classifier is not similar across the data sets For example, some studies showed that Logistic regression outperformed Decision tree (Marqués, García, & Sánchez, 2012; Wang, Ma, Huang, & Xu,

2012), but another result concluded that the Logistic regression worked worse than Decision tree (Bensic, Sarlija, & Zekic-Susac, 2005) Besides, according to (Baesens et al., 2003), Support vector machine was better than Logistic re- gression, Li et al (2019); Van Gestel et al (2006) indicated that there was an insignificant difference among Support vector machine, Logistic regression, and Linear discriminant analysis In summary, empirical credit scoring studies lead to the important conclusion that there is no best single classifier for all data sets.

Under the development of computational software and programming lan- guages, there is a shift from single classifiers to ensemble ones The term

“ensemble classifier” or “ensemble model” refers to the collection of multiple classifier algorithms Ensemble models work by leveraging the collective power for decision-making across multiple sub-classifiers In the literature on credit scoring, empirical studies concluded that the ensemble models had superior per- formance to the single ones (Brown & Mues, 2012; Dastile, Celik, & Potsane, 2020; Lessmann, Baesens, Seow, & Thomas, 2015; Marqués et al., 2012) How- ever, ensemble algorithms do not directly handle the imbalanced data issue.

While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important It provides the reasons for the classification results, which is the framework for assessing, managing, and hedging credit risk For example, nowadays, customers’ features are col- lected into empirical data sets more and more diversely, but not all of them are useful for credit scoring Administrators need important information from the classification model that influences the likelihood of default to set transpar- ent credit standards There is usually a trade-off between the effectiveness and transparency of classifiers (Brown & Mues, 2012) As performance measures increase, explaining the predicted result becomes more difficult For example, single classifiers such as Discriminant analysis, Logistic regression, and Decision trees are interpretable, but they usually work far less effectively than Support vector machine and Artificial neural network, which are the representatives of

“black box” classifiers Another case is ensemble classifiers Most of them operate in an incomprehensible process although they have outstanding perfor- mance Even with popular ensemble classifiers such as Bagging Tree, RandomForest, or AdaBoost, which do not have very complicated structures, their in- terpretability is not discussed According to Dastile et al (2020), in the credit scoring application, only 8% studies proposed new models with the discussion of interpretability.

Therefore, building a credit-scoring ensemble classifier that satisfies both requirements is an essential task.

In Vietnam, credit data sets usually suffer from imbalance, noise, and over- lapping issues Although the economy is under the influence of the digital trans- formation process and credit scoring models have developed rapidly, Vietnamese commercial banks have still applied traditional methods such as Logistic regres- sion and Discriminant analysis Some studies used machine learning methods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen & Nguyen, 2016; Thịnh & Toàn, 2016), Support vector machine (Nhâm, 2021), Random forest (Ha, Nguyen, & Nguyen, 2016), and ensemble models (Luu & Hung, 2021) The idea of these studies is to support the applications of advanced methods in credit scoring, but they are not concerned with the imbalanced issue and interpretability Very few studies dealt with the imbalance issue (Mỹ, 2021; Toàn, Lịch, Hương, & Thọ, 2017) However, these works only solved imbalanced data and ignored the noise and overlapping samples.

In summary, it is necessary to build a credit-scoring ensemble classifier that can tackle the imbalanced data and other related issues such as noise and over- lapping samples to raise the performance measures, especially on Vietnamese data sets Furthermore, the proposed model can point out the important fea- tures to predict the credit risk status.

Gaps in the approaches to solving imbalanced data

There are three popular approaches to imbalanced classification in the lit- erature They are algorithm-level, data-level, and ensemble-based approaches (Galar et al., 2011).

The algorithm-level approach solves imbalanced data by modifying the clas- sifier algorithms to reduce the bias toward the majority class This approach needs deep knowledge about the intrinsic classifiers which users usually lack.

In addition, designing specific corrections or modifications for the given clas- sifier algorithms makes this approach not versatile A representative of the algorithm-level approach is the Cost-sensitive learning method which imposes or corrects the costs of loss upon misclassifications and requires the minimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; Xiao et al., 2020) However, the values of the costs of losses are usually assigned by the researchers’ intention In short, the algorithm-level approach is inflexible and unwieldy.

The data-level approach re-balances training data sets by applying re-sampling techniques, which belong to three main groups, including over-sampling, under- sampling, and the hybrid of over and under-sampling Over-sampling techniques increase the quantity of the minority class while under-sampling techniques de- crease the one of the majority class This approach implements easily and performs independently of the classifier algorithms However, re-sampling tech- niques change the distribution of the training data set which may lead to a poor classification model For instance, random over-sampling techniques in- crease the computation time and may repeat the noise, and overlapping samples, thus probably leading to an over-fitting classification model Some hierarchical methods of over-sampling can cause other problems For example, the Synthetic Minority Over-sampling technique (SMOTE) can exacerbate the overlapping is- sue In contrast, under-sampling techniques may miss useful information about the majority class, especially on severely imbalanced data (Baesens et al., 2003; Sun, Lang, Fujita, & Li, 2018).

The third is the ensemble-based approach which integrates ensemble classi- fier algorithms with algorithm-level or data-level approaches This approach exploits the advantage of ensemble classifiers to improve the performance cri- teria The ensemble-based approach seems to be the trend in dealing with imbalanced data (Abdoli, Akbari, & Shahrabi, 2023; Shen, Zhao, Kou, & Al- saadi, 2021; Yang, Qiao, Huang, Wang, & Wang, 2021; Zhang, Yang, & Zhang,

2021) However, the ensemble-based approach often faces complex models that are too difficult to interpret the results This is a concern that must be realized fully.

In summary, although there are many methods for imbalanced classification, each of them has some drawbacks Some hybrid methods are complex and inaccessible Moreover, there are very few studies dealing with either imbalance or noise and overlapping samples With the available studies, on some data sets, the methods do not raise the performance measures as high as expected.Hence, it is coming up with the idea of a new algorithm that can deal with imbalance, noise, and overlapping to increase the performance measure on the positive class.

Gaps in Logistic regression with imbalanced data

Logistic regression (LR) is one of the most popular single classifiers, especially in credit scoring (Onay & ệztỹrk, 2018) LR can provide an understandable output that is a conditional probability of belonging to the positive class This probability is the reference to predict the sample’s label by comparing it with a given threshold The sample is classified into the positive class if and only if its conditional probability is greater than this threshold This characteristic of LR can innovate into multi-classification Besides, the computation process of LR, which employs the maximum likelihood estimator, is quite simple It does not take much time since there are several available packages of software or programming languages Furthermore, LR can show the impact of predictors on the output by evaluating the statistically significant level of the parameters corresponding to the predictors In other words, LR provides an interpretable and affordable model.

However, LR is ineffective on imbalanced data sets (Firth, 1993; King & Zeng, 2001), specifically, the conditional probability of positive samples is un- derestimated Therefore, the positive samples are likely misclassified Besides, the statistically significant level of predictors is usually based on the parameter testing procedure, which uses the “ p -value” criterion as a framework Mean- while, the p -value has recently been criticized in the statistical community be- cause of its misunderstanding (Goodman, 2008) Those lead to the limitation in the application fields of LR although it has several advantages.

There are multiple methods to deal with imbalanced data for LR such as prior correction (Cramer, 2003; King & Zeng, 2001), weighted likelihood esti- mation (WLE) (Maalouf & Trafalis, 2011; Manski & Lerman, 1977; Ramalho & Ramalho, 2007) and penalized likelihood regression (PLR) (Firth, 1993; Green- land & Mansournia, 2015; Puhr, Heinze, Nold, Lusa, & Geroldinger, 2017) All of them are related to the algorithm-level approach, which requires much effort from the users For example, prior correction and WLE need the ratio of the positive class in the population which is usually unavailable in real-world ap- plications Besides, some methods of PLR are too sensitive for initial values in the computation process of the maximum likelihood estimation Furthermore, some methods of PLR were just for the biased parameter estimates, not for the biased conditional probability (Firth, 1993) A hybrid of these methods and re-sampling techniques has not been considered in the literature on LR with imbalanced data The hybrid methods can exploit the advantages of each individual and directly solve imbalanced data for LR.

In summary, LR for imbalanced data needs to be modified in the computation process by a combination of data-level and algorithm-level approaches The modification can deal with imbalanced data and still retain the ability to provide the impacts of the predictors on the response without the “ p -value” criterion.

Research objectives, research subjects, and research scopes

Research objectives

In this dissertation, we aim to achieve the following objectives.

The first objective is to propose a new ensemble classifier that satisfies two key requirements of a credit-scoring model This ensemble classifier is expected to outperform the traditional classification models and popular balanced methods such as the Bagging tree, Random forest, and AdaBoost combined with random over-sampling (ROS), random under-sampling (RUS), SMOTE, and Adaptive synthetic sampling (ADASYN) Furthermore, the proposed model can identify the significance of input features in predicting the credit risk status.

The second objective is to propose a novel technique to address the chal- lenges of imbalanced data, noise, and overlapping samples This technique can leverage the strengths of re-sampling methods and ensemble models to tackle these critical issues in classification Subsequently, this technique can be applied to credit scoring and other imbalanced classification applications, for example, medical diagnosis.

The final objective is to modify the computation process of Logistic regres- sion to address imbalanced data and mitigate the issue of overlapping samples.This modification directly impacts the F-measure, which is commonly used to evaluate the performance of classifiers in imbalanced classification The pro- posed work can compete with popular balanced methods for Logistic regression such as weighted likelihood estimation, penalized likelihood regression, and re- sampling techniques, including ROS, RUS, and SMOTE.

Research subjects

This dissertation investigates the phenomenon of imbalanced data and other related issues such as noise and overlapping samples in classification We exam- ine various balancing methods, encompassing algorithm-level, data-level, and ensemble-based approaches in a case study of credit scoring Within these ap- proaches, data-level and ensemble-based are paid more attention than algorithm- level Additionally, Lasso-Logistic regression, which is a version of penalization on Logistic regression, is studied in two application contexts: a based learner of an ensemble classifier and the individual classifier.

Research scopes

The dissertation focuses on binary classification problems for imbalanced data sets and their application in credit scoring Interpretable classifiers, in- cluding Logistic regression, Lasso-logistic regression, and Decision trees, are considered To deal with imbalanced data, the dissertation concentrates on the data-level approach and the integration of data-level methods and ensem- ble classifier algorithms Some popular re-sampling techniques such as ROS,RUS, SMOTE, ADASYN, Tomek-link, and Neighborhood Cleaning Rule, are investigated in this study In addition, popular performance criteria, which are suitable for imbalanced classification such as AUC (Area Under the Re- ceiver Operating Characteristics Curve), KS (Kolmogorov-Smirnov statistic),F-measure, G-mean, and H-measure, are used to evaluate the effectiveness of considered classifiers.

Research data and research methods

Research data

The case study of credit scoring uses six secondary data sets Three of them are from the UCI machine learning repository such as German, Taiwan, and the Bank personal loan data sets These data sets are very popular in studying credit scoring and are used as a benchmark in the literature Besides, the three private data sets are collected from commercial banks in Vietnam All Viet- namese data sets are highly imbalanced with different levels Furthermore, to justify the ability to improve the performance measures of the proposed works,the empirical study used one data set belonging to the medical field, Hepatitis data This data set was available on the UCI machine learning repository.The case study of Logistic regression employs nine data sets Four of them,which are German, Taiwanese, Bank personal loan, and Hepatitis data sets,are also used in the case study of credit scoring The others are easy to access through the Kaggle website and UCI machine learning repository.

Research methods

The dissertation applies the quantitative research method to clarify the ef- fectiveness of the proposed works such as the credit scoring ensemble classifier, the algorithm for balancing and free-overlapping data, and the modification of Logistic regression.

The general implementation protocol of the proposed works follows the steps in Table 1.1 This implementation protocol is applied in all computation pro- cesses in the dissertation However, in each case, the content in Step 2 may vary in some ways The computation processes are conducted by the programming

Table 1.1: General implementation protocol in the dissertation

1 Proposing the new algorithm or new procedure.

2 Constructing the new model with different hyper-parameters to find the optimal model on the training data.

3 Constructing other popular models with existing balanced methods and classifier algorithms on the same training data.

4 Applying the optimal model and other popular models to the same test- ing data, then calculating their performance measures.

5 Comparing the testing performance measures of the considered models.language R, which has been widely used in the machine learning community.

Contributions of the dissertation

The dissertation contributes three methods to the literature on credit scoring and imbalanced classification The proposed methods were published in three articles, including:

(1) An interpretable decision tree ensemble model for imbalanced credit scoring datasets,Journal of Intelligent and Fuzzy System, Vol 45, No 6, 10853–10864,

(2) TOUS: A new technique for imbalanced data classification, Studies in Sys- tems, Decision, and Control, Vol 429, 595–612, 2022, Springer.

(3) A modification of Logistic regression with imbalanced data: F-measure- oriented Lasso-logistic regression, ScienceAsia, 49S, 68–77, 2023.

Regarding the literature on credit scoring, the dissertation suggests the inter- pretable ensemble classifier which can address imbalanced data The proposed model which uses Decision tree as the base learner has more specific advan- tages than the popular approaches such as higher performance measures and interpretability The proposed model corresponds to the first article.

Regarding the literature on imbalanced data, the dissertation proposes a method for balancing, de-noise, and free-overlapping samples thanks to the ensemble-based approach This method outperforms the integration of the re- sampling techniques (ROS, RUS, and SMOTE, Tomek-link, and NeighborhoodCleaning Rule) and popular ensemble classifier algorithms (Bagging tree, Ran- dom forest, and AdaBoost) This work corresponds to the second article.Regarding the literature on Logistic regression, the dissertation provides a modification to its computation process The proposed work makes Logistic regression more effective than the existing methods for Logistic regression with imbalanced data and retain the ability to show the important level of input features without using p− value This modification is in the third article.

Dissertation outline

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 2 Literature review of imbalanced data

• Chapter 3 Imbalanced data in credit scoring

• Chapter 4 A modification of Logistic regression with imbalanced data

Chapter 1 is the introduction, which briefly introduces the contents of the dissertation This chapter presents the overview of imbalanced data in classifi- cation Besides, other contents are the motivations, research gap identifications, objectives, subjects, scopes, data, methods, contributions, and the dissertation outline.

Chapter 2 is the literature review on imbalanced data in classification This chapter provides the definition, obstacles, and related issues of imbalanced data,for example, the overlapping classes Besides, this chapter deeply presents the performance measures for imbalanced data The most important section is the review of approaches to imbalanced data, including algorithm-level, data-level, and ensemble-based-level Chapter 2 also examines the basic background and recent proposed works of credit scoring The detailed discussion of previous studies clarifies the pros and cons of existing balancing methods That is the framework for developing the new balanced methods in the dissertation.

Chapter 3 is the case study of imbalanced classification - credit scoring This chapter is based on the main contents of the first and second articles referred to in Section 1.6 We propose an ensemble classifier that can address imbalanced data and provide the importance level of predictors Furthermore, we innovate the algorithm of this credit-scoring ensemble classifier to handle overlapping and noise before dealing with imbalanced data The empirical studies are conducted to verify the effectiveness of the proposed algorithms.

Chapter 4 is another study on imbalanced data which is related to Logistic regression This chapter proposes a modification of the inner and outer of the computation process of Logistic regression The inner is a change in the perfor- mance criterion to estimate the score, and the outer is a selective application of re-sampling techniques to re-balance the training data The experiment stud- ies on nine data sets to verify the performance of the modification Chapter 4 corresponds to the third article referred to in Section 1.6.

Chapter 5 is the conclusion, which summarizes the dissertation, implies the applications of the proposed works, and refers to some further studies.

Chapter 2LITERATURE REVIEW OF IMBALANCED DATA

Imbalanced data in classification

Description of imbalanced data

According to Definition 1.1.4, any data set with a skewed quantity of samples in two classes is technically imbalanced data (ID) In other words, any two-class data set with an imbalanced ratio (IR) greater than one is considered ID There are not any conventional definitions of the IR threshold to conclude that a data set is imbalanced Most authors simply define ID that there is a class with a much greater (or lower) number of samples than one of the other (Brown & Mues, 2012; Haixiang et al., 2017) Other authors assess a data set imbalanced if the interest class has significantly fewer samples than the other and ordinary classifier algorithms encounter difficulty in distinguishing two classes (Galar et al., 2011; López, Fernández, García, Palade, & Herrera, 2013; Sun, Wong, &Kamel, 2009) Therefore, a data set is considered as ID when its IR is greater than one and most samples of the minority class cannot be identified by standard classifiers.

Obstacles in imbalanced classification

In ID, the minority class is usually misclassified since there is too little in- formation about their patterns Besides, standard classifier algorithms often operate according to the rules of the maximum accuracy metric Hence, the classification results are usually biased toward the majority class to get the highest global accuracy and very low accuracy for the minority class On the other hand, the patterns of the minority class are often specific, especially in extreme ID, which leads to the ignorance of minority samples (they may be treated as noise) to favor the more general patterns of the majority class As a consequence, the minority class, which is the interested object in the classifica- tion process, is usually misclassified in ID.

The above analyzes are also supported by empirical studies Brown and Mues

(2012) concluded that the higher the IR, the lower the performance of classi- fiers Furthermore, Prati, Batista, and Silva (2015) found that the expected performance loss, which was the proportion of the performance difference be- tween ID and the balanced data, became significant when IR was from 90/10 and greater Prati et al also pointed out that the performance loss tended to increase quickly for higher values of IR.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

Categories of imbalanced data

In real applications, combinations of ID and other phenomena make classifi- cation processes more difficult Some authors even claim that ID is not only the main reason for the poor performance but the overlapping, small sample size, small disjuncts, borderline, rare, and outlier samples are also the causes of the low effectiveness of popular classifier algorithms (Batista et al., 2004; Fernández et al., 2018; Napierala & Stefanowski, 2016; Sun et al., 2009).

• Overlapping or class separability (Fig.2.1b) is the phenomenon of the un- clear decision boundary of two classes It also means that some samples of two classes are blended On data sets with overlapping, the standard classi- fier algorithms such as Decision tree, Support vector machine, or K-nearest neighbors become harder to perform Batista et al (2004) stated that the

IR was less important than the degree of overlap between classes Similarly, Fernández et al (2018) believed that any simple classifier algorithm could perform classification independently of the IR in case of no overlapping.

• Small sample size: Learning algorithms need a sufficient amount of sam- ples of data sets to generalize the rule to discriminate classes Without large training sets, a classifier cannot only generalize characteristics of the data but it can also produce an over-fitting model (Cui, Davis, Cheng,

& Bai, 2004; Wasikowski & Chen, 2009) On imbalanced and small data

Figure 2.1: Examples of circumstances of imbalanced data

Source: Galar et al (2011) sets, the lack of information about the positive class becomes more serious. Krawczyk and Woźniak (2015) stated that when fixing the IR, the more samples of the minority class, the lower the error rate of classifiers.

• Small disjuncts (Fig 2.1c): This problem occurs when the minority class consists of several sub-spaces in the feature space Therefore, small dis- juncts provide classifiers with a smaller number of positive samples than large disjuncts In other words, small disjuncts cover rare samples that are too hard to be found in the data sets, and learning algorithms often ignore rare samples to set the general classification rules It leads to a higher error rate on small disjuncts (Prati, Batista, & Monard, 2004; Weiss, 2009).

• The characteristics of positive samples such as borderline, rare, and outlier, affect the performance of standard classifiers The fact is that borderline samples are always too difficult to be recognized In addition, the rare and outliers are extremely hard to be identified According to Napierala and Stefanowski (2016); Van Hulse and Khoshgoftaar (2009), an imbalanced data set with many borderline or rare and outlier samples made standard classifiers less efficient.

In summary, studying ID should pay attention to the related issues such as the overlapping, small sample size, small disjuncts, and the characteristics of the positive samples.

Performance measures for imbalanced data

Performance measures for labeled outputs

Most learning algorithms show labeled outputs, for example, K-nearest neigh- bors, Decision tree, ensemble classifier based Decision tree, and so on A conve- nient way to introduce the performance of labeled-output classifiers is a cross- tabulation between actual and predicted labels, known as confusion matrix.

Predicted positive Predicted negative Total

Actual positive TP FN POS

Actual negative FP TN NEG

In Table 2.1, TP, FP, FN, and TN follow the Definition 1.1.3 Besides, POS and NEG are the numbers of the actual positive and negative samples in the training data, respectively PPOS and PNEG are the numbers of the predicted positive and negative samples, respectively N is the total number of samples. From the confusion matrix, several metrics are built to provide a framework for analyzing many aspects of a classifier These metrics can be divided into two types, single and complex metrics.

The most popular single metric is accuracy or its complement, error rate.

Accuracy is the proportion of the correct outputs, and error rate is the propor- tion of the incorrect ones Therefore, the higher (or lower) accuracy (or error rate) is, the better the classifier is.

Although accuracy and error rate are easy to calculate and express the mean- ings, they may mislead the performance evaluation of a classifier in the case of

ID Firstly, on an imbalanced data set with very high IR, standard classifiers often get a very high accuracy and low error rate It means the number of positive samples classified correctly is small despite their crucial role in the classification task Secondly, the error rate considers the cost of misclassifying the positive class and the negative equally Whereas in ID, the misclassification of the positive sample is often more costly than the one of the negative There- fore, imbalanced classification studies use some single metrics that focus on a specific class such as TPR (or recall), FPR, TNR, FNR, and precision.

TPR is the proportion of the positive samples classified correctly Other names of TPR are recall and sensitivity.

FPR is the proportion of the negative samples classified incorrectly.

TNR (or specificity) and FNR are the complements of FPR and TPR, re- spectively.

Precision is the proportion of the actual positive samples among the predicted positive class.

Among these metrics, accuracy, TPR, TNR, and precision are expected as high as possible while FPR and FNR are opposite In many applications, some specific metrics may be prioritized For instance, in imbalanced classification, instead of accuracy, TPR is the most favored metric because of the importance of the positive class However, in credit scoring and cancer diagnosis, if only focusing on the TPR and ignoring the FPR, a trivial classifier will design all samples with the positive label In other words, the classifier cannot identify any negative samples That causes an arbitrary loss which is not small Hence, high values of precision and recall are preferred in these circumstances In summary, each single performance metric has its meaning, and choosing what metrics depends on the application fields.

The single metrics seem not to provide enough information to evaluate the performance of a classifier, especially in ID It leads to combinations of the above single metrics F-measure is one of the most popular complex metrics. F-measure expresses the precision and recall trade-off by the weighted harmonic mean following the formula:

(1 + β 2 )T P + F P + β 2 F N (2.8) where β is the positive parameter for controlling the significance of FP or FN. The parameter β is set greater than 1 if and only if FN is more concerned than FP F1 is the special case of Fβ when the importance of precision and recall metrics is equal Equivalently, the role of FP and FN are the same in F1. Sometimes, F-measure is the name for F1 unless there are specific comments.

2T P + F P + F N (2.9) The maximum value of F1 is 1 According to formula (2.9), the value of F1 is high if and only if both values of precision and recall are high In applications,

F1 is usually chosen in cancer diagnostic or credit scoring (Abdoli et al., 2023; Akay, 2009; Chen, Li, Xu, Meng, & Cao, 2020).

Another metric isG-mean which uses the geometric mean of TPR and TNR.The formula for G-mean is shown in (2.10) G-mean collects information about both positive and negative classes, not only from the positive class as F-measure.

G-mean is high if and only if TPR and TNR are high The most ideal value of the G-mean is 1.

Performance measures for scored outputs

Besides labeled-output classifiers, several classifiers show scored outputs that express the likelihood of belonging to each class, for instance, Logistic regression. Usually, high-scored samples are predicted positive labels Generally, the scored outputs are transferred into the labeled ones by being compared with a given threshold If the target of the classification is to restrict the error prediction of the positive class, a low threshold will be assigned That will introduce a high TPR and a high FPR Otherwise, high thresholds will reduce the FPR but raise the FNR In short, choosing a threshold for a scored-output classifier depends on the target to optimize which performance metrics.

When transforming to labeled outputs, samples with the same labels are equally treated although their likelihoods of belonging to the positive class are very different Therefore, Receiver Operating Characteristics Curve (ROC), Area under the Receiver Operating Characteristics Curve (AUC), Kolmogorov- Smirnov statistic (KS), and H-measure are the popular free-threshold measures to evaluate the performance of scored classifiers without changing the type of outputs These metrics, which are considered overall (general) performance metrics, are also widely used in imbalanced classification studies.

2.2.2.1 Area under the Receiver Operating Characteristics Curve

TheReceiver Operating Characteristics Curve (ROC) is a graph showing the relationship of FPR and TPR over all possible thresholds ROC is plotted on the two-dimensional plane with the x -axis and y -axis representing FPR and TPR, respectively ROC is expected to hug to the top left corner since the classifier introduces high TPRs and low FPRs In the unit square, the ROC of a classifier must be above the diagonal, which corresponds to the ROC of the random one.

Figure 2.2 illustrates the ROCs of three classifiers and the random one In this figure, all classifiers have better overall performance than the random since all three curves are above the red diagonal Besides, the first and second curves are above the third It means that the third shows the least performance since with the same FPR, it always offers a lower TPR than the first and the second. However, we cannot compare the overall performance of the first with the sec- ond A natural way is comparing the area under the ROC curves (AUCROC) which is bounded by the ROC curves and the two axes The greater the AU- CROC is, the better the classifier is Conveniently, AUCROC is shortened to AUC.

AUC is the expected TPR averaged over all FPRs with all possible thresholds (Ferri, Hernández-Orallo, & Flach, 2011) The AUC of a random classifier is 0.5, so the AUC is expected to be greater than 0.5 Besides, the AUC of the ideal classifier is 1 Hence, the AUC usually falls in the range of [0.5; 1] With a discrete series of thresholds {α i } n 1 , AUC is estimated by the formula:

|F P R(α i ) − F P R(α i−1 )| (T P R(α i ) + T P R(α i−1 )) (2.11) where T P R(α) and F P R(α) are the TPR and FPR corresponding to the thresh- old α

In ID literature, AUC is the most popular performance metric to deter- mine the optimal classifiers and compare learning algorithms (Batista et al., 2004; Brown & Mues, 2012; Huang & Ling, 2005) However, AUC has some weaknesses Firstly, AUC may provide incorrect evaluations when ROCs cross together For example, a ROC is only higher in a neighborhood of a specific threshold, but it is lower than other ROCs at all remaining thresholds This curve may correspond to a greater AUC than the others, but the others show higher TPR at most thresholds In this case, AUC may be an irrational mea- sure Secondly, according to Hand (2009), AUC is an incoherent performance measure: “AUC is equivalent to averaging the misclassification loss over a cost ratio distribution which depends on the score distributions” of the classifier it- self, thus, the AUC evaluates different classifiers by different metrics However, Ferri et al (2011) argues that Hand’s argument is not “a natural interpretation”. Besides, Ferri et al (2011) confirms the AUC’s coherent meaning of a general classification performance measure and the independence of the classifier itself.

Figure 2.3: Illustration of KS metric

Kolmogorov-Smirnov statistic (KS) is another popular metric measuring the predictive power of classifiers (He, Zhang, & Zhang, 2018; Shen et al., 2021;Yang et al., 2021) KS expresses the separation degree of the predicted positive and predicted negative classes Figure 2.3 is an illustration of the KS metric that is defined as the formula (2.12).

Although a high KS implies an effective classifier, KS only reflects good performance in the local of the point determining KS (Řezáč & Řezáč, 2011).

In Figure 2.3, KS is realized at threshold 0.55, so effective analysis is only meaningful in the neighborhood of this value.

Hand (2009) strongly criticizes AUC and proposes H-measure as a substitu- tion H-measure is the fractional improvement in the expected minimum loss compared with a random classifier The formula of the H-measure is:

L ref (2.13) where L is the overall expected minimum misclassification loss and L ref is the expected minimum misclassification loss corresponding to a random classifier. H-measure can overcome the AUC’s limitation by fixing a classifier-independent distribution of relative misclassification cost The expected loss in the definition of H-measure can be from any loss distribution Most applications follow the popular proposed of Hand and Anagnostopoulos (2014) for the beta distribu- tion Beta(π 1 + 1, π 0 + 1) ( π 0 and π 1 are the proportions of negative and positive class in the population, respectively) Although H-measure appears lately, it has become popular in recent classification studies, for example, Ala’raj andAbbod (2016); Garrido, Verbeke, and Bravo (2018); He et al (2018).

Conclusion of performance measures in imbalanced clas-

There are two types of performance metrics in the literature on imbalanced classification, including one for labeled outputs and one for scored outputs. Regarding labeled outputs, accuracy is the universal performance metric, but it may mislead the evaluation of classifiers’ effectiveness in ID since pursuing the highest accuracy will make positive samples not be classified correctly In several application fields such as credit scoring or cancer diagnosis, F-measure and G-mean are the popular metrics, instead of accuracy Regarding scored outputs, AUC, KS, and H-measure are favored However, it should be reminded that there is no perfect performance measure suitable for all data sets Every metric has its meanings and drawbacks Hence, it is necessary to utilize the overall and based-threshold metrics to get an adequate analysis of a classifier’s performance.

Approaches to imbalanced classification

Algorithm-level approach

The algorithm-level approach, which focuses on the intrinsic classifiers, mod- ifies underlying algorithms to restrict the negative impact of ID The target of the algorithm-level approach is usually to raise a specific performance metric or to constrain a consequence of ID.

Let’s review some typical types of the algorithm-level approach in ID.

2.3.1.1 Modifying the current classifier algorithms

The algorithm-level approach limits the bias toward the majority class of imbalanced data by modifying or correcting the underlying mechanism of a se- lected classifier, for example, Support vector machine, Decision tree, or Logistic regression.

Modifications of Support vector machine usually focus on the decision bound- ary while those of Decision tree pay attention to splitting feature criteria, and those of Logistic regression are related to the log-likelihood function or the maximum likelihood estimation process.

Table 2.2 shows some representatives of this approach.

Table 2.2: Representatives employing the algorithm-level approach to ID

Applying specific kernel modifi- cations to rebuild the decision boundary in order to reduce the bias toward the majority class.

Wu and Chang (2004); Xu (2016); Yang, Yang, and Wang (2009)

Setting a weight on the sam- ples in the training set based on their importance (the positive samples are usually assigned a higher weight).

Lee, Jun, and Lee (2017); Lee et al (2017); Yang, Song, and Wang (2007).

Applying active learning paradigm, especially in the situation where the samples of the training set are not fully labeled.

Hoi, Jin, Zhu, and Lyu (2009); Sun, Xu, and Zhou (2016); Žliobaitė, Bifet, Pfahringer, and Holmes (2013).

Proposing a new distance for cre- ating split Cieslak, Hoens, Chawla, and

Boonchuay, Sinapiromsaran, and Lursinsap (2017); Lenca, Lallich, Do, and Pham (2008); Liu, Chawla, Cieslak, and Chawla (2010).

Re-computes the maximum like- lihood estimate for the intercept and the conditional probability of belonging to the positive class.

Weighted maximum likelihood estimation Maalouf and Siddiqi (2014);

Maalouf and Trafalis (2011); Manski and Lerman (1977).

Penalized maximum likelihood estimation Firth (1993); Fu, Xu, Zhang, and Yi (2017); Li et al (2015).

The basic idea of cost-sensitive learning (CSL) is that every misclassification causes a loss Denote C(1, 0) and C(0, 1) the loss when predicting a positive sample to be the negative one and the negative to be the positive, respectively. The simplest form of CSL is the independent misclassification cost which sets

C(1, 0) and C(0, 1) are constants Under the notations, the total cost function is:

C(1, 0) × F N + C(0, 1) × F P (2.14) The target of the independent cost form is to find the optimal threshold α ∗ corresponding to the minimum value of total cost function: α ∗ = arg min α∈(0,1) [C(1, 0) × F N (α) + C(1, 0) × F P (α)]

(2.15) where F N (α) and F P (α) are the number of false negative and false positive samples corresponding to the threshold α , respectively Table 2.3 shows the independent misclassification cost matrix for a prediction result.

Table 2.3: Cost matrix in Cost-sensitive learning

In ID, CSL sets C(1, 0) higher than C(0, 1) to compensate for the bias toward the negative class This assumption is also rational in real-world classification applications because misclassifying a positive sample usually causes more serious problems than doing a negative one.

Many authors assigned C(0, 1) a unit and C(1, 0) a constant number C (greater than unit) Some studies proposed the formula or the procedure to find the op- timal threshold based on C(0, 1) and C(1, 0) such as Elkan (2001); Moepya,Akhoury, and Nelwamondo (2014); Sheng and Ling (2006) Besides the inde- pendent one, authors pursued the dependent misclassification cost which put individual cost per observation (Bahnsen, Aouada, & Ottersten, 2014, 2015;Petrides, Moldovan, Coenen, Guns, & Verbeke, 2022).

Among methods of algorithm-level approach, CSL is the most popular (Fer- nández et al., 2018; Haixiang et al., 2017) since CSL can be embedded into other classifier algorithms such as:

• Support vector machine (SVM): Datta and Das (2015); Iranmehr, Masnadi- Shirazi, and Vasconcelos (2019); Ma, Zhao, Wang, and Tian (2020).

• Decision tree (DT): Drummond, Holte, et al (2003); Jabeur, Sadaaoui, Sghaier, and Aloui (2020); Qiu, Jiang, and Li (2017).

• Logistic regression (LR): Shen, Wang, and Shen (2020); Sushma S J and Assegie (2022); Zhang, Ray, Priestley, and Tan (2020).

The effectiveness of CSL strongly depends on the assumption of the cost matrix If the difference between C(1, 0) and C(0, 1) is too high, the positive class is over-favored in the classification process That will push the FPR Otherwise, if this difference is too low, the classifier does not provide enough adjustment to rebalance the bias toward the negative class Therefore, constructing the cost matrix is the major concern in CSL There are two popular scenarios for the cost matrix:

• The cost matrix is built on an expert’s opinion For example, in credit scoring, Moepya et al (2014) assigned C(1, 0) the average loss when ac- cepting a bad customer based on an expert’s experience This scenario often depends on the prior information, which is the subjective opinion of researchers without transparent evidence.

• The cost matrix is inferred from the data set Some authors assigned IR to the cost C(1, 0) and 1 to C(0, 1) since they implied that the higher the IR, the poorer the classification performance (Castro & Braga, 2013; López, Del Río, Benítez, & Herrera, 2015) However, IR is not the only factor reducing the performance of classifiers (see Subsection 2.1.3) If IR is the cost C(1, 0) , any data sets with the same IR will be similarly solved despite belonging to different application fields.

In summary, the cost of loss in CSL is usually a disputable issues.

2.3.1.3 Comments on algorithm-level approach

The algorithm-level approach focuses on the intrinsic nature of classifiers.

It requires a deep understanding of the classifier algorithms to directly deal with the consequences of ID Hence, algorithm-level methods are usually de- signed based on specific classifier algorithms Therefore, this approach seems less flexible than the data-level approach.

CSL is the most popular method of algorithm-level approach However, the cost matrix is usually a controversial issue.

In the future, it should be considered combinations of the algorithm-level and data-level approaches to create a more effective and versatile balanced method.

Data-level approach

The data-level approach involves re-sampling techniques to re-balance or al- leviate the skew distribution of the original data set These techniques are easy to apply and do not depend on learning algorithms training the classification model after the pre-processing data stage Therefore, the data-level approach is a natural strategy for solving ID In the imbalanced classification literature, many empirical studies agreed that re-sampling techniques improved the per- formance measures of most classifiers such as Batista et al (2004); Brown and Mues (2012); Prati et al (2004) This approach forms three main groups of methods, including under-sampling, over-sampling, and the hybrid of under and over-sampling techniques.

The under-sampling method removes negative samples, which are in the ma- jority class, to re-balance or degrade the imbalance status of the original data set.

The most common under-sampling technique is random under-sampling (RUS).RUS creates a balanced subset of the training set by randomly eliminating neg- ative samples RUS is non-heuristic, easy to employ, and shortens computation time However, if the data is highly imbalanced, RUS may waste useful infor-

Figure 2.4: Illustration of RUS technique

Source: Author’s design mation from the majority class because of removing too many negative samples. Figure 2.4 depicts the operation of RUS.

To overcome the limitation of RUS, authors have developed heuristic methods to remove the concerned samples Some representatives are Condensed Nearest Neighbor Rule (Hart, 1968), Tomek-Link (Tomek et al., 1976), One-side Selec- tion (Kubat, Matwin, et al., 1997), Neighborhood Cleaning Rule (Laurikkala,

2001) These methods can be used for balancing and cleaning data.

Condensed Nearest Neighbor Rule (CNN) (Hart, 1968) finds the consistent subset E of the original data set S , which correctly classifies all samples of S by the 1-nearest neighbor classifier Then, replace S with the store which consists of the minority class and the subset of the majority class not belonging to E

Majority class (MA) Minority class (MI)

Figure 2.5: Illustration of CNN rule

CNN removes the negative samples of E , which are often far from the bor- derline between the classes These samples are considered less relevant to the learning process However, CNN does not determine the maximum consistent subset Besides, CNN randomly removes samples, particularly in the initial stage, hence it often retains internal samples rather than boundary ones In some cases, for instance, Figure 2.5, CNN is not a balancing method due to re- moving too many negative samples Furthermore, samples in the store are in too close distances That makes the characteristics of the two classes not distinctly different It leads to difficulties in the operation of the following classifiers.

Tomek-Link (Tomek et al., 1976), which is an innovation of CNN, finds all pairs of samples (e i , e j ) satisfying the conditions: i) Belonging to the different classes. ii) With any sample e k, d(e i ; e j ) < d(e i ; e k ) and d(e i ; e j ) < d(e j ; e k ), where d(e i ; e j ) is the distance between e i and e j.

Figure 2.6: Illustration of tomek-links

The pair (e i , e j )is called a tomek-link If two samples form a tomek-link, then one of them may be a noise or both of them are at the borderline The reason is that only noise and boundary samples have the nearest neighbor belonging to the opposite class Figure 2.6 gives examples of tomek-links In this figure,tomek-links are the pairs of samples that are marked by the green oval.

Tomek-Link can be applied as a cleaning or balancing method Regarding the cleaning purpose, two samples in the tomek-link are removed Regarding the balancing purpose, only the negative sample of the tomek-link is eliminated. However, Tomek-Link cannot provide a balanced training data set Besides, although Tomek-Link can realize noise and boundary samples, it cannot distin- guish exactly which sample of the tomek-link is the noise or boundary one It may be a situation that, when removing the negative sample in a tomek-link, the remaining is a noise Then, using Tomek-link is not meaningful in this situation In short, Tomek-Link should be combined with another re-sampling technique to have balanced training data without noise and overlapping classes.

One-side Selection (OSS) (Kubat et al., 1997) is an under-sampling technique that combines Tomek-Link and CNN It is a natural idea because Tomek-Link takes care of removing noise and boundary samples, while CNN discards the redundant samples from the majority class The remaining part of the training data set is considered to be “safe” for the learning process However, OSS only reduces the imbalanced status of the original data In some cases, OSS does not balance the training data.

Neighborhood Cleaning Rule (NCL) (Laurikkala, 2001) performs as the fol- lowing way In the training set, each sample e k is classified by the 3 -nearest neighbors rule ( 3 -NN) If e k belongs to the majority class but the 3 -NN predicts it from the minority class, then e k is eliminated Otherwise, if e k belongs to the minority class and 3 -NN misclassifies it, then the nearest neighbors from the majority class are taken away.

NCL not only reduces the quantity of the majority class but also handles the overlapping status However, NCL is not a really balanced method NCL should be combined with other under-sampling techniques, even with over-sampling ones to employ the most advantages.

Clustering-based method with the first version of Yen and Lee (2006) can be summarized: In the beginning, cluster the data set into K clusters Then, in each cluster, randomly select some negative samples Finally, combine all selected samples and the positive class to form a new balanced training data. The clustering-based is expected to limit the issue of information loss, which is a drawback of RUS Many innovations of clustering method have been in- troduced and improved performance measures of classifiers when comparing with RUS such as Nugraha, Maulana, and Sasongko (2020); Prathilothamai and Viswanathan (2022); Rekha and Tyagi (2021); Yen and Lee (2009) However, the optimal value of K , the number of clusters, has not been discussed deeply. Besides, the random choice of negative samples in each cluster does not handle noise or borderline ones Furthermore, the computation time of clustering-based methods is often longer than RUS and other “nearest approach” techniques such as CNN, Tomek-link, OSS, and NCL (Yen & Lee, 2009).

Notes The operation of the nearest and clustering approaches is based on

“distance”, an important concept in machine learning Depending on the char- acteristics of features, which are nominal and numeric, several types of distance measures are used For example, distance measures for numeric samples are usually Euclidean, Manhattan, and Minkowski distances The distances for samples with nominal and numeric features can be calculated by HEOM or HVDM metric Details of these distance types can be found in Santos, Abreu, Wilk, and Santos (2020); Weinberger and Saul (2009); Wilson and Martinez

(1997) The summary of distance types is in Appendix A.

In contrast with the under-sampling method, the over-sampling increases the positive samples to clear or alleviate the imbalanced status of the original data.The most common technique of the over-sampling method is random over- sampling (ROS) ROS randomly duplicates the positive sample to raise the quantity of the minority class Figure 2.7 illustrates an example of ROS.ROS is non-heuristic and easy to apply However, ROS lengthens the compu- tation time Furthermore, ROS may duplicate the noise, the borderline samples,that can lead to an overfitting classification model (Batista et al., 2004; Fernán-

Figure 2.7: Illustration of ROS technique

Source: Author’s design dez et al., 2018) Therefore, heuristic techniques were proposed to overcome the limitation of ROS The most popular is Synthetic Minority Over-sampling Technique (SMOTE) (Chawla, Bowyer, Hall, & Kegelmeyer, 2002).

Ensemble-based approach

The ensemble-based approach integrates methods of the algorithm-level or data-level approach with an ensemble classifier algorithm to solve ID.

The term “ensemble model” refers to the collection of quite similar classifiers. The idea of ensemble classifiers is to leverage the collective power for decision- making across multiple sub-classifiers Therefore, the effectiveness and diversity of sub-classifiers are the concerns for the performance of an ensemble classifier (Fernández et al., 2018) In comparison with the single, ensemble classifiers are experienced better in performance measures (Galar et al., 2011) Details of ensemble classifiers are discussed in Subsection 3.1.2.2.

2.3.3.1 Integration of algorithm-level method and ensemble classifier al- gorithm

The most popular version of this approach is the cost-sensitive ensemble. This type combines an ensemble learning algorithm with the costs of loss for each class misclassified There are two typical ways to deal with ID, including cost-sensitive Boosting and ensemble with cost-sensitive learning.

Cost-sensitive Boosting keeps the general framework of Boosting, for exam- ple, AdaBoost, and introduces the costs into the step of updating the weights. Some works belonging to this approach are Sun, Kamel, Wong, and Wang (2007); Tong et al (2022); Zelenkov (2019) These authors explained their motivation and claimed their superiority However, Nikolaou, Edakunni, Kull, Flach, and Brown (2016) found out that the original Boosting and Cost-sensitive Boosting algorithms performed equivalently if the Cost-sensitive Boosting were not adjusted Therefore, Nikolaou et al (2016) suggested applying the original AdaBoost algorithm due to its simplicity, flexibility, and effectiveness.

Ensemble with cost-sensitive learning also remains the original structure of ensemble algorithms and applies cost-sensitive learning to assign the costs to each type of misclassification In comparison with Cost-sensitive Boosting, this approach is less flexible since they are oriented to a specific classifier algorithm. Some representatives of this approach are Krawczyk, Woźniak, and Schaefer (2014); Tao et al (2019); Xiao et al (2020).

The integration of ensemble classifiers and cost-sensitive learning may out- perform those of ensemble and data-level approach in some cases However, the cost-sensitive approach usually faces arguments about the cost of loss There- fore, this approach is not a popular choice in practicing applications.

2.3.3.2 Integration of data-level method and ensemble classifier algorithm

Regarding the integration of the data-level approach, the training data for each sub-classifier of the ensemble is re-balanced by one or more re-sampling techniques After that, the base learner is applied to this balanced data Some typical works can be listed below.

Boosting-based AdaBoost or variants of Boosting algorithms are considered to build the classification model The re-sampling techniques are applied at the beginning or the end of each iteration of the Boosting algorithm.

• SMOTEBoost (Chawla, Lazarevic, Hall, & Bowyer, 2003) combines SMOTE and Boosting procedure The standard boosting sets all misclassified sam- ples equal weights, but SMOTEBoost does not do that After every it- eration, SMOTE creates synthetic samples from the minority class, thus the updating weights of samples are changed This process does not only balance but also increases the diversity in the training data, which usually brings benefits to the learning process.

• RUSBoost (Seiffert, Khoshgoftaar, Van Hulse, & Napolitano, 2010) oper- ates similarly to SMOTEBoost, but it randomly eliminates samples from the majority class at the beginning of each iteration.

• BalancedBoost (Wei, Sun, & Jing, 2014) combines over-sampling and under- sampling in each iteration Furthermore, the re-sampling process is fulfilled according to AdaBoost.M2 algorithm.

Bagging-based The main idea of this type is to apply re-sampling techniques to change the distribution of the training data in each bootstrap step Many works proposed ways to balance and diversify the training data across bags. Bagging-based is simpler than boosting-based since there are not any updates of weights or changes in the standard Bagging algorithm Some typical works are as follows:

• OverBagging (Wang & Yao, 2009) uses ROS to balance data for each sub- classifier instead of applying ROS to the whole original data at the begin- ning of the training process Regarding this method, there are two ways to create balanced data in each bag: (i) including the negative class and applying ROS to raise the quantity of the positive samples; (ii) including the bootstrap version of the negative class and then applying ROS to the positive class In OverBagging, each sample will be present in at least one bag.

• SMOTEBagging (Wang & Yao, 2009) has some differences from OverBag- ging In each iteration, the negative class is bootstrapped and the positive class is resampled with replacement to have a proportion of the original positive class This proportion varies from 10% in the first to 100% in the last bag Then, the SMOTE algorithm is employed to balance the data.

Similar to the Boosting-based approach, SMOTEBagging and OverBag- ging, which belong to the over-sampling approach, lead to an ensemble classifier with long computation time Furthermore, Bagging-based prob- ably suffers the overlapping issue (with SMOTEBagging) and overfitting (with OverBagging).

• UnderBagging (Barandela, Valdovinos, & Sánchez, 2003) trains the ensem- ble classifier on N balanced data sets which are the subsets of the original data set, where N is approximate to the IR Firstly, the majority class is divided randomly into N subsets with the same quantities as the one of the minority class Secondly, N balanced data sets are the unions of the mi- nority class and each subset of the majority class Finally, the base learner performs on N balanced data parallelly The authors believed that since every sample of the majority class is included in the computation process, the loss of potential information from this class would be reduced.

• ClusteringBagging (Wang, Xu, & Zhou, 2015) operates similarly to Under- Bagging However, the clustering method is applied to the majority class to get K clusters with different quantities Then, each cluster and a boot- strapped resample of the minority class form a balanced data set for each sub-classifier.

UnderBagging and ClusteringBagging belong to the under-sampling ap- proach They can reduce the computation time in comparison with the over-sampling one Besides, ensemble classifiers based on under-bagging can diversify the input However, each sub-classifier does not employ the whole information from the original data since each iteration uses a part of the original data Thus, each sub-classifiers may be ineffective which leads to an ensemble classifier with poor performance.

2.3.3.3 Comments on ensemble-based approach

Conclusions of approaches to imbalanced data

Addressing ID is a common challenge in classification There are three popular approaches to imbalanced issues, including algorithm-level, data-level, and ensemble-based, which are summarized in Figure 2.9 Each of them has strengths and weaknesses.

The algorithm-level approach modifies the classification algorithm by adjust- ing specific details, changing the decision threshold, or using the CSL method. This approach is effective in certain cases, but it has some drawbacks First of all, it is often limited to specific algorithms In other words, this approach can- not be applied to all types of data sets and may require significant customization to be effective Furthermore, several algorithm-level methods are too compli- cated to understand how the algorithm produces predictions and identify any biases or errors in the model Finally, some algorithm-level methods can spend too long computation time and require significant resources to implement All of these issues can make the algorithm-level approach disliked in real-world applications.

Data-level approach modifies the data by either under-sampling the majority class or over-sampling the minority class to balance or alleviate the imbalanced issue This approach is more flexible and simpler than the algorithm-level.

Approaches to solving imbalanced data

Figure 2.9: Approaches to imbalanced data in classification

However, it still has limitations Firstly, it can cause a loss of valuable infor- mation from the majority class when applying under-sampling, or it can lead to an overfitting model when using over-sampling Secondly, the effectiveness of the data-level approach depends on the sampling technique used, and there is no universal solution that is the best for all data sets Choosing inappropri- ate sampling methods can lead to a model with poor performance Therefore, it is necessary to consider these drawbacks when deciding on an appropriate approach to address ID.

Ensemble-based approach combines an ensemble algorithm and one of the approaches above This approach is favored due to its outstanding effective- ness However, it can lead to an over-fitting model if the sub-classifiers are not diverse enough Besides, it can lengthen the computation time, especially with a Boosting-based ensemble Moreover, it may be too difficult to interpret the effects of inputs on the output of the ensemble classifier since the final prediction is a combination of multiple sub-classifiers.

Note that although these approaches can improve the performance of a clas- sifier on ID, there is no one-size-fits-all solution The choice of the technique depends on the specific problem, the size of the data set, the imbalanced ratio, and the desired performance metrics In conclusion, handling imbalanced data in classification is an ongoing research topic, and selecting the most appropriate approach is critical for building an accurate and robust model.

Credit scoring

Meaning of credit scoring

There are several definitions of “credit scoring” in the literature Authors explain the meaning of each component of this term “Credit” refers to an amount of money lent by a financial institution to a customer and must be repaid in installments with interest “Scoring” uses several numerical tools to rank loans based on actual or perceived quality Scores can be expressed in the form of “letters” or “labels” to represent the credit risk status of the customer (Anderson et al., 2007; Hand & Henley, 1997) Meanwhile, Thomas, Crook, and Edelman (2017) defines credit scoring as a set of decision models and basic techniques to assist lenders in granting credit Besides, Louzada et al.

(2016) states that credit scoring is an arithmetical representation based on the customer’s creditworthiness analysis, a useful tool for assessing and preventing default risk.

In this dissertation, credit scoring is the discrimination of customers into

“bad” or “good” labels based on their features and loan characteristics The

“bad” label is assigned to ones with a high probability of default and vice versa, the “good” is for ones with a low probability of default.

Credit scoring is necessary for both the banks and customers For the banks,credit scoring provides valuable information to make appropriate decisions for credit granting A misclassification of customers, for example, assigning the good label to the customers with high credit risk can lead to huge losses for the bank (Abdou & Pointon, 2011) On the side of customers, knowing their credit risk status helps them to improve their rating or score, thereby accessing loans with reasonable interest rates and terms Therefore, credit scoring has contributed to preventing bank losses and ensuring proper cash flow in the economy.

Since the Basel Committee on Banking Supervision released the Basel Ac- cords, especially the third accord in 2013, credit scoring has attracted more consideration One of the main contents of Basel III is to enhance risk man- agement and supervision Specifically, Basel III includes requirements for gov- ernance practices such as more robust risk measurement and management pro- cesses, improvements in evaluating risk-weighted assets, and deeper supervisory review.

The real-world application asks credit scoring models to be effective and explicit The requirement for effectiveness is understandable The explicitness of credit scoring models, which provides clear explanations for the predicted results, is a framework for composing regulations on credit risk management.

In summary, a useful credit scoring model should meet two requirements: i) Accurately classifying bad customers; ii) And transparently interpreting the classification results.

Inputs for credit scoring models

Credit scoring has been utilized since the 1950s (Thomas et al., 2017) In this early stage, credit scoring was carried out according to the expert method which was the 5C rule This rule included the factors considered important in the process of credit risk assessment, including:

• Character: The reputation and perceived trustworthiness of customers.

• Capital: The amount of money invested by the customers.

• Collateral: The assets used to guarantee or secure a loan.

• Capacity: The ability to repay the loan.

• Condition: The conditions of customers’ business, for example, the state of the economy, industry trends, and so on.

This approach was only concerned with the current loan It did not consider the background information about customers such as their payment history, employment history, consumption habits, and so on.

Following the 5C rules, Fair Issac Corporation (FICO), a data analytic com- pany in The United States, provides the credit score framework based on five main categories such as payment history, amount of debt accounts, length of credit history, new credit, and credit mix to calculate the score expressing the credit risk status 1

• Payment history accounts for about 35% of the total score It consists of the factors expressed on customers’ payment timelines, including bills on time, late payments, collections, bankruptcies, or foreclosures, and the severity of any delinquencies.

• Amount of debt accounts for 30% of the total score It includes information about the account types of customers (credit cards, loans, etc.), the uti- lization rate (the amount of available credit), and the balance on revolving credit accounts.

• Length of credit history makes up 15% of the FICO score It considers the length of time credit accounts, the age of the oldest account, and the average age of all ones.

• New credit contributes to 10% of the FICO score It looks at the informa- tion about the new accounts of customers such as the number of recently opened accounts, the number of recent credit inquiries, and the time since a new account opened Having multiple new accounts in a short period or having too many credit inquiries may negatively affect credit scores.

• Credit mix takes part in 10% of the total score It concerns a variety of types of credit accounts such as credit cards, retail accounts, installment loans, and mortgages Having a diverse mix of credit types can have a positive impact on the credit score.

1 https://www.myfico.com/credit-education/whats-in-your-credit-score

FICO score has become one of the most popular credit scoring approaches now However, the FICO score does not provide a reasonable explanation for the weights of the components in the credit scoring formula Furthermore, other financial factors such as income, savings, and assets are not considered. Even though the factors mentioned in the 5C and FICO potentially lead to an incomplete assessment of the overall financial status of a customer, they are widely used by most banks or institutes when constructing credit scoring models.

With the tendency to deploy as many input features as possible, information from individuals and families of customers such as gender, income, consumption, marital status, education level, number of family members, local reputation, and so on are included in credit scoring models Another trend is to use time series data as the input of credit scoring models In addition, credit scoring models have included macroeconomic variables (Zhang, Chi, & Zhang, 2018), lags of factors such as income, loan amount (Zhang, Xu, Hao, & Zhu, 2020) The authors explain that credit risk is influenced by the economic environment and in chronological order.

In summary, the inputs of a credit scoring model are not limited to a rigid theoretical framework.

Interpretability of credit scoring models

Interpretability is “the degree to which a human can understand the cause of a decision” (Miller, 2019) The higher the interpretability of a model, the easier the understanding of its outputs (Molnar, 2018) As a result, the term

“interpretability” is sometimes replaced by “explainability”.

In some real-world applications, users are concerned with both the classifi- cation model’s accuracy and interpretability This point of view comes from incompleteness in problem formalization: the prediction result is only the par- tial solution, and the reasons for the final result can bring valuable information to understand or explain the result (Doshi-Velez & Kim, 2017) For example, a credit scoring model pointing out importance levels of features can help the bank control the regulations and manage the credit risk.

In credit scoring application, interpretability can be measured from two per- spectives: i) The size of the set of decision rules, which is usually used to evaluate models based-tree (Dumitrescu, Hué, Hurlin, et al., 2021); ii) The marginal effects of the predictors such as the important features, or the explicit scorable outputs (Wang et al., 2015).

Thus, according to the views of Dumitrescu et al (2021) and Molnar (2018), Discriminant analysis, Logistic regression, and Decision tree are explainable models due to the following reasons The output of Discriminant analysis and Logistic regression is the conditional probability of belonging to the interest class That is a reasonable reference to classify samples In addition, regarding Logistic regression, the statistical significance of parameters corresponding to predictors shows the predictors’ ability to affect the response Regarding the Decision tree, the importance level of a predictor is proportional to the number of its segments: The more segments a predictor has, the more essential it is.Conversely, Support vector machine and Artificial neural network are the rep- resentatives of “black box” models since it is too difficult to interpret the reason for the final result or to point out the impact of predictors on the response.Interpretability and effectiveness are the competing aspects of a credit scoring model (Brown & Mues, 2012) Interpretability involves a simple and transparent structure while effectiveness is related to a complicated and translucent one.For example, most credit scoring ensemble classifiers (homo and heterogeneous) aim to improve their effectiveness but are not interested in their interpretability.According to Dastile et al (2020), only eight percent of primary studies have investigated new credit scoring models with transparent structures.

Approaches to imbalanced data in credit scoring

Credit scoring is a typical case of imbalanced classification, where the bad customers are the concerned objects The number of the bad is always far less than those of the good since there are multiple regulations to screen potentially bad customers This leads to the fact that most credit scoring models prioritize dealing with ID to raise their effectiveness.

Although all balancing approaches can be applied to credit scoring, the most popular methods are cost-sensitive learning, re-sampling techniques, and ensemble-based methods with typical works such as:

• Cost-sensitive learning: Moepya et al (2014); Petrides et al (2022); Xiao et al (2020); W Zhang et al (2020).

• Re-sampling techniques: Batista et al (2004); Brown and Mues (2012); Marqués et al (2013); Shen et al (2019).

• Ensemble-based methods: Abdoli et al (2023); Fiore, De Santis, Perla, Zanetti, and Palmieri (2019); He et al (2018); Shen et al (2021); Wang et al (2015); Yotsawat, Wattuya, and Srivihok (2021); Zhang et al (2021).

It can be seen that the ensemble-based approach is the current trend in addressing ID in credit scoring This approach exploits the effectiveness of en- semble classifiers and dealing with ID to increase the performance measures of credit scoring models However, the more effective models, the more compli- cated their structure, thus the less interpretable their prediction results There- fore, a model solving both ID and interpretability is an expectation for the credit scoring application.

Recent credit scoring ensemble models

Most of the recent credit scoring models are hybrid and ensemble classifiers which are built in complex structures Some recent representatives can be listed below.

Abdoli et al (2023) proposed a bagging supervised auto-encoder classifier(BSAC) that leveraged the performance of the supervised auto-encoder BSAC tackled ID by a bagging process based on the under-sampling of the majority class The performance measures of BSAC, especially F-measure and G-mean, were higher than those of some previous studies However, BSAC did not ex- press the interpretable aspect.

Zhang et al (2021) proposed a hybrid credit scoring model with voting-based outlier detection, balanced sampling, and a stacking-based method to optimize the learners’ parameters This model similarly addressed ID to the work of Abdoli et al (2023), which is the integration of the under-sampling technique and the bagging strategy However, the computation process was a burden because of the optimization of hyper-parameters of many considered learners such as XGBoost, Gradient Boosting DT, Adaboost, RF, LR, Bagging tree, and ExtraTree Furthermore, Zhang et al did not discuss interpretability.

Other works belonging to the hybrid models based deep learning were GAN (Fiore et al., 2019), LSTM (Shen et al., 2021), ACSS (Yang et al., 2021), and CS- NNE (Yotsawat et al., 2021) GAN addressed imbalanced data by the addition of mimicked positive samples thanks to two feed-forward neural networks with competitive targets LSTM and ACSS applied the improved version of SMOTE. While CS-NNE used the CSL method and did not explain the weights of the positive and negative samples Another hybrid model named EBCA (He et al.,

2018), which was not based on deep learning, used extended balance cascade, a technique of data-level approach, to deal with ID The common point of these models was a complex and unexplainable computation process.

On the contrary, other models paid attention to interpretability such as GSCI (X Chen et al., 2020) and PLTR (Dumitrescu et al., 2021) GSCI discussed the important features but did not show the calculation formula Meanwhile, PLTR considered the size and the maximal number of predicates of the decision rules However, these works did not address ID.

In summary, the most recent credit scoring models are hybrids and ensemble classifiers Authors increased the performance measures thanks to the innova- tion in the structure of models and solutions for ID However, most of them did not solve both ID and interpretability.

Chapter summary

This chapter refers to ID and three main related issues, including performance measures, balanced methods, and credit scoring.

ID is a situation where the number of samples in the positive class is signifi- cantly lower than those in the negative class ID poses challenges for standard classifiers because they tend to predict bias toward the majority class Hence, the standard classifiers operate poorly in the minority class even though this is often the crucial class in classification Furthermore, combinations of ID and other issues such as overlapping classes, small sample sizes, and small disjuncts, can make classifiers perform worse.

When the majority is over-supported by most classifiers, accuracy is not a rational performance metric Imbalanced classification studies or practitioners should use the other rather than accuracy F-measure, G-mean, AUC, KS, and H-measure can be considered at the same time to provide a complete under- standing of the performance of classifiers in both classes.

Regarding approaches to ID, algorithm-level, data-level, and ensemble-based are effective in most cases if used appropriately A note on balanced-data meth- ods is that the best solution for all data sets does not exist Therefore, a new balanced-data technique or algorithm, which is more effective than the ones in the literature, is always the target of researchers Combinations of approaches, such as data-level and ensemble classifier algorithms or algorithm and data-level methods can be promising solutions for ID.

Regarding credit scoring - an example of imbalanced classification, all bal- anced approaches can be utilized Besides, interpretability is an important requirement for a credit scoring model However, recent proposed credit scor- ing models have not addressed both ID and interpretability Therefore, solving

ID to increase performance measures and interpreting the classification results should be reminded when constructing a new credit scoring model.

Chapter 3 IMBALANCED DATA IN CREDIT SCORING

This chapter studies imbalanced data in a specific application, credit scoring,where bad customers are usually more concerned than good ones Unluckily,the bad is always the minority class in this classification task Most credit scoring models do not pay much attention to addressing imbalanced data and interpreting the final result Therefore, we propose a credit scoring ensemble model for imbalanced data sets The proposed model can rank the importance of the features on the final predictive result In addition, this idea of dealing with imbalanced credit scoring derives a solution for imbalanced data, overlapping classes, and noise based on the ensemble approach.

Classifiers for credit scoring

Single classifiers

Discriminant analysis (DA) has two common methods, including linear dis- crimination analysis (LDA) (Altman, Marco, & Varetto, 1994; Baesens et al., 2003; Desai, Crook, & Overstreet Jr, 1996; West, 2000; Yobas, Crook, & Ross,

2000) and quadratic discriminant analysis (QDA) (Baesens et al., 2003).

Consider the training data set consists of n independent samples {(Y i , X i )} n i=1 Where Y i ∈ {1, , k} is the label and X i ∈ R p is the vector expressing the features of i th sample DA classifies a sample with a hidden label into one of the class {1, , k} based on its features LDA assumes that: i) For all i , X i = (X 1i , , X pi ) ∈R p follows a multivariate normal distribution. ii) The samples in each j th (j ∈ 1, k) class follow the same multivariate normal distribution and the same covariance matrices of classes.

Based on the conditional probabilities P (Y = j|X = x) (j ∈ 1, k) , the sample x is classified into the class corresponding to the greatest one.

The idea of QDA is similar to LDA, but assumption ii) is replaced by another: ii’) The samples in each j th (j ∈ 1, k) class follow different multivariate normal distributions and different covariance matrices of classes.

LDA is a special case of QDA It is less flexible than QDA since the condition ii) is impractical However, assumption ii’) raises the number of parameters in QDA If samples have p features, there are kp parameters relative to the conditional probabilities P (Y = j |x), (j ∈ 1, k) when using LDA Meanwhile, there are kp parameters of the estimation of covariance matrices, so there are kp + kp(p + 1)/2 parameters in total when using QDA.

In comparison with other classifiers, LA offers a simple computation process. The output of DA is the natural evidence to classify samples Therefore, DA is a representative of interpretable classifiers However, the assumptions are too perfect, hence LDA and QDA are usually dominated by others, such as Logistic regression (Desai et al., 1996; Wiginton, 1980), Support vector machine, and Random forest (Brown & Mues, 2012).

The k-nearest neighbors (KNN) is a non-parametric classifier introduced by

E Fix and J Hodges in 1951 This is a simple and intuitive method that classifies a new sample based on the majority vote of its k nearest neighbors in the training data.

The hyper-parameter k impacts directly on the decision boundary, the gen- eralization ability of the method, and the bias-variance trade-off In the case of small k (e.g., k = 1 ), KNN reflects the local patterns clearly but is too sen- sitive to noises, boundary samples, and fluctuations in the data It also means small k makes KNN lower bias but higher variance In contrast, if k is very high (e.g., k = 20 or 30 ), KNN considers a greater number of neighbors, which leads to a more generalized decision boundary Then, KNN provides a robust and smoother decision boundary (lower variance) However, higher values of k cause higher bias because the large decision region may imperfectly capture the local patterns (Chomboon, Chujai, Teerarassamee, Kerdprasop, & Kerdprasop, 2015; Taunk, De, Verma, & Swetapadma, 2019) The optimal value of k de- pends on the specific data set Selecting the appropriate k is often thanks to experimentation and validation Therefore, it requires balancing the trade-off between bias and variance, considering the characteristics of the data, and the desired generalization ability of the method.

The distance metric is also a factor impacting the effectiveness of KNN In practice, there are several types of distance metrics such as Euclidean, Man- hattan, Minkowski, Overlap, HEOM, HVDM, and so on (Weinberger & Saul, 2009; Wilson & Martinez, 1997).

In credit scoring, KNN usually performs less effectively than the popular classifiers such as Logistic regression, Decision tree, Support vector machine, and Artificial neural network (Brown & Mues, 2012; Li, Wang, & Wang, 2009; West, 2000) Furthermore, on imbalanced data sets, the performance of KNN decreases along with the increase in the imbalanced ratio (Brown & Mues, 2012).

Logistic regression (LR) is one of the most popular classifiers in classification. The content of LR is summarized as follows.

Let Y ∈ {0, 1} be variable for labels and X = (X 1 , , X p ) ∈R p be predictor variables LR supposes that the conditional probability of belonging to the positive class is: π(x) = P (Y = 1|X = x) = e β 0 +βx T

1 + e β 0 +βx T (3.1) Where β 0is called the intercept parameter and β = (β 1 , , β p )are the parameters showing the effects of the predictors on the conditional probability π(x) which is also called the score of x ∈ R p

Consider a data set of n independent samples:

(x i , y i ) ∈R p+1 , i ∈ 1, n , where x i ∈ R p is the vector expressing p features and y i ∈ {0, 1} is the label of sample i th Then, the parameters in (3.1) can be estimated by maximizing the log-likelihood function: l (Y |X, β) : = logL (P ( Y | X, β))

The solution for (3.2) can be computed by an interactive algorithm, for in- stance, the Newton-Raphson method A new sample x ∗ is classified into the positive class if and only if its score is greater than a given threshold The details of LR can be found in James, Witten, Hastie, and Tibshirani (2013).

To conclude the effects of the predictors on the score, statistical testing pro- cedures compare the parameters β j , (j = 1, , p) with zero If a parameter is statistically significant at the level α , it is implied that the corresponding predictor impacts the score These tests are easy to employ by comparing α and the probability values of the statistic ones ( p -value), which is quickly calcu- lated by most statistical software Therefore, LR is a simple, transparent, and employable model.

LR is even more popular than ensemble classifiers in credit scoring (Onay & ệztỹrk, 2018) Empirical credit scoring studies using LR can be listed such as Baesens et al (2003); Bensic et al (2005); Chen, Yadav, Khan, and Zhu (2020); Desai et al (1996); West (2000); Wiginton (1980) Among those, Desai et al. (1996); Wiginton (1980) showed that LR offered a higher accuracy than DA On the contrary, several authors concluded that LR was less effective than Decision tree, Support vector machine (Bensic et al., 2005), and Artificial neural network (Etheridge & Sriram, 1997) Furthermore, on ID, the parameter estimation of

LR from (3.2) can be biased and the scores can be under-estimated (Firth,

1993) Therefore, LR usually misclassifies the positive samples.

A modification of LR is Lasso-Logistic regression (LLR), in which the main problem is finding βb 0 , βb 1 , , βb p satisfying:

(3.3) where, l(Y |X, β) is the log-likelihood function referred in (3.2); t > 0 is a tuning parameter.

If t is sufficiently large, the constraint imposing on the parameters is not strict, the solution for (3.3), βb j (j ∈ 1, p), are the same as the one of (3.2) On the contrary, if t is very small, the magnitude of βb j (j ∈ 1, p) is shrunk Then, due to the property of the absolute function, some of βb j are zero Therefore, the constraint on β j (i ∈ 1, p) in (3.3) plays the role of a feature selection method: only the predictors relevant to the response, which corresponds to non-zero βb j , are retained in the fitted model.

Based on the theory of convex optimization, problem (3.3) is equivalent to: min β

(3.4) where, λ is a penalty level, corresponding one-one to the tuning parameter t in (3.3) If λ is zero, the solution of LLR is exactly equal to LR’s solution in (3.2). Otherwise, if λ is sufficiently large, the solution of LLR is zero For values of λ between the two extremes, LLR gives a solution with some of βb j zero, thus some predictors are excluded from the model The values of λ are surveyed on a grid search to select the best based on criteria AIC, BIC, or cross-validation procedure With a given λ , problem (3.4) is solved by the coordinate descent algorithm and proximal – Newton interaction (see Gareth, Daniela, Trevor, andRobert (2013); Hastie, Tibshirani, and Wainwright (2015) for more details).Besides being a feature selection method, the predictive power of LLR is better than LR in empirical studies (Li et al., 2019; Wang et al., 2015).

A decision tree (DT) consists of rules to classify samples The set of rules splits the feature space of samples into sub-spaces that possess similar specific attributes Constructing a DT on a training set is to determine the order of the predictor variables and the conditions for branching them The process is to iterate a recursion over each split sub-space Splitting stops when it is no longer possible to split or all samples in this sub-space have the same output. Therefore, the division process of the feature space follows a series of conditions

- results “if - then - else” of the attributes A hidden-label sample belonging to the sub-space S k is predicted to be in class j (j = 0, 1) if most samples in S k have the label j

Figure 3.1: Illustration of a Decision tree

Figure 3.1 is an illustration of a DT built on a training set with two classes: Green (G) and Red (R) The features of samples represented by two predictor variables are x 1 and x 2 If a sample has x 2 smaller than h 1 or x 1 greater than t 2, it belongs to class R Otherwise, if x 2 is smaller than h 2, it belongs to class

G On this condition, if x 1 is less than t 1, it is a member of G; otherwise, it is a red sample The outputs of this DT are terminal nodes: “Red” and “Green”, which are also called leaves.

Ensemble classifiers

The term “ensemble model” refers to the combinations of several classification models, which are also named sub-classifiers, to leverage the collective power for decision-making across them (Roncalli, 2020) Ensemble classifiers can be divided into two types: heterogeneous and homogeneous ensemble.

Heterogeneous ensemble classifiers (sometimes called hybrid models) combine different techniques or algorithms, which are called base learners, to leverage their strengths and compensate for their weaknesses Hybrid classifiers often involve the fusion of different classifiers such as DT, LR, SVM, ANN, or KNN.

In credit scoring, hybrid models have brought promising results compared to individual classifiers (Dumitrescu et al., 2021; Shen et al., 2021; Yang et al., 2021; Zhang et al., 2021) However, hybrid classifiers do not attract as much consideration as homogeneous ensembles because of some limitations Firstly, hybrid models often require more extensive training and tuning The combina- tion of multiple algorithms introduces additional hyper-parameters and config- uration options that need to be optimized This process makes model develop- ment require more effort and expertise Secondly, hybrid models require great computational resources such as memory and processing power For example, the hybrid models consisting of SVM or ANN usually spend too long compu- tation time which can limit the scalability of hybrid models, particularly when dealing with large data sets or real-time applications Finally, hybrid models often make classification more complex Hence, it is too difficult to interpret or understand the decision-making process, especially when the hybrid models include multiple classifiers or complicated techniques.

Homogeneous ensemble classifiers (also called ensemble models) combine sim- ilar base learners to make predictions collectively The term “ensemble classi- fiers” referred to in Subsection 2.3.3 is homogeneous ensemble classifiers From this point, unless otherwise stated, ensemble classifiers are interpreted as ho- mogeneous ones.

Ensemble classifiers usually employ a base learner many times, on differ- ent subsets of the data set or use various hyper-parameters The individual classifiers (sub-classifiers) are integrated with specific strategies to get the final prediction.

The main idea of ensemble follows the natural human behavior when making a decision Instead of seeking an expert with a high cost, a set of several normal workers with a cheap cost is an alternative This idea implies that single errors can be suppressed by multiple results catching many aspects of the training data In other words, several sub-classifiers can predict a more accurate overall result than the individual Therefore, the effectiveness and diversity of sub- classifiers are the concerns of an ensemble classifier (Fernández et al., 2018). Another concept related to the diversity of an ensemble is the bias-variance decomposition The bias can be characterized as the ability to generalize the prediction results to a test set On the contrary, the variance can be depicted as the sensitivity of the classifier to the training set Hence, the performance improvement of ensembles is often the reduction of variance (bagging ensembles) or bias (boosting ensembles) (Fernández et al., 2018).

Ensemble classifiers can work in parallel or sequential ways The parallel type consists of many independent sub-classifiers The final classification results are combined according to the majority rule, where the results of each sub-classifier may have the same or different weights The Bootstrap aggregating (Bagging) classifier and Random forest are the typical cases of parallel ensembles On the other hand, the sequential type includes innovative versions of sub-classifiers.Generally, the following sub-classifier is grown by a modification of the pre- vious The overall classification result is often the final sub-classifier’s result.

The boosting-based algorithm, for example, Adaptive Boosting (AdaBoost), is typical of this type.

Bagging classifier (Breiman, 1996) consists of several iterations of the same learner trained on bootstrap versions of the original data set Bagging diversifies sub-classifiers by changing the training data set in each iteration An unknown- label sample will be classified into the class which is suggested by the majority of sub-classifiers (with or without weights) The advantages of Bagging are the simple execution and the ability to reduce variance.

The pseudo-code of Bagging is shown in Table B.1, Appendix B.

Random forest (RF) (Breiman, 2001), which utilizes DT as the base learner, also has similar iterations to Bagging but randomly picks some features instead of the whole feature space Therefore, the level of diversity of RF is higher than that of Bagging The effectiveness of RF depends on the power and the correlation of sub-classifiers Analogous to Bagging, the number of iterations could be chosen quite large without over-fitting.

The algorithm of RF is shown in Table B.2, Appendix B.

Adaptive boosting (AdaBoost) (Freund, Schapire, et al., 1996), which is the first-introduced version of Boosting family, uses DT as the base classifier algo- rithm AdaBoost follows the idea: The next classifier will correct the mistakes of the previous and contribute to the overall predicted result according to its performance.

AdaBoost uses two types of weight: D t (i) of every sample x i at the t th it- eration, and α t of every sub-classifier t th After the t th iteration, AdaBoost modifies the weights D t (i): increasing the weights of misclassified samples, and decreasing the ones of the correct classified in the next iteration As regards the weight α t, if the error rate of the t th sub-classifier is greater than 0.5 , α t is assigned zero Thus, this sub-classifier does not contribute to the overall result due to its poor performance With a new sample, each sub-classifier t gives a predictive class accompanied by a weighted vote α t, and then the final result is determined by the majority.

By AdaBoost algorithm, Freund et al (1996) proves that weak classifiers, for example DT, can become stronger in the sense of probably approximately correct learning framework.

AdaBoost can reduce the bias, instead of the variance as the Bagging Be- sides, because of sequential operation, Boosting ensemble takes longer compu- tation time than Bagging and RF with the same iterations Unlike Bagging and

RF, when the number of iterations becomes large, AdaBoost may be over-fitting. Furthermore, the effectiveness of AdaBoost is as good as RF but sometimes less than (Breiman, 2001).

Table B.3, Appendix B summarizes the operation of AdaBoost.

In credit scoring literature, empirical studies agreed that ensembles had a superior performance in compared to single classifiers (Brown & Mues, 2012;Dastile et al., 2020; Lessmann et al., 2015; Marqués et al., 2012) Finlay (2011);Kim, Kang, and Kim (2015) concluded that Adaboost was the best solution,even in the imbalanced circumstances, while the Bagging tree was supported byFinlay (2011) and Luo (2022) Besides, RF was the most effective according to the study of Brown and Mues (2012).

Conclusions of statistical models for credit scoring

The statistical and machine learning models have been utilized variously in credit scoring Table 3.1 presents some typical works of credit scoring which are clustered by characteristics of classifiers such as single or ensemble; transparent or black-box structure Each type of classifier has its advantages and disad- vantages Regarding effectiveness, homogeneous and heterogeneous ensemble classifiers usually dominate the single However, regarding interpretability, en- semble classifier algorithms often build black-box credit scoring models There- fore, constructing an interpretable ensemble model is an urgent requirement for credit scoring.

Table 3.1: Representatives of classifiers in credit scoring

Single classifiers Transparent DA Altman et al (1994); Baesens et al.

(2003); Desai et al (1996); West (2000); Yobas et al (2000).

KNN Brown and Mues (2012); Li et al (2009);

LR Baesens et al (2003); Bensic et al (2005);

K Chen et al (2020); Desai et al (1996); Steenackers and Goovaerts (1989); West (2000); Wiginton (1980).

DT Brown and Mues (2012); Galindo and

Tamayo (2000); Pandya and Pandya (2015); Pang and Gong (2009); Zhang et al (2016).

Black-box SVM Huang et al (2007); Li et al (2019);

Schebesch and Stecking (2005); Van Ges- tel et al (2006).

ANN Shen et al (2019); West (2000); Yobas et al (2000).

Heterogeneous He et al (2018); Shen et al (2021); Yang et al (2021); Yotsawat et al (2021); Zhang et al (2021).

Boosting Brown and Mues (2012); Cao, He,

Wang, Zhu, and Demazeau (2021); Fin- lay (2011); Marqués et al (2012).

Bagging Abdoli et al (2023); Finlay (2011); Luo

RF Brown and Mues (2012); Cao et al.

(2021); Ha et al (2016); Marqués et al. (2012).

The proposed credit scoring ensemble model base Decision tree 71

The proposed algorithms

3.2.1.1 Algorithm for balancing data - OUS(B) algorithm

Consider a training data set T with the majority class M A (also the negative class) and the minority one M I (the positive one): T = M A ∪ M I The positive and negative labels of samples are denoted “ 1 ” and “ 0 ”, respectively.

Define D as the difference in the quantities of M A and M I With B given and every i (i ∈ 1, B) , apply ROS to get a new positive class denoted M I i by randomly duplicating D×i B positive samples Then, RUS creates a new negative class M A i, which has the same quantity as M I i The union of M I i and M A i is a balanced data set T i: T i = M A i ∪ M I i

When i varies from 1 to B , the set T i is balanced and has different quan- tity from the others That is the premise for the diversity of sub-classifiers ofDTE( B ) In addition, the combinations of ROS and RUS aim to take advantage

Input: T : the training data set; M I and M A : the positive and negative class of T , respectively.

B: the number of new balanced data sets.

Output: A family of balanced data sets {T i } B i=1 of these techniques and compensate for their drawbacks It is noted that when i equals B , T B is created by only ROS Thus, there is not any loss of negative class information The OUS( B ) algorithm is described in Table 3.2.

3.2.1.2 Algorithm for constructing ensemble classifier - DTE( B ) algorithm

On each balanced data set of the output of the OUS( B ) algorithm, the Recur- sive Partitioning and Regression Tree algorithm (RPART) (Therneau, Atkinson, et al., 1997) is applied to build sub-classifiers of DTE( B ) Finally, the predicted label of a sample is the majority voted by B sub-classifiers of DTE( B ) The algorithm for DTE( B ) is in Table 3.3.

In each sub-classifier, the parameters are set as follows The minimum num- ber of observations in any terminal node is 10 The pruning process of each tree is determined by 5-fold cross-validation with the complexity parameter 0.001.

In each sub-classifier, when a feature is split, the reduction in the loss function(e.g., classification errors) is used to measure the importance of this feature A feature can be used several times in a tree The more segments this feature has,the more essential it is Therefore, the total reduction in the loss function across all segments of a feature is the index to measure its importance With DTE( B ),

Input: {T i } B i=1 : The family of balanced training data sets with the same number of features; p: the number of features in each data set T i;

3 F I i = (F I ij ) p j=1 , where F I ij is the degree of j th feature’s impor- tance.

0, otherwise The importance level vector F I the overall importance level of a feature is the average of B importance levels from B sub-classifiers In this study, the overall values are standardized so that the most important features are 100 and the remaining features are scored based on their relative to the most.

Empirical data sets

Four data sets, which are German (GER), Taiwanese (TAI), Vietnamese 1 (VN1), and Vietnamese 2 (VN2) are used in the empirical study The summary of the data sets is shown in Table 3.4 More details can be found in Appendix C.1 – C.4.

GER 1 and TAI 2 are public on the UCI machine learning repositories On the contrary, VN1 and VN2 are private data sets Due to security concerns, we cannot access detailed information about credit customers at Vietnamese banks. All features of VN1 and VN2 are in nominal forms They are the interest rate,

1 http://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

2 https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients

Table 3.4: Description of empirical data sets

Data sets Sample size # positive class a Imbalanced ratio # features b

VN2 16,407 1,340 11.24 12 a: The number of positive samples; b : The number of total features terms, duration, loan amount, customer gender, loan purpose, base balance, current balance, type of customers, type of products, credit history of customers, and branches of the bank Besides, the imbalanced ratio of VN1 and VN2 is notably high, especially one of VN2 These characteristics make the Vietnamese data sets different from GER and TAI.

VN1 and VN2 are used to determine the optimal DTE( B ∗ ) while GER andTAI are the validation data sets to compare the optimal DTE( B ∗ ) with popular ensemble classifiers based DT.

Computation process

The computation processes of credit scoring by DTE( B ) and other popular ensemble classifiers based DT follow the steps in Table 3.5.

Instead of finding the optimal B ∗ corresponding to each data set, a general evaluation of the two Vietnamese data sets is conducted to determine the most suitable B ∗ for both data sets This phase corresponds to steps 1 to 7 in Table 3.5.

Subsequently, DTE( B ∗ ) is applied to the public data sets, which are GER andTAI, to compare the performance measures with popular ensemble classifiers based on DT such as Bagging tree, RF, and AdaBoost with and without the popular re-sampling techniques such as ROS, RUS, SMOTE, and ADASYN.The comparison phase is from steps 8 to 13 in Table 3.5 In this phase, the performance metrics, including AUC, KS, F-measure, G-mean, and H-measure,are used to provide an overview of the effectiveness of the proposed ensemble.

Table 3.5: Computation protocol of empirical study on DTE

1 On VN1 and VN2, divide randomly the data sets into the training

2 On the training data, with a given number B , OUS( B ) and DTE( B ) algorithms are applied to get DTE( B ) classifier.

3 On the testing data, find AUC, KS, F-measure of DTE( B ).

4 Repeat from Step 2 to Step 3 with other values of B

5 Repeat from Step 1 to Step 4 fifty times.

6 Average the AUC, KS, and F-measure across fifty times.

7 The optimal DTE( B ∗ ) is the one corresponding to the highest averaged

Compare DTE( B ∗ ) with other ensemble classifiers based DT

8 On the empirical data set, divide randomly it into the training (70%) and testing data (30%).

9 Construct DTE( B ∗ ), Bagging tree, RF, and Adaboost.

10 Construct Bagging, RF, and Adaboost integrated with one of the tech- niques RUS, ROS, SMOTE, or ADASYN.

11 On the testing data, calculate the AUC, KS, F-measure, G-mean, and

H-measure of all considered ensembles.

12 Repeat from Step 8 to Step 11 fifty times.

13 Average the performance metrics after fifty times.

To get robust evaluations, the computation process of all considered classifiers is carried out 50 times on each data set Then, the comparisons are based on the average values of the performance measures.

Empirical results

3.2.4.1 The optimal Decision tree ensemble classifier

Table 3.6: Performance measures of DTE( B ) on the Vietnamese data sets

* denotes the optimal value of B ; bold values is the highest in each row.

With Bagging and RF, the number of sub-classifiers can be arbitrarily high. However, with DTE( B ), the number of sub-classifiers, which is B , is bounded by D , the difference in the quantities of the negative and positive classes As B gets closer to this upper bound, each balanced set T i is insignificantly different from the others Therefore, the sub-classifiers in DTE( B ) are not diverse Thus, the survey for the optimal B ∗ does not focus on extremely high values of B Table 3.6 presents the mean testing AUC, KS, and F-measure of DTE( B )s, which are averaged after 50 repetitions.

It is not easy to determine the trend of AUC On the VN1 data set, the max- imum value of AUC corresponds to DTE(3), while on the VN2 data set, AUC reaches the maximum value at a higher B However, AUC gradually stabilizes when B becomes large enough Besides, the variations of KS and F-measure fol- low the inverse U-shape: increasing along with B , reaching the maximum, and decreasing Considering the computation time and the performance measures,the optimal value of B for the two Vietnamese data sets is 39

3.2.4.2 Performance of the proposed model on the Vietnamese data sets

It is a fact that the ensemble-based approach with popular re-sampling tech- niques does not perform effectively on the empirical Vietnamese data sets, which are highly imbalanced data Table 3.7 shows the performance of DTE( 39 ) against Bagging, FR, and Adaboost with and without re-sampling techniques.

On the VN1 data set, DTE(39) outperforms other classifiers on three evalua- tion criteria (AUC, KS, and H-measure) at least, while on the VN2 data set, it surpasses the others on five criteria In short, DTE(39) is more effective than the ensemble-based approach to handling ID in the Vietnamese data sets. Another output of the DTE( B ) algorithm is the vector F I representing the importance level of features Figure 3.4 describes the features’ importance level of the Vietnamese data sets In the VN1 data set, “Asset” is the most im- portant feature, followed by other features such as “Purpose”, “Duration”, and

“History” Analogous, in the VN2 data set, the most significant features are “In- terest”, “Duration”, “Types of Product”, and “Branches”, in descending order. These features of customers provide more information to predict the likelihood of default than the others This is a valuable framework for Vietnamese admin- istrators to introduce regulations for screening potential default cases.

Interest Duration Product_types Branches Terms Purposes Current-balance Base_balance Amount Loan_types Sex Customer_types

Figure 3.4: Importance level of features of the Vietnamese data sets

In summary, on the Vietnamese data sets, DTE(39) fulfills the two require- ments for a credit scoring model: effectiveness and interpretability.

Table 3.7: Performance of ensemble classifiers on the Vietnamese data sets

Data sets Classifiers AUC KS F-measure G-mean H-measure

Bold values are the highest of each criterion on each data set

3.2.4.3 Performance of the proposed model on the public data sets

On the public data sets, which are German and Taiwanese credit scoring data sets, DTE(39) is compared with the popular ensemble classifiers based

DT without and with re-sampling techniques such as RUS, ROS, SMOTE, and ADASYN.

In the Bagging tree, RF, and AdaBoost, an increase in the number of trees usually leads to a decrease in the error rate (Breiman, 1996; Freund et al.,

1996) Regarding Bagging and RF, a large number of trees in the ensemble does not cause over-fitting models However, the improvement of error rate is insignificant when the number of trees is greater than 20 for Bagging, and 100 for

RF (Breiman, 1996, 2001) Regarding AdaBoost, if there are many trees on the whole, the computation time will be very long and a possibility of over-fitting. For all these reasons, the parameters of the ensemble classifiers are assigned as follows:

• Bagging and AdaBoost: The number of nodes along with the longest path from the root node to the farthest terminal node is 10 The number of trees is 30.

• Random forest: The number of trees is 300 The number of features for each tree is the squared root of the total features of each data set.

Steps 8-13 of the computation protocol in Table 3.5 are applied to German and Taiwanese data sets The testing performance measures are shown in Table 3.8 and 3.9 On the German data set, DTE(39) archives the highest values of AUC and H-measure Besides, in comparison with each classifier, DTE(39) always wins by at least three out of five performance criteria Similarly, on the Taiwanese data set, DTE(39) is the most effective among considered classifiers since DTE(39) beats the other by AUC, KS, and H-measure.

In addition, the performance measures of DTE(39) are compared with some recent empirical studies, which are also presented in Tables 3.8 and 3.9 DTE(39) is still almost dominant in the AUC criterion Furthermore, DTE(39) shows higher performance than GSCI (X Chen et al., 2020) and EBCA (He et al.,

Table 3.8: Performance of ensemble classifiers on the German data set

Classifiers AUC KS F-measure G-mean H-measure

CS-NNE (Yotsawat et al., 2021) 8011 —— —— 7363 ——

GSCI (X Chen et al., 2020) 7042 —— 5822 —— —— EBCA (He et al., 2018) 8002 4932 8444 6203 3372

2018) on the German data set; and it demonstrates superior performance com- pared to BSAC (Abdoli et al., 2023), LSTM (Shen et al., 2021), and the pro- posed model of Zhang et al (2021) on the Taiwanese data set In other cases, no recent ensemble completely outperforms DTE(39).

In summary, DTE(39) exhibits exceptional performance compared to both common methods and recent complex ensemble and hybrid models.

Table 3.9: Performance of ensemble classifiers on the Taiwanese data set

Classifiers AUC KS F-measure G-mean H-measure

BSAC (Abdoli et al., 2023) —— —— 5316 6807 —— PLTR (Dumitrescu et al., 2021) 7780 4257 —— —— ——

The DTE(39) offers superior performance in terms of AUC and H-measure across four data sets when compared with ensemble classifiers-based trees such as Bagging, RF, and Adaboost On the German and Vietnamese 1 data sets,the AUCs of DTE(39) are significantly greater than those of the others It means DTE(39) shows a higher expected TPR upon all FPR with all possible thresholds Besides, introducing the highest H-measure implies that DTE(39) outperforms all other models by the expected minimum loss improvement when considering the misclassification costs The AUC and H-measure are comple- mentary metrics to evaluate the general performance of a classifier The out- standing AUC and H-measure show a robust effectiveness of DTE(39).

On Vietnamese 1 and 2 data sets, which suffer from highly imbalanced sta- tus, DTE(39) is the optimal choice On Vietnamese 2 data sets, DTE(39) completely outperforms all considered ensembles integrated with the popular balanced methods Thus, DTE( B ) is a promising solution for seriously imbal- anced credit scoring data sets.

Furthermore, interpretability makes DTE(39) the most reasonable credit scoring classifier DTE(39) can point out the important features of customers, which are useful for hedging credit risk Although many of the recently proposed ensembles show good performance, their primary focus is not on interpretabil- ity In contrast, some discussions on interpretability such as GSCI and PLTR work less effectively than DTE(39) (see Table 3.8 and 3.9).

Some further results are drawn from the empirical study Firstly, on four real data sets, none of ROS, RUS, SMOTE, and ADASYN is the outstanding re- sampling technique for addressing ID Secondly, some balanced methods do not always work as expected For example, on the German data set, the Bagging classifier without any re-sampling technique offers higher performance measures than the others (see Table 3.8) Therefore, users should carefully check several re-sampling techniques when applying the data-level approach.

Conclusions of the proposed credit scoring ensemble model

Credit scoring is always one of the most important tasks of financial insti- tutions A little improvement in the effectiveness of credit scoring models can limit the significant loss of the banking system and the economy Therefore,the evolution of credit scoring models continues with the enhancement of new classification algorithms and the innovation of balanced methods In addition,interpretability is a crucial aspect of a credit scoring model, but it has not received sufficient attention from researchers.

This section contributes two algorithms, OUS( B ) for solving imbalanced data and DTE( B ) for building a classifier ensemble-based DT to credit scoring lit- erature The product of the two proposed algorithms is the ensemble classi- fier DTE(39) which is more effective than Bagging, RF, and AdaBoost even though they are combined with common re-sampling techniques such as ROS, RUS, SMOTE, and ADASYN DTE(39) also competes with other recent credit scoring models, especially in AUC and H-measure Furthermore, DTE(39) in- troduces the important features for predicting credit risk status Thus, the DTE(39) fulfills two requirements of typical credit scoring models: improving the performance measures and presenting the importance level of input features. These attributes position DTE(39) as the most reasonable option for addressing imbalanced credit scoring.

However, DTE( B ) should be practiced on more data sets to get detailed conclusions about the optimal value of B Besides, the study only considers the imbalanced ratio as the parameter affecting the performance of classifiers on ID.

In fact, overlapping is also a common issue in imbalanced classification OUS( B ) algorithm should be examined deeply on the data sets suffering imbalanced and overlapping to improve its effectiveness.

The proposed algorithm for imbalanced and overlapping data 83

The proposed algorithms

3.3.1.1 Algorithm for dealing with noise, overlapping, and imbalanced data

The pseudo-code for TOUS( B ) is represented in Table 3.10.

Firstly, the Tomek-link method is applied to remove all the pairs {(e + i , e − i )} m i=1 , which may be noise, borderline, and overlapping samples (Steps 1-2) Then, the imbalanced issue of the remaining data set, which is called T 0 , is addressed by the OUS( B ) algorithm (Steps 3-9) The output of TOUS( B ) is a family of

T i 0 B i=1, which are balanced, free-overlapping and de-noise.

Input: T : the training data set; B : the number of new balanced data sets.

3 M I 0 and M A 0 : the new positive and negative class of T 0

Output: A family of balanced data sets

3.3.1.2 Algorithm for constructing ensemble model

From the output of the TOUS( B ) algorithm, construct an ensemble model by applying a base learner on every data set of the family

T i 0 B i=1 This process follows the steps of the TOUS-F( B ) algorithm shown in Table 3.11 TOUS-F( B ) borrows the idea of the DTE( B ) algorithm, which is represented in Table3.3 However, DT can be replaced by another base learner such as LR or LLR.

Input: {T i } B i=1 : The family of B balanced sets with the same number of features; F : Classifier.

Empirical data sets

The empirical study is conducted on six data sets which are Bank personal loan (BANK) 3 , German credit (GER), Hepatitis C (HEPA) 4 , Loan schema data (US) 5 , Vietnamese 1 (VN1), and Vietnamese 3 (VN3) credit These data sets are chosen because of the diversity in sample size, imbalanced ratio, number of attributes, types of attributes, and the presence of overlapping samples (which will be found by the Tomek-Link and NCL methods) Table 3.12 summarizes the characteristics of the empirical data sets The details of the data sets BANK, HEPA, US, and VN3 can be found in Appendix C.5 – C.8, respectively.

Table 3.12: Description of empirical data sets

Data sets Size Positive size Imbalanced ratio # feat a # num feat b

VN3 11,124 837 12.29 12 0 a : The number of total features; b : The number of numeric features.

3 https://www.kaggle.com/datasets/teertha/personal-loan-modeling

4 https://archive.ics.uci.edu/dataset/571/hcv+data

5 https://www.openintro.org/data/index.php?data=loans_full_schema

The four heading data sets are obtained from public sources while the Viet- namese 1 and 3, which are from the two Vietnamese commercial banks, are private data sets The German and Vietnamese 1 data sets were used in the empirical study for DTE (Section 3.2) In addition, the Hepatitis C data set is involved with the medical field which usually suffers from ID.

Some changes to the original data sets are made.

• Regarding HEPA, observations with missing values are removed; the levels of variable Category are grouped into two labels which are denoted “0” for

“Blood donor” and “1” for the remaining ones.

• Regarding US, the original data set consists of 10, 000 samples which are individuals and companies Besides, there are some empty values in the data set We remove samples that are missing values or not individual customers The rest consists of 8, 505 samples for the empirical study.

Computation process

Figure 3.5: Computation protocol of the proposed ensemble classifier

Each data set is randomly split into the training and testing sets at the pro- portion of 70% – 30% For every value of B in the set {3, 5, 7, 9} , TOUS( B ) and then TOUS-F( B ) algorithm are applied to the training set to build en- semble classifiers This section employs LLR and DT to be the base learners of the ensembles, called Lasso Logistic Ensemble (LLE( B )) and Decision Tree Ensemble (DTE( B )), respectively Experiments just conduct on small values of

B due to the burden in the computation process Besides, AUC is the unique performance metric in the evaluation.

Figure 3.5 illustrates the computation protocol of the proposed ensemble classifiers On each data set, this process is repeated 50 times for every value of B The optimal proposed classifier ensembles on each data set are the ones corresponding to the highest average testing AUC.

3.3.3.1 Computation protocol of the Lasso Logistic ensemble

This subsection builds an ensemble classifier, in which LLR plays the role of a base learner The proposed ensemble classifier, which consists of B sub- classifiers, is denoted LLE( B ) The computation protocol of LLE( B ) follows the steps shown in Figure 3.5 with the replacement of classifier F by LLR. Furthermore, the study trains single models based on LLR and popular re- sampling techniques, such as ROS, SMOTE, RUS, Tomek-link, and NCL on the same training sets of the optimal LLE( B ∗ ) The performance comparisons are based on the average testing AUC of 50 times runs.

There are some transformations on the data sets after conducting re-sampling. Firstly, for each nominal attribute, binary variables (dummy variables) are cre- ated to express all the levels Secondly, the numerical variables are scaled according to the formula:

X scale = X − X dev(X) where, X and dev(X) are mean and deviation of variable X

Finally, when training models by LLR, we design a grid of 500 values of penalty level λ on each data set The coordinate descent algorithm and 5-fold cross-validation procedure are applied to choose the best λ for LLR.

3.3.3.2 Computation protocol of the Decision tree ensemble

This subsection builds another ensemble classifier, in which DT is the base learner Similarly to Section 3.2, the proposed ensemble classifier, which has B sub-classifiers, is denoted DTE( B ).

All steps to construct the optimal DTE( B ∗ ) are similar to those of LLE( B ). The performance of DTE( B ∗ ) is compared to the popular ensemble classifiers based DT (Bagging, Random forest, and AdaBoost) integrating with one of the common re-sampling techniques (RUS, ROS, SMOTE, Tomek-link, and NCL). RPART algorithm, which is complementary to the CART algorithm, is ap- plied to construct DT classifier (Therneau et al., 1997) The parameters of RPART are assigned as follows The minimum number of samples in any ter- minal node is 10 The pruning process of each tree is determined by 5-fold cross-validation with the complexity parameter 0.001 As regards Bagging and AdaBoost, the number of nodes along the longest path from the root node to the farthest terminal node is 10 (maxdept = 10 ) The number of trees in the ensemble classifier is 30 (mfinal = 30 ) As regards RF, the number of trees is assigned 300 The number of predictors of each sub-classifier is the squared root of the total predictors of the data set The performance of DTE( B ) and the popular ensemble are evaluated based on the average testing AUC of 50 running times.

Empirical results

Table 3.13 introduces the average testing AUC of LLE( B ) and DTE( B ) on six data sets, where B belongs to the set {3, 5, 7, 9}

For each data set, the optimal classifier ensemble is the model with the great- est AUC If two ensembles have the same greatest AUC, the better is one with a smaller B In Table 3.13, the bold values of AUC correspond to the optimal ensemble classifiers of each data set.

According to Table 3.13, it can be concluded that on every data set, with any B and C , DTE( B ) usually has a higher AUC than LLE( C ).

Table 3.13: Average testing AUC of the proposed ensembles

Table 3.14 shows the average testing AUC of the optimal LLE( B ∗ ) and the models based on LLR post-applied the popular re-sampling techniques such as ROS, SMOTE, RUS, Tomek-Link, and NCL The LLR without any re-sampling balanced method is also compared with LLE( B ∗ ) Its AUC is shown in the

“No re-samp” column LLE( B ∗ ) completely outperforms other popular models by the AUC criterion Thus, it can be concluded that the TOUS algorithm improves the performance of LLR even when combined with popular re-sampling techniques.

Moreover, some comments are implied from this experiment On the GER,HEPA, and US data sets, the values of AUC of LLR without any re-sampling technique and Tomek-link-LLR or NCL-LLR are the same Thus, these data sets do not have any noise and overlapping samples On the remaining data sets,Tomek-Link and NCL raise the AUC It can be concluded that when excluding noise and overlapping samples, the performance of LLR will improve.

Table 3.14: Average testing AUC of the models based LLR

Datasets No re- samp ROS SMOTE RUS Tomek- link NCL LLE( B ∗ )

Table 3.15 shows the average testing AUC of the optimal ensemble DTE( B ∗ ) and the popular ensemble classifiers based DT such as Bagging, RF, and Ad- aBoost with and without one of the re-sampling techniques ROS, SMOTE, RUS, Tomek-Link, and NCL Similar to the case of LLE( B ∗ ), DTE( B ∗ ) shows a great improvement in AUC compared to all methods considered.

On the data sets BANK and HEPA, the popular classifiers perform well even when not addressing ID by re-sampling techniques However, DTE can push AUC higher Besides, on the data sets US, VN1, and VN3, DTE raises the AUC significantly These data sets have some special characteristics The US suffers from ID seriously (the imbalanced ratio is 49.93), while VN1 and VN3 possess all issues such as noise, overlapping, and ID These results imply that the ensemble-based approach thanks to the TOUS algorithm is the suitable option to deal with noise, overlapping, and ID.

Besides, some minor results are drawn Firstly, re-sampling techniques are not always efficient For example, on GER and HEPA, re-sampling techniques decrease the AUC of the popular ensemble algorithms Secondly, Bagging shows notably lower performance than the other ensemble classifiers Finally, RUS- AdaBoost does not perform as effectively as the conclusion of Galar et al (2011).

In this experiment, RF and DTE( B ∗ ) work better than RUS-AdaBoost.

Table 3.15: Average testing AUC of the ensemble classifiers based tree

Data sets Ensemble algorithms Re-sampling technique

None ROS SMOTE RUS Tomek- link NCL DTE( B ∗ )

Conclusions of the proposed technique

This section proposes the TOUS algorithm for dealing with noise, overlapping samples, and ID in classification TOUS algorithm combines Tomek-Link, RUS, and ROS techniques to create a list of free-noise, free-overlapping, and balanced data sets, which will be the training sets of sub-classifiers of an ensemble classi- fier To verify the effectiveness of TOUS, LLR and DT algorithms are applied to construct the ensemble classifiers LLE and DTE, respectively The empirical study indicates some important results TOUS has a significant innovation in

AUC compared with popular re-sampling techniques That means the hybrid of many re-sampling techniques is more effective than the individuals Besides,the cleaning-data methods can increase the performance measure although the data set is still imbalanced This fact re-confirms that noise and overlapping samples are also the reasons for reducing the effectiveness of standard classifiers.The results suggest experiments to study another classifier as a base learner and consider more performance metrics to evaluate the potential of the proposed method for solving ID and related issues.

Chapter summary

This chapter studies credit scoring as a case study of ID There are two proposed works in this chapter.

• The credit scoring ensemble classifier based DT in Section 3.2.

• The algorithm TOUS for de-noise, free-overlapping, and balancing data; and the algorithm for constructing an ensemble classifier based on the out- put of TOUS in Section 3.3.

The credit-scoring ensemble classifier DTE addresses ID by the ensemble- based approach The empirical results show that DTE outperforms standard classifiers, even when combined with popular re-sampling techniques In addi- tion, DTE can point out the important features for the final predicted results. Therefore, DTE satisfies the two vital requirements for credit scoring models: effectiveness and interpretability.

The TOUS algorithm, which derives from the OUS algorithm (Section 3.2), can tackle noise, overlapping samples, and ID TOUS combined Tomek-link, random over-sampling, and random under-sampling techniques The proposed technique provides a substantial improvement in the AUC metric.

All proposed works show impressive results However, it should be considered deeply in the parameter optimization and conducted on more empirical data sets to reach a robust conclusion.

A MODIFICATION OF LOGISTIC REGRESSION WITH

Logistic regression, a traditional model, is very popular in classification Sim- ilar to common classifiers, Logistic regression performs ineffectively on imbal- anced data sets Although there are some approaches to imbalanced data forLogistic regression, including resampling techniques and modifications to the log-likelihood function, their effectiveness is generally not robust In this chap- ter, we suggest a solution that combines the algorithm-level and data-level ap- proaches for Logistic regression in imbalanced data sets.

Introduction

Recently, although machine learning and data-mining algorithms are pen- etrating several real applications of classification, Logistic regression (LR), a traditional model, is still in favor by several authors (Bektas, Ibrikci, & Ozcan, 2017; Khemais, Nesrine, & Mohamed, 2016; Li et al., 2015; Muchlinski, Siroky,

He, & Kocher, 2016) There are two prominent reasons for that Firstly, the output of LR is the sample’s conditional probability of belonging to the interest class, which is the reasonable reference to classify the sample Secondly, LR shows a transparent model for interpretation while most machine learning and data-mining models operate as a “black box” process However, LR has some problems The interpretive power of LR is based on the statistically significant level of parameters which is closely relevant to the p -value Nevertheless, the p-value has been recently criticized since its meaning is usually misunderstood(Goodman, 2008) Furthermore, in imbalanced classification, the parameter es- timation of LR can be biased and the conditional probability of belonging to the minority class can be under-estimated (Firth, 1993; King & Zeng, 2001).

As a consequence, LR usually misclassifies the interest class on ID.

In the literature on LR with ID, there were some groups of methods, which were linked to the algorithm-level approach They were prior correction, weighted likelihood estimation (WLE) (Maalouf & Trafalis, 2011; Manski & Lerman, 1977; Ramalho & Ramalho, 2007) and penalized likelihood regression (PLR) (Firth, 1993; Greenland & Mansournia, 2015; Park & Hastie, 2008; Puhr et al.,

2017) Most of them were designed to reduce the parameter estimation and pre- dicted probability biases, especially in small samples However, prior correction and WLE need the previous information of two classes in the population which is usually unavailable Besides, some methods of PLR, such as FIR (Firth,

1993), FLIC, and FLAC (Puhr et al., 2017) are quite sensitive to initial values in the computation process of maximum likelihood estimation Therefore, solv- ing LR with ID should consider both data-level and algorithm-level approaches and not make the computation process complex.

This chapter proposes a binary classifier named F-measure-oriented Lasso- Logistic regression (F-LLR) to deploy the ability of interpretation of LR and address the imbalanced issue F-LLR utilizes Lasso Logistic regression (LLR) as a base learner and integrates algorithm-level and data-level approaches to handling ID Lasso is a penalized shrinkage estimator and a feature selection method without a p -value In Lasso, the hyper-parameter λ is set by a new procedure called F-CV which is an adjustment of the ordinary cross-validation procedure (CV) F-CV finds the optimal λ by maximizing the cross-validation F- measure instead of the cross-validation accuracy as the way of CV The proposed classifier F-LLR has two computation stages In the first stage, apply LLR based on F-CV to get the scores of all samples In the second stage, according to the scores, under-sampling and SMOTE are respectively used to re-balance the data set Next, LLR-based F-CV is applied again on the balanced data set to get the final result The proposed classifier F-LLR experiments on nine real imbalanced data sets and its performance measures (KS and F-measure) are higher than those of traditional approaches to ID of LR.

This chapter is organized as follows The related works section reviews the general background involved with LR and ID The next section describes the proposed classifier The empirical study section introduces the empirical data sets, the implementation protocol, and the results The conclusion section is final.

Related works

Prior correction

Prior correction re-computes the maximum likelihood estimate (MLE) for the intercept of the standard LR It is unnecessary to correct the MLE for the parameter β because it is statistically consistent (Cramer, 2003; King & Zeng,

2001) The correction for β 0 follows the formula: βe0 = βb0 − ln δ 1 δ 0 δ 1 = y τ ; δ 0 = 1 − y

Where βb 0 is the MLE for β 0; τ and y are the proportion of the positive class in the population and the sample, respectively.

Where βbis the MLE for β

The biggest advantage of prior correction is the ease of use However, the value of τ is usually unavailable Besides, if the model is misspecified, the estimates of β 0 and β are slightly less robust than the WLE (Xie & Manski,1989).

King and Zeng (2001) argued that the score in the formula (4.2) was still underestimated They proposed a correction for the score in the formula (4.2): eπ KZ (x) =e π(x) + C(x) C(x) = [0.5 −e π(x)] e π(x)[1 − e π(x)] h xV ( β)xe T i

Where V ( β)b is the variance matrix of βb.

King and Zeng (2001) stated that the score estimate in the formula (4.3) could reduce the bias and variance However, this method was applied after the complete estimation Thus, it was a correction, not a prevention (Wang &Wang, 2001) Besides, according to Puhr et al (2017), the score in (4.3) was still biased.

Weighted likelihood estimation (WLE)

Instead of solving the optimization in (3.2), WLE (Manski & Lerman, 1977) considers the weighted log-likelihood function: logLW (P (Y |X, β)) = n

In (4.4), w i is the weight of the observation i th in the sample data Where τ and y are the proportions of the positive class in the population and the sample, respectively.

WLE outperforms prior correction in both cases of large sample data and misspecified model (Xie & Manski, 1989) In a small sample set, WLE may be asymptotically less efficient than prior correction though the differences are insignificant (Scott & Wild, 1986) In addition, misspecification is a common issue in social science studies Therefore, WLE should be preferred to prior correction (King & Zeng, 2001; Xie & Manski, 1989).

There were some studies following the weighting method for solving LR with

ID Maalouf and Trafalis (2011) combined weighting, regularization, kerneliza- tion, and numerical methods Maalouf and Siddiqi (2014) applied the truncated Newton method on WLE These works studied the problems that were avail- able for the value of τ Meanwhile, in general cases, the information about the population proportion τ is unknown There was only one study dealing with a part of this gap in the literature Ramalho and Ramalho (2007) provided a generalized method of moments estimator applying moment conditions for endogenously stratified samples However, the investigation for the effective- ness of the proposed method was based on a simulation study according toCosslett’s design That seems not enough cases to evaluate the performance of the proposed method.

Penalized likelihood regression (PLR)

PLR has the general form as follows: logL ∗ (P (Y |X, β )) = log L (P (Y |X, β )) + A (β ) (4.5)

In (4.5), the term of A(β) could be:

2 log(det(I(β))), where I(β) is the Fisher information matrix (Firth, 1993).

• Normal prior (Ridge): A(β) = −λ Pp j=1 β j 2 , where λ > 0 (Maalouf & Trafalis, 2011; Park & Hastie, 2008).

• Double exponential (Lasso): A(β) = −λ Pp j=1 |β j |, where λ > 0 (Fu et al., 2017; Li et al., 2015).

Firth-type (FIR) can reduce the small-sample bias of the MLE of parameters. However, FIR introduces the bias in the scores which are pulled toward the value of 0.5 The bias is significant in the case of high ID To overcome this drawback, Puhr et al (2017) suggested two modifications of FIR, which were the intercept correction (FLIC) and adjustment for an artificial covariate (added covariate approach, FLAC) Although FLIC and FLAC perform better than FIR, they cannot win Ridge on most empirical and simulation data sets (Puhr et al.,

2017) Besides, FIR, FLIC, and FLAC are quite sensitive to initial values in the computational process of the maximum likelihood estimation.

Ridge possesses a similar idea to Lasso which is discussed in subsection 3.1.1.4 In Ridge and Lasso, the penalty parameter λ controls the magnitude of the estimations of β j (j 6= 0) (denoted βb j ), which can be found by the Coordi- nate descent algorithm (Friedman, Hastie, & Tibshirani, 2010) The optimal λ is usually determined by the cross-validation procedure (CV), which is based on the default threshold of 0.5 and minimizing the cross-validation error rate (or maximizing the cross-validation accuracy) Ridge can compete with FLIC and FLAC However, Ridge usually leads to a dense estimation of β , which consists of very few values zero of βb, due to the property of the normal prior Thus, in high-dimension data, Ridge takes a long interval of computation time.

Analogous to Ridge, Lasso is a penalized shrinkage estimator Besides, Lasso is a feature selection method without a p -value Lasso retains only the predictors closely relevant to the response In high-dimension data, Lasso does not spend as much time as Ridge because of the exclusion of predictors However, Lasso does not directly deal with ID Some studies applied SMOTE to re-balance data before performing Lasso (Kitali et al., 2019; Shrivastava et al., 2020) Despite its popularity, SMOTE can cause the overlapping problem, which decreases the performance measures of classifiers.

The proposed works

The modification of the cross-validation procedure

F-LLR utilizes LLR as a base algorithm Instead of using the CV to find the optimal λ , a modification of CV called F-measure-oriented cross-validation procedure (F-CV), is proposed In F-CV, the criterion to evaluate the optimal λ is F-measure, a more suitable metric than accuracy on ID The details of CV and F-CV are described in Table 4.1 and 4.2.

Table 4.1: Cross-validation procedure for Lasso Logistic regression

Input: A training data set T, a series of {λ i } h 1 , an integer K(K > 1).

1 Randomly divide T intoK equal-sized subsets: T 1 , , T K

4 On T \T k, apply LLR with λ i to get the fitted model LLR(λ i ).

5 On T k, apply LLR(λ i )to get the scores of the samples of T k.

6 Compare the scores with the threshold 0.5 to get the labels.

7 Calculate the accuracy, denoted ACC ik

{ACC i } Output: The classifierLLR(λ i 0 ) with the optimal penalty λ i 0

Under the notations in Table 4.2, with every threshold α j, the cross-validation F-measure, F ij, is an estimate of the testing F-measure of the fitted model

LLR(λ i ) When the penalty parameter λ and the threshold α take all values in the series {λ i } h 1 and {α j } l 1 , respectively, F i 0 j 0 determined at Step 10 is an estimate of the highest testing F-measure of LLR(λ) on data set T Therefore,F-CV indicates not only the optimal penalty parameter λ i 0 but also the opti- mal threshold α j 0 corresponding to F i 0 j 0 The computation process of F-CV is illustrated in Figure 4.1.

Table 4.2: F-measure-oriented Cross-Validation Procedure

Input: Training data set T , a series of {λ i } h 1 , a series of thresholds {α j } l 1 , an integer K (K > 1)

1 Randomly divide T into K equal-sized subsets: T 1 , , T K

5 On T \T k , construct the fitted model LLR(λ i ).

6 On T k, apply LLR(λ i ) to get the scores of all samples.

7 Compare the scores with α j to get the labels of T k

10 F i 0 j 0 = max i,j {F ij } Output: The classifier LLR(λ i 0 ), the optimal penalty λ i 0 , and the optimal threshold α j 0.

There are three different points between CV and F-CV Firstly, CV fixes a threshold of 0.5 to distinguish the positive and negative samples while F-CV considers a series of thresholds {α j } l 1 Secondly, CV determines the optimal λ based on the cross-validation accuracy (denoted ACC i in Step 9, Table 4.1), which is the mean value of all accuracy metrics on every subset T k , (k ∈ 1, K ).

In contrast, F-CV utilizes F-measure instead of accuracy Finally, F-CV can point out the optimal threshold for the classification process while CV cannot.

The modification of Logistic regression

Table 4.3: Algorithm for F-LLR classifier

Input: Training data set T 0; the positive and negative class S 0 + and S 0 − ; series of penalties {λ i } h 1 , series of thresholds {α j } l 1 , integer K r U , r S : rates for under-sampling and SMOTE ((1 − r U )|S 0 − | > |S 0 + |). Stage 1

1 Apply F-CV to T 0 to get the classifier LLR(λ 0 ).

2 Apply LLR(λ 0 ) to score all samples of T 0.

3 Order the samples of S 0 + and S 0 − by their scores from the highest to the lowest.

4 On S 0 − , remove (r U × |S 0 − |) upper high-scored samples to get S 1 −

5 Determine the subset ofS 0 + consisting of(r S ×|S 0 + |)upper high-scored samples called S 0 ++

7 Apply SMOTE to S 0 ++ to create (m − 1)r S × |S 0 + | synthetic samples.

9 Apply F-CV to the balanced training set T 1 = S 1 + ∪ S 1 −

Output: The classifier LLR(λ 1 ) and the optimal threshold α 1.

Where |A| denotes the quantity of the data set A

The proposed classifier F-LLR has two computation stages shown in Table4.3 In the beginning, all of the samples of the training data are scored by

F-CV Then, according to the samples’ scores, under-sampling and SMOTE are respectively applied to balance the training data set Finally, on the balanced data set, LLR-based F-CV builds a classifier F-LLR.

The combination of under-sampling and SMOTE aims to remove the useless samples and increase the useful ones The fact that the higher the negative samples’ scores, the greater the chance of being misclassified Those may be noise, borderline, or overlapping samples, which decrease the performance mea- sures of the classifiers Thus, instead of applying random under-sampling, just a proportion of the negative class which contains the upper high-scored sam- ples is eliminated Next, instead of utilizing the whole minority class, SMOTE just performs on the subset consisting of the positive samples with upper high scores This idea is in contrast with the application of under-sampling The high-scored positive samples are usually identified correctly across thresholds. This practice emphasizes these samples which show prominent characteristics of the positive class Creating more neighbors of these samples will provide more useful information for the identification of the positive class Further- more, these high-scored positive samples are often in the safe region that is far from the borderline, so applying SMOTE here can prevent overlapping issues.Figure 4.2 illustrates the meaning under the steps of the F-LLR classifier.

Empirical study

Empirical data sets

Credit scoring is a typical example of imbalanced classification since the number of bad customers is always far less than the number of good ones. Eight credit scoring data sets are used in the experimental study They are Australian data (AUS) 1 , German data (GER), Taiwanese data (TAI), Credit risk data (Credit 1) 2 , Credit card data (Credit 2) 3 , Credit default data (Credit

3) 4 , Bank personal loan data (BANK) 5 , and Vietnamese data (VN4) Moreover, a data set of hepatitis patients (HEPA), which is not only imbalanced but also has a small size of the positive class, is also investigated The data sets BANK, GER, TAI, and HEPA were used in the empirical study in Chapter 3.

Nine empirical data sets suffer imbalanced status with different levels eval- uated by the imbalanced ratio (IR) Some characteristics of the data sets are presented in Table 4.4 in the increasing order of IR from the smallest to the highest The first group of data sets, including AUS, GER, TAI, and Credit 1, are imbalanced data at a low level (IR ≤ 5 ) AUS, GER, and TAI data sets publicized on the UCI machine learning repository are familiar with credit scor- ing studies Credit 1 is the subset randomly drawn from the original data set at the rate of 20% to save computation time Credit 1 still maintains the same

IR as the original data on the Kaggle website The second group consisting of Credit 2, 3, BANK, and HEPA suffers average imbalanced status (5 < IR ≤ 10 ). Credit 3 is formed in a similar way to Credit 1 but at the rate of 10% The last data set is VN4 collected from a commercial bank in Vietnam in the period

2019 - 2020 This is the most severely imbalanced data among empirical data sets Except for the last data set, eight others are public on the UCI library and Kaggle website with transparent sources Details of the empirical data sets AUS, Credit 1, Credit 2, Credit 3, and VN4 can be found in Appendix C.9; C.10, C.11, C.12, and C.13, respectively.

1 http://archive.ics.uci.edu/dataset/143/statlog+australian+credit+approval

2 https://www.kaggle.com/datasets/laotse/credit-risk-dataset

3 https://www.kaggle.com/datasets/samuelcortinhas/credit-card-classification-clean-data

4 https://www.kaggle.com/datasets/gargvg/univai-dataset

5 https://www.kaggle.com/datasets/teertha/personal-loan-modeling

Table 4.4: Description of empirical data sets

Data sets Size Positive size a Imbalanced ratio # feat b # num feat c

VN4 10,889 602 17.09 11 0 a: The number of the positive class; b : The number of total features; c: The number of numeric features.

All observations with missing values of features are omitted Moreover, all numeric features of data sets are standardized to have zero mean and unit deviation.

Performance measures

AUC, KS, F-measure, and G-mean are utilized to evaluate the performance of classifiers considered Among them, AUC and KS are free-threshold measures that judge the general effectiveness of classifiers Meanwhile, F-measure and G- mean depend on the threshold which is a reference value to distinguish positive and negative labels Details of AUC, KS, F-measure, and G-mean can be found in Section 2.2, Chapter 2.

Regarding F-measure, it is the harmonic mean of precision and recall metrics.Precision is the ratio of true positive samples among the predicted positive and recall is the proportion of the predicted positive samples in the positive class.F-measure is high if and only if both precision and recall are high On ID, LR and LLR usually give a high precision and low recall It means that few of the positive samples are classified correctly On the contrary, when boosting the recall but ignoring the precision of an imbalanced data set, it leads to an extreme classifier that cannot identify the negative The bias toward precision or recall can cause unnecessary losses, especially in credit scoring or medical diagnosis field Therefore, in the procedure to find the optimal λ of LLR, the highest F-measure is a more reasonable target than the highest accuracy.

In this chapter, the F-measure and G-mean corresponding to threshold α are denoted by F-measure( α ) and G-mean( α ).

Computation process

The performance of F-LLR is compared with some versions of PLR such as LLR and Ridge The comparison also considers RUS-LLR, ROS-LLR, and SMOTE-LLR which are LLR models built after employing one of the re-sampling techniques RUS, ROS, and SMOTE, respectively Besides, on VN4, F-LLR is compared with WLE because of the available value of τ (τ = 1.7%) which is the bad debt ratio in the Vietnamese banking system in the period 2019-2020 6 Ridge and WLE, the representative of LR with the algorithm-level approach, are chosen for comparison with F-LLR since these methods worked better than others according to previous studies (King & Zeng, 2001; Puhr et al., 2017) The optimal λ s of the models, including LLR, RUS-LLR, ROS-LLR, SMOTE-LLR, and Ridge, are determined by the original version of CV (see Table 4.1).

The general implementation protocol is described in Table 4.5 The hyper- parameters are set up as follows:

• The series of lambdas {λ i } 100 i=1 consists of 100 equal-distanced values from 0.0001 to 0.005

• The series of thresholds {α j } 50 j=1 consists of 50 equal-distanced values from

0.01 to 0.7 We choose 0.7 as the upper bound of the series of thresholds because if the threshold for distinguishing two classes is too high, there are many positive samples misclassified.

• The series of rates for under-sampling {r U } 20 1 includes of 20 values from 0.05 to 0.5 × IR−1 IR and satisfying (1 − r U )|S 0 − | > |S 0 + | If RUS is applied, a

6 https://sbv.gov.vn/webcenter/portal/vi/links/cm255?dDocName=SBV489213

Table 4.5: Implementation protocol of empirical study

1 Set up the series: {λ i } h 1 , {α j } l 1 , {r U u } 20 u=1 , and {r S s } 20 s=1

2 Split randomly the data set into the training and testing sets (70% - 30%).

{λ i } h 1 , {α j } l 1 , r U u , r S s to build the classifiers F-LLR.

6 Determine the optimal threshold α ∗ j of F-LLR.

7 Determine the optimal F-LLR based on the highest F-measure across r U s and r S s.

8 Build LLR, RUS-LLR, ROS-LLR, SMOTE-LLR, and Ridge-LLR,

9 For each classifier, determine the threshold α ∗ corresponding to the highest F-measure.

10 On VN4, run above ones and WLE.

12 Calculate AUC, KS, F-measure( α ∗ ), and G-mean( α ∗ ) of all considered classifiers.

13 Repeat from Step 2 to Step 12 twenty times.

14 Average twenty values of AUC, KS, F-measure, and G-mean. number of negative samples that account for a proportion of IR−1 IR of the negative class will be removed randomly However, in this employment, we do not eliminate too many negative samples to restrict the loss of valuable information from the majority class.

• The series of rates for SMOTE {r S } 20 1 comprises of 20 values from 0.05 to

0.75 Original SMOTE balances data by using 100% samples of the minority class to generate synthetic samples in their neighborhoods In our way, the upper bound rate for SMOTE is 0.75 since we just focus on the top-scored positive samples to restrict the overlapping issue, a typical drawback of SMOTE.

Note that on each data set, the considered classifiers carry out 20 times and the performance measures of 20 times are averaged to reduce the bias of the results.

Empirical results

A minor experiment on some values of r U - the rate for under-sampling and r S - the rate for SMOTE, suggests that the optimal value of r U is in the range of [0.05; 0.25] while the one of r S is in [0.20; 0.75] The HEPA data set, which has a very small positive class and suffers an average imbalanced level, performs at the rates r U = 0.07 and r S = 0.75 It can be implied that SMOTE is prioritized to under-sampling in the protocol of F-LLR in the experiment.

The average performance measures of considered classifiers are recorded in Table 4.6 and Table 4.7 In comparison with other classifiers, F-LLR shows better performance on empirical data sets.

On TAI, Credit 2, Credit 3, and HEPA, F-LLR is the most prominent classi- fier since F-LLR outperforms the others in at least three performance metrics.

On other data sets (except BANK), F-LLR wins other classifiers by two metrics.

On VN4 - the most imbalanced data, F-LLR is the most noticeable classifier with the highest KS and F-measure Although ROS-LLR wins F-LLR by G- mean and WLE wins by AUC, the G-mean difference between F-LLR and ROS- LLR is not significant It is similar to the AUC difference between F-LLR and WLE In addition, Ridge performs worse than all considered classifiers.

In general, F-LLR shows the highest KS and F-measure across the empirical data sets In contrast, the data-level approach cannot deal with ID on GER, Credit 2, and HEPA The techniques ROS, RUS, and SMOTE even decrease the performance measures of LLR on these data sets Besides, Ridge seems a competitor with F-LLR in some cases such as AUS, GER, and BANK data sets. Regarding the optimal thresholds of LLR, they are quite higher than the ones of the other classifiers.

Table 4.6: Average testing performance measures of classifiers

Re-sampling tech + LLR Measures LLR RUS ROS SMOTE Ridge WLE F-LLR

∗ : The optimal threshold corresponding to the highest trained F-measure.

The bold values is the highest in each row.

Table 4.7: Average testing performance measures of classifiers (cont.)

Datasets Re-sampling tech + LLR

Measures LLR RUS ROS SMOTE Ridge WLE F-LLR

∗ : The optimal threshold corresponding to the highest trained F-measure.

The bold values is the highest in each row.

Statistical test

To conclude the effectiveness of F-LLR, the Sign test is utilized This test does not assign any assumption of the distribution of performance measures. The Sign test counts the number of data sets on which the interest classifier wins the others Details of the Sign test can be found in Sheskin (2003) When comparing multiple classifiers, it can be performed pairwise comparisons and recorded the results in a matrix When considering the interest and another classifier, there are two possibilities: win or not Thus, the number of wins follows the binomial distribution Binorm(N ; p) Under the null hypothesis (the two classifiers are equivalent), the parameters of this distribution are:

• N : the number of empirical data sets.

• p = 0.5 : the probability of winning under the null hypothesis.

The critical number of wins can be calculated according to the distribution

Binorm(N ; 0.5) For example, with N = 9 , the critical values at the significance level of α = 5% (or 10% ) is w α = 8 (or w α = 7) (Demšar, 2006) It means the interest classifier is significantly better than another if it performs better on at least w α data sets.

Table 4.8: The number of wins of F-LLR on empirical data sets

LLR RUS-LLR ROS-LLR SMOTE-LLR Ridge

With N = 9 , the critical values are w 0.05 = 8 and w 0.1 = 7.

∗∗ and ∗ : statistically significant at 5% and 10% levels, respectively.

According to the results in Table 4.6 and Table 4.7, we organize two tests which are overall and pairwise comparisons to conclude the effectiveness of F-LLR However, it cannot be compared the performance measures of F-LLR and

WLE since there is only one observation of this comparison The number of wins of F-LLR is shown in Table 4.8 F-LLR wins the others seven times by

KS and F-measure in overall comparison That implies the KS and F-measure of F-LLR are significantly higher than the ones of others at the statistically significant level of 10% in overall comparison.

In pairwise comparisons, there are some notes:

• By AUC: F-LLR wins LLR and RUS-LLR on all nine data sets while it wins ROS-LLR on five, SMOTE-LLR on six, and Ridge on six data sets. Therefore, F-LLR is only significantly better than LLR and RUS-LLR.

• By G-mean: F-LLR wins LLR on eight data sets Besides, F-LLR wins RUS-LLR, SMOTE-LLR, and Ridge on seven data sets but wins ROS- LLR on six ones Thus, except ROS-LLR, F-LLR significantly wins the others.

In summary, F-LLR outperforms the others by KS and F-measure In pair- wise comparisons, F-LLR completely wins LLR and RUS-LLR by four perfor- mance criteria; beats SMOTE and Ridge by three ones.

Important variables for output

4.4.6.1 Important variables for F-LLR fitted model

Lasso is a well-known feature selection method thanks to its ability to retain the important variables in the fitted model without using the “ p -value” criterion. Hence, F-LLR inherits this ability.

Regarding a numeric feature, one quantitative variable is used to express it. The value of the estimateβbj corresponding to this variable shows its importance level The greater the absolute value of βb j , the more effect of this feature on the score.

Regarding a nominal feature with m categorical levels, m − 1 binary vari- ables are used to describe the information of this feature The categorical level corresponding to the case of all m − 1 binary variables zero is called the basic one The value of βc jk , (k ∈ 1, m − 1) shows the impact level of category k in comparison with the basic category That means in the case of the same levels of remaining features, βc jk is positive if and only if the observation belonging to category k has a greater score than one with the basic category Therefore, the value of βc jk can be used to evaluate the importance level of categorical k of the given nominal feature.

In our training, F-LLR is re-applied 20 times on each data set to reduce the bias in evaluation Thus, the number of a variable’s presence follows the binomial distribution Binorm(N ; p) The variable impacts the score if and only if the number of its presences is statistically significant Under the null hypothesis (the variable does not impact the score), the parameters of this distribution are:

• N = 20 : the number of runs per data set.

• p = 0.5 : the probability of occurring in the fitted model F-LLR.

According to the distributionBinorm(20; 0.5), the critical number of presences are calculated With the significance level of α = 5% (or 10%), the critical value is w α = 15 (or w α = 14) (Demšar, 2006) It means that on each data set, a variable impacts the score if the number of its presences is at least w α.

4.4.6.2 Important variables of the Vietnamese data set

Consider the VN4 data set, which consists of all nominal features Tables 4.9 and 4.10 show the features, categories of features, the number of presences in the fitted models, and the importance level Note that the importance level of a feature is the average value of the estimated coefficients after 20 runs The importance level of variables is only calculated if this variable impacts the score. According to Tables 4.9 and 4.10, there are some comments on the features of the VN4 data set.

• The four features “Customer types”, “Loan types”, “Sex”, and “Terms” are unrelated to the probability of default The remaining features affect the score statistically.

• About “Duration”: Customers with duration longer than 36 months show significantly high probabilities of default Furthermore, the highest credit

Table 4.9: Important features of the Vietnamese data set

Features Categories Number of presences Importance level

∗∗ and ∗ : statistically significant at 5% and 10% levels, respectively. risk falls in the category of duration of over 42 months (the variable express- ing the category of duration over 42 months appears 20 times) Besides, the scores of the customers with other duration are not statistically different.

Table 4.10: Important features of the Vietnamese data set (cont.)

Features Categories Number of presences Importance level

∗∗ and ∗ : statistically significant at 5% and 10% levels, respectively.

• About “Interest rate”: Customers with an interest rate in the range of 10% – 14% are the most likely default, especially the highest risk related to interest rates from 12% to 14% There is no difference in scores between customers with other interest rates.

• About “Product types”: Only the type of periodic interest shows a smaller probability of default than the others Besides, there is no difference in scores between the other product types.

• About “Purposes”: Customers with the purpose P2 are significantly risky.Besides, the customers with the purpose P5 have statistically lower scores than the others There is no difference in the scores of customers with the purposes P1, P3, and P4.

• About “Current balance”: The group of customers with current balances in the range of 250 to 350 (million VND) decreases the risk of default There is no statistical evidence of the different scores across the other ranges of the current balance.

• About “Base balance”: Customers with base balances under 4 (million VND) are the least risky compared with the others The remaining levels of base balance show the same credit risk status.

• About “Branches”: The branches B4 and B5 suffer the most likely credit risk The following is B2 The branches B1 and B3 are not significantly different in the probabilities of default

In conclusion, the features of the VN4 data set, including “Branches”, “In- terest rate”, “Duration”, “Product types”, “Purposes”, “Current balance”, and

“Base balance”, are the factors of the credit risk.

Discussions and Conclusions

Discussions

According to the empirical study and statistical test, the proposed classi- fier F-LLR, which combines under-sampling and SMOTE under the control of the samples’ scores, completely beats LLR and RUS-LLR (by four performance measures), wins SMOTE-LLR and Ridge-LLR (by three ones) Furthermore, F-LLR outperforms the considered classifiers in KS and F-measure That means F-LLR can separate the true positive distribution and the false positive distri- bution better than the others; as well as F-LLR shows the best trade-off between precision and recall.

In the credit scoring application, recall is more important than precision since bad customers are often crucial objects to identify However, if classifiers stress recall and ignore precision, a large number of good customers are rejected This is also an unpleasant scenario for financial organizations that operate for profit. Thus, F-LLR with the ability to boost F-measure is a good option for credit scoring Besides the effectiveness, the feature selection ability of F-LLR meets the requirement of interpretability of a credit scoring model Therefore, F-LLR is a worthy classifier for credit scoring.

Conclusions

LR is a popular traditional classifier although many modern models have been available recently Similar to common classifiers, LR works ineffectively on imbalanced data sets The algorithm-level and data-level approaches cannot increase the performance measures of LR in many cases Consequently, the application of LR can be narrowed despite its strengths.

Taking advantage of the LLR, algorithm-level, and data-level approach, the proposed classifier F-LLR prepares a balanced training set by removing unnec- essary negative samples and increasing essential positive samples by selectively applying under-sampling and SMOTE Besides, the optimal penalty parameter λand the optimal threshold of LLR are determined by a procedure called F-CV, which is an adjustment of the ordinary CV The modifications to the intrinsic algorithm of LLR and the re-sampling techniques make F-LLR more effective in KS and F-measure than the traditional versions such as LLR, RUS-LLR, ROS-LLR, SMOTE-LLR, Ridge, and WLE This innovation opens up the pos- sibility of applying LLR to fields with severely imbalanced data and the need to identify the input features significantly affecting the classification results, for example, credit fraud detection or cancer diagnosis However, the best val- ues of the hyper-parameters r U and r S of F-LLR should be investigated deeply by experiments Besides, F-LLR should be applied to more data sets of other application fields to have a robust conclusion of its effectiveness.

Chapter summary

This chapter proposes a modification of Logistic regression, calledF-measure- oriented Lasso-Logistic regression Inheriting the penalized version of Logistic regression, Lasso type, which imposes the prior on the magnitude of parameters by a hyper-parameter λ , F-LLR shows two modified points Firstly, the optimal λof LLR is determined by an adjustment of the cross-validation procedure which aims for the highest F-measure instead of the highest accuracy Secondly, F-LLR deals with ID by the combination of under-sampling and SMOTE selectively based on the samples’ scores of the training data set The empirical study proves that the F-LLR classifier can increase F-measure and KS as compared withLLR and the popular traditional balanced-data methods such as the resampling techniques (RUS, ROS, and SMOTE) and the modifications to the log-likelihood function (Ridge and WLE).

Summary of contributions

The interpretable credit scoring ensemble classifier

Credit scoring is a typical case of imbalanced classification In credit scor- ing, bad customers are the crucial objects because of the heavy costs of loss when unrecognized However, the number of bad customers is always far less than one of the good That makes traditional and machine learning models operate ineffectively Therefore, most credit scoring models focus on solving imbalanced data to increase performance metrics However, very few studies are concerned with the need for interpretable models to respond to the risk management requirement of financial institutions.

In the credit scoring literature, the studies on the improvement or innova- tion in classifier algorithms have accounted for a fairly high percentage of the overall The fact is that the more effective the credit scoring model is, the more complicated its structure is It leads to difficulties in interpreting the predicted result Therefore, effectiveness and interpretability are competing aspects of a credit scoring model Until present, very few studies have addressed both of issues when building credit scoring models.

The dissertation proposes a solution for the “effectiveness - interpretation” trade-off in credit scoring which is the Decision tree ensemble classifiers (DTE). DTE utilizes the ensemble-based approach to handle imbalanced data Fur- thermore, DTE can show the importance level of input features to the final response.

The way of building DTE is to apply the DT classifier to several balanced training data sets which are generated by the combinations of random over- sampling (ROS) and random under-sampling (RUS) techniques on the original training data set The meaning of the combinations of ROS and RUS is to take advantage and compensate for their drawbacks Moreover, the combinations make the size of balanced training data sets different, which is the preparation for the diversity of the sub-classifiers of DTE Besides, DTE can exhibit the importance level of a feature which is the average value of the importance levels of this feature from all sub-classifiers of DTE The idea of constructing DTE is analogous to the Bagging trees but the Bagging trees does not solve imbalanced data and ignores the interpretability.

The empirical study shows that DTE outperforms the popular ensemble clas- sifiers based on DT such as Bagging, Random forest, and AdaBoost in AUC,

KS, and H-measure In addition, in comparison with recently proposed ensem- ble classifiers, DTE is the most suitable for credit scoring.

In summary, DTE satisfies the two requirements for a credit scoring model:effectiveness and interpretation.

The technique for imbalanced data, noise, and overlap-

Several studies state that imbalanced data is not the only reason for the poor performance of classifiers but noise and overlapping samples are also the major factors The dissertation improves the technique of solving imbalanced data ofDTE to be able to handle the complicated issues: the occurrence of imbalance,noise, and overlapping in a data set The products of this improvement areTOUS and TOUS-F algorithms.

TOUS algorithm employs the Tomek-link method to remove all pairs of sam- ples, called tomek-links, which are the nearest neighbor of each other and belong to different classes The samples in a tomek-link may be noise, borderline, or overlapping samples Thus, removing these samples makes the training data de-noise and free-overlapping After that, the combinations of ROS and RUS are applied to the remaining data set to generate B different balanced data sets. TOUS-F algorithm utilizes the output of the TOUS algorithm to construct

B sub-classifiers by applying the classifier F to B balanced data sets Where

F can be Lasso-Logistic regression or Decision tree The ensemble classifiers created by TOUS and TOUS-DT are considered the improvements of DTE In addition, empirical study shows that TOUS algorithm improves the AUC values compared with other popular re-sampling techniques such as ROS, SMOTE,RUS, Tomek-link, and NCL.

The modification of Logistic regression

In classification, LR is one of the popular classifiers However, LR per- forms ineffectively on imbalanced data sets There are two approaches to im- balanced data for LR, including resampling techniques and modifications to the log-likelihood function or correction of the parameter maximum likelihood es- timation These approaches can improve performance measures of LR in some cases, but their effectiveness is not generally robust.

The dissertation proposes a new classifier called F-measure-oriented Lasso- Logistic regression (F-LLR) The base learner of F-LLR is Lasso-Logistic re- gression (LLR) which imposes the prior on the magnitude of parameters by a hyper-parameter λ The optimal λ is determined by an adjustment for the cross-validation procedure, in which F-measure is the criterion for performance evaluation instead of accuracy Besides, F-LLR addresses imbalanced data by the combination of under-sampling and SMOTE selectively Specifically, under- sampling removes negative samples with upper-high scores, which is too diffi- cult for classification due to the high probability of misclassifying Furthermore,SMOTE generates the synthetic samples in the neighborhood of the upper-high- scored positive samples These positive samples seem useful for classification since they are usually correctly classified across thresholds Removing the high- scored negative samples and synthesizing the new samples in the high-scored positive subclass also restrict the overlapping issue of the training data set.The empirical study shows that F-LLR can increase F-measure and KS as compared with the standard LLR and LLR with the traditional balanced data methods such as RUS-LLR, ROS-LLR, SMOTE-LLR, Ridge, and Weighted likelihood estimation.

Implications

There are some notes for solving imbalanced data in credit scoring and general classification cases.

Firstly, DTE is a suitable classifier for credit scoring because of its effective- ness and interpretability Although the purpose of building DTE is for credit scoring, DTE is also a good classifier for any imbalanced classification task concerning interpretability.

Secondly, the accompanying issues such as overlapping classes and noise should be investigated besides imbalanced data Some re-sampling methods can detect overlapping and noise samples, for example, Tomek-link and Neigh- borhood Clearing Rule (NCL) If a data set suffers from all these issues, the ensemble classifier constructed by TOUS and TOUS-F algorithms is the ratio- nal choice If the data set is only imbalanced, DTE is the reasonable option.Thirdly, F-LLR is an appropriate option for the application of Logistic re- gression F-LLR can show the probability of belonging to the positive class which is an advantage of F-LLR that DTE does not have Furthermore, F-LLR can optimize F-measure, which means that F-LLR can boost both the recall and precision criteria F-LLR can also point out the essential predictors of the response Therefore, F-LLR can be applied to real-world applications concern- ing false negative and false positive samples such as credit scoring and medical diagnostics fields.

Limitations and suggestions for further research

Besides contributions, the dissertation still has some limitations which need further research to enhance the results.

Firstly, the proposed ensemble classifier DTE should be tested on more real data sets to find the relationship between the optimal B , the number of sub- classifiers of the proposed ensemble model, and the characteristics of the data set such as the imbalanced ratio, the number of predictors, the numerical predictors, and so on.

Secondly, the study only uses AUC, which is the overall metric, in the perfor- mance evaluation of TOUS and TOUS-F This may provide not enough advan- tages for the proposed works Therefore, other performance measures, including

KS, F-measure, G-mean, or H-measure should be used to show a complete eval- uation of their effectiveness In addition, the TOUS and TOUS-F algorithms should be examined on other base learners such as KNN, SVM, or ANN. Thirdly, all proposed works in the dissertation solve imbalanced binary clas- sification However, multi-classification is present in several practical applica- tions For example, credit scoring models can classify customers into many levels corresponding to their credit risk levels Therefore, the proposed algorithms can be innovated into works for imbalanced multi-classification.

Finally, the dissertation only focuses on the case study for credit scoring on the cross-sectional data Meanwhile, time series data that are related to macroeconomic variables (inflation rate, foreign exchange rate, unemployment rate, and so on) affects deeply on predicting the credit risk status of customers.Therefore, credit scoring should be further studied on imbalanced issues of time series data Besides, the age of Big Data leads to an extension of data from social networks, digital footprints, or intelligent applications for credit scoring models In summary, credit scoring models should perform on more types of data sets.

1 TOUS: A new technique for imbalanced data classification (2022) Studies in Systems, Decision and Control, Vol 429, 595–612 Springer.

2 An interpretable decision tree ensemble model for imbalanced credit scoring datasets (2023) Journal of Intelligent & Fuzzy Systems, Vol 45, No 6,

3 A modification of logistic regression with imbalanced data: F-measure- oriented Lasso-logistic regression (2023) ScienceAsia, 49S, 68–77.

Abdiansah, A., & Wardoyo, R (2015) Time complexity analysis of support vector machines (svm) in libsvm International Journal Computer and Application, 128(3), 28–34.

Abdoli, M., Akbari, M., & Shahrabi, J (2023) Bagging supervised autoencoder classifier for credit scoring Expert Systems with Applications, 213, 118991.

Abdou, H A., & Pointon, J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature Intelligent Systems in Accounting, Finance and Management, 18(2-3),

Agustianto, K., & Destarianto, P (2019) Imbalance data handling using neighborhood cleaning rule (ncl) sampling method for precision student modeling In 2019 international conference on computer science, information technology, and electrical engineering (icomitee) (p 86-89) doi: 10.1109/ICOMITEE.2019.8921159

Akay, M F (2009) Support vector machines combined with feature selection for breast cancer diagnosis Expert systems with applications, 36(2), 3240–3247.

Ala’raj, M., & Abbod, M F (2016) Classifiers consensus system approach for credit scoring.

Altman, E I., Marco, G., & Varetto, F (1994) Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the italian experience) Journal of banking & finance, 18(3), 505–529.

Alves, G E D A P., Silva, D F., Prati, R C., et al (2012) An experimental design to evaluate class imbalance treatment methods In 2012 11th international conference on machine learning and applications (Vol 2, pp 95–101).

Anderson, R., et al (2007) The credit scoring toolkit: Theory and practice for retail credit risk management and decision automation OUP Catalogue.

Angiulli, F (2005) Fast condensed nearest neighbor rule In Proceedings of the 22nd inter- national conference on machine learning (p 25–32) New York, NY, USA: Association for Computing Machinery Retrieved from https://doi.org/10.1145/1102351.1102355 doi: 10.1145/1102351.1102355

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J (2003) Bench- marking state-of-the-art classification algorithms for credit scoring Journal of the operational research society, 54(6), 627–635.

Bahnsen, A C., Aouada, D., & Ottersten, B (2014) Example-dependent cost-sensitive logistic regression for credit scoring In 2014 13th international conference on machine learning and applications (pp 263–269).

Bahnsen, A C., Aouada, D., & Ottersten, B (2015) Example-dependent cost-sensitive decision trees Expert Systems with Applications, 42(19), 6609–6619.

Barandela, R., Valdovinos, R M., & Sánchez, J S (2003) New applications of ensembles of classifiers.

Batista, G., Carvalho, A., & Monard, M C (2000) Applying one-sided selection to unbalanced datasets In O Cairó, L E Sucar, & F J Cantu (Eds.), Micai 2000: Advances in artificial intelligence (pp 315–325) Berlin, Heidelberg: Springer Berlin Heidelberg.

Batista, G., Prati, R C., & Monard, M C (2004) A study of the behavior of several methods for balancing machine learning training data ACM SIGKDD Explorations Newsletter, 6(1),

Bektas, J., Ibrikci, T., & Ozcan, I T (2017) Classification of real imbalanced cardiovascular data using feature selection and sampling methods: a case study with neural networks and logistic regression International Journal on Artificial Intelligence Tools, 26(06), 1750019.

Bellinger, C., Drummond, C., & Japkowicz, N (2016) Beyond the boundaries of smote In Machine learning and knowledge discovery in databases (pp 248–263) Springer International Publishing. Bensic, M., Sarlija, N., & Zekic-Susac, M (2005) Modelling small-business credit scoring by us- ing logistic regression, neural networks and decision trees Intelligent Systems in Accounting, Finance & Management: International Journal , 13(3), 133–150.

Bishop, C M., et al (1995) Neural networks for pattern recognition Oxford university press.

Błaszczyński, J., Deckert, M., Stefanowski, J., & Wilk, S (2010) Integrating selective pre-processing of imbalanced data with ivotes ensemble In Rough sets and current trends in computing: 7th international conference, rsctc 2010, warsaw, poland, june 28-30, 2010 proceedings 7 (pp. 148–157).

Boonchuay, K., Sinapiromsaran, K., & Lursinsap, C (2017) Decision tree induction based on minority entropy for the class imbalance problem Pattern Analysis and Applications, 20, 769–782.

Boser, B E., Guyon, I M., & Vapnik, V N (1992) A training algorithm for optimal margin classifiers In Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152).

Breiman, L (1996) Bagging predictors Machine learning, 24(2), 123–140.

Breiman, L (2001) Random forests Machine learning, 45 (1), 5–32.

Breiman, L., Friedman, J H., Olshen, R A., & Stone, C J (2017) Classification and regression trees Routledge.

Brown, I., & Mues, C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets Expert Systems with Applications, 39(3), 3446–3453.

Bình, Đ T T., & Anh, C H V (2021) Mô hình cảnh báo sớm rủi ro tín dụng cho các ngân hàng thương mại Tạp Chí Tài Chính, kỳ 1 tháng 5.

Cao, W., He, Y., Wang, W., Zhu, W., & Demazeau, Y (2021) Ensemble methods for credit scoring of chinese peer-to-peer loans Journal of Credit Risk, 17 (3).

Castro, C L., & Braga, A P (2013) Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data IEEE transactions on neural networks and learning systems, 24(6), 888–899.

Chawla, N V., Bowyer, K W., Hall, L O., & Kegelmeyer, W P (2002) Smote: synthetic minority over-sampling technique Journal of artificial intelligence research, 16, 321–357.

Chawla, N V., Lazarevic, A., Hall, L O., & Bowyer, K W (2003) Smoteboost: Improving prediction of the minority class in boosting In European conference on principles of data mining and knowledge discovery (pp 107–119).

Chen, K., Yadav, A., Khan, A., & Zhu, K (2020) Credit fraud detection based on hybrid credit scoring model Procedia Computer Science, 167, 2–8.

Chen, X., Li, S., Xu, X., Meng, F., & Cao, W (2020) A novel gsci-based ensemble approach for credit scoring IEEE Access, 8, 222449–222465.

Chomboon, K., Chujai, P., Teerarassamee, P., Kerdprasop, K., & Kerdprasop, N (2015) An empirical study of distance metrics for k-nearest neighbor algorithm In Proceedings of the 3rd international conference on industrial application engineering (Vol 2).

Chou, C.-H., Kuo, B.-H., & Chang, F (2006) The generalized condensed nearest neighbor rule as a data reduction method In 18th international conference on pattern recognition (icpr’06)

Cieslak, D A., Hoens, T R., Chawla, N V., & Kegelmeyer, W P (2012) Hellinger distance decision trees are robust and skew-insensitive Data Mining and Knowledge Discovery, 24, 136–158.

Cortes, C., & Vapnik, V (1995) Support-vector networks Machine learning, 20, 273–297.

Cramer, J S (2003) Logit models from economics and other fields Cambridge University Press.

Cui, Y.-J., Davis, S., Cheng, C.-K., & Bai, X (2004) A study of sample size with neural network.

In Proceedings of 2004 international conference on machine learning and cybernetics (ieee cat no 04ex826) (Vol 6, pp 3444–3448).

Dastile, X., Celik, T., & Potsane, M (2020) Statistical and machine learning models in credit scoring: A systematic literature survey Applied Soft Computing, 91, 106263.

Datta, S., & Das, S (2015) Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs Neural Networks, 70, 39–52.

Demšar, J (2006) Statistical comparisons of classifiers over multiple data sets The Journal of

Desai, V S., Crook, J N., & Overstreet Jr, G A (1996) A comparison of neural networks and linear scoring models in the credit union environment European journal of operational research, 95(1), 24–37.

Devi, D., Biswas, S., & Purkayastha, B (2017) Redundancy-driven modified tomek-link based un- dersampling: A solution to class imbalance Pattern Recognition Letters, 93 , 3-12 Retrieved from https://www.sciencedirect.com/science/article/pii/S0167865516302719 (Pat- tern Recognition Techniques in Data Mining) doi: https://doi.org/10.1016/j.patrec.2016.10 Doshi-Velez, F., & Kim, B (2017) Towards a rigorous science of interpretable machine learning .006 arXiv preprint arXiv:1702.08608.

Drummond, C., Holte, R C., et al (2003) C4 5, class imbalance, and cost sensitivity: why under- sampling beats over-sampling In Workshop on learning from imbalanced datasets ii (Vol 11, pp 1–8).

Dumitrescu, E.-I., Hué, S., Hurlin, C., et al (2021) Machine learning or econometrics for credit scoring: Let’s get the best of both worlds.

D’Addabbo, A., & Maglietta, R (2015) Parallel selective sampling method for imbalanced and large data classification Pattern Recognition Letters, 62, 61-67 Retrieved from https:// www.sciencedirect.com/science/article/pii/S0167865515001531 doi: https://doi.org/ 10.1016/j.patrec.2015.05.008

Ebenuwa, S H., Sharif, M S., Alazab, M., & Al-Nemrat, A (2019) Variance ranking attributes selection techniques for binary classification problem in imbalance data IEEE Access, 7, 24649-

Effendy, V., Baizal, Z A., et al (2014) Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest In 2014 2nd international conference on information and communication technology (icoict) (pp 325–330).

Elahi, E., Ayub, A., & Hussain, I (2021) Two staged data preprocessing ensemble model for software fault prediction In 2021 international bhurban conference on applied sciences and technologies (ibcast) (pp 506–511).

Elhassan, T., & Aljurf, M (2017) Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method Global J Technol Optim S, 1.

Elkan, C (2001) The foundations of cost-sensitive learning In International joint conference on artificial intelligence (Vol 17, pp 973–978).

Etheridge, H L., & Sriram, R S (1997) A comparison of the relative costs of financial distress models: artificial neural networks, logit and multivariate discriminant analysis Intelligent Systems in Accounting, Finance & Management, 6(3), 235–248.

Faris, H (2014) Neighborhood cleaning rules and particle swarm optimization for predicting customer churn behavior in telecom industry International Journal of Advanced Science and Technology,

Fernández, A., García, S., Galar, M., Prati, R C., Krawczyk, B., & Herrera, F (2018) Learning from imbalanced data sets (Vol 10) Springer.

Ferri, C., Hernández-Orallo, J., & Flach, P A (2011) A coherent interpretation of auc as a measure of aggregated classification performance In Proceedings of the 28th international conference on machine learning (icml-11) (pp 657–664) Madison, WI, USA: Omnipress.

Finlay, S (2011) Multiple classifier architectures and their application to credit risk assessment.

European Journal of Operational Research, 210(2), 368–378.

Fiore, U., De Santis, A., Perla, F., Zanetti, P., & Palmieri, F (2019) Using generative adversarial networks for improving classification effectiveness in credit card fraud detection Information Sciences, 479, 448–455.

Firth, D (1993) Bias reduction of maximum likelihood estimates Biometrika, 80(1), 27–38.

Fotouhi, S., Asadi, S., & Kattan, M W (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data Journal of Biomedical Informatics, 90, 103089 Retrieved from https://www.sciencedirect.com/science/article/pii/S1532046418302302 doi: https:// doi.org/10.1016/j.jbi.2018.12.003

Freund, Y., Schapire, R E., et al (1996) Experiments with a new boosting algorithm In icml

Friedman, J., Hastie, T., & Tibshirani, R (2010) Regularization paths for generalized linear models via coordinate descent Journal of Statistical Software, 33(1), 1–22.

Fu, G.-H., Xu, F., Zhang, B.-Y., & Yi, L.-Z (2017) Stable variable selection of class-imbalanced data with precision-recall criterion Chemometrics and Intelligent Laboratory Systems, 171,

241–250 doi: https://doi.org/10.1016/j.chemolab.2017.10.015

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4),

Galindo, J., & Tamayo, P (2000) Credit risk assessment using statistical and machine learning: basic methodology and risk modeling applications Computational economics, 15, 107–143.

Gareth, J., Daniela, W., Trevor, H., & Robert, T (2013) An introduction to statistical learning: with applications in r Spinger.

Garrido, F., Verbeke, W., & Bravo, C (2018) A robust profit measure for binary classification model evaluation Expert Systems with Applications, 92, 154–160.

Goodman, S (2008) A dirty dozen: twelve p-value misconceptions Seminars in hematology, 45(3),

Gosain, A., & Sardana, S (2017) Handling class imbalance problem using oversampling techniques:

A review In 2017 international conference on advances in computing, communications and informatics (icacci) (pp 79–85).

Greenland, S., & Mansournia, M A (2015) Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions Statistics in Medicine, 34(23), 3133–3143.

Ha, V S., Nguyen, H N., & Nguyen, D N (2016) A novel credit scoring prediction model based on feature selection approach and parallel random forest Indian Journal of Science and Technology, 9(20), 1–6.

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G (2017) Learning from class-imbalanced data: Review of methods and applications Expert Systems with Applications,

Han, H., Wang, W.-Y., & Mao, B.-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning In Advances in intelligent computing: International conference on intelligent computing, icic 2005, hefei, china, august 23-26, 2005, proceedings, part i 1 (pp. 878–887).

Hand, D J (2009) Measuring classifier performance: a coherent alternative to the area under the roc curve Machine learning, 77 (1), 103–123.

Hand, D J., & Anagnostopoulos, C (2014) A better beta for the h measure of classification performance Pattern Recognition Letters, 40 , 41–46.

Hand, D J., & Henley, W E (1997) Statistical classification methods in consumer credit scoring: a review Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3),

Hart, P (1968) The condensed nearest neighbor rule (corresp.) IEEE transactions on information theory, 14(3), 515–516.

Hastie, T., Tibshirani, R., & Wainwright, M (2015) Statistical learning with sparsity: the lasso and generalizations Chapman and Hall/CRC.

He, H., Bai, Y., Garcia, E A., & Li, S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning In 2008 ieee international joint conference on neural networks (ieee world congress on computational intelligence) (pp 1322–1328).

He, H., Zhang, W., & Zhang, S (2018) A novel ensemble method for credit scoring: Adaption of different imbalance ratios Expert Systems with Applications, 98, 105–117.

Hoi, S C., Jin, R., Zhu, J., & Lyu, M R (2009) Semisupervised svm batch mode active learning with applications to image retrieval ACM Transactions on Information Systems (TOIS), 27 (3), 1–29.

Huang, C.-L., Chen, M.-C., & Wang, C.-J (2007) Credit scoring with a data mining approach based on support vector machines Expert systems with applications, 33(4), 847–856.

Huang, J., & Ling, C X (2005) Using auc and accuracy in evaluating learning algorithms IEEE Transactions on knowledge and Data Engineering, 17 (3), 299–310.

Huang, Y.-M., Hung, C.-M., & Jiau, H C (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem Nonlinear Analysis: Real World Applications, 7 (4), 720–747.

Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., & Wu, S (2004) Credit rating analysis with support vector machines and neural networks: a market comparative study Decision support systems,

Hưng, N T., & Trang, L T H (2018) Mô hình chấm điểm tín dụng dựa trên sự kết hợp giữa mô hình cây quyết định, logit, k láng giềng gần nhất và mạng thần kinh nhân tạo Tạp Chí Khoa Học và Đào Tạo Ngân Hàng, tháng 6 , 46–54.

Iranmehr, A., Masnadi-Shirazi, H., & Vasconcelos, N (2019) Cost-sensitive support vector machines.

Jabeur, S B., Sadaaoui, A., Sghaier, A., & Aloui, R (2020) Machine learning models and cost- sensitive decision trees for bond rating prediction Journal of the Operational Research Society, 71(8), 1161–1179.

James, G., Witten, D., Hastie, T., & Tibshirani, R (2013) An introduction to statistical learning.

Jiang, C., Lv, W., & Li, J (2023) Protein-protein interaction sites prediction using batch normal- ization based cnns and oversampling method borderline-smote IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Junsomboon, N., & Phienthrakul, T (2017) Combining over-sampling and under-sampling tech- niques for imbalance dataset In Proceedings of the 9th international conference on machine learning and computing (pp 243–247).

Kaur, P., & Gosain, A (2018) Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise In Ict based innovations (pp 23–30) Springer.

Khemais, Z., Nesrine, D., & Mohamed, M (2016) Credit scoring and default risk prediction: A comparative study between discriminant analysis & logistic regression International Journal of Economics and Finance, 8(4), 39–53.

Kim, M.-J., Kang, D.-K., & Kim, H B (2015) Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction Expert Systems with Applications, 42 (3), 1074–1082.

King, G., & Zeng, L (2001) Logistic regression in rare events data Political analysis, 9(2), 137–163.

Kitali, A E., Alluri, P., Sando, T., & Wu, W (2019) Identification of secondary crash risk factors using penalized logistic regression model Transportation Research Record, 2673(11), 901–914.

Kiều, N M., Diệp, N T N., Nga, N T H., & Nam, N K (2017) Ứng dụng mô hình mạng thần kinh nhân tạo để ước lượng rủi ro tín dụng ở các ngân hàng thương mại việt nam Tạp chí Ngân hàng, 11.

Krawczyk, B., & Woźniak, M (2015) One-class classifiers with incremental learning and forgetting for data streams with concept drift Soft Computing, 19(12), 3387–3400.

Krawczyk, B., Woźniak, M., & Schaefer, G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification Applied Soft Computing, 14, 554–562.

Kubat, M., Matwin, S., et al (1997) Addressing the curse of imbalanced training sets: one-sided selection In Icml (Vol 97, pp 179–186).

Laurikkala, J (2001) Improving identification of difficult small classes by balancing class distribution.

In Conference on artificial intelligence in medicine in europe (pp 63–66).

Lee, W., Jun, C.-H., & Lee, J.-S (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification Information Sciences, 381, 92–103.

German credit data set (GER)

GER has 1, 000 samples and 20 features Details of GER can be found in the source indicated.

Table C.1: Summary of the German credit data set

Features Roles Types Descriptions Units

A1 Feature Categorical Status of existing checking account

A6 Feature Categorical Savings account/bonds

A7 Feature Categorical Present employment since

A8 Feature Integer Installment rate in percentage of disposable income

A9 Feature Categorical Personal status and sex

A10 Feature Categorical Other debtors / guarantors

A11 Feature Integer Present residence since

A14 Feature Categorical Other installment plans

A16 Feature Integer Number of existing credits at this bank

A18 Feature Integer Number of people being liable to provide mainte- nance for

Status Target Binary 0 = Good, 1 = Bad

Vietnamese 1 data set (VN1)

VN1 has 3, 232 samples, 10 features and does not has missing values.

Table C.2: Summary of the Vietnamese 1 data set

Features Roles Types Description Distribution (# of samples)

Duration Feature Categorical Types of duration of the loans

Draw down Feature Categorical Drawdown amount of money

> 400,000,000 VND: 938. Interest Feature Categorical The average interest rate of the loan

Asset Feature Categorical Total income of the borrower in a year

Purpose Feature Categorical Purpose of the loan

Type 1: 1,202; Type 2: 757; Type 3: 599; Type 4: 378; Type 5: 99; Type 6: 94;

Borrowers Feature Categorical Borrowers types DCO: 479; DIN: 2753.

Types Feature Categorical Loan types DLS: 36; DMS: 1,787; DSS: 1,409.

History Feature Categorical Credit history all paid back duly: 103; critical account: 723; delay in the past: 215; existing credits paid back duly till now: 1,676; no credits taken/ all credits paid back duly: 515.

Sex Feature Categorical Gender and married status

DCO: 479; female: 916; male - divorced/separated: 116; male - married/widowed: 250; male - single: 1,471.

Collateral Feature Categorical Liquidity of collateral Yes: 124; No: 3108

Status Target Binary 0 = Good, 1 = Bad Good: 2,778; Bad: 454.

Vietnamese 2 data set (VN2)

VN2 has 16, 407 samples, 12 features and does not has missing values.

Table C.3: Summary of Vietnamese 2 data set

Features Roles Types Description Distribution (# of samples)

Duration Feature Categorical Types of duration of the loans

Amount Feature Categorical Loan amount

Interest rate Feature Categorical The average interest rate of the loan

Product types Feature Categorical Product types

Periodic principal & interest: 3,447; Periodic interest: 8,404;

Purpose Feature Categorical Purpose of the loan P1: 1,055 ; P2: 1,222; P3: 1,827;

Customers types Feature Categorical Customers types Firm: 266; Individual: 16,141. Loan types Feature Categorical Loan types CML: 2,833; CNS: 7,973;

Current- balance Feature Categorical Current outstanding balance from other loans

Base- balance Feature Categorical The average base balance per month

> 50: 8,094 Sex Feature Categorical Customer’s gender Firm: 266; F: 5,845; M: 10,296.

Branches Feature Categorical The branch provides the loan B1: 2,547; B2: 2,017;

B3: 1,500; B4: 4,969; B5: 5,374.Status Target Binary 0 = Good, 1 = Bad Good: 15,067; Bad: 1,340.

Taiwanese credit data set (TAI)

TAI has 30, 000 samples, 23 features and does not has missing values.

Table C.4: Summary of the Taiwanese credit data set (a)

Features Roles Types Descriptive statistics

Min: 10,000; Max:1,000,000; Median: 140,000 X2 Feature Binary “1”: 11,888; “2”: 18,112.

X5 Feature Integer Mean: 35.485; S.D: 9.218; Min: 21; Max:79; Median: 34 X6 Feature Categorical “-2”: 2,759; “-1”: 5,686; “0”: 14,737;

Table C.5: Summary of the Taiwanese credit data set (b)

Features Roles Types Descriptive statistics

Min: 0; Max: 1,684,259; Median: 2,009. X20 Feature Numeric Mean: 5,226; S.D: 17,606.961;

Min: 0; Max: 426,529.0; Median: 1,500.0. X23 Feature Numeric Mean: 5,215.503 ; S.D: 17,777.466;

Min: 0; Max: 528,666.0; Median: 1,500.0.Status Target Binary Good: 23,364, Bad: 6,636.

Bank personal loan data set (BANK)

BANK has 5, 000 samples, 11 features and does not has missing values.

Table C.6: Summary of the Bank personal loan data set

Feature Roles Types Description Descriptive statistics

Age Feature Numeric Ages in completed years Mean: 45.34; S.D: 11.463;

Experi- ence Feature Numeric Number of years of professional experience

Income Feature Numeric Annual income of the customer ($1000) Mean: 73.774; S.D: 46.034;

Min: 8; Max: 224; Median: 64. Family Feature Numeric Family size of customers Mean: 2.396; S.D: 1.148;

CCAvg Feature Numeric Avg spending on credit cards per month ($1000)

Educa- tion Feature Categori- cal Education Level Advanced :1,501;

Mort- gage Feature Numeric Value of house mortgage if any ($1000)

Se Ac- count Feature Binary Have a securities account with the bank “1”: 522; “0”: 4,480,

CD Ac- count Feature Binary Have a certificate of deposit

Online Feature Binary Does the customer use inter- net banking facilities? “0”: 2,016; “1”: 2,984.

Credit card Feature Binary Use a credit card issued by this bank? “0”: 3,530; “1”: 1,470.

Per- sonal.Loan Target Binary Accepted the personal loan offered in the last campaign? Yes: 4520; No: 480

Hepatitis C patients data set (HEPA)

HEPA has 615 samples and 12 features with some missing values After removing the samples corresponding to missing values, HEPA consists of 589 observations Besides, the variable “Category” is changed into binary variable

Table C.7: Summary of the Hepatitis C patients data set

Features Roles Types Descriptive statistics

Age Feature Integer Mean: 47.418; S.D: 9.931; Min: 23; Max: 77; Median: 47. Sex Feature Categorical Female: 226; Male: 363;

ALB Feature Numeric Mean: 41.624; S.D: 5.762; Min: 14.900; Max: 82.200;

ALP Feature Numeric Mean: 68.123; S.D: 25.921; Min: 11.30; Max: 416.600;

ALT Feature Numeric Mean: 26.575; S.D: 20.863; Min: 0.90; Max: 325.300;

AST Feature Numeric Mean: 33.773; S.D: 32.867; Min: 10.600; Max: 324.000;

BIL Feature Numeric Mean: 11.018; S.D: 17.407; Min: 0.80; Max: 209.000;

CHE Feature Numeric Mean: 8.204; S.D: 2.191; Min: 1.420; Max: 16.410; Me- dian: 8.260.

CHOL Feature Numeric Mean: 5.391; S.D: 1.129; Min: 1.430; Max: 9.670; Median:

CREA Feature Numeric Mean: 81.67; S.D: 81.669; Min: 8.000; Max: 1079.100;

GGT Feature Numeric Mean: 38.198; S.D: 54.302; Min: 4.500; Max: 650.900;

PROT Feature Numeric Mean: 71.890; S.D: 5.349; Min: 44.800; Max: 86.500;

Status Target Binary Yes: 63; No: 526.

The Loan schema data from lending club (US)

US consists of 10, 000 samples, including individuals (8,505 samples) and joints (1495 samples) The number of variables is 55 However, seven ones related to joints suffering missing values were removed For convenience, we also removed other seven variables, including empty-title, loan-status, months- since-90d-late, num-accounts-30d-past-due, num-collections-last-12m, current- accounts-delinq, tax-liens In short, the data set for the experiment has 8,505 individuals with 40 features and the target variable.

Table C.8: Summary of the Loan schema data from lending club (a)

Features Roles Types Description Descriptive statistics emp- length Feature Categorical Number of years in the job, rounded down

>6: 3,659. homeown- ership Feature Categorical The ownership status of the applicant’s residence MORTGAGE: 3,839;

OWN: 1,170; RENT: 3,496. annual- income Feature Numeric Annual income of the customer

Mean: 82,322; S.D: 67,064.765; Min: 5,235; Max: 2,300,000; Med: 68,200. verified- income Feature Categorical Type of verification of the applicant’s income Not Ver: 3,195;

Source Ver: 3,502; Ver: 1,808. debt-to- income Feature Numeric Debt-to-income ratio Mean: 17.341; S.D: 8.757;

2y Feature Categorical Delinquencies on lines of credit in the last 2 years

1: 872. mon- since-last- delinq Feature Categorical Months since the last delinquency “-1”: 4,770; 0: 1,222. total- collection- ever Feature Categorical The total amount that the applicant has had against them in collections

“0”: 7,314; >0: 1,191. current- inst-acc Feature Numeric Number of installment accounts with a fixed payment amount and period

Mean: 2.635; S.D: 2.933; Min: 0; Max: 35; Med: 2. acc- opened-

24m Feature Numeric Number of new lines of credit opened in the last 24 months

Mean: 4.434; S.D: 3.184; Min: 0; Max: 29; Med: 4. mon-last- credit-inq Feature Categorical Number of months since the last credit inquiry on this applicant.

4: 2,109y. num- satis-acc Feature Numeric Number of satisfactory accounts Mean: 11.425; S.D: 5.905;

Min: 0; Max: 51; Med: 10. num-acc-

120d-p-d Feature Categorical Number of current ac- counts 120 days past due “-1”: 318; “0”: 8,187. num-act- deb-acc Feature Categorical Number of currently active bank cards 1: 7,131. total- deb-lim Feature Numeric Total of all bank card limits Mean: 27,590; S.D: 26,818;

Min: 0; Max: 386,700; Med: 19,600. num- total-cc- acc Feature Categorical Total number of credit card accounts in the applicant’s history

9 : 890. num- mort-acc Feature Categorical Number of mortgage accounts

Table C.10: Summary of the Loan schema data from lending club (c)

Features Roles Types Description Descriptive statistics acc- never- deli-per Feature Numeric Percent of all lines of credit where the applicant was never delinquent.

Mean: 94.551; S.D: 9.235; Min: 20; Max: 100; Med: 100. pub-rec- bkrupt Feature Binary Number of bankruptcies listed in the public record 0: 7,457; 1: 1,048; loan- amount Feature Numeric The amount of the loan the applicant received

Mean: 15,749; S.D: 10,092; Min: 1,000; Max: 40,000; Med: 13,000. term Feature Categorical The duration (months) 36: 2,398. interest- rate Feature Numeric Interest rate of the loan Mean: 12.299; S.D: 4.923;

Min: 5.31; Max: 30.94; Med: 11.98. install- ment Feature Numeric Monthly payment for the loan the applicant received

Mean: 461.193; S.D: 289.345; Min: 30.750; Max: 1503.890; Med: 379.450. grade Feature Categorical Grade with the loan A: 2,142; B: 2,611;

C: 2,252; D: 1,190; E: 310. in-list- status Feature Categorical Initial listing status of the loan fractional: 1,544; whole: 6,961. disbur- meth Feature Categorical Dispersement method of the loan

Cash : 7,829; DirectPay: 676. balance Feature Numeric Current balance on the loan Mean: 13,865; S.D: 9,731;

Min: 0; Max: 40,000; Med: 11,427. paid- total Feature Numeric Total has been paid on the loan by the applicant

Min: 0; Max: 41,630.400; Med: 1,518.600. paid- interest Feature Numeric The amount of interest paid so far by the applicant

Mean: 570.583; S.D: 496.986; Min: 0; Max: 4206.200; Med: 419.400. paid-late- fees Feature Categorical Late fees paid by the appli- cant “0”: 8,465; “>0”: 40 purpose Feature Categorical The category for the purpose of the loan car: 244; credit card: 1,971;house: 419; debt consolidation:4,289; moving: 223; house im- provement: 566; others: 793.status Target Binary Good = 0, Bad = 1 0: 8,338; 1: 167

Vietnamese 3 data set (VN3)

VN3 has 11, 124 samples, 11 features and does not has missing values.

Table C.11: Summary of the Vietnamese 3 data set

Features Roles Types Description Distribution

Duration Feature Categorical Types of duration of the loans

Interest rate Feature Categorical The average interest rate of the loan

Product types Feature Categorical Product types

Periodic principal & interest: 2,483; Periodic interest: 5,762;

Purpose Feature Categorical Purpose of the loan P1: 768; P2: 828; P3: 1,290;

Customers types Feature Categorical Customers types Firm: 172; Individual: 10,952. Loan types Feature Categorical Loan types CML: 1,943; CNS: 5,285;

Current- balance Feature Categorical Current outstanding balance from other loans

Base- balance Feature Categorical The average base balance per month

Sex Feature Categorical Customer’s gender Firm: 172; F: 3,913; M: 7,039.

Branches Feature Categorical The branch provides the loan

B3: 1,099; B4: 3,400; B5: 3,502.Status Target Binary 0 = Good, 1 = Bad Good: 10,287; Bad: 837.

Australian credit data set (AUS)

AUS consists of 690 samples and 14 features All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. AUS does not have missing values.

Table C.12: Summary of the Australian credit data set

Features Roles Types Descriptive statistics

A5 Feature Numeric Mean: 7.372; S.D: 3.683; Min: 1; Max: 14; Median: 8. A6 Feature Numeric Mean: 4.693; S.D: 1.992; Min: 1; Max: 9; Median: 4. A7 Feature Numeric Mean: 2.223; S.D: 3.347; Min: 1; Max: 28.5; Median: 1. A8 Feature Categorical “0”: 329; “1”: 361.

A10 Feature Numeric Mean: 2.400; S.D: 4.863; Min: 0; Max: 67; Median: 0. A11 Feature Categorical “0”: 374; “1”: 316.

Credit risk data set (Credit 1)

Credit 1 consists of 5, 740 samples, 11 features and does not have missing values.

Table C.13: Summary of the Credit 1 data set

Features Roles Types Description Distribution

Person-age Feature Numeric Age Mean: 27.760; S.D: 6.168;

Person- income Feature Numeric Annual income Mean: 66,169; S.D: 55,079; Min: 4,080;

Home- ownership Feature Categorical Home personal ownership mortgage: 2,419; own: 424; rent: 2,882; others: 15.

Emp- length Feature Numeric Employment length

Loan- intent Feature Categorical Loan intent debtconsolidation: 926; education: 1,132; home improvement: 630; medical: 1,086 personal: 940; venture: 1026.

Grade Feature Categorical Loan grade A: 1,876; B: 1,817; C: 1,163;

Loan- amnt Feature Numeric Loan amount Mean: 9,708; S.D: 6,434;

Interest- rate Feature Numeric Interest rate Mean: 11.072; S.D: 3.265;

Percent- income Feature Numeric Percent income Mean: 0.171; S.D: 0.108;

Min: 0.000; Max: 0.760; Med: 0.150. default-on- file Feature Categorical Historical default Yes: 1,028; No: 4,712. cred- hist- length Feature Integer Credit history length Mean: 5.824; S.D: 4.061;

Status Target Binary 0 = Good, 1 = Bad Good: 4,503; Bad: 1,237.

Credit card data set (Credit 2)

Credit 2 has 9, 709 samples, 18 features and does not have missing values.

Table C.14: Summary of the Credit 2 data set

Features Roles Types Description Distribution

Gender Feature Categorical Gender Male: 6,323; Female: 3,386.

Own car Feature Categorical Car ownership Yes: 3,570; No: 6,139.

Property Feature Categorical Own property Yes: 6,520 ; No: 3,189.

W-phone Feature Categorical Own a work phone? Yes: 2,111; No: 7,598.

Phone Feature Categorical Own a phone? Yes: 2,793; No: 6,916.

Email Feature Categorical Have an email? Yes: 850; No: 8,859.

Unempl Feature Categorical unemployed? Yes: 1,696; No: 8,013.

Num- children Feature Numeric Number of children Mean: 0.423; S.D: 0.767;

Far-size Feature Numeric Number of members Mean: 2.183; S.D: 0.933;

Acc- length Feature Numeric Months credit card has been owned Mean: 27.270; S.D: 16.648;

Total- income Feature Numeric Total income Mean: 181.228; S.D: 99.277;

Age Feature Numeric Age in years Mean: 43.784; S.D: 11.626;

Years- employed Feature Numeric Number of years employed Mean: 5.665; S.D: 6.342;

Min: 0.000; Max: 43.021; Med: 3.762. Income- type Feature Categorical Income type Com ass: 2,312; Pensioner: 1,712;

State servant: 725; Working: 4,960. Educa- tion Feature Categorical Education type High: 2,463; Incomplete high: 371;

Lower secondary: 114; Others: 6,761. Fam- status Feature Categorical Family status Civil marr: 836; Marr: 6,530; Single:

1,359; Separated: 574; Widow: 410. House- type Feature Categorical Housing type House / apa: 8,684; Municipal apa:

Occup- type Feature Categorical Occupation type Laborers: 1,724; Sales: 959; Core:

877; Managers: 782; Drivers: 623;Others 1: 2,994; Others 2: 1,750.Status Target Binary 0 = Good, 1 = Bad Good: 8,426; Bad: 1,283.

Credit default data set (Credit 3)

Credit 3 data consists of 12, 600 samples and 11 features Credit 3 does not have missing values.

Table C.15: Summary of the Credit 3 data set

Experience Feature Numeric Mean: 10.046; S.D: 6.036; Min: 0; Max: 20; Med: 10. Married Feature Categorical Married: 1,285; Single :11,315.

House- ownership Feature Categorical Rented: 11,599; Owned: 636; Other: 365.

Car- ownership Feature Categorical No : 8,780; Yes: 3,820.

Current- job-year Feature Numeric Mean: 6.291; S.D: 3.652;

Current- house-year Feature Numeric Mean: 12; S.D: 1.391;

Risk-flag Target Binary Good: 11,075; Bad: 1,525.

Vietnamese 4 data set (VN4)

Ngày đăng: 12/04/2024, 18:33

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w