Imbalanced Data in classification: A case study of credit scoring

Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring

Overview of imbalanced data in classification

Classification is essential across various fields, including medicine for cancer diagnosis, finance for fraud detection, and business administration for predicting customer churn It involves predicting a class label for a sample based on features learned from training datasets Classification algorithms identify patterns in these features to create a fitted model that can predict labels for new samples There are two main types of classification: binary and multi-classification Binary classification addresses two-class label problems, while multi-classification involves multiple labels, sometimes viewed as a binary problem with one class of interest and others as the remaining labels In this context, we will focus on binary classification, defining key concepts for clarity.

Definition 1.1.1 A data set with k input features for binary classification is the set of samples S = X × Y , where X ⊂ R k is the domain of samples’ features and Y = {0, 1} is the set of labels.

In data classification, the samples categorized as the positive class are referred to as S +, while those in the negative class are labeled S − A sample identified within S + is termed a positive sample, whereas any sample outside this group is classified as a negative sample.

Definition 1.1.2 A binary classifier is a function mapping the domain of features X to the set of labels {0, 1}.

Definition 1.1.3 Considering a data set S and a classifier f : X → {0, 1}. With a given sample s 0 = (x 0 , y 0 ) ∈ S, there are four possibilities follows:

• If f (s 0) = y 0 = 1, s 0 is called a true positive sample.

• If f (s 0) = y 0 = 0, s 0 is called a true negative sample.

• If f (s 0 ) = 1 and y 0 = 0, s 0 is called a false positive sample.

• If f (s 0) = 0 and y 0 = 1, s 0 is called a false negative sample.

The number of the true positive, true negative, false positive, and false negative samples, are denoted TP, TN, FP, and FN, respectively.

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

In various application domains with balanced positive and negative classes, classifiers often prioritize accuracy However, when the positive class comprises rare or unusual events, the limited number of samples can hinder classifiers from effectively identifying positive patterns Consequently, errors in classifying the positive class can result in significant losses As a result, accuracy may not be the primary performance metric; instead, metrics related to true positives, such as true positive rate (TPR), become more crucial for evaluating classifier performance.

In fraud detection, customers are categorized into "good" and "bad" classes, with credit data sets typically showing a predominance of good customers due to prior screening Misclassifying a bad customer as good results in a significantly higher loss compared to misclassifying a good customer as bad, making the identification of bad customers a critical task For instance, in a dataset with 95% good and 5% bad customers, a trivial classifier that labels all customers as good achieves 95% accuracy but fails to identify any bad customers, resulting in a true positive rate (TPR) of 0% Therefore, a classifier with lower overall accuracy but a higher TPR is preferred for effective fraud detection.

Cancer diagnosis presents a unique challenge in classification due to its two primary categories: "malignant" and "benign." Typically, the number of benign cases far exceeds that of malignant ones, yet identifying malignancy is crucial due to the severe implications of overlooking cancer patients Consequently, relying solely on accuracy metrics to assess the effectiveness of cancer diagnosis classifiers is inadequate.

The phenomenon of skew distribution in training data sets for classification is known as imbalanced data.

An imbalanced data set, denoted as S = S + ∪ S −, consists of positive (S +) and negative (S −) classes When the quantity of the positive class (S +) is significantly lower than that of the negative class (S −), the data set is considered imbalanced The imbalance ratio (IR) is defined as the ratio of the quantities of the negative class to the positive class.

Motivations

Imbalanced training data sets can lead to classifiers achieving high overall accuracy while exhibiting low true positive rates (TPR) These classifiers prioritize maximizing accuracy, which can result in an equal weighting of type I and type II errors, ultimately causing a bias towards the majority class, often the negative class This bias is particularly pronounced when the imbalance ratio is significant, adversely affecting the representation of the minority class.

In imbalanced classification, the positive class is frequently overlooked by common classifiers, which tend to categorize it as noise or outliers This oversight hampers the ability to recognize positive patterns, despite the fact that identifying positive samples is essential in such scenarios Consequently, dealing with imbalanced data presents a significant challenge in the classification process.

Research indicates that an increased imbalanced ratio negatively impacts model performance (Brown & Mues, 2012) Additionally, various authors have highlighted that imbalanced data is a significant factor contributing to poor performance, with noise and overlapping samples further diminishing the effectiveness of learning methods (Batista & Prati).

& Monard, 2004; Haixiang et al., 2017) Thus, researchers or practitioners should deeply understand the nature of data sets to handle them correctly.

A prominent example of imbalanced classification is credit scoring, particularly evident in the bad debt ratios of commercial banks In Vietnam, the on-balance sheet bad debt ratio was 1.9% in 2021, down from 1.7% in 2020 Furthermore, the gross bad debt ratio, which includes unresolved debts and potential bad debts from restructuring, rose to 7.3% in 2021 from 5.1% in 2020 Despite the small percentage of bad customers, their impact on banks is significant, as a rise in bad debt ratios can threaten the stability of the banking system and potentially lead to economic collapse in countries heavily reliant on banking Thus, accurately identifying bad customers in credit scoring is crucial for financial health.

In Vietnam, the credit market operates under strict regulations set by the State Bank, compelling commercial banks to rigorously manage credit risk through thorough credit appraisal processes prior to funding Academic research on credit scoring has garnered significant attention from various authors (Bình & Anh, 2021; Hưng & Trang, 2018; Quỳnh, Anh, & Linh, 2018; Thắng, 2022), yet there remains a scarcity of studies addressing the issue of imbalanced credit scoring (Mỹ, 2021).

1 https://sbv.gov.vn/webcenter/portal/vi/links/cm255?dDocName=SBV489213

The dissertation titled “Imbalance Data in Classification: A Case Study of Credit Scoring” explores the challenges of imbalanced classification, particularly focusing on credit scoring in Vietnam This study seeks to identify effective solutions for addressing imbalanced data and its associated issues.

Research gap identifications

Gaps in credit scoring

In the dissertation, we choose credit scoring as a case study of imbalanced classification.

Credit scoring is a numerical assessment that evaluates the creditworthiness of individuals, offering essential insights for banks and financial institutions to mitigate credit risk and standardize credit management practices To be effective, credit-scoring classifiers must fulfill two critical criteria: they must accurately identify high-risk customers and provide clear explanations for their predictive outcomes.

In the past two decades, significant advancements have been made in enhancing credit scoring models through various methodologies These improvements include traditional statistical techniques such as K-nearest neighbors, discriminant analysis, and logistic regression, alongside widely-used machine learning models like decision trees, artificial neural networks, and support vector machines.

Single classifiers, such as Logistic Regression and Decision Trees, exhibit varying effectiveness across different datasets For instance, while some studies indicate that Logistic Regression outperforms Decision Trees (Marqués, García, & Sánchez, 2012; Wang, Ma, Huang, & Xu, 2012), other research suggests the opposite, showing that Logistic Regression performs worse than Decision Trees (Bensic, Sarlija, & Zekic-Susac, 2005) This inconsistency highlights the importance of evaluating classifier performance in the context of specific data sets.

Research indicates that Support Vector Machines (SVM) often outperform Logistic Regression in credit scoring (Li et al., 2019) However, Van Gestel et al (2006) found no significant differences among SVM, Logistic Regression, and Linear Discriminant Analysis Overall, empirical studies in credit scoring suggest that no single classifier is universally superior for all datasets.

The advancement of computational software and programming languages has led to a transition from single classifiers to ensemble classifiers, which are collections of multiple classifier algorithms These ensemble models enhance decision-making by harnessing the combined strengths of various sub-classifiers Research in credit scoring has demonstrated that ensemble models outperform single classifiers (Brown & Mues, 2012; Dastile, Celik, & Potsane, 2020; Lessmann, Baesens, Seow, & Thomas, 2015; Marqués et al., 2012) However, it is important to note that ensemble algorithms do not effectively address the issue of imbalanced data.

The second requirement of a credit scoring model, while often overlooked, is crucial as it explains the classification results essential for assessing, managing, and hedging credit risk With an increasing diversity of customer features in empirical data sets, not all information is relevant for credit scoring Administrators must extract key insights from the classification model that influence default likelihood to establish clear credit standards However, there is typically a trade-off between classifier effectiveness and transparency; as performance measures improve, the ability to explain predicted outcomes diminishes For instance, interpretable classifiers like Discriminant Analysis, Logistic Regression, and Decision Trees may be less effective than more complex models such as Support Vector Machines and Artificial Neural Networks.

Black box classifiers, including ensemble methods like Bagging Tree, Random Forest, and AdaBoost, often deliver impressive performance despite their lack of interpretability Dastile et al (2020) highlight that in the context of credit scoring, merely 8% of studies have introduced new models that address the issue of interpretability.

Therefore, building a credit-scoring ensemble classifier that satisfies both requirements is an essential task.

In Vietnam, credit data sets face challenges such as imbalance, noise, and overlapping issues Despite the rapid development of credit scoring models amid digital transformation, Vietnamese commercial banks continue to rely on traditional methods like Logistic regression and Discriminant analysis Some studies have explored machine learning techniques, including Artificial Neural Networks, Support Vector Machines, and Random Forest, to enhance credit scoring accuracy.

Recent studies have focused on advanced methods for credit scoring, including ensemble models, but have largely overlooked the challenges of imbalanced data and interpretability While a few researchers have addressed the issue of data imbalance, their approaches often neglect the presence of noise and overlapping samples, which can compromise the effectiveness of credit scoring applications.

To enhance performance metrics on Vietnamese datasets, it is essential to develop a credit-scoring ensemble classifier that effectively addresses challenges like imbalanced data, noise, and overlapping samples Additionally, the proposed model will identify key features crucial for predicting credit risk status.

Gaps in the approaches to solving imbalanced data

There are three popular approaches to imbalanced classification in the literature They are algorithm-level, data-level, and ensemble-based approaches (Galar et al., 2011).

The algorithm-level approach addresses imbalanced data by adjusting classifier algorithms to lessen bias towards the majority class, requiring in-depth knowledge of intrinsic classifiers that users often lack This method is not versatile, as it necessitates specific modifications for each classifier A notable example is Cost-sensitive learning, which adjusts the costs associated with misclassifications to minimize total loss during classification (Xiao et al., 2012; Xiao et al., 2020) However, the cost values are typically determined by researchers' intentions, making the algorithm-level approach inflexible and cumbersome.

The data-level approach for re-balancing training datasets utilizes re-sampling techniques categorized into three main groups: over-sampling, under-sampling, and a hybrid of both Over-sampling increases the quantity of the minority class, while under-sampling reduces that of the majority class This method is easy to implement and functions independently of classifier algorithms However, re-sampling can alter the distribution of the training data, potentially resulting in a subpar classification model For example, random over-sampling may increase computation time and introduce noise, leading to an overfitting model Hierarchical over-sampling methods, like the Synthetic Minority Over-sampling Technique (SMOTE), can further exacerbate overlapping issues Conversely, under-sampling may overlook valuable information from the majority class, particularly in cases of severe data imbalance.

The ensemble-based approach combines ensemble classifier algorithms with algorithm-level or data-level techniques to enhance performance, particularly in addressing imbalanced data (Abdoli et al., 2023; Shen et al., 2021; Yang et al., 2021; Zhang et al., 2021) Despite its effectiveness, this approach can lead to complex models that are challenging to interpret, highlighting a significant concern that needs to be addressed.

In conclusion, while numerous methods exist for addressing imbalanced classification, each presents certain limitations, with some hybrid approaches being overly complex and difficult to implement Additionally, there is a scarcity of research focusing on the challenges of imbalance, noise, and overlapping samples Existing studies often fail to achieve the anticipated performance improvements on various datasets Therefore, the development of a novel algorithm that effectively tackles imbalance, noise, and overlapping issues is essential to enhance performance metrics for the positive class.

Gaps in Logistic regression with imbalanced data

Logistic regression (LR) is a widely used classifier, particularly in credit scoring, due to its ability to deliver interpretable outputs in the form of conditional probabilities for class membership By comparing these probabilities against a specified threshold, LR effectively classifies samples into positive or negative classes, making it adaptable for multi-classification tasks Its computation is efficient, utilizing the maximum likelihood estimator, and is supported by various software packages and programming languages Additionally, LR allows for the assessment of the influence of predictors on outcomes by analyzing the statistical significance of the corresponding parameters, resulting in a model that is both interpretable and cost-effective.

Logistic Regression (LR) struggles with imbalanced data sets, often underestimating the conditional probability of positive samples, which increases the risk of misclassification Additionally, the reliance on p-values for determining the statistical significance of predictors has faced criticism due to common misunderstandings, limiting LR's applicability despite its advantages.

To effectively address imbalanced data in logistic regression (LR), various methods have been proposed, including prior correction, weighted likelihood estimation (WLE), and penalized likelihood regression (PLR) However, these algorithm-level approaches often require significant user input and rely on data that may not be readily available, such as the positive class ratio in the population Additionally, certain PLR methods can be overly sensitive to initial values during maximum likelihood estimation and may only correct biased parameter estimates rather than biased conditional probabilities Notably, the literature lacks exploration of hybrid methods that combine these techniques with re-sampling strategies, which could leverage the strengths of each approach to more effectively tackle the challenges posed by imbalanced data in LR.

To effectively address imbalanced data in logistic regression (LR), it is essential to implement modifications at both the data and algorithm levels These adjustments enable the analysis of imbalanced datasets while preserving the ability to assess the influence of predictors on the response variable, independent of the traditional "p-value" criterion.

Research objectives, research subjects, and research scopes

Research objectives

In this dissertation, we aim to achieve the following objectives.

The primary goal is to introduce an innovative ensemble classifier designed to meet two essential criteria for credit-scoring models This new classifier aims to surpass traditional classification methods and well-known balanced techniques, including Bagging trees, Random Forest, and AdaBoost, when combined with random over-sampling (ROS), random under-sampling (RUS), SMOTE, and Adaptive Synthetic Sampling (ADASYN) Additionally, the proposed model is capable of determining the importance of input features in assessing credit risk status.

The second objective is to introduce an innovative technique designed to overcome challenges associated with imbalanced data, noise, and overlapping samples in classification tasks By integrating the advantages of re-sampling methods and ensemble models, this approach effectively addresses these critical issues This technique is particularly applicable to various fields, including credit scoring and medical diagnosis, where imbalanced classification is a significant concern.

The goal of this study is to enhance the computation process of Logistic Regression to effectively handle imbalanced datasets and reduce the challenges posed by overlapping samples This enhancement significantly influences the F-measure, a key metric for assessing classifier performance in imbalanced classification scenarios The proposed approach aims to rival established balanced methods for Logistic Regression, including weighted likelihood estimation, penalized likelihood regression, and various resampling techniques such as Random Over Sampling (ROS), Random Under Sampling (RUS), and Synthetic Minority Over-sampling Technique (SMOTE).

Research subjects

This dissertation explores the challenges of imbalanced data, noise, and overlapping samples in classification, focusing on credit scoring as a case study It evaluates various balancing methods, with a particular emphasis on data-level and ensemble-based approaches over algorithm-level techniques Additionally, the study investigates Lasso-Logistic regression, a penalized version of Logistic regression, in two contexts: as a base learner within an ensemble classifier and as an individual classifier.

Research scopes

This dissertation examines binary classification challenges associated with imbalanced datasets, specifically in the context of credit scoring It evaluates interpretable classifiers such as Logistic Regression, Lasso-Logistic Regression, and Decision Trees To address the issue of imbalanced data, the study emphasizes data-level strategies and the integration of these methods with ensemble classifier algorithms Key re-sampling techniques, including Random Over-Sampling (ROS), Random Under-Sampling (RUS), SMOTE, ADASYN, Tomek-link, and Neighborhood Cleaning Rule, are thoroughly investigated Additionally, the research employs various performance metrics suitable for imbalanced classification, including Area Under the Receiver Operating Characteristics Curve (AUC), Kolmogorov-Smirnov statistic (KS), F-measure, G-mean, and H-measure, to assess the effectiveness of the classifiers analyzed.

Research data and research methods

Research data

This credit scoring case study utilizes six secondary data sets, including three well-known datasets from the UCI Machine Learning Repository—German, Taiwan, and Bank personal loan data sets—which serve as benchmarks in credit scoring research Additionally, three private datasets were sourced from commercial banks in Vietnam, characterized by varying degrees of imbalance To validate the effectiveness of the proposed methods, an empirical study also incorporated the Hepatitis dataset from the medical field, available on the UCI repository.

The logistic regression case study utilizes nine distinct data sets, including the German, Taiwanese, Bank personal loan, and Hepatitis data sets, which are also featured in credit scoring analyses The remaining data sets can be easily accessed from the Kaggle website and the UCI Machine Learning Repository.

Research methods

The dissertation utilizes a quantitative research approach to evaluate the effectiveness of several proposed methods, including a credit scoring ensemble classifier, an algorithm for balancing and handling free-overlapping data, and modifications to logistic regression.

The implementation protocol outlined in Table 1.1 serves as a foundational guide for all computational processes within this dissertation While the protocol remains consistent, variations may occur in Step 2 depending on specific requirements of each case All computational tasks are executed through programming methods.

Table 1.1: General implementation protocol in the dissertation

1 Proposing the new algorithm or new procedure.

2 Constructing the new model with different hyper-parameters to find the optimal model on the training data.

3 Constructing other popular models with existing balanced methods and classifier algorithms on the same training data.

4 Applying the optimal model and other popular models to the same testing data, then calculating their performance measures.

5 Comparing the testing performance measures of the considered models.language R, which has been widely used in the machine learning community.

Contributions of the dissertation

The dissertation contributes three methods to the literature on credit scoring and imbalanced classification The proposed methods were published in three articles, including:

(1) An interpretable decision tree ensemble model for imbalanced credit scoring datasets, Journal of Intelligent and Fuzzy System, Vol 45, No 6, 10853–10864, 2023.

(2) TOUS: A new technique for imbalanced data classification, Studies in Sys- tems, Decision, and Control, Vol 429, 595–612, 2022, Springer.

(3) A modification of Logistic regression with imbalanced data: F-measure- oriented Lasso-logistic regression, ScienceAsia, 49S, 68–77, 2023.

The dissertation introduces an interpretable ensemble classifier designed to tackle imbalanced data in credit scoring literature This model, utilizing a Decision Tree as the base learner, offers distinct advantages over traditional methods, including enhanced performance metrics and improved interpretability This innovative approach aligns with the findings presented in the first article.

This dissertation presents an innovative ensemble-based method for addressing imbalanced data by balancing, de-noising, and eliminating overlapping samples This approach demonstrates superior performance compared to traditional resampling techniques such as Random Over Sampling (ROS), Random Under Sampling (RUS), SMOTE, Tomek-link, and the Neighborhood Cleaning Rule, as well as popular ensemble classifiers like Bagging Trees, Random Forest, and AdaBoost This research aligns with the findings discussed in the second article.

This dissertation presents an innovative modification to the computation process of Logistic Regression, enhancing its effectiveness in handling imbalanced data The proposed approach not only improves performance compared to existing methods but also retains the ability to indicate the significance of input features without relying on p-values This significant advancement is detailed in the third article.

Dissertation outline

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 2 Literature review of imbalanced data

• Chapter 3 Imbalanced data in credit scoring

• Chapter 4 A modification of Logistic regression with imbalanced data

Chapter 1 serves as the introduction to the dissertation, offering a concise overview of the study on imbalanced data in classification It outlines the motivations behind the research, identifies gaps in existing literature, and states the objectives and scope of the study Additionally, this chapter discusses the subjects, data, methods employed, and highlights the contributions of the research, culminating in a structured outline of the dissertation.

Chapter 2 focuses on the literature review concerning imbalanced data in classification, defining the concept and addressing challenges such as overlapping classes It thoroughly discusses performance measures relevant to imbalanced data and reviews various approaches, including algorithm-level, data-level, and ensemble-based methods Additionally, the chapter explores foundational knowledge and recent advancements in credit scoring, providing a detailed analysis of previous studies to highlight the advantages and disadvantages of existing balancing methods This comprehensive examination serves as a framework for developing new balanced methods in the dissertation.

Chapter 3 focuses on a case study of imbalanced classification in credit scoring, building on the foundational concepts from the first two articles mentioned in Section 1.6 We introduce an innovative ensemble classifier designed to effectively manage imbalanced data while assessing the significance of predictors Additionally, we enhance the algorithm of this credit-scoring ensemble classifier to address issues of overlapping and noise prior to tackling imbalanced data Empirical studies are conducted to validate the effectiveness of the proposed algorithms.

Chapter 4 focuses on addressing imbalanced data in the context of Logistic regression by proposing modifications to its computation process The inner modification involves altering the performance criterion used to estimate scores, while the outer modification applies selective re-sampling techniques to rebalance the training data To validate the effectiveness of these modifications, experiments were conducted on nine different data sets This chapter is linked to the third article mentioned in Section 1.6.

Chapter 5 is the conclusion, which summarizes the dissertation, implies the applications of the proposed works, and refers to some further studies.

Chapter 2LITERATURE REVIEW OF IMBALANCED DATA

Imbalanced data in classification

Description of imbalanced data

Imbalanced data (ID) refers to any data set with a skewed distribution of samples across two classes, specifically when the imbalance ratio (IR) exceeds one While there is no universally accepted threshold for defining ID, it is generally recognized that a class must have a significantly higher or lower number of samples compared to the other Many researchers agree that a data set is classified as imbalanced if the minority class has substantially fewer samples, making it challenging for standard classifiers to effectively differentiate between the two classes Thus, a data set is deemed ID when its IR is greater than one, leading to difficulties in identifying most samples of the minority class using conventional classification algorithms.

Obstacles in imbalanced classification

In intrusion detection (ID), the minority class is frequently misclassified due to insufficient information about its patterns Standard classifier algorithms prioritize maximum accuracy, resulting in biased outcomes that favor the majority class, which leads to low accuracy for the minority class Additionally, the unique patterns of the minority class, particularly in extreme ID scenarios, are often overlooked and treated as noise in favor of the more prevalent majority patterns Consequently, the minority class, which is the focus of the classification process, is often misclassified in ID.

Empirical studies support the analysis that a higher imbalance ratio (IR) negatively impacts classifier performance Brown and Mues (2012) found that as IR increases, classifier performance decreases Additionally, Prati, Batista, and Silva (2015) observed that significant performance loss occurs when the IR reaches 90/10 or higher, with losses accelerating rapidly at elevated IR levels.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

Categories of imbalanced data

In real-world applications, the complexity of classification processes is heightened by the interplay of ID and various other factors Some researchers argue that ID is a primary contributor to the subpar performance of classifier algorithms, while additional issues such as overlapping data, small sample sizes, disjuncts, borderline cases, rare instances, and outliers further diminish the effectiveness of these popular algorithms (Batista et al., 2004; Fernández et al., 2018; Napierala & Stefanowski, 2016; Sun et al., 2009).

Overlapping or class separability refers to the unclear decision boundary between two classes, where samples from both classes become blended This phenomenon complicates the performance of standard classifier algorithms like Decision Trees, Support Vector Machines, and K-Nearest Neighbors Research by Batista et al (2004) indicated that the degree of overlap between classes is more critical than the imbalance ratio (IR), while Fernández et al (2018) suggested that simple classifier algorithms can effectively classify data regardless of the IR when there is no overlap present.

Learning algorithms require a substantial sample size to effectively generalize and distinguish between classes Insufficient training data can lead to poor generalization and result in an overfitting model, particularly in scenarios involving imbalanced and limited datasets.

Figure 2.1: Examples of circumstances of imbalanced data.

According to Galar et al (2011), the issue of insufficient information regarding the positive class intensifies in imbalanced datasets Krawczyk and Woźniak (2015) emphasized that increasing the number of samples in the minority class can significantly reduce the error rate of classifiers when addressing this imbalance.

Small disjuncts arise when the minority class is fragmented into multiple sub-spaces within the feature space, resulting in classifiers having fewer positive samples compared to large disjuncts These small disjuncts represent rare samples that are difficult to detect in datasets, causing learning algorithms to often overlook them while establishing general classification rules Consequently, this oversight leads to an increased error rate for small disjuncts (Prati, Batista, & Monard, 2004; Weiss, 2009).

The performance of standard classifiers is significantly impacted by the presence of positive samples that are borderline, rare, or outliers Borderline samples pose a particular challenge for recognition, while rare and outlier samples are even more difficult to identify Research by Napierala and Stefanowski (2016) and Van Hulse and Khoshgoftaar (2009) indicates that imbalanced datasets containing numerous borderline, rare, or outlier samples diminish the efficiency of standard classifiers.

In summary, studying ID should pay attention to the related issues such as the overlapping, small sample size, small disjuncts, and the characteristics of the positive samples.

Performance measures for imbalanced data

Performance measures for labeled outputs

Most learning algorithms, such as K-nearest neighbors, decision trees, and ensemble classifiers based on decision trees, produce labeled outputs A useful method for evaluating the performance of these labeled-output classifiers is through a confusion matrix, which provides a cross-tabulation of actual versus predicted labels.

Predicted positive Predicted negative Total

Actual positive TP FN POS

Actual negative FP TN NEG

Table 2.1 defines key metrics in classification: TP (True Positives), FP (False Positives), FN (False Negatives), and TN (True Negatives) as per Definition 1.1.3 Additionally, POS and NEG represent the counts of actual positive and negative samples in the training dataset, while PPOS and PNEG indicate the predicted positive and negative sample counts N denotes the total number of samples analyzed.

From the confusion matrix, several metrics are built to provide a framework for analyzing many aspects of a classifier These metrics can be divided into two types, single and complex metrics.

Accuracy is a widely used metric that measures the proportion of correct outputs in a dataset, while its counterpart, the error rate, reflects the proportion of incorrect outputs A higher accuracy indicates better performance, making it a crucial factor in evaluating models.

TN FN rate) is, the better the classifier is.

Accuracy and error rate can be misleading when evaluating classifier performance in imbalanced datasets, especially in the context of Intrusion Detection (ID) In cases with a high imbalance ratio, standard classifiers may achieve high accuracy and low error rates, yet fail to correctly classify a significant number of positive samples, which are critical for effective classification Additionally, these metrics treat the misclassification of positive and negative classes equally, despite the fact that misclassifying a positive sample typically incurs a greater cost Consequently, studies on imbalanced classification often utilize specific metrics that concentrate on individual classes, such as True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), False Negative Rate (FNR), and precision, to provide a more accurate assessment of classifier performance.

TPR is the proportion of the positive samples classified correctly Other names of TPR are recall and sensitivity.

FPR is the proportion of the negative samples classified incorrectly.

TNR (or specificity) and FNR are the complements of FPR and TPR, respectively.

Precision is the proportion of the actual positive samples among the predicted positive class.

In evaluating classification models, it is crucial to prioritize metrics such as accuracy, true positive rate (TPR), true negative rate (TNR), and precision, while minimizing false positive rate (FPR) and false negative rate (FNR) In cases of imbalanced classification, TPR becomes the preferred metric due to the significance of the positive class However, in contexts like credit scoring and cancer diagnosis, focusing solely on TPR without considering FPR can lead to misleading outcomes, as a trivial classifier may label all samples as positive, resulting in substantial losses Therefore, achieving high precision and recall is essential in these scenarios Ultimately, the choice of performance metrics should be tailored to the specific application to ensure effective evaluation.

Single metrics often fall short in effectively assessing classifier performance, particularly in information retrieval (ID) This limitation has prompted the use of combined metrics Among these, the F-measure stands out as a widely used complex metric, capturing the balance between precision and recall through the weighted harmonic mean.

F β = (1 + β ) = (2.8) β 2 Precision + Recall (1 + β 2 )TP + FP + β 2 FN where β is the positive parameter for controlling the significance of FP or

In the context of evaluating classification models, the parameter β is greater than 1 when false negatives (FN) are prioritized over false positives (FP) The F1 score represents a specific instance of the Fβ measure, where precision and recall are equally valued, indicating that FP and FN have the same significance Often, the term F-measure is used interchangeably with F1, unless otherwise specified.

Precision + Recall 2TP + FP + FN (2.9)

The F1 score, with a maximum value of 1, indicates high performance when both precision and recall are elevated This metric is commonly utilized in fields such as cancer diagnosis and credit scoring, where accurate assessments are crucial.

(Abdoli et al., 2023; Akay, 2009; Chen, Li, Xu, Meng, & Cao, 2020).

The G-mean metric, represented by the formula in (2.10), calculates the geometric mean of True Positive Rate (TPR) and True Negative Rate (TNR) Unlike the F-measure, which focuses solely on the positive class, G-mean provides a comprehensive assessment by incorporating data from both positive and negative classes.

G-mean is high if and only if TPR and TNR are high The most ideal value of the G-mean is 1.

Performance measures for scored outputs

In addition to labeled-output classifiers, many classifiers, such as Logistic Regression, provide scored outputs that indicate the probability of each class membership, with higher scores typically predicting positive labels These scored outputs are converted to labeled outputs by comparing them against a specified threshold If the goal is to minimize errors in predicting the positive class, a lower threshold is used, resulting in a higher True Positive Rate (TPR) and a higher False Positive Rate (FPR) Conversely, employing a higher threshold can decrease the FPR but may increase the False Negative Rate (FNR) Ultimately, the selection of a threshold for scored-output classifiers hinges on the specific performance metrics that need to be optimized.

When converting to labeled outputs, samples sharing the same labels are treated uniformly, despite varying probabilities of being in the positive class Consequently, metrics such as the Receiver Operating Characteristics Curve (ROC), Area Under the Curve (AUC), Kolmogorov-Smirnov statistic (KS), and H-measure serve as popular threshold-free measures for assessing the performance of scored classifiers while maintaining output types These metrics are recognized as general performance indicators and are extensively utilized in studies involving imbalanced classification.

2.2.2.1 Area under the Receiver Operating Characteristics Curve

The Receiver Operating Characteristics Curve (ROC) is a graphical representation that illustrates the relationship between the False Positive Rate (FPR) and True Positive Rate (TPR) across various thresholds In a two-dimensional plane, the x-axis denotes FPR while the y-axis represents TPR An ideal ROC curve approaches the top left corner of the graph, indicating a classifier's ability to achieve high TPRs and low FPRs Within the unit square, a classifier's ROC should lie above the diagonal line, which signifies the performance of a random classifier.

Figure 2.2 presents the Receiver Operating Characteristic (ROC) curves for three classifiers alongside a random classifier All classifiers outperform the random option, as indicated by their curves lying above the red diagonal Notably, the first and second classifiers demonstrate superior performance compared to the third, which consistently yields a lower True Positive Rate (TPR) at the same False Positive Rate (FPR) However, a direct comparison of the overall performance between the first and second classifiers is not feasible without analyzing the area under their respective curves.

The Area Under the Receiver Operating Characteristic Curve (AUCROC) is a crucial metric that evaluates the performance of a classifier, with values closer to 1 indicating superior classification ability AUC is a convenient abbreviation for AUCROC, reflecting its significance in assessing model effectiveness.

The Area Under the Curve (AUC) represents the average True Positive Rate (TPR) across all False Positive Rates (FPRs) for various thresholds (Ferri, Hernández-Orallo, & Flach, 2011) A random classifier yields an AUC of 0.5, while an ideal classifier achieves an AUC of 1, indicating that the AUC typically ranges from 0.5 to 1 To estimate AUC with a discrete set of thresholds {α i } n, a specific formula is utilized.

AUC = 0.5 |FPR(α i ) − FPR(α i−1)| (TPR(α i ) + TPR(α i−1)) (2.11) i=2 where TPR(α) and FPR(α) are the TPR and FPR corresponding to the threshold α.

In the field of ID literature, the Area Under the Curve (AUC) is widely regarded as a key performance metric for assessing optimal classifiers and comparing learning algorithms (Batista et al., 2004; Brown & Mues, 2012; Huang & Ling, 2005) However, AUC has notable limitations One significant issue arises when Receiver Operating Characteristic (ROC) curves intersect; in such cases, a curve may exhibit a higher AUC despite having lower True Positive Rates (TPR) at most thresholds This discrepancy can render AUC an unreliable measure Additionally, Hand (2009) critiques AUC as an incoherent performance metric, arguing that it averages misclassification loss across varying cost ratio distributions influenced by the classifier's score distributions, leading to inconsistent evaluations across different classifiers Conversely, Ferri et al (2011) contend that Hand's interpretation lacks a natural basis.

(2011) confirms the AUC’s coherent meaning of a general classification performance measure and the independence of the classifier itself.

Figure 2.3: Illustration of KS metric

The Kolmogorov-Smirnov statistic (KS) is a widely used metric for assessing the predictive power of classifiers, as highlighted by He, Zhang, and Zhang (2018), Shen et al (2021), and Yang et al (2021) It quantifies the degree of separation between predicted positive and negative classes, providing valuable insights into model performance An illustration of the KS metric can be found in Figure 2.3, with its formal definition presented in formula (2.12).

Although a high KS implies an effective classifier, KS only reflects good performance in the local of the point determining KS (Řezáč & Řezáč, 2011).

In Figure 2.3, KS is realized at threshold 0.55, so effective analysis is only meaningful in the neighborhood of this value.

In his 2009 critique of the Area Under the Curve (AUC), Hand advocates for the H-measure as a superior alternative The H-measure quantifies the fractional improvement in expected minimum loss when compared to a random classifier, offering a more effective assessment of classification performance.

The overall expected minimum misclassification loss, denoted as L, is compared to the expected minimum misclassification loss of a random classifier, L ref The H-measure addresses the limitations of the AUC by establishing a classifier-independent distribution of relative misclassification costs This measure allows for the expected loss to be derived from any loss distribution, with many applications adhering to the widely recognized framework proposed by Hand and Anagnostopoulos.

In 2014, the beta distribution Beta(π 1 + 1, π 0 + 1) was introduced, where π 0 and π 1 represent the proportions of negative and positive classes in a population Despite being a recent development, the H-measure has gained significant traction in classification studies, as evidenced by research from Ala’raj and Abbod (2016) and Garrido.

Verbeke, and Bravo (2018); He et al (2018).

Conclusion of performance measures in imbalanced classification

There are two types of performance metrics in the literature on imbalanced classification, including one for labeled outputs and one for scored outputs.

While accuracy is commonly used as a performance metric for labeled outputs, it can be misleading in evaluating classifier effectiveness, particularly in applications like credit scoring and cancer diagnosis where F-measure and G-mean are preferred For scored outputs, metrics such as AUC, KS, and H-measure are more favored; however, it's important to note that no single performance measure is universally applicable to all datasets Each metric has its own advantages and limitations, making it essential to employ a combination of overall and threshold-based metrics for a comprehensive assessment of a classifier's performance.

Approaches to imbalanced classification

Algorithm-level approach

The algorithm-level approach aims to enhance specific performance metrics by modifying intrinsic classifiers to mitigate the adverse effects of identification (ID) This strategy focuses on adjusting underlying algorithms to limit the negative consequences associated with ID.

Let’s review some typical types of the algorithm-level approach in ID.

2.3.1.1 Modifying the current classifier algorithms

An algorithm-level approach addresses bias in imbalanced data by adjusting the core mechanisms of classifiers such as Support Vector Machines, Decision Trees, or Logistic Regression.

Support vector machine modifications primarily target the decision boundary, while adjustments to decision trees emphasize the criteria for splitting features In contrast, enhancements to logistic regression are associated with the log-likelihood function and the process of maximum likelihood estimation.

Table 2.2 shows some representatives of this approach.

Table 2.2: Representatives employing the algorithm-level approach to ID

Applying specific kernel modifications to rebuild the decision boundary in order to reduce the bias toward the majority class.

Setting a weight on the samples in the training set based on their importance (the positive samples are usually assigned a higher weight).

Applying active learning paradigm, especially in the situation where the samples of the training set are not fully labeled.

Proposing a new distance for creating split.

Re-computes the maximum likelihood estimate for the intercept and the conditional probability of belonging to the positive class.

Wu and Chang (2004); Xu (2016); Yang, Yang, and Wang (2009)

Lee, Jun, and Lee (2017); Lee et al (2017); Yang, Song, and Wang (2007).

Hoi, Jin, Zhu, and Lyu (2009); Sun, Xu, and Zhou (2016); Žliobaitė, Bifet, Pfahringer, and Holmes (2013).

Cieslak, Hoens, Chawla, and Kegelmeyer (2012).

Boonchuay, Sinapiromsaran, and Lursinsap (2017); Lenca, Lallich, Do, and Pham (2008); Liu, Chawla, Cieslak, and Chawla (2010).

Maalouf and Siddiqi (2014); Maalouf and Trafalis (2011); Manski and Lerman (1977).

Firth (1993); Fu, Xu, Zhang, and Yi (2017); Li et al (2015).

Cost-sensitive learning (CSL) focuses on the idea that each misclassification incurs a specific loss In this context, C(1, 0) represents the loss associated with incorrectly predicting a positive sample as negative, while C(0, 1) indicates the loss for misclassifying a negative sample as positive The most straightforward implementation of CSL involves independent misclassification costs, which assigns distinct penalties for these errors.

C(1, 0) and C(0, 1) are constants Under the notations, the total cost function is:

The independent cost form aims to determine the optimal threshold \( \alpha^* \) that minimizes the total cost function, represented as \( \alpha^* = \arg \min_{\alpha \in (0,1)} [C(1, 0) \times FN(\alpha) + C(0, 1) \times FP(\alpha)] \) In this equation, \( FN(\alpha) \) and \( FP(\alpha) \) denote the counts of false negatives and false positives at the threshold \( \alpha \), respectively For further insights, refer to Table 2.3, which presents the independent misclassification cost matrix related to the prediction outcomes.

Table 2.3: Cost matrix in Cost-sensitive learning

In ID, CSL assigns a higher cost to misclassifying a positive sample (C(1, 0)) compared to a negative one (C(0, 1)) to address the inherent bias toward the negative class This approach is justified in practical classification scenarios, as the consequences of incorrectly classifying a positive sample often lead to more significant issues than misclassifying a negative sample.

Many authors assigned C(0, 1) a unit and C(1, 0) a constant number C

Several studies have proposed methods for determining the optimal threshold using misclassification costs C(0, 1) and C(1, 0), including notable works by Elkan (2001), Moepya, Akhoury, and Nelwamondo (2014), and Sheng and Ling (2006) Additionally, researchers have explored dependent misclassification costs, which assign individual costs to each observation, as demonstrated by Bahnsen, Aouada, and Ottersten (2014, 2015) and Petrides et al (2022).

Among methods of algorithm-level approach, CSL is the most popular (Fer- nández et al., 2018; Haixiang et al., 2017) since CSL can be embedded into other classifier algorithms such as:

• Support vector machine (SVM): Datta and Das (2015); Iranmehr,

Masnadi- Shirazi, and Vasconcelos (2019); Ma, Zhao, Wang, and Tian (2020).

• Decision tree (DT): Drummond, Holte, et al (2003); Jabeur, Sadaaoui, Sghaier, and Aloui (2020); Qiu, Jiang, and Li (2017).

• Logistic regression (LR): Shen, Wang, and Shen (2020); Sushma S J and Assegie (2022); Zhang, Ray, Priestley, and Tan (2020).

The effectiveness of Cost-Sensitive Learning (CSL) hinges on the design of the cost matrix, as a significant disparity between C(1, 0) and C(0, 1) can lead to an overemphasis on the positive class, increasing the false positive rate (FPR) Conversely, a minimal difference may result in insufficient adjustments, failing to correct the bias towards the negative class Consequently, the construction of the cost matrix is crucial in CSL, with two prevalent scenarios guiding its formulation.

The cost matrix is established based on expert opinions, as illustrated by Moepya et al (2014) in credit scoring, where they assigned C(1, 0) to reflect the average loss incurred from accepting a high-risk customer This approach frequently relies on prior information, which is often shaped by the subjective views of researchers and may lack transparent evidence.

• The cost matrix is inferred from the data set Some authors assigned IR to the cost C(1, 0) and 1 to C(0, 1) since they implied that the higher the

IR, the poorer the classification performance (Castro & Braga, 2013; López, Del Río, Benítez, & Herrera, 2015) However, IR is not the only factor reducing the performance of classifiers (see Subsection 2.1.3) If

IR is the cost C(1, 0), any data sets with the same IR will be similarly solved despite belonging to different application fields.

In summary, the cost of loss in CSL is usually a disputable issues.

2.3.1.3 Comments on algorithm-level approach

The algorithm-level approach emphasizes the fundamental characteristics of classifiers, necessitating a thorough comprehension of classifier algorithms to effectively address the implications of ID Consequently, methods at this level are typically tailored to particular classifier algorithms, which makes this approach appear less adaptable compared to data-level methods.

CSL is the most popular method of algorithm-level approach However, the cost matrix is usually a controversial issue.

In the future, it should be considered combinations of the algorithm-level and data-level approaches to create a more effective and versatile balanced method.

Data-level approach

The data-level approach utilizes re-sampling techniques to address the skewed distribution in original datasets, making it a straightforward strategy for tackling imbalanced data (ID) issues These techniques are easy to implement and do not rely on the training of classification models after the data pre-processing stage Numerous empirical studies, including those by Batista et al (2004), Brown and Mues (2012), and Prati et al (2004), have demonstrated that re-sampling methods enhance the performance of various classifiers This approach is categorized into three primary methods: under-sampling, over-sampling, and a hybrid of both techniques.

The under-sampling method removes negative samples, which are in the majority class, to re-balance or degrade the imbalance status of the original data set.

Random under-sampling (RUS) is the most prevalent technique for addressing class imbalance in datasets It works by randomly removing negative samples to create a balanced training subset, making it a straightforward and efficient approach that reduces computation time However, in cases of significant data imbalance, RUS may inadvertently discard valuable information.

Figure 2.4: Illustration of RUS technique

Source: Author’s design mation from the majority class because of removing too many negative samples Figure 2.4 depicts the operation of RUS.

To overcome the limitation of RUS, authors have developed heuristic methods to remove the concerned samples Some representatives are

Condensed Nearest Neighbor Rule (Hart, 1968), Tomek-Link (Tomek et al., 1976), One-side Selec- tion (Kubat, Matwin, et al., 1997), Neighborhood Cleaning Rule

(Laurikkala, 2001) These methods can be used for balancing and cleaning data.

The Condensed Nearest Neighbor Rule (CNN), introduced by Hart in 1968, identifies a consistent subset E from the original dataset S that accurately classifies all samples using the 1-nearest neighbor classifier Subsequently, S is replaced with a new dataset that includes the minority class and the portion of the majority class that is not part of E.

Figure 2.5: Illustration of CNN rule

Majority class (MA) Minority class (MI)

CNN eliminates negative samples of class E that are often distant from the class boundary, considering them less relevant for learning However, it does not identify the maximum consistent subset and tends to randomly discard samples, particularly in the early stages, which can result in retaining internal samples over boundary ones This approach may lead to an imbalance, as seen in instances like Figure 2.5, where excessive removal of negative samples occurs Additionally, the proximity of samples in the dataset causes the characteristics of the two classes to be insufficiently distinct, complicating the performance of subsequent classifiers.

Tomek-Link, introduced by Tomek et al in 1976, enhances the capabilities of CNN by identifying pairs of samples (e_i, e_j) that meet specific criteria: they must belong to different classes, and the distance between them, d(e_i; e_j), must be less than the distance from e_i to any other sample e_k and from e_j to e_k This method effectively helps in refining classification by focusing on the relationships between samples.

Figure 2.6: Illustration of tomek-links

A tomek-link refers to a pair of samples, denoted as (e i, e j), where one or both samples may represent noise or lie on the boundary between classes This phenomenon occurs because only noise and boundary samples have their nearest neighbors belonging to the opposite class Examples of tomek-links are illustrated in Figure 2.6.

In this figure, tomek-links are the pairs of samples that are marked by the green oval.

Tomek-Link serves as a method for both cleaning and balancing datasets When used for cleaning, it removes two samples from the Tomek-Link, while for balancing, only the negative sample is eliminated However, it is important to note that Tomek-Link cannot create a balanced training dataset on its own Although it can identify noise and boundary samples, it fails to accurately distinguish which samples are noise or boundaries, leading to potential issues where removing a negative sample may leave behind noise Therefore, relying solely on Tomek-Link is ineffective in such cases To achieve a balanced training dataset free from noise and overlapping classes, it is essential to combine Tomek-Link with other resampling techniques.

One-side Selection (OSS), introduced by Kubat et al in 1997, is an under-sampling technique that effectively integrates Tomek-Link and CNN methods This approach is advantageous as Tomek-Link focuses on eliminating noise and boundary samples, while CNN removes redundant samples from the majority class, resulting in a cleaner dataset The refined training data is deemed “safe” for the learning process However, it is important to note that OSS primarily addresses the imbalance in the original dataset, and in certain situations, it may not achieve a fully balanced training set.

The Neighborhood Cleaning Rule (NCL), introduced by Laurikkala in 2001, operates by classifying each sample in the training set using the 3-nearest neighbors (3-NN) method If a sample, denoted as e_k, is part of the majority class but is incorrectly predicted as belonging to the minority class by the 3-NN, it is removed from the dataset Conversely, if e_k is from the minority class and is misclassified by the 3-NN, the nearest neighbors from the majority class are eliminated to enhance classification accuracy.

NCL effectively reduces the prevalence of the majority class and addresses overlapping data issues, yet it does not achieve complete balance on its own To maximize its benefits, NCL should be integrated with other under-sampling and over-sampling techniques.

The clustering-based method introduced by Yen and Lee in 2006 begins by dividing the dataset into K clusters Next, it involves randomly selecting negative samples from each cluster Finally, these selected negative samples are combined with the positive class to create a new, balanced training dataset.

Clustering-based methods are anticipated to mitigate the information loss associated with Random Under-Sampling (RUS) Several innovations in clustering techniques have demonstrated improved performance metrics for classifiers compared to RUS, as highlighted by studies from Nugraha et al (2020), Prathilothamai and Viswanathan (2022), Rekha and Tyagi (2021), and Yen and Lee (2009) However, there remains a lack of in-depth discussion regarding the optimal value of K, the number of clusters, and the random selection of negative samples within each cluster fails to adequately address noise or borderline cases Additionally, clustering-based methods typically require longer computation times than RUS and other nearest-neighbor techniques, including CNN, Tomek-link, OSS, and NCL, as noted by Yen and Lee (2009).

Notes The operation of the nearest and clustering approaches is based on

Distance is a crucial concept in machine learning, with various distance measures employed based on feature characteristics, including nominal and numeric types Numeric samples typically utilize distance measures such as Euclidean, Manhattan, and Minkowski distances For samples featuring both nominal and numeric attributes, the HEOM or HVDM metrics are applied For more detailed information on these distance types, refer to the works of Santos et al (2020), Weinberger and Saul (2009), and Wilson and Martinez (1997) A summary of the different distance types can be found in Appendix A.

Over-sampling is a technique used to address class imbalance in datasets by increasing the number of positive samples The most prevalent method of over-sampling is Random Over-Sampling (ROS), which involves randomly duplicating positive samples to enhance the representation of the minority class While ROS is straightforward to implement and does not rely on heuristics, it can extend computation time and may inadvertently replicate noise and borderline samples, potentially leading to overfitting in classification models.

Figure 2.7: Illustration of ROS technique

Source: Author’s design dez et al., 2018) Therefore, heuristic techniques were proposed to overcome the limitation of ROS The most popular is Synthetic Minority Over-sampling

Technique (SMOTE) (Chawla, Bowyer, Hall, & Kegelmeyer, 2002).

SMOTE (Synthetic Minority Over-sampling Technique) operates within the feature space rather than the data space, distinguishing it from Random Over-Sampling (ROS), which merely duplicates positive samples Instead of creating copies, SMOTE generates synthetic samples by exploring the vicinity of the original positive samples A detailed overview of the SMOTE algorithm for numeric samples can be found in Table 2.4.

Figure 2.8: Illustration of SMOTE technique

Figure 2.8 is an illustration of SMOTE with k = 5, N = 6 In this figure, the positive sample x 0 has five nearest neighbors, including x 0i , (i ∈ 1, 5) On

Table 2.4: Summary of SMOTE algorithm

Inputs: T, MA, and MI: training set, majority class, and minority class.

N : amount of oversampling; k: the number of the nearest neighbors.

1 for i from 1 to |MI| do

2 In T , find k nearest neighbors of x i ∈ MI, indexed x ij , j ∈ 1, k.

3 Randomly choose N nearest neighbors among x ij , j ∈ 1, k.

Outputs: N.|MI| synthetic positive samples {s ih } i,h

Ensemble-based approach

The ensemble-based approach integrates methods of the algorithm-level or data-level approach with an ensemble classifier algorithm to solve ID.

Ensemble models consist of a group of similar classifiers that work together to enhance decision-making by utilizing the combined strength of multiple sub-classifiers The performance of an ensemble classifier hinges on the effectiveness and diversity of its sub-classifiers (Fernández et al., 2018) Research shows that ensemble classifiers generally outperform single classifiers in various performance metrics (Galar et al., 2011) For a deeper understanding of ensemble classifiers, refer to Subsection 3.1.2.2.

2.3.3.1 Integration of algorithm-level method and ensemble classifier algorithm

The cost-sensitive ensemble is a widely recognized method that integrates an ensemble learning algorithm with the associated costs of misclassifying each class Two common strategies for addressing this issue are cost-sensitive Boosting and ensemble techniques that incorporate cost-sensitive learning.

Cost-sensitive Boosting maintains the foundational principles of traditional Boosting methods, such as AdaBoost, while integrating cost considerations into the weight updating process Notable contributions to this approach include works by Sun et al (2007), Tong et al (2022), and Zelenkov (2019), who articulated their motivations and claimed advantages of their methods However, research by Nikolaou et al (2016) revealed that without proper adjustments, the performance of Cost-sensitive Boosting is comparable to that of the original Boosting algorithms.

(2016) suggested applying the original AdaBoost algorithm due to its simplicity, flexibility, and effectiveness.

Ensemble methods incorporating cost-sensitive learning maintain the foundational structure of ensemble algorithms while assigning costs to different types of misclassification Unlike Cost-sensitive Boosting, this method is less adaptable as it is tailored to specific classifier algorithms Notable contributors to this approach include Krawczyk, Woźniak, and Schaefer (2014), as well as Tao et al (2019) and Xiao et al (2020).

Integrating ensemble classifiers with cost-sensitive learning can sometimes yield better results than traditional ensemble and data-level methods However, the cost-sensitive approach often encounters criticism regarding loss costs, which limits its popularity in practical applications.

2.3.3.2 Integration of data-level method and ensemble classifier algorithm

In the context of integrating a data-level approach, the training data for each sub-classifier in the ensemble is re-balanced using various resampling techniques Subsequently, a base learner is utilized on this balanced dataset, ensuring improved performance and accuracy.

Boosting algorithms, particularly AdaBoost and its variants, are effective for constructing classification models These algorithms utilize re-sampling techniques either at the start or the conclusion of each iteration to enhance performance.

SMOTEBoost, developed by Chawla et al in 2003, integrates the SMOTE technique with the boosting procedure to enhance machine learning performance Unlike standard boosting, which assigns equal weights to all misclassified samples, SMOTEBoost generates synthetic samples from the minority class after each iteration, thereby adjusting the weights of the samples This innovative approach not only balances the dataset but also enriches the diversity of the training data, ultimately improving the learning process.

• RUSBoost (Seiffert, Khoshgoftaar, Van Hulse, & Napolitano, 2010) operates similarly to SMOTEBoost, but it randomly eliminates samples from the majority class at the beginning of each iteration.

• BalancedBoost (Wei, Sun, & Jing, 2014) combines over-sampling and under- sampling in each iteration Furthermore, the re-sampling process is fulfilled according to AdaBoost.M2 algorithm.

Bagging-based methods utilize re-sampling techniques to modify the training data distribution at each bootstrap step, aiming to balance and diversify the training data across different bags This approach is generally simpler than boosting-based methods, as it does not involve weight updates or alterations to the standard Bagging algorithm Notable studies in this area have explored various strategies to enhance data balance and diversity within the bags.

OverBagging, as introduced by Wang and Yao (2009), enhances data balancing for individual sub-classifiers by utilizing Random OverSampling (ROS) rather than applying it to the entire dataset at the outset of training This method generates balanced data in each bag through two approaches: (i) by incorporating the negative class and using ROS to increase the number of positive samples, or (ii) by including a bootstrap version of the negative class while applying ROS solely to the positive class Notably, in OverBagging, every sample is guaranteed to appear in at least one bag.

SMOTEBagging, as introduced by Wang and Yao in 2009, differs from OverBagging by implementing a unique sampling strategy In each iteration, the negative class undergoes bootstrapping while the positive class is resampled with replacement to maintain a specific proportion of the original positive class, starting at 10% and reaching 100% by the final bag Subsequently, the SMOTE algorithm is applied to achieve a balanced dataset.

SMOTEBagging and OverBagging are over-sampling techniques that, like the Boosting-based approach, create ensemble classifiers but result in longer computation times Additionally, Bagging-based methods may encounter overlapping issues with SMOTEBagging and face the risk of overfitting in the case of OverBagging.

UnderBagging, introduced by Barandela, Valdovinos, and Sánchez (2003), enhances ensemble classifiers by training them on N balanced datasets derived from the original dataset, where N approximates the imbalance ratio (IR) The process begins by randomly dividing the majority class into N subsets, matching the quantity of the minority class Each balanced dataset is then formed by combining the minority class with each subset of the majority class This parallel training approach allows the base learner to utilize all samples from the majority class, thereby minimizing the loss of valuable information.

ClusteringBagging, introduced by Wang, Xu, and Zhou in 2015, functions like Under-Bagging but incorporates a clustering technique on the majority class to create K clusters of varying sizes Each of these clusters is then combined with a bootstrapped resample of the minority class, resulting in a balanced dataset for each sub-classifier.

UnderBagging and ClusteringBagging are under-sampling techniques that enhance computational efficiency compared to over-sampling methods These approaches allow ensemble classifiers to incorporate diverse inputs, but they utilize only portions of the original dataset in each iteration Consequently, this limited data usage may result in ineffective sub-classifiers, ultimately leading to a decrease in the overall performance of the ensemble classifier.

2.3.3.3 Comments on ensemble-based approach

Conclusions of approaches to imbalanced data

Addressing imbalanced data is a significant challenge in classification tasks Three prominent approaches to tackle this issue include algorithm-level methods, data-level techniques, and ensemble-based strategies, each of which has its own strengths and weaknesses These approaches are visually summarized in Figure 2.9, highlighting their distinct characteristics and effectiveness in handling classification imbalances.

The algorithm-level approach enhances classification algorithms by adjusting specific parameters, altering decision thresholds, or employing the CSL method While this method can be effective in certain scenarios, it has notable limitations Primarily, it is often confined to specific algorithms, making it less applicable to diverse datasets and necessitating substantial customization for optimal performance Additionally, many algorithm-level techniques are complex, obscuring the understanding of how predictions are generated and complicating the identification of biases or errors Moreover, some methods require extensive computation time and significant resources, which can hinder their practicality in real-world applications.

A data-level approach addresses class imbalance by either under-sampling the majority class or over-sampling the minority class, effectively balancing the dataset This method is considered more flexible and straightforward compared to algorithm-level techniques.

Approaches to solving imbalance d data

Algorith m-level Data- level Ensembl e-based

Algorith m-level and Ensemb le

Figure 2.9: Approaches to imbalanced data in classification

While data-level approaches for addressing imbalanced datasets can be effective, they have notable limitations Under-sampling may result in the loss of critical information from the majority class, whereas over-sampling can lead to model overfitting Additionally, the success of these methods is contingent on the chosen sampling technique, and there is no one-size-fits-all solution applicable to every dataset Selecting inappropriate sampling methods can adversely affect model performance, making it essential to weigh these drawbacks when determining the best strategy for tackling imbalanced data.

The ensemble-based approach integrates an ensemble algorithm with other methodologies, offering remarkable effectiveness However, it risks overfitting if the sub-classifiers lack diversity and may increase computation time, particularly with Boosting-based ensembles Additionally, interpreting the influence of inputs on the ensemble classifier's output can be challenging, as the final prediction results from the combination of several sub-classifiers.

While various techniques can enhance classifier performance on imbalanced data, there is no universal solution The selection of an appropriate method depends on factors such as the specific problem, dataset size, imbalance ratio, and targeted performance metrics Ultimately, effectively addressing imbalanced data in classification remains a significant research area, making the choice of the right approach essential for developing an accurate and robust model.

Credit scoring

Classifiers for credit scoring

The proposed credit scoring ensemble model base Decision tree 71

The proposed algorithm for imbalanced and overlapping data 83

Related works

The proposed works

Empirical study

Discussions and Conclusions

Summary of contributions

Vietnamese 4 data set (VN4)

Tiêu đề	Imbalanced Data in Classification: A Case Study of Credit Scoring
Tác giả	Bui Thi Thien My
Người hướng dẫn	Assoc. Prof. Dr. Le Xuan Truong, Dr. Ta Quoc Bao
Trường học	University of Economics Ho Chi Minh City
Chuyên ngành	Statistics
Thể loại	Doctoral Dissertation
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	225
Dung lượng	2,35 MB