Imbalanced Data in classification: A case study of credit scoring

Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring

Overviewofimbalanceddatainclassification

Nowadays,classificationplaysacrucialroleinseveralfields,forexample,medici ne (cancer diagnosis), finance (fraud detection), business administration (customerchurnprediction),informationretrieval(oilspilltracking,telecommu- nication fraud), image identification (face recognition), and so on. Classificationistheproblemofpredictingaclasslabelforagivensample.Ontrainingda ta setsthatcomprisesampleswithdifferentlabeltypes,classificationalgorithms learnsamples’featurestorecognizethelabels’patterns.Afterthat,thesepat- terns,nowpresentedasafittedclassificationmodel,willmakepredictionsaboutthe labels of new samples.

Classificationiscategorizedintotwotypes,binaryandmulti-classification Binary classification, which is the basic type, focuses on the two-class label problems In contrast, multi-classification solves the tasks of several classla- bels.Multi- classificationissometimesconsideredbinarywithtwoclasses:oneclasscorrespondingtotheconcer nlabel,andtheotherrepresentingtheremain- ing labels In binary classification, data sets are partitioned into positiveandnegative classes The positive is the interest class, which has tobeidentifiedintheclassificationtask.Inthisdissertation,wefocusonbinaryclassification.F orconvenience,wedefine some concepts asfollows.

Definition 1.1.1 A data set withkinputfeaturesfor binary classification isthe set of samplesS=X×Y,whereX⊂R k is the domain of samples’featuresandY={0,1}is the set ofl a b e l s

The subset of samples labeled1is called the positive class, denotedS + Theremaining subset is called the negative class, denotedS − A samples∈S + iscalled a positive sample, otherwise it is called a negative sample.

Definition 1.1.2 A binary classifier is a function mapping the domainoffeaturesXto the set oflabels{0,1}.

Definition 1.1.3 Considering a data setSand a classifierf:X→ {0,1}.With a given samples 0 =(x 0 ,y 0 )∈S,therearefour possibilitiesfollows:

• Iff(s 0)=y 0= 1,s 0 is called a true positive sample.

• Iff(s 0)=y 0= 0,s 0 is called a true negativesample.

• Iff(s 0 )= 1andy 0 = 0,s 0 is called a false positivesample.

• Iff(s 0)= 0andy 0= 1,s 0 is called a false negativesample.

The number of the true positive, true negative, false positive, and false negativesamples, are denoted TP, TN, FP, and FN, respectively.

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

TP+TN Accuracy TP+FP+TN+FN

TP TP+FN ;TNR= TN

In many application domains where there is a balance of the positiveandnegativeclasses,accuracyisthefirsttargetofclassifiers.However,theinterest class(thepositiveclass)sometimesconsistsofunusualeventsorrareevents The number of samples in the positive class istoosmall for classifiers torec- ognizethepositivepatterns.Insuchsituations,ifclassifiersmakemistakesinthepositivecla ss,thecostoflosswillbeveryheavy.Therefore,accuracyisnolongerthemostimportant performancecriterionbutsomethingrelatedtoTPsuch as theT P R

Forexample, in fraud detection, the customers are divided into

“bad”and“good”classes.Sincethecreditregulationsaremadepublicandthecustomershavepr eliminarilybeenscreenedbeforeapplyingforaloan,acreditdatasetoften includes a majority class ofgoodcustomers and a minority class ofthebad.Thelossofmisclassifyingthe“bad”into“good”isoftenfargreaterthan the loss of misclassifying the“good”into“bad”.Hence, identifying the bad is often considered more crucial than the other task Consider a list ofcreditcustomersconsistingof95%goodand5%bad.Ifpursuingahighaccuracy,wecan chooseatrivialclassifiermappingallcustomerswithgoodlabels.Thentheaccuracyofthisclassi fieris95%,butTPRis0%.Inotherwords,thisclassifierwasunabletoidentifybadcustome rs.Instead,anotherclassifierwithaloweraccuracybutgreaterTPRcanbeconsidere dtoreplacethistrivialclassifier.

Another example of the rare classification is cancer diagnosis In thiscase,thedatasethastwoclasses,whicharethe“malignant”and“benign”.Thenum- berofmalignantpatientsisalwaysmuchlessthanthoseofbenign.However,maligna ncy is the first target ofanycancer diagnosis process because of the heavy consequences of missing cancer patients Therefore, it isunreasonabletobaseontheaccuracymetrictoevaluatetheperformanceofcancerdiagnosis classifiers.

Definition 1.1.4 LetS=S + ∪S − be the data set, whereS + andS − are thepositive and negative classes, respectively If the quantity ofS + is far less thanthe one ofS − ,Sis called an imbalanced data set Besides, the imbalanced ratio(IR) ofSis defined as the ratio of the quantities of negative and positive class:

Motivations

Whenatrainingdatasetisimbalanced,simpleclassifiersusuallyhaveavery high accuracy butlowTPR These classifiers aim to maximize the accuracy (sometimes called global accuracy), thus equating losses causedbytheerrortype I and error typeII(Shen, Zhao, Li, Li, & Meng, 2019) Therefore,theclassification results are often biasedtowardthe majority class (thenegativeclass)(Galar,Fernandez,Barrenechea,Bustince,&Herrera,2011;Haixiangetal.,2017).In thecaseofaratherhighimbalanced ratio,t h e minorityclass

(thepositiveclass)isusuallyignoredsincethecommonclassifiersoftentreat it as noise or outliers Hence, the target of recognizing the patterns ofthepositiveclassfailsalthoughidentifyingthepositivesamplesisoftenthecrucial taskofimbalancedclassification.Therefore,imbalanceddataisachallengeinclassificatio n.

Besides,experimentstudiesshowedthatiftheimbalancedratioincreased, theoverallmodelperformancedecreased(Brown&Mues,2012).Furthermore,some authors stated that imbalanced datawasnot only the main reasonforthepoorperformancebutthenoiseandoverlappingsamplesalsodegradedtheperforman ce of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al.,

2017) Thus, researchers or practitioners should deeply understandthenatureofdatasetstohandlethemcorrectly.

Atypicalcasestudyofimbalancedclassificationiscreditscoring.Thisissue is reflected in the bad debt ratio of commercial banks.Forexample, inViet-nam, the bad debt ratio in the on-balance sheetwas1.9% in 2021 and 1.7%in2020 Besides, the gross bad debt ratio (including on-balance sheet baddebt,unresolvedbaddebtsoldtoVAMC,andpotentialbaddebtfromrestructuring)was7.3

% in 2021 and 5.1% in 2020 1 Although bad customers account foraverysmallpartofthecreditcustomers,theconsequencesofthebaddebtofthe bankareextremelyheavy.Incountrieswheremosteconomicactivitiesrelyonthe banking system, the increase in the bad debt ratiomaynot only threaten the execution of the banking system but also push the economy to a seriesofcollapses Therefore, it is important to identify the bad customers increditscoring.

In Vietnam, the credit market is tightly controlledbyregulations of the Statebank.Commercialbanksnowconsciouslymanagecreditriskbystrictly applying credit appraisal processes before funding In the field of academic research,creditscoringhasattractedmanyauthors(Bình&Anh,2021;Hưng&Trang,2 018;Quỳnh,Anh,&Linh,2018;Thắng,2022).However,fewworkshavesolved the imbalanced issue(Mỹ,20 21 ).

1 https://sbv.gov.vn/webcenter/portal/vi/links/cm255?dDocName=SBV489213

Thesefactspromptedustostudyimbalancedclassificationdeeply.Thedis- sertation titled“Imbalance data in classification: A case study of credit scoring”aims to find suitable solutions for the imbalanced data and related issues,especially a case study of credit scoring inVietnam.

Researchgapidentifications

Gapsincreditscoring

Creditscoringisanarithmeticalrepresentationbasedontheanalysisofthe creditworthinessofcustomers(Louzada,Ara,&Fernandes,2016).Creditscor-ing provides valuable information to banks and finance institutions inorder notonlytohedgethecreditriskbutalsotostandardizeregulationsoncredit management.Therefore,credit- scoringclassifiershavetomeettwosignificantrequirements Theyare: i) Theabilitytoaccuratelyclassifythebadcustomers; ii) The ability to easily explain the predicted results of thec l a s s i fi e r s

Over thetworecent decades, the first requirement has been solved withthe development of methods to improve the performance of credit scoringmod-els.Theyaretraditionalstatisticalmethods(K-nearestneighbors,Discriminant analysis,andLogisticregression)andpopularmachinelearningmodels(Deci- siontree,Artificialneuralnetwork,andSupportvectormachine)(Baesenset al., 2003; Brown & Mues, 2012; Louzada et al., 2016) Those are calledsingleclassifiers.Theeffectivenessofasingleclassifierisnotsimilaracrossthedat asets.Forexample,somestudiesshowedthatLogisticregressionoutperformed Decision tree (Marqués, García, & Sánchez, 2012;Wang,Ma, Huang,

&Xu,2012),butanotherresultconcludedthattheLogisticregressionworkedworsethanDecision tree (Bensic, Sarlija, & Zekic-Susac, 2005) Besides, according to(Baesens et al., 2003), Support vector machinewasbetter than Logisticre- gression,Lietal.(2019);VanGesteletal.(2006)indicatedthattherewasan insignificant difference among Support vector machine, Logistic regression,andLineardiscriminantanalysis.Insummary,empiricalcreditscoringstudiesleadto the important conclusion that there is no best single classifier for all data sets.

Underthedevelopmentofcomputationalsoftwareandprogramminglan-guages, there is a shift from single classifiers to ensemble ones. Theterm“ensembleclassifier”or“ensemblemodel”referstothecollectionofmultiplec lassifier algorithms Ensemble models workbyleveraging the collective powerfordecision-makingacrossmultiplesub- classifiers.Intheliteratureoncreditscoring,empiricalstudiesconcludedthattheensembl emodelshadsuperiorper-formance to the single ones (Brown & Mues, 2012;

2020;Lessmann,Baesens,Seow,&Thomas,2015;Marquésetal.,2012).How- ever,ensemblealgorithmsdonotdirectlyhandletheimbalanceddataissue.

While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important It provides the reasons fortheclassificationresults,whichistheframeworkforassessing,managing, and hedging credit risk.Forexample,nowadays,customers’ features arecol- lected into empirical data sets more and morediversely,but not all ofthema r e usefulforcreditscoring.Administratorsneedimportantinformationfromthecl assificationmodelthatinfluencesthelikelihoodofdefaulttosettranspar- entcreditstandards.Thereisusuallyatrade-offbetweentheeffectivenessandtransparency of classifiers (Brown & Mues, 2012) As performance measures increase,explainingthepredictedresultbecomesmoredifficult.Forexample,singleclassifie rssuchasDiscriminantanalysis,Logisticregression,andDecisiontreesareinterpretable,b uttheyusuallyworkfarlesseffectivelythanSupport vectormachineandArtificialneuralnetwork,whicharetherepresentativesof“black box” classifiers Another case is ensemble classifiers Most ofthemoperate in an incomprehensible process although theyhaveoutstanding performance.EvenwithpopularensembleclassifierssuchasBaggingTree,RandomForest,orAdaBoost,whichdonothaveverycomplicatedstructures,theirin-terpretability is not discussed According to Dastile et al (2020), in thecredit scoringapplication,only8%studiesproposednewmodelswiththediscussion ofinterpretability. Therefore, building a credit-scoring ensemble classifier that satisfies both requirements is an essentialt a s k

InVietnam,creditdatasetsusuallysufferfromimbalance,noise,andover- lappingissues.Althoughtheeconomyisundertheinfluenceofthedigitaltrans- formationprocessandcreditscoringmodelshavedevelopedrapidly,Vietnamese commercial bankshavestill applied traditional methods such as Logistic regres- sionandDiscriminantanalysis.Somestudiesusedmachinelearningmethods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen&Nguyen, 2016; Thịnh

(Nhâm,2021),Randomforest(Ha,Nguyen,&Nguyen,2016),andensemblemodels(L uu& Hung,2021).Theideaofthesestudiesistosupporttheapplicationsofadvanced methodsincreditscoring,buttheyarenotconcernedwiththeimbalancedissueandinterpretability. Veryfewstudiesdealtwiththeimbalanceissue(Mỹ,2021;Toàn,Lịch,Hương,&Thọ,2017).Howe ver,theseworksonlysolvedimbalanceddata and ignored the noise and overlappings a m p l e s

Insummary,itisnecessarytobuildacredit- scoringensembleclassifierthatcantackletheimbalanceddataandotherrelatedissuessuchasnoi seandover-lappingsamplestoraisetheperformancemeasures,especiallyonVietnamese data sets Furthermore, the proposed model can point out the importantfea- turestopredictthecreditriskstatus.

Gaps in the approaches to solving imbalanced data

There are three popular approaches to imbalanced classification in thelit- erature.Theyarealgorithm-level,data-level,andensemble-basedapproaches(Galar et al.,2 0 1 1 )

The algorithm-level approach solves imbalanced databymodifying theclas- sifier algorithms to reduce the biastowardthe majority class This approach needsdeepknowledgeabouttheintrinsicclassifierswhichusersusuallylack.

Inaddition,designingspecificcorrectionsormodificationsforthegivenclas- sifier algorithms makes this approach not versatile A representative ofthealgorithm-level approach is the Cost-sensitive learning method which imposesor corrects the costs of loss upon misclassifications and requires theminimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; Xiaoetal.,2020).However,thevaluesofthecostsoflossesareusuallyassignedbytheresearc hers’intention.Inshort,thealgorithm-levelapproachisinflexibleandunwieldy.

Thedata-levelapproachre-balancestrainingdatasetsbyapplyingre-sampling techniques, which belong to three main groups, including over-sampling, under- sampling,andthehybridofoverandunder-sampling.Over- samplingtechniquesincreasethequantityoftheminorityclasswhileunder-samplingtechniquesde- creaset h e o n e o f t h e m a j o r i t y c l a s s This approach implements easily andperforms independently of the classifier algorithms.However,re-sampling tech-niques change the distribution of the training data set whichmaylead to apoorclassificationmodel.Forinstance,randomover-samplingtechniquesin-crease the computation time andmayrepeat the noise, and overlapping samples,thusprobablyleadingtoanover-fittingclassificationmodel.Somehierarchical methodsofover-samplingcancauseotherproblems.Forexample,theSynthetic MinorityOver- samplingtechnique(SMOTE)canexacerbatetheoverlappingis- sue.Incontrast,under- samplingtechniquesmaymissusefulinformationabout themajorityclass,especiallyonseverelyimbalanceddata(Baesensetal.,2003; Sun,

Thethirdistheensemble-basedapproachwhichintegratesensembleclassi- fieralgorithmswithalgorithm-levelordata- levelapproaches.T h i s approachexploitstheadvantageofensembleclassifierstoimpro vetheperformancecri-teria The ensemble-based approach seems tobethe trend in dealingwithi m b a l a n c e d d a t a ( A b d o l i , A k b a r i , &

S h a h r a b i , 2 0 2 3 ; S h e n , Z h a o , K o u , & A l - s a a d i , 2021;Yang,Qiao,Huang,Wang,&Wang,2021;Zhang,Yang,&Zhang,2021 ).However,theensemble- basedapproachoftenfacescomplexmodelsthataretoodifficulttointerprettheresults.Thisisacon cernthatmustberealizedfully.

Insummary,although there are many methods for imbalanced classification,each of them has some drawbacks Some hybrid methods are complex andinaccessible Moreover, there are very few studies dealing with eitherimbalanceor noise and overlapping samples With theavailablestudies, on somedatasets,themethodsdonotraisetheperformancemeasuresashighasexpected.Hence, it is coming up with the idea of a new algorithm that can deal with imbalance,noise,andoverlappingtoincreasetheperformancemeasureonthepositiveclass.

Gaps in Logistic regression with imbalanced data

Logisticregression(LR)isoneofthemostpopularsingleclassifiers,especiallyin credit scoring (Onay & ệztỹrk, 2018) LR can provide an understandable outputthatisaconditionalprobabilityofbelongingtothepositiveclass.Thisprobabilityi sthereferencetopredictthesample’slabelbycomparingitwitha given threshold. The sample is classified into the positive class if andonlyif its conditional probability is greater than this threshold.

ThischaracteristicofLRcaninnovateintomulti- classification.Besides,thecomputationprocess of LR, which employs the maximum likelihood estimator, is quite simple.Itdoesnottakemuchtimesincethereareseveralavailablepackagesofsoftwareor programming languages Furthermore, LR can show the impact of predictorsontheoutputbyevaluatingthestatisticallysignificantleveloftheparameters correspondingtothepredictors.Inotherwords,LRprovidesaninterpretable and affordablem o d e l

However,LR is ineffective on imbalanced data sets (Firth, 1993; King & Zeng,2001),specifically,theconditionalprobabilityofpositivesamplesisun- derestimated.Therefore,thepositivesamplesarelikelymisclassified.Besides, thestatisticallysignificantlevelofpredictorsisusuallybasedontheparameter testingprocedure,whichusesthe“p-value”criterionasaframework.Mean - wh ile, thep-valuehasrecentlybeencriticizedinthestatisticalcommunitybe- cause of its misunderstanding (Goodman, 2008) Those lead to thelimitation in the application fields of LR although it has severala d v a n t a g e s

There are multiple methods to deal with imbalanced data for LR suchaspriorcorrection(Cramer,2003;King&Zeng,2001),weightedlikelihoodestimation(WLE)(Maalouf&Trafalis,2011;Manski&Lerman,1977;Ramalho&

Ramalho,2007)andpenalizedlikelihoodregression(PLR)(Firth,1993;Green- land&Mansournia,2015;Puhr,Heinze,Nold,Lusa,&Geroldinger,2017).Allofthemarerelatedt othealgorithm-levelapproach,whichrequiresmucheffortfrom the users.Forexample, prior correction and WLE need the ratio ofthepositiveclassinthepopulationwhichisusuallyunavailableinreal-worldap- plications.Besides,somemethodsofPLRaretoosensitiveforinitialvaluesinthecompu tationprocessofthemaximumlikelihoodestimation.Furthermore,some methods of PLR were just for the biased parameter estimates, notforthe biased conditional probability (Firth, 1993) A hybrid of these methods and re-sampling techniques has not been considered in the literature onLRwithimbalanceddata.Thehybridmethodscanexploittheadvantagesofeach individual and directly solve imbalanced data forLR.

Insummary,LRforimbalanceddataneedstobemodifiedinthecomputation processbya combination of data-level and algorithm-level approaches.Themodificationcandealwithimbalanceddataandstillretaintheabilitytoprovidethei mpactsofthepredictorsontheresponsewithoutthe“p-value”criterion.

Research objectives, research subjects, andresearchscopes

Researchobjectives

In this dissertation, we aim to achieve the following objectives.

Thefirstobjectiveistoproposeanewensembleclassifierthatsatisfiestwokeyrequirementsofacredit- scoringmodel.Thisensembleclassifierisexpectedtooutperform the traditional classification models and popular balanced methodssuchastheBaggingtree,Randomforest,andAdaBoostcombinedwithrandomover- sampling(ROS),randomundersampling(RUS),SMOTE,andAdaptivesyntheticsampling(ADASYN).Furthermore,theprop osedmodelcanidentifythe significance of input features in predicting the credit risks t a t u s

The second objective is to propose a novel technique to address the chal- lengesofimbalanceddata,noise,andoverlappingsamples.Thistechniquecanleveragethestrengt hsofre-samplingmethodsandensemblemodelstotacklethese critical issues in classification.Subsequently,this technique canbeappliedtocreditscoringandotherimbalancedclassificationapplications,forexample, medicaldiagnosis.

ThismodificationdirectlyimpactstheF-measure,whichiscommonlyusedto evaluate the performance of classifiers in imbalanced classification Thepro- posedworkcancompetewithpopularbalancedmethodsforLogisticregression such as weighted likelihood estimation, penalized likelihood regression, andre- sampling techniques, including ROS, RUS, andSMOTE.

Researchsubjects

Thisdissertationinvestigatesthephenomenonofimbalanceddataandotherrelated issues such as noise and overlapping samples in classification.Weexam- ine various balancing methods, encompassing algorithm-level, data-level, andensemble- basedapproachesinacasestudyofcreditscoring.Withintheseap-proaches,data- levelandensemble-basedarepaidmoreattentionthanalgorithm- level.Additionally,Lasso-

Logistic regression, which is a version of penalizationonLogisticregression,isstudiedintwoapplicationcontexts:abasedlearnerof an ensemble classifier and the individualc l a s s i fi e r

Researchscopes

The dissertation focuses on binary classification problems forimbalanceddata sets and their application in credit scoring Interpretable classifiers,in-cluding Logistic regression, Lasso-logistic regression, and Decision trees,areconsidered.Todeal with imbalanced data, the dissertation concentratesonthedata-levelapproachandtheintegrationofdata- levelmethodsandensem- ble classifier algorithms Some popular re-sampling techniques such asROS,RUS,SMOTE,ADASYN,Tomek- link,andNeighborhoodCleaningRule,are investigated in thisstudy.In addition, popular performance criteria,whichare suitable for imbalanced classification such asAUC(Area Under theRe-ceiverOperatingCharacteristicsCurve),KS(Kolmogorov-

Smirnovstatistic), F-measure,G-mean,andH- measure,areusedtoevaluatetheeffectivenessofconsideredclassifiers.

Research data andr e s e a r c h methods

Researchdata

Thecasestudyofcreditscoringusessixsecondarydatasets.Threeofthemare from the UCI machine learning repository such as German,Taiwan, andtheBankpersonalloandatasets.Thesedatasetsareverypopularinstudying creditscoringandareusedasabenchmarkintheliterature.Besides,thethreeprivate data sets are collected from commercial banks in Vietnam AllViet- namesedatasetsarehighlyimbalancedwithdifferentlevels.Furthermore,tojustifythea bilitytoimprovetheperformancemeasuresoftheproposedworks, theempiricalstudyusedonedatasetbelongingtothemedicalfield,Hepatitis data This data setwas availableon the UCI machine learningrepository.

ThecasestudyofLogisticregressionemploysninedatasets.Fourofthem,which areGerman,Taiwanese,Bank personal loan, and Hepatitis data sets,arealsousedinthecasestudyofcreditscoring.Theothersareeasytoaccessthrough the Kaggle website and UCI machine learningrepository.

Researchmethods

Thedissertationappliesthequantitativeresearchmethodtoclarifytheef- fectiveness of the proposed works such as the credit scoring ensemble classifier,thealgorithmforbalancingandfree- overlappingdata,andthemodificationofLogisticregression.

ThegeneralimplementationprotocoloftheproposedworksfollowsthestepsinTabl e1.1 This implementation protocol is applied in all computationpro- cessesinthedissertation.However,ineachcase,thecontentinStep2mayvaryinsomeways.Thecomputationprocessesareconductedbytheprogramming

Table1.1: General implementation protocol in thed i s s e r t a t i o n

1 Proposing the new algorithm or newprocedure.

2 Constructingthenewmodelwithdifferenthyper-parameterstofindthe optimal model on the trainingdata.

3 Constructingotherpopularmodelswithexistingbalancedmethodsandclassifier algorithms on the same trainingda t a

4 Applyingtheoptimalmodelandotherpopularmodelstothesametest- ing data, then calculating their performancemeasures.

5 Comparingthetestingperformancemeasuresoftheconsideredmodels. language R, which has been widely used in the machine learning community.

Contributions ofthedissertation

Thedissertationcontributesthreemethodstotheliteratureoncreditscoring andimbalancedclassification.Theproposedmethodswerepublishedinthree articles,including:

(1) An interpretable decision tree ensemble model for imbalanced creditscoringdatasets,Journal of Intelligent andFuzzySystem,Vol45, No 6,10853–

Decision, and Control,Vol429, 595–612, 2022, Springer.

(3) A modification of Logistic regression with imbalanced data: F-measure- oriented Lasso-logistic regression,ScienceAsia,49S, 68–77,2 0 2 3

Regardingtheliteratureoncreditscoring,thedissertationsuggeststheinter-pretable ensemble classifier which can address imbalanced data The proposedmodel which usesDecision tree as the base learner has more specificadvan- tagesthanthepopularapproachessuchashigherperformancemeasuresandinterpretabi lity.The proposed model corresponds to the firstarticle.

Regarding the literature on imbalanced data, the dissertation proposesamethod for balancing, de-noise, and free-overlapping samples thanks totheensemble- basedapproach.Thismethodoutperformstheintegrationofthere- samplingtechniques(ROS,RUS,andSMOTE,Tomek-link,andNeighborhood

CleaningRule)andpopularensembleclassifieralgorithms(Baggingtree,Ran-dom forest, and AdaBoost) This work corresponds to the secondarticle.

RegardingtheliteratureonLogisticregression,thedissertationprovidesamodifica tion to its computation process The proposed work makesLogisticregression more effective than the existing methods for Logistic regression withimbalanced data and retain the ability to show the important level ofinputfeatures without usingp−value This modification is in the thirdarticle.

Dissertationoutline

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 2 Literature review of imbalancedd a t a

• Chapter 3 Imbalanced data in credits c o r i n g

• Chapter 4 A modification of Logistic regression with imbalanceddata

Chapter 1 is the introduction, which briefly introduces the contents of the dissertation.Thischapterpresentstheoverviewofimbalanceddatainclassifi-cation Besides, other contents are the motivations, research gap identifications,objectives,subjects,scopes,data,methods,contributions,andthedissert ationoutline.

Chapter2istheliteraturereviewonimbalanceddatainclassification.Thischapterpro videsthedefinition,obstacles,andrelatedissuesofimbalanceddata,forexample,theoverl appingclasses.Besides,thischapterdeeplypresentstheperformancemeasuresforimbalanc eddata.Themostimportantsectionisthe review of approaches to imbalanced data, including algorithm-level,data- level,andensemble-based- level.Chapter2alsoexaminesthebasicbackgroundandrecentproposedworksofcreditsc oring.T h e detaileddiscussionofprevious studies clarifies the pros and cons of existing balancing methods That is the frameworkfordevelopingthenewbalancedmethodsinthedissertation.

Chapter3isthecasestudyofimbalancedclassification- creditscoring.Thischapterisbasedonthemaincontentsofthefirstandsecondarticlesreferredto inSection1.6.Weproposeanensembleclassifierthatcanaddressimbalanceddataandprov idetheimportancelevelofpredictors.Furthermore,weinnovatethealgorithmofthiscredit- scoringensembleclassifiertohandleoverlappingand noise before dealing with imbalanced data The empirical studies are conductedto verify the effectiveness of the proposed algorithms.

Chapter4isanotherstudyonimbalanceddatawhichisrelatedtoLogistic regression.Thischapterproposesamodificationoftheinnerandouterofthecomputati onprocessofLogisticregression.Theinnerisachangeintheperfor- mancecriteriontoestimatethescore,andtheouterisaselectiveapplicationofre- samplingtechniquestore-balancethetrainingdata.Theexperimentstud- iesonninedatasetstoverifytheperformanceofthemodification.Chapter4correspo ndstothethirdarticlereferredtoinSection1.6.

Chapter5istheconclusion,whichsummarizesthedissertation,impliestheapplications oftheproposedworks,andreferstosomefurtherstudies.

Imbalanced datainclassification

Description ofimbalanceddata

AccordingtoDefinition1.1.4,anydatasetwithaskewedquantityofsamples intwoclassesistechnicallyimbalanceddata(ID).Inotherwords,anytwo- classdatasetwithanimbalancedratio(IR)greaterthanoneisconsideredID.Therearenotany conventionaldefinitionsoftheIRthresholdtoconcludethatadata set is imbalanced Most authors simply define ID that there is a class with amuchgreater (orlower)number of samples than one of the other

(Brown&Mues,2012;Haixiangetal.,2017).Otherauthorsassessadatasetimbalanced iftheinterestclasshassignificantlyfewersamplesthantheotherandordinaryclassifier algorithms encounter difficulty in distinguishingtwoclasses (Galaretal., 2011; López,Fernández,García, Palade, & Herrera, 2013; Sun,Wong,& Kamel,

2009) Therefore, a data set is considered as ID when its IR is greaterthanoneandmostsamplesoftheminorityclasscannotbeidentifiedbystandardclas sifiers.

Obstacles inimbalancedclassification

In ID, the minority class is usually misclassified since there istoolittlein- formation about their patterns Besides, standard classifier algorithmsoftenoperate according to the rules of the maximum accuracy metric Hence,theclassification results are usually biasedtowardthe majority class to getthehighest global accuracy and verylowaccuracy for the minority class Ontheother hand, the patterns of the minority class are often specific, especiallyinextreme ID, which leads to the ignorance of minority samples (theymay betreatedasnoise)tofavorthemoregeneralpatternsofthemajorityclass.Asa consequence,theminorityclass,whichistheinterestedobjectintheclassifica- tion process, is usually misclassified inID.

Theaboveanalyzesarealsosupportedbyempiricalstudies.Br own andMues(2012) concluded that the higher the IR, thelowerthe performance of classifiers. Furthermore, Prati, Batista, andSilva(2015) found that the expected performanceloss,whichwastheproportionoftheperformancedifferencebe-tweenID and the balanced data, became significant when IRwasfrom90/10and greater Prati et al also pointed out that the performance loss tendedtoincrease quickly for highervaluesofIR.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

Categories ofimbalanceddata

Inrealapplications,combinationsofIDandotherphenomenamakeclassifi- cationprocessesmoredifficult.SomeauthorsevenclaimthatIDisnotonlythe mainreasonforthepoorperformancebuttheoverlapping,smallsamplesize,smalldisjunct s,borderline,rare,andoutliersamplesarealsothecausesoftheloweffectivenessofpopularclassifier algorithms(Batistaetal.,2004;Fernándezetal.,2018;Napierala&Stefanowski,2016;Su netal.,2009).

• Overlappingorclassseparability(Fig.2.1b)isthephenomenonoftheun- cleardecisionboundaryoftwoclasses.Italsomeansthatsomesamplesoftwoclassesareble nded.Ondatasetswithoverlapping,thestandardclassi- fieralgorithmssuchasDecisiontree,Supportvectormachine,orK-nearest neighbors become harder to perform Batista et al (2004) stated thattheIRwaslessimportantthanthedegreeofoverlapbetweenclasses.Similarly,Fernández etal.

• Smallsamplesize:Learningalgorithmsneedasufficientamountofsam-ples of data sets to generalize the rule to discriminate classes.Withoutlargetrainingsets,aclassifiercannotonlygeneralizecharacteristicsoft hedata but it can also produce an over-fitting model (Cui, Davis,Cheng, & Bai, 2004;Wasikowski& Chen, 2009) On imbalanced and smalldata

Source: Galar et al (2011) sets,thelackofinformationaboutthepositiveclassbecomesmoreserious Krawczyk andWoźniak(2015) stated that when fixing the IR, themoresamples of the minority class, thelowerthe error rate ofclassifiers.

• Small disjuncts (Fig 2.1c): This problem occurs when the minorityclassconsists of several sub-spaces in the feature space.

Therefore, smalldis- junctsprovideclassifierswithasmallernumberofpositivesamplesthanlargedisju ncts.Inotherwords,smalldisjunctscoverraresamplesthataretoohardtobefoundinthed atasets,andlearningalgorithmsoftenignore raresamplestosetthegeneralclassificationrules.Itleadstoahighererror rateonsmalldisjuncts(Prati,Batista,&Monard,2004;Weiss,2009).

• Thecharacteristicsofpositivesamplessuchasborderline,rare,andoutlier, affect the performance of standard classifiers The fact is that borderline samplesarealwaystoodifficulttoberecognized.Inaddition,therareand outliers are extremely hard tobeidentified According to Napieralaand Stefanowski

(2016);VanHulse and Khoshgoftaar (2009), animbalanceddatasetwithmanyborderlineorrareandoutliersamplesmadestandar d classifiers lesse ffi c i e n t

Insummary,studyingIDshouldpayattentiontotherelatedissuessuchas theoverlapping,smallsamplesize,smalldisjuncts,andthecharacteristicsof the positives a m p l e s

Performancemeasuresforimbalanceddata

Performancemeasuresforlabeledoutputs

Mostlearningalgorithmsshowlabeledoutputs,forexample,K-nearestneighbors,Decisiontree,ensembleclassifierbasedDecisiontree,andsoon.Aconve- nientwaytointroducetheperformanceoflabeled-outputclassifiersisacross- tabulation between actual and predicted labels, known asconfusionmatrix.

Predicted positive Predicted negative Total

Actual positive TP FN POS

Actual negative FP TN NEG

InTable2.1,TP, FP,FN, and TN follow the Definition 1.1.3 Besides, POS and NEG are the numbers of the actual positive and negative samples inthe trainingdata,respectively PPOSandPNEGarethenumbersofthepredicted positiveandnegativesamples,respectively.Nisthetotalnumberofsamples.

Fromtheconfusionmatrix,severalmetricsarebuilttoprovideaframeworkfor analyzing many aspects of a classifier These metrics canbedividedintotwotypes, single and complexmetrics.

The most popular single metric isaccuracyor its complement,errorrate.Accuracyistheproportionofthecorrectoutputs,anderrorrateist hepropor- tion of the incorrect ones Therefore, the higher (orlower)accuracy (orerror

TN FN rate) is, the better the classifier is.

Althoughaccuracyanderrorrateareeasytocalculateandexpressthemean- ings,theymaymisleadtheperformanceevaluationofaclassifierinthecaseof ID.Firstly,on an imbalanced data set with very high IR, standard classifiers often get a very high accuracy andlowerror rate It means the number of positive samples classified correctly is small despite their crucial role intheclassificationtask.Secondly,theerrorrateconsidersthecostofmisclassifying thepositiveclassandthenegativeequally.WhereasinID,themisclassification ofthepositivesampleisoftenmorecostlythantheoneofthenegative.There- fore,imbalancedclassificationstudiesusesomesinglemetricsthatfocusona specific class such asTPR (orrecall),FPR, TNR, FNR, andprecision.

TPR is the proportion of the positive samples classified correctly Other names of TPR arerecallandsensitivity.

FPR is the proportion of the negative samples classified incorrectly.

TNR (orspecificity) and FNR are the complements of FPR and TPR, respectively.

Precision is the proportion of the actual positive samples among the predictedpositive class.

Among these metrics,accuracy,TPR, TNR, and precision are expectedashighaspossiblewhileFPRandFNRareopposite.Inmanyapplications,some specificmetricsmaybeprioritized.Forinstance,inimbalancedclassification, insteadofaccuracy,TPRisthemostfavoredmetricbecauseoftheimportance of the positive class.However,in credit scoring and cancer diagnosis, ifonlyfocusing on the TPR and ignoring the FPR, a trivial classifier will design all samples with the positive label In other words, the classifier cannotidentify anynegativesamples.Thatcausesanarbitrarylosswhichisnotsmall.Hence, highvaluesof precision and recall are preferred in these circumstances.Insummary,eachsingleperformancemetrichasitsmeaning,andchoosing whatmetrics depends on the applicationfields.

Thesinglemetricsseemnottoprovideenoughinformationtoevaluatetheperforma nce of a classifier, especially in ID It leads to combinations of the above single metrics.F-measureis one of the most popular complex metrics.F- measureexpressestheprecisionandrecalltrade-offbytheweightedharmonicmean following thef o r m u l a :

F β =(1+β ) β 2 Precision+Recall = (1+β 2 )TP+FP+β 2 FN (2.8) whereβisthepositiveparameterforcontrollingthesignificanceofFPorFN.The parameterβis set greater than 1 if and only if FN is more concerned thanFP.F 1 is the special case of F β when the importance of precision and recallmetricsisequal.Equivalently,theroleofFPandFNarethesameinF1.Sometim es,F-measureisthenameforF1unlesstherearespecificcomments.

Precision+Recall 2TP+FP+FN

ThemaximumvalueofF1is1.Accordingtoformula(2.9),thevalueofF1ishighifan donlyifbothvaluesofprecisionandrecallarehigh. Inapplications,F1isusuallychosenincancerdiagnosticorcreditscoring(Abdolietal.,

2023;Akay, 2009; Chen, Li, Xu, Meng, & Cao, 2020).

AnothermetricisG-meanwhichusesthegeometricmeanofTPRandTNR.TheformulaforG-meanisshownin(2.10).G-meancollectsinformationabout bothpositiveandnegativeclasses,notonlyfromthepositiveclassasF-measure.

G-mean is high if and only if TPR and TNR are high The most idealvalueofthe G-mean is1

Performancemeasuresforscoredoutputs

Besides labeled-output classifiers, several classifiers show scored outputsthat expressthelikelihoodofbelongingtoeachclass,forinstance,Logisticregression.Usually,high- scored samples are predicted positive labels.Generally,thescoredoutputsaretransferredintothelabeledonesbybeingcom paredwithagiventhreshold.Ifthetargetoftheclassificationistorestricttheerrorpredic tionofthepositiveclass,alowthresholdwillbeassigned.Thatwillintroduceahigh

TPRandahighFPR.Otherwise,highthresholdswillreducetheFPRbutraisetheFNR.Insho rt,choosingathresholdforascored-outputclassifierdepends on the target to optimize which performancemetrics.

When transforming to labeled outputs, samples with the same labels are equallytreatedalthoughtheirlikelihoodsofbelongingtothepositiveclassarevery different.

Curve(ROC),AreaundertheReceiverOperatingCharacteristicsCurve(AUC),Kolmogorov-

Smirnovstatistic(KS),andH-measurearethepopularfree-thresholdmeasures toevaluatetheperformanceofscoredclassifierswithoutchangingthetypeof outputs These metrics, which are considered overall (general) performance metrics, are also widely used in imbalanced classificationstudies.

TheReceiverOperatingCharacteristics Curve (ROC)is a graph showingtherelationship of FPR and TPRoverall possible thresholds.ROCis plottedonthe two-dimensional plane with thex-axis andy-axis representing FPRandTPR, respectively.ROCis expected tohugto the top left corner sincethec l a s s i fi e r i n t r o d u c e s h i g h T P R s a n d lowFPRs In the unit square, theROCof aclassifiermustbeabovethediagonal,whichcorrespondstotheROCofthe randomone.

Figure 2.2 illustrates the ROCs of three classifiers and the random one In thisfigure,allclassifiershavebetteroverallperformancethantherandomsince allthreecurvesareabovethereddiagonal.Besides,thefirstandsecondcurvesareabovethethird. Itmeansthatthethirdshowstheleastperformancesince withthesameFPR,italwaysoffersalowerTPRthanthefirstandthesecond.However, wecannotcomparetheoverallperformanceofthefirstwiththesec-ond A naturalwayis comparing theareaunder theROCcurves(AUCROC)w h i c h i s b o u n d e dbytheROCcurves and thetwoaxes The greater theAU-CROC is, the better the classifier is.Conveniently,AUCROCis shortenedtoAUC.

AUCistheexpectedTPRaveragedoverallFPRswithallpossiblethresholds(Ferri,Hernández -Orallo, & Flach, 2011) TheAUCof a random classifier is 0.5, so theAUCis expected tobegreater than 0.5 Besides, theAUCof the ideal classifier is 1. Hence, theAUCusually falls in the range of[0.5; 1].With a discrete series of thresholds{α i } n ,AUCis estimatedbythef o r m u l a : n

AUC=0.5 |FPR(α i )−FPR(α i−1)|(TPR(α i )+TPR(α i−1)) (2.11) i=2 whereTPR(α)andFPR(α)are the TPR and FPR corresponding to the thresh- oldα.

In ID literature,AUCis the most popular performance metric todeter- mine the optimal classifiers and compare learning algorithms (Batista etal.,2004; Brown & Mues, 2012; Huang & Ling, 2005).However, AUChas some weaknesses.Firstly,AUCmayprovideincorrectevaluationswhenROCscrosstogether.Fo rexample, aROCis only higher in a neighborhood of a specific threshold, but it islowerthan other ROCs at all remaining thresholds.ThiscurvemaycorrespondtoagreaterAUCthantheothers,buttheothe rsshow higher TPR at most thresholds In this case,AUC maybean irrationalmea- sure.Secondly,accordingtoHand(2009),AUCisanincoherentperformance measure:“AUCisequivalenttoaveragingthemisclassificationlossoveracost ratiodistributionwhichdependsonthescoredistributions”oftheclassifierit- self,thus,theAUCevaluatesdifferentclassifiersbydifferentmetrics.However, Ferrietal. (2011)arguesthatHand’sargumentisnot“anaturalinterpretation”.Besides,Ferriet al.

(2011) confirms the AUC’s coherent meaning of ageneralclassification performance measure and the independence of the classifieri t s e l f

Figure 2.3: Illustration of KS metric Source:Author’s design

Kolmogorov-Smirnovstatistic (KS)is another popular metric measuringthepredictive power of classifiers (He, Zhang, & Zhang, 2018;

Shen et al.,2021;Yangetal.,2021).KSexpressestheseparationdegreeofthepredictedpositive and predicted negative classes Figure 2.3 is an illustration of the KS metric that is defined as the formula (2.12).

Although a high KS implies an effective classifier, KS only reflects good performance in the local of the point determining KS (Řezáč & Řezáč, 2011).

In Figure 2.3, KS is realized at threshold 0.55, so effective analysis is only meaningful in the neighborhood of this value.

Hand(2009)stronglycriticizesAUCandproposesH-measureasasubstitu- tion.H- measureisthefractionalimprovementintheexpectedminimumlosscompared with a random classifier The formula of the H-measure is:

L ref (2.13) whereListheoverallexpectedminimummisclassificationlossandL ref istheexpected minimum misclassification loss corresponding to a randomc l a s s i fi e r

H-measurecanovercometheAUC’slimitationbyfixingaclassifier- independentdistributionofrelativemisclassificationcost.Theexpectedlossinthedefinition of H- measure canbefromanyloss distribution Most applications follow the popular proposed of Hand and Anagnostopoulos (2014) for the beta distribu- tionBeta(π 1+1, π 0+1)(π 0andπ 1are the proportions of negative and positive class in the population, respectively) Although H-measure appearslately,it has become popular in recent classification studies, for example, Ala’raj andAbbod(2016);Garrido,Verbeke,andBravo(2018);Heetal.(2018).

Conclusion of performance measures in imbalanced classification

Regarding labeled outputs, accuracy is the universal performance metric,butitmaymisleadtheevaluationofclassifiers’effectivenessinIDsincepursuing thehighestaccuracywillmakepositivesamplesnotbeclassifiedcorrectly.In severalapplicationfieldssuchascreditscoringorcancerdiagnosis,F-measure and G-mean are the popular metrics, instead ofaccuracy.Regarding scoredoutputs, AUC, KS, and H- measure are favored.However,it shouldberemindedthat there is no perfect performance measure suitable for all data sets.Everymetric has its meanings and drawbacks Hence, it is necessary to utilize the overallandbased- thresholdmetricstogetanadequateanalysisofaclassifier’sperformance.

Approachestoimbalancedclassification

Algorithm-levelapproach

The algorithm-level approach, which focuses on the intrinsic classifiers,mod- ifiesunderlyingalgorithmstorestrictthenegativeimpactofID.Thetargetofthealgorithm- levelapproachisusuallytoraiseaspecificperformancemetricor to constrain a consequence ofID.

Let’s review some typical types of the algorithm-level approach in ID.

The algorithm-level approach limits the biastowardthe majority classofimbalanceddatabymodifyingorcorrectingtheunderlyingmechanismofase- lectedclassifier,forexample,Supportvectormachine,Decisiontree,orLogistic regression.

ModificationsofSupportvectormachineusuallyfocusonthedecisionbound- arywhilethoseofDecisiontreepayattentiontosplittingfeaturecriteria,and those of Logistic regression are related to the log-likelihood function or the maximum likelihood estimationp r o c e s s

Table 2.2 shows some representatives of this approach.

Applyingspecifickernelmodifi- cations to rebuild thedecisionboundary in order to reducethebiastowardthe majorityclass.

Setting aweighton thesam-ples in the training setbasedontheirimportance(theposi tivesamples are usually assignedahigherweight).

Applying activelearningparadigm, especially inthesituation where the samplesofthe training set are notfullyl a b e l e d

Re-computesthemaximumlike- lihood estimate for theinterceptand the conditionalprobabilityofbelongin gtothepositiveclass.

Lee, Jun, and Lee (2017); Lee et al (2017); Yang, Song, and Wang (2007).

Hoi, Jin, Zhu, and Lyu(2009);Sun, Xu, and Zhou(2016);Žliobaitė, Bifet,

(2017);Lenca,Lallich, Do, and Pham(2008);Liu, Chawla, Cieslak,andChawla (2010).

Maalouf and Siddiqi (2014); Maalouf and Trafalis (2011); Manski and Lerman (1977).

Firth (1993);Fu,Xu,Zhang,and Yi (2017); Li et al.(2015).

Thebasicideaofcost-sensitivelearning(CSL)isthateverymisclassificationcauses a loss.

DenoteC(1,0)andC(0,1)the loss when predictinga positivesampletobethenegativeoneandthenegativetobethepositive,res pectively.ThesimplestformofCSListheindependentmisclassificationcostwhichsets

C(1,0)andC(0,1)are constants Under the notations, the total cost function is:

Thetargetoftheindependentcostformistofindtheoptimalthresholdα ∗ corresponding to the minimum value of total cost function: α ∗ =arg min α∈(0,1)[C(1,0)×FN(α)+C(1,0)×FP(α)]Σ (2.15) whereFN(α)andFP(α)are the number of false negative and false positive samples corresponding to the thresholdα, respectively Table 2.3 shows the independent misclassification cost matrix for a prediction result.

Table 2.3: Cost matrix in Cost-sensitive learning

InID,CSLsetsC(1,0)higherthanC(0,1)tocompensateforthebiastowardthenegativecl ass.Thisassumptionisalsorationalinreal- worldclassificationapplicationsbecausemisclassifyingapositivesampleusuallycausesmoreseriou sproblems than doing a negativeone.

ManyauthorsassignedC(0,1)aunitandC(1,0)aconstantnumberC(greater thanunit).Somestudiesproposedtheformulaortheproceduretofindtheop- timal threshold based onC(0,1)andC(1,0)such as Elkan (2001);Moepya, Akhoury,andNelwamondo (2014); Sheng and Ling (2006) Besides theinde- pendentone,authorspursuedthedependentmisclassificationcostwhichputindividual costperobservation (Bahnsen, Aouada, & Ottersten,2014,2015;Petrides,Moldovan,Coenen, Guns, &Verbeke,2022).

Amongmethodsofalgorithm-levelapproach,CSListhemostpopular(Fer-nández et al., 2018; Haixiang et al., 2017) since CSL canbeembedded into other classifier algorithms sucha s :

• Supportvectormachine(SVM):DattaandDas(2015);Iranmehr,Masnadi- Shirazi, and

Vasconcelos (2019); Ma, Zhao,Wang,and Tian( 2 0 2 0 )

• Decision tree (DT): Drummond, Holte, et al (2003); Jabeur, Sadaaoui, Sghaier,andAloui(2020);Qiu,Jiang,andLi(2017).

• Logistic regression (LR): Shen,Wang,and Shen (2020); Sushma

SJandAssegie (2022); Zhang,Ray,Priestley,andTan(2020).

The effectiveness of CSL strongly depends on the assumption of the cost matrix.IfthedifferencebetweenC(1,0)andC(0,1)istoohigh,thepositiveclass isover- favoredintheclassificationprocess.ThatwillpushtheFPR.Otherwise, ifthisdifferenceistoolow,theclassifierdoesnotprovideenoughadjustmentto rebalance the biastowardthe negative class Therefore, constructing thecostmatrix is the major concern in CSL There aretwopopular scenarios forthec o s t matrix:

• The cost matrix is built on an expert’s opinion.Forexample, in credit scoring, Moepya et al (2014) assignedC(1,0)the average loss when ac- cepting a bad customer based on an expert’s experience This scenario oftendependsonthepriorinformation,whichisthesubjectiveopinionof researchers without transparente v i d e n c e

• Thecostmatrixisinferredfromthedataset.SomeauthorsassignedIRto the costC(1,0)and 1 toC(0,1)since they implied that the higher theIR,thepoorerthe classification performance (Castro & Braga, 2013; López, Del Río, Benítez, & Herrera, 2015).However,IR is not the onlyfactorreducing the performance of classifiers (see Subsection 2.1.3).

If IR isthecostC(1,0),anydatasetswiththesameIRwillbesimilarlysolveddespite belonging to different applicationfi e l d s

In summary, the cost of loss in CSL is usually a disputable issues.

The algorithm-level approach focuses on the intrinsic nature ofclassifiers It requires a deep understanding of the classifier algorithms to directlydealwiththeconsequencesofID.Hence,algorithm- levelmethodsareusuallyde- signedbasedonspecificclassifieralgorithms.Therefore,thisapproachseemsless flexible than the data-levelapproach.

CSListhemostpopularmethodofalgorithm-levelapproach.However,thecost matrix is usually a controversialissue.

Data-levelapproach

Thedata-levelapproachinvolvesre-samplingtechniquestore-balanceoral- leviatetheskewdistributionoftheoriginaldataset.Thesetechniquesareeasytoapplyanddon otdependonlearningalgorithmstrainingtheclassification modelafterthepre- processingdatastage.Therefore,thedata- levelapproachisanaturalstrategyforsolvingID.Intheimbalancedclassificationliteratur e, manyempiricalstudiesagreedthatre-samplingtechniquesimprovedtheper- formancemeasuresofmostclassifierssuchasBatistaetal.(2004);BrownandMues

(2012); Prati et al (2004) This approach forms three main groupsofmethods,includingunder-sampling,over- sampling,andthehybridofunderand over-samplingt e c h n i q u e s

The under-sampling method removes negative samples, which are in themajorityclass,tore-balanceordegradetheimbalancestatusoftheoriginaldata set.

Themostcommonunder-samplingtechniqueisrandomunder- sampling(RUS).RUScreatesabalancedsubsetofthetrainingsetbyrandomlyeliminatingneg- ativesamples.RUSisnon-heuristic,easytoemploy,andshortenscomputation time.However,if the data is highly imbalanced,RUS maywaste usefulinfor-

Majority class (MA) Minority class (MI)

Figure 2.4: Illustration of RUS technique

Source: Author’s design mation from the majority class because of removingtoomany negativesamples.Figure 2.4 depicts the operation ofR U S

ToovercomethelimitationofRUS,authorshavedevelopedheuristicmethodstoremove theconcernedsamples SomerepresentativesareCondensedNearestNeighbor Rule(Hart, 1968),Tomek-Link(Tomeket al., 1976),One-sideSelec-tion(Kubat, Matwin, et al., 1997),NeighborhoodCleaning Rule(Laurikkala, 2001) These methods canbeused for balancing and cleaningdata.

Condensed Nearest Neighbor Rule (CNN)(Hart, 1968) finds theconsistentsubsetEoftheoriginaldatasetS,whichcorrectlyclassifiesallsamplesofSbyt he1-nearestneighborclassifier.Then,replaceSwiththestorewhichconsists oftheminorityclassandthesubsetofthemajorityclassnotbelongingtoE.

Figure 2.5: Illustration of CNN rule Source: Author’s design

CNNremovesthenegativesamplesofE,whichareoftenfarfromthebor- derlinebetweentheclasses.Thesesamplesareconsideredlessrelevanttothelearningp rocess.However,CNNdoesnotdeterminethemaximumconsistentsubset Besides, CNN randomly removes samples, particularly in the initial stage, hence it often retains internal samples rather than boundary ones.Insomecases,forinstance,Figure2.5,CNNisnotabalancingmethodduetore- movingtoomanynegativesamples.Fu rt h e rmo r e, samplesinthestoreareintooclose distances.Thatmakesthecharacteristicsofthetwoclassesnotdistinctly different.Itleadstodifficultiesintheoperationofthefollowingclassifiers.

Tomek-Link (Tomeket al., 1976), which is aninnovationof CNN, findsallpairs of samples(e i ,e j )satisfying thec o n d i t i o n s : i) Belonging to the differentc l a s s e s ii) Withanysamplee k ,d(e i ;e j )< d(e i ;e k )andd(e i ;e j )0isatuning parameter.

Iftis sufficiently large, the constraint imposing on the parameters is not strict, the solution for (3.3),β j (j∈1, p), are the same as the one of (3.2) On thecontrary,iftis very small, the magnitude ofβ j (j∈1, p)is shrunk Then, duetothepropertyoftheabsolutefunction,someofβ j arezero.Therefore,the constraint onβ j (i∈1, p)in (3.3) plays the role of a feature selection method: onlythepredictorsrelevanttotheresponse,whichcorrespondstonon-zeroβ j ,are retained in the fittedmodel.

Based on the theory ofconvexoptimization, problem (3.3) isequivalentt o : min β Σ−(l(Y|X,β))+λ Σ j

=1 |β j |Σ (3.4) where,λisapenaltylevel,correspondingone-onetothetuningparametertin(3.3) Ifλis zero, the solution of LLR is exactly equal to LR’s solution in(3.2).Otherwise, ifλis sufficiently large, the solution of LLR is zero.Forvaluesofλbetween thetwoextremes, LLR gives a solution with some ofβ j zero, thus somepredictorsareexcludedfromthemodel.Thevaluesofλaresurveyedona grid search to select the best based on criteria AIC, BIC, orcross-validation procedure With a givenλ, problem (3.4) is solvedbythe coordinate descent algorithmandproximal– Newtoninteraction(seeGareth,Daniela,Trevor,andRobert (2013); Hastie, Tibshirani, andWainwright(2015) for moredetails).

Besides being a feature selection method, the predictive power of LLR is better than LR in empirical studies (Li et al., 2019; Wang et al., 2015).

A decision tree (DT) consists of rules to classify samples The set of rules splitsthefeaturespaceofsamplesintosub-spacesthatpossesssimilarspecific attributes. Constructing a DT on a training set is to determine the orderofthepredictorvariablesandtheconditionsforbranchingthem.Theprocessisto iterate a recursionovereach split sub-space Splitting stops when it isnolongerpossibletosplitorallsamplesinthissub-spacehavethesameoutput.

- results “if - then - else” of the attributes A hidden-label sample belonging to the sub-spaceS k is predicted tobein classj(j=0,1)if most samples inS k havethe labelj.

Green(G)andRed(R).Thefeaturesofsamplesrepresentedbytwopredictor variables arex 1andx 2.If a sample hasx 2smaller thanh 1orx 1greater thant 2,it belongs to class R Otherwise, ifx 2is smaller thanh 2,it belongs toc l a s s

G On this condition, ifx 1is less thant 1,it is a member of G; otherwise, it is aredsample.TheoutputsofthisDTareterminalnodes:“Red”and“Green”, which are also calledleaves.

DT has many outstanding advantages The first convenience is thea b i l i t y toexplicitlyexplaintheroleofpredictorsinthefinaloutput.Moreover,itcan determinetheorderoffeaturesthatisimportantfortheclassificationresults.Besides, DTdoesnotrequiretheprobabilitydistributionofthepredictors.DTcan also perform on both qualitative and quantitativep r e d i c t o r s

However,DToftendoesnotgivehighaccuracycomparedtoSupportvectormachine(SV M)andArtificialNeuralNetwork(ANN)whentheorderofpredic- torsforbranchingisunsuitable.Besides,DTtendstoover- fitthetrainingdata,especiallywhenthetreegrowsdeeplyandcomplexly.Furthermore, DToffershighlyvariantmodelssinceitissensitivetosmallchangesinthetrainingdata.The lackofrobustnessmakesDTlessreliablethanotheralgorithms(Garethet al.,2013).

In credit scoring, there are some usual algorithms for constructing a DT model:chi- squareautomaticinteractiondetector(CHAID),classificationandregressiontree(CAR T),andC5,whichdistinguishfromtheothersbythecrite- rionofsplittinginputattributes.CHAIDusesthechi- squaretest(Yap,Ong,&Husain,2011),whileCARTappliesGinicriterion(Breiman,Friedma n,Olshen,& Stone, 2017), and C5 utilizes entropy information gain (Pandya

&Pandya,2015; Pang & Gong, 2009; Zhang, Jia, Diao, Hai, & Li, 2016). Besides, other particular algorithms for building a DT are C4.5 (Quinlan,

1996, 2014)andRPARTalgorithms (Breiman et al., 2017) The effectiveness of

DT is notro- bustinthecreditscoringliterature.Forinstance,GalindoandTamayo(2000); Zhang et al (2016) found that DTwasbetter than ANN and LR.Meanwhile,according to Brown and Mues (2012); Marqués et al (2012), DTwasworse than LR, SVM, andANN.

Assume that the two classes of training data are linearly separate The idea of SVM is to construct a linear boundary between two classes such that the margin of the two classes is equal to that boundary and the width of the margin is maximized The margin is defined as the smallest distance from a sample of a class to the boundary. i= 1

Source: https://machinelearningcoban.com/2017/04/09/smv/

Figure3.2showsanexampleofthelinearboundariesofthetwoclasses.InFigure 3.2a, the bold black line does not satisfy the equal distance condition, whileinFigure3.2b,thegreendottedlinedoesnotgetthemaximumdistance from each class to the boundary The bold black line in Figure 3.2b is the boundary determinedbyS V M

The mathematical model of SVM is as follows.

Suppose a training data set to be{X i , Y i } n , whereX i ∈R p , Y i ∈ {−1,1}.

Theo bjectiveo f S V M i s t o fi n d a hyper-plane:b J X+b 0=0,(b 0∈R,b∈R p )satisfying: max b 0 ∈R

(3.5) b J ∈R p  n b 2 + +b 2  Thesolutionfortheoptimizationin(3.5)isdetermineduniquely However,SVM only succeeds when thetwoclasses are linearly separate.Unluckily,itrarely happens in most real-world data sets Then, the original SVMwasim- provedto soft-margin SVM (Cortes &Vapnik,1995) and kernelSVM (Boser,Guyon,&Vapnik,1992)tobeabletosolvethenonlinearseparationproblem.Incr editscoring,theperformanceofSVManditsextendedversionsisnotrobust. Theya r e q u i t e b e tt e r t h a n L R , a c c o r d i n g t o B a e s e n s e t a l

(2019); VanGestel etal. (2006)concludethatthereisaninsignificantdifferencebetweenSVM,LR,andLDA.Besid es,Huang,Chen,andWang(2007);Huang,Chen,Hsu,Chen,andWu i j

(2004) implies that SVM does not perform more accurately than ANN.Onimbalanceddata,accordingtoAlves,Silva,Prati,etal.(2012);Yijing,Haixiang,

Xiao,Yanan,andJinling(2016),theperformanceofSVMisnotaffectedwhen increasingtheimbalancedratioandisquiterobustandprecisecomparedtothe otherclassifiers.

Ensembleclassifiers

Theterm“ensemblemodel”referstothecombinationsofseveralclassificationmodels,whi charealsonamedsub-classifiers,toleveragethecollectivepower fordecision- makingacrossthem(Roncalli,2020).Ensembleclassifierscanbedivided intotwotypes: heterogeneous and homogeneousensemble.

Heterogeneousensembleclassifiers(sometimescalledhybridmodels)combinedifferenttechn iquesoralgorithms,whicharecalledbaselearners,toleveragetheirstrengthsandcompen satefortheirweaknesses.Hybridclassifiersofteninvolvethefusionofdifferentclassifierss uchasDT,LR,SVM,ANN,orKNN.Increditscoring,hybridmodelshavebroughtpromisin gresultscomparedto individual classifiers (Dumitrescu et al., 2021; Shen et al., 2021;Yangetal.,2021; Zhang et al., 2021).However,hybrid classifiers do not attractasmuchconsiderationashomogeneousensemblesbecauseofsomelimitations.Firstly, hybridmodelsoftenrequiremoreextensivetrainingandtuning.Thecombina-tion of multiple algorithms introduces additional hyper-parametersa n d config- urationoptionsthatneedtobeoptimized.Thisprocessmakesmodeldevelop- mentrequiremoreeffortandexpertise.Secondly,hybridmodelsrequiregreatcomputationa lresourcessuchasmemoryandprocessingpower.Forexample,thehybridmodelsconsistingofSV

MorANNusuallyspendtoolongcompu- tationtimewhichcanlimitthescalabilityofhybridmodels,particularlywhendealing with large data sets or real-time applications.Finally,hybridmodelsoften make classification more complex Hence, it istoodifficulttointerpretor understand the decision- making process, especially when the hybridm o d e l s include multiple classifiers or complicatedt e c h n i q u e s

Homogeneousensembleclassifiers(alsocalledensemblemodels)combinesim- ilarbaselearnerstomakepredictionscollectively.Theterm“ensembleclassifiers”referredtoinSubsection2.3.3ishomogeneousensembleclassifiers.Fromthispoint,unlessother wisestated,ensembleclassifiersareinterpretedasho-mogeneousones.

Ensemble classifiers usually employ a base learner many times, on differ- entsubsets of the data set or use various hyper-parameters The individual classifiers(sub-classifiers)areintegratedwithspecificstrategiestogetthefinalprediction.

Themainideaofensemblefollowsthenaturalhumanbehaviorwhenmakingadecision. Insteadofseekinganexpertwithahighcost,asetofseveralnormal workerswithacheapcostisanalternative.Thisideaimpliesthatsingleerrors canbesuppressedbymultiple results catching many aspects of the training data.Inotherwords,severalsub-classifierscanpredictamoreaccurateoverallresult than the individual Therefore, the effectiveness and diversity ofsub- classifiersaretheconcernsofanensembleclassifier(Fernándezetal.,2018).

Anotherconceptrelatedtothediversityofanensembleisthebias- variancedecomposition The bias canbecharacterized as the ability to generalizetheprediction results to a test set On thecontrary,the variance canbedepicted as the sensitivity of the classifier to the training set Hence, the performanceimprovementofensemblesisoftenthereductionofvariance(baggingensembl es)orbias(boostingensembles)(Fernándezetal.,2018).

Ensembleclassifierscanworkinparallelorsequentialways.Th e paralleltypeconsists ofmanyindependentsub- classifiers.Thefinalclassificationresultsarecombinedaccordingtothemajorityrule,wheretheresultsofeach sub-classifiermayhavethesameordifferentweights TheBootstrapaggregating(Bagging) classifierandRandomforestarethetypicalcasesofparallelensembles.Ontheotherhand,theseq uentialtypeincludesinnovativeversionsofsub-classifiers.Generally,the following sub- classifier is grownbya modification of thepre- vious.Th eoverallclassificationresultisoftenthefinalsub-classifier’sresult.

Theboosting-basedalgorithm,forexample,AdaptiveBoosting(AdaBoost),istypical of thist y p e

Bagging classifier(Breiman, 1996) consists of several iterations of thesamelearnertrainedonbootstrapversionsoftheoriginaldataset.Baggingdiversifiessub- classifiersbychangingthetrainingdatasetineachiteration.Anunknown- labelsamplewillbeclassifiedintotheclasswhichissuggestedbythemajority ofsub- classifiers(withorwithoutweights).TheadvantagesofBaggingarethesimple execution and the ability to reducevariance.

The pseudo-code of Bagging is shown in Table B.1, Appendix B.

Random forest(RF) (Breiman, 2001), which utilizes DT as the base learner, alsohassimilariterationstoBaggingbutrandomlypickssomefeaturesinstead of the whole feature space. Therefore, the level of diversity of RF is higher than that of Bagging The effectiveness of RF depends on the power and the correlationofsub-classifiers.AnalogoustoBagging,thenumberofiterations couldbechosen quite large withoutover-fitting.

The algorithm of RF is shown in Table B.2, Appendix B.

Adaptive boosting(AdaBoost)(Freund,Schapire, et al., 1996), which isthefirst- introducedversionofBoostingfamily,usesDTasthebaseclassifieralgo- rithm.AdaBoostfollowstheidea:Thenextclassifierwillcorrectthemistakes ofthepreviousandcontributetotheoverallpredictedresultaccordingtoits performance.

AdaBoost usestwotypes of weight:D t (i)of every samplex i at thet th iteration, andα t of every sub-classifiert th After thet th iteration, AdaBoost modifiestheweightsD t (i):increasingtheweightsofmisclassifiedsamples,and decreasing the ones of the correct classified in the next iteration Asregardstheweightα t ,if the error rate of thet th sub-classifier is greater than0.5,α t is assignedzero.Thus,thissub- classifierdoesnotcontributetotheoverallresult duetoitspoorperformance.Withanewsample,eachsub- classifiertgivesapredictiveclassaccompaniedbyaweightedvoteα t ,andthenthefinalresulti s determinedbythemajority.

By AdaBoost algorithm,Freundet al (1996)provesthat weak classifiers, forexampleDT,canbecomestrongerinthesenseofprobablyapproximately correct learningf r a m e w o r k

AdaBoost can reduce the bias, instead of the variance as the Bagging Be- sides,becauseofsequentialoperation,Boostingensembletakeslongercompu- tationtimethanBaggingandRFwiththesameiterations UnlikeBaggingand RF,whenthenumberofiterationsbecomeslarge,AdaBoostmaybeover- fitting.Furthermore,theeffectivenessofAdaBoostisasgoodasRFbutsometimesless than (Breiman,2 0 0 1 )

Table B.3, Appendix B summarizes the operation of AdaBoost.

In credit scoring literature, empirical studies agreed that ensembles hadasuperiorperformanceincomparedtosingleclassifiers(Brown&Mues,2012;Das tileetal.,2020;Lessmannetal.,2015;Marquésetal.,2012).Finlay(2011); Kim, Kang,and Kim (2015) concluded that Adaboostwasthe best solution,evenintheimbalancedcircumstances,whiletheBaggingtreewassupportedbyFinlay(2011)andLuo(2022).Besides,RFwasthemosteffectiveaccordingtothe study of Brown andMues( 2 0 1 2 )

Conclusionsofstatisticalmodelsforcreditscoring

Thestatisticalandmachinelearningmodelshavebeenutilizedvariouslyincreditscoring.T able3.1presentssometypicalworksofcreditscoringwhichareclusteredbycharacteristics of classifiers such as single or ensemble; transparentor black-box structure Each type of classifier has itsadvantagesand disad-vantages.Regarding effectiveness, homogeneous and heterogeneousensembleclassifiers usually dominate the single.However,regardinginterpretability,en-semble classifier algorithms often build black-box credit scoring models There- fore,constructing an interpretable ensemble model is an urgent requirement forcreditscoring.

Table 3.1: Representatives of classifiers in credit scoring

TransparentDA Altman et al (1994); Baesens etal. classifiers

(2003); Desai et al (1996); West (2000); Yobas et al (2000).

KNN Brown and Mues (2012); Li et al.

LR Baesens et al (2003); Bensic et al.(2005);

K Chen et al (2020); Desai et al. (1996);Steenackers and Goovaerts (1989);West(2000); Wiginton (1980).

GalindoandTamayo(2000);PandyaandP andya(2015); Pang and Gong (2009); Zhangetal.(2016).

SchebeschandStecking(2005);VanGes-tel et al.( 2 0 0 6 )

ANN Shen et al (2019);West(2000);Yobasetal.

Heterogeneous He et al (2018); Shen et al (2021);Yang et al (2021);Yotsawatet al.(2021);Zhang et al.( 2 0 2 1 )

Homoge- neous Boosting Brown and Mues (2012); Cao,He,

Wang,Zhu, and Demazeau (2021);Fin- lay(2011); Marqués et al.(2012).

Bagging Abdoli et al (2023); Finlay

RF Brown and Mues (2012); Cao etal.

(2021); Ha et al (2016); Marqués etal. (2012).

TheproposedcreditscoringensemblemodelbaseDecisiontree 71

Theproposedalgorithms

ConsideratrainingdatasetTwiththemajorityclassMA(alsothenegativeclass)andt heminorityoneMI(thepositiveone):T=MA∪MI.Th epositive and negative labels of samples are denoted “1” and “0”,respectively.

DefineDas the difference in the quantities ofMAandMI.WithBgiven and everyi(i∈1,B),applyROSto get a new positive class denotedMI i by randomlyduplicating D×ipositive samples.Then,RUScreatesanewnegative classMA i ,which has the same quantity asMI i The union ofMI i andMA i is a balanced data setT i :T i =MA i ∪MI i

Whenivariesfrom 1 toB,the setT i is balanced and has different quantity from the others That is the premise for the diversity of sub- classifiersofDTE(B).Inaddition,thecombinationsofROSandRUSaimtotakeadvant age

Input: T: the training data set;MIandMA:the positive andnegativeclass ofT,respectively.

B:the number of new balanced datas e t s

4 MA i ←RandomUndersampling(MA,|MA|−|MI i |)

Output: Afamilyofbalanced datasets{T i } B ofthesetechniquesandcompensatefortheirdrawbacks.ItisnotedthatwheniequalsB,T B is createdbyonly ROS Thus, there is notanyloss of negative class information. The OUS(B) algorithm is described inTable3 2

3.2.1.2 Algorithm for constructing ensemble classifier - DTE( B )a l g o r i t h m

OneachbalanceddatasetoftheoutputoftheOUS(B)algorithm,theRecur- sivePartitioningandRegressionTreealgorithm(RPART)

(Therneau,Atkinson,etal.,1997)isappliedtobuildsub- classifiersofDTE(B).Finally,thepredicted label of a sample is the majority votedbyBsub-classifiers of DTE(B) The algorithm for DTE(B) is inTable3.3. Ineachsub-classifier,theparametersaresetasfollows.Theminimumnum- berofobservationsinanyterminalnodeis10.Thepruningprocessofeachtreeisdeterminedby5 -foldcross-validationwiththecomplexityparameter0.001.

(e.g.,classificationerrors)isusedtomeasuretheimportanceofthisfeature.A featurecanbeusedseveraltimesinatree.Themoresegmentsthisfeaturehas, themoreessentialitis.Therefore,thetotalreductioninthelossfunctionacross allsegmentsofafeatureistheindextomeasureitsimportance.WithDTE(B),

Input: {T i } B : T h e familyofbalancedtraining datasetswiththesame numiboffeatures; p: the number of features in each data setT i ;DT: Decision tree classifier.

3 FI i =(FI ij ) p , whereFI ij is the degree ofj th feature’s importance.

0,otherwise The importance level vectorFI. theoverallimportancelevelofafeatureistheaverageofBimportancelevels fromBsub- classifiers.Inthisstudy,theoverallvaluesarestandardizedsothatthemostimportantfeaturesare100andtheremainingfeaturesarescoredbasedon their relative to themost.

Empirical datasets

Fourdata sets, which are German (GER),Taiwanese (TAI),Vietnamese1(VN1),andVietnamese2(VN2)areusedintheempiricalstudy.Thesu mmaryofthedatasetsisshowninTable3.4.MoredetailscanbefoundinAppendix C.1 – C.4.

GER 1 andTAI 2 are public on the UCI machine learning repositories On thecontrary,VN1andVN2areprivatedatasets.Duetosecurityconcerns,wecannot access detailed information about credit customers at Vietnamesebanks.AllfeaturesofVN1andVN2areinnominalforms.Theyaretheinter estrate,

1 http://archive.ics.uci.edu/dataset/144/statlog+german+credit+data 2 https://archive.ics.uci.edu/dataset/350/ default+of+credit+card+clients

Table 3.4: Description of empirical data sets

Data sets Sample size #positive class a Imbalanced ratio # features b

VN2 16,407 1,340 11.24 12 a :The number of positive samples; b :The number of total features terms, duration, loan amount, customer gender, loan purpose, base balance,currentbalance,typeofcustomers,typeofproducts,credithistoryofcustomers, and branches of the bank Besides, the imbalanced ratio of VN1 and VN2isnotablyhigh,especiallyoneofVN2.ThesecharacteristicsmaketheVietnamese datasetsdifferentfrom GERandTAI.

VN1andVN2areusedtodeterminetheoptimalDTE(B ∗ )whileGERandTAIarethe validationdatasetstocomparetheoptimalDTE(B ∗ )withpopularensemble classifiers based DT.

Computationprocess

ThecomputationprocessesofcreditscoringbyDTE(B)andotherpopular ensemble classifiers based DT follow the steps inTable3.5.

InsteadoffindingtheoptimalB ∗ correspondingtoeachdataset,ageneral evaluationofthetwoVietnamesedatasetsisconductedtodeterminethemostsuitableB ∗ for bothdatasets.Thisphasecorrespondstosteps1to7inTable3.5.

Subsequently,DTE(B ∗ )isappliedtothepublicdatasets,whichareGERandTAI,toc omparetheperformancemeasureswithpopularensembleclassifiers based on DT such as Bagging tree, RF, and AdaBoost with and withoutthep o p u l a r r e - s a m p l i n g t e c h n i q u e s s u c h a s R O S , R U S , S M O T E , a n d A D A S Y N T h e c o m p a r i s o n p h a s e i s f r o m s t e p s

8 t o 1 3 i n Table3.5 In this phase, theperformance metrics, including

AUC, KS, F-measure, G-mean, and H- measure,areusedtoprovideanoverviewoftheeffectivenessoftheproposedensemble.

1 On VN1 and VN2, divide randomly the data sets into the training

2 On the training data, with a given numberB,OUS(B) and DTE(B) algorithms are applied to get DTE(B)classifier.

3 On the testing data, find AUC, KS, F-measure ofD T E (B).

6 Averagethe AUC, KS, and F-measure across fiftytimes.

CompareDTE (B ∗ )wi thotherense mbl ecl assi fiers basedDT

8 Ontheempiricaldataset,dividerandomlyitintothetraining(70%) and testing data( 3 0 % )

9 ConstructDTE(B ∗ ),Bagging tree, RF, andAdaboost.

10 ConstructBagging,RF,andAdaboostintegratedwithoneofthetech-niques RUS,

11 Onthetestingdata,calculatetheAUC,KS,F-measure,G-mean,andH- measure of all considerede n s e m b l e s

13 Averagethe performance metrics after fiftytimes.

Togetrobustevaluations,thecomputationprocessofallconsideredclassifiersis carried out 50 times on each data set Then, the comparisons are basedonthe averagevaluesof the performancemeasures.

Empiricalresults

3.2.4.1 Theop ti mal Dec i si on tre een sem bl e cl as s i fier

Table 3.6: Performance measures of DTE( B ) on the Vietnamese data sets

*denotes the optimal value ofB; bold values is the highest in each row.

WithBaggingandRF,thenumberofsub- classifierscanbearbitrarilyhigh.However,withDTE(B),thenumberofsub- classifiers,whichisB,isboundedbyD,thedifferenceinthequantitiesofthenegativeandp ositiveclasses.AsBgetsclosertothisupperbound,eachbalancedsetT i isinsignificantlydiffere nt fromtheothers.Therefore,thesub- classifiersinDTE(B)arenotdiverse.Thus,thesurveyfortheoptimalB ∗ doesnotfocusone xtremelyhighvaluesofB.

Table 3.6 presents the mean testing AUC, KS, and F-measure of DTE(B)s, which are averaged after50repetitions.

ItisnoteasytodeterminethetrendofAUC.OntheVN1dataset,themax- imumvalueofAUCcorrespondstoDTE(3),whileontheVN2dataset,AUCreachest hemaximumvalueatahigherB.However,AUCgraduallystabilizes whenBbecomeslargeenough.Besides,thevariationsofKSandF-measurefol-lowtheinverseU- shape:increasingalongwithB,reachingthemaximum,anddecreasing Considering the computation time and the performance measures,theoptimalvalueofBforthetwoVietnamesedatasetsis39.

Sex Loan_types Amount Base_balance Current-balance 18.8

Purposes Terms Branches Product_types

Loan amount Borrower Drawdown Interest

Sex History Duration Purpose Asset

Itisafactthattheensemble-basedapproachwithpopularre-samplingtech- niquesdoesnotperformeffectivelyontheempiricalVietnamesedatasets,whichare highly imbalanced data.Table3.7 shows the performance of DTE(39) againstBagging,FR,andAdaboostwithandwithoutre-samplingtechniques On the VN1 data set, DTE(39) outperforms other classifiers on three evalua- tioncriteria(AUC,KS,andH-measure)atleast,whileontheVN2dataset,it surpasses the others on five criteria In short, DTE(39) is more effectivethantheensemble- basedapproachtohandlingIDintheVietnamesedatasets.

Another output of the DTE(B) algorithm is the vectorFIrepresenting the importanceleveloffeatures.Figure3.4describesthefeatures’importancelevelof the Vietnamese data sets In the VN1 data set, “Asset” is the most im- portantfeature,followedbyotherfeaturessuchas“Purpose”,“Duration”,and“History”.Analogo us,intheVN2dataset,themostsignificantfeaturesare“In- terest”,“Duration”,“TypesofProduct”,and“Branches”,indescendingorder.

Thesefeaturesofcustomersprovidemoreinformationtopredictthelikelihood ofdefaultthantheothers.ThisisavaluableframeworkforVietnameseadmin- istrators to introduce regulations for screening potential defaultcases.

Insummary,on the Vietnamese data sets, DTE(39) fulfills thetworequire- ments for a credit scoring model: effectiveness andinterpretability.

F ea tu re s F ea tu re s

Table 3.7: Performance of ensemble classifiers on the Vietnamese data sets

Data sets Classifiers AUC KS F-measure G-mean H-measure

Bold values are the highest of each criterion on each data set

On the public data sets, which are German andTaiwanesecredit scoring data sets, DTE(39) is compared with the popular ensemble classifiersbasedDTwithoutandwithre- samplingtechniquessuchasRUS,ROS,SMOTE,andADASYN.

In the Bagging tree, RF, and AdaBoost, an increase in the number of trees usually leads to a decrease in the error rate (Breiman, 1996;Freundet al.,

1996) Regarding Bagging and RF, a large number of trees in theensembledoesnotcauseover-fittingmodels.However,theimprovementoferrorrateis insignificantwhenthenumberoftreesisgreaterthan20forBagging,and100for

RF(Breiman,1996,2001).RegardingAdaBoost,iftherearemanytreesonthe whole,thecomputationtimewillbeverylongandapossibilityofover- fitting.Forallthesereasons,theparametersoftheensembleclassifiersareassignedas follows:

• Random forest: The number of trees is 300 The number of features for each tree is the squared root of the total features of each dataset.

Steps 8-13 of the computation protocol inTable3.5 are applied toGermanandTaiwanesedatasets.ThetestingperformancemeasuresareshowninTable 3.8 and 3.9 On the German data set, DTE(39) archives the highestvaluesofAUCandH- measure.Besides,incomparisonwitheachclassifier,DTE(39)alwayswinsbyatleastth reeoutoffiveperformancecriteria.Similarly,onthe

Taiwanesedataset,DTE(39)isthemosteffectiveamongconsideredclassifiers sinceDTE(39)beatstheotherbyAUC,KS,andH-measure.

Inaddition,theperformancemeasuresofDTE(39)arecomparedwithsome recentempiricalstudies,whicharealsopresentedinTables3.8and3.9.DTE(39) is still almost dominant in theAUCcriterion Furthermore, DTE(39) shows higherperform an ce th an GSCI(X Ch en etal., 2020)an dEBCA( Hee tal ,

Classifiers AUC KS F-measure G-mean H-measure

CS-NNE (Yotsawat et al., 2021) 8011 —— —— 7363 ——

BSAC (Abdoli et al., 2023), LSTM (Shen et al., 2021), and thepro-posed model of Zhang et al (2021) on theTaiwanesedata set In other cases, no recent ensemble completely outperformsD T E ( 3 9 )

Insummary,DTE(39) exhibits exceptional performance compared tobothcommon methods and recent complex ensemble and hybrid models.

Classifiers AUC KS F-measure G-mean H-measure

BSAC (Abdoli et al.,2 0 2 3 ) —— —— 5316 6807 —— PLTR (Dumitrescu et al., 2021) 7780 4257 —— —— ——

TheDTE(39)offerssuperiorperformanceintermsofAUCandH-measureacross four data sets when compared with ensemble classifiers-based treessuchasBagging, RF, and Adaboost On the German and Vietnamese 1 datasets,theAUCs of DTE(39) are significantly greater than those of the others It meansDTE(39) shows a higher expected TPR upon all FPR with allpossible thresholds.Besides,introducingthehighestH- measureimpliesthatDTE(39)outperformsallothermodelsbytheexpectedminimumloss improvementwhenconsideringthemisclassificationcosts.TheAUCandH- measurearecomple-mentary metrics to evaluate the general performance of a classifier Theout- standingAUCand H-measure show a robust effectiveness ofDTE(39).

OnVietnamese1and2datasets,whichsufferfromhighlyimbalancedsta-tus,

DTE(39) is the optimal choice On Vietnamese 2 data sets, DTE(39) completelyoutperformsallconsideredensemblesintegratedwiththepopular balancedmethods.Thus,DTE(B)isapromisingsolutionforseriouslyimbal-anced credit scoring datasets.

Furthermore, interpretability makes DTE(39) the most reasonable credit scoringclassifier.DTE(39)canpointouttheimportantfeaturesofcustomers,whichare usefulforhedgingcreditrisk.Althoughmanyoftherecentlyproposedensemblesshowgoo dperformance,theirprimaryfocusisnotoninterpretabil-ity.In contrast, some discussions on interpretability such as GSCI andPLTRworklesseffectively thanDTE(39) (seeTable3.8and3.9).

Somefurtherresultsaredrawnfromtheempiricalstudy.Firstly,onfourreal data sets, none of ROS, RUS, SMOTE, and ADASYN is the outstanding re-sampling technique for addressing ID.Secondly,some balanced methods donotalwayswork as expected.Forexample, on the German data set, the Baggingclassifierwithoutanyre- samplingtechniqueoffershigherperformancemeasuresthantheothers(seeTable3.8).There fore,usersshouldcarefullycheckseveral re-sampling techniques when applying the data-levelapproach.

Conclusions of the proposed credit scoring ensemble model basedDecisiontree

Credit scoring isalwaysone of the most important tasks of financialinsti- tutions.Alittleimprovementintheeffectivenessofcreditscoringmodelscanlimit the significant loss of the banking system and theeconomy.Therefore, theevolutionofcreditscoringmodelscontinueswiththeenhancementofnewclassificational gorithmsandtheinnovationofbalancedmethods.Inaddition, interpretabilityis ac r uc ia la sp ec to f a c re di ts c o ri n g m o de l, b u t i t h a s n ot received sufficient attention fromr e s e a r c h e r s Thissectioncontributestwoalgorithms,OUS(B)forsolvingimbalanceddataand DTE(B) for building a classifier ensemble-based DT to credit scoringlit-erature The product of thetwoproposed algorithms is the ensemble classifier DTE(39) which is more effective than Bagging, RF, and AdaBoosteventhoughtheyarecombinedwithcommonre- samplingtechniquessuchasROS,RUS,SMOTE,andADASYN.DTE(39)alsocompetesw ithotherrecentcreditscoringmodels,especiallyinAUCandH- measure.Furthermore,DTE(39)in-troduces the important features for predicting credit risk status.

Thus,theDTE(39)fulfillstworequirementsoftypicalcreditscoringmodels:improvingth eperformancemeasuresandpresentingtheimportancelevelofinputfeatures.Theseattribute spositionDTE(39)asthemostreasonableoptionforaddressing imbalanced credits c o r i n g

However,DTE(B) shouldbepracticed on more data sets to get detailed conclusionsabouttheoptimalvalueofB.Besides,thestudyonlyconsiderstheimbalanced ratioastheparameteraffectingtheperformanceofclassifiersonID.Infact,overlappingisalsoacommonissueinimb alancedclassification.OUS(B)algorithm shouldbeexamined deeply on the data sets suffering imbalancedandoverlapping to improve itse ff e c t i v e n e s s

Theproposedalgorithmforimbalancedandoverlappingdata 83

Theproposedalgorithms

3.3.1.1 Algorithm for dealing with noise, overlapping, andimbalanceddata

The pseudo-code for TOUS(B) is represented in Table 3.10.

Firstly,theTomek-linkmethodisappliedtoremoveallthepairs{(e + ,e − )} m , whichmaybenoise,borderline,andoverlappingsamples(Steps1-

2).Then,theimbalancedissueofthe remaining dataset,whi chiscalledT J ,isaddressedby the OUS(B) algorithm(Steps3 - 9 ) Theo u t p u t o f T O U S (B)i s a f a m i l y o f

3 MI J andMA J :the new positive and negative class ofT J

Fromthe output of the TOUS(B) algorithm, construct an ensemblem o d e l byapplyingabaselearneroneverydatasetofthefamily

Thisprocess follows the steps of the TOUS-F(B) algorithm shown inTable3.11 TOUS-

3.3 However,DT canbereplacedbyanother base learner such as LR orL L R i= 1 Σ

Input: {T i } B : Thefa m il yo fBbalanc edsets wi th th e sam en u m b er of features;F: Classifier.

Empirical datasets

TheempiricalstudyisconductedonsixdatasetswhichareBankpersonal loan(BANK) 3 ,Germancredit(GER),HepatitisC(HEPA) 4 ,Loanschemadata (US) 5 , Vietnamese 1 (VN1), and Vietnamese 3 (VN3) credit These datasetsarechosenbecauseofthediversityinsamplesize,imbalancedratio,numberofattribute s,typesofattributes,andthepresenceofoverlappingsamples(whichwillbefoundbytheTo mek-Linkand NCL methods).Table3.12summarizes thecharacteristicsoftheempiricaldatasets.ThedetailsofthedatasetsBANK,HEPA,US, and VN3 canbefound in Appendix C.5 – C.8,r e s p e c t i v e l y

Table 3.12: Description of empirical data sets

Data sets Size Positive size Imbalanced ratio #feat a #numfeat b

VN3 11,124 837 12.29 12 0 a :Thenumberof totalfeatures; b :Thenumberof numericfeatures.

3 https://www.kaggle.com/datasets/teertha/personal-loan- modeling 4 https://archive.ics.uci.edu/dataset/571/hcv+data 5 https:// www.openintro.org/data/index.php?data=loans_full_schema

3, which are from thetwoVietnamese commercial banks, are private data sets.

The German and Vietnamese 1 data sets were used in the empirical study for DTE (Section 3.2) In addition, the Hepatitis C data set isinvolvedwith the medical field which usually suffers fromI D

Some changes to the original data sets are made.

• RegardingHEPA,observations with missingvaluesare removed; the levelsofvariableCategoryaregroupedintotwolabelswhicharedenoted“0”for“Blooddono r” and “1” for the remainingones.

• Regarding US, the original data set consists of10 ,000samples whichareindividuals and companies Besides, there are some emptyvaluesinthedata set.Weremovesamples that are missingvaluesor not individual customers The rest consists of8,505samples for the empiricalstudy.

Computationprocess

Eachdatasetisrandomlysplitintothetrainingandtestingsetsatthepro- portion of 70% – 30%.ForeveryvalueofBin the set{3,5,7,9}, TOUS(B) and then TOUS- F(B) algorithm are applied to the training set to build ensemble classifiers. This section employs LLR and DT tobethe base learners of the ensembles, called Lasso Logistic Ensemble (LLE(B)) and

DecisionTreeEnsemble(DTE(B)),respectively.Experimentsjustconductonsmallvalue sofBdue to the burden in the computation process Besides,AUCis theuniqueperformance metric in thee v a l u a t i o n

Figure 3.5 illustrates the computation protocol of the proposedensembleclassifiers On each data set, this process is repeated 50 times for everyvalueofB.Theoptimalproposedclassifierensemblesoneachdatasetaretheon escorresponding to the highest average testingAUC.

This subsection builds an ensemble classifier, in which LLR plays the role of a base learner The proposed ensemble classifier, which consists ofBsub- classifiers, is denoted LLE(B) The computation protocol of LLE(B)followst h e s t e p s s h o w n i n F i g u r e 3 5 w i t h t h e r e p l a c e m e n t o f c l a s s i fi e r Fby LLR.Furthermore, the study trains single models based on LLR and popularre-sampling techniques, such as

ROS, SMOTE, RUS, Tomek-link, and

Therearesometransformationsonthedatasetsafterconductingre- sampling.Firstly,foreachnominalattribute,binaryvariables(dummyvariables)arecre- ated to express all the levels.Secondly,the numerical variables are scaled according to thef o r m u l a :

X scale =X − X dev(X) where,Xanddev(X)are mean and deviation of variableX.

Finally,when training modelsbyLLR,wedesign a grid of500valuesof penaltylevelλoneachdataset.Thecoordinatedescentalgorithmand5-fold cross- validation procedure are applied to choose the bestλforLLR.

Thissubsectionbuildsanotherensembleclassifier,inwhichDTisthebaselearner.Si milarlytoSection3.2,theproposedensembleclassifier,whichhasBsub-classifiers, is denotedD T E (B).

AllstepstoconstructtheoptimalDTE(B ∗ )aresimilartothoseofLLE(B). TheperformanceofDTE(B ∗ )iscomparedtothepopularensembleclassifiersbasedDT(Bagging,

Randomforest,andAdaBoost)integratingwithoneofthecommonre- samplingtechniques(RUS,ROS,SMOTE,Tomek-link,andNCL).RPARTalgorithm, which is complementary to theCARTalgorithm,is ap- pliedt o c o n s t r u c t D T c l a s s i fi e r ( T h e r n e a u e t a l , 1 9 9 7 )

TheparametersofRPARTareassignedasfollows.Theminimumnumberofsamplesi nanyter-minal nodeis1 0 The pruning process of each tree is determinedby5-foldcross- validationwiththecomplexityparameter0.001.AsregardsBaggingandAdaBoost,thenumb erofnodesalongthelongestpathfromtherootnodetothefarthestterminal nodei s10(maxdept=10).T h e numberoftreesintheensemble classifier is30(mfinal0) As regards RF, the number oft r e e s isassigned300.Thenumberofpredictorsofeachsub- classifieristhesquaredrootofthetotalpredictors ofthedataset.Theperformance ofDTE(B)andthepopularensembleareevaluatedbasedontheaveragetestingAUC of50running times.

Empiricalresults

Table 3.13 introduces the average testing AUC of LLE(B) and DTE(B) on six data sets, whereBbelongs to the set{3,5,7,9}.

For each data set, the optimal classifier ensemble is the model with the great- estAUC.IftwoensembleshavethesamegreatestAUC,thebetterisonewith a smallerB.InTable3.13, the boldvaluesofAUCcorrespond to the optimal ensemble classifiers of each dataset.

According to Table 3.13, it can be concluded that on every data set, with anyBandC, DTE(B) usually has a higher AUC than LLE(C).

Table 3.13: Average testing AUC of the proposed ensembles

Table3 14 sh owsthe averagete st in g AUCof the optim al LLE(B ∗ )and the modelsbasedonLLRpost-appliedthepopularre- samplingtechniquessuchasROS,SMOTE,RUS,Tomek-Link,andNCL.TheLLRwithoutanyre- sampling balanced method is also compared withLLE(B ∗ ).ItsAUCis shown in the“Nore- samp”column.LLE(B ∗ )completelyoutperformsotherpopularmodelsbytheA UCcriterion Thus, it canbeconcluded that the TOUS algorithmimprovestheperformanceofLLRevenwhencombinedwithpopularre- samplingtechniques.

Moreover,somecommentsareimpliedfromthisexperiment.OntheGER,HEPA,an d US data sets, thevaluesofAUCof LLR withoutanyre-sampling technique and Tomek-link-LLR or NCL-LLR are the same Thus, these data setsdonothaveanynoiseandoverlappingsamples.Ontheremainingdatasets,Tomek-

Table 3.14: Average testing AUC of the models based LLR link

Table3.15showstheaveragetestingAUCoftheoptimalensembleDTE(B ∗ )and the popular ensemble classifiers based DT such as Bagging, RF, andAd- aBoostwithandwithoutoneofthere-samplingtechniquesROS,SMOTE,RUS,Tomek-

Link,andNCL.SimilartothecaseofLLE(B ∗ ),DTE(B ∗ )showsagreatimprovemen t inAUCcompared to all methodsc o n s i d e r e d

OnthedatasetsBANKandHEPA,thepopularclassifiersperformwelleven whennotaddressingIDbyre- samplingtechniques.However,DTEcanpushAUChigher Besides, on the data sets US, VN1, and VN3, DTE raises theAUC significantly.These data setshavesome special characteristics.

TheUSsuffersfromIDseriously(theimbalancedratiois49.93),whileVN1andVN3 possessallissuessuchasnoise,overlapping,andID.Theseresultsimplythat the ensemble-based approach thanks to the TOUS algorithm is the suitable option to deal with noise, overlapping, andID.

Besides,someminorresultsaredrawn.Firstly,re-samplingtechniquesare notalwaysefficient.Forexample,onGERandHEPA,re-samplingtechniquesdecrease theAUCof the popular ensemble algorithms.Secondly,Baggingshowsnotablylowerperformance than the other ensemble classifiers.Finally,RUS-

ROS SMOTE RUS Tomek- NCL LLE( B ∗ )

Table 3.15: Average testing AUC of the ensemble classifiers based tree

None ROS SMOTERUS Tomek- link

Conclusionsoftheproposedtechnique

Link,RUS,andROStechniquestocreatealistoffree-noise,free-overlapping,andbalanced datasets,whichwillbethetrainingsetsofsub-classifiersofanensembleclassi- fier.ToverifytheeffectivenessofTOUS,LLRandDTalgorithmsareapplied toconstructtheensembleclassifiersLLEandDTE,respectively.Theempirical studyindicatessomeimportantresults.TOUShasasignificantinnovationin

AUCcomparedwithpopularre-samplingtechniques.Thatmeansthehybridof many re-sampling techniques is more effective than thei n d i v i d u a l s B e s i d es , thecleaning- datamethodscanincreasetheperformancemeasurealthoughthedata set is still imbalanced.

This fact re-confirms that noiseandoverlappingsamplesarealsothereasonsforreducingtheeffectivenessofstanda rdclassifiers.Theresultssuggestexperimentstostudyanotherclassifierasabaselearnerandcon sidermoreperformancemetricstoevaluatethepotentialoftheproposed method for solving ID and related issues.

Chaptersummary

This chapter studies credit scoring as a case study of ID There aretwoproposed works in thisc h a p t e r

• The algorithm TOUS for de-noise, free-overlapping, and balancingdata;andthealgorithmforconstructinganensembleclassifierbasedontheout- put of TOUS in Section3.3.

The credit-scoring ensemble classifier DTE addresses IDbytheensemble- based approach The empirical results show that DTE outperformsstandardclassifiers, even when combined with popular re-sampling techniques Inaddi- tion,DTEcanpointouttheimportantfeaturesforthefinalpredictedresults.

TheTOUSalgorithm,whichderivesfromtheOUSalgorithm(Section3.2),can tackle noise, overlapping samples, and ID TOUS combinedTomek- link,random over-sampling, and random under-sampling techniques The proposedtechnique provides a substantial improvement in theAUCmetric. Allproposedworksshowimpressiveresults.However,itshouldbeconsidered deeplyintheparameteroptimizationandconductedonmoreempiricaldatasetsto reach a robustconclusion.

Introduction

Recently,althoughmachinelearninganddata-miningalgorithmsarepen- etrating several real applications of classification, Logistic regression (LR),atraditionalmodel,isstillinfavorbyseveralauthors(Bektas,Ibrikci,&Ozcan, 2017;Khemais,Nesrine,&Mohamed,2016;Lietal.,2015;Muchlinski,Siroky,He, & Kocher, 2016) There aretwoprominent reasons for that.Firstly,the outputofLRisthesample’sconditionalprobabilityofbelongingtotheinterestclass, which is the reasonable reference to classify the sample.Secondly,LRshowsatransparentmodelforinterpretationwhilemostmachinelearningan ddata- miningmodelsoperateasa“blackbox”process.However,LRhassomeproblems.Thei nterpretivepowerofLRisbasedonthestatisticallysignificant levelofparameterswhichiscloselyrelevanttothep-value.Nevertheless, thep- valuehasbeenrecentlycriticizedsinceitsmeaningisusuallymisunderstood

(Goodman,2008).Furthermore,inimbalancedclassification,theparameteres-timation of LR canbebiased and the conditional probability of belongingtotheminorityclasscanbeunder- estimated(Firth,1993;King&Zeng,2001).

As a consequence, LR usually misclassifies the interest class on ID.

IntheliteratureonLRwithID,thereweresomegroupsofmethods,whichwerelinkedto thealgorithm-levelapproach.Theywerepriorcorrection,weightedlikelihood estimation (WLE) (Maalouf &Trafalis,2011; Manski & Lerman, 1977; Ramalho &

Ramalho, 2007) and penalized likelihood regression (PLR) (Firth, 1993;

Greenland & Mansournia, 2015; Park & Hastie, 2008; Puhr et al.,

2017).Mostofthemweredesignedtoreducetheparameterestimationandpre-dicted probability biases, especially in small samples.However,prior correctionandWLEneedthepreviousinformationoftwoclassesinthepopulationwhich isusuallyunavailable Besides, some methods of PLR, such as FIR (Firth,

1993), FLIC, and FLAC (Puhr et al., 2017) are quite sensitive to initialvaluesinthecomputationprocessofmaximumlikelihoodestimation.Therefore,solv- ingLRwithIDshouldconsiderbothdata-levelandalgorithm-levelapproaches and not make the computation processcomplex.

This chapter proposes a binary classifier namedF-measure-orientedLasso-

Logistic regression(F-LLR) to deploy the ability of interpretation of

LRandaddress the imbalanced issue F-LLR utilizes Lasso Logistic regression(LLR)asabaselearnerandintegratesalgorithm-levelanddata- levelapproachesto handlingID.Lassoisapenalizedshrinkageestimatorandafeatureselection method without ap-value In Lasso, the hyper-parameterλis setbyanewp r o c e d u r e calledF-CVwhichisanadjustmentoftheordinarycross-validation procedure(CV).F-CVfindstheoptimalλbymaximizingthecross-validationF- measureinsteadofthecross- validationaccuracyasthewayofCV.TheproposedclassifierF-

LLRhastwocomputationstages.Inthefirststage,applyLLRbased onF- CVtogetthescoresofallsamples.Inthesecondstage,accordingtothescores, under- sampling and SMOTE are respectively used to re-balance thedataset.Next,LLR- basedF-CVisappliedagainonthebalanceddatasettogetthe finalresult.TheproposedclassifierF-LLRexperimentsonninerealimbalanceddata sets and its performance measures (KS and F-measure) are higherthanthoseoftraditionalapproachestoIDofLR.

This chapter is organized as follows The related works section reviews the

^ δ 0 general backgroundinvolvedwith LR and ID The next section describestheproposedclassifier.Theempiricalstudysectionintroducestheempiricaldata sets, the implementation protocol, and the results The conclusion sectionisfinal.

Relatedworks

The details of LR are discussed in Subsection 3.1.1.3 Although LR has severaladvantages,itisineffectiveinID:thescoreisunderestimated(King&Zeng,2001).I ntheLR’sliterature,therearesomegroupsofmethodsthatfocus ontheintrinsiccomputationprocessofLRtoreducethebiasinparameterand scoreestimation.Theywerepriorcorrection,weightedlikelihoodestimation (WLE), and penalized likelihood regression(PLR).

Prior correction re-computes the maximum likelihood estimate (MLE)forthe intercept of the standard LR It is unnecessary to correct the MLE for the parameterβbecause it is statistically consistent (Cramer, 2003; King &Zeng,2001).

The correction forβ 0follows theformula: β˜=β^−ln.δ 1Σ

Whereβ 0is the MLE forβ 0;τandyare the proportion of the positive class in the population and the sample, respectively.

Subsequently,the scorei s : ˜π(x)=P(Y=1|X=x)Whereβ^istheMLEforβ.

The biggest advantage of prior correction is the ease of use However, the value ofτis usually unavailable Besides, if the model is misspecified, the estimates ofβ 0andβare slightly less robust than the WLE (Xie & Manski,1989). n

King and Zeng (2001) argued that the score in the formula (4.2)wasstill underestimated.Theyproposedacorrectionforthescoreintheformula(4.2): ˜π KZ (x)=˜π(x)+C(x)

WhereV(β)is the variance matrix ofβ.

King and Zeng (2001) stated that the score estimate in the formula (4.3) could reduce the bias and variance.However,this methodwasappliedafterthe complete estimation Thus, itwasa correction, not a prevention(Wang&Wang,2001) Besides, according to Puhr et al.

Insteadofsolvingtheoptimizationin(3.2),WLE(Manski&Lerman,1977) considers the weighted log-likelihoodf u n c t i o n : logL W (P(Y|X,β))=Σ w i [y i log (π(x i ))+(1−y i )log (1−π(x i ))] i= τ 1 w= y +1−τ(1

In (4.4),w i is theweightof the observationi th in the sample data. Whereτandyaretheproportionsofthepositiveclassinthepopulationandthesample, respectively. WLE outperforms prior correction in both cases of large sample dataandmisspecified model (Xie & Manski, 1989) In a small sample set, WLEmaybeasymptoticallylessefficientthanpriorcorrectionthoughthedifferencesareinsignifica nt (Scott & Wild, 1986) In addition, misspecification is a commonissue in social science studies Therefore, WLE shouldbepreferred topriorcorrection(King&Zeng,2001;Xie&Manski,1989).

ThereweresomestudiesfollowingtheweightingmethodforsolvingLRwithID.Maalo ufandTrafalis(2011)combinedweighting,regularization,kerneliza- Σ Σ j= 1 j j= 1 tion,andnumericalmethods.MaaloufandSiddiqi(2014)appliedthetruncated

NewtonmethodonWLE.Theseworksstudiedtheproblemsthatwereavail- ableforthevalueofτ.Meanwhile,ingeneralcases,theinformationaboutthe population proportionτis unknown Therewasonly one study dealingwitha part of this gap in the literature Ramalho and Ramalho (2007)providedageneralizedmethodofmomentsestimatorapplyingmomentconditio nsfor endogenouslystratifiedsamples.However,theinvestigationfortheeffective-ness of the proposed methodwasbased on a simulation study accordingtoCosslett’sdesign.Thatseemsnotenoughcasestoevaluatetheperformanceof the proposedm e t h o d

PLR has the general form as follows: logL ∗ ( P(Y|X,β))=l o gL(P(Y|X,β))+A(β)

(4.5)In (4.5), the term ofA(β)couldbe:

• Jeffrey prior (Firth-type):A(β)=1 log(det(I(β)))2 ,whereI(β)is theFisher information matrix (Firth,1 9 9 3 )

• Normalprior(Ridge):A(β)=−λΣ p β 2 ,whereλ>0(Maalouf&Trafalis, 2011; Park & Hastie, 2008).

|β j |,whereλ >0(Fuetal., 2017; Li et al., 2015).

Firth-type(FIR)canreducethesmall- samplebiasoftheMLEofparameters.However,FIRintroducesthebiasinthescoreswhicharepulledtow ardthevalueof0.5.ThebiasissignificantinthecaseofhighID.Toovercomethisdrawback, Puhretal.(2017)suggestedtwomodificationsofFIR,whichweretheintercept correction (FLIC) and adjustment for an artificialcovariate(addedcovariateapproach, FLAC). Although FLIC and FLAC perform better than FIR,theyc a n n o t w i n

R i d g e o n m o s t e m p i r i c a l a n d s i m u l a t i o n d a t a s e t s ( P u h r e t al.,2 0 1 7 ) B e s i d e s , F I R , F L I C , a n d F L A C a r e q u i t e s e n s i t i v e t o i n i t i a l valuesinthe computational process of the maximum likelihoodestimation.

Ridge possesses a similar idea to Lasso which is discussed in subsection 3.1.1.4.InRidgeandLasso,thepenaltyparameterλcontrolsthemagnitudeof the estimations ofβ j (jƒ= 0)(denotedβ j ), which canbefoundbytheCoordi-nate descent algorithm(Friedman,Hastie, & Tibshirani, 2010) The optimalλis usually determinedbythe cross-validation procedure (CV), which is basedonthedefaultthresholdof0.5andminimizingthecross-validationerrorrate(or maximizingthecross- validationaccuracy).RidgecancompetewithFLICandFLAC.However,Ridgeusuallyleads toadenseestimationofβ,whichconsists ofveryfewvalueszeroofβ,duetothepropertyofthenormalprior.Thus,in high- dimension data, Ridge takes a longintervalof computationtime.

AnalogoustoRidge,Lassoisapenalizedshrinkageestimator.Besides,Lassoisafeatureselectionmetho dwithoutap-value.Lassoretainsonlythepredictors closelyrelevanttotheresponse.Inhigh- dimensiondata,Lassodoesnotspend asmuchtimeasRidgebecauseoftheexclusionofpredictors.However,Lassodoesnotdir ectlydealwithID.SomestudiesappliedSMOTEtore-balancedatabefore performing Lasso(Kitali et al., 2019;Shrivastavaet al., 2020) Despite itspopularity,SMOTEcancausetheoverlappingproblem,whichdecreasesthe performance measures ofc l a s s i fi e r s

Theproposedworks

Reviewing the literature on LR with ID leads to some conclusions LRcan stillemploytheabilityofinterpretationwiththepenalizedversionofLasso.TodealwithID,itshould beconsideredthehybridofbothintrinsic(algorithm- level)andextrinsic(data- level)algorithmsofLassoLogisticregression(LLR).Moreover,thedata- levelapproachshouldbeexaminedtoboosttheadvantagesandrestrictthedisadvantagesofre- samplingtechniques.Forexample,SMOTEshouldbeonlyappliedtothesafesubsetofthemi norityclasswhichconsists oftypicalsamplesofthepositiveclass.Besides,thealgorithm- levelapproachcanbeused to modify the computation process of LR and can supporttheapplication of re-samplingt e c h n i q u e s

Inspired by the idea of the hybrid of the data and algorithm-level approaches

1 to addressing ID for LR, this chapter proposes a modification of LR named

F-measure-oriented Lasso-Logistic regression(F-LLR).

4.3.1 Them od ifi c at i o n o ft h e c r os s - v al id at io n p ro c e d u re

F-LLR utilizes LLR as a base algorithm Instead of using the CV tofindthe optimalλ, a modification of CV calledF-measure-orientedcross- validationprocedure(F-CV), is proposed In F-CV, the criterion to evaluate theoptimalλisF- measure,amoresuitablemetricthanaccuracyonID.ThedetailsofCVand F-CV are described inTable4.1 and4.2.

Table 4.1: Cross-validation procedure for Lasso Logistic regression

Input: A training data setT, a series of{λ i } h , an integerK(K >1).

• OnT\T k ,apply LLR withλ i to get the fitted modelLLR(λ i ).

• Compare the scores with the threshold0.5to get thelabels.

UnderthenotationsinTable4.2,witheverythresholdα j ,thecross-validation F-measure,F ij ,is an estimate of the testing F-measure of the fitted modelLLR(λ i ).Whenthepenaltyparameterλandthethresholdαtakeall values in the series{λ i } h and{α j } l , respectively,F ij determined at Step 10 isan estimate of the highest testing F-measure ofLLR(λ)on data setT Therefore,F-CV indicates not only the optimal penalty parameterλ i 0but also the optimal thresholdα j 0corresponding toF i 0 j 0 The computation process of F-CV is illustrated in Figure4 1 Σ

Original data set Train LLR(𝜆 𝑖 ) Train LLR(𝜆 𝑖 )

Table 4.2: F-measure-oriented Cross-Validation Procedure

5 OnT\T k ,construct the fitted modelLLR(λ i ).

7 Compare the scores withα j to get the labels ofT k

Output: The classifierLLR(λ i ), the optimal penaltyλ i , and the optimal thresholdα j

There are three different points between CV and F-CV.Firstly,CV fixes a threshold of 0.5 to distinguish the positive and negative samples whileF-CV considers a series of thresholds{α j } l Secondly,CV determines the optimalλbased on the cross-validation accuracy (denotedACC i in Step 9,Table4.1), whichisthemeanvalueofallaccuracymetricsoneverysubsetT k ,(k∈1,K).In contrast, F-CV utilizes F-measure instead ofaccuracy Finally,F- CVcanpointouttheoptimalthresholdfortheclassificationprocesswhileCVcannot.

Table4.3: Algorithm for F-LLRc l a s s i fi e r

Input: TrainingdatasetT 0 ;thepositiveandnegativeclassS + andS − ;series

0 0 of penalties{λ i } h , series of thresholds{α j } l , integerK.

• ApplyLLR(λ 0)to score all samples ofT 0.Stage2

• Order the samples ofS + andS − bytheir scores from the highest to thelowest 0 0

• DeterminethesubsetofS + consistingof(r S ×|S + |)upperhigh-scored samplescalledS ++ 0 0

• Apply SMOTE toS ++ to create(m−1)r S ×|S + |syntheticsamples.

• Apply F-CV to the balanced training setT 1=S + ∪S −

Where|A|denotes the quantity of the data setA.

F.C V.Then,accordingtothesamples’scores,under-samplingandSMOTEarerespectively applied to balance the training data set.Finally,on thebalanced dataset,LLR- basedF-CVbuildsaclassifierF-LLR.

Thecombinationofunder-samplingandSMOTEaimstoremovetheuseless samples and increase the useful ones The fact that the higher the negative samples’ scores, the greater the chance of being misclassified Thosemay benoise,borderline,oroverlappingsamples,whichdecreasetheperformancemea- suresoftheclassifiers.Thus,insteadofapplyingrandomunder- sampling,justaproportionofthenegativeclasswhichcontainstheupperhigh- scoredsam- plesiseliminated.

Next,insteadofutilizingthewholeminorityclass,SMOTEjustperformsonthesubsetconsisti ngofthepositivesampleswithupperhigh scores This idea is in contrast with the application of under-sampling.Thehigh-scored positive samples are usually identified correctly across thresholds.Thispracticeemphasizesthesesampleswhichshowprominentcharacteristics of the positive class Creating more neighbors of these samples will provide more useful information for the identification of the positive class.Further-more,thesehigh- scoredpositivesamplesareofteninthesaferegionthatisfarfromtheborderline,soapplyingSMOTEherecanpreventoverlappingissues Figure 4.2 illustrates the meaning under the steps of theF-LLRclassifier.

Empiricalstudy

Credit scoring is a typical example of imbalanced classification since the number of bad customers isalwaysfar less than the number ofgoodones.Eight credit scoring data sets are used in the experimentalstudy.TheyareAustralian data (AUS) 1 , German data (GER),Taiwanesedata(TAI),Credit risk data (Credit1) 2 ,Credit card data

(Credit2) 3 ,Credit default data (Credit

3) 4 ,Bankpersonalloandata(BANK) 5 ,andVietnamesedata(VN4).Moreover, adatasetofhepatitispatients(HEPA),whichisnotonlyimbalancedbutalso hasasmallsizeofthepositiveclass,isalsoinvestigated.ThedatasetsBANK,GER,TAI,an dHEPAwereusedintheempiricalstudyinChapter3.

Nineempiricaldatasetssufferimbalancedstatuswithdifferentlevelseval-uatedbythe imbalanced ratio (IR) Some characteristics of the data sets are presented inTable4.4 in the increasing order of IR from the smallest tothehighest The first group of data sets, including AUS, GER,TAI,and Credit 1, are imbalanced data at alowlevel (IR≤5) AUS, GER, andTAIdata sets publicizedontheUCImachinelearningrepositoryarefamiliarwithcreditscor- ing studies Credit 1 is the subset randomly drawn from the original datasetattherateof20%tosavecomputationtime.Credit1stillmaintainsthesame IRastheoriginaldataontheKagglewebsite.Thesecondgroupconsistingof

Credit2,3,BANK,andHEPAsuffersaverageimbalancedstatus(5

Tiêu đề	Imbalanced Data in Classification: A Case Study of Credit Scoring
Tác giả	Bui Thi Thien My
Người hướng dẫn	Assoc. Prof. Dr. Le Xuan Truong, Dr. Ta Quoc Bao
Trường học	University of Economics Ho Chi Minh City
Chuyên ngành	Statistics
Thể loại	Doctoral Dissertation
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	178
Dung lượng	0,91 MB