789 Application Of Machine Learning For Predicting Probabilyty Of Default Of Small And Medium Enterprises 2023.Docx

MINISTRYOFEDUCATION&TRAINING STATEBANKOFVIETNAM HOCHI MINHCITYUNIVERSITYOFBANKING NGUYENTHINGOCANH APPLICATION OF MACHINE LEARNING FORPREDICTINGPROBABILITYOFDEFAULTOFS MALLANDMEDIUMENTERPRISES GRADUAT[.]

Theurgency oftheresearch

After the economic downturn has been caused by the Covid-19 pandemic in 2020, theglobaleconomyisonarecoverypath,witheconomiesgraduallyreopeningandvaccinatio n campaigns continuing The pandemic, however, is not yet over. Severalnewwavesofinfectionhavebeentriggeredbymoretransmissiblec o r o n a v i r u s variants, in particular the Delta variant According to Smid and Iulian Ciobica (2021),global insolvencies are forecast to increase dramatically by 33% in 2022, compared to2019.In2021,thetrendofinsolvencies continuedtodecreasein NorthAmeri caandtheAsiaPacificduetosignificantUSfinancialsupportandastronge c o n o m i c reco very Insolvencies, however, are predicted to climb in all three areas with thehighest increases expected in Asia-Pacific and slight increases expected in Europe andNorthAmerica.

Source: Atradius, 2022LookingatFigure1.1,we canseethatbankruptcieswillbe greatestinItaly(+3

4%),the UnitedKingdom (+33%), and Australia(+33%) as comparedto pre- pandemiclevels Moreover, bankruptcy levels in the Netherlands are likewise quite high (+26%)in 2022 compared to pre-pandemic levels Other major economies, including Spain(+26%), France (+23%), and the United States (+6%), might forecast a higher rate ofinsolvency in 2022 Up to 2022, several nations' bankruptcy trends are rather constant,suchasGermany(+2%),toalesserextentSweden(+3%),and Japan (+4%). Furthermore, the Covid-19 outbreak has also wreaked havoc in Vietnamese enterprisessince2019.Accordingtothereport"TheimpactofCovid-

19onbusinessesi n Vietnam: a quick survey on businesses and Covid-19" by the World Bank (2020), thereare about 50% of small businesses and more than 40% of medium businesses have hadto close temporarily or permanently due to the impact of the Covid-19 pandemic. In2021,thenumberof enterprisessuspendingbusinessforadefiniteterm is nearly55.000, an increase of 18% compared to the previous year; 48.1 thousand enterprisesstopped operating and waited for dissolution procedures, increasing by 27.8%; 16.7thousandenterprisescompleteddissolutionprocedures,decreasingby4.1%, ofwhich

14.8thousandenterpriseswiththecapitalscaleoffewerthan10billiondongs,decreasing by 4%; 211 enterprises the with the capital scale of over 100 billion VND,decreasingby20.7%.Onaverage,nearly10,000businesseswithdrawfromthema rketamonthbecausetheycannot "hold"thebrutalityofthe Covid-19 pandemic.

Having said that, when the Covid-19 pandemic has been severely affecting the globaleconomy and causing the insolvency growth of enterprises, the internal credit ratingsystem at commercial banks is always vital in assessing the amount of credit risk ofclientsandassistingthebankinmakingcreditdecisionsandriskmanagement.Commercial banks in Vietnam are increasingly realizing the importance of this systemintheircreditoperationsandriskmanagement,particularlyastheystrivetosatisf ythe

From that perspective, the following specific aspects demonstrate the urgency of thisstudytopic:

Probability of Default (PD) models, due to the existing certain drawbacks,which makes it more challenging to choose the appropriate models for predicting theprobability of a company's default (Huseyin& B o r a , 2 0 0 9 )

A d d i t i o n a l l y , r e s e a r c h e s by Aysegul Iscanoglu (2005), as well as Hayden and Daniel (2010), illustrate the prosand cons of many related research models in predicting a company‟s PD, such as theDiscriminantAnalysismodel,LogisticRegressionmodel,DecisionTreemodel,Artificia l Neural Networks model (ANN model), In fact, many in-depth analyses ofthe above models have been conducted Platt (1991) utilized the Logistic model to testand choose financial variables and argued that using the industry's average financialvariables rather than the financial variables from a single firm's business bankruptcyreportispreferable.Lawrence(1992)utilizedtheLogisticmodeltoforecastcollat eralized loan default probabilities Altman (1968) utilized Discriminant Analysismodelstoidentifyalinearfunctionoffinancialandmarketvariablesthatbestdisting uishedbetweenbankruptandsolventfirms

Second, determining which financial indicators have a great influence on the choice ofmodel to predict the default probability of small and medium enterprises is always themain objective Altman (1968) evaluates 22 potentially useful financial parameters,including working capital/total assets, retained earnings/total assets, earnings beforeinterest and tax (EBIT)/total assets, market value equity/book value of total debt, andsales/total assets, and picks five that offer the greatest overall forecast of businessinsolvency when combined The study uses a matched sample of 66 manufacturingenterprises (33 failed and 33 non-failed) that filed for bankruptcy between 1946 and1965.Inthenextperiod,RegressionModelsarebuiltwith8variablessuchas

Exposure/Assets, Turnover/Assets, Owner > 10 years, Leverage Ratio, Liquidity ratio,CurrentRatio,TradeDebtors/Assets,andProfitabilityRatiotopresentagranularregression -basedanalysisofthepredictorsofdefaultinsmallandmedium- sizedenterprises(McCann&McIndoe-

Calder,2012).Guptaandhispartnerused30variables: Current Ratio, Quick Ratio, Earnings before interest taxes depreciation andamortization/interest expense, Working capital, etc in 8 groups (Leverage, Liquidity,Financing, Profitability, Activity, Growth, Other, and Control) in predicting SMEsfailure Gallucci et al chose 23 variables, including Debt/EBITDA, Current ratio,

Cashflow/turnover,andEBITDA/cashflowin3groups(Profitability,Leverage,andLiquidit y)in2022.

Third, default forecasting approaches based on statistical techniques (a reduced- formregressionapproach) havegrownwidelyintheBankingsectorsincetheBaselIIAccord.Themodelsarestandardized usingStatisticalapproaches,however,theselection of important predictors is based on assumptions about the structural elementsthat affect a corporation‟s financial health (Resti and Sironi, 2007) Recently, becauseof the massive datasets and unstructured data availability, a growing body of researchindicates that models based on Machine

Learning Techniques are also a viable optionformodelingdefaultrisk.Whenthelinkbetweenpredictorsandoutcomesiscomplicat ed or uncertain, Machine Learning refers to a class of models that can handlechallenging forecasting tasks As a result, as demonstrated in several studies (Brown &Mues, 2012; Barboza et al (2017)), Machine learning techniques may deliver veryaccurate out-of-sample forecasts without imposing strict constraints on the structure ofthedatagenerationprocess.

Finally,theGovernmenthasdevelopedalegalframeworkforthecreditratingsect orto improve the transparency of information and assist banks in controlling credit risks,promote capital mobilization through the stock market, support the stock and bondmarket,a n d p r o t e c t i n v e s t o r s ' r i g h t s , a n d i n t e r e s t s T h e e x p a n s i o n o f c r e d i t r a t i n g operations in Vietnam will be substantially contributed by research and the selection ofappropriate rating models In particular, credit rating services, and operating conditionsofcreditratingenterprisesinVietnamisthecontentprescribedinDecreeNo.88/2014/ ND-CP datedSeptember26,2014.

It is obvious that the selection of a model predicting the PD of a corporate, which isbased on Machine Learning approaches is one of the credit risk control measures so astoclassifycustomersatfirstandmanagethebank‟sdefaultriskasrecommen dedbytheBaselCommittee(BaselII, 2004).

Therefore, the thesis focuses on the issue of “Application of Machine Learning forpredicting the probability of default of Small and Medium Enterprises (SMEs)” toprovidetheoreticalbasisandempiricalevidencerelatedtotheselectionofanappropriatemo delpredictingenterprises‟insolvencyinordertocontributetoimproving creditriskcontrolofcommercialbankseffectively.

ResearchObjectives

The current research has been carried out with the major objective is using MachineLearning techniques compared to the Statistical approaches to predict the Probability ofdefaultofSMEs.

The following specific objectives are: The first one is to investigate what financialindicators have a great impact on choosing the PD predicting model toassist theVietnamese Commercial Banks System and Credit Rating Agencies in identifying theirpotential SME customers The second one is to investigate what model is the mostappropriateforpredictingtheprobabilityofdefaultofSMEsintheVietnameseCommerci al Banks System by Machine Learning approaches The last one is to realizethegreatinfluenceofMachineLearningapproachesonpredictingSMEs‟PD.

ResearchQuestions

To achieve the research objectives, the thesis focuses on answering the major researchquestions: i Which financial indicators have a great influence on the choice of model topredictthedefaultprobabilityofsmallandmediumenterprises? ii Howd o M a c h i n e Le ar ni ng ap p r o a c h es g rea tl y influence p re dic ti n gt h e PD of SMEs and which model of Machine Learning gives the best results inpredictingthePDofSMEs?

ResearchSubjectandScope

ResearchSubject

The subjectofthisstudyis theProbabilityofDefault of SMEs in

Vietnamwhichmeetsoneofthetwoconditionslistedbelow: i Totalcapital isless than100billionVND. ii Totalincomeforthepreviousyearisless than 500billionVND.

ResearchScope

Data were collected from financial indicators of financial statements of 400 SMEs inVietnamintheperiod2019–2021.

ResearchContributions

The thesis study findings are scientifically and practically significant in the followingaspects: i Systematically analyzing basic and background theoretical systems related tothePDpredictionmodelsandthecriteriaforselectingthea p p r o p r i a t e mod el. ii Completely and comprehensively providing published research to identifygapsi n p r e v i o u s s t u d i e s r e l a t e d t o t h e s e l e c t i o n o f t h e m o s t a p p r o p r i a t e model to predict the PD of SMEs at Vietnamese Commercial Banks byapplying Machine Learning to analyze financial indicators This provides asolidfoundationforresearcherstocarryoutmorerelevantstudies. iii Proposing to select a suitable credit rating model capable of forecasting thePDf o r S M E s b a s e d o n f i n a n c i a l i n d i c a t o r s t o c o n t r i b u t e t o e n h a n c i n g t h e efficiency of credit risk control at Vietnamese Commercial Banks in thefuture.

ResearchMethodology

Qualitativemethods:Discussingperspectivesandinvestigatingtheelementsinfluencingpre dictingthePDofSMEsbyMachineLearningapplicationa t Vietnamese Commercial Banks Since then, scales have been constructed in order toconductQuantitative research.

Quantitativemethods:UsingavarietyofmethodologiesincludingD e s c r i p t i v e Statisti cstoconstructdatabasedonrelatedcharacteristics;Comparativemethodbetween model and practice to conclude; Analytic-Synthetic method to synthesize andanalyzepertinentdataintheresearchprocess.

Furthermore,thethesisusedStatisticalApproaches(LogisticRegressionmodel)comparedt oMachineLearningTechniques(RandomForest,DecisionTree,andEnsemble Learning) to run regressions to determine how independent variables affectthe PD prediction ofSMEs Finally, the Confusion matrix and F1 – Score are used tore-evaluateand choosetheappropriatePDprediction models.

TheStructure ofResearch

;researchresultsanddiscussion;recommendation,andconclusion.Specificdetailsareasfoll ows:

The urgency of the research, research issue, research objectives, research questions,research subject and scope, research methodology, the topic's contributions, and thethesis'sstructureareallpresentedinthischaptertoprovidereadersanoverallview of thecompletestudy.

Illustrating basic and background theories related to the Probability of Default, theoverview of the models used to predict the PD of SMEs (the Structural models, and theNon-Structural models), as well as assessment results of previous studies, which havebeenpublishedtoclarifytheurgencyofthetopicandprovideafoundationforproposing research models and analyzing research results presented in the followingchapters.

Based on the theoretical basis of Chapter 2, Chapter 3 indicates the methodologicalmodel framework, a full description of the collected data, the input variables selection,and the suggested PD prediction models to show the confidence level of the studyresults.

ApplyingMachineLearningTechniques,inparticular,theDecisionTreemodel,Radom Forest model, and Ensemble Learning model compared the Logistic Model tore- evaluate the ability of the PD prediction of each model, and make the conclusionwhichmodelismoreappropriateinpredictingthePDofSMEs.

Summarizing the thesis's findings, proposing solutions to assist inimproving theirability to predict the PD of SMEs at Vietnamese Commercial Banks and CreditRatingAgencies; promptly having the orientation policy as well as adjustingCommercialBanks' credit-granting to achieve greater performance, reduce credit risks,and ensurecapital safety; providing certain results for corporate governance regulations in order toreduce the risk of bankruptcy; discussing the limitations and current issues, as well astheresearchdirectionsthatshouldbepursuedinthefuture.

This chapter illustrates basic and background theories related to the PD of enterprises,Machine Learning approaches, related measuring and forecasting methods, as well asassessmentresultsofpreviousstudies,whichhavebeenpublishedtoclarifyt h e urgency of the topic and provide a foundation for proposing research models andanalyzingresearchresultspresentedinthefollowingchapters.

ProbabilityofDefault(PD)

The PD is one of three main components for determining credit risk This study focuseson the PD: the likelihood of a default over a certain period (assessment period) Itestimates the risk that a borrower will be unable to satisfy his or her debt commitments.PDisemployedinawiderangeofcreditevaluationsandrisk managementsystems.

TheupdatedBaselframeworkrequiresbankstoestimatethePDofalltheircounterpartiesusin gtheInternalratings-basedapproaches.BaselIIemploysthefollowing definition for the probability of default: “The PD, stated as a percentage, isthe likelihood that a borrower would fulfill the default criterion within one year Adefault is regarded to have occurred concerning a certain obligor when one or both ofthe following conditions occur: the obligor is 90 days past due on any substantial creditobligation, and the obligor is unlikely to satisfy its credit obligations (Basel CommitteeonBankingSupervision,2006)”.

Office of the Comptroller of the Currency defines the PD as follows: “PD is the riskthat the borrower will be unable or unwilling to repay its debt in full or on time.Therisk of default is derived by analyzing the obligor's capacity to repay the debt followingcontractual terms PD is generally associated with financial characteristics such asinadequate cash flow to service debt, declining revenues or operating margins, highleverage, declining or marginal liquidity, and the inability to successfully implement abusiness plan In addition to these quantifiable factors, the borrower's willingness torepayalsomustbeevaluated.”

The PD is an important risk management metric that may be utilized for loan requests,rating estimates, credit derivative pricing, and many other essential financial domains.The false estimation of PD leads to unjustified ratings, and wrong pricing of financialinstruments,andwas,therefore, oneofthecausesoftheglobalfinancialcrisis.

OverviewofthemodelsusedtopredicttheProbabilityofDefault ofSMEs

TheStructuralModels

1 A recent line of study has proposed approaches to improve the interpretability of Machine Learning models(seeGuidotti et al., 2019) Furthermore, the application of ML models for inference is discussed in several papers (seeChernozhukovet al., 2018).

RegressionAnalysisModels Overview  Regression analysis is a collection of statistical techniques usedto estimate the associations between a dependent variable andoneormoreindependentvariables.

 Orgler(1970)discoveredalinearrelationshipbetweenacustomer'sc haracteristicsandher defaultcondition.

DiscriminantAnalysisModels Overview  DiscriminantAnalysiswasintroducedbyFisher(1936)todistinguis hsolventandinsolventborrowersasaccuratelyaspossibleusingalinear discriminantfunctionwhichistheapproachpredominantlyused inpractice.

 Evaluates dependency of only linear financial indicators (fromannualfinancialstatements ofSMEs)

 Enterprise‟sfinancialstate,internalorexternalenvironment ofactivities,anddevelopmenttrends arenotthoroughlyassessed.

LogisticModels Overview  Because the standardized coefficients do not reveal the relativeimportance of distinct variables, some writers, beginning withOhlson (1980), utilize a logit model, which does not need suchrestrictiveassumptions.

 Following Ohlson's (1980) article, which examined a dataset ofUS enterprises from 1970 to 1976 and constructed a logit modelusingninefinancialvariablesasregressors,mostacademicliter atureemployedlogitmodelstoforecastdefaultsuchasMossmanetal. (1998),andBecchettiandSierra(2003).

 Logistic Model investigates the relationship between ordinal orbinaryresponseprobabilityandexplanatoryvariablestoestimate the Probability of Default of a business, in particular,how muchthebankruptcyriskis.

TheNon-StructuralModels

DecisionTreeModel(DT) Overview  In Classification and Regression Trees (CART), the decisiontreemodelwh ich isan on - pa ra me tr ic modelis usedwidely inan enterprise‟s classification and prediction (Breiman & Ihaka,1984).

 Huntandhisco-workers(Hunt,Marin,andStone,1966)initially developed the decision tree model idea Later, Tsai andWu

(2008) built the bank's credit rating model using the C5.0decisiontreealgorithm.

 DT algorithm which is a methodology that solves classificationorregressionproblemsbyorganizingdecision rulesintoatree structure(Jamesetal.,2013).

RandomForestModel Overview  The random forest (RF) method is a machine learning approachthat employs multiple decision trees (DTs) When creating anew DT, the RF algorithm randomly selects a certain number ofexplanatoryvariables.

 RFisasupervisedlearningsystem.Theyareusedforclassificationan dregression,just likedecisiontrees.

ArtificialNeural NetworkModel(ANNs) Overview  An ANN is a machine learning system designed to mimic thewayrealhumanbrainswork.

(1999)analyzeseveralmethodologiesandindicatethatFisherdiscri minanta n a l y s i s and probabilistic neural networks outperform them in terms ofpredictionperformance.

 With a sample of over 7,000 Italian SMEs, Ciampi and Gordini(2013) investigate ANNs They demonstrate that ANNs maycontributemore to SME credit risk appraisal and that ANNpredictionaccuracyisparticularlyhighforthesmallestenterprises.

 Devotingasignificant amountof timeto off-linetraining.

Ensemble learning is a methodf o r a t t e m p t i n g t o i m p r o v e t h e p e r f o r m a n c e o f basealgorithmsbycombiningmanyalgorithmstoproducenew,higher-performing classifiersthat aremoresuited toskewed class situations.

EnsembleLearning Overview  Ensembleclassifiersareknowntoincreasetheaccuracyofsingle classifiers by combining several of them and have beensuccessfullyappliedtoimbalanceddatasets(Fanetal.,2011)

 Toimprovetheclassification performanceofunbalanced data, ensemble learning approachesaremore successful than datasamplingstrategies(Khoshgoftaaret al.,2015)

Source: Statistics from the authorIn 2018,ShigeyukiHamoriemployetal.usedthreeensemblelearningapproaches,baggin g,randomforest,andboosting,aswellaseightneuralnetwork methods with varying activation functions The ability of each approachto forecasting default risk is measured using many indicators (accuracy, rate ofprediction, outcomes, receiver operating characteristic (ROC) curve, area underthecurve(AUC),andF-score)of enterprises.

Previous RelatedResearch

In general, the strengths of these models are twofold: first, their capacity to calculatethe certainty (probability) of the outcomes, and; second, assessing the impact of eachelement individually In spite of their widespread use in both research and industry,thesekindsofmodelshaveprovenerroneous,implyingtheneedforimprov ementsin the modeling of default and risk (Begley, Ming, & Watts, 1996) Furthermore, theyhave a limited capacity to improve prediction findings, usually not for more than a year(Altman, 2014; Altman et al., 2017) In addition, they cannot be automatically includedin huge time-series data and must rely on the usual mean-value theory; nonetheless, forthe most part, extreme occurrences are the determining variables, and so the extreme- valuetheorymaygivesuperiorinsight(Baldi,Manerba,Perboli,&Tadei,2019;Perboli, Tadei, & Gobbato, 2014) To overcome the constraints of Statistical models,research proving how Machine Learning models outperform traditional classificationapproacheshasbeenactivelyproduced(Barboza,Kimura,andAltman2017). Several writers have contributed to the extensive literature on predicting the PD ofSMEsbyapplyingMachineLearninginrecentdecades.

First, the study of Guido Perboli, and Ehsan Arabnezhad (2021) applying models suchas Random Forest, Neural Network, Logistic Regression, and Gradient Boosting wascalibratedusing15independentvariablesoffinancialstatementdatafromo v e r 160,00

0 Italian SMEs that were operating by the end of 2018, paired with data fromaround 3,000 bankrupted enterprises from 2001 to 2018 Their result is accurate at over80% not only in the short-term (12 months) but also in the middle (36 months) andlong-term(upto60months).

Second, Liou (2008) evaluated three data mining methods for identifying fraudulentfinancials t a t e m e n t s a n d p r e d i c t i n g c o m p a n y f a i l u r e : L o g i s t i c R e g r e s s i o n , A N N , a n d the Decision Tree model Using stepwise Logistic Regression, 19 of 52 financial ratiosfrom priorresearchwereshowntohavesubstantialpredictivevalueinpredictingcompany failure However, according to the study's findings, a Decision Tree obtains alargerproportionofsuccessincompanyfailureprediction.

Third, Ravi and Pramodh (2008) developed the ANN model to forecast default forTurkish and Spanish banks, taking into account 9 financial indicators for Turkish banksandt w e l v e f i n a n c i a l a s p e c t s f o r S p a n i s h b a n k s T h e a c c u r a c y l e v e l o f t h e b u i l d i n g modeli s 9 6 6 % f o r t h e S p a n i s h b a n k s ' d a t a s e t a n d 1 0 0 % f o r t h e T u r k i s h b a n k s ' dataset However, due to the model's complexity, its structure is difficult to understand,anditis notwithin themodeler'scontrol,soitsusefulnessislimited.

Fourthly, the recent empirical literature is also gaining traction in understanding smallbusinesses'creditriskbehavior.Altmanand Sabato(2007) investigateda panelof2010

U.S SMEs that included 120 defaults between 1994 and 2002 They selected fiveaccountingratiocategoriesthatdescribethekey aspectsofacompany'sfinancialprofile:liquidity,profitability,leverage,coverage,andactivity. Theydevelopedanumberoffinancialratiosidentifiedintheliteratureasthemostsuccessfuli n p redicting firm bankruptcy for each of these categories Finally, five variables (onefrom each category) with the highest predictive power ofS M E d e f a u l t w e r e c h o s e n , and a distress prediction model for SMEs was created using the logistic regressiontechnique They do, however, recognize the importance of using qualitative data toimprovethepredictiveperformanceoftheirmodel.

Finally, David Gun-Fie Yong and Wai-Ching Poon (2010) conducted to use a

MultipleDiscriminant Analysis (MDA) model to predict the PD for Malaysian corporates Asample of 64 businesses using 16 financial indicator variables was examined Theresearch findings revealed that 7 financial index factors have a great influence on highaccuracy rates ranging from 88% to 94% for each organization before going intobankruptcy However, this technique has limitations linked to essential assumptions inthemodel(assumptionsofnormaldistributionandequalvariance)thatmightbebroken,decrea singthemodel's trustworthinessandapplicability.

Step 1: Separation of the explanatory and predictor variables

Step 2: Split the data into training and test sets

Step 3: Setting up the applying model

Step 5: Making a prediction on the test data set

Step 6: Evaluating the effectiveness of the model

Basedonthelimitationsof theexistingstudies,thischapterwillprovidespecificmethodology related to the models used in Chapter 2, as well as process the appropriatePD prediction models with a detailed description of each model and collected data toconduct examine the research's assumption set out in Chapter 1 In this section,threeproposed models, which are Logistic models, Random Forest, and Decision Tree,areused In the following section, the thorough and precise information will illustrate thereliabilitylevelofempiricalresults.

MethodologicalModelFramework

This study focuses on the PD prediction of SMEs, which satisfies one of the twoconditions: (i) Total capital is less than 100 billion VND; (ii) Total income for thepreviousyearislessthan500billionVND,fromdifferentsectors.

First,importthecollecteddatawith14variables.The6steps build modelsasfollows:

The given proposed schematic process indicates how to build the PD prediction modelsof SMEs to achieve the research objectives First of all,t h e r a w d a t a w a s c o l l e c t e d from the Annual Audited Financial Statements (AAFS) of approximately 400 SMEs in3 main industries in Vietnam from 2019 to 2021 and processed At the following stage,based on the existing studies in Chapter 2, 13 input independent variables which arefinancial indicators calculated from AAFS of SMEs were chosen These independentvariables are divided into four types of financial indicators: liquidity index, debt usageindex, profitability index, and company performance indicators After choosing inputvariables, the selected models including Logistic models, Decision Tree, and RandomForest run the regression to determine how these indicators affect the PD of SMEs Theprocess concludes with using the Confusion matrix and F1 – Score to evaluate andchoosetheappropriatemodels.

Data collection

To provide predictions for the PD of Vietnamese SMEs, a database is used from AAFSon Vietnamese small and medium-sized firms for a three–year period from 2019 to2021 These financial statements have been audited to guarantee that the data source isreliable.AccordingtotwostandardstochooseSMEs,theauthorconsiders400enterprisesin 3mainindustries.Inparticular,97enterprisesintheConstructionindustry;274enterprisesint heServiceindustry;and29Agriculture businesses.

The reason why this study was carried out during the last three years is that the state ofthe economy and corporate performance have remained mostly unchanged. Accordingto Crouhy, Galai & Mark (2001), themajorw e a k n e s s o f t h e r i s k e s t i m a t i o n m o d e l i s the sample based on previous financial data collected under situations that may or maynot apply in the future or data utilized in seldom updated models Moreover, theyarguedthatthecreditriskanalysisisbasedonseveralborrowercharacteristics,includingf inance,management,income,cashflow,assetquality,andliquidity.Therefore,thisassumpti oncanbebuiltasfollows:

These businesses are categorized into two groups: those that have filed for bankruptcyare assigned 1, and those that have not filed for bankruptcy are assigned 0. The Equityfrom the Balance Sheet, the Net Profit after Taxes from the Profit and Loss Statement(P&L),a n d t h e O p e r a t i n g C a s h F l o w f r o m t h e C a s h F l o w S t a t e m e n t a r e u s e d t o separate these groups If a company falls into one of the following categories, it isconsidered bankrupt(marked as 1): i Equityisanegativenumber. ii TheNetProfitafterTaxesandtheOperatingCashFlowhasbeennegativef orthepasttwoyears. iii Companiesdeclareinsolvency. iv Inthecategoryofbaddebts(groups3,4,and5)

Missing, erroneous, or unimportant values in the data set shall be adjusted according tothefollowingprinciples:

First, replace the missing data with the borrowers' average Take the missing valuefrom a bankrupt business as an example, it will be replaced by the bankruptcy group'saverage.

Finally, a 5% or 95% quantile of financial indicators will be used to replace aberrantvolatilitylevels.

InputVariablesSelection

Financiali n d i c a t o r s h a v e b e e n i n w i d e s p r e a d u s e a n d p l a y e d a n e s s e n t i a l r o l e , i n particular, commitment conditions to predict the PD of SMEs in the existing literature.Some scholarly articles suggest the use of financial indicators as sources of informationforc r e d i t r i s k a s s e s s m e n t ( D e m e r j i a n , 2 0 0 7 ) A l t e r n a t i v e l y , t h e c u s t o m e r ' s b r e a c h o f the terms of the commitment would send indications regarding their capacity to repaytheirb a n k l o a n s ( S m i t h a n d W a r n e r , 1 9 7 9 ) M o r e o v e r , D i c h e v a n d S k i n n e r ( 2 0 0 2 ) discovered that committed requirements on financial ratios would also influence thesubstanceofcreditcontracts.Thecommitmentconstraintsonfinancialratiosareextremel y beneficial because " if the firm begins to show indications of distress, thebankcanreclaimtheloanordisposeoftheassetsbeforethecompanylosesitssolvency."

(Lundholm and Sloan, 2004) In addition, Beaver (1966) gives empiricalevidence that some financial indicators provide excellent statistical signals against realcompany failures This study, however, focuses mostly on debt and liquidity ratios,whichmaynotgiveenoughfinancialinformationtoprovidereliable estimates.

Classical variables were used in the study, as reported by Altman and Sabato

(2007),Shumway (2001), Bekhet and Eletter (2014), Addo et al (2018), and Arora and

Kaur(2020): i Profitability(LogEBITDA,EBITDA/NetWorth) ii Liquidity(LogCashandEquivalents,CurrentRatio,QuickRatio) iii Performance (LogNetWorth) iv Leverage(DebttoNet Worth,DebttoEBITDA) v Coverage(EBITDAtoIE)

Moody's and Standard and Poor's disclosed various essential financial measures in theirrating process, including as Total-Debt-to-Total-Assets Ratio; Earnings Before Interestand Taxes (EBIT); the

Prospect of good business (a rise in cash flows or a rise in assetreturns);Dividendsandotherpayments;Businessrisk(Cashfloworassetvaluevariation s);Liquidityofassets.

Ina2001paper,Moody'sdevelopedRiskCalcv3.1,aquantitativecreditratingmethodology for evaluating enterprises in Japan's high-end sector This model employssevenvariables,includingProfitability,FinancialLeverage,Liquidity,Principal

Repayment,InterestRepayment,Size,andOperatingRatio(Appendix- Table1).

O t h e r , a n d C o n t r o l whichare8groupsalongwith30variablesaredevelopedbyGuptaetal.2018(Appendix-Table2)

This collection of indices represents the enterprise's capacity to transform assets intocasht o p a y o f f s h o r t - t e r m d e b t s o r d e m o n s t r a t e t h e e n t e r p r i s e ' s a b i l i t y t o p a y s h o r t - term solvency The

Current Ratio, Quick Ratio, Log Working Capital, and Log Cashand Equivalents are typically included in this category of metrics The lower thesecoefficients,thegreaterthebusiness'schanceofinsolvency.

This category of metrics assesses how much capital is in the form of debt (loans) orevaluates a company's ability to meet its financial commitments such as Debt to

NetWorth, Debt to EBITDA The larger the Debt Ratio and the lower the Solvency, thegreaterthe chance ofacorporate default.

This category of metrics assesses an enterprise's profitability using indicators includingLog EBITDA – EBITDA is frequently used to assess a company's financial position,accordingtoPompeandBilderbeek(2005);EBITDA/NetWorth.Itisconsideredt hata corporation's profitability is a strong predictor of whether or not the firm wouldthereafter be able to pay off its debt commitments, a loss-making company that willeventuallydeplete itsequitysourcesandisunabletorepaythedebt.

The performance indicator demonstrates how well the company has utilized its assets.The lower the index, the greater the business's credit risk when its assets, particularlyLogNetWorth,arenot beingusedinefficiently.

Basedo n t h e f i n d i n g s o f t h e p r e c e d i n g i n v e s t i g a t i o n s , t h e a u t h o r c h o s e 1 3 f i n a n c i a l indicators as independent variables: 10 numeric variablesand 3c a t e g o r i c a l v a r i a b l e s for the credit rating models used in the study article The two tables below illustratehow the 13 independent variables are determined, as well as their sign expectation inthedefaultpredictionmodel.

Log EBITDA (Earnings beforeinterest,taxes,depreciation, andamortization)

TheProbabilityof Defaultprediction models

LogisticRegressionModel

Instatistics, theLogitModel isaregressionmodel inwhichthedependentvariable(Y)istreatedasadummyvariableorbinaryvariablewithjusttwo possiblevalues:0and1, indicating two separate events (default and non-default); independent variables mightbediscreteorcontinuous.TheprobabilityPistheresearchobject,whiletheindependent variablesXiare discrete or continuous and the dependent variable Y is abinaryvariable.

The following function depicts the connection between the dependent variable (Y) andtheindependent variables:

TheestimatedvalueofYgeneratedbyregressing𝑌̂accordingtoindependentvariablesX ii s referr ed toasY.

The following is the formula for determining the chance of a business's insolvencyusingthelogitmodel:

Decision TreeModel(DT)

TheDecisionTree(DT)isamodelforaddressingregressionandclassificationproblems using a tree structure to chart decision rules A Root Node, Internal Nodes,and Leaf Nodes make up this paradigm Each node in the DT represents a variable, andthe path that connectsit toitso f f s p r i n g i n d i c a t e s a p a r t i c u l a r v a l u e f o r t h a t v a r i a b l e (this is the condition or rule for branching for each node) The anticipated value of thetargetv a r i a b l e i s r e p r e s e n t e d b y e a c h l e a f n o d e , w h i l e t h e p r o v i d e d v a l u e s o f t h e variablesarerepresentedbytheroutefromtherootnodetothatleafnode.

Source: Abdou, H and Pointon, J. (2011).Decisiontreesareconstructedbysplittingtheattributeof valuesateachn o d e according to an input characteristic The classification procedure employs separablequalitiesandrunsindefinitelyuntilitreachestheleafnodes(targetvalue).T h e deci sionruleswhose objectivefunctionreturnsthevalueoftheriskl e v e l corresponding to the customer will be determined by a set of path rules from the RootNodetotheLeafNodes. WiththeCARTmethod,thenodesplitwillbedecidedusingtheGiniindex(Classificationand RegressionTree):

TheDecisionTreeModelisasimple-to- understandcategorizationsystemthatisquitesuccessful.Thec lassi fi ca ti on effective nessoft h e decisiontree, onthe otherhand, is highly dependent on the training data As a result, building a viable decision tree modelnecessitatestheuseofahugedatasetonclientloanhistory.

RandomForestModel(RF)

The Random Forest (RF) is a simple and versatile machine learning technique. Highaccuracyoutcomeshavebeenreportedevenwithouthyperparameteradjustment.More over, it is also a way of constructing a huge array of decision trees in order togenerate predictions about the predicted target feature Each decision tree is built atrandombyre-selecting(bootstrap,randomsampling)andonlyusingalimitedcollection of random characteristics from all of the data variables The random forestmodel normally works quite precisely in its final form, but the downside of the methodis that owing to its complicated structure, we cannot grasp the mechanism of actionwithinthemodel.Randomforestsarecommonlyusedas"blackbox"modelsinbusinesses since they produce excellent predictions over a wide variety of data withminimummodificationinsci-kit- learnpackages.

Confusion Matrix

The Confusion matrix is a method for evaluating the outcomes of classification issuesby taking into account the ratio of prediction accuracy and coverage for each givenlayer For each layer of classification, a Confusion matrix comprises the four indicatorslisted below: True positive (TP) is a measure of how accurate a forecast is. TrueNegative(TN)isanindirectindicatoroftheamountofcorrectprediction;FalsePositive (FP) is an indirect indicator of the number of false predictions; False Negative(FN) is an indirect indicator of the number of deviations The Confusion matrix is usedto simulate the results of forecasting the default of firms, as illustrated in the tablebelow:

Non-default TrueNegative(TN) FalsePositive(FP)

Default FalseNegative(FN) TruePositive(TP)

True Positive (TP):The total number of accurate forecasts This happens when themodelproperlyforecastsacompany'sfailure.

TrueNegative(TN):Indirectly,thenumberofrightforecasts.Whenthemodelproperlypred ictsthatacompanywillnotgobankrupt,itisrighttoavoidt h e eventualityofa companygoing bankrupt.

False Positive (FP):The number ofincorrect predictions.Whent h e m o d e l p r e d i c t s that a firm would go bankrupt, the company is actually in good shape.This can beclassifiedasamodeltype1error.

False Negative (FN):The indirect number of erroneous forecasts This occurs when themodel predicts that a company will not go bankrupt, yet the firm is in bankruptcy; thatis, it is incorrect when the scenario where the company goes bankrupt is not chosen.Thiscanbeclassifiedasamodeltype2mistake.

Thepurposeofthedefaultpredictionmodel,accordingtoHaydenandDaniel'sresearch, is to study, evaluate, and rate borrowers in order to categorize good and poorclients. However, when hazards do exist, statistics reveal that False Negative, a kind 2error, would lead the bank to incur more losses than False Positive, a type 1 error, duetotheprobabilityofnotrecoveringfundsfroma clientgroupdueto atype2error.

Researchers create groupings ofindicators based on the four indications listed abovetoassess the dependability of a model that forecasts the likelihood of a business's default.Accuracyratio:Themodel'saccuracyisdefinedasitscapacitytodiscriminatebetw eenbankruptandnon- bankruptenterprises.Theresearchersdevisedthefollowingformulatocalculatetheaccuracy ofapredictionmodel:

Sensitivity ratio:Sensitivity is the model's capacity to correctly detect bankruptcies; itis assessed by the proportion of firms that correctly forecast bankruptcy out of allbankruptcycompanies.Thefollowingformuladetermines themodel'ssensitivitylevel:

Specificity ratio:This is a ratio developed to evaluate the model's ability to reliablyidentify situations of enterprises that are not bankrupt The fraction of businesses withaccuratelyforecastednon-bankruptcytototalnon- bankruptcyfirmsd e t e r m i n e t h i s ratio.Thisratiomaybecalculatedusingthefollowingformula:

Precision ratio:This ratio compares the number of firms that are properly predicted tofail to the total number of companies predicted to fail This rate is calculated using thefollowingformula:

F1-Score

In the problem of classifying companies as bankrupt or non-bankrupt, the data set usedto build and test the model often has an uneven distribution; in this case, data aboutnon-bankrupt companies accounts for a much larger share than data about bankruptcycompanies, resulting in skewed and inaccurate sensitivity and precision ratios As aresult, the F1 score is assigned, which is an index based on the simultaneous evaluationof the two groups of sensitivity and precision ratios, which leads to an evaluation of theefficacyoftheratingmodels,whichwillprovidemoredependablefindings.

Thefollowingformula determines theF1 scoreorF-measure:

Descriptivestatisticsresults

Source: Statistics from the authorTable4 1p rese nt s t h e v a l u e s o f 1 0 independentvar ia bl esi nt er ms o f M e a n ,

Me d i a n AbsoluteDeviation(MAD),Minimum(Min),andMaximum(Max)valuestoassi stusin understanding the data set utilized to develop the models to forecast the default rate.IntermsofMeanValue,EBITDAtoIE(X6)reachesthehighestvalue(163.529)an d the lowest value (X7) is Debt to Net Worth (-0.599) Meanwhile, the variable with thelargest MAD value is Debt to EBITDA (X8) (3.6), and X3 (EBITDA/ Net Worth) hasthesmallestvaluewith0.1.Inaddition,DebttoNetWorth(X7)hasthelowestminimum value at -1634.47, and the variable with the highest minimum value (1.22) isLog Cash and Equivalents (X5) Finally,EBITDA to IE (Interest Expense) (X6) is thelargest maximum value (46822.25) and the smallest maximum value (6.42) is LogEBITDA(X1)

Correlations

Figure 4.1 below shows the Pearson correlation coefficient among all pairs. Pearson'scorrelation coefficient (r) is a measure of two variables' linear correlation Its valuerangesfrom-1to+1,with-1representingtotalnegativelinearcorrelation,0representing no linear correlation, and 1 representing total positive linear correlation.Furthermore, r is unaffected by changes in the location and scale of the two variables,meaningthattheangletothex-axishasnoeffectonr foralinearfunction.T o calculate r for two variables X and Y, divide their covariance by the product of theirstandarddeviations.

There are pairs of independent variables with high correlation in the above correlationmatrix In particular, EBITDA/Net Worth (X3) is highly correlated with Log EBITDA(X1); Debt to Net Worth (X7) is highly correlated with EBITDA/Net Worth (X3); LogEBITDA (X1) is highly correlated with Log Net Worth (X2). However, because thesecorrelation values areall less than 0,8, the phenomenon ofm u l t i c o l l i n e a r i t y b e t w e e n thevariablesinthemodelmaynothaveasignificantinfluenceonthemodel'sregr essionfindings.

Regressionresultsofaparametric model

LogisticRegressionResult

PandasProfilingwithGoogleCollabisusedforLogisticRegressionA n a l y s i s Learning t h e p a r a m e t e r s o f a p r e d i c t i o n f u n c t i o n a n d t h e n e v a l u a t i n g i t o n t h e s a m e data is a methodological error: a model that just repeats the labels of the samples it hasjust seen will get a perfect score but will fail to predict anything valuable on yet-unseendata This is referred to as overfitting To avoid this, while executing a (supervised)Machine Learning experiment, it is usual practice to set aside a portion of the availabledata as X sets (X_test, X_train) and train sets (y_train, y_test) (Details of six steps withtheusage codescanbeseeninFigures1, Appendix)

Source: Statistics from the authorTheLogisticRegressionresultillustratestheindependentvariablessuchasLogEBITD

A (X1), Log Net Worth (X2), EBITDA/Net Worth (X3), Log Working Capital(X4), EBITDA to IE (X6), Debt to Net Worth (X7), Debt to EBITDA (X8) move inopposing directions to the dependent variable This variation corresponds to the signexpectationoftheindependentvariablesreportedinTable3.3.However,intheaforement ioned independent variables, variable X8 has no discernible impact on thelogistic model coefficients -0.01 On the contrary, the variables X1, X2, and X4 have asignificante f f e c t o n t h e o u t c o m e s o f p r o j e c t i n g t h e d e f a u l t o f f i r m s , w i t h e x t r e m e l y large measuredcoefficientsof-1.32,-0.79,and-1.33,respectively.

Thegreaterthevalueofafirm'searningsbeforeinterest,taxes,depreciation,amortization, and earnings before interest, taxes, depreciation, and amortization to networth, the better the ability of the business to generate profits, and the lower the chanceof business bankruptcy In addition, the capacity to pay interest is assessed by thegreater the percentage of earnings before interest, taxes, depreciation, and amortizationto interest expenses, the lower the probability of the firm going bankrupt.M o r e o v e r , thehighertheDebttoNetWorth,theDebttoearningsbeforeinteres t,taxes,depreciation, and amortization, and the lower the Solvency, the greater the probabilityofacorporatedefault.Additionally,thegreatervalueofworkingcapitalw hichmeansit has enough cash, accounts receivable, and other liquid assets to cover its short-termobligations, such as accounts payable and short-term debt, the lower the likelihood ofbusinessbankruptcy.

On the other hand, the independent variables such as Log Cash and Equivalents (X5),Current Ratio (X9), and Quick Ratio (X10) follow the same direction as the dependentvariable This similar directional variation is consistent with the independent variablesign expectations suggested in prior investigations Moreover, in which the variable

X9hasagreatimpactontheresultsofthedefaultratepredictionmodelwiththecoefficients measured in the Logistic Model being 0.092 On the contrary, variable X10hasaweakerinfluencewithquitelowmeasuredcoefficientsof0.071.

More specifically, the higher the current ratio, the greater the ratio of short-term assetsto short-term debt, and the more cash a firm holds in the company, the less likely thecorporationistogobankrupt.Thequickratio,whichisaliquidityindicator,evaluate sa company's capacity to utilize its quick assets (current assets minus inventory) tosatisfy its current liabilities Firms with better liquidity ratios, as predicted, have lowerdefault probability The more healthy the amount of cash and cash equivalents of acompany,andthemorepositivethereflectiononitsabilitytomeetitsshort-termdebt obligations,thelower thepossibilityofbankruptcy.

Thedefaultprobabilityestimated bytheLogisticR e g r e s s i o n m o d e l canbesh ownas follows:

Confusionmatrixoftheparametricmodel

Source: Statistics from the authorTable 4.3 indicates the Confusion matrix of the Logistic Regression Model Randomlyselecting 100 companies out of 400 to check how many bankrupt and non- bankruptcompanieswerecorrectlypredicted.FromtheredetermineAccuracy,Sensitivity,S pecificity, Precision, and F1 – Score It can be seen that out of 100 companies,themodel correctly predicted 82 non-defaulted companies Moreover, the model properlyanticipatedthebankruptcyof17oftheremaining18businesses.TheLogisticRegres sion model'saccuracyratiois0.99,indicatingthattheLogisticmodelcanreliably discriminate between bankrupt and non-bankrupt individuals by 99%.Throughthes e n s i t i v i t y r a t i o , i t c a n b e s e e n t h a t t h e Logisticmodel's abilityt o a c c u r a t e l y identify bankrupted SMEs is quite high, up to 94.44% In precision ratio, 100% ofcompanies predicting bankruptcy is correct The effectiveness of the Logistic model isevaluated through F1- score with a value is 97.14%, it means that the evaluation of thismodelgivesreliableresults.

Regressionresultsofnon-parametricmodels

DecisionTree

Source:StatisticsfromtheauthorUsi ng the training set (X_train) of 300 companies to find the 2 variables that have themost influence on the decision tree model As a result, X3 and X2 have a great impacton identifying how many defaulted and non – defauted SMEs (The usage code to drawDecisionTreemodelinFigure2,Appendix):

Step1:UsingtheindependentvariableX3,dividethepopulationintotwogroups:47 bankrupcies and 253 non-bankrupcies If X3≤-0.35, the output values are predictedcorrectlyby28bankruptcompanies.Otherwise,proceedtothenextstep.

Step 2: If X3>-0.35 and X2≤0.003, the model predicts correctly 19 insolvententerprises from a total of 272 firms Otherwise, X2>0.003 we consider the next step.Step3:IfX3>-0.35andX2>0.003,the253 enterprisesgobankrupt.

Table4.4:Confusionmatrixof Decision Tree model(DT)

Source: Statistics from the authorTable 4.4 indicates the Confusion matrix of the Decision Tree Model (the usage codecan be seen in Figure 3,

Appendix) Randomly selecting 100 companies out of 400 tocheck how many bankrupt and non-bankrupt companies were correctly predicted Fromthere determine Accuracy, Sensitivity, Specificity, Precision, and F1 – Score It can beseenthatoutof100companies,themodelcorrectlypredicted82non- defaultedcompanies.Moreover,themodelproperlyanticipatedthebankruptcyof18.TheDe cision Tree model's accuracy ratio is 1.0, higher than Logistic model, indicating thatthe Decision Tree model can reliably discriminate between bankrupt and non- bankruptindividualsb y 1 0 0 % T h r o u g h t h e s e n s i t i v i t y r a t i o , i t c a n b e s e e n t h a t t h e D e c i s i o n Treemodel'sabilitytoaccuratelyidentifybankruptedSMEsisupto100%,higherthan

Logistic model In precision ratio, 100% of companies predicting bankruptcy is correct.The effectiveness of the Decision Tr is evaluated through F1- score with the value is100%, higher than Logistic model, it means that the evaluation of this model gives amorereliable result thanLogisticmodel.

RandomForest

Table4.5:Confusionmatrixof RandomForest model(RF)

Source: Statistics from the authorTable 4.5 indicates the Confusion matrix of the Random Forest Model (The usage codeto run Confusion Matrix of Random Forest model can be seen in Figure 4, Appendix).Randomly selecting 100 companies out of 400 to check how many bankrupt and non- bankruptcompanieswerecorrectlypredicted.FromtheredetermineAccuracy,Sensitivity, Specificity, Precision, and F1 – Score Generally, the Confusion matrix ofthe Random Forest Model gives the same result of Decision Tree ones It can be seenthat out of

100 companies, the model correctly predicted 82 non-defaulted companies.Moreover, the model properly anticipated the bankruptcy of 18 The Random Forestmodel's accuracy ratio is 1.0, higher than Logistic model, indicating that the RandomForestmodelcanreliablydiscriminatebetweenbankruptandnon- bankruptindividuals by 100% Through the sensitivity ratio, it can be seen that the Random Forest model'sability to accurately identify bankrupted SMEs is up to 100%, higher than

Logisticmodel In precision ratio, 100% of companies predicting bankruptcy is correct Theeffectiveness of the Random Forest is evaluated through F1- score with the value is100%, higher than Logistic model, it means that the evaluation of this model gives amorereliableresultthanLogisticmodel.

RegressionresultofEnsembleLearning

In this research, for better classification and generalization, ensemble learning isa combination of the Logistic model, Decision Tree model, and Random

Source: Statistics from the authorAlthough Ensemble Learning gives the same result with Random Forest and Decision Tree, the ratio of Accuracy, Sensitivity, Specificity, Precision, and F1 –Score will be changed when the larger data are collected (size and time)(Theusing code to run the Confusion matrix of Ensemble Learning in Figure 5 –Appendix)

The Confusion Matrix tables of two non-parametric default probability models,comprising the Decision Tree model and the Random Forest model, shows thatthe Random Forest model is as efficient as the Decision Tree model with thesame of all Accuracy, Sensitivity, Specificity, Precision, and F1-Score However,based on the advantages and disadvantages of Decision Tree and Random Forestmentionedinsection2.2chapter2,RandomForestmodelism o r e understandabl e and accurate than other machine learning algorithms (Olson etal., 2012) Furthermore, overfitting is not a problem with Random Forest model.Random Forests' testing performance does not decrease( d u e t o o v e r f i t t i n g ) a s thenumberoftreesrises.Asaresult,afteragivennumberoftrees,theperformancete ndstoremainconstant.Therefore,RandomForestmodeloutperforms

In this chapter 5, the author makes some recommendations for organizations that mightutilize the model to estimate the risk of default in Vietnam, while also highlighting thelimitsintheimplementationprocessandsuggestingfutureresearch directions.

All of the topic's research findings were discussed in chapter 4 of the thesis, where theauthordiscovered answersto two researchquestionsraisedin chapter1.

Toaddressthefirstquestion,theauthoremployed13 independentvariablesthatcorresponded to 10 numeric variables and 3 categorical variables of the organization toidentify a company default According to the findings, two variables have a crucial rolein determining consumer default likelihood in the non – parametric model (DT), theseareLogNetWorthandEBITDA/Net Worth.

PD of SMEs In particular, the author employed specific criteria such asthe Confusion Matrix and the F1 - Score to identify the optimal model to evaluate theeffectiveness of the likelihood of default for small and medium-sized businesses atCommercial banks in Vietnam Regression results in Chapter 4 show that the non-parametric model, in particular, Random Forest and Decision Tree give the best resultswith an accuracy of up to 100%, and more reliable than the Logistic model which is theStatistical approaches. However, the Random Forest model outperforms Decision Treemodelb e c a u s e : ( i ) i t i s m o r e s t a b l e t h a n D T ; ( 2 ) iti s n o n – overfitting;( 3 ) t h e predictive result is more accurate and reliable than DT Besides, RF is harder to explainthan the DT model If the accuracy ratio of DT is not too much difference than RF,DTshouldbechosentoexplaineasier.

Applying themodeltoforecastthelikelihoodofdefaultforSMEcustomersatVietnameseCo mmercialbanks

Toolst h a t a i d i n t h e i d e n t i f i c a t i o n o f g r o u p s o f p r o s p e c t i

The thesis developed a model to estimate the solvency (default probability) of SMEcustomers at Vietnamese commercial banks This approach can assist to stabilize creditquality and reduce the emergence of bad loans Customers having a qualifying creditrating (A or better) along with the model's results of evaluating strong repaymentcapability will have a low likelihood of accruing bad debt, implying that credit risk forthissetofcustomersislow.

The model may be considered a tool to assist commercial banks in credit granting,credit quality assurance, and promoting efficient, safe, and sustainable expansion andgrowth.Itmay thenassist banksinselectingandmaintainingahealthy customerstructure,promotingmarketingmethodstolow- riskcustomers,anddevelopinganetworkoftrustworthycustomerstoensuredebtrecovery.

The model resultsserveasthefoundationforcreditpolicyorientation

risk clients and effective credit growth to high-performing customers(lowprobabilityofbankruptcy).Concurrently,developingacreditpolicythatisapp ropriate for each type of consumer in terms of credit periods, interest rates, fees,securityrequirements,etc toguarantee operational safety.

The group with a low probability of default: credit granted with various preferentialcircumstancessuchaspreferentialinterestrates,creditextendedwithnoproperty security or partially secured by assets, and commitment conditions on financial indicesthatare notmonitored,etc.

The group with the average probability of default: giving credit in accordance with thebank'sg e n e r a l r e q u i r e m e n t s , c o n t e m p l a t i n g l o w e r i n g i n t e r e s t r a t e s w h e n consumer's mortgage collaterals fall below the prescribed rate of customer groups andcreditproducts.

The group with a high chance of default: no new credit, gradually limiting the creditbalancepreviously granted,during theduration ofc red it , applying h ig h- in te res t rates and severe financial index criteria or other harsh restrictions to minimize the risk ofconsumersnotpayingtheirobligations.

Furthermore, it is vital to focus on providing credit to customers in highly efficientbusiness lines with low default risk; on the contrary, policies, and orientations tightenand reinforce control for customersin the group ofindustries withp o o r p r o f i t a b i l i t y andsignificant bankruptcyrisks arerequired.

Additionally,theregressionfindingsoftheRandomForest model inChapter4demonstratethat2financialindexvariableshaveasubstantialimpactonSMEs'ins olvency,includingLogNetWorthandEBITDA/

DA/NetW o r t h exhibit interchanges with the risk of the business's bankruptcy That is, the better thefirm's potential to create revenue as well as profit, the lesser the likelihood of corporatebankruptcy In contrast, the ratios of Debt to Net Worth, Debt to EBITDA move in thesamedirectionasthefirm'sdefaultrate.Thisdemonstratesthatthemoredebtacompany utilizes, the greater the financial strain, and the greater the probability ofbankruptcy Commercial banks might use the following study findings to analyze andchoose consumers in order to reduce the chance that customers will default on theirloans.GivingcreditpreferencetofirmswithstrongEBITDA,NetWorth,andEBITDA/

Net Worth, as well as carefully researching and evaluating before issuingcredit to enterprises with high debt usage and poor self-reliance financing On the otherhand,informationtomeasurethesolvencyandtheoutcomesofthemodelalsorepresent numerous challenges connected to the company's performance and the field –production and business sector As a result, the model serves as a resource for futurecreditpolicystudy,evaluation,forecasting,andadministration.

Applyingt h e modelr e s u l t s t o i m p r o v e c r e d i t r i s k m a n a g e m e

Predictingc o n s u m e r s ' c a p a c i t y t o r e p a y d e b t s i s a s t r a t e g y t h a t h e l p s b a n k s i d e n t i f y potentialcustomersandimprovesq u a l i t y i n m o n i t o r i n g a n d r e - r a t i n g c l i e n t s o n c e the creditisgranted.Basedonthemodelresults,commercialbanksmayswiftlyrecognize and take action to resolve credit issues with SME customers who have a highchanceofdefault,therefore reducing thebank's riskofcapitalloss.

According to Decision No 493/2005/QD-NHNN dated April 22, 2005, of the StateBankofVietnampromulgatingtheRegulationontheclassificationofdebts,appropriati on,settingupanduseofreservesforhandlingcreditrisksinbankingactivitiesofcreditinstitutions, m o s t c o m m e r c i a l b a n k s i n V i e t n a m c o n t i n u e t o apply provisioning based on the findings of SME customer debt categorization,fromwhichtheycalculatetheappropriateratio.Asaresult,ifcommercialbankscanaccurately predict their SME customers' repayment capabilities (default likelihood),making provisions becomes easier, and developing a credit risk reserve fund becomesmoreeffective.

ApplyingthemodeltoanticipatethelikelihoodofdefaultforCre

A credit rating agency (CRA) is a for-profit corporation that gathers debt informationfrom individuals and companies and provides a numerical value called a credit scorethat represents the borrower's creditworthiness Many countries' debt (bond) marketsrelyh e a v i l y o n C R A s C r e d i t o r s a n d l e n d e r s , s u c h a s c r e d i t c a r d f i r m s a n d b a n k s , report to credit agencies their clients' borrowing activities and history Individuals andcorporations can request copies of the information reported about them by contactingthecreditbureauoralinkedthird-partyentityandpaying asmall charge.

Decree No 88/2014/ND-CP dated September 26, 2014, of the Government on creditratingservices.ThelegalcapitalofacreditratingagencyisVND15billion.Simultaneou sly, the Prime Minister authorized the Credit Rating Service DevelopmentPlan for

2020 and a Vision for 2030 As a result, it is projected to issue a businesseligibilityc e r t i f i c a t e f o r t h e m a x i m u m b y 2 0 3 0 F i v e e n t e r p r i s e s a n d a s t r a t e g y f o r

One of the most trusted credit rating agencies in Vietnam is FiinRatings which is aFiinGroup trademark It has been approved by the Ministry of Finance to operate as aCredit Rating Agency ("CRA") in Vietnam FiinGroup was granted the license onMarch 20, 2020, in accordancewithDecree 88/2014/ND-CP dated September2 6 , 2014, which governs Credit Rating Agency services Moreover, Saigon Ratings is thefirst domestic CRA to provide credit rating services in the Vietnamese market It hasbeen a pioneer in the financial sector and has been licensed by the Ministry of Financefor credit rating activities on July 21, 2017, with seven ministries and branches such asFinance, Public Security, Justice, Planning &

Investment, State Bank, State SecuritiesCommissionand

CRAs can gather credit information from clients borrowing money every month at allcredit institutions across the country, then use the model to anticipate the likelihood ofcustomer default to identify debt groups and synthesize credit information for eachcustomer Ultimately, when financial institutions request it and pay a charge for it,CRAsreselltheborrower'saggregatedcreditinformation.

According to the findings of this thesis research, CRAs should concentrate on financialindexfactorsthatarestatisticallysignificantandhaveamajorimpactont h e bankrupt cy of SMEs, including EBITDA, Net Worth and EBITDA/Net.Attention andin-depth research of these variables can assist CRA in producing more accurate creditrating reports, thereby providing appropriate comments and suggestions to SMEs. Italsoassists banksin managingcredit riskmoreeasily.

This thesis gives a reasonably complete and comprehensive method of published studyto see the gaps in prior studies connected to the selection of the best suitable model toforecast the default likelihood of small and medium firms in Vietnamese based onfinancialindicators.BasedontheempiricalresultsinChapter4,thist h e s i s recommends thatCRAsusetheRandomForesttoforecastdefaultprobabilitybecause theresultsshowthattheRandomForestmodelproducesthebestresults,witha forecast accuracy of up to 100%, which is an important basis for CRAs to choose anappropriatecredit ratingmodel.

Competition, in principle, ensures innovation and serves as a healthy check on productquality The rating business, on the other hand, is wholly built on reputation. There issometensionbetweencompetitionandreputation.TheleadingCRAshavehugereputationa l capital Investors have faith in CRAs' judgment, and they request a riskpremium depending on the issuer's rating Even after the subprime catastrophe, ratingjudgments remain regularly debated in the financial press, emphasizing their ongoingsignificance As a result, selecting accurate client information and data is critical forCRAsto an t i c i p a t e t h e l i k e l i h o o d o f d e f a u l t T he d a t a u t i l i z e d i n t h i s stu d ycana l s o helpwiththeprocessofgatheringdatatoanticipatethe defaultrates ofCRAclients.

Topiclimitationandpotentialresearchdirections

Topiclimitation

Aside from the thesis outcomes, there areseveral limitations and difficulties.Thelimited data set is the most obvious shortcoming of this investigation Due to timerestrictions, only 400 firms were gathered and were restricted to three industries Thenumber of input variables is 14 (13 independent variables and 1 dependent variable),which is suitable for the number of observations However, because this sample wasconsideredtiny, therewere no significantfindings.

Furthermore, the quality of the input data is low Although the gathered financialstatementshavebeenauditedtoassurethequality ofinformation sources, thequalityof auditing financial statements in Vietnam is not as transparent, clear, and effective asindevelopednations.InVietnam,acompanycancreatethreeorfourfinancialstatements for a variety of purposes, including taxation, banking, auditing, and internalcontrol As a result, managing and analyzing the model's input quality is critical forobtainingthemostaccurateresults.

Additionally, the predicting probability of default models presented in the thesis isbased solely on financial data, with no regard for non-financial elements, unlike theinternal credit rating models used by commercial banks in Vietnam today In reality,because customers' financial statements do not always precisely and completely reflecttheirbusinessoutcomesandfinancialstatus,banksmustdependonnon- financialinformationtoscreenandclassifycustomers.

Potentialresearchdirections

Data sets and time intervals are expanded: To obtain more trustworthy findings, thenumber of firms gathered is increased to 1,000 or even higher Furthermore, instead ofcollecting yearly financial statements, quarterly corporate financial statements might begatheredforgreateraccuracy.

Furthermore, if the acquired data is large enough and can be disaggregated for each setof consumers in different business sectors, it will produce reliable findings that aretailoredtoeachorganization'sbusinessoperations,enhancingthemodel'sapplicability.

Breiman, L., & Ihaka, R (1984) Nonlinear discriminant analysis via scaling and ACE.Davis One Shields Avenue Davis, CA, USA: Department of Statistics, UniversityofCalifornia.

Khoshgoftaar, T.M.; Fazelpour, A.; Dittman, D.J.; Napolitano, A Ensemble vs. DataSampling: Which Option Is Best Suited to Improve Classification Performance ofImbalanced Bioinformatics Data? In Proceedings of the IEEE 27th InternationalConference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, Italy,9– 11November2015;pp.705–712.

Fan, X.N.; Tang, K.; Weise, T Margin-Based Over-Sampling Method for LearningfromImbalancedDatasets.InAdvancesinKnowledgeDiscoveryandDataMi ning;S p r i n g e r : B e r l i n /

Hamori, S., Kawai, M., Kume, T., Murakami, Y., & Watanabe, C (2018). Ensemblelearning or deep learning?Application to default risk analysis.J o u r n a l o f R i s k andFinancialManagement,11(1),12.

Gupta, J.,Gregoriou, A., & Ebrahimi, T (2018) Empirical comparison of hazardmodelsinpredictingSMEsfailure.QuantitativeFinance,18(3),437-466.

McCann, F., & McIndoe-Calder, T (2012).Determinants of SME loan default: theimportance of borrower-level heterogeneity(Vol 6) Central Bank and

Mestre, D.; Fonseca, J.M.; Mora, A Monitoring of in-vitro plant cultures using digitalimage processing and random forests.In Proceedings of the 8th

InternationalConference of Pattern Recognition Systems(ICPRS 2017), Madrid,

Pompe, P., Bilderbeek, J., 2005 The prediction of bankruptcy of small- and medium- sizedindustrialfirms.Journalof BusinessVenturing,20(6),847–868.

Abdou, H., & Pointon, J (2011) Credit Scoring, Statistical Techniques, and EvaluationCriteria: A Review of the Literature.Intelligent Systems inA c c o u n t i n g ,

Dichev, I D., & Skinner, D J (2002) Large–sample evidence on the debt covenanthypothesis.Journalofaccountingresearch,40(4),1091-1123.

Smith, C W., & Warner, J B (1979) Bankruptcy, secured debt, and optimal capitalstructure:Comment.TheJournalofFinance,34(1),247-251.

Demerjian, P R (2007) Financial ratios and credit risk: The selection of financial ratiocovenantsin debt contracts AAA.

Crouhy, M., Galai, D., & Mark, R (2001) Prototype risk rating system.Journal ofbanking&finance,25(1),47-95.

Yap, B C F., Yong, D G F., & Poon, W C (2010) How well do financial ratios andmultiplediscriminantanalysispredictcompanyfailuresinMalaysia?

Ravi, V., & Pramodh, C (2008) Threshold accepting trained principal componentneural network and feature subset selection:Application to bankruptcy predictioninbanks.AppliedSoftComputing,8(4),1539-1548.

F.M Liou, Fraudulentfinancial reporting detection and businessfailure predictionmodels:acomparison,Manag.Audit.J.23(2008),650–662

Perboli, G., & Arabnezhad, E (2021) A Machine Learning-based DSS for mid andlong- termcompanycrisisprediction.ExpertSystemswithApplications,174,114758.

Perboli, G., Tadei, R., & Gobbato, L (2014) The multi-handler Knapsack problem isunderuncertainty.EuropeanJournalofOperationalResearch, 236, 1000–1007. Baldi,M.M.,Manerba,D.,Perboli,G.,&Tadei,R.

(2019).AGeneralizedB i n PackingP r o b l e m f o r p a r c e l de l i v e r y inl a s t - m i l e l o g is t i c s.E u r o p e a n Jo u r n a l o f

Altman, E I (2014) Predicting financial distress of companies: Revisiting the Z- Scoreand ZETA models In A R Bell, C Brooks & M Prokopczuk (Eds.), Handbookof research methods and applications in empirical finance (pp 428–456) EdwardElgarPub

Begley, J., Ming, J., & Watts, S (1996) Bankruptcy classification errors in the 1980s:An empirical analysis of Altman‟s and Ohlson‟s models.Review of

Schalck,C.,&Yankol-Schalck,M.(2021).PredictingFrenchSMEfailures:newevidence from machine learning techniques.Applied Economics, 53(51), 5948-5963.

Ciampi, F., and Gordini, N (2013) Small Enterprise Default Prediction ModelingthroughArtificialNeuralNetworks:AnEmpiricalAnalysisofItalianSmallE nterprises.JournalofSmallBusinessManagement.51(1):23-45

James, G., Witten, D., Hastie, T., and Tibshirani, R An Introduction to StatisticalLearning,112; Springer:NewYork,NY,USA,2013.

Barboza, F., H Kimura, and E Altman Machine learning models and bankruptcyprediction,Expert Systems with Applications: An International

Brown, I., and C Mues, „An experimental comparison of classification algorithms forimbalancedcreditscoringdatasets‟,ExpertSystemswithApplications:AnInternati onalJournal,39,2012,3446-3453.

Resti, A., and A Sironi,Risk Management and Shareholders' Value in Banking:

Chernozhukov,V.,D.Chetverikov,M.Demirer,E.Duflo,C.Hansen,W.Newey,and

Guidotti, R., A Monreale, S Ruggieri, F Turini, D Pedreschi, and F Giannotti,

„ASurvey of Methods for Explaining Black Box Models‟,ACM computing surveys(CSUR),51(5),2019,93

INE(2014).Empresas emPortugal–2012.Lisboa:InstitutoNacionaldeEstatística.

Psillaki, M., Tsolas, L.E & Margaritis, D (2010) Evaluation of credit risk based onfirmperformance.EuropeanJournalofOperationalResearch,201(3),873-881 Back,B.,Laitinen,T.,Sere,K.&VanWezel,M.

(1996).Choosingbankruptcypredictorsusingdiscriminantanalysis,logitanalysis,an dgeneticalgorithms.TurkuCentreforComputerScience.Technical Report,40.

Lo, A (1986) Logit versus discriminant analysis: A specification test and applicationtocorporatebankruptcies.JournalofEconometrics,31(2),151-178.

Global Economic Outlook - January 2022 (2022, January 25) Retrieved from thegroup Atradius: https://group.atradius.com/publications/economic- research/global-economic-outlook-january-2022.html.

Insolvency increases expected as support ends (2021, October 07) Retrieved fromAtradiusCollections:https://atradiuscollections.com/global/reports/ economic-research-insolvency-increases-expected-as-support-ends.html.

Report on COVID-19’s impact on Vietnamese businesses released.(2021, March

12).RetrievedfromNhanDan:https://en.nhandan.vn/business/item/9663802- report-on-covid-19%E2%80%99s-impact-on-vietnamese-businesses- released.html.

Ince, Huseyin, and Bora Aktan 2009 “A Comparison of Data Mining Techniques forCreditScoringinBanking:AManagerialPerspective”.JournalofBusinessEconom icsa n d M a n a g e m e n t 1 0( 3 ) : 2 3 3 - 4 0 h t t p s : / / d o i o r g / 1 0 3 8 4 6 / 1 6 1 1 - 1699.2009.10.233-240.

Platt,H.D.(1991).Predictingcorporatefinancialdistress:Reflectionsonchoice- basedsamplebias.JournalofEconomicsandFinance2002.

CPdated Se p t e m b er 26, 2 0 1 4, o f theGovernment on credit rating services (2014, November 15) Retrieved from

LuatVietnam:https://english.luatvietnam.vn/decree-no-88-2014-nd-cp-dated-september-26-2014-of- the-government-on-credit-rating-services-89671-Doc1.html

BaselCommitteeonBankingSupervision.2006.Internationalconvergenceof c apitalmeasurementandcapitalstandards.www.bis.org.

Journalof Accounting Research,Vol.18 No.1,pp.109-31.

Lennox, C (1999), “Identifying failing companies: a re-evaluation of the logit, probit,andDAapproaches”,JournalofEconomicsandBusiness,Vol 51,pp.347-64.

C Tsai and J Wu, “Using neural network ensembles for bankruptcy prediction andcredit scoring,”Expert Systems with Applications, vol 34, no 4, pp 2639– 2649,2008.

Olson, D.L.; Delen, D.; Meng, Y Comparative analysis of data mining methods forbankruptcyprediction.Decis.SupportSyst.2012,52,464–473.

DuJardin,P.Failurepattern-basedensembles appliedtobankruptcyforecasting.Decis.

(1966) F i n a n c i a l ratios as predictors o f failure.J o u r n a l ofA cc ou nt in g Research,4, 71-111.

Altman,E , H a l d e m a n , R G & N a r a y a n , P ( 1 9 7 7 ) Z e t a - a n a l y s i s : A n e w m o d e l t o identifybankruptcyonco rp or at io ns.J o u r n a l ofBankingandFinance,1(1),29-

Gombola, M., Haskins, M., Ketz, J & Williams, D (1987) Cash flow in bankruptcyprediction.FinancialManagement,16(4),55-65.

Mossman, Ch.E., Bell, G., Swartz, L & Turtle, H (1998) An empirical comparison ofbankruptcymodels.TheFinancialReview,33(2),35-54.

Le, N S V (2013), Investment decisions and bankruptcy risks of companieslistedontheVietnamesestockmarket,Master'sthesisineconomics,U niversityof Economics,Ho ChiMinhCity

Nguyen, T T L (2019) Factors affecting bankruptcy risk of listed companies intheconstructionindustryinVietnam.JournalofBankingScience&Training,No 205.

Than dinh Tin dung Blog (2018) Overview of Credit Rating Agency (CRA) intheworld.Availablefrom

Vo, H D and Nguyen, D T (2013a) Credit rating for listed companies inVietnamusingfuzzytheory.JournalofEconomicDevelopment,No.269.

Source:Risk,F.C.(2004).Moody’sKMVRiskCalc™v3.1 model

Table2:TheinputvariablewasselectedbyGupta,Jairaj,AndrosGregoriou,andTaheraEbra himi (2018)

 Quick Ratio; (current assets – stocks - prepayments)/current liabilities.

 Cash Ratio; (cash + bank + marketable securities)/currentliabilities

Source:Gupta,Jairaj, AndrosGregoriou,andTaheraEbrahimi(2018)

Figure3:Theusagecodeto runtheConfusionMatrix oftheDecisionTree model

Tiêu đề	Application Of Machine Learning For Predicting Probability Of Default Of Small And Medium Enterprises
Tác giả	Nguyen Thingocanh
Người hướng dẫn	Ph.D. Nguyen Minh Nhat
Trường học	Ho Chi Minh City University of Banking
Chuyên ngành	Finance – Banking
Thể loại	Graduation thesis
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	74
Dung lượng	0,95 MB