VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF ECONOMICS AND LAW FINAL PROJECT MACHINE LEARNING METHODS FOR PREDICTING THE BANKRUPTCY RISK OF NON- MANUFACTURING ENTERPRIS
Trang 1
VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF ECONOMICS AND LAW
FINAL PROJECT
MACHINE LEARNING METHODS FOR
PREDICTING THE BANKRUPTCY RISK OF NON- MANUFACTURING ENTERPRISES IN VIETNAM
Course: Machine Learning and Artificial Intelligence in Finance Course Code: 232TC6701
Supervisor: Master Phan Huy Tam
Student: Tran Tat Anh — K214142056
Ho Chi Minh City, May 2024
1
Trang 2
3 The đdafa and research method0Ì08y nh nh ty 5
=Ẵh°- hố Ẽ aDn .ố.ố 5 KXN (s0 i93 o2 aA 3Õ 7 3.2.1 Logistic Regression (LR) 6 .ẢẢ 7 3.2.2 K-Nearest Neighbor (KNN) QQ Q00 02 HH HH HH HT nen 8 3.2.3 Support Vector Machine (SVM) TH HH nen 1211111 kkk 8
3.2.5 Random Forest (RF) 2.0.0 9 3.2.6 Extreme Gradient Boosting (XGB) QQ LH HH HH ng nen ng 121111111 kreh 10 3.3 Evaluation methodology - - -c 222221222111 n HH HH nh HH nen 10 3.3.1 Confusion MlafriX TS TH n TT TT TK TH HE TT 10 K2 S 11 KE== 8= ` ố.ẽẽ All 11 3.3.4 RECA nh a1 11 K== "18x: ' 4a ở 11 3.3.6 Area Under the Curve (ÁC) nh TT H111 1k kệ 12
Trang 34 Results and dlisCusSIOI - .- QC HH SH nh kh kh xu 12 5Š, Novel contributions in the researcHi .- . - cọ nọ nỲ nh nh nh khen 15 S90 U ă ẽ é.(cadẢ References
Trang 4Abstract
Predicting bankruptcy risk is crucial for providing early warnings to businesses Traditional statistical methods and machine learning models are commonly used in bankruptcy risk assessment This study focuses on forecasting the bankruptcy risk of non- manufacturing enterprises in Vietnam using logistic regression and various machine learning models The research compares the effectiveness of these machine learning models with traditional statistical methods and evaluates the performance of different machine learning models The findings highlight the superiority of the Random Forest and K-Nearest Neighbor models over logistic regression and other methods
Keywords: Bankruptcy, Z-Score, Logistic, K-Nearest Neighbor, Naive Bayes, Support Vector Machine, Random Forest, Extreme Gradient Boosting
1 Introduction
Businesses are dealing with extraordinary hurdles in the aftermath of Covid-19 Significant swings in the global economy have a direct influence on financial capacity while also affecting corporate operations High inflation, along with political disputes and geopolitical problems, has resulted in an unpredictable and dangerous economic climate In this backdrop, the risk of insolvency for firms is more obvious than ever The risk is especially acute for non-manufacturing businesses, which operate in industries such as services, commerce, information technology, telecommunications, and others These firms frequently rely largely on customer demand, market trends, and the stability of the business environment When inflation rises and the economy contracts, consumer spending power falls, resulting in
a loss in revenue and earnings for firms Furthermore, political upheavals can disrupt global supply chains, resulting in product shortages and higher production costs, which have a direct impact on the operations of non-manufacturing enterprises The risk of bankruptcy affects not just individual enterprises, but also the economy and society as a whole When a company goes bankrupt, other firms in the supply chain may suffer, resulting in job losses and lower wages for employees This can cause a downward spiral, making economic conditions even more difficult Therefore, projecting the possibility of corporate bankruptcy is of paramount importance Accurate predictions not only help firms detect financial risks early on, but they also aid managers and investors in making timely and effective actions to reduce unfavorable consequences In this study, we compare the prediction capacities of conventional and new
Trang 5methodologies for measuring the bankruptcy risk of Vietnam's non-manufacturing firms The objective is to determine the best forecasting approach, so assisting businesses in risk management and stabilizing their operations in the present turbulent economic environment Altman (1968) and Beaver (1966) established the framework for typical bankruptcy risk forecasting approaches Beaver (1966) pioneered the use of financial criteria including financial leverage, return on assets, and liquidity to predict a company's bankruptcy risk His main study revealed that these financial indicators could successfully distinguish between healthy and insolvent enterprises Following Beaver's initial study, later studies attempted to improve prediction powers by introducing nonlinear models For example, Jones and Hensher (2004) investigated numerous nonlinear techniques to improving the accuracy of bankruptcy forecasts Their findings underscored the limits of linear models and the potential benefits of more advanced, nonlinear approaches Kolari and colleagues (2002) improved the discipline
by creating an early warning system for American banks that incorporated logit and characteristic recognition algorithms This hybrid method enabled a more sophisticated understanding of the causes contributing to bank failures, resulting in a more reliable prediction tool Similarly, Lam and Moy (2002) enhanced the accuracy of bankruptcy categorization by combining discriminant models and running simulations to fine-tune their discriminant model Their findings emphasized the value of integrating various approaches
to improve forecast precision The logistic regression model has withstood the test of time and is an effective tool for describing the elements that influence financial risk Barboza and colleagues (2017) proved the model's effectiveness in modern situations, proving its durability in a variety of financial circumstances Logistic regression's capacity to handle binary outcomes makes it ideal for bankruptcy prediction, where the major goal is to discriminate between solvent and insolvent organizations
The growth of technology, along with the power to compute complicated algorithms, has led to the creation of intelligent computational models in bankruptcy risk forecasting (Goldstein & colleagues, 2019) Machine learning models have showed greater performance (Florez-Lopez, 2007) by efficiently handling nonlinear connections and difficult issues without the need for significant data Machine learning models include both single and ensemble models, with ensemble models consisting of many models combined to produce a higher-performing model Advanced ensemble models use bagging and boosting strategies
2
Trang 6Random Forest, a strong classification approach in the bagging group, delivers excellent accuracy while identifying the relevance of variables Extreme Gradient Boosting (xGBoost),
a kind of boosting model, has gained popularity in recent years and demonstrated considerable benefits (Barboza & colleagues, 2017) K-Nearest Neighbor and Naive Bayes are popular methods for classification tasks, alongside ensemble machine learning models
The creation of numerous model groups with varied methodologies raises concerns about evaluating their efficiency in forecasting corporate bankruptcy risk (Duénez-Guzman & Vose, 2013) This is critical since the choice of bankruptcy risk forecasting models is dependent on the characteristics of enterprises in each nation, particularly the availability of data for prediction The study adds to the theory and practice of company bankruptcy risk forecasting
by comparing the prediction ability of machine learning models to classical methods using data from Vietnamese enterprises This research compares the performance of logistic regression, K-Nearest Neighbor, Support Vector Machine, Naive Bayes, Random Forest, and Extreme Gradient Boosting The findings of an investigation reflect prior studies on the advantages of machine learning over traditional approaches, with Random Forest and K- Nearest Neighbor demonstrating great performance
2 Overview of bankruptcy risk forecasting research
2.1 Studies using traditional models
Alman (1968) and Beaver (1966) pioneered conventional research by using financial variables to anticipate firm bankruptcy risk Lin (2009) investigated the predictive capacity
of several analytic models, including discriminant analysis, logit, and probit, for Taiwanese enterprises during the 2009 financial crisis, with positive findings for classical approaches Serrano-Cinca and Gutiérrez-Nieto (2013) used partial least squares discriminant analysis to foresee the 2008 financial crisis in American banks, producing forecasting results comparable
to machine learning models Liang and colleagues (2015) employed discriminant analysis and logistic regression to identify financial hardship variables for machine learning models Traditional approaches’ key benefit is their interpretability of predictors and bankruptcy risk, but with severe data requirements
Trang 72.2 Studies using intelligent models
Intelligent models were relatively new, with neural network models appearing in the 1990s (Serrano-Cinca, 1996) Technological breakthroughs have permitted the computation of complicated algorithms in short periods of time, allowing for the construction of machine learning models with self-improving capabilities and the efficient handling of high- complexity issues without the need for large data Breiman (2001) proposed the random forest, which consists of an ensemble of decision trees built using bootstrapping and produces results based on majority vote Random Forest has effectively identified credit fraud (Whitrow & colleagues, 2009) and forecasted bank customer attrition (Xie & colleagues, 2009) Zhao and colleagues (2009) revealed that machine learning models outperformed conventional approaches Barboza et al (2017) found that Random Forest, bagging, and boosting outperformed SVM, logistic regression, and discriminant analysis The combination
of traditional and intelligent models provides a holistic approach to bankruptcy risk forecasting, using the benefits of both techniques while addressing their shortcomings This comprehensive strategy improves forecast accuracy and dependability, allowing companies and investors to make more informed decisions to avoid financial risks
2.3 Bankruptcy risk forecasting research in Vietnam
In Vietnam, projecting the possibility of corporate bankruptcy has gained substantial interest Bui Phuc Trung (2012) used the standard Z-score technique to evaluate bankruptcy risk in listed firms Nguyén Thi Canh and Pham Chi Khoa (2014) used the KVM-Merton technique to predict the chance of bankruptcy among Vietcombank's commercial clients Huynh Thi Cam Ha & colleagues (2017) applied a decision tree model in machine learning
to predict financial distress among Vietnamese companies, achieving accuracy rates above 90% Additionally, the application of Alman's Z-score for 60 Vietnamese firms was proved in the research by Hoang Thi Héng Van (2020), providing correct forecasts of 76.67% utilizing variables such as average assets, ROA, and ROE
However, studies on bankruptcy risk in Vietnamese enterprises primarily use traditional models, with machine learning models underutilized As a result, in this study, we intend to compare the predictive performance of bankruptcy risk among Vietnamese businesses using both traditional and modern machine learning methods
Trang 8By doing such a comparison research, we intend to give useful insights into the effectiveness of different forecasting methodologies in the Vietnamese environment This research not only contributes to the current literature on bankruptcy risk forecasting but also gives practical consequences for firms, investors, and governments in Vietnam The mix of classic and new approaches offers a full evaluation of bankruptcy risk, therefore aiding stakeholders in making informed decisions and adopting suitable risk management measures
3 The data and research methodology
Picture 1: Descriptive Statistics
Source: Author's calculations
Application of Z-Score model
Then, the author applied Professor Edward Altman's Z-Score model, which was first introduced in the Journal of Finance in 1968 The Z-Score model formula is as follows:
Z = 1.2X, + 1.4X, + 3.3X3 + 0.64X, + 0.999X,
Here, the variables X, to X; represent various financial ratios:
- X,: Working Capital/Total Assets
Trang 9- X,: Retained Earnings/Total Assets
- X;: Earnings Before Interest and Taxes (EBIT)/Total Assets
- X,: Market Value of Total Equity/Book Values of Total Liabilities
- X;: Sales/Total Assets
The Z-Score model categorizes enterprises into three zones based on their Z-Score values
- Safe Zone: Z > 2.99 - indicates no bankruptcy risk
- Caution Zone: 1.81 < Z < 2.99 - suggests potential bankruptcy risk
- Distress Zone: Z < 1.81 - indicates high bankruptcy risk
However, applying the original Z-Score formula universally to all stocks across industries is a common mistake Altman later developed adjusted formulas for different industry sectors Manufacturing firms listed on the stock exchange use Z-Score, while non- manufacturing entities, such as real estate, financial services, education, etc., utilize Z"- Score This adjustment aligns with the diverse operational methods, financial structures, and risks across industries
The formula for Z"-Score is:
Z" = 6.56X, + 3.26X, + 6.72X3 + 1.05X,
Enterprises are classified based on Z"-Score as follows:
- Safe Zone: : Z" > 2.6 - indicates no bankruptcy risk
- Caution Zone: 1.1 < Z" < 2.6 - suggests potential bankruptcy risk
- Distress Zone: Z" < 1.1 - indicates high bankruptcy risk
Thus, the author applied Z"-Score to classify bankruptcy risk for 219 non-manufacturing enterprises in Vietnam (e.g., tourism, telecommunications, real estate, etc.) If Z" < 2.6, it indicates a bankruptcy risk (assigned as 1), while Z" > 2.6 signifies no bankruptcy risk (assigned as 0) The classification results showed that 34.59% were labeled as 1 and 65.41%
as 0 The balanced dataset enabled the utilization of machine learning models without requiring additional data balancing techniques
Trang 101
Picture 2: The ratio of label I to label 0
Source: Author's calculations
3.2 Research methodology
In this section, the author outline the algorithms and methodologies used in the comparative study of traditional and machine learning models for predicting bankruptcy risk among Vietnamese businesses The models included in this study are Logistic Regression, K- Nearest Neighbor, Naive Bayes, Support Vector Machine, Random Forest, and Extreme Gradient Boosting Below, the author provide a detailed description of each algorithm 3.2.1 Logistic Regression (LR)
Logistic Regression is a statistical model used to predict the probability of a binary outcome based on one or more predictor variables The logistic regression model estimates the relationship between the dependent binary variable and one or more independent variables
by using a logistic function The formula for logistic regression is expressed as:
1 P(Y=1|X)= [+ e-(0+BIXI+B2X2+ tBnXn) where P (Y = 1|X) is the probability of the event occurring, BO is the intercept, and B1,B2 , Pn are the coefficients of the predictor variables X1,X2, ,Xn The model uses the method
Trang 11of maximum likelihood estimation to find the best-fitting coefficients that maximize the likelihood of observing the given set of data
Logistic Regression is particularly useful for its simplicity and interpretability, allowing us to understand the influence of each predictor variable on the probability of the outcome 3.2.2 K-Nearest Neighbor (KNN)
The K-Nearest Neighbor algorithm is a non-parametric method used for classification and regression For classification tasks, the algorithm classifies a data point based on how its neighbors are classified The process involves several steps:
First, the author select the number of nearest neighbors (K) to consider This is a crucial parameter that affects the algorithm's performance Then, the author compute the distance between the test data point and all the training data points using a chosen distance metric, typically the Euclidean distance Next, the author identify the K closest training data points
to the test data point Finally, the author classify the test data point by assigning it the class label that is most frequent among the K nearest neighbors
The K-Nearest Neighbor algorithm is simple to implement and can capture complex decision boundaries, making it effective for various classification problems
3.2.3 Support Vector Machine (SVM)
Support Vector Machine is a supervised learning algorithm used for classification and regression tasks The objective of SVM is to find the hyperplane that best separates the data points of different classes The best hyperplane is the one that maximizes the margin between the classes, defined as the distance between the hyperplane and the nearest data point of each class
To find this hyperplane, solve the following optimization problem:
1
min ; MÏ
subject to the constraints:
yi(w:x,+b)> 1