Investigating some classification models and applying in bankruptcy prediction

Mẫu: 03/ĐT-KLTN/BM VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT PROJECT NAME Investigating some classification models and applying in bankruptcy prediction Student’s name: TRAN THI LAN PHUONG Hanoi - Year 2020 VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT PROJECT NAME Investigating some classification models and applying in bankruptcy prediction SUPERVISOR: DR TRAN DUC QUYNH STUDENT: TRAN THI LAN PHUONG STUDENT ID: 16071302 COHORT: MIS2016A MAJOR: MANAGEMENT OF INFORMATION SYSTEM Hanoi - Year 2020 LETTER OF DECLARATION I hereby declare that the graduation project “Investigating some classification models and applying in bankruptcy prediction” is the result of my own research and has never been published in any work of others During the implementation process of this project, I have seriously taken research ethics; all findings of this project are result of my own research and survey; all references in this project are clearly cited according to regulations I take full responsibility for the fidelity of the number and data and other contents of my graduation project Hanoi, June 5th 2020 Student (Signature and Full name) ACKNOWLEDGEMENT I have taken all of my effort in this project However, it would not have been possible without the kind support and help of many individuals and organizations First of all, I would like to extend my sincere thanks to all of them I am highly indebted to Dr Tran Duc Quynh from Vietnam National University, Hanoi – International School - Department of Natural Science and Technology for guidance, his encouragement as well as for his timely support in completing the graduation thesis I would like to express my gratitude towards the members of my MIS2016A class because of their timely encouragement which motivated me in the completion of this project Sincerely thank! Hanoi, June 5th 2020 Student TABLE OF CONTENT LETTER OF DECLARATION ACKNOWLEDGEMENT ABSTRACT 10 CHAPTER I: INTRODUCTION 11 1.1 Overview of machine learning 11 1.2 The problem and the motivation to solve the problem 12 1.3 Problem definition 14 1.4 Related works 15 CHAPTER II: MACHINE LEARNING MODELS 16 2.1 Decision tree 16 2.1.1 Model introduction 16 2.1.2 Characteristics 18 2.2 Random forest 19 2.2.1 Model introduction 19 2.2.2 Characteristics 21 2.2.3 Out of bag 21 2.3 Bagging 22 2.3.1 Model introduction 22 2.3.2 Characteristics 23 2.4 Gradient Boosting 24 2.4.1 Model introduction 24 2.4.2 Characteristics 25 CHAPER III: APPLYING CLASSIFICATION MODELS IN BANKRUPTCY PREDICTION 26 3.1 Data Analysis 26 3.1.1 Data 26 3.1.2 Dataset Quality Assessment 29 a Missing data 29 b Imbalance dataset 32 3.2 Data Preparation 33 3.2.1 Dealing with duplicate and missing data 33 3.2.2 Dealing with imbalance data 34 3.3 Experimental setup 35 3.3.1 K fold cross validation 35 3.3.2 Model evaluation method 35 3.4 Experimental results and comments 37 3.4.1 Results on original dataset 37 3.4.2 Experimental result on preprocessed dataset 39 3.4.3 Comparing proposed model’s result with other research 41 3.4.4 Result evaluation 42 Experience and benefit learned from the study 43 Conclusion 43 REFERENCES 45 TABLE OF NOTATIONS AND ABBREVIATIONS Abbreviation Meaning SMOTE Synthetic Minority Oversampling Technique ID3 Iterative Dichotomiser CART Classification And Regression Tree XGBoost Extreme Gradient Boosting AUC Area Under the Curve AI Artificial Intelligence LIST OF TABLES Table 1: Dataset summary 27 Table 2: Dataset attribute description 28 Table 3: Missing data statistics 30 Table 4: Imbalance dataset statistics 33 Table 5: Confusion matric 36 Table 6: Results of dataset only being imputed by mean method 38 Table 7: Experimental result on preprocessed dataset 40 Table 8: AUC score on "Ensemble Boosted Trees with Synthetic Features Generation 41 LIST OF CHARTS AND FIGURE Figure 1: Decision tree architecture 17 Figure 2: Random forest architecture 20 Figure 3: Bagging Architecture 23 Figure 4: Percentage of missing data by class 30 Figure 5: Missing data by attributes in 1st year 31 Figure 6: Missing data by attributes in 2nd year 31 Figure 7: Missing data by attributes in 3rd year 31 Figure 8: Missing data by attributes in 4th year 32 Figure 9: Missing data by attributes in 5th year 32 Figure 10: K fold cross validation 35 Figure 11: Experimental result on original dataset 37 Figure 12: Experimental result on preprocessed dataset 39 ABSTRACT The recent years have seen much discussion of machine intelligence and how this means for human’s health, productivity and wellbeing In such discussion, machine learning has demonstrated its increasingly important role regards to human’ fundamental need in present and its power of prediction of the events in the future Besides, bankruptcy has being a concerned problem due to its negative effects to economy and wellbeing This problem is out of control Therefore, research of bankruptcy prediction using machine learning is necessary and practical at the moment The purpose of the research is to study some classification models and then identify the best predictive model that can be applied to the task of bankruptcy prediction In this document, the models being studied are decision tree, random forest, bagging and gradient boosting The idea, architecture, operation and the characteristics of each model are also explored Furthermore, the Polish companies’ bankruptcy dataset have been chosen to support for the project It is beginning by analyzing and assessing the dataset quality Next, the dataset will be preprocessed by using random forest algorithm to impute missing values and Synthetic Minority Oversampling Technique (SMOTE) to balance two target labels in the dataset Then, models will be applied to the processed dataset to find out the best performance model Last but not least, K fold cross validation method is also applied to evaluate the model performance The project uses Python as the programming language, Spyder as a cross-platform integrated development environment and Tableau, Microsoft Excel as visualization tool Keywords: Machine learning, Random Forest, Bagging, Gradient Boosting, SMOTE 10 Figure 8: Missing data by attributes in 4th year Figure 9: Missing data by attributes in 5th year The pictures above have plotted the nullity of five datasets on 64 attributes The more white space in the column, the more missing values that exists in that column From the figures from figure 5th to figure 9th It is clear that all columns have missing values however the attribute 37th contains the most null values (43.73%) b Imbalance dataset Besides of the huge number of missing values, the Polish companies’ bankruptcy dataset is a very imbalance dataset This table below makes some statistics on this issue: 32 Table 4: Imbalance dataset statistics Dataset Total observations Total bankrupt observations Total nonbankrupt observations 7027 10173 10503 9792 5910 271 400 495 515 410 6756 9773 10008 9277 5500 Year Year Year Year Year Percentage of Bankrupt observations/ non-bankrupt observations 3.85% 3.93% 4.71% 5.25% 6.93% The table above summarizes the populations of class label (bankrupt or non – bankrupt) by dataset Looking at the last column - Percentage of bankrupt observations/ non-bankrupt observations, it tells that this dataset is very imbalanced when the highest rate of bankruptcy label only accounts for 6.93% This is warning call that if this imbalance is not handled, the model will not have enough data to be trained for the minority class label – bankrupt class As the result, the model will perform poorly Based on missing data and data imbalance analysis above, it shows that the data quality is not good and it requires the methods so as to improve the data quality 3.2 Data Preparation 3.2.1 Dealing with duplicate and missing data As being analyzed in the section 4.1, it shows that the number of bankruptcy class is very low and the size of each dataset by year is small Therefore, all five datasets are combined into one big dataset and the period column will be inserted to present duration the information is collected Then the dataset now have 65 attributes and 43.405 observations There are 401 observation is duplicated These observations will be deleted Moreover, the section 4.1.2.a also shows that the attribute 37th has the highest number of missing observation with 18984 over total 43405 observations (43.73%) Then, it is considered this column has less informative so this attribute is deleted from dataset 33 To missing data, the imputation by random forest technique is considered as the best for the dataset at the moment This method is implemented in steps At the first step, the missing values is imputed by mean method of each column Now there are two dataset which are dataset having missing values (1st dataset) and the dataset without missing value (already being made from step 1- 2nd dataset) The next step is using random forest algorithm to predict the missing value This method will impute missing values by column Take a column having missing value of 1st dataset as the target column The observations that have no missing values at this column will be grouped to training dataset and the rest attributes will be taken corresponding from 2nd dataset The observations that have missing values are grouped to testing dataset and the rest attributes will be taken corresponding from 2nd dataset Then applying random forest regression algorithm into these two training and testing dataset to make prediction for each missing value Repeated to these two steps into all attributes containing missing values Finally, a new dataset is imputed by random forest algorithm is made 3.2.2 Dealing with imbalance data The dataset is very imbalanced because there is only 4.82% observations assigned to bankruptcy class Therefore, Synthetic Minority Oversampling Technique (SMOTE) is used to create observations of minority class so as to balance target class This method works by selecting observations that are close others in the feature space, drawing a line between the observations and generate a new sample at a point on the line This procedure can be used to create as many synthetic examples for the minority class that are equal to the number of observation of majority class In this method, the observations of minority class is chosen and the number of the nearest neighbors can be fixed by users or chosen by default The approach is effective because new synthetic observations from the minority class are created that are plausible Besides, they are also relatively close in feature space to existing observations from the minority class 34 3.3 Experimental setup 3.3.1 K fold cross validation K fold cross-validation is a technique that is used for the assessment of how the results of statistical analysis generalize to an independent data set This method is largely used in settings where the target is prediction and it is necessary to estimate the accuracy of the performance of a predictive model [8].This method divides dataset into k folds (k is the number of sub-dataset formed from original dataset) and take n folds (n

Định dạng
Số trang	46
Dung lượng	687,35 KB