1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam

15 36 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 15
Dung lượng 284,83 KB

Nội dung

Bài viết này nghiên cứu hiệu quả của phương pháp xử lý dữ liệu mất cân bằng trong bài toán phân loại khách hàng tại các ngân hàng thương mại. Đây là một vấn đề phổ biến trong vấn đề phân loại khách hàng, trong đó các quan sát của một lớp nhiều hơn lớp còn lại trong dữ liệu. Mời các bạn cùng tham khảo!

INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 THE EFFECTIVENESS OF METHODS IN DEALING WITH IMBALANCED DATA IN CREDIT SCORING: THE CASE OF VIETNAM COMMERCIAL BANKS HIỆU QUẢ CỦA CÁC PHƯƠNG PHÁP XỬ LÝ DỮ LIỆU MẤT CÂN BẰNG TRONG CHẤM ĐIỂM TÍN DỤNG: TRƯỜNG HỢP TẠI CÁC NGÂN HÀNG THƯƠNG MẠI VIỆT NAM Nguyen Thi Lien, MS ; Nguyen Thi Thu Trang, MS; Nguyen Thi Dung National Economics University lientkt@neu.edu.vn Abstract This article investigates the effectiveness of imbalanced data processing methods in the problem of customer classification at commercial banks This is a common issue in a customer classification attempt, where observations of one class outnumber the remaining class We apply the methods widely used in the world including undersampling, oversampling, bothsamling techniques, and SMOTE (Synthetic Minority Oversampling Technique) to deal with imbalances The logit model is applied to datasets that have been processed by these methods to classify customers Using 7501 transaction data from individual customers, the classification results using data processed with these techniques all improve significantly compared to using untreated data Besides, the results also show that the most efficient method is SMOTE technique combined with the logit model using variables transformed by Weight of Evidence (WOE) Keywords: Bothsampling, credit scoring, oversampling, SMOTE, undersampling, WOE Tóm tắt Bài báo nghiên cứu hiệu phương pháp xử lý liệu cân toán phân loại khách hàng ngân hàng thương mại Đây vấn đề phổ biến vấn đề phân loại khách hàng, quan sát lớp nhiều lớp cịn lại liệu Chúng tơi áp dụng phương pháp sử dụng rộng rãi giới bao gồm kỹ thuật lấy mẫu (Undersampling), lấy mẫu mức (Oversampling), kỹ thuật lấy mẫu hai (boothsampling) SMOTE (Synthetic Minority Oversampling Technique) để giải vấn đề cân Mơ hình logit áp dụng cho tập liệu xử lý cân để phân loại khách hàng Sử dụng 7501 liệu giao dịch từ khách hàng cá nhân, kết phân loại sử dụng liệu xử lý cân cải thiện đáng kể so với sử dụng liệu không xử lý Bên cạnh đó, kết cho thấy phương pháp hiệu kỹ thuật SMOTE kết hợp với mơ hình logit sử dụng biến đầu vào chuyển đổi sang trọng số chứng (Weight of Evidence - WOE) WOE Từ khóa: Chấm điểm tín dụng, lấy mẫu hai, lấy mẫu mức, lấy mẫu dưới, SMOTE, 1273 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 Introduction In the case of classification algorithms, we aim to predict the false observations with the highest accuracy However, the accuracy of these algorithms can be influenced by the imbalanced data (López, 2013), where observations of one class outnumber the other class(es) at least 5:1 for the binary case (He & Garcia, 2009) In order to improve the prediction accuracy and lower the computation expense, a previous identification and quantification of the most relevant input variables of the model is always highly advised Imbalanced data also appears in some other fields, such as fault detection (Gong & Qiao, 2012; Silva et al., 2006; Wei et al., 2013), toxin detection (Harley et al., 2020), medical diagnosis (Ertekin et al., 2007) and customer churn prediction (Bing Zhu et al., 2017) In medical area, consider the group of cancer patients as the positive class and remaining persons are in the negative class, the difference about quantity between these two groups in datasets is large and unequal in the number of observations between the groups Other examples of rare events include software defects (Rodriguez et al., 2014), cancer gene expressions (Wu et al., 2012), credit card transactions fraudulent (Kundu et al., 2009), fraud detection in telecommunication (Augustin, 2012), and natural disaster events (Kim et al., 2016) The most prevalent class is called the majority class, while the rarest class is called the minority class (Huang et al., 2016) Several techniques to process imbalanced data are developed and applied on the aspect of data while others concentrate on the algorithmic level To address the imbalanced problem, four categories in the applications include the preprocessing strategies, cost-sensitive learning methods, adaptation of machine learning techniques and combination of the previous three approaches The preprocessing strategies include resampling techniques and/or variable importance analysis Resampling algorithms are presented by oversampling and undersampling methods Resampling techniques are used to rebalance the sample space for an imbalanced dataset to alleviate the effect of the skewed class distribution in the learning process Resampling methods are more versatile because they are independent of the selected classifier (López et al., 2013) Oversampling methods are created the synthetic minority samples are randomly duplicating the minority samples Some research (Chawla et al., 2002; Estabrooks et al., 2004; and Tahir et al., 2009) introduced the random undersampling in managing the imbalanced dataset The undersampling method drops non-default observations to oversee the data imbalance However, those methods have limitations such as overfitting or loosing essential information The SMOTE (Synthetic Minority Oversampling Technique) method increases instances for minority class based on KNN (K Nearest Neighbor) algorithm introduced by Batista et al (2004) The added observations have data properties that are close to the original observations, reducing the level of imbalance in the data set Based on the original SMOTE method, several newly developed algorithms include SMOTE-N (Synthetic Minority Oversampling Technique Nominal), SMOTE-NC (Synthetic Minority Oversampling Technique Nominal Continuous), MWMOTE (Majority Weighted Minority Oversampling Technique) Sukarna Barua et al (2012) introduced two methods, Borderline-SMOTE and Borderline-SMOTE 2, which only alter the observations at the border of the sample to reduce the misclassification rate The SMOTEBoost method (Chawla et al., 2003) improved the SMOTE method by incorporating the AdaBoost M2 algorithm The EasyEnsemble method (proposed by Liu et al., 2009) has increased the efficiency of the undersampling algorithm by adding useful 1274 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 information to the clustering method to reduce the noise level of the generated sample Besides, the BalanceCascade method (Han et al., 2009) provides a method to guide the deletion of observations rather than the random deletion method Besides, bothsampling is a combination of the above methods (Dubey et al., 2014) The traditional classification method for predicting the probability of default of a loan is logistic regression (Hosmer et al., 2000; Maalouf and Trafalis, 2011; Shu et al., 2014; Maratea et al., 2014); neural networks and decision trees (Quinlan, 1998; Sarkar et al., 2016) Other methods are applied including gradient boosting, least square support vector machines (Jin et al., 2012) These research models focus on statistical orientation or artificial intelligence (AI) Input variable analysis of these algorithms has gained attention in many practical applications (Ferretti et al., 2016) due to the complexity of interactions among variables on large datasets Variable analysis is a crucial task to improve the model interpretability, reduce the computational cost, optimize the data storage, and provide a smaller number of relevant input variables Several approaches have been considered, which can be grouped input variables into new categories or values without losing explanatory prediction power Principal Component Analysis (PCA) is only suitable for data that are normally distributed, near-standard, or linearly related features (Jolliffe and Cadima, 2016) Other analysis constructs a model base on the Weight of Evidence (WOE) and Information Value (IV) in clustering algorithms (Polykretis and Chalkias, 2018) In Vietnam, the data processing algorithm by the undersampling method of reducing random elements on the data boundary is introduced in medical data (Phuong et al., 2015) Research on SMOTE techniques (Lien et al., 2018) in machine learning techniques in credit card fraud detection show that the algorithm is suitable for controlling credit card fraud detection in Vietnam Currently, commercial banks in Vietnam have many difficulties in building internal risk management models according to Basel II capital standards (issued by the Basel Committee on Banking Supervision (BIS,2004)) and Circular No.41/2016/TT-NHNN (issued by the State Bank of Vietnam) Specific difficulties occur in the process of data collection, data processing, selection and evaluation of the effectiveness of the internal risk management model More seriously, according to VietstockFinance (2020), the bad debt ratio of Vietnamese commercial banks is all lower 3.42% in 2019 As a result, the imbalanced datasets effect on the efficiency of the good and bad classification algorithms at Vietnamese commercial banks This means that the number of non-performing loans is much less than that of performing loans in the datasets These imbalanced datasets can lead to biased prediction in favor of the major group in the classification algorithms (Ganganwar, 2012) Reality shows the need to find algorithms for imbalanced data processing and debt classification models appropriate to the context in Vietnam Earlier studies yet investigate the effectiveness of variable analysis based on grouped by WOE after managing imbalanced processed datasets This article aims to oversee an imbalanced credit dataset by oversampling, undersampling, bothsampling and SMOTE algorithms in the Vietnam bank context After that, we apply this handled imbalanced data into the logistic regression with input variables transformed by WOE to evaluate the importance of these algorithms and select an efficient credit scoring method at commercial banks in Vietnam 1275 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 Credit scoring method at commercial banks The credit scoring process at commercial banks has a number of steps The first step is the data processing One of the challenging problems when applying regression models in practice is the quality of the data The data processing step also consumes a great deal of time, accounting for nearly 80% of the total time The data processing includes segmenting, sampling, data partitioning, processing missing data, and outliers The next step is the scorecard construction process After data cleaning, the variables will be grouped by binning The new value for each group is WOE (Weight of Evidence) WOE presents the predictive power of an independent variable in relation to the dependent variable Distribution of good is the percentage of good customers in a particular group Distribution of bad is the percentage of bad customers in a particular group The WOE approach is suitable for the logistic regression model thanks to the convenience of scoring and no need to deal with missing data Binning is to reduce the number of attributes because if original variables are used in the regression model, a great deal of dummy variables will be created Also, binning variables is useful, especially for variables whose relationship to the dependent variable is nonlinear This technique helps deal with nonlinear effects in a linear model The next step is to remove some variables before running the model They are the variables which has the poor ability to differentiate between good and bad accounts Eliminating them before running the model could improves the model quality and shortens processing time afterwards The index of value information – IV is used to remove weak variables (Siddiqi, 2012) The IV is calculated using the following formula: Criteria for evaluating the variable predictiveness according to IV: IV Table The assessment of IV Assessment < 0.02 Not useful for prediction 0.1 – 0.3 Medium predictive power >0.5 Suspicious predictive power 0.02 - 0.1 Weak predictive power – 0.5 Strong predictive power (Source: Siddiqi, 2012) 1276 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 From the IV, some variables that are not useful for prediction or have weak predictive power are immediately removed from the model For remaining variables, it is necessary to combine with expert opinion to achieve the goal of dropping unnecessary variables but not losing good factors Logit regression method The dependent variable is the default status of the customer, symbolized by D (D = if the customer pays on time, D = if the customer pays late) Modelling the relationship between customer characteristics and repayment ability through the logit function as follows: The maximum likelihood method is applied to estimate the coefficients β, thereby calculating the default probability for each specific customer Criteria for selecting variables for the logit model are statistical significance and expert judgement At the same time, operational efficiency evaluation includes considering resources mobilized for data collection, considering whether the use of variables is consistent with the provisions of law or not Algorithms for handling imbalanced datasets Oversampling The common method of increasing instances in the group of bad accounts is randomly repeating the data in this group Take a set of randomly selected minority examples in the minority group, then augment the size of this group by replicating the selected examples and adding them to it The result is that total examples in the bad group will be increased and the class distribution becomes more balanced accordingly This random method simply replicates a portion of minority class in order to increase the weights of those examples Because the replacement process is totally random, this method recreates some existing examples in the original minority class Therefore, its main drawback is that the overfitting phenomenon can occur This method is the most fundamental among oversampling techniques Many other common oversampling algorithms used in the real world are developed based on this method (Peng Jun Huang, 2015) Undersampling Undersampling method randomly reduces observations in the good group The downside of this technique is that it is possible to eliminate useful observations from the majority group The common technique applied in undersampling is the weighted sampling It is described as follows: Calculate the weighted Euclidean distance of each negative instance from each of the positive instances All features are weighted by its Fishers discriminant score (F1 measure), which measures the overlapping per attribute After that, for each positive sample, sort negative instances in ascending order of distance from the positive sample Finally, for each positive instance, select 1277 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 a user-defined number of negative instances The user-defined number indicates the desired ratio of negative samples to positive samples At this stage, special care is taken to avoid repetitive selection of negative samples If a particular negative sample has been already selected, the next available negative sample is selected (Fernández, 2018) Bothsampling Due to the fact that oversampling results in too many bad observations and undersampling can cause loss of the original data, the combination of these two methods can be chosen in the hope of fixing those problems However, it may not yield good results for the study either because the method simultaneously multiply bad observations and reduce the random good ones SMOTE (Synthetic Minority Over-sampling Technique) The SMOTE algorithm carries out an oversampling approach, but the key difference is that it introduces synthetic examples but not replicate existing instances To create these new data points, the KNN algorithm is used for the minority group, clustering the data into K different groups Then, between two or more observations of the same group, it is needy to create more observations For this reason, the procedure is said to be focused on the “feature space” rather than on the “data space” For example, an xi positive instance is selected as basis to create new synthetic data points Based on a distance metric, several NNs of the same class (points xi1 to xi4) are chosen from the training set Finally, a randomized interpolation is carried out in order to obtain new instances r1 to r4 (Fernández, 2018) The SMOTE method works in “feature spaces” but not “data spaces” It treats nominal and continuous attributes in different ways In the closest neighbor determination calculations, it uses Euclidean distances for the continuum property and the Value Distance Metric for the nominal property For continuum attributes: hGet the difference between the feature vector (the minority class pattern) and one of its k closest neighbors (the minority class sample) hMultiply this difference with a random number between and hAdding this difference to the object value of the original object vector constitutes a new feature vector For nominal attributes: hTake the ratio of majority groups between the feature vector under consideration and the k nearest neighbors In the case of a tie, take it at random hAssign that value to an additional new case for the minority class hUsing SMOTE creates additional regions for the minority, thereby allowing the classification methods to predict more minorities SMOTEboost algorithm The SMOTEboost algorithm is the integration of the SMOTE algorithm into the standard 1278 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 boosting process Thus, it gains the benefits of both boosting and SMOTE This algorithm applies to some asymmetric data at the medium and high levels, improves the prediction results for the minority and improves the F-value Boosting helps improve the predictability of classifiers by recalculating the weights of misclassified observations SMOTE only improves the classifier of minority cases By attaching SMOTE to boosting, it promotes boosting that focuses more on minority cases than in the majority SMOTEboost implicitly increases the weight of false negatives in the distribution because the SMOTE algorithm increases the number of minority observations So, in the next iteration, SMOTEboost could create a wider decision area for the minority SMOTEboost combines SMOTE’s Recall-improving power and boosting’s Precision-improving power Altogether it improves the F-value or Validation In the classification problem, to evaluate the performance of the model, two methods are used including Confusion matrix and AUROC curve Confusion Matrix In these problems, it is common to define the more critical data class as the Positive class (P-positive), the other one called the Negative class (N-negative) In a good-bad classification problem, bad means Positive, and good means Negative From there, we define True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN) to create confusion matrix not normalized as follows: Table Confusion matrix Actual “bad” Actual “good” Predicted “bad” True Positives (TP) False Positives (FP) Predicted “good” False Negatives (FN) True Negatives (TN) Evaluation criteria from the confusion matrix as following: Table Evaluation criteria Actual “bad” Actual “good” Predicted “bad” TPR = TP/(TP+FN) FPR = FP/(FP+TN) Predicted “good” FNR = FN/(TP+FN) TNR = TN/(FP+TN) FPR (False Positive Rate) is also known as a false forecast rate, FNR (False Negative Rate) is also known as omission rate 1279 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 For the classification problem where the data sets of the classes are hugely different from each other, there is an efficiency measure commonly used as the Precision-Recall Precision = TP/(TP+FP) Recall = TP/(TP+FN) = TPR High precision means the accuracy of the bad observations found is high Precision = or FP = means that all observations predicted to be bad are true as bad However, this does not guarantee to find out all bad observations A high recall means a high True Positive Rate, meaning a low rate of omitting bad observations Recall = or FN = means finding all bad observations However, it is unlikely that all bad prediction observations are correct Therefore, a good classification model is one that has both Precision and Recall as high as possible, as close to as possible To measure the quality of the classifier based on both Precision and Recall we use the F1 score Choosing β higher than means to value Recall over Precision and vice versa, β less than means that Precision is more important than Recall Two commonly used values of β are and 0.5 AUROC curve The values of FPR and TPR change when the good or bad threshold changes When the threshold decreases, both FPR and TPR increase, which means more false statements than omissions Conversely, when the threshold increases, both the FPR and TPR decrease, which means more omission than false statements (Source: Sarang Narkhede, 2018) Figure Receiver Operating Characteristic curve or ROC curve 1280 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 For each threshold, there is one pair of value (FPR, TPR) Standing for the points (FPR, TPR) on the graph, when changing the threshold from to 1, we get a line called the Receiver Operating Characteristic curve or ROC curve Plot the graph with the horizontal axis FPR (some graphs denoted by - Specificity) and the vertical axis as TPR (some graphs denoted as Sensitivity) Based on the ROC curve, one can show whether a model is useful or not An efficient model has low FPR and high TPR, there exists a point on the ROC curve that is close to the point with the coordinate (0, 1) on the graph (upper left corner) The closer the curve is, the more efficient the model is There is another parameter used to evaluate a model called Area Under the Curve or AUROC, which is the area under the ROC curve This value is a positive number less than or equal to The larger the value, the better the model The meaning of AUROC criteria Table Assessment of model by AUROC AUROC Assessment of model 0.80 - 0.90 Good > 0.90 Excellent 0.70 - 0.80 Fair 0.60 - 0.70 Poor 0.50 - 0.60 Fail (Source: D’Agostino et al., 2013) Gini coefficient The Gini coefficient is calculated according to the formula (Schechtman, 2016) of Gini = 2* AUROC – This coefficient is also used in evaluating the relevance or significance level of models Its measured values are between and 1, where a score of means that the model is 100% accurate in predicting the outcome The closer the Gini is to one, the better model is On the other hand, a Gini score equal to means the model is entirely inaccurate Results of experimental research The article uses credit dataset at a commercial bank in Vietnam, including 7052 observations with 456 bad observations With the bad rate of 7.13%, the data surveyed is imbalanced between the two groups of good observations and the number of bad ones The data relating to the information of the customers is encrypted After outliers processing, imbalanced data is handled by methods of oversampling, undersampling, bothsampling and SMOTE After that, the processed datasets are divided into two datasets including training set (70%) and testing set (30%) With each generated dataset, run the logistic model with the original variables and WOE grouping variables for credit scoring The model evaluation results are as follows: 1281 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 Model [1] [2] [3] Logistic with Method AUROC Gini Precision Recall F_score Original 0.5521 0.1042 0.008 0.999 0.023 Original variables Oversampling 0.6271 0.254 0.581 0.499 0.308 Undersampling 0.6089 0.3166 0.628 0.784 0.448 Bothsampling 0.6296 0.2646 0.595 0.627 0.372 SMOTE 0.6105 0.221 0.628 0.580 0.353 Oversampling 0.7051 0.4196 0.656 0.639 0.385 Undersampling 0.6228 0.244 0.599 0.634 0.376 0.6978 0.3424 0.636 0.618 0.373 SMOTE 0.8208 0.642 0.757 0.742 0.447 Original variables [4] [5] [6] [7] [8] [9] Table Results of logistic models (WOE) grouping variables Bothsampling (Source: author’s calculation) Using the original imbalanced dataset, the logistic model has the AUROC of 0.5521 and the G coefficient of 0.1042 This result shows that the model is not able to distinguish between good and bad observation The low precision and high Recall coefficients (0.008 and 0.999, respectively) means the accuracy of the bad observations found is low As a result, predict results are biased towards the majority group After processing with oversampling, undersampling, bothsampling, and SMOTE methods, the validation results of the logit regression models with the unbalanced processed data set were improved compared to the model [1] The results show that the imbalanced handling method increases the efficiency of the classification methods The AUROC of these logistic models increase greater than 0.5, showing that these models are capable of differentiation Besides, the undersampling method shows better results than the other three methods because the coefficients of Gini, Precision, Recall, F_score are all higher However, the Gini coefficient is in the range of 0.2 - 0.3, showing that the classification ability of these models is still quite weak 1282 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 To improve the differentiation of the model, we use a combination of logistic regression variable analysis Validation results of the model with unbalanced processed data sets showed a marked improvement in the evaluation criteria of classification ability The AUROC coefficient results show that using binning WOE variables has improved the ability to classify good and bad customers Also, the SMOTE methods using the logistic model [9] combined WOE grouping variables are significantly improved by the high AUROC (0.821) and Gini coefficients (0.624) This model has the highest recall of 0.757 and precision coefficients of 0.742, compared to [6], [7], and [8] models This conclusion is also true for the F_ score coefficient Conclusion Imbalanced data problem occurs in some different fields and reduces the efficiency of the classification regression Several algorithms are introduced to handle this problem in preprocessing data steps Various methods are introduced in this study Each method has both advantages and drawbacks The bank must consider rationality, stability, and strength, and complexity when working with each method The oversampling algorithm is easy to implement However, the data size increases, but observations are repeated from the original observations, so in some cases, the oversampling method shows ineffective classification The undersampling technique that randomly reduces the data in the Good observation group is also easy to perform, in turn, it can eliminate useful observations in the majority group Overcoming the disadvantages of oversampling, the SMOTE method complements the minority, creating a completely new advantage through artificial methods to show more optimal results in most classification cases We have tested the effectiveness of the oversampling, undersampling, bothsampling and SMOTE algorithms with specific data in Vietnam Based on the AUROC, Gini, recall, precision, and F-score coefficients; the results show that imbalanced datasets have a negative effect on the ability to classify good and bad customers Furthermore, the undersampling technique is more efficient when used in the logit model with the original variables To improve performance of the classification regression, when combining variable analysis grouped by WOE, the SMOTE algorithm shows outstanding efficiency compared to the mentioned others Appendix: The Figures represent the AUROC Figure AUROC of the logistic model with original variables Figure AUROC of the logistic model with oversampling data 1283 Figure AUROC of the logistic model with undersampling data INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 Figure AUROC of the logistic model with bothsampling data Figure AUROC of logistic model with SMOTE data Figure AUROC of Logistic model clustering variables by WOE with oversampling data Figure AUROC of Logistic model clustering variables by WOE with undersampling data Figure AUROC of Logistic model clustering variables by WOE with bothsampling data Figure 10 AUROC of Logistic model clustering variables by WOE with SMOTE data REFERENCES Augustin, S., Gaißer, C., Knauer, J., Massoth, M., Piejko, K., Rihm, D., & Wiens, T (2012) Telephony fraud detection in next generation networks Proceedings of the AICT, 203-207 Barua, S., Islam, M.M., Yao, X and Murase, K., 2012 MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning IEEE Transactions on Knowledge and Data Engineering, 26(2), pp.405-425 Basel II: Revised international capital framework” bis.org 2004-06-10 Batista, G.E., Prati, R.C and Monard, M.C., 2004 A study of the behavior of several methods for balancing machine learning training data ACM SIGKDD explorations newsletter, 6(1), pp.20-29 Chawla, N.V., Lazarevic, A., Hall, L.O and Bowyer, K.W., 2003, September SMOTEBoost: Improving prediction of the minority class in boosting In European conference on principles of data mining and knowledge discovery (pp 107-119) Springer, Berlin, Heidelberg Dubey, R., Zhou, J., Wang, Y., Thompson, P M., Ye, J., & Alzheimer’s Disease Neuroimaging Initiative (2014) Analysis of sampling techniques for imbalanced data: An n= 648 ADNI study NeuroImage, 87, 220-241 1284 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 Ertekin, S., Huang, J., Bottou, L., & Giles, L (2007, November) Learning on the border: active learning in imbalanced data classification In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp 127-136) Fernández, A., Garcia, S., Herrera, F., & Chawla, N V (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary Journal of ar- tificial intelligence research, 61, 863-905 Ferretti, F., Saltelli, A., & Tarantola, S (2016) Trends in sensitivity analysis practice in the last decade Science of the total environment, 568, 666-670 10 Ganganwar, V (2012) An overview of classification algorithms for imbalanced datasets International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47 11 Gong, X., & Qiao, W (2012) Imbalance fault detection of direct-drive wind turbines using generator current signals IEEE Transactions on energy conversion, 27(2), 468-476 12 Han, H., Wang, W.Y and Mao, B.H., 2005, August Borderline-SMOTE: a new over- sampling method in imbalanced data sets learning In International conference on intelligent computing (pp 878-887) Springer, Berlin, Heidelberg 13 Han, S., Yuan, B and Liu, W., 2009, November Rare class mining: progress and prospect In 2009 Chinese Conference on Pattern Recognition (pp 1-5) IEEE 14 Harley, J R., Lanphier, K., Kennedy, E., Whitehead, C., & Bidlack, A (2020) Random forest classification to determine environmental drivers and forecast paralytic shellfish toxins in Southeast Alaska with high temporal resolution Harmful Algae, 99, 101918 15 He, H., & Garcia, E A (2009) Learning from imbalanced data IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284 16 Hosmer, D.W., Lemeshow, S and Sturdivant, R.X., 2000 Introduction to the logistic regression model Applied Logistic Regression, 15, pp.1-30 17 https://vietstock.vn/2020/02/buc-tranh-no-xau-ngan-hang-nam-2019-757-730262.htm 18 Huang, C., Li, Y., Loy, C C., & Tang, X (2016) Learning deep representation for im- balanced classification In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 5375-5384) 19 Huang, P.J., 2015 Classication of Imbalanced Data Using Synthetic Over-Sampling Techniques (Doctoral dissertation, UCLA) 20 Jin, Y., Yang, K., Wu, Y J., Liu, X S., & Chen, Y (2012) Application of particle swarm optimization based least square support vector machine in quantitative analysis of extraction so- lution of safflower using near-infrared spectroscopy Fenxi Huaxue, 40(6), 925-931 21 Jolliffe, I T., & Cadima, J (2016) Principal component analysis: a review and recent developments Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202 1285 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 22 Kim, S., Kim, H., & Namkoong, Y (2016) Ordinal classification of imbalanced data with application in emergency and disaster information services IEEE Intelligent Systems, 31(5), 50-56 23 Kundu, A., Panigrahi, S., Sural, S., & Majumdar, A K (2009) Blast-ssaha hybridiza- tion for credit card fraud detection IEEE transactions on dependable and Secure Computing, 6(4), 309-315 24 Liu, T.Y., 2009, August Easyensemble and feature selection for imbalance data sets In 2009 international joint conference on bioinformatics, systems biology and intelligent com- puting (pp 517-520) IEEE 25 López, V., Fernández, A., García, S., Palade, V., & Herrera, F (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics Information sciences, 250, 113-141 25 Maalouf, M., & Trafalis, T B (2011) Robust weighted kernel logistic regression in imbalanced and rare events data Computational Statistics & Data Analysis, 55(1), 168-183 27 2Maratea, A., Petrosino, A., & Manzo, M (2014) Adjusted F-measure and kernel scal- ing for imbalanced data learning Information Sciences, 257, 331-341 28 Naeem Siddiqi (2012) Credit Risk Scorecards: Developing and Implementing Intel- ligent Credit Scoring DOI:10.1002/9781119201731 29 Narkhede, S (2018) Understanding AUC-ROC Curve Towards Data Science, 26 30 Nguyen Thi Lien, nguyen thi thu Trang, Nguyen Chien Thang (2018) Machine learn- ing techniques to detect credit card fraud Journal of economics and development 31 Phương, N.M., Tuyết, T.T.Á., Hồng, N.T and Thọ, Đ.X (2016) Random border un- dersampling Science and Technology 32 Polykretis, C., & Chalkias, C (2018) Comparison and evaluation of landslide suscep- tibility maps obtained from weight of evidence, logistic regression, and artificial neural network models Natural hazards, 93(1), 249-274 33 Quinlan, J.R., 1998 Miniboosting decision trees Journal of Artificial Intelligence Re- search, pp.1-15 34 Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., & Riquelme, J C (2014, May) Preliminary comparison of techniques for dealing with imbalance in software defect prediction In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (pp 1-10) 35 Sarkar, S., Raj, R., Vinay, S., Maiti, J., & Pratihar, D K (2019) An optimization-based decision tree approach for predicting slip-trip-fall accidents at work Safety science, 118, 57-69 36 Schechtman, E., & Schechtman, G (2016) The Relationship between Gini Methodol- ogy and the ROC curve Available at SSRN 2739245 1286 INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 37 Shu, B., Zhang, H., Li, Y., Qu, Y., & Chen, L (2014) Spatiotemporal variation analysis of driving forces of urban land spatial expansion using logistic regression: A case study of port towns in Taicang City, China Habitat international, 43, 181-190 38 Silva, K M., Souza, B A., & Brito, N S (2006) Fault detection and classification in transmission lines based on wavelet transform and ANN IEEE Transactions on Power Delivery, 21(4), 2058-2063 39 Tahir, M A., Kittler, J., Mikolajczyk, K., & Yan, F (2009, June) A multiple expert approach to the class imbalance problem using inverse random under sampling In International workshop on multiple classifier systems (pp 82-91) Springer, Berlin, Heidelberg 40 The State Bank of Vietnam, 2016, Circular No.41/2016/TT-NHNN dated 30/12/2016 regulates the capital adequacy ratio for foreign banks and branches in Vietnam 41 Wei, W., Li, J., Cao, L., Ou, Y., & Chen, J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data World Wide Web, 16(4), 449-475 42 Wu, H., Liu, X., You, L., Zhang, L., Zhou, D., Feng, J., & Yu, J (2012) Effects of salinity on metabolic profiles, gene expressions, and antioxidant enzymes in halophyte Suaeda salsa Journal of Plant Growth Regulation, 31(3), 332-341 43 Zhu, B., Baesens, B., & vanden Broucke, S K (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction Information sciences, 408, 84-99 1287 ... the algorithm is suitable for controlling credit card fraud detection in Vietnam Currently, commercial banks in Vietnam have many difficulties in building internal risk management models according... ratio of Vietnamese commercial banks is all lower 3.42% in 2019 As a result, the imbalanced datasets effect on the efficiency of the good and bad classification algorithms at Vietnamese commercial... Heidelberg 40 The State Bank of Vietnam, 2016, Circular No.41/2016/TT-NHNN dated 30/12/2016 regulates the capital adequacy ratio for foreign banks and branches in Vietnam 41 Wei, W., Li, J., Cao, L.,

Ngày đăng: 29/07/2021, 09:01

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN