A new method based on clustering improves the efficiency of imbalanced data classification

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	9
Dung lượng	1,17 MB

Nội dung

In this paper, in order to increase the accuracy of the prediction model in imbalanced data classification problem, we propose a new cluster-based sampling method to address this work. Performing tests on a number of datasets, we have achieved important results when compared to cases without using any data balancing strategies and previous method.

HNUE JOURNAL OF SCIENCE Natural Sciences, 2020, Volume 65, Issue 4A, pp 33-41 This paper is available online at http://stdb.hnue.edu.vn A NEW METHOD BASED ON CLUSTERING IMPROVES THE EFFICIENCY OF IMBALANCED DATA CLASSIFICATION Nguyen Thi Hong and Dang Xuan Tho Faculty of Information Technology, Hanoi National University of Education Abstract Classification of data imbalance is an important problem in practice and is becoming a new approach for many researchers In particular, in the diagnosis of medicine, the number of ill people accounts for only a very small percentage of the total number of people, so the ability to detect people with many difficulties or major deviations, causing serious consequences, even affect human life Therefore, the efficiency of classification imbalance requires high accuracy and the preprocessing method of data is a common solution with good results This paper will introduce some approaches in imbalanced data classification, propose a new method based on cluster data We have installed this method and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima, and Yeast For example, the result of classification with Yeast data, G-mean of original data is 21.41%, but when applying the new method, it has increased to 83.06% The experimental results show that the new method increases the classification efficiency of data significantly Keywords: imbalanced data classification, Data mining, Clustering based undersampling Introduction Many classification algorithms published such as: k-nearest neighbors, Decision trees, Naïve Bayes, Support vector machines These are the standard algorithms applied to balance classification cases and has been tested experimentally However, applying these algorithms to data where the large disparity in the number of samples in classes is not effective [1-3] Therefore, new approaches need to be taken in case of data imbalance A data imbalance is a case where data have a significant difference in the number of samples in classes At that time, the class had many samples called the majority class; the other class had few samples called the minority class This shows that, due to the overwhelming number of class samples, the efficiency of the classification process is significantly reduced For example, the Mamography data set consists of 11,183 data samples, of which 10,923 are assigned the label "negative" (no cancer) and 260 samples are labeled "positive" (cancer) Assuming a classification model of only 10% accuracy means that 234 samples of the minority class were misclassified into the majority, resulting in 234 people with cancer but diagnosed without cancer [4, 5] Clearly, the misclassification of such patients will have more serious consequences than misclassification of non-cancer to cancer Therefore, the problem of imbalanced Received March 20, 2020 Revised May 5, 2020 Accepted May 12, 2020 Contact Nguyen Thi Hong, e-mail address: nguyenhong@hnue.edu.vn 33 Nguyen Thi Hong and Dang Xuan Tho data classification is an application that has important practical applications and is interested by many scientists in the field of data mining In this paper, in order to increase the accuracy of the prediction model in imbalanced data classification problem, we propose a new cluster-based sampling method to address this work Performing tests on a number of datasets, we have achieved important results when compared to cases without using any data balancing strategies and previous method Content 2.1 Related works The problem of imbalanced class distribution is concerned by many researchers and many scientific papers have been published In general, there are two main approaches to problem solving, based on the algorithm level and data level At the algorithm level, the authors try to suggest new algorithms or improve existing classification methods to solve data with the imbalanced class distribution These methods are more effective than previous methods, such as using different thresholds and more accessible one-class methods Specifically, a number of cost-based learning algorithms with the addition of weight to the minority class [6], adjust the predictive probability of leaves for the decision tree method [7], add more constants different penalties for each class or adjustment of layered boundary improvement of supported vector machine algorithms [8] On the other hand, data-level methods are based on preprocessing of the desired data to balance the training set again with the strategy of over-sampling or under-sampling samples The over-sampling approaches are used to increase the number of samples of the minority class, while the under-sampling approaches are used to reduce the number of majority class samples There are several common methods that are used, such as SPY [9], SMOTE [10], BorderlineSMOTE [11], Safe-level-SMOTE [12], Safe-SMOTE [13], BAM [14] Specifically, in 2017, Wei-Chao Lin et al [15] proposed clustering-based undersampling method to reduce the number of majority class samples based on clustering As shown in Figure 1, firstly, the author chooses the number of clusters equal to the number of samples of the minority class Then, the cluster center is selected by the k-means clustering algorithm on the majority class samples These cluster centers are used to replace all majority class samples Then, the number of majority and minority class samples includes the same number of samples The empirical results on real data have shown that this method is highly effective, however, in some cases, we found that this method does not improve the efficiency of classification but also reduces accuracy In the next section, we will analyze a major drawback in this method and thereby propose improvements majority class (M) k-means k cluster centers Training set minority class (N) Imbalanced dataset Balanced training set training Testing set testing classifier Figure Clustering-based undersampling procedure 34 A new method based on clustering improves the efficiency of imbalanced data classification 2.2 Main Drawback of Clustering-based under sampling To illustrate this approach, Figure (a) shows the clustered majority class samples based on the k-means algorithm, with the number of clusters set by the number of minority class samples (i.e k = N) Figure (b) introduces this k cluster center is used to replace the entire majority class data set Then, with this method, we see that the entire majority samples of a cluster will be represented by the cluster center which will be used for future learning Therefore, it is important to select cluster centers that represent samples in the cluster As we know, the cluster center is updated by the algorithm k-means as the average of all samples in the cluster: ∑ ∈ = (1) | | Where xj is a data sample in the data set, Ci is a cluster (set of data samples) and oi is the cluster center (the center of cluster of Ci) However, in some clusters there are noise samples that affect the cluster center selection as shown in Figure (c) Therefore, in order to address this drawback and improve the classification accuracy of the clustering-based undersampling method, we focused on how to choose a better cluster center which represents samples in the cluster This idea is illustrated in Figure 3, a new method that we term Priority-Center (PC), will be presented in more detail in the next section oi a) b) c) Figure Main Drawback of Clustering-based undersampling 2.3.The Priority-Center method In order to overcome the drawback of Clustering-based undersampling described previously, we focus on choosing a good cluster center which represents samples in the cluster The ideal cluster center needs to deviate from the densely populated locations of the cluster; the samples located in the high density area will have a higher weight than the other samples Based on that idea, we propose a new, Priority-Center method C A D B oi a) b) Figure The idea of Priority-Center 35 Nguyen Thi Hong and Dang Xuan Tho Firstly, for each sample in the cluster, we count the number of nearest neighbors (nj) in the radius r Then, due the cluster center will represent important samples in the cluster, so we will calculate the cluster center by the key samples ( ′ ), not on all samples in the cluster The priority order of the key samples in the cluster will be determined based on the value of nj of each element Finally, the new center of the cluster is updated by the following formula: ∑ ∈ ′ × = (2) ∑ where ′ are the key samples in the data set, nj is number of nearest samples of ′ in the radius r, and oi is the new cluster center (the center of cluster of Ci) By applying the new cluster center calculation formula, we obtain a better cluster center, as shown in Figure 3b 2.4 Experiments 2.4.1 Evaluation matrix Table Confusion matrix Predicted Positive Predicted Negative Reality Positive TP FN Reality Positive FP TN Confusion matrix in Table is used to evaluate the performance of binary classification methods A labeled data is classified for calculating the values TP, FN, FP and TN The effectiveness of the classification of balanced data is assessed by a number of measures such as Recall, Precision and Accuracy given by the following formulas: = = = = = = /( /( /( = /( + + + /( + ) ) ) ) (3) (4) (5) (6) + ) (7) (8) =( + )/( + + + ) For datasets which have a high imbalance rate, if most of the minority class elements are misclassified and most of the majority class elements are correctly classified, the accuracy may still be high Therefore, the evaluation of classification efficiency by these measures is no longer suitable for imbalanced datasets G-mean measure was proposed to balance the accuracy between majority class elements and minority class elements [7, 16, 17] This value is high if both of TPrate and TNrate value are high − =√ (9) In this paper, the authors used the G-mean value to evaluate the classification efficiency of the datasets after adjusting the imbalance ratio 2.4.2 Experimental method and datasets To evaluate the effectiveness of the proposed method, we used imbalanced datasets from UCI international standard data store: Blood, Glass, Haberman, Heart, Pima and Yeast Among 36 A new method based on clustering improves the efficiency of imbalanced data classification them, Haberman, Heart, Pima are datasets about breast cancer, heart disease and diabetes, respectively Blood is the dataset containing information about whether or not Taiwanese donors donated blood in March 2007 Yeast dataset is about protein localization sites in yeast The glass dataset is about whether or not the glass is a "float" type All of these datasets have an imbalance between the number of elements in the majority and the minority In particular, Glass (1:6) and Yeast (1:28) are the two datasets with the highest imbalance rate The classification of these data sets using standard algorithms such as SVM, KNN, and Naïve Bayes is inefficient Therefore, the authors make adjustments to these data sets by the proposed method to reduce the imbalance Information about datasets is given in Table Table UCI imbalanced datasets Number of elements Number of Minority elements Number of Majority elements Percentage of Minority class Blood 748 178 570 23.8% Glass 214 29 185 13.5% Haberman 306 81 255 24.1% Heart 270 120 150 44.4% Pima 768 268 500 34.9% Yeast 1484 51 1433 3.43% Dataset We conducted experiments using the 20 times 10-fold method Each time, the datasets were divided into 10 sections, each of which was selected as testing data and the remaining were used as training data The values of the confusion matrix are calculated based on the testing data classification result The average G-mean value of 20 times 10-fold is used to evaluate the performance of the classification In this experiment, we adjusted the original datasets by Clustering-based undersampling (CU) and Priority-Center methods (PC), then, classified these data using SVM, Random Forest, Naïve Bayes and K nearest neighbor (KNN) algorithms using packages supported by R: Kernlab, randomForest, e1071, class 2.4.3 Results The G-mean values in Figure show classification efficiency of original datasets and datasets after being adjusted by Clustering-based Undersampling and Priority-Centers It can be seen that, the proposed method is better in most of datasets when classified by the SVM, Random Forest and KNN algorithms except for Naïve Bayes Especially, when it comes to the Haberman dataset, the G-mean values of PC method increase significantly by 5-8% in comparison to that of CU method There are slight increases of 1-4% in classification efficiency of other dataset After the application of undersampling methods, the number of elements in the datasets decreases sharply so the efficiency of classification by the Naïve Bayes algorithm is improved insignificantly Information on the number of elements of the datasets before and after undersampling is illustrated in Table 37 Nguyen Thi Hong and Dang Xuan Tho Table UCI imbalanced Datasets after adjusting Number of Minority elements after adjusting Number of Majority elements after adjusting Number of elements after adjusting Blood 178 178 356 Glass 29 29 58 Haberman 81 81 162 Heart 120 120 240 Pima 268 268 268 Yeast 51 51 102 Dataset Figure The graphs compare the G-mean value calculated when classifying the original data set and the data governed by CU and PC Figure shows 3D illustration in of Haberman, Heart and Yeast datasets before and after being adjusted by Clustering-based undersampling and Priority-Centers algorithms It can be seen that, Heart dataset has not changed much after being adjusted because the number of 38 A new method based on clustering improves the efficiency of imbalanced data classification majority and minority class elements of this dataset is not much different (120:150), so the efficiency of the classification is hardly improved When it comes to Yeast dataset, it has a high imbalance ratio (51: 1433) After being adjusted by PC, there are only 102 elements (51:51) Therefore, the efficiency of classification on this dataset by Naive Bayes algorithm decreases significantly However, the classification efficiency of this dataset is significantly improved with both Random Forest and SVM algorithms because adjusted dataset has less overlapping distribution Figure 3D illustration of Haberman, Heart and Yeast datasets before and after being adjusted by CU and PC Table shows P-values calculated by T.test package in R when comparing G-mean values in Figure Most of the P-values are less than 5% so the comparison of the G-mean values of the methods are statistically significant 39 Nguyen Thi Hong and Dang Xuan Tho Table P-values compare G-mean values SVM Dataset Blood Glass Haberman Heart Pima Yeast PC PC PC PC PC PC Naïve Bayes Random Forest KNN Origin al CU Origina l CU Original CU Original CU < 2.2e-16 0.0301

Ngày đăng: 24/09/2020, 03:54