A new method based on clustering improves the efficiency of imbalanced data classification

Abstract. Classification of data imbalance is an important problem in practice and is becoming a new approach for many researchers. In particular, in the diagnosis of medicine, the number of ill people accounts for only a very small percentage of the total number of people, so the ability to detect people with many difficulties or major deviations, causing serious consequences, even affect human life. Therefore, the efficiency of classification imbalance requires high accuracy and the preprocessing method of data is a common solution with good results. This paper will introduce some approaches in imbalanced data classification, propose a new method based on cluster data. We have installed this method and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima, and Yeast. For example, the result of classification with Yeast data, G-mean of original data is 21.41%, but when applying the new method, it has increased to 83.06%. The experimental results show that the new method increases the classification efficiency of data significantly.

9 trang | Chia sẻ: thanhle95 | Lượt xem: 849 | Lượt tải: 1

Bạn đang xem nội dung tài liệu A new method based on clustering improves the efficiency of imbalanced data classification, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

33 HNUE JOURNAL OF SCIENCE Natural Sciences, 2020, Volume 65, Issue 4A, pp. 33-41 This paper is available online at A NEW METHOD BASED ON CLUSTERING IMPROVES THE EFFICIENCY OF IMBALANCED DATA CLASSIFICATION Nguyen Thi Hong and Dang Xuan Tho Faculty of Information Technology, Hanoi National University of Education Abstract. Classification of data imbalance is an important problem in practice and is becoming a new approach for many researchers. In particular, in the diagnosis of medicine, the number of ill people accounts for only a very small percentage of the total number of people, so the ability to detect people with many difficulties or major deviations, causing serious consequences, even affect human life. Therefore, the efficiency of classification imbalance requires high accuracy and the preprocessing method of data is a common solution with good results. This paper will introduce some approaches in imbalanced data classification, propose a new method based on cluster data. We have installed this method and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima, and Yeast. For example, the result of classification with Yeast data, G-mean of original data is 21.41%, but when applying the new method, it has increased to 83.06%. The experimental results show that the new method increases the classification efficiency of data significantly. Keywords: imbalanced data classification, Data mining, Clustering based undersampling. 1. Introduction Many classification algorithms published such as: k-nearest neighbors, Decision trees, Naïve Bayes, Support vector machines. These are the standard algorithms applied to balance classification cases and has been tested experimentally. However, applying these algorithms to data where the large disparity in the number of samples in classes is not effective [1-3]. Therefore, new approaches need to be taken in case of data imbalance. A data imbalance is a case where data have a significant difference in the number of samples in classes. At that time, the class had many samples called the majority class; the other class had few samples called the minority class. This shows that, due to the overwhelming number of class samples, the efficiency of the classification process is significantly reduced. For example, the Mamography data set consists of 11,183 data samples, of which 10,923 are assigned the label "negative" (no cancer) and 260 samples are labeled "positive" (cancer). Assuming a classification model of only 10% accuracy means that 234 samples of the minority class were misclassified into the majority, resulting in 234 people with cancer but diagnosed without cancer [4, 5]. Clearly, the misclassification of such patients will have more serious consequences than misclassification of non-cancer to cancer. Therefore, the problem of imbalanced Received March 20, 2020. Revised May 5, 2020. Accepted May 12, 2020 Contact Nguyen Thi Hong, e-mail address: [email protected] Nguyen Thi Hong and Dang Xuan Tho 34 data classification is an application that has important practical applications and is interested by many scientists in the field of data mining. In this paper, in order to increase the accuracy of the prediction model in imbalanced data classification problem, we propose a new cluster-based sampling method to address this work. Performing tests on a number of datasets, we have achieved important results when compared to cases without using any data balancing strategies and previous method. 2. Content 2.1. Related works The problem of imbalanced class distribution is concerned by many researchers and many scientific papers have been published. In general, there are two main approaches to problem solving, based on the algorithm level and data level. At the algorithm level, the authors try to suggest new algorithms or improve existing classification methods to solve data with the imbalanced class distribution. These methods are more effective than previous methods, such as using different thresholds and more accessible one-class methods. Specifically, a number of cost-based learning algorithms with the addition of weight to the minority class [6], adjust the predictive probability of leaves for the decision tree method [7], add more constants different penalties for each class or adjustment of layered boundary improvement of supported vector machine algorithms [8]. On the other hand, data-level methods are based on preprocessing of the desired data to balance the training set again with the strategy of over-sampling or under-sampling samples. The over-sampling approaches are used to increase the number of samples of the minority class, while the under-sampling approaches are used to reduce the number of majority class samples. There are several common methods that are used, such as SPY [9], SMOTE [10], Borderline- SMOTE [11], Safe-level-SMOTE [12], Safe-SMOTE [13], BAM [14]. Specifically, in 2017, Wei-Chao Lin et al. [15] proposed clustering-based undersampling method to reduce the number of majority class samples based on clustering. As shown in Figure 1, firstly, the author chooses the number of clusters equal to the number of samples of the minority class. Then, the cluster center is selected by the k-means clustering algorithm on the majority class samples. These cluster centers are used to replace all majority class samples. Then, the number of majority and minority class samples includes the same number of samples. The empirical results on real data have shown that this method is highly effective, however, in some cases, we found that this method does not improve the efficiency of classification but also reduces accuracy. In the next section, we will analyze a major drawback in this method and thereby propose improvements. Imbalanced dataset Training set Testing set majority class (M) minority class (N) Balanced training set k cluster centersk-means classifier training testing Figure 1. Clustering-based undersampling procedure A new method based on clustering improves the efficiency of imbalanced data classification 35 2.2. Main Drawback of Clustering-based under sampling To illustrate this approach, Figure 2 (a) shows the clustered majority class samples based on the k-means algorithm, with the number of clusters set by the number of minority class samples (i.e. k = N). Figure 2 (b) introduces this k cluster center is used to replace the entire majority class data set. Then, with this method, we see that the entire majority samples of a cluster will be represented by the cluster center which will be used for future learning. Therefore, it is important to select cluster centers that represent samples in the cluster. As we know, the cluster center is updated by the algorithm k-means as the average of all samples in the cluster: ݋௜ = ∑ ݔ௝௫ೕ∈஼೔ |ܥ௜| (1) Where xj is a data sample in the data set, Ci is a cluster (set of data samples) and oi is the cluster center (the center of cluster of Ci). However, in some clusters there are noise samples that affect the cluster center selection as shown in Figure 2 (c). Therefore, in order to address this drawback and improve the classification accuracy of the clustering-based undersampling method, we focused on how to choose a better cluster center which represents samples in the cluster. This idea is illustrated in Figure 3, a new method that we term Priority-Center (PC), will be presented in more detail in the next section. a) b) c) oi Figure 2. Main Drawback of Clustering-based undersampling 2.3.The Priority-Center method In order to overcome the drawback of Clustering-based undersampling described previously, we focus on choosing a good cluster center which represents samples in the cluster. The ideal cluster center needs to deviate from the densely populated locations of the cluster; the samples located in the high density area will have a higher weight than the other samples. Based on that idea, we propose a new, Priority-Center method. A B C D a) b) oi Figure 3. The idea of Priority-Center Nguyen Thi Hong and Dang Xuan Tho 36 Firstly, for each sample in the cluster, we count the number of nearest neighbors (nj) in the radius r. Then, due the cluster center will represent important samples in the cluster, so we will calculate the cluster center by the key samples (ݔ′௝), not on all samples in the cluster. The priority order of the key samples in the cluster will be determined based on the value of nj of each element. Finally, the new center of the cluster is updated by the following formula: ݋௜ = ∑ ݔ′௝ × ௝݊௫ೕ∈ௌ೔ ∑ ௝݊ (2) where ݔ′௝ are the key samples in the data set, nj is number of nearest samples of ݔ′௝ in the radius r, and oi is the new cluster center (the center of cluster of Ci). By applying the new cluster center calculation formula, we obtain a better cluster center, as shown in Figure 3b. 2.4. Experiments 2.4.1. Evaluation matrix Table 1. Confusion matrix Predicted Positive Predicted Negative Reality Positive TP FN Reality Positive FP TN Confusion matrix in Table 1 is used to evaluate the performance of binary classification methods. A labeled data is classified for calculating the values TP, FN, FP and TN. The effectiveness of the classification of balanced data is assessed by a number of measures such as Recall, Precision and Accuracy given by the following formulas: ܶܲݎܽݐ݁ = ݎ݈݈݁ܿܽ = ܶܲ/(ܶܲ + ܨܰ) (3) ܶܰݎܽݐ݁ = ܶܰ/(ܶܰ + ܨܲ) (4) ܨܲݎܽݐ݁ = ܨܲ/(ܶܰ + ܨܲ) (5) ܨܰݎܽݐ݁ = ܨܰ/(ܶܲ + ܨܰ) (6) ܲܲݒ݈ܽݑ݁ = ܲݎ݁ܿ݅ݏ݋݊ = ܶܲ/(ܶܲ + ܨܲ) (7) ܣܿܿݑݎܽܿݕ = (ܶܲ + ܶܰ)/(ܶܲ + ܨܲ + ܶܰ + ܨܰ) (8) For datasets which have a high imbalance rate, if most of the minority class elements are misclassified and most of the majority class elements are correctly classified, the accuracy may still be high. Therefore, the evaluation of classification efficiency by these measures is no longer suitable for imbalanced datasets. G-mean measure was proposed to balance the accuracy between majority class elements and minority class elements [7, 16, 17]. This value is high if both of TPrate and TNrate value are high. ܩ −݉݁ܽ݊ = √ܶܲݎܽݐ݁. ܶܰݎܽݐ݁ (9) In this paper, the authors used the G-mean value to evaluate the classification efficiency of the datasets after adjusting the imbalance ratio. 2.4.2. Experimental method and datasets To evaluate the effectiveness of the proposed method, we used 6 imbalanced datasets from UCI international standard data store: Blood, Glass, Haberman, Heart, Pima and Yeast. Among A new method based on clustering improves the efficiency of imbalanced data classification 37 them, Haberman, Heart, Pima are datasets about breast cancer, heart disease and diabetes, respectively. Blood is the dataset containing information about whether or not Taiwanese donors donated blood in March 2007. Yeast dataset is about protein localization sites in yeast. The glass dataset is about whether or not the glass is a "float" type. All of these datasets have an imbalance between the number of elements in the majority and the minority. In particular, Glass (1:6) and Yeast (1:28) are the two datasets with the highest imbalance rate. The classification of these data sets using standard algorithms such as SVM, KNN, and Naïve Bayes is inefficient. Therefore, the authors make adjustments to these data sets by the proposed method to reduce the imbalance. Information about datasets is given in Table 2. Table 2. UCI imbalanced datasets Dataset Number of elements Number of Minority elements Number of Majority elements Percentage of Minority class Blood 748 178 570 23.8% Glass 214 29 185 13.5% Haberman 306 81 255 24.1% Heart 270 120 150 44.4% Pima 768 268 500 34.9% Yeast 1484 51 1433 3.43% We conducted experiments using the 20 times 10-fold method. Each time, the datasets were divided into 10 sections, each of which was selected as testing data and the remaining 9 were used as training data. The values of the confusion matrix are calculated based on the testing data classification result. The average G-mean value of 20 times 10-fold is used to evaluate the performance of the classification. In this experiment, we adjusted the original datasets by Clustering-based undersampling (CU) and Priority-Center methods (PC), then, classified these data using SVM, Random Forest, Naïve Bayes and K nearest neighbor (KNN) algorithms using packages supported by R: Kernlab, randomForest, e1071, class. 2.4.3. Results The G-mean values in Figure 4 show classification efficiency of original datasets and datasets after being adjusted by Clustering-based Undersampling and Priority-Centers. It can be seen that, the proposed method is better in most of datasets when classified by the SVM, Random Forest and KNN algorithms except for Naïve Bayes. Especially, when it comes to the Haberman dataset, the G-mean values of PC method increase significantly by 5-8% in comparison to that of CU method. There are slight increases of 1-4% in classification efficiency of other dataset. After the application of undersampling methods, the number of elements in the datasets decreases sharply so the efficiency of classification by the Naïve Bayes algorithm is improved insignificantly. Information on the number of elements of the datasets before and after undersampling is illustrated in Table 3. Nguyen Thi Hong and Dang Xuan Tho 38 Table 3. UCI imbalanced Datasets after adjusting Dataset Number of Minority elements after adjusting Number of Majority elements after adjusting Number of elements after adjusting Blood 178 178 356 Glass 29 29 58 Haberman 81 81 162 Heart 120 120 240 Pima 268 268 268 Yeast 51 51 102 Figure 4. The graphs compare the G-mean value calculated when classifying the original data set and the data governed by CU and PC Figure 5 shows 3D illustration in of Haberman, Heart and Yeast datasets before and after being adjusted by Clustering-based undersampling and Priority-Centers algorithms. It can be seen that, Heart dataset has not changed much after being adjusted because the number of A new method based on clustering improves the efficiency of imbalanced data classification 39 majority and minority class elements of this dataset is not much different (120:150), so the efficiency of the classification is hardly improved. When it comes to Yeast dataset, it has a high imbalance ratio (51: 1433). After being adjusted by PC, there are only 102 elements (51:51). Therefore, the efficiency of classification on this dataset by Naive Bayes algorithm decreases significantly. However, the classification efficiency of this dataset is significantly improved with both Random Forest and SVM algorithms because adjusted dataset has less overlapping distribution. Figure 5. 3D illustration of Haberman, Heart and Yeast datasets before and after being adjusted by CU and PC Table 4 shows P-values calculated by T.test package in R when comparing G-mean values in Figure 4. Most of the P-values are less than 5% so the comparison of the G-mean values of the methods are statistically significant. Nguyen Thi Hong and Dang Xuan Tho 40 Table 4. P-values compare G-mean values Dataset SVM Naïve Bayes Random Forest KNN Origin al CU Origina l CU Original CU Original CU Blood PC < 2.2e-16 0.0301 <2.2e-16 0.0329 <2.2e-16 0.0007 < 2.2e-16 0.0647 Glass PC < 2.2e-16 2.70E-09 1.12E-05 0.3913 3.40E-08 0.00017 6.16E-11 0.000767 Haberman PC < 2.2e-16 3.92E-14 <2.2e-16 0.07081 <2.2e-16 4.15E-11 < 2.2e-16 2.08E-08 Heart PC 0.05397 0.156 0.202 0.4161 7.23E-04 0.176 0.003911 0.465 Pima PC < 2.2e-16 4.57E-03 3.76E-16 0.02276 <2.2e-16 0.422 < 2.2e-16 0.490 Yeast PC < 2.2e-16 4.33E-04 3.76E-16 0.02276 <2.2e-16 1.34E-06 < 2.2e-16 0.490 3. Conclusions In this scientific study, we presented an overview of the new algorithm based on clustering data to improve data classification efficiency. By replacing the majority class samples with the center of each cluster, it has created the ability to explore large-sized databases, improve computational efficiency, reduces the level of data imbalance and thereby increasing the accuracy of the data classification results. Based on the research and the results achieved, we realize that there are many issues that need further study. Specifically, we will study the combination of noise removal with other methods such as Safe level, Boderline-SMOTE, Add-Boder-SMOTE or further develop the Priority-Centers algorithm to achieve higher efficiency in solving class imbalance problems. REFERENCES [1] M. Zareapoor, 2015. Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier. Int. Conf. Intell. Comput. Commun. Converg., Vol. 48, No. 12, pp. 679-686. [2] W. Xiao, J. Zhang, Y. Li, S. Zhang, and W. Yang, Oct. 2017. Class-specific cost regulation extreme learning machine for imbalanced classification. Neurocomputing, Vol. 261, pp. 70-82. [3] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, 2017. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl., Vol. 73, pp. 220-239. [4] X. Yuan, L. Xie, and M. Abouelenien, 2018. A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recognit., Vol. 77, pp. 160-172. [5] M. Arafat, A. Qusef, and G. Sammour, 2019. Detection of Wangiri Telecommunication Fraud Using Ensemble Learning. IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp. 330-335. A new method based on clustering improves the efficiency of imbalanced data classification 41 [6] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, 2015. A novel ensemble method for classifying imbalanced data. Pattern Recognit., Vol. 48, No. 5, pp. 1623-1637. [7] C. X. Ling, Q. Yang, J. Wang, and S. Zhang, 2004. Decision Trees with Minimal Costs. Proc. 21st Int. Conf. Mach. Learn., p. 69. [8] Y. Zhang, P. Fu, W. Liu, and G. Chen, 2014. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput. Appl., Vol. 25, No. 3-4, pp. 927-935. [9] X. T. Dang, D. H. Tran, O. Hirose, and K. Satou, 2019. SPY: A Novel Resampling Method for Improving Classification Performance in Imbalanced Data. Seventh International Conference on Knowledge and Systems Engineering (KSE), pp. 280-285. [10] N. V Chawla, K. W. Bowyer, and L. O. Hall, 2002. SMOTE : Synthetic Minority Over- sampling Technique. Artif. Intell., Vol. 16. [11] H. Han, W. Wang, and B. Mao, 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Lect. Notes Comput. Sci., Vol. 3644, pp. 878-887. [12] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, 2009. Safe-Level-SMOTE: Safe- Level-Synthetic Minority Over-Sampling Technique. Lect. Notes Comput. Sci., Vol. 5476, pp. 475-482. [13] X. T. Dang et al., 2013. A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data. Chem-Bio Informatics J., Vol. 13, pp. 19-29. [14] H. Nguyen Thi and T. Dang Xuan, 2019. BAM: border adjustment method improve the efficiency of imbalanced biological data classification. HNUE J. Sci ., Vol. 64, No. 6, pp. 173-182. [15] W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, 2017. Clustering-based undersampling in class-imbalanced data. Inf. Sci. (Ny)., Vol. 409–410, pp. 17-26. [16] C.-M. Vong, J. Du, C.-M. Wong, and J.-W. Cao, 2018. Pos