Abstract. Classification of data imbalance is an important problem in practice and is
becoming a new approach for many researchers. In particular, in the diagnosis of medicine,
the number of ill people accounts for only a very small percentage of the total number of
people, so the ability to detect people with many difficulties or major deviations, causing
serious consequences, even affect human life. Therefore, the efficiency of classification
imbalance requires high accuracy and the preprocessing method of data is a common
solution with good results. This paper will introduce some approaches in imbalanced data
classification, propose a new method based on cluster data. We have installed this method
and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima,
and Yeast. For example, the result of classification with Yeast data, G-mean of original data
is 21.41%, but when applying the new method, it has increased to 83.06%. The
experimental results show that the new method increases the classification efficiency of
data significantly.
9 trang |
Chia sẻ: thanhle95 | Lượt xem: 561 | Lượt tải: 1
Bạn đang xem nội dung tài liệu A new method based on clustering improves the efficiency of imbalanced data classification, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
33
HNUE JOURNAL OF SCIENCE
Natural Sciences, 2020, Volume 65, Issue 4A, pp. 33-41
This paper is available online at
A NEW METHOD BASED ON CLUSTERING IMPROVES THE EFFICIENCY
OF IMBALANCED DATA CLASSIFICATION
Nguyen Thi Hong and Dang Xuan Tho
Faculty of Information Technology, Hanoi National University of Education
Abstract. Classification of data imbalance is an important problem in practice and is
becoming a new approach for many researchers. In particular, in the diagnosis of medicine,
the number of ill people accounts for only a very small percentage of the total number of
people, so the ability to detect people with many difficulties or major deviations, causing
serious consequences, even affect human life. Therefore, the efficiency of classification
imbalance requires high accuracy and the preprocessing method of data is a common
solution with good results. This paper will introduce some approaches in imbalanced data
classification, propose a new method based on cluster data. We have installed this method
and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima,
and Yeast. For example, the result of classification with Yeast data, G-mean of original data
is 21.41%, but when applying the new method, it has increased to 83.06%. The
experimental results show that the new method increases the classification efficiency of
data significantly.
Keywords: imbalanced data classification, Data mining, Clustering based undersampling.
1. Introduction
Many classification algorithms published such as: k-nearest neighbors, Decision trees,
Naïve Bayes, Support vector machines. These are the standard algorithms applied to balance
classification cases and has been tested experimentally. However, applying these algorithms to
data where the large disparity in the number of samples in classes is not effective [1-3].
Therefore, new approaches need to be taken in case of data imbalance.
A data imbalance is a case where data have a significant difference in the number of
samples in classes. At that time, the class had many samples called the majority class; the other
class had few samples called the minority class. This shows that, due to the overwhelming
number of class samples, the efficiency of the classification process is significantly reduced. For
example, the Mamography data set consists of 11,183 data samples, of which 10,923 are
assigned the label "negative" (no cancer) and 260 samples are labeled "positive" (cancer).
Assuming a classification model of only 10% accuracy means that 234 samples of the minority
class were misclassified into the majority, resulting in 234 people with cancer but diagnosed
without cancer [4, 5]. Clearly, the misclassification of such patients will have more serious
consequences than misclassification of non-cancer to cancer. Therefore, the problem of imbalanced
Received March 20, 2020. Revised May 5, 2020. Accepted May 12, 2020
Contact Nguyen Thi Hong, e-mail address: nguyenhong@hnue.edu.vn
Nguyen Thi Hong and Dang Xuan Tho
34
data classification is an application that has important practical applications and is interested by
many scientists in the field of data mining.
In this paper, in order to increase the accuracy of the prediction model in imbalanced data
classification problem, we propose a new cluster-based sampling method to address this work.
Performing tests on a number of datasets, we have achieved important results when compared to
cases without using any data balancing strategies and previous method.
2. Content
2.1. Related works
The problem of imbalanced class distribution is concerned by many researchers and many
scientific papers have been published. In general, there are two main approaches to problem
solving, based on the algorithm level and data level. At the algorithm level, the authors try to
suggest new algorithms or improve existing classification methods to solve data with the
imbalanced class distribution. These methods are more effective than previous methods, such as
using different thresholds and more accessible one-class methods. Specifically, a number of
cost-based learning algorithms with the addition of weight to the minority class [6], adjust the
predictive probability of leaves for the decision tree method [7], add more constants different
penalties for each class or adjustment of layered boundary improvement of supported vector
machine algorithms [8].
On the other hand, data-level methods are based on preprocessing of the desired data to
balance the training set again with the strategy of over-sampling or under-sampling samples.
The over-sampling approaches are used to increase the number of samples of the minority class,
while the under-sampling approaches are used to reduce the number of majority class samples.
There are several common methods that are used, such as SPY [9], SMOTE [10], Borderline-
SMOTE [11], Safe-level-SMOTE [12], Safe-SMOTE [13], BAM [14]. Specifically, in 2017,
Wei-Chao Lin et al. [15] proposed clustering-based undersampling method to reduce the
number of majority class samples based on clustering. As shown in Figure 1, firstly, the author
chooses the number of clusters equal to the number of samples of the minority class. Then, the
cluster center is selected by the k-means clustering algorithm on the majority class samples.
These cluster centers are used to replace all majority class samples. Then, the number of
majority and minority class samples includes the same number of samples. The empirical results
on real data have shown that this method is highly effective, however, in some cases, we found
that this method does not improve the efficiency of classification but also reduces accuracy. In
the next section, we will analyze a major drawback in this method and thereby propose
improvements.
Imbalanced
dataset
Training
set
Testing
set
majority
class (M)
minority
class (N)
Balanced
training set
k cluster
centersk-means
classifier
training
testing
Figure 1. Clustering-based undersampling procedure
A new method based on clustering improves the efficiency of imbalanced data classification
35
2.2. Main Drawback of Clustering-based under sampling
To illustrate this approach, Figure 2 (a) shows the clustered majority class samples based
on the k-means algorithm, with the number of clusters set by the number of minority class
samples (i.e. k = N). Figure 2 (b) introduces this k cluster center is used to replace the entire
majority class data set. Then, with this method, we see that the entire majority samples of a
cluster will be represented by the cluster center which will be used for future learning.
Therefore, it is important to select cluster centers that represent samples in the cluster. As we
know, the cluster center is updated by the algorithm k-means as the average of all samples in the
cluster:
=
∑ ݔ௫ೕ∈
|ܥ|
(1)
Where xj is a data sample in the data set, Ci is a cluster (set of data samples) and oi is the
cluster center (the center of cluster of Ci). However, in some clusters there are noise samples
that affect the cluster center selection as shown in Figure 2 (c). Therefore, in order to address
this drawback and improve the classification accuracy of the clustering-based undersampling
method, we focused on how to choose a better cluster center which represents samples in the
cluster. This idea is illustrated in Figure 3, a new method that we term Priority-Center (PC), will
be presented in more detail in the next section.
a) b) c)
oi
Figure 2. Main Drawback of Clustering-based undersampling
2.3.The Priority-Center method
In order to overcome the drawback of Clustering-based undersampling described
previously, we focus on choosing a good cluster center which represents samples in the cluster.
The ideal cluster center needs to deviate from the densely populated locations of the cluster; the
samples located in the high density area will have a higher weight than the other samples. Based
on that idea, we propose a new, Priority-Center method.
A
B
C
D
a) b)
oi
Figure 3. The idea of Priority-Center
Nguyen Thi Hong and Dang Xuan Tho
36
Firstly, for each sample in the cluster, we count the number of nearest neighbors (nj) in the
radius r. Then, due the cluster center will represent important samples in the cluster, so we will
calculate the cluster center by the key samples (ݔ′), not on all samples in the cluster. The
priority order of the key samples in the cluster will be determined based on the value of nj of
each element. Finally, the new center of the cluster is updated by the following formula:
=
∑ ݔ′ × ݊௫ೕ∈ௌ
∑ ݊
(2)
where ݔ′ are the key samples in the data set, nj is number of nearest samples of ݔ′ in the radius
r, and oi is the new cluster center (the center of cluster of Ci). By applying the new cluster center
calculation formula, we obtain a better cluster center, as shown in Figure 3b.
2.4. Experiments
2.4.1. Evaluation matrix
Table 1. Confusion matrix
Predicted Positive Predicted Negative
Reality Positive TP FN
Reality Positive FP TN
Confusion matrix in Table 1 is used to evaluate the performance of binary classification
methods. A labeled data is classified for calculating the values TP, FN, FP and TN. The
effectiveness of the classification of balanced data is assessed by a number of measures such as
Recall, Precision and Accuracy given by the following formulas:
ܶܲݎܽݐ݁ = ݎ݈݈݁ܿܽ = ܶܲ/(ܶܲ + ܨܰ) (3)
ܶܰݎܽݐ݁ = ܶܰ/(ܶܰ + ܨܲ) (4)
ܨܲݎܽݐ݁ = ܨܲ/(ܶܰ + ܨܲ) (5)
ܨܰݎܽݐ݁ = ܨܰ/(ܶܲ + ܨܰ) (6)
ܲܲݒ݈ܽݑ݁ = ܲݎ݁ܿ݅ݏ݊ = ܶܲ/(ܶܲ + ܨܲ) (7)
ܣܿܿݑݎܽܿݕ = (ܶܲ + ܶܰ)/(ܶܲ + ܨܲ + ܶܰ + ܨܰ) (8)
For datasets which have a high imbalance rate, if most of the minority class elements are
misclassified and most of the majority class elements are correctly classified, the accuracy may
still be high. Therefore, the evaluation of classification efficiency by these measures is no longer
suitable for imbalanced datasets. G-mean measure was proposed to balance the accuracy
between majority class elements and minority class elements [7, 16, 17]. This value is high if
both of TPrate and TNrate value are high.
ܩ −݉݁ܽ݊ = √ܶܲݎܽݐ݁. ܶܰݎܽݐ݁ (9)
In this paper, the authors used the G-mean value to evaluate the classification efficiency of the
datasets after adjusting the imbalance ratio.
2.4.2. Experimental method and datasets
To evaluate the effectiveness of the proposed method, we used 6 imbalanced datasets from
UCI international standard data store: Blood, Glass, Haberman, Heart, Pima and Yeast. Among
A new method based on clustering improves the efficiency of imbalanced data classification
37
them, Haberman, Heart, Pima are datasets about breast cancer, heart disease and diabetes,
respectively. Blood is the dataset containing information about whether or not Taiwanese donors
donated blood in March 2007. Yeast dataset is about protein localization sites in yeast. The glass
dataset is about whether or not the glass is a "float" type. All of these datasets have an
imbalance between the number of elements in the majority and the minority. In particular, Glass
(1:6) and Yeast (1:28) are the two datasets with the highest imbalance rate. The classification of
these data sets using standard algorithms such as SVM, KNN, and Naïve Bayes is inefficient.
Therefore, the authors make adjustments to these data sets by the proposed method to reduce the
imbalance. Information about datasets is given in Table 2.
Table 2. UCI imbalanced datasets
Dataset Number of elements
Number of
Minority
elements
Number of
Majority
elements
Percentage of
Minority class
Blood 748 178 570 23.8%
Glass 214 29 185 13.5%
Haberman 306 81 255 24.1%
Heart 270 120 150 44.4%
Pima 768 268 500 34.9%
Yeast 1484 51 1433 3.43%
We conducted experiments using the 20 times 10-fold method. Each time, the datasets were
divided into 10 sections, each of which was selected as testing data and the remaining 9 were
used as training data. The values of the confusion matrix are calculated based on the testing data
classification result. The average G-mean value of 20 times 10-fold is used to evaluate the
performance of the classification.
In this experiment, we adjusted the original datasets by Clustering-based undersampling
(CU) and Priority-Center methods (PC), then, classified these data using SVM, Random Forest,
Naïve Bayes and K nearest neighbor (KNN) algorithms using packages supported by R:
Kernlab, randomForest, e1071, class.
2.4.3. Results
The G-mean values in Figure 4 show classification efficiency of original datasets and
datasets after being adjusted by Clustering-based Undersampling and Priority-Centers. It can be
seen that, the proposed method is better in most of datasets when classified by the SVM,
Random Forest and KNN algorithms except for Naïve Bayes. Especially, when it comes to the
Haberman dataset, the G-mean values of PC method increase significantly by 5-8% in
comparison to that of CU method. There are slight increases of 1-4% in classification efficiency
of other dataset. After the application of undersampling methods, the number of elements in the
datasets decreases sharply so the efficiency of classification by the Naïve Bayes algorithm is
improved insignificantly. Information on the number of elements of the datasets before and after
undersampling is illustrated in Table 3.
Nguyen Thi Hong and Dang Xuan Tho
38
Table 3. UCI imbalanced Datasets after adjusting
Dataset
Number of Minority
elements after
adjusting
Number of Majority
elements after
adjusting
Number of elements
after adjusting
Blood 178 178 356
Glass 29 29 58
Haberman 81 81 162
Heart 120 120 240
Pima 268 268 268
Yeast 51 51 102
Figure 4. The graphs compare the G-mean value calculated when classifying the original
data set and the data governed by CU and PC
Figure 5 shows 3D illustration in of Haberman, Heart and Yeast datasets before and after
being adjusted by Clustering-based undersampling and Priority-Centers algorithms. It can be
seen that, Heart dataset has not changed much after being adjusted because the number of
A new method based on clustering improves the efficiency of imbalanced data classification
39
majority and minority class elements of this dataset is not much different (120:150), so the
efficiency of the classification is hardly improved. When it comes to Yeast dataset, it has a high
imbalance ratio (51: 1433). After being adjusted by PC, there are only 102 elements (51:51).
Therefore, the efficiency of classification on this dataset by Naive Bayes algorithm decreases
significantly. However, the classification efficiency of this dataset is significantly improved
with both Random Forest and SVM algorithms because adjusted dataset has less overlapping
distribution.
Figure 5. 3D illustration of Haberman, Heart and Yeast datasets before
and after being adjusted by CU and PC
Table 4 shows P-values calculated by T.test package in R when comparing G-mean values
in Figure 4. Most of the P-values are less than 5% so the comparison of the G-mean values of
the methods are statistically significant.
Nguyen Thi Hong and Dang Xuan Tho
40
Table 4. P-values compare G-mean values
Dataset
SVM Naïve Bayes Random Forest KNN
Origin
al CU
Origina
l CU Original CU Original CU
Blood PC < 2.2e-16 0.0301 <2.2e-16 0.0329 <2.2e-16 0.0007 < 2.2e-16 0.0647
Glass
PC
< 2.2e-16 2.70E-09 1.12E-05 0.3913 3.40E-08 0.00017 6.16E-11 0.000767
Haberman
PC
< 2.2e-16 3.92E-14 <2.2e-16 0.07081 <2.2e-16 4.15E-11 < 2.2e-16 2.08E-08
Heart PC 0.05397 0.156 0.202 0.4161 7.23E-04 0.176 0.003911 0.465
Pima
PC
< 2.2e-16 4.57E-03 3.76E-16 0.02276 <2.2e-16 0.422 < 2.2e-16 0.490
Yeast
PC
< 2.2e-16 4.33E-04 3.76E-16 0.02276 <2.2e-16 1.34E-06 < 2.2e-16 0.490
3. Conclusions
In this scientific study, we presented an overview of the new algorithm based on clustering
data to improve data classification efficiency. By replacing the majority class samples with the
center of each cluster, it has created the ability to explore large-sized databases, improve
computational efficiency, reduces the level of data imbalance and thereby increasing the
accuracy of the data classification results.
Based on the research and the results achieved, we realize that there are many issues that
need further study. Specifically, we will study the combination of noise removal with other
methods such as Safe level, Boderline-SMOTE, Add-Boder-SMOTE or further develop the
Priority-Centers algorithm to achieve higher efficiency in solving class imbalance problems.
REFERENCES
[1] M. Zareapoor, 2015. Application of Credit Card Fraud Detection: Based on Bagging
Ensemble Classifier. Int. Conf. Intell. Comput. Commun. Converg., Vol. 48, No. 12,
pp. 679-686.
[2] W. Xiao, J. Zhang, Y. Li, S. Zhang, and W. Yang, Oct. 2017. Class-specific cost regulation
extreme learning machine for imbalanced classification. Neurocomputing, Vol. 261, pp. 70-82.
[3] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, 2017. Learning from
class-imbalanced data: Review of methods and applications. Expert Syst. Appl., Vol. 73,
pp. 220-239.
[4] X. Yuan, L. Xie, and M. Abouelenien, 2018. A regularized ensemble framework of deep
learning for cancer detection from multi-class, imbalanced training data. Pattern Recognit.,
Vol. 77, pp. 160-172.
[5] M. Arafat, A. Qusef, and G. Sammour, 2019. Detection of Wangiri Telecommunication Fraud
Using Ensemble Learning. IEEE Jordan International Joint Conference on Electrical
Engineering and Information Technology (JEEIT), pp. 330-335.
A new method based on clustering improves the efficiency of imbalanced data classification
41
[6] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, 2015. A novel ensemble method for
classifying imbalanced data. Pattern Recognit., Vol. 48, No. 5, pp. 1623-1637.
[7] C. X. Ling, Q. Yang, J. Wang, and S. Zhang, 2004. Decision Trees with Minimal Costs. Proc.
21st Int. Conf. Mach. Learn., p. 69.
[8] Y. Zhang, P. Fu, W. Liu, and G. Chen, 2014. Imbalanced data classification based on scaling
kernel-based support vector machine. Neural Comput. Appl., Vol. 25, No. 3-4, pp. 927-935.
[9] X. T. Dang, D. H. Tran, O. Hirose, and K. Satou, 2019. SPY: A Novel Resampling Method for
Improving Classification Performance in Imbalanced Data. Seventh International Conference
on Knowledge and Systems Engineering (KSE), pp. 280-285.
[10] N. V Chawla, K. W. Bowyer, and L. O. Hall, 2002. SMOTE : Synthetic Minority Over-
sampling Technique. Artif. Intell., Vol. 16.
[11] H. Han, W. Wang, and B. Mao, 2005. Borderline-SMOTE: A New Over-Sampling Method in
Imbalanced Data Sets Learning. Lect. Notes Comput. Sci., Vol. 3644, pp. 878-887.
[12] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, 2009. Safe-Level-SMOTE: Safe-
Level-Synthetic Minority Over-Sampling Technique. Lect. Notes Comput. Sci., Vol. 5476,
pp. 475-482.
[13] X. T. Dang et al., 2013. A Novel Over-Sampling Method and its Application to Cancer
Classification from Gene Expression Data. Chem-Bio Informatics J., Vol. 13, pp. 19-29.
[14] H. Nguyen Thi and T. Dang Xuan, 2019. BAM: border adjustment method improve the
efficiency of imbalanced biological data classification. HNUE J. Sci ., Vol. 64, No. 6,
pp. 173-182.
[15] W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, 2017. Clustering-based undersampling in
class-imbalanced data. Inf. Sci. (Ny)., Vol. 409–410, pp. 17-26.
[16] C.-M. Vong, J. Du, C.-M. Wong, and J.-W. Cao, 2018. Pos