33 
HNUE JOURNAL OF SCIENCE 
Natural Sciences, 2020, Volume 65, Issue 4A, pp. 33-41 
This paper is available online at  
A NEW METHOD BASED ON CLUSTERING IMPROVES THE EFFICIENCY 
OF IMBALANCED DATA CLASSIFICATION 
Nguyen Thi Hong and Dang Xuan Tho 
Faculty of Information Technology, Hanoi National University of Education 
Abstract. Classification of data imbalance is an important problem in practice and is 
becoming a new approach for many researchers. In particular, in the diagnosis of medicine, 
the number of ill people accounts for only a very small percentage of the total number of 
people, so the ability to detect people with many difficulties or major deviations, causing 
serious consequences, even affect human life. Therefore, the efficiency of classification 
imbalance requires high accuracy and the preprocessing method of data is a common 
solution with good results. This paper will introduce some approaches in imbalanced data 
classification, propose a new method based on cluster data. We have installed this method 
and experimented on UCI international data sets: Blood, Glass, Haberman, Heart, Pima, 
and Yeast. For example, the result of classification with Yeast data, G-mean of original data 
is 21.41%, but when applying the new method, it has increased to 83.06%. The 
experimental results show that the new method increases the classification efficiency of 
data significantly. 
Keywords: imbalanced data classification, Data mining, Clustering based undersampling. 
1. Introduction 
Many classification algorithms published such as: k-nearest neighbors, Decision trees, 
Naïve Bayes, Support vector machines. These are the standard algorithms applied to balance 
classification cases and has been tested experimentally. However, applying these algorithms to 
data where the large disparity in the number of samples in classes is not effective [1-3]. 
Therefore, new approaches need to be taken in case of data imbalance. 
A data imbalance is a case where data have a significant difference in the number of 
samples in classes. At that time, the class had many samples called the majority class; the other 
class had few samples called the minority class. This shows that, due to the overwhelming 
number of class samples, the efficiency of the classification process is significantly reduced. For 
example, the Mamography data set consists of 11,183 data samples, of which 10,923 are 
assigned the label "negative" (no cancer) and 260 samples are labeled "positive" (cancer). 
Assuming a classification model of only 10% accuracy means that 234 samples of the minority 
class were misclassified into the majority, resulting in 234 people with cancer but diagnosed 
without cancer [4, 5]. Clearly, the misclassification of such patients will have more serious 
consequences than misclassification of non-cancer to cancer. Therefore, the problem of imbalanced 
Received March 20, 2020. Revised May 5, 2020. Accepted May 12, 2020 
Contact Nguyen Thi Hong, e-mail address: 
[email protected] 
Nguyen Thi Hong and Dang Xuan Tho 
34 
data classification is an application that has important practical applications and is interested by 
many scientists in the field of data mining. 
In this paper, in order to increase the accuracy of the prediction model in imbalanced data 
classification problem, we propose a new cluster-based sampling method to address this work. 
Performing tests on a number of datasets, we have achieved important results when compared to 
cases without using any data balancing strategies and previous method. 
2. Content 
2.1. Related works 
The problem of imbalanced class distribution is concerned by many researchers and many 
scientific papers have been published. In general, there are two main approaches to problem 
solving, based on the algorithm level and data level. At the algorithm level, the authors try to 
suggest new algorithms or improve existing classification methods to solve data with the 
imbalanced class distribution. These methods are more effective than previous methods, such as 
using different thresholds and more accessible one-class methods. Specifically, a number of 
cost-based learning algorithms with the addition of weight to the minority class [6], adjust the 
predictive probability of leaves for the decision tree method [7], add more constants different 
penalties for each class or adjustment of layered boundary improvement of supported vector 
machine algorithms [8]. 
On the other hand, data-level methods are based on preprocessing of the desired data to 
balance the training set again with the strategy of over-sampling or under-sampling samples. 
The over-sampling approaches are used to increase the number of samples of the minority class, 
while the under-sampling approaches are used to reduce the number of majority class samples. 
There are several common methods that are used, such as SPY [9], SMOTE [10], Borderline-
SMOTE [11], Safe-level-SMOTE [12], Safe-SMOTE [13], BAM [14]. Specifically, in 2017, 
Wei-Chao Lin et al. [15] proposed clustering-based undersampling method to reduce the 
number of majority class samples based on clustering. As shown in Figure 1, firstly, the author 
chooses the number of clusters equal to the number of samples of the minority class. Then, the 
cluster center is selected by the k-means clustering algorithm on the majority class samples. 
These cluster centers are used to replace all majority class samples. Then, the number of 
majority and minority class samples includes the same number of samples. The empirical results 
on real data have shown that this method is highly effective, however, in some cases, we found 
that this method does not improve the efficiency of classification but also reduces accuracy. In 
the next section, we will analyze a major drawback in this method and thereby propose 
improvements. 
Imbalanced 
dataset
Training 
set
Testing 
set
majority 
class (M)
minority 
class (N)
Balanced 
training set
k cluster 
centersk-means
classifier
training
testing
Figure 1. Clustering-based undersampling procedure 
A new method based on clustering improves the efficiency of imbalanced data classification 
35 
2.2. Main Drawback of Clustering-based under sampling 
To illustrate this approach, Figure 2 (a) shows the clustered majority class samples based 
on the k-means algorithm, with the number of clusters set by the number of minority class 
samples (i.e. k = N). Figure 2 (b) introduces this k cluster center is used to replace the entire 
majority class data set. Then, with this method, we see that the entire majority samples of a 
cluster will be represented by the cluster center which will be used for future learning. 
Therefore, it is important to select cluster centers that represent samples in the cluster. As we 
know, the cluster center is updated by the algorithm k-means as the average of all samples in the 
cluster: 
 =
∑ ݔ௫ೕ∈
|ܥ|
(1) 
Where xj is a data sample in the data set, Ci is a cluster (set of data samples) and oi is the 
cluster center (the center of cluster of Ci). However, in some clusters there are noise samples 
that affect the cluster center selection as shown in Figure 2 (c). Therefore, in order to address 
this drawback and improve the classification accuracy of the clustering-based undersampling 
method, we focused on how to choose a better cluster center which represents samples in the 
cluster. This idea is illustrated in Figure 3, a new method that we term Priority-Center (PC), will 
be presented in more detail in the next section. 
a) b) c)
oi
Figure 2. Main Drawback of Clustering-based undersampling 
2.3.The Priority-Center method 
In order to overcome the drawback of Clustering-based undersampling described 
previously, we focus on choosing a good cluster center which represents samples in the cluster. 
The ideal cluster center needs to deviate from the densely populated locations of the cluster; the 
samples located in the high density area will have a higher weight than the other samples. Based 
on that idea, we propose a new, Priority-Center method. 
A
B
C
D
a) b)
oi
Figure 3. The idea of Priority-Center 
Nguyen Thi Hong and Dang Xuan Tho 
36 
Firstly, for each sample in the cluster, we count the number of nearest neighbors (nj) in the 
radius r. Then, due the cluster center will represent important samples in the cluster, so we will 
calculate the cluster center by the key samples (ݔ′), not on all samples in the cluster. The 
priority order of the key samples in the cluster will be determined based on the value of nj of 
each element. Finally, the new center of the cluster is updated by the following formula: 
 =
∑ ݔ′ × ݊௫ೕ∈ௌ
∑ ݊
(2) 
where ݔ′ are the key samples in the data set, nj is number of nearest samples of ݔ′ in the radius 
r, and oi is the new cluster center (the center of cluster of Ci). By applying the new cluster center 
calculation formula, we obtain a better cluster center, as shown in Figure 3b. 
2.4. Experiments 
2.4.1. Evaluation matrix 
Table 1. Confusion matrix 
 Predicted Positive Predicted Negative 
Reality Positive TP FN 
Reality Positive FP TN 
Confusion matrix in Table 1 is used to evaluate the performance of binary classification 
methods. A labeled data is classified for calculating the values TP, FN, FP and TN. The 
effectiveness of the classification of balanced data is assessed by a number of measures such as 
Recall, Precision and Accuracy given by the following formulas: 
ܶܲݎܽݐ݁ = ݎ݈݈݁ܿܽ = ܶܲ/(ܶܲ + ܨܰ) (3) 
ܶܰݎܽݐ݁ = ܶܰ/(ܶܰ + ܨܲ) (4) 
ܨܲݎܽݐ݁ = ܨܲ/(ܶܰ + ܨܲ) (5) 
ܨܰݎܽݐ݁ = ܨܰ/(ܶܲ + ܨܰ) (6) 
ܲܲݒ݈ܽݑ݁ = ܲݎ݁ܿ݅ݏ݊ = ܶܲ/(ܶܲ + ܨܲ) (7) 
ܣܿܿݑݎܽܿݕ = (ܶܲ + ܶܰ)/(ܶܲ + ܨܲ + ܶܰ + ܨܰ) (8) 
For datasets which have a high imbalance rate, if most of the minority class elements are 
misclassified and most of the majority class elements are correctly classified, the accuracy may 
still be high. Therefore, the evaluation of classification efficiency by these measures is no longer 
suitable for imbalanced datasets. G-mean measure was proposed to balance the accuracy 
between majority class elements and minority class elements [7, 16, 17]. This value is high if 
both of TPrate and TNrate value are high. 
ܩ −݉݁ܽ݊ = √ܶܲݎܽݐ݁. ܶܰݎܽݐ݁ (9) 
In this paper, the authors used the G-mean value to evaluate the classification efficiency of the 
datasets after adjusting the imbalance ratio. 
2.4.2. Experimental method and datasets 
To evaluate the effectiveness of the proposed method, we used 6 imbalanced datasets from 
UCI international standard data store: Blood, Glass, Haberman, Heart, Pima and Yeast. Among 
A new method based on clustering improves the efficiency of imbalanced data classification 
37 
them, Haberman, Heart, Pima are datasets about breast cancer, heart disease and diabetes, 
respectively. Blood is the dataset containing information about whether or not Taiwanese donors 
donated blood in March 2007. Yeast dataset is about protein localization sites in yeast. The glass 
dataset is about whether or not the glass is a "float" type. All of these datasets have an 
imbalance between the number of elements in the majority and the minority. In particular, Glass 
(1:6) and Yeast (1:28) are the two datasets with the highest imbalance rate. The classification of 
these data sets using standard algorithms such as SVM, KNN, and Naïve Bayes is inefficient. 
Therefore, the authors make adjustments to these data sets by the proposed method to reduce the 
imbalance. Information about datasets is given in Table 2. 
Table 2. UCI imbalanced datasets 
Dataset Number of elements 
Number of 
Minority 
elements 
Number of 
Majority 
elements 
Percentage of 
Minority class 
Blood 748 178 570 23.8% 
Glass 214 29 185 13.5% 
Haberman 306 81 255 24.1% 
Heart 270 120 150 44.4% 
Pima 768 268 500 34.9% 
Yeast 1484 51 1433 3.43% 
We conducted experiments using the 20 times 10-fold method. Each time, the datasets were 
divided into 10 sections, each of which was selected as testing data and the remaining 9 were 
used as training data. The values of the confusion matrix are calculated based on the testing data 
classification result. The average G-mean value of 20 times 10-fold is used to evaluate the 
performance of the classification. 
In this experiment, we adjusted the original datasets by Clustering-based undersampling 
(CU) and Priority-Center methods (PC), then, classified these data using SVM, Random Forest, 
Naïve Bayes and K nearest neighbor (KNN) algorithms using packages supported by R: 
Kernlab, randomForest, e1071, class. 
2.4.3. Results 
The G-mean values in Figure 4 show classification efficiency of original datasets and 
datasets after being adjusted by Clustering-based Undersampling and Priority-Centers. It can be 
seen that, the proposed method is better in most of datasets when classified by the SVM, 
Random Forest and KNN algorithms except for Naïve Bayes. Especially, when it comes to the 
Haberman dataset, the G-mean values of PC method increase significantly by 5-8% in 
comparison to that of CU method. There are slight increases of 1-4% in classification efficiency 
of other dataset. After the application of undersampling methods, the number of elements in the 
datasets decreases sharply so the efficiency of classification by the Naïve Bayes algorithm is 
improved insignificantly. Information on the number of elements of the datasets before and after 
undersampling is illustrated in Table 3. 
Nguyen Thi Hong and Dang Xuan Tho 
38 
Table 3. UCI imbalanced Datasets after adjusting 
Dataset 
Number of Minority 
elements after 
adjusting 
Number of Majority 
elements after 
adjusting 
Number of elements 
after adjusting 
Blood 178 178 356 
Glass 29 29 58 
Haberman 81 81 162 
Heart 120 120 240 
Pima 268 268 268 
Yeast 51 51 102 
Figure 4. The graphs compare the G-mean value calculated when classifying the original 
data set and the data governed by CU and PC 
Figure 5 shows 3D illustration in of Haberman, Heart and Yeast datasets before and after 
being adjusted by Clustering-based undersampling and Priority-Centers algorithms. It can be 
seen that, Heart dataset has not changed much after being adjusted because the number of 
A new method based on clustering improves the efficiency of imbalanced data classification 
39 
majority and minority class elements of this dataset is not much different (120:150), so the 
efficiency of the classification is hardly improved. When it comes to Yeast dataset, it has a high 
imbalance ratio (51: 1433). After being adjusted by PC, there are only 102 elements (51:51). 
Therefore, the efficiency of classification on this dataset by Naive Bayes algorithm decreases 
significantly. However, the classification efficiency of this dataset is significantly improved 
with both Random Forest and SVM algorithms because adjusted dataset has less overlapping 
distribution. 
Figure 5. 3D illustration of Haberman, Heart and Yeast datasets before 
and after being adjusted by CU and PC 
Table 4 shows P-values calculated by T.test package in R when comparing G-mean values 
in Figure 4. Most of the P-values are less than 5% so the comparison of the G-mean values of 
the methods are statistically significant. 
Nguyen Thi Hong and Dang Xuan Tho 
40 
Table 4. P-values compare G-mean values 
Dataset 
SVM Naïve Bayes Random Forest KNN 
Origin
al CU 
Origina
l CU Original CU Original CU 
Blood PC < 2.2e-16 0.0301 <2.2e-16 0.0329 <2.2e-16 0.0007 < 2.2e-16 0.0647 
Glass 
PC 
< 2.2e-16 2.70E-09 1.12E-05 0.3913 3.40E-08 0.00017 6.16E-11 0.000767 
Haberman 
PC 
< 2.2e-16 3.92E-14 <2.2e-16 0.07081 <2.2e-16 4.15E-11 < 2.2e-16 2.08E-08 
Heart PC 0.05397 0.156 0.202 0.4161 7.23E-04 0.176 0.003911 0.465 
Pima 
PC 
< 2.2e-16 4.57E-03 3.76E-16 0.02276 <2.2e-16 0.422 < 2.2e-16 0.490 
Yeast 
PC 
< 2.2e-16 4.33E-04 3.76E-16 0.02276 <2.2e-16 1.34E-06 < 2.2e-16 0.490 
3. Conclusions 
In this scientific study, we presented an overview of the new algorithm based on clustering 
data to improve data classification efficiency. By replacing the majority class samples with the 
center of each cluster, it has created the ability to explore large-sized databases, improve 
computational efficiency, reduces the level of data imbalance and thereby increasing the 
accuracy of the data classification results. 
Based on the research and the results achieved, we realize that there are many issues that 
need further study. Specifically, we will study the combination of noise removal with other 
methods such as Safe level, Boderline-SMOTE, Add-Boder-SMOTE or further develop the 
Priority-Centers algorithm to achieve higher efficiency in solving class imbalance problems. 
REFERENCES 
[1] M. Zareapoor, 2015. Application of Credit Card Fraud Detection: Based on Bagging 
Ensemble Classifier. Int. Conf. Intell. Comput. Commun. Converg., Vol. 48, No. 12, 
pp. 679-686. 
[2] W. Xiao, J. Zhang, Y. Li, S. Zhang, and W. Yang, Oct. 2017. Class-specific cost regulation 
extreme learning machine for imbalanced classification. Neurocomputing, Vol. 261, pp. 70-82. 
[3] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, 2017. Learning from 
class-imbalanced data: Review of methods and applications. Expert Syst. Appl., Vol. 73, 
pp. 220-239. 
[4] X. Yuan, L. Xie, and M. Abouelenien, 2018. A regularized ensemble framework of deep 
learning for cancer detection from multi-class, imbalanced training data. Pattern Recognit., 
Vol. 77, pp. 160-172. 
[5] M. Arafat, A. Qusef, and G. Sammour, 2019. Detection of Wangiri Telecommunication Fraud 
Using Ensemble Learning. IEEE Jordan International Joint Conference on Electrical 
Engineering and Information Technology (JEEIT), pp. 330-335. 
A new method based on clustering improves the efficiency of imbalanced data classification 
41 
[6] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, 2015. A novel ensemble method for 
classifying imbalanced data. Pattern Recognit., Vol. 48, No. 5, pp. 1623-1637. 
[7] C. X. Ling, Q. Yang, J. Wang, and S. Zhang, 2004. Decision Trees with Minimal Costs. Proc. 
21st Int. Conf. Mach. Learn., p. 69. 
[8] Y. Zhang, P. Fu, W. Liu, and G. Chen, 2014. Imbalanced data classification based on scaling 
kernel-based support vector machine. Neural Comput. Appl., Vol. 25, No. 3-4, pp. 927-935. 
[9] X. T. Dang, D. H. Tran, O. Hirose, and K. Satou, 2019. SPY: A Novel Resampling Method for 
Improving Classification Performance in Imbalanced Data. Seventh International Conference 
on Knowledge and Systems Engineering (KSE), pp. 280-285. 
[10] N. V Chawla, K. W. Bowyer, and L. O. Hall, 2002. SMOTE : Synthetic Minority Over-
sampling Technique. Artif. Intell., Vol. 16. 
[11] H. Han, W. Wang, and B. Mao, 2005. Borderline-SMOTE: A New Over-Sampling Method in 
Imbalanced Data Sets Learning. Lect. Notes Comput. Sci., Vol. 3644, pp. 878-887. 
[12] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, 2009. Safe-Level-SMOTE: Safe-
Level-Synthetic Minority Over-Sampling Technique. Lect. Notes Comput. Sci., Vol. 5476, 
pp. 475-482. 
[13] X. T. Dang et al., 2013. A Novel Over-Sampling Method and its Application to Cancer 
Classification from Gene Expression Data. Chem-Bio Informatics J., Vol. 13, pp. 19-29. 
[14] H. Nguyen Thi and T. Dang Xuan, 2019. BAM: border adjustment method improve the 
efficiency of imbalanced biological data classification. HNUE J. Sci ., Vol. 64, No. 6, 
pp. 173-182. 
[15] W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, 2017. Clustering-based undersampling in 
class-imbalanced data. Inf. Sci. (Ny)., Vol. 409–410, pp. 17-26. 
[16] C.-M. Vong, J. Du, C.-M. Wong, and J.-W. Cao, 2018. Pos