Abstract. Imbalanced data classification is one of the most difficult issues in the machine
learning and data mining community. In particular, the problem is becoming more difficult
with data sets with a large number of features, many redundant features affect the efficiency
of the data classification process. Specifically, many biomedical data, diagnosing cancer
both have a large imbalance and have thousands of features. Therefore, finding a solution to
overcome these difficulties is extremely important and very meaningful. In this paper, we
present an overview of the imbalanced data classification and the difficulties encountered in
current approaches, from which we propose a new method, SMOTE-PLS. To evaluate the
effectiveness of this new method, we conducted experiments based on standard cancer data
sets from UCI sources, including breast-p, coil2000, leukemia, colon-cancer, and yeast.
Empirical results show that the correctly classified minority samples are significantly
improved, which proves that the new method is more effective than the previous one in
dealing with imbalanced data and the large number of features.
9 trang |
Chia sẻ: thanhle95 | Lượt xem: 702 | Lượt tải: 1
Bạn đang xem nội dung tài liệu A new hybrid method to improve the effectiveness of cancer data classification, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
42
HNUE JOURNAL OF SCIENCE
Natural Sciences, 2020, Volume 65, Issue 4A, pp. 42-50
This paper is available online at
A NEW HYBRID METHOD TO IMPROVE THE EFFECTIVENESS
OF CANCER DATA CLASSIFICATION
Nguyen Thi Chinh1, Dao Thi Minh2, Le Xuan Ly3 and Dang Xuan Tho4
1Gifted High School, Hanoi National University of Education
2Natural Science department, Thai Binh Teacher Training College
3School of Applied Mathematics and Informatics, Hanoi University of Science and Technology
4Faculty of Information Technology, Hanoi National University of Education
Abstract. Imbalanced data classification is one of the most difficult issues in the machine
learning and data mining community. In particular, the problem is becoming more difficult
with data sets with a large number of features, many redundant features affect the efficiency
of the data classification process. Specifically, many biomedical data, diagnosing cancer
both have a large imbalance and have thousands of features. Therefore, finding a solution to
overcome these difficulties is extremely important and very meaningful. In this paper, we
present an overview of the imbalanced data classification and the difficulties encountered in
current approaches, from which we propose a new method, SMOTE-PLS. To evaluate the
effectiveness of this new method, we conducted experiments based on standard cancer data
sets from UCI sources, including breast-p, coil2000, leukemia, colon-cancer, and yeast.
Empirical results show that the correctly classified minority samples are significantly
improved, which proves that the new method is more effective than the previous one in
dealing with imbalanced data and the large number of features.
Keywords: data mining, SMOTE, Imbalanced data classification, PLS.
1. Introduction
Data classification is a widely applied problem in practice, however, many problems appear
imbalanced data which means there is a huge difference in the number of samples of the two labels.
Imbalanced data classification is one of the most difficult issues in the machine learning and data
mining community. The problem of class imbalance usually occurs with the problem of binary
classification where the samples of one class of interest occupy a very small proportion compared to
the other class. The class imbalance greatly affects the efficiency of the classification model.
Practical applications encounter this problem more and more, such as fraud detection, network
intrusion detection, oil spill detection from satellite Radar images, management risks, and medical
diagnostics [1-3]. Especially, in the medical database, the number of people with cancer accounts for
a very small proportion of the average number of people. But diagnosing people with cancer - a
minority - plays a very important role. If we misdiagnose people who are ill, normal people will
seriously affect human health and life.
Received March 24, 2020. Revised May 4, 2020. Accepted May 11, 2020
Contact Nguyen Thi Chinh, e-mail address: chinh.nguyenthi@gmail.com
A new hybrid method to improve the effectiveness of cancer data classification
43
Regarding the data imbalance problem, there have been many scientists interested and there are
many studies on this problem. Currently, to solve the imbalanced data classification problem, there
are two main approaches: data-level and algorithm-based approaches. An algorithm-based approach
means that the following standard classification algorithms are adjusted so that when applied to an
imbalanced data set, it is still highly effective. For example, with the SVM algorithm, it is suggested
to use different penalty constants for different classes or to adjust the class boundaries based on the
idea of kernel-alignment. Adjust the data distribution of the labels to reduce or eliminate the
imbalance to apply standard classification algorithms [4, 5]. There are many different ways to adjust
data such as: generation of synthetic samples for minorities (randomly selecting samples to generate
more, selecting by sample criteria to generate or creating new synthetic samples), remove samples
from the majority class (randomly select samples to remove, select according to sample criteria to
remove), or combine the above methods. In the above approaches, the solution to create new
synthetic samples for the minority is of interest to many scientists and has given certain results.
There have been several data adjustment algorithms based on this method proposed earlier such as:
SMOTE [6], SPY [7], Random Boder Oversampling and Random Safe Undersampling [8],
Tomek-link [9].
In addition to the occurrence of class imbalances, many data sets in biomedical medicine also
have a large number of features with thousands of features. However, among these features, there are
many redundant features that are not useful in predicting minority class samples. In many cases, they
also affect the efficiency of classification and slow down the processing of experimental.
To solve the imbalance problem and the huge number of features of the data, we propose a new
method, SMOTE-PLS. The new method combines a decrease in the number of features and the
generation of synthetic samples in the minority class. This method generates synthetic samples that
balance the label while reducing redundant features to improve the efficiency of cancer data
classification.
2. Content
2.1. Framework
The general workflow for the method in this paper consists of four steps: (1) Divide the data set
into training data and test data (k-fold cross validation); (2) apply artificial generation with SMOTE;
(3) apply data dimension reduction with PLS; (4) training supervised machine learning models to
predict cancer. Illustration of the workflow is presented in Figure 1.
Training
set
Dataset
New
training
set
Balanced
training
set
SMOTE
Testing
set classifier
Step 1
Step 2
Step 3
Step 4
PLS
Figure 1. A new hybrid method procedure
Nguyen Thi Chinh, Dao Thi Minh, Le Xuan Ly and Dang Xuan Tho
44
2.2. Synthetic Minority Over-sampling Technique (SMOTE) method
Synthetic Minority Over-sampling Technique method (SMOTE) [9] was proposed by the
author NV Chawla et al. The minority class is sampled by taking each sample in the minority and
choosing the k nearest neighbors. Depending on the number of sampling required, neighbors from
the nearest neighbor were randomly selected. The synthetic sample is created in the following way:
Take the difference between the feature vector (sample) considered and its nearest neighbor.
Multiply this difference with a random number between 0 and 1, and add to it the feature vectors to
be considered. This generates a random point along the segment between two specific samples.
The SMOTE algorithm is defined with 3 input variables: Number of samples of the minority
class (T), The ratio of synthetic samples being generated further (N%), and the number of nearest
neighbors (k). Thus, for each of these sets input values, the SMOTE algorithm will generate
synthetic samples to train the new data set. Specifically, the SMOTE algorithm is as follows [2]:
Algorithm SMOTE (T, N, k)
Input: Number of samples of minority class: T;
Ratio of synthetic samples being generated: N%;
Number of nearest neighbors: k;
Output: (N / 100) * T: number of samples in the synthetic array (T and the number of synthetic
samples generated)
1. If N < 100 then T = (N / 100) * T; N = 100;
2. If N > = 100 then
for i = 1 to T
- Calculate the k nearest samples for each sample of i and store them in the array nnarray
- Call the function Populate (N, i, nnarray)
3. function Populate (N, i, nnarray) (∗ Function creates synthetic samples. ∗)
- While (N # 0)
+ Pick a random sample between 1 and k; and call it nn
+ Synthesize a new synthetic sample along the line joining i and nn
+ N = N -1
- End while
4. Return.
The SMOTE algorithm is different from the previous known methods, it does not change the
number of minority and majority classes, but only affects the minority label to produce synthetic
samples, i.e. This sample is randomly generated and will be labeled as a minority, thereby balancing
the data between the two labels. The SMOTE generates additional synthetic samples in accordance
with the algorithm presented above. For each value of N, the number of samples will change,
specifically, the more N increases, the more the number of synthetic samples increases.
On the other hand, if N increases too high, it will cause the data to be imbalanced in the
opposite direction (that is, the majority now becomes a minority and the minority due to the addition
of more synthetic samples becoming the majority class), then the data imbalance is still not resolved.
Therefore, for each specific data set, the SMOTE algorithm works well at values of certain N
parameters.
2.3. The Partial Least Squares (PLS) method
Currently, there are many methods to reduce the number of data dimensions such as PCA,
PLS... The idea of this method group is to create a new feature set representing the old feature set.
This new feature set has all the features of the old property but is many times smaller than the
A new hybrid method to improve the effectiveness of cancer data classification
45
number of old features. This dimension reduction minimizes information loss and especially does
not change the nature of the data.
The Principal Component Analysis (PCA) [10] is a methodology of a group of unsupervised
learning methods in machine learning and synthetic intelligence. The idea of PCA is to build a
subspace of a smaller dimension from the original multi-dimensional space by constructing a new
variable. This new variable is constructed from the linear combination of the original variables. So
the axes of new space are a combination of coordinate axes in the old space and are called "key
components" in the new space. Map the data points of the original space to the subspace so that the
average distance squared the error between the data points of the old space to its projection on the
new space is minimal. This means that PCA is building a new space with fewer dimensions. The
coordinate axes on the new space represent data better than the old space and ensure that on each
coordinate axis, the variance of the data is greatest.
The Partial Least Squares method (PLS) [11] is a group of techniques for building a
relationship model between two multidimensional variables (a learning data set and a set of labels),
i.e., constructing a regression function between dependent and independent variables in the
regression problem. Rules construct a discrete function to determine the value class received by the
variable. PLS is also a supervised learning method, which means that when it comes to reducing the
number of data dimensions, PLS relies on both the feature data set (X) and the information in the
class label data set (Y). This ensures "orientation" according to the available information gained
from practical experience or through experiments.
The idea of PLS is to represent variable class Y and feature X through the value of intermediate
variables (hidden variables). Hidden variables are determined by linear combination of the initial
variables related to each other. As a result, the number of variables decreases a lot compared to the
initial number of variables. This eliminates subjective errors when selecting variables participating in
the problem. The choice of the number of hidden variables depends on the purpose of the user for
the number of dimensions expressed by the object to be observed. Therefore, PLS is mainly used to
reduce the number of data dimensions for a feature set.
In other words, PLS constructs a new space many times smaller than the original space, the
spatial coordinate system is the orthogonal system (orthogonal coordinates). PLS finds point vectors
of new spaces by solving the problem of maximum covariance between variables. That is, the
problem returns to solve the eigenvalue problem, from which determines the eigenvectors (presented
in detail in NIPALS algorithm). The number of distinct vectors is the number of dimensions to use,
chosen according to the size of that eigenvalue.
PLS is particularly effective at reducing the number of data dimensions compared to traditional
methods such as PCA, so it is strongly applied to biomedical data with a large number of features.
The new data after being reduced by PLS is more reliable than PCA by using both feature
information and class labels. This makes the adjustment data "oriented" in accordance with the
actual value collected.
2.4. A new hybrid method
SMOTE is one of the typical methods to increase the efficiency of classification by generating
additional synthetic samples of the minority class. But also because of that increase the capacity of
the data set according to the amount of synthetic samples added. On the other hand, currently the
data in real applications often have a very large number of features. This leads to a lot of time in the
process of classification, along with the classification of those data sets will no longer be accurate, or
the accuracy of data classification will not be high.
To overcome the increase in data set capacity, reduce the number of such features, we came up
with the idea of combining two algorithms that is to generate more samples and reduce the number
Nguyen Thi Chinh, Dao Thi Minh, Le Xuan Ly and Dang Xuan Tho
46
of data dimensions by the PLS method. The idea of this combination being beneficial is that the data
set capacity does not increase, or just the size of the original data set, greatly reduces the run time of
the subclass, and more importantly, binding. The combination of generating more samples and
reducing the number of data dimensions results in a higher and more efficient classification result
compared to separate methods.
The combination of SMOTE and PLS methods is a combination of two advantages of two
algorithms SMOTE and PLS. Increasing the samples of the minority label, both reducing the
number of data dimensions increases the accuracy to better classify the data in the imbalanced data
classification, shortening the run time of the data sets.
2.5. Evaluation criteria
In the case of two classes, a class with very few training samples, but of greater importance is
called a positive label; differs from popular class, but does not have much meaning and importance
is called negative class (negative). Samples can be classified into four groups during the
classification process as symbols in the following confusing matrix:
Table 1. Confusion matrix
Predicted Positive Predicted Negative
Reality Positive TP FN
Reality Positive FP TN
Some evaluation criteria based on confusion matrix table:
TPrate= TP/ (TP+FN) (1)
TNrate= TN/ (TN+FP) (2)
G-mean= raterate TN.TP
(3)
G-mean is a measure used to evaluate the efficiency of data classification between two labels
[4, 5, 7].
2.6. Experiments
2.6.1. Datasets
To evaluate the effectiveness of the combined method, we installed and ran the program in R
and Perl languages which were tested on five cancer imbalanced data sets from UCI (University of
California, Irvine) [12] as: Breast-p, Coil2000, Leukemia, Colon-cancer, and Yeast.
Table 2. Cancer imbalanced data sets
Dataset Number of samples
Number of
features Imbalance rate
Breast-p 198 32 1:4
Coil2000 5822 86 1:16
Leukemia 72 7128 1:2
Colon-cancer 62 2000 1:2
Yeast 1484 4 1:29
A new hybrid method to improve the effectiveness of cancer data classification
47
2.6.2. Results
We first ran 10-fold cross-validation to divide the above data sets into 10 roughly equal random
parts. Of those 10 sections, one is used as test data, the others are used as training data. Thereby
adjusting the data into 7 different running ways with the purpose to compare the G-mean results and
p-value values: Original data, SMOTE, PCA, PLS, RSO-RBO [8], SMOTE-PCA, and SMOTE-PLS.
The final comparison result is the G-mean average value of 20 times of 10 fold cross-validation.
For each data set, there will be different input parameters for PLS, PCA and SMOTE methods.
For example, with the PLS method, the input parameter (hidden variable number) depends on the
square root value of the mean square error forecast (RMSEP), such as at the hidden variable number
of 4 RMSEP values without a level if the number of variables hidden is equal to the number of
hidden variables sufficient for the PLS adjustment model and is similar to PCA. For SMOTE
method, the input parameter is N, for each value of N changes, a different number of synthetic
samples will be generated. For example, if N = 100, and the minority has 50 samples, then SMOTE
will create 50 new synthetic samples in the minority. In this experiment, parameter N was chosen to
balance the data between the ratio of the minority and the majority.
To see the effectiveness of the proposed method (SMOTE-PLS), we compared the results based
on G-mean values. For each figure below, there is a graph representing the G-mean value of each
data set with 7 methods: Original, SMOTE, PCA, PLS, RSO-RBO [8], SMOTE-PCA, and SMOTE-
PLS.
With breast-p dataset, the G-mean value of the SMOTE-PLS method was 68.45% higher than
SMOTE-PCA 62.88% and much higher than the original data of 32.99%.
Figure 2. The graph shows the G-mean value of breast-p dataset
With coil2000 data set, the G-mean value of SMOTE-PLS method is 29.72%, much higher
than SMOTE-PCA 4.12%, original data is 0 and SMOTE is 11.86%.
Figure 3. The graph shows the G-mean value of coil2000 dataset
Nguyen Thi Chinh, Dao Thi Minh, Le Xuan Ly and Dang Xuan Tho
48
In the leukemia dataset, the G-mean value of the SMOTE-PLS method was 94.76% higher than
that of PLS, 92.04% and SMOTE-PCA 90.08% and much higher than the original data by 75.07%.
Figure 4. The graph shows the G-mean value of leukemia dataset
Similar to the colon-cancer dataset, the G-mean value of the SMOTE-PLS method was 87.12%
higher than the original data (86.46%) and higher than SMOTE-PCA, RSO-RBO.
Figure 5. The graph shows the G-mean value of colon-cancer dataset
In addition, in the yeast dataset, the G-mean value of the SMOTE-PLS method was 76.39%
higher than the original data (18.85%) and higher than SMOTE-PCA, RSO-RBO.
Figure 6. The graph shows the G-mean value of yeast dataset
A new hybrid method to improve the effectiveness of cancer data classification
49
As shown on the above results, we can see that our combined method (SMOTE-PLS) is more
effective than other methods. To evaluate whether the combined method is statistically significant,
we apply t-test. If the p-value of this test is less than or equal to 0.05, then we say the two mean
values are different and statistically significant. In this paper, we use the t-test function in the stats
package of R to calculate the p-value.
Table 3. P-values compare G-mean values
Datasets Original SMOTE PLS SMOTE -PLS
Breast-p
Original x <2.20E-16 5.02E-05 <2.20E-16
SMOTE x 3.48E-13 2.06E-04
PLS x 6.68E-16
Coil2000
Original x 1.05E-12 0.1649 8.32E-15
SMOTE x 1.18E-13 3.92E-12
PLS x 3.18E-15
Leukemia
Original x 2.09E-12 <2.20E-16 <2.20E-16
SMOTE x 3.96E-08 1.38E-11
PLS x 4.77E-03
Colon-cancer
Original x 0.98 0.357 0.295
SMOTE x 0.0928 0.0904
PLS x 0.4261
Yeast
Original x <2.20E-16 2.04E-06 <2.20E-16
SMOTE x 0.31654 2.58E-08
PLS x 1.63E-09
Based on the results of the G-mean calculation, we have evaluated the statistical significance of
t