A new hybrid method to improve the effectiveness of cancer data classification

Abstract. Imbalanced data classification is one of the most difficult issues in the machine learning and data mining community. In particular, the problem is becoming more difficult with data sets with a large number of features, many redundant features affect the efficiency of the data classification process. Specifically, many biomedical data, diagnosing cancer both have a large imbalance and have thousands of features. Therefore, finding a solution to overcome these difficulties is extremely important and very meaningful. In this paper, we present an overview of the imbalanced data classification and the difficulties encountered in current approaches, from which we propose a new method, SMOTE-PLS. To evaluate the effectiveness of this new method, we conducted experiments based on standard cancer data sets from UCI sources, including breast-p, coil2000, leukemia, colon-cancer, and yeast. Empirical results show that the correctly classified minority samples are significantly improved, which proves that the new method is more effective than the previous one in dealing with imbalanced data and the large number of features.

9 trang | Chia sẻ: thanhle95 | Lượt xem: 897 | Lượt tải: 1

Bạn đang xem nội dung tài liệu A new hybrid method to improve the effectiveness of cancer data classification, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

42 HNUE JOURNAL OF SCIENCE Natural Sciences, 2020, Volume 65, Issue 4A, pp. 42-50 This paper is available online at A NEW HYBRID METHOD TO IMPROVE THE EFFECTIVENESS OF CANCER DATA CLASSIFICATION Nguyen Thi Chinh1, Dao Thi Minh2, Le Xuan Ly3 and Dang Xuan Tho4 1Gifted High School, Hanoi National University of Education 2Natural Science department, Thai Binh Teacher Training College 3School of Applied Mathematics and Informatics, Hanoi University of Science and Technology 4Faculty of Information Technology, Hanoi National University of Education Abstract. Imbalanced data classification is one of the most difficult issues in the machine learning and data mining community. In particular, the problem is becoming more difficult with data sets with a large number of features, many redundant features affect the efficiency of the data classification process. Specifically, many biomedical data, diagnosing cancer both have a large imbalance and have thousands of features. Therefore, finding a solution to overcome these difficulties is extremely important and very meaningful. In this paper, we present an overview of the imbalanced data classification and the difficulties encountered in current approaches, from which we propose a new method, SMOTE-PLS. To evaluate the effectiveness of this new method, we conducted experiments based on standard cancer data sets from UCI sources, including breast-p, coil2000, leukemia, colon-cancer, and yeast. Empirical results show that the correctly classified minority samples are significantly improved, which proves that the new method is more effective than the previous one in dealing with imbalanced data and the large number of features. Keywords: data mining, SMOTE, Imbalanced data classification, PLS. 1. Introduction Data classification is a widely applied problem in practice, however, many problems appear imbalanced data which means there is a huge difference in the number of samples of the two labels. Imbalanced data classification is one of the most difficult issues in the machine learning and data mining community. The problem of class imbalance usually occurs with the problem of binary classification where the samples of one class of interest occupy a very small proportion compared to the other class. The class imbalance greatly affects the efficiency of the classification model. Practical applications encounter this problem more and more, such as fraud detection, network intrusion detection, oil spill detection from satellite Radar images, management risks, and medical diagnostics [1-3]. Especially, in the medical database, the number of people with cancer accounts for a very small proportion of the average number of people. But diagnosing people with cancer - a minority - plays a very important role. If we misdiagnose people who are ill, normal people will seriously affect human health and life. Received March 24, 2020. Revised May 4, 2020. Accepted May 11, 2020 Contact Nguyen Thi Chinh, e-mail address: [email protected] A new hybrid method to improve the effectiveness of cancer data classification 43 Regarding the data imbalance problem, there have been many scientists interested and there are many studies on this problem. Currently, to solve the imbalanced data classification problem, there are two main approaches: data-level and algorithm-based approaches. An algorithm-based approach means that the following standard classification algorithms are adjusted so that when applied to an imbalanced data set, it is still highly effective. For example, with the SVM algorithm, it is suggested to use different penalty constants for different classes or to adjust the class boundaries based on the idea of kernel-alignment. Adjust the data distribution of the labels to reduce or eliminate the imbalance to apply standard classification algorithms [4, 5]. There are many different ways to adjust data such as: generation of synthetic samples for minorities (randomly selecting samples to generate more, selecting by sample criteria to generate or creating new synthetic samples), remove samples from the majority class (randomly select samples to remove, select according to sample criteria to remove), or combine the above methods. In the above approaches, the solution to create new synthetic samples for the minority is of interest to many scientists and has given certain results. There have been several data adjustment algorithms based on this method proposed earlier such as: SMOTE [6], SPY [7], Random Boder Oversampling and Random Safe Undersampling [8], Tomek-link [9]. In addition to the occurrence of class imbalances, many data sets in biomedical medicine also have a large number of features with thousands of features. However, among these features, there are many redundant features that are not useful in predicting minority class samples. In many cases, they also affect the efficiency of classification and slow down the processing of experimental. To solve the imbalance problem and the huge number of features of the data, we propose a new method, SMOTE-PLS. The new method combines a decrease in the number of features and the generation of synthetic samples in the minority class. This method generates synthetic samples that balance the label while reducing redundant features to improve the efficiency of cancer data classification. 2. Content 2.1. Framework The general workflow for the method in this paper consists of four steps: (1) Divide the data set into training data and test data (k-fold cross validation); (2) apply artificial generation with SMOTE; (3) apply data dimension reduction with PLS; (4) training supervised machine learning models to predict cancer. Illustration of the workflow is presented in Figure 1. Training set Dataset New training set Balanced training set SMOTE Testing set classifier Step 1 Step 2 Step 3 Step 4 PLS Figure 1. A new hybrid method procedure Nguyen Thi Chinh, Dao Thi Minh, Le Xuan Ly and Dang Xuan Tho 44 2.2. Synthetic Minority Over-sampling Technique (SMOTE) method Synthetic Minority Over-sampling Technique method (SMOTE) [9] was proposed by the author NV Chawla et al. The minority class is sampled by taking each sample in the minority and choosing the k nearest neighbors. Depending on the number of sampling required, neighbors from the nearest neighbor were randomly selected. The synthetic sample is created in the following way: Take the difference between the feature vector (sample) considered and its nearest neighbor. Multiply this difference with a random number between 0 and 1, and add to it the feature vectors to be considered. This generates a random point along the segment between two specific samples. The SMOTE algorithm is defined with 3 input variables: Number of samples of the minority class (T), The ratio of synthetic samples being generated further (N%), and the number of nearest neighbors (k). Thus, for each of these sets input values, the SMOTE algorithm will generate synthetic samples to train the new data set. Specifically, the SMOTE algorithm is as follows [2]: Algorithm SMOTE (T, N, k) Input: Number of samples of minority class: T; Ratio of synthetic samples being generated: N%; Number of nearest neighbors: k; Output: (N / 100) * T: number of samples in the synthetic array (T and the number of synthetic samples generated) 1. If N < 100 then T = (N / 100) * T; N = 100; 2. If N > = 100 then for i = 1 to T - Calculate the k nearest samples for each sample of i and store them in the array nnarray - Call the function Populate (N, i, nnarray) 3. function Populate (N, i, nnarray) (∗ Function creates synthetic samples. ∗) - While (N # 0) + Pick a random sample between 1 and k; and call it nn + Synthesize a new synthetic sample along the line joining i and nn + N = N -1 - End while 4. Return. The SMOTE algorithm is different from the previous known methods, it does not change the number of minority and majority classes, but only affects the minority label to produce synthetic samples, i.e. This sample is randomly generated and will be labeled as a minority, thereby balancing the data between the two labels. The SMOTE generates additional synthetic samples in accordance with the algorithm presented above. For each value of N, the number of samples will change, specifically, the more N increases, the more the number of synthetic samples increases. On the other hand, if N increases too high, it will cause the data to be imbalanced in the opposite direction (that is, the majority now becomes a minority and the minority due to the addition of more synthetic samples becoming the majority class), then the data imbalance is still not resolved. Therefore, for each specific data set, the SMOTE algorithm works well at values of certain N parameters. 2.3. The Partial Least Squares (PLS) method Currently, there are many methods to reduce the number of data dimensions such as PCA, PLS... The idea of this method group is to create a new feature set representing the old feature set. This new feature set has all the features of the old property but is many times smaller than the A new hybrid method to improve the effectiveness of cancer data classification 45 number of old features. This dimension reduction minimizes information loss and especially does not change the nature of the data. The Principal Component Analysis (PCA) [10] is a methodology of a group of unsupervised learning methods in machine learning and synthetic intelligence. The idea of PCA is to build a subspace of a smaller dimension from the original multi-dimensional space by constructing a new variable. This new variable is constructed from the linear combination of the original variables. So the axes of new space are a combination of coordinate axes in the old space and are called "key components" in the new space. Map the data points of the original space to the subspace so that the average distance squared the error between the data points of the old space to its projection on the new space is minimal. This means that PCA is building a new space with fewer dimensions. The coordinate axes on the new space represent data better than the old space and ensure that on each coordinate axis, the variance of the data is greatest. The Partial Least Squares method (PLS) [11] is a group of techniques for building a relationship model between two multidimensional variables (a learning data set and a set of labels), i.e., constructing a regression function between dependent and independent variables in the regression problem. Rules construct a discrete function to determine the value class received by the variable. PLS is also a supervised learning method, which means that when it comes to reducing the number of data dimensions, PLS relies on both the feature data set (X) and the information in the class label data set (Y). This ensures "orientation" according to the available information gained from practical experience or through experiments. The idea of PLS is to represent variable class Y and feature X through the value of intermediate variables (hidden variables). Hidden variables are determined by linear combination of the initial variables related to each other. As a result, the number of variables decreases a lot compared to the initial number of variables. This eliminates subjective errors when selecting variables participating in the problem. The choice of the number of hidden variables depends on the purpose of the user for the number of dimensions expressed by the object to be observed. Therefore, PLS is mainly used to reduce the number of data dimensions for a feature set. In other words, PLS constructs a new space many times smaller than the original space, the spatial coordinate system is the orthogonal system (orthogonal coordinates). PLS finds point vectors of new spaces by solving the problem of maximum covariance between variables. That is, the problem returns to solve the eigenvalue problem, from which determines the eigenvectors (presented in detail in NIPALS algorithm). The number of distinct vectors is the number of dimensions to use, chosen according to the size of that eigenvalue. PLS is particularly effective at reducing the number of data dimensions compared to traditional methods such as PCA, so it is strongly applied to biomedical data with a large number of features. The new data after being reduced by PLS is more reliable than PCA by using both feature information and class labels. This makes the adjustment data "oriented" in accordance with the actual value collected. 2.4. A new hybrid method SMOTE is one of the typical methods to increase the efficiency of classification by generating additional synthetic samples of the minority class. But also because of that increase the capacity of the data set according to the amount of synthetic samples added. On the other hand, currently the data in real applications often have a very large number of features. This leads to a lot of time in the process of classification, along with the classification of those data sets will no longer be accurate, or the accuracy of data classification will not be high. To overcome the increase in data set capacity, reduce the number of such features, we came up with the idea of combining two algorithms that is to generate more samples and reduce the number Nguyen Thi Chinh, Dao Thi Minh, Le Xuan Ly and Dang Xuan Tho 46 of data dimensions by the PLS method. The idea of this combination being beneficial is that the data set capacity does not increase, or just the size of the original data set, greatly reduces the run time of the subclass, and more importantly, binding. The combination of generating more samples and reducing the number of data dimensions results in a higher and more efficient classification result compared to separate methods. The combination of SMOTE and PLS methods is a combination of two advantages of two algorithms SMOTE and PLS. Increasing the samples of the minority label, both reducing the number of data dimensions increases the accuracy to better classify the data in the imbalanced data classification, shortening the run time of the data sets. 2.5. Evaluation criteria In the case of two classes, a class with very few training samples, but of greater importance is called a positive label; differs from popular class, but does not have much meaning and importance is called negative class (negative). Samples can be classified into four groups during the classification process as symbols in the following confusing matrix: Table 1. Confusion matrix Predicted Positive Predicted Negative Reality Positive TP FN Reality Positive FP TN Some evaluation criteria based on confusion matrix table: TPrate= TP/ (TP+FN) (1) TNrate= TN/ (TN+FP) (2) G-mean= raterate TN.TP (3) G-mean is a measure used to evaluate the efficiency of data classification between two labels [4, 5, 7]. 2.6. Experiments 2.6.1. Datasets To evaluate the effectiveness of the combined method, we installed and ran the program in R and Perl languages which were tested on five cancer imbalanced data sets from UCI (University of California, Irvine) [12] as: Breast-p, Coil2000, Leukemia, Colon-cancer, and Yeast. Table 2. Cancer imbalanced data sets Dataset Number of samples Number of features Imbalance rate Breast-p 198 32 1:4 Coil2000 5822 86 1:16 Leukemia 72 7128 1:2 Colon-cancer 62 2000 1:2 Yeast 1484 4 1:29 A new hybrid method to improve the effectiveness of cancer data classification 47 2.6.2. Results We first ran 10-fold cross-validation to divide the above data sets into 10 roughly equal random parts. Of those 10 sections, one is used as test data, the others are used as training data. Thereby adjusting the data into 7 different running ways with the purpose to compare the G-mean results and p-value values: Original data, SMOTE, PCA, PLS, RSO-RBO [8], SMOTE-PCA, and SMOTE-PLS. The final comparison result is the G-mean average value of 20 times of 10 fold cross-validation. For each data set, there will be different input parameters for PLS, PCA and SMOTE methods. For example, with the PLS method, the input parameter (hidden variable number) depends on the square root value of the mean square error forecast (RMSEP), such as at the hidden variable number of 4 RMSEP values without a level if the number of variables hidden is equal to the number of hidden variables sufficient for the PLS adjustment model and is similar to PCA. For SMOTE method, the input parameter is N, for each value of N changes, a different number of synthetic samples will be generated. For example, if N = 100, and the minority has 50 samples, then SMOTE will create 50 new synthetic samples in the minority. In this experiment, parameter N was chosen to balance the data between the ratio of the minority and the majority. To see the effectiveness of the proposed method (SMOTE-PLS), we compared the results based on G-mean values. For each figure below, there is a graph representing the G-mean value of each data set with 7 methods: Original, SMOTE, PCA, PLS, RSO-RBO [8], SMOTE-PCA, and SMOTE- PLS. With breast-p dataset, the G-mean value of the SMOTE-PLS method was 68.45% higher than SMOTE-PCA 62.88% and much higher than the original data of 32.99%. Figure 2. The graph shows the G-mean value of breast-p dataset With coil2000 data set, the G-mean value of SMOTE-PLS method is 29.72%, much higher than SMOTE-PCA 4.12%, original data is 0 and SMOTE is 11.86%. Figure 3. The graph shows the G-mean value of coil2000 dataset Nguyen Thi Chinh, Dao Thi Minh, Le Xuan Ly and Dang Xuan Tho 48 In the leukemia dataset, the G-mean value of the SMOTE-PLS method was 94.76% higher than that of PLS, 92.04% and SMOTE-PCA 90.08% and much higher than the original data by 75.07%. Figure 4. The graph shows the G-mean value of leukemia dataset Similar to the colon-cancer dataset, the G-mean value of the SMOTE-PLS method was 87.12% higher than the original data (86.46%) and higher than SMOTE-PCA, RSO-RBO. Figure 5. The graph shows the G-mean value of colon-cancer dataset In addition, in the yeast dataset, the G-mean value of the SMOTE-PLS method was 76.39% higher than the original data (18.85%) and higher than SMOTE-PCA, RSO-RBO. Figure 6. The graph shows the G-mean value of yeast dataset A new hybrid method to improve the effectiveness of cancer data classification 49 As shown on the above results, we can see that our combined method (SMOTE-PLS) is more effective than other methods. To evaluate whether the combined method is statistically significant, we apply t-test. If the p-value of this test is less than or equal to 0.05, then we say the two mean values are different and statistically significant. In this paper, we use the t-test function in the stats package of R to calculate the p-value. Table 3. P-values compare G-mean values Datasets Original SMOTE PLS SMOTE -PLS Breast-p Original x <2.20E-16 5.02E-05 <2.20E-16 SMOTE x 3.48E-13 2.06E-04 PLS x 6.68E-16 Coil2000 Original x 1.05E-12 0.1649 8.32E-15 SMOTE x 1.18E-13 3.92E-12 PLS x 3.18E-15 Leukemia Original x 2.09E-12 <2.20E-16 <2.20E-16 SMOTE x 3.96E-08 1.38E-11 PLS x 4.77E-03 Colon-cancer Original x 0.98 0.357 0.295 SMOTE x 0.0928 0.0904 PLS x 0.4261 Yeast Original x <2.20E-16 2.04E-06 <2.20E-16 SMOTE x 0.31654 2.58E-08 PLS x 1.63E-09 Based on the results of the G-mean calculation, we have evaluated the statistical significance of t