Abstract. With the development of the Internet, network security has become an
indispensable factor of computer technology. Intrusion Detection Systems (IDS)
play an important role in network security. One aspect which affects the accuracy
and performance of IDS are classifiers. This paper proposes a new approach which
combines different classifiers in order to make best use of each classifier. To
build the new model, we evaluate the accuracy and performance (training and
testing time) of three classification algorithms: ID3, Naitive Bayes and SVM. Our
experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross
validation test shows that against any one particular type of attack, one of the
classifiers functions best. The purpose of this study is to enhance the accuracy
and performance of IDS against particular types of attacks.
8 trang |
Chia sẻ: thanhle95 | Lượt xem: 164 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Building models for detecting system attacts based on data mining, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
JOURNAL OF SCIENCE OF HNUE
Interdisciplinary Science, 2013, Vol. 58, No. 5, pp. 39-46
This paper is available online at
BUILDING MODELS FOR DETECTING SYSTEM ATTACTS
BASED ON DATA MINING
Pham Duy Trung1, Luong The Dung1 and Nguyen Duy Hai2
1Academy of Cryptography Techniques,
2Centre of Information Technology, Hanoi National University of Education
Abstract. With the development of the Internet, network security has become an
indispensable factor of computer technology. Intrusion Detection Systems (IDS)
play an important role in network security. One aspect which affects the accuracy
and performance of IDS are classifiers. This paper proposes a new approach which
combines different classifiers in order to make best use of each classifier. To
build the new model, we evaluate the accuracy and performance (training and
testing time) of three classification algorithms: ID3, Naitive Bayes and SVM. Our
experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross
validation test shows that against any one particular type of attack, one of the
classifiers functions best. The purpose of this study is to enhance the accuracy
and performance of IDS against particular types of attacks.
Keywords: Network security, data mining, network computer.
1. Introduction
The Internet pervades almost every aspect of life and business and, due to the
exponential growth of this trend, there has come to exist the critical need to secure these
systems from unauthorized disclosure, transfer, modification or destruction. An Intrusion
Detection System (IDS) inspects the activities in a system for suspicious behavior or
patterns that may indicate an ongoing system attack or misuse. Recently, as networks have
become faster, the need has an emerged for security analysis techniques that will be able
to keep up with the increased network throughput [1]. Due to large volumes of security
audit data as well as complex and dynamic properties of intrusion behaviors, optimizing
Received May 25, 2013. Accepted June 30, 2013.
Contact Nguyen Duy Hai, e-mail address: haind@hnue.edu.vn
39
Pham Duy Trung, Luong The Dung and Nguyen Duy Hai
the performance of IDS becomes an important, open problem that receives more attention
from the research community [2].
Besides expert systems, state transition analysis and statistical analysis, data mining
has become a popular technique for detecting intrusion [3]. The main reason for using
Data Mining Techniques for IDS is that it is capable of handling the enormous volume
of existing and newly appearing network data that require processing. One of the most
important Data Mining Techniques for Intrusion Detection is classification. Classification
models can be built using a wide variety of algorithms which can be classified into
three types: extensions to linear discrimination (e.g., multiplayer perceptron and logistic
discrimination), decision tree and rule-based methods (e.g., C4.5 or J.48, AQ and CART)
and density estimators (Naı¨ve Bayes and k-nearest neighbor, LVQ) [4]. A search of the
literature shows that a 3-level classification model with C4.5 algorithm provides a DOS
detection rate of almost 100% [5]. Rung Chin Cheng et al. [6] proposed an intrusion
detection method using SVM based on a RST. They show that an accuracy of 86.79%
could be achieved using 41 features, while using a rough set increased the accuracy by
89.13%.
No data mining algorithms for intrusion detection has been identified as being the
best. Furthermore, it should be noted that once IDS are more widely used, new properties
will have to be taken into consideration, such as large volumes of security audit data and
complex and dynamic properties of intrusion behavior. One difficulty encountered in such
a study concerns the lack of published objective comparisons between classifiers. Ideally,
classifiers should be tested within the same context, i.e., with the same dataset and using
the same features extraction method. Currently, this is a crucial problem for IDS research
based on data mining.
In this paper, we evaluated three data mining algorithms for intrusion detection,
Naı¨ve Bayes, J48 and Support Vector Machine (SVM), based on data mining structure
for IDS. In addition, we propose a new approach which combines different classifiers in
order to make best use of each classifier. The purpose of our research is to enhance the
accuracy and performance of IDS against particular types of attacks.
2. Content
2.1. The data mining model for IDS
In recent years, there has been an increase in the use of data mining-based
approaches to build intrusion detection models. Our intrusion detection models can be
built in five steps. The process starts with an initial set of network audit data. The data are
preprocessed, and then the optimal set of features will be obtained by feature extraction
and feature selection stages before classification.
40
Building models for detecting system attacts based on data mining
Systems that construct classifiers are commonly used tools in data mining. Such
systems take a collection of cases as input, each belonging to a small number of classes
described by a fixed set of attributes and output a classifier that can accurately predict the
class to which a new case belongs.
Network Audit
↓
Data Preprocess
↓
Feature Extraction
↓
Feature Selection
↓
Classification
Figure 1. Intrusion detection model based on data mining
2.2. Experiment
2.2.1. Dataset
The KDD Cup 1999 dataset [7] was derived from the 1998 DARPA Intrusion
detection Evaluation program prepared and managed by the MIT Lincoln Laboratory.
The dataset was a collection of simulated raw TCP dump data collected over a period
of nine weeks. The simulated attacks were classified according to the actions and goals
of the attacker. The dataset consists of one type of normal data and 22 different attack
types categorized into 4 classes: Denial of Service (DoS), Probe, User–to–Root (U2R)
and Remote–to–Login (R2L).
Denials of Service (DoS) attacks have the goal of limiting or denying services
provided to the user, computer or network. A common tactic is to severely overload the
targeted system. Probing or Surveillance attacks have the goal of gaining knowledge of
the existence or configuration of a computer system or network. Port Scans or sweeping
of a given IP address range typically fall into this category.
User-to-Root (U2R) attacks have the goal of gaining root or super-user access
on a particular computer or system on which the attacker previously had user level
access. These are attempts by a non-privileged user to gain administrative privileges. A
Remote-to-Local (R2L) attack is an attack in which a user sends packets to a machine
which the user does not have access to in order to expose the machine’s vulnerabilities
and exploit privileges which a local user would have on the computer.
The details of attacks of labeled records are given in Table 1.
41
Pham Duy Trung, Luong The Dung and Nguyen Duy Hai
Table 1. Attack classification
Category of attack Attack Name
DOS
Neptune, Smurf, Pod, Teardrop, Land, back, mailbomb,
processtable, udpstorm.
Probe portsweep, IPsweep, nmap, mscan
U2R
buffer_overflow, loadmodule, perl, rootkit, httprunnel, ps,
sqlattack, xterm
R2L
guess_password, ftpwirte, Imap, multihop, named, phf,
sendmail, snmpgetattack, snmpguess, spy, warezclient,
warezmaster, worm, xlock, xsnoop
10% of the overall KDD Cup 1999 labeled dataset which contains 494,020 records
having 41 features. The distribution of connections types is given in the Table 2.
Table 2. Distribution of connection types in the KDD CUP’99 Training Dataset
Class Number of instances Percentage of occurrence
Normal 97.277 19.69%
DoS 391.458 79,24%
Probe 4.107 0.83%
U2R 52 0.01%
R2L 1.126 0.23%
Total 494.020 100%
Due to the large number of data in the dataset, duplicate instances are removed and
selected at random and a sample of 10% normal data, 10% Neptune attack in DoS class
and the other data remained.
2.2.2. Feature selection
Feature selection includes the basic features of an individual TCP connection such
as duration, protocol type, number of bytes transferred and the flag indicating the normal
or error status of the connection. Other features of an individual connection were obtained
using some domain knowledge, and include the number of file creation operations and
number of failed login attempts. In total, there were 41 features, most of them taking on
continuous values as in Table 3.
42
Building models for detecting system attacts based on data mining
Table 3. KDD cup’99 feature
No. Name of the attribute No. Name of the attribute
1 duration 22 is_guest_login
2 protocol_type 23 count
3 service 24 srv_count
4 flag 25 serror_rate
5 src_bytes 26 srv_serror_rate
6 dst_bytes 27 rerror_rate
7 land 28 srv_serror_rate
8 wrong_fragment 29 same_srv_rate
9 urgent 30 diff_srv_rate
10 hot 31 srv_diff_host_rate
11 num_failed_logins 32 dst_host_count
12 logged_in 33 dst_host_srv_count
13 num_compromised 34 dst_host_same_srv_rate
14 root_shell 35 dst_host_diff_srv_rate
15 su_attempted 36 dst_host_same_srv_port_rate
15 num_root 37 dst_host_srv_diff_host_rate
17 num_file_creations 38 dst_host_serror_rate
18 num_shells 39 dst_host_srv_serror_rate
19 num_access_files 40 dst_host_rerror_rate
20 num_outbound_cmd 41 dst_host_srv_rerror_rate
21 is_host_login
2.3. Results and discussions
The three techniques of SVM using Radial Kernel, Native Bayes and J48 to build
intrusion detection models were obtain from WEKA [8]. The Radial Kernel and Neural
Kernel were selected for the SVM technique. We choose those settings to obtain the
highest performance for those techniques. In our experiments, 10-fold cross validation
was used to have intrusion detection rates for the three techniques.
When comparing with the accuracy of the multi-class classifier and the two-class
classifier used with ID3 and Naı¨ve Bayes, it can be seen that the two-class classifier
43
Pham Duy Trung, Luong The Dung and Nguyen Duy Hai
has better results based on accuracy criteria. Figure 2 indicates that the decision tree
produces better accuracy for Probe, R2L and U2R compared to SVM and Naitive Bayes.
It’s accuracy is lower than SVM but higher than with Naitive Bayes for DOS with a
small dataset. Therefore, SVM is not suitable with such a small dataset. This finding is
consistent with the studies of Mohammad Reza Ektefa et al. [9] which showed that C4.5
algorithms performed better than SVM in detecting network intrusions and regarding false
alarms.
Figure 2. Comparing the accuracy
of the three algorithms
Figure 3. Comparing the model building
time of the three algorithms
In Figure 3, Naı¨ve Bayes has the best training time, while for SVM the training time
is much higher than for the others. Figure 4 shows that the test time of decision trees is
much better than the others, thus the use of decision tree classifier systems for intrusion
detection will enhance system performance significantly.
Figure 4. Comparing the model testing time of the three algorithms
2.4. Attack classification method based on combined classifiers
From the experimental results, we can provide an integrated model to select
efficient algorithms for each specific type of attack. Observing the chart and table, we can
44
Building models for detecting system attacts based on data mining
see that a classification model can give better results than the other models for a certain
type of attack, so each best algorithm should be selected for some specific types of attack.
Therefore, assuming that the IDS system is integrated from several different classifiers
and able to perform in parallel with n processors at the same time, each processor will run
a classification algorithm (Classifier). The attack class of each new access to the system
(new record) can be selected by the voting algorithm for classifiers. The algorithm is
presented in Figure 5.
Input: - New record: r
- n of classification algorithms: CF1, ..., CFn
- Processors: P0, ..., Pn
Output: C (Class of new record )
Begin
For i = 1 to n, each Pi do
Begin
C[i] := CFi(r); Send (C[i], P0);
End
If (tid == 0) then P do
Begin
Class[1] := C[1];
Count[1] : = 1;
For i = 2 to n− 1
If (C[i] = class[k]) Count[k] = Count[k] + 1; Else
Begin
k = k + 1;
Class[k] = C[k];
Count[k] = 1;
End
For i = 1 to k − 1
If (maxd < Count[i])
Begin
maxd = Count[i];
C = Class[i];
End
End
Ouput C;
End
Figure 5. Attack classification model based combined classifiers
45
Pham Duy Trung, Luong The Dung and Nguyen Duy Hai
3. Conclusion
The paper proposed a new approach which combines different classifiers in order to
make best use of each classifier. To build the new model, we evaluated the accuracy and
performance (training and testing time) of three classification algorithms: ID3, Naitive
Bayes and SVM. Our experimental results using the KDDCup’99 IDS dataset based on
the 10-fold cross validation test show that each classifier functions best for each particular
type of attack.
REFERENCES
[1] Christopher Kruegel, Fredrik Valeur, Giovanni Vigna and Richard A. Kemmerer,
2002. Stateful Intrusion Detection for High-Speed Networks. In IEEE Symposium
on Security and Privacy, IEEE Computer Society Press, USA.
[2] Nguyen, H. & Choi, D. 2008. Application of data mining to network intrusion
detection:classifier selection model. Sprnger-Verlag Berlin Heidelberg, pp. 399-408.
[3] Lu, C.-T., Boedihardjo, A.P., Manalwar, P., 2005. Exploiting efficient data
mining techniques to enhance intrusion detection systems. Information Reuse and
Integration, Conf, 2005. IRI-2005 IEEE International Conference, pp. 512-517.
[4] Henery R. J., 1994. Classification. Machine Learning Neural and Statistical
Classification.
[5] C. Xiang; M.Y. Chong; H.L. Zhu. 2004. Design of mnitiple-level tree classifiers for
intrusion detection system. Cybernetics and Intelligent Systems, IEEE Conference,
Vol. 2, pp. 873-878.
[6] Rung-Ching Chen, Kai-Fan Cheng, Ying-Hao Chen, Chia-Fen Hsieh, 2009.
Using Rough Set and Support Vector Machine for Network Intrusion Detection
System. Intelligent Information and Database Systems. ACIIDS 2009. First Asian
Conference, pp. 465-470.
[7] KDD99: percent.gz.
[8] WEKA:
[9] Mohammadreza Ektefa, Sara Memar, Fatimah Sidi, Lilly Suriani Affendey, 2010.
Intrusion Detection Using Data Mining Techniques. Proc.of IEEE Intl. Conference
on Information Retrieval & Knowledge Management, pp. 200-203.
46