This paper proposes a new approach which combines different classifiers in order to make best use of each classifier. To build the new model, we evaluate the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM.
JOURNAL OF SCIENCE OF HNUE Interdisciplinary Science, 2013, Vol 58, No 5, pp 39-46 This paper is available online at http://stdb.hnue.edu.vn BUILDING MODELS FOR DETECTING SYSTEM ATTACTS BASED ON DATA MINING Pham Duy Trung1, Luong The Dung1 and Nguyen Duy Hai2 Academy of Cryptography Techniques, Centre of Information Technology, Hanoi National University of Education Abstract With the development of the Internet, network security has become an indispensable factor of computer technology Intrusion Detection Systems (IDS) play an important role in network security One aspect which affects the accuracy and performance of IDS are classifiers This paper proposes a new approach which combines different classifiers in order to make best use of each classifier To build the new model, we evaluate the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM Our experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross validation test shows that against any one particular type of attack, one of the classifiers functions best The purpose of this study is to enhance the accuracy and performance of IDS against particular types of attacks Keywords: Network security, data mining, network computer Introduction The Internet pervades almost every aspect of life and business and, due to the exponential growth of this trend, there has come to exist the critical need to secure these systems from unauthorized disclosure, transfer, modification or destruction An Intrusion Detection System (IDS) inspects the activities in a system for suspicious behavior or patterns that may indicate an ongoing system attack or misuse Recently, as networks have become faster, the need has an emerged for security analysis techniques that will be able to keep up with the increased network throughput [1] Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing Received May 25, 2013 Accepted June 30, 2013 Contact Nguyen Duy Hai, e-mail address: haind@hnue.edu.vn 39 Pham Duy Trung, Luong The Dung and Nguyen Duy Hai the performance of IDS becomes an important, open problem that receives more attention from the research community [2] Besides expert systems, state transition analysis and statistical analysis, data mining has become a popular technique for detecting intrusion [3] The main reason for using Data Mining Techniques for IDS is that it is capable of handling the enormous volume of existing and newly appearing network data that require processing One of the most important Data Mining Techniques for Intrusion Detection is classification Classification models can be built using a wide variety of algorithms which can be classified into three types: extensions to linear discrimination (e.g., multiplayer perceptron and logistic discrimination), decision tree and rule-based methods (e.g., C4.5 or J.48, AQ and CART) and density estimators (Naăve Bayes and k-nearest neighbor, LVQ) [4] A search of the literature shows that a 3-level classification model with C4.5 algorithm provides a DOS detection rate of almost 100% [5] Rung Chin Cheng et al [6] proposed an intrusion detection method using SVM based on a RST They show that an accuracy of 86.79% could be achieved using 41 features, while using a rough set increased the accuracy by 89.13% No data mining algorithms for intrusion detection has been identified as being the best Furthermore, it should be noted that once IDS are more widely used, new properties will have to be taken into consideration, such as large volumes of security audit data and complex and dynamic properties of intrusion behavior One difficulty encountered in such a study concerns the lack of published objective comparisons between classifiers Ideally, classifiers should be tested within the same context, i.e., with the same dataset and using the same features extraction method Currently, this is a crucial problem for IDS research based on data mining In this paper, we evaluated three data mining algorithms for intrusion detection, Naăve Bayes, J48 and Support Vector Machine (SVM), based on data mining structure for IDS In addition, we propose a new approach which combines different classifiers in order to make best use of each classifier The purpose of our research is to enhance the accuracy and performance of IDS against particular types of attacks 2.1 Content The data mining model for IDS In recent years, there has been an increase in the use of data mining-based approaches to build intrusion detection models Our intrusion detection models can be built in five steps The process starts with an initial set of network audit data The data are preprocessed, and then the optimal set of features will be obtained by feature extraction and feature selection stages before classification 40 Building models for detecting system attacts based on data mining Systems that construct classifiers are commonly used tools in data mining Such systems take a collection of cases as input, each belonging to a small number of classes described by a fixed set of attributes and output a classifier that can accurately predict the class to which a new case belongs Network Audit ↓ Data Preprocess ↓ Feature Extraction ↓ Feature Selection ↓ Classification Figure Intrusion detection model based on data mining 2.2 Experiment 2.2.1 Dataset The KDD Cup 1999 dataset [7] was derived from the 1998 DARPA Intrusion detection Evaluation program prepared and managed by the MIT Lincoln Laboratory The dataset was a collection of simulated raw TCP dump data collected over a period of nine weeks The simulated attacks were classified according to the actions and goals of the attacker The dataset consists of one type of normal data and 22 different attack types categorized into classes: Denial of Service (DoS), Probe, User–to–Root (U2R) and Remote–to–Login (R2L) Denials of Service (DoS) attacks have the goal of limiting or denying services provided to the user, computer or network A common tactic is to severely overload the targeted system Probing or Surveillance attacks have the goal of gaining knowledge of the existence or configuration of a computer system or network Port Scans or sweeping of a given IP address range typically fall into this category User-to-Root (U2R) attacks have the goal of gaining root or super-user access on a particular computer or system on which the attacker previously had user level access These are attempts by a non-privileged user to gain administrative privileges A Remote-to-Local (R2L) attack is an attack in which a user sends packets to a machine which the user does not have access to in order to expose the machine’s vulnerabilities and exploit privileges which a local user would have on the computer The details of attacks of labeled records are given in Table 41 Pham Duy Trung, Luong The Dung and Nguyen Duy Hai Table Attack classification Category of attack Attack Name DOS Neptune, Smurf, Pod, Teardrop, Land, back, mailbomb, processtable, udpstorm Probe portsweep, IPsweep, nmap, mscan U2R buffer_overflow, loadmodule, perl, rootkit, httprunnel, ps, sqlattack, xterm R2L guess_password, ftpwirte, Imap, multihop, named, phf, sendmail, snmpgetattack, snmpguess, spy, warezclient, warezmaster, worm, xlock, xsnoop 10% of the overall KDD Cup 1999 labeled dataset which contains 494,020 records having 41 features The distribution of connections types is given in the Table Table Distribution of connection types in the KDD CUP’99 Training Dataset Class Number of instances Percentage of occurrence Normal 97.277 19.69% DoS 391.458 79,24% Probe 4.107 0.83% U2R 52 0.01% R2L 1.126 0.23% Total 494.020 100% Due to the large number of data in the dataset, duplicate instances are removed and selected at random and a sample of 10% normal data, 10% Neptune attack in DoS class and the other data remained 2.2.2 Feature selection Feature selection includes the basic features of an individual TCP connection such as duration, protocol type, number of bytes transferred and the flag indicating the normal or error status of the connection Other features of an individual connection were obtained using some domain knowledge, and include the number of file creation operations and number of failed login attempts In total, there were 41 features, most of them taking on continuous values as in Table 42 Building models for detecting system attacts based on data mining Table KDD cup’99 feature No Name of the attribute No Name of the attribute duration 22 is_guest_login protocol_type 23 count service 24 srv_count flag 25 serror_rate src_bytes 26 srv_serror_rate dst_bytes 27 rerror_rate land 28 srv_serror_rate wrong_fragment 29 same_srv_rate urgent 30 diff_srv_rate 10 hot 31 srv_diff_host_rate 11 num_failed_logins 32 dst_host_count 12 logged_in 33 dst_host_srv_count 13 num_compromised 34 dst_host_same_srv_rate 14 root_shell 35 dst_host_diff_srv_rate 15 su_attempted 36 dst_host_same_srv_port_rate 15 num_root 37 dst_host_srv_diff_host_rate 17 num_file_creations 38 dst_host_serror_rate 18 num_shells 39 dst_host_srv_serror_rate 19 num_access_files 40 dst_host_rerror_rate 20 num_outbound_cmd 41 dst_host_srv_rerror_rate 21 is_host_login 2.3 Results and discussions The three techniques of SVM using Radial Kernel, Native Bayes and J48 to build intrusion detection models were obtain from WEKA [8] The Radial Kernel and Neural Kernel were selected for the SVM technique We choose those settings to obtain the highest performance for those techniques In our experiments, 10-fold cross validation was used to have intrusion detection rates for the three techniques When comparing with the accuracy of the multi-class classifier and the two-class classifier used with ID3 and Naăve Bayes, it can be seen that the two-class classifier 43 Pham Duy Trung, Luong The Dung and Nguyen Duy Hai has better results based on accuracy criteria Figure indicates that the decision tree produces better accuracy for Probe, R2L and U2R compared to SVM and Naitive Bayes It’s accuracy is lower than SVM but higher than with Naitive Bayes for DOS with a small dataset Therefore, SVM is not suitable with such a small dataset This finding is consistent with the studies of Mohammad Reza Ektefa et al [9] which showed that C4.5 algorithms performed better than SVM in detecting network intrusions and regarding false alarms Figure Comparing the accuracy of the three algorithms Figure Comparing the model building time of the three algorithms In Figure 3, Naăve Bayes has the best training time, while for SVM the training time is much higher than for the others Figure shows that the test time of decision trees is much better than the others, thus the use of decision tree classifier systems for intrusion detection will enhance system performance significantly Figure Comparing the model testing time of the three algorithms 2.4 Attack classification method based on combined classifiers From the experimental results, we can provide an integrated model to select efficient algorithms for each specific type of attack Observing the chart and table, we can 44 Building models for detecting system attacts based on data mining see that a classification model can give better results than the other models for a certain type of attack, so each best algorithm should be selected for some specific types of attack Therefore, assuming that the IDS system is integrated from several different classifiers and able to perform in parallel with n processors at the same time, each processor will run a classification algorithm (Classifier) The attack class of each new access to the system (new record) can be selected by the voting algorithm for classifiers The algorithm is presented in Figure Input: - New record: r - n of classification algorithms: CF1 , , CFn - Processors: P0 , , Pn Output: C (Class of new record ) Begin For i = to n, each Pi Begin C[i] := CFi (r); Send (C[i], P0 ); End If (tid == 0) then P Begin Class[1] := C[1]; Count[1] : = 1; For i = to n − If (C[i] = class[k]) Count[k] = Count[k] + 1; Else Begin k = k + 1; Class[k] = C[k]; Count[k] = 1; End For i = to k − If (maxd < Count[i]) Begin maxd = Count[i]; C = Class[i]; End End Ouput C; End Figure Attack classification model based combined classifiers 45 Pham Duy Trung, Luong The Dung and Nguyen Duy Hai Conclusion The paper proposed a new approach which combines different classifiers in order to make best use of each classifier To build the new model, we evaluated the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM Our experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross validation test show that each classifier functions best for each particular type of attack REFERENCES [1] Christopher Kruegel, Fredrik Valeur, Giovanni Vigna and Richard A Kemmerer, 2002 Stateful Intrusion Detection for High-Speed Networks In IEEE Symposium on Security and Privacy, IEEE Computer Society Press, USA [2] Nguyen, H & Choi, D 2008 Application of data mining to network intrusion detection:classifier selection model Sprnger-Verlag Berlin Heidelberg, pp 399-408 [3] Lu, C.-T., Boedihardjo, A.P., Manalwar, P., 2005 Exploiting efficient data mining techniques to enhance intrusion detection systems Information Reuse and Integration, Conf, 2005 IRI-2005 IEEE International Conference, pp 512-517 [4] Henery R J., 1994 Classification Machine Learning Neural and Statistical Classification [5] C Xiang; M.Y Chong; H.L Zhu 2004 Design of mnitiple-level tree classifiers for intrusion detection system Cybernetics and Intelligent Systems, IEEE Conference, Vol 2, pp 873-878 [6] Rung-Ching Chen, Kai-Fan Cheng, Ying-Hao Chen, Chia-Fen Hsieh, 2009 Using Rough Set and Support Vector Machine for Network Intrusion Detection System Intelligent Information and Database Systems ACIIDS 2009 First Asian Conference, pp 465-470 [7] KDD99: http://kdd.ics.uci.edu/databases/kddcup99/10 percent.gz [8] WEKA: http://sourceforge.net/projects/weka/ [9] Mohammadreza Ektefa, Sara Memar, Fatimah Sidi, Lilly Suriani Affendey, 2010 Intrusion Detection Using Data Mining Techniques Proc.of IEEE Intl Conference on Information Retrieval & Knowledge Management, pp 200-203 46 ... extraction and feature selection stages before classification 40 Building models for detecting system attacts based on data mining Systems that construct classifiers are commonly used tools in data mining. .. problem for IDS research based on data mining In this paper, we evaluated three data mining algorithms for intrusion detection, Naăve Bayes, J48 and Support Vector Machine (SVM), based on data mining. .. taking on continuous values as in Table 42 Building models for detecting system attacts based on data mining Table KDD cup’99 feature No Name of the attribute No Name of the attribute duration 22