Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005 CLASSIFICATION MODELS FOR INTRUSION DETECTION SYSTEMS Srinvas Mukkamala email: srinivas@cs.nmt.edu Andrew H Sung email: sung@cs.nmt.edu Rajeev Veeraghattam email: rajeev@nmt.edu Department of Computer Science, New Mexico Tech, Socorro, NM 87801, USA Institute of Complex Additive Systems Analysis, New Mexico Tech, Socorro, NM 87801, USA Key words: Machine learning, Intrusion detection systems, CART, MARS, TreeNet ABSTRACT This paper describes results concerning the classification capability of supervised machine learning techniques in detecting intrusions using network audit trails In this paper we investigate three well known machine learning techniques: classification and regression tress (CART), multivariate regression splines (MARS) and treenet The best model is chosen based on the classification accuracy (ROC curve analysis) The results show that high classification accuracies can be achieved in a fraction of the time required by well known support vector machines and artificial neural networks Treenet performs the best for normal, probe and denial of service attacks (DoS) CART performs the best for user to super user (U2su) and remote to local (R2L) scheme, to achieve accuracy without the drawback of a tendency to be misled by bad data [11,12] We performed experiments using MARS, CART, Treenet for classifying each of the five classes (normal, probe, denial of service, user to super-user, and remote to local) of network traffic patterns in the DARPA data A brief introduction MARS and model selection is given in section II CART and a tree generated for classifying normal vs intrusions in DARPA data is explained in section III TreeNet is briefly described in section IV Intrusion detection data used for experiments is explained in section V In section VI, we analyze classification accuracies of MARS, CART, TreeNet using ROC curves Conclusions of our work are given in section VII INTRODUCTION II MARS Since the ability of an Intrusion Detection System (IDS) to identify a large variety of intrusions in real time with high accuracy is of primary concern, we will in this paper consider performance of machine learning-based IDSs with respect to classification accuracy and false alarm rates Multivariate Adaptive Regression Splines (MARS) is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables Instead, MARS constructs this relation from a set of coefficients and basis functions that are entirely “driven” from the data AI techniques have been used to automate the intrusion detection process; they include neural networks, fuzzy inference systems, evolutionary computation, machine learning, support vector machines, etc [1-6] Often model selection using SVMs, and other popular machine learning methods requires extensive resources and long execution times [7,8] In this paper, we present a few machine learning methods (MARS, CART, TreeNet) that can perform model selection with higher or comparable accuracies in a fraction of the time required by the SVMs MARS is a nonparametric regression procedure that is based on the “divide and conquer” strategy, which partitions the input space into regions, each with its own regression equation [9] CART is a tree-building algorithm that determines a set of if-then logical (split) conditions that permit accurate prediction or classification of classes [10] TreeNet a tree-building algorithm that uses stochastic gradient boosting to combine trees via a weighted voting The method is based on the “divide and conquer” strategy, which partitions the input space into regions, each with its own regression equation This makes MARS particularly suitable for problems with higher input dimensions, where the curse of dimensionality would likely create problems for other techniques Basis functions: MARS uses two-sided truncated functions of the form as basis functions for linear or nonlinear expansion, which approximates the relationships between the response and predictor variables A simple example of two basis functions (t-x)+ and (x-t)+[9,11] Parameter t is the knot of the basis functions (defining the "pieces" of the piecewise linear regression); these knots (parameters) are also determined from the data The "+" signs next to the terms (t-x) and (x-t) simply denote that only positive results of the respective equations are considered; otherwise the respective functions evaluate to zero Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005 produced, which probably greatly overfits the information contained within the learning dataset The MARS Model The basis functions together with the model parameters (estimated via least squares estimation) are combined to produce the predictions given the inputs The general MARS m m(X ) m− Where the summation is over the M nonconstant terms in the model, y is predicted as a function of the predictor variables X (and their interactions); this function consists of an intercept parameter ( βo ) and the weighted by ( βm ) sum of one or more basis functions hm ( X ) Model Selection After implementing the forward stepwise selection of basis functions, a backward procedure is applied in which the model is pruned by removing those basis functions that are associated with the smallest increase in the (least squares) goodness-of-fit A least squares error function (inverse of goodness-of-fit) is computed The so-called Generalized Cross Validation error is a measure of the goodness of fit that takes into account not only the residual error but also the model complexity as well It is given by N ( y −f ( x )) ∑ GCV = i i /(1 −c / n) i= with The third step consists of tree “pruning,” which results in the creation of a sequence of simpler and simpler trees, through the cutting off of increasingly important nodes • The fourth step consists of optimal tree selection, during which the tree which fits the information in the learning dataset, but does not overfit the information, is selected from among the sequence of pruned trees M βh ∑ y =f ( x ) =β o + • C = + cd Where N is the number of cases in the data set, d is the effective degrees of freedom, which is equal to the number of independent basis functions The quantity c is the penalty for adding a basis function Experiments have shown that the best value for C can be found somewhere in the range < d < [9] III CART CART builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification) [10,11] The decision tree begins with a root node t derived from whichever variable in the feature space minimizes a measure of the impurity of the two sibling nodes The measure of the impurity or entropy at node t, denoted by i(t), is as shown in the following equation [11]: k i (t ) =− ∑ p ( w j / t ) log p ( w j / t ) f =1 Where p(wj | t ) is the proportion of patterns xi allocated to class wj at node t Each non-terminal node is then divided into two further nodes, tL and tR, such that pL , pR are the proportions of entities passed to the new nodes tL, tR respectively The best division is that which maximizes the difference given in [11]: ∆i ( s, t ) =i (t ) − piL (t L ) − piR (t R ) The decision tree grows by means of the successive subdivisions until a stage is reached in which there is no significant decrease in the measure of impurity when a further additional division s is implemented When this stage is reached, the node t is not sub-divided further, and automatically becomes a terminal node The class wj associated with the terminal node t is that which maximizes the conditional probability p(wj | t) No of nodes generated and terminal node values for each class are for the DARPA data set described in section V are presented in Table CART analysis consists of four basic steps1 [12]: • The first step consists of tree building, during which a tree is built using recursive splitting of nodes Each resulting node is assigned a predicted class, based on the distribution of classes in the learning dataset which would occur in that node and the decision cost matrix • The second step consists of stopping the tree building process At this point a “maximal” tree has been Reference [12] was accidentally omitted during the editing process of the original manuscript Complete reference is: R J Lewis An Introduction to Classification and Regression Tree (CART) Analysis Annual Meeting of the Society for Academic Emergency Medicine, 2000 Figure Tree for classifying normal vs intrusions Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005 Figure is represents a classification tree generated for DARPA data described in section V for classifying normal activity vs intrusive activity Each of the terminal node describes a data value; each record is classifies into one of the terminal node through the decisions made at the nonterminal node that lead from the root to that leaf Table Summary of tree splitters for all five classes Class Normal Probe DoS U2Su R2L No of Nodes 23 22 16 10 Terminal Node Value 0.016 0.019 0.004 0.113 0.025 several connections in a short time frame, whereas R2U and U2Su attacks are embedded in the data portions of the connection and often involve just a single connection; “traffic-based” features play an important role in deciding whether a particular network activity is engaged in probing or not Attack types fall into four main categories: • • IV TREENET In a TreeNet model classification and regression models are built up gradually through a potentially large collection of small trees Typically consist from a few dozen to several hundred trees, each normally no longer than two to eight terminal nodes The model is similar to a long series expansion (such as Fourier or Taylor’s series) - a sum of factors that becomes progressively more accurate as the expansion continues The expansion can be written as [11,13]: F ( X ) − F0 + β1T1( X ) + β2T2 ( X ) + βM TM ( X ) Where Ti is a small tree Each tree improves on its predecessors through an errorcorrecting strategy Individual trees may be as small as one split, but the final models can be accurate and are resistant to overfitting V DATA USED FOR ANALYSIS A subset of the DARPA intrusion detection data set is used for offline analysis In the DARPA intrusion detection evaluation program, an environment was set up to acquire raw TCP/IP dump data for a network by simulating a typical U.S Air Force LAN The LAN was operated like a real environment, but being blasted with multiple attacks [14,15] For each TCP/IP connection, 41 various quantitative and qualitative features were extracted [16] for intrusion analysis Attacks are classified into the following types The 41 features extracted fall into three categorties, “intrinsic” features that describe about the individual TCP/IP connections; can be obtained from network audit trails, “content-based” features that describe about payload of the network packet; can be obtained from the data portion of the network packet, “traffic-based” features, that are computed using a specific window (connection time or no of connections) As DOS and Probe attacks involve • • Denial of Service (DOS) Attacks: A denial of service attack is a class of attacks in which an attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine Examples are Apache2, Back, Land, Mail bomb, SYN Flood, Ping of death, Process table, Smurf, Syslogd, Teardrop, Udpstorm User to Superuser or Root Attacks (U2Su): User to root exploits are a class of attacks in which an attacker starts out with access to a normal user account on the system and is able to exploit vulnerability to gain root access to the system Examples are Eject, Ffbconfig, Fdformat, Loadmodule, Perl, Ps, Xterm Remote to User Attacks (R2L): A remote to user attack is a class of attacks in which an attacker sends packets to a machine over a network−but who does not have an account on that machine; exploits some vulnerability to gain local access as a user of that machine Examples are Dictionary, Ftp_write, Guest, Imap, Named, Phf, Sendmail, Xlock, Xsnoop Probing (Probe): Probing is a class of attacks in which an attacker scans a network of computers to gather information or find known vulnerabilities An attacker with a map of machines and services that are available on a network can use this information to look for exploits Examples are Ipsweep, Mscan, Nmap, Saint, Satan In our experiments, we perform 5-class classification The (training and testing) data set contains 11982 randomly generated points from the data set representing the five classes, with the number of data from each class proportional to its size, except that the smallest class is completely included The set of 5092 training data and 6890 testing data are divided in to five classes: normal, probe, denial of service attacks, user to super user and remote to local attacks Where the attack is a collection of 22 different types of instances that belong to the four classes described in Section V, and the other is the normal data Note two randomly generated separate data sets of sizes 5092 and 6890 are used for training and testing MARS, CART, and TreeNet respectively Section VI summarizes the classifier accuracies VI ROC CURVES Detection rates and false alarms are evaluated for the fiveclass pattern in the DARPA data set and the obtained results are used to form the ROC curves The point (0,1) is the perfect classifier, since it classifies all positive cases Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005 and negative cases correctly Thus an ideal system will initiate by identifying all the positive examples and so the curve will rise to (0,1) immediately, having a zero rate of false positives, and then continue along to (1,1) Figures to show the ROC curves of the detection models by attack categories as well as on all intrusions In each of these ROC plots, the x-axis is the false positive rate, calculated as the percentage of normal connections considered as intrusions; the y-axis is the detection rate, calculated as the percentage of intrusions detected A data point in the upper left corner corresponds to optimal high performance, i.e, high detection rate with low false alarm rate Area of the ROC curves, no of false positives and false negatives are presented in Tables to Table Summary of classification accuracy for normal Curve MARS CART TreeNet Area 0.99 0.991 0.99 False Positives False Negatives 56 75 18 Figure Classification accuracy for probe Table Summary of classification accuracy for DoS Curve MARS CART TreeNet Area 0.94 0.99 0.99 False Positives False Negatives 185 169 16 Figure Classification accuracy for normal Table Summary of classification accuracy for probe Curve MARS CART TreeNet Area 0.77 0.99 0.99 False Positives False Negatives 64 305 24 14 Figure Classification accuracy for DoS Table Summary of classification accuracy for U2Su Curve MARS CART TreeNet Area 0.70 0.72 0.69 False Positives False Negatives 15 14 16 Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005 learning methods we can achieve high classification accuracies in a fraction of the time required by the well know support vector machines and artificial neural networks We note, however, that the difference in accuracy figures tend to be small and may not be statistically significant, especially in view of the fact that the classes of patterns differ tremendously in their sizes More definitive conclusions perhaps can only be drawn after analyzing more comprehensive sets of network data ACKNOWLEDGEMENTS Figure Classification accuracy for U2Su Table Summary of classification accuracy for R2L Curve MARS CART TreeNet Area False Positives False Negatives 17 0.99 0.99 0.99 15 19 Partial support for this research received from ICASA (Institute for Complex Additive Systems Analysis, a division of New Mexico Tech), a DoD IASP, and an NSF SFS Capacity Building grants are gratefully acknowledged REFERENCES S Mukkamala, G Janowski, A H Sung, Intrusion Detection Using Neural Networks and Support Vector Machines Proceedings of IEEE International Joint Conference on Neural Networks 2002, IEEE press, pp 1702-1707, 2002 M Fugate, J R Gattiker, Computer Intrusion Detection with Classification and Anomaly Detection, Using SVMs International Journal of Pattern Recognition and Artificial Intelligence, Vol 17(3), pp 441-458, 2003 W Hu, Y Liao, V R Vemuri, Robust Support Vector Machines for Anamoly Detection in Computer Security International Conference on Machine Learning, pp 168-174, 2003 K A Heller, K M Svore, A D Keromytis, S J Stolfo, One Class Support Vector Machines for Detecting Anomalous Window Registry Accesses Proceedings of IEEE Conference Data Mining Workshop on Data Mining for Computer Security, 2003 A Lazarevic, L Ertoz, A Ozgur, J Srivastava, V Kumar, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection Proceedings of Third SIAM Conference on Data Mining, 2003 S Mukkamala, A H Sung, Feature Selection for Intrusion Detection Using Neural Networks and Support Vector Machines Journal of the Transportation Research Board of the National Academics, Transportation Research Record No 1822: 33-39, 2003 S J Stolfo, F Wei, W Lee, A Prodromidis, P K Chan, Cost-based Modeling and Evaluation for Data Mining with Application to Fraud and Intrusion Detection Results from the JAM Project, 1999 Figure Classification accuracy for R2L VII CONCLUSIONS A number of observations and conclusions are drawn from the results reported in this paper: • TreeNet easily achieves high detection accuracy (higher than 99%) for each of the classes of DARPA data Treenet performed the best for normal with 18 false positives (FP) and false negatives (FP), probe with 14 FP and FN, and denial of service attacks (DoS) with FP and FN • CART performed the best for user to super user (U2su) with FP and 14 FN and remote to local (R2L) with 15 FP and FN We demonstrate that using these fast execution machine Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005 S Mukkamala, B Ribeiro, A H Sung, Model Selection for Kernel Based Intrusion Detection Systems Proceedings of International Conference on Adaptive and Natural Computing Algorithms (ICANNGA), Springer-Verlag, pp 458-461, 2005 T Hastie, R Tibshirani, J H Friedman, The elements of statistical learning: Data mining, inference, and prediction Springer, 2001 10 L Breiman, J H Friedman, R A Olshen, C J Stone, Classification and regression trees Wadsworth and Brooks/Cole Advanced Books and Software, 1986 11 Salford Systems TreeNet, CART, MARS Manual 12 R J Lewis An Introduction to Classification and Regression Tree (CART) Analysis Annual Meeting of the Society for Academic Emergency Medicine, 2000 13 J H Friedman, Stochastic Gradient Boosting Journal of Computational Statistics and Data Analysis, Elsevier Science, Vol 38, PP 367-378, 2002 14 K Kendall, A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems Master's Thesis, Massachusetts Institute of Technology (MIT), 1998 15 S E Webster, The Development and Analysis of Intrusion Detection Algorithms Master's Thesis, MIT, 1998 16 W Lee, S J Stolfo, A Framework for Constructing Features and Models for Intrusion Detection Systems ACM Transactions on Information and System Security, Vol 3, pp 227-261, 2000 ... final models can be accurate and are resistant to overfitting V DATA USED FOR ANALYSIS A subset of the DARPA intrusion detection data set is used for offline analysis In the DARPA intrusion detection. .. Thesis, MIT, 1998 16 W Lee, S J Stolfo, A Framework for Constructing Features and Models for Intrusion Detection Systems ACM Transactions on Information and System Security, Vol 3, pp 227-261,... Attacks for the Evaluation of Intrusion Detection Systems Master's Thesis, Massachusetts Institute of Technology (MIT), 1998 15 S E Webster, The Development and Analysis of Intrusion Detection