An Ensemble Feature Selection Algorithm for Machine Learning Based Intrusion Detection System An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System Phuoc Cuong[.]
2021 8th NAFOSTED Conference on Information and Computer Science (NICS) An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System Phuoc-Cuong Nguyen∗ , Quoc-Trung Nguyen† , Kim-Hung Le‡ University of Information Technology, Vietnam National University Ho Chi Minh City Ho Chi Minh, Vietnam Email: ∗ 18520545@gm.uit.edu.vn, † 18521553@gm.uit.edu.vn, ‡ hunglk@uit.edu.vn Abstract—In recent years, we have witnessed the significant growth of the Internet along with emerging security threats A machine learning-based Intrusion Detection System (IDS) is widely employed to detect cyber attacks by continuously monitoring network traffic However, the diversity of network features considerably affected the accuracy and training time of the IDS model In this paper, a lightweight and effective feature selection algorithm for IDS is proposed This algorithm combines the advantages of both Random Forest and AdaBoost algorithms The evaluation results on popular datasets ( NSLKDD, UNSW-NB15, and CICIDS-2017) show that our proposal outperforms existing feature selection algorithms regarding the detection accuracy and the number of selected features Index Terms—Intrusion Detection System, feature selection algorithm, machine learning, random forest, adaboost I I NTRODUCTION The Internet has exponentially increased and transformed every aspect of our daily lives According to IWS (Internet World Stats), from June 2017 to March 2021, Internet users increased from 3885 million to 5168 million This is a significant number because it shows that 65.6% of the world’s population uses the Internet [1] For most people, the above numbers are nothing special, but it is like a gold mine for cybercriminals to exploit A specific example for this is when the covid pandemic broke out, the number of Internet users skyrocketed when many jobs were moved online; taking advantage of that, the attackers kept exploiting and attacking Internet users [2] According to McAfee’s statistics for the global damage caused by cyberattacks, in 2018, the damage caused by cybercrime was $500 billion [3] By 2020, this number has nearly doubled, about $945 billion Therefore, the demand for the use of IDS systems is increasing day by day An IDS is a device or software running on internet gateways to detect malicious activity or policy violations Cyberattacks usually aim at our digital properties When an attack occurs, they try to obtain confidential data, modify it, or make the system stop providing service Therefore, IDS plays an essential role in detecting and preventing large numbers of security threats and maintaining confidentiality, integrity, and availability [4] However, most existing IDS cannot handle complex and continuous variable attacks Classical IDS (e.g., firewalls, access control mechanisms) still have limitations in comprehensively protecting the network and system from complex attacks, for example, DDOS [5] To solve this problem, applying machine learning is a potential solution because it 978-1-6654-1001-4/21/$31.00 ©2021 IEEE can increase detection intensity, reduce the false alarm rate, and better adapt to ensemble attacks An IDS system has to deal with huge volumes of data, including false positives and incompatible or redundant features These features not only reduce the detection process but also consume a significant amount of resources Thereby, setting a new development direction on using feature selection could increase accuracy, training, and testing speed [6] It also helps to solve some of the common problems in IDS by detecting confounding features, resulting in lowering operating costs and storage space The filter method is a popular approach to select the features for IDS It selects the best sub-feature set by using a different technique called variable ranking All features are scored by a suitable ranking criterion based on a threshold Any feature that has a score below the threshold is removed [7] However, the performance of the filter method highly depends on the threshold that is data-specific It means that choosing the correct threshold is very challenging To solve this issue, in this paper, we introduce a novel feature selection algorithm that combines two ensemble algorithms: Random forest and Ada boost At first, we select sub-feature set S0, then apply ensemble algorithm and use our criterion to evaluate We repeat this process until covering most of the cases Then we use a clustering algorithm to select out those sub-features sets that tended to provide the best performance The rest of this paper is organized as follows The related works are presented in Section II We briefly introduce our proposal in Section III Section IV describes the evaluated datasets and experiment results In Section V, we conclude the whole of our work II R ELATED WORKS Paulo M Mafra et al proposed an IDS based on Genetic and SVM algorithms that can improve SVM parameters [8] In other words, the model uses Genetics as the optimization algorithm to maximize the performance of the SVM The average detection accuracy rate is recorded as 80.14% by this model Similarly, the authors in [9] proposed a similar model of IDS using Genetics to optimize the parameters of SVM and also as a feature selection The functional suitability is developed for the Genetic algorithm in the paper that evaluated chromosomes with maximum accuracy and minimum number of features [10] proposed a wrapper-based feature selection 50 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) method as a multi-objective optimization algorithm and used an unsupervised clustering method based on Growing Hierarchical Self-Organizing Maps (GHSOMSs) SOM is known as one of the most used artificial neural networks unsupervised learning models The study selects 25 features and shows an accuracy rate of 99.12 ± 0.61% and an FP rate of 2.24 ± 0.41% The papers show an improvement over IDS models with filter-based feature selection and IDS without feature selection The authors in [11] developed an IDS that uses a set classifier for feature selection This framework combines the bat algorithm (BA) and correlation-based feature selection (CFS) In addition, the set classifier is built using Random Forest and Forest-based Penalizing Attributes Test results using dataset CIC-IDS2017 The test uses various performance metrics, including false-positive, true-positive, detection, accuracy, false alarm, and Matthews correlation coefficient The results indicate that the combined CFSBA approach results in high accuracy of 96.76%, a detection rate of 94.04%, and a low false alarm level of 2.38% Rikhtegar et al developed a feature selection model which based on an associative learning mechanism [12] IDS feature selection and clustering mechanism They employed the support vector machine (SVM) and K-Medoids clustering algorithm in turn In addition, the approach also uses the Naive Bayes classifier and uses the KDD CUP99 dataset for evaluation The proposed model is evaluated using three essential performance metrics: accuracy, detection rate, and alarm rate The performance metrics are generated using TNs (true negatives), FPs (false positives), and FNs (false negatives) The test results are compared with three other feature selection methods The test results show that the proposed hybrid normalization method produces better accuracy (91.5%), detection rate (90.1%), and false alarm rate (6.36%) Shadi Aljawarneha et al developed a hybrid model used to estimate the intrusion scope threshold degree based on the network transaction data’s optimal features that were made available for training [7] The experiments are tested on the NSL-KDD datasets with significant results The accuracy of the proposed modal was 99.81% for the binary class and 98.56% for the multiclass However, this method has some issues with high false and low false-negative rates, which are addressed using a hybrid approach: Vote algorithm with Information Gain and a hybrid algorithm consisting of the following classifiers: J48, Meta Pagging, RandomTree, REPTree, AdaBoostM1, DecisionStump, and naive Bayes The final result is excellent with better accuracy, high false-negative rate, and low false-positive rate [13] proposed an IDS based on feature selection and clustering algorithm using filter and wrapper methods - named feature grouping based on linear correlation coefficient (FGLCC) algorithm and cuttlefish algorithm (CFA), respectively The decision tree is used as the classifier in the proposed method For performance verification, the proposed method was applied on KDD Cup 99 large data sets The results are fascinating, with high accuracy of 95.03% and a detection rate of 95.23% with a low false-positive rate of 1.65% III E NSEMBLE FEATURE SELECTION The primary idea of our proposal is to combine the advantages of both Random Forrest and Adaboost to maximize the feature selection If it is a large dataset with many noise samples, Ada Boost could be selected; otherwise, Random Forest is more suitable if we have a smaller, less noise dataset Comparing with deep-learning approach, our proposal achieves similar accuracy but significantly faster More detail about the algorithm can be described as below: Proposed model’s algorithm Input: Dataset D Ouput: Selected features set Sbest initialize: rounds R, Gbest[R], C[accuracy, running time] for n=1 to R: randomize F(F0, F1, , Fn-1) RF = RandomForest(D, F), AB = AdaBoost(D, F) Gn = compare(RF and AB) Gbest.append(Gn) C = cluster(Gbest) Sbest = MostFrequent(C[high accuracy, low running time]) return Sbest Random Forest is a tree-based algorithm that consists of many decision trees to create an ensemble method The primary purpose is to optimize the accuracy much better than one Decision tree alone Each tree is built from different sub-contributes with different sub-datasets In this way, these trees protect each other from their own errors In detail, the algorithm can be described as below: • • • • Step 1: Select randomly n features from the original features set Step 2: Build Decision Tree based on n features selected, back to step Step 3: Repeating until we get a large number of the decision tree Step 4: Select randomly m decision tree in tree system With each test sample, each decision tree predicts which sample belongs to which class The final result is the side received most of the votes Random forest is suitable as our dataset has many missing values The downside of this algorithm is performance since it deals with a large number of the decision tree More about this algorithm can be described as below: 51 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) The boosting algorithm AdaBoost Training Sample Training Sample Training Sample n Decision Tree Decision Tree Decision Tree n Given: (x1 , y1 ), , (xm , ym )where xi ∈ X, yi ∈ {−1, +1} Initialize: D1 (i) = 1/mfor i=1, ,m - Train weak learner using distribution Dt - Get weak hypothesis ht X → −1, +1 - Aim: Select ht with low weighted error: (t ) = P r( i Dt )[ht (xi ) 6= yi ] t ) - Choose αt = 12 ln( 1− Training set t Test set D (i)exp(−α y h (x )) t i t i - Update, for i=1, ,m: D( t + 1)(i) = t Zt where Zt is a normalization factor(chosen so that D( t + 1)will be a distribution) Output the final hypothesis: P H(x)=sign( T t=1 αt ht (x)) The process can be explained as follows: Firstly: Given a training set containing m samples where all x inputs are an element of the total set X and where the output y is an element of a set consisting of only two values, -1 (negative class) and ( positive class) Next: initialize all significant numbers of samples to divided by the number of training samples Next: For the t = to T classifier, fit it with the training data (in each prediction -1 or 1) and choose the classifier with the lowest number of error classifications Voting Prediction AdaBoost is an ensemble algorithm designed to improve accuracy by adjusting the weight attribute of each weak learner (like a small decision tree) on the same dataset The prediction uses a weighted majority vote from all weak learners in its system The overall algorithm can be described as below: • • • • • Step 1: Select and build weak learners (small decision tree); initialize weight value for each tree is 1/N for N trees Step 2: Select sub training set and start training Step 3: For those samples which were predicted right from the last round, their weight will be decreased Step 4: Increase weight value from samples that were predicted wrong Step 5: Back to step with these new weight values Repeating steps and 4, Adaboost forces weak learners to concentrate on those samples predicted wrong, improving overall accuracy Adaboost takes less time to implement The trade-off is that its accuracy is not good as Random Forest, but we can use it in some cases to maximize the performance of our algorithm More details about Adaboost from below pseudo-code: IV R ESULTS AND D ISCUSSION A Dataset 1) DAPRA KDDCUP99 dataset: The DARPA dataset was originally developed in 1998, aiming to improve research and survey in intrusion detection KDDCUP99 is an upgraded version of the DARPA 1999 dataset, used to develop an intrusion detection system to distinguish between bad and good connections The dataset is mainly designed to detect intrusions in the network through simulation in a military environment KDDCUP99 is designed and developed using DARPA98 IDS and is used to simulate four 52 different types of attacks These attacks can be classified into four main categories: • Denial of Service Attacks (DoS): The attacker tries to limit network usage by disrupting service availability to intended users • User to Root Attacks (U2R): This type of attack occurs when an attacker gains access from a normal user account and tries to gain root access through system vulnerabilities • Remote to Local attacks (R2L): The attacker does not have an account on the local system but tries to gain access through sending network packets to exploit vulnerabilities and gain access as a local user • Probing attacks: This happens when an attacker scans a network to gather information about the system in order to use it to avoid system security control TABLE I KDDCUP99 T RAINING DATASET D ISTRIBUTION Normal DOS Probe R2L U2L Total # of instances Percentage % 97,277 391,458 4107 1126 52 494,019 19.69% 79.24% 0.83% 0.23% 0.01% 100% 2) NSL-KDD dataset: The NSL-KDD dataset is an improved version of KDDCUP99 It has been suggested to solve some of the problems of KDDCUP99 NSL-KDD has a fair amount of records in its training and testing set It has the same features as the original KDDCUP99 It is an effective standard by which researchers compare their proposed IDSs Some of the improvements of this dataset are: 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) There are no redundant records in the training set, so the classifier will not produce any misleading results • There are no duplicate records in the testing set, which results in a better Reduction ratio • The number of selected records from each group of difficulty levels is inversely proportional to the percentage of records in the original KDD dataset 3) UNSW-NB15NB15 dataset: The raw network packets of the UNSW-NB 15 Dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of UNSW Canberra for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviors This data set has a hybrid of the real modern normal, and the recent synthesized attack activities of the network traffic include fuzzes, analysis, backdoors, DoS, exploits, Generic, Reconnaissance, Shellcode, Worms There are 49 features that have been developed using Argus, Bro-IDS tools, and twelve algorithms that cover characteristics of network packets In contrast, the existing benchmark data sets such as KDD98, KDDCUP99, and NSLKDD, realized a limited number of attacks and information of outdated packets The tcpdump tool is used to capture network traffic in the form of packets To capture 100 GBs, the simulation period was 16 hours on Jan 22, 2015 and 15 hours on Feb 17, 2015 Each pcap file is divided into 1000MB using the tcpdump tool To create reliable features from the pcap files, Argus6 and Bro-IDS7 tools are used Twelve algorithms were developed using a C# language to analyze the flows of the connection packets in-depth TABLE IV • B Resutls In this section, we show the results when using our proposed modal on three datasets, including NSL-KDD, UNSW-NB15 and CICIDS-2017 TABLE II T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL FEATURES AND SELECTED FEATURES ON THE CICIDS-2017 DATASET T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL FEATURES AND SELECTED FEATURES ON THE NSLKDD DATASET # of features Loss Accuracy Time/step Time/epoch 40 14 0.0835 0.0479 0.9772 0.988 157s 127s 42ms 34ms algorithm is significantly reduced from 1600 seconds to 535 seconds In contrast, with the UNSW-NB15 dataset, although the training time did not decrease, the algorithm’s accuracy improved from 0.6924 to 0.7774 In the case of the NSLKDD dataset, the result has an improvement in both accuracy and training time To demonstrate the effectiveness of our proposal over stateof-the-art methods, we compare our method with Sigmoid PIO and Cosine PIO [14], which are the newest feature selection methods As shown in Table IV-B and Table IV-B, our method is more effective than the PIO in the KDDCUP99 dataset when only six features are selected, but the accuracy is still guaranteed Similar result in the NSLKDD dataset, the number of features selected is more than the Cosine PIO method but less than the Sigmoid PIO, but the accuracy is still much higher than the two above methods There comparison results again demonstrated the effectiveness of our proposal V C ONCLUSION In this paper, we introduced an ensemble feature selection algorithm for IDS This algorithm aims at boosting the detection accuracy of IDS whereas reducing the number of networks features required to build the IDS model Evaluating NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets shows that our proposal could reduce the number of features from 83 to 20, 43 to 14, and 40 to 14, respectively As a result, the model training time is significantly reduced Furthermore, the proposed algorithm is more effective than related works demonstrated by achieving better accuracy with fewer selected features R EFERENCES # of features Loss Accuracy Time/step Time/epoch 83 20 0.202 0.023 0.9943 0.9926 1600s 535s 30ms 10ms TABLE III T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL FEATURES AND SELECTED FEATURES ON THE UNSW-NB15 DATASET # of features Loss Accuracy Time/step Time/epoch 43 14 1.1482 0.5152 0.6924 0.7774 47s 47s 8ms 8ms As shown in Table IV-B, IV-B, IV-B, our proposal could improve the model performance (accuracy, loss and training time) on all datasets For example, as with the dataset CICIDS2017, the accuracy is stable, but the training time of the 53 [1] R Morrar, H Arman, and S Mousa, “The fourth industrial revolution (industry 4.0): A social innovation perspective,” Technology Innovation Management Review, vol 7, no 11, pp 12–20, 2017 [2] R P Singh, M Javaid, A Haleem, and R Suman, “Internet of things (iot) applications to fight against covid-19 pandemic,” Diabetes & Metabolic Syndrome: Clinical Research & Reviews, vol 14, no 4, pp 521–524, 2020 [3] H S Brar and G Kumar, “Cybercrimes: A proposed taxonomy and challenges,” Journal of Computer Networks and Communications, vol 2018, 2018 [4] K Khan, A Mehmood, S Khan, M A Khan, Z Iqbal, and W K Mashwani, “A survey on intrusion detection and prevention in wireless ad-hoc networks,” Journal of Systems Architecture, vol 105, p 101701, 2020 [5] A Aldweesh, A Derhab, and A Z Emam, “Deep learning approaches for anomaly-based intrusion detection systems: A survey, taxonomy, and open issues,” Knowledge-Based Systems, vol 189, p 105124, 2020 [6] S Maza and M Touahria, “Feature selection algorithms in intrusion detection system: A survey,” KSII Transactions on Internet and Information Systems (TIIS), vol 12, no 10, pp 5079–5099, 2018 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) T HE COMPARISION RESULTS WITH THE TABLE V D ECISION T REES MODEL ON THE KDDCUP99 DATASET Technique # of features Accuracy Selected features Sigmoid PIO Cosine PIO Our method 10 0.869 0.883 0.992 [3, 4, 6, 11, 13, 18, 23, 36, 37, 39] [3, 4, 6, 13, 23, 29, 34] [3, 8, 10, 13, 30, 32] T HE COMPARISION RESULTS WITH THE TABLE VI D ECISION T REES MODEL ON THE NSLKDD DATASET Technique # of features Accuracy Selected features Sigmoid PIO Cosine PIO Our method 18 14 0.869 0.883 0.980 [1, 3, 4, 5, 6, 8, 10, 11, 12, 13, 14, 15, 17, 18, 27, 32, 36, 39, 41] [2, 6, 10, 22, 27] [2, 3, 6, 8, 10, 11, 13, 23, 27, 30, 32, 35, 36, 39] [7] S Aljawarneh, M Aldwairi, and M B Yassein, “Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model,” Journal of Computational Science, vol 25, pp 152–160, 2018 [8] P M Mafra, V Moll, J da Silva Fraga, and A O Santin, “Octopusiids: An anomaly based intelligent intrusion detection system,” in The IEEE symposium on Computers and Communications IEEE, 2010, pp 405–410 [9] L Zhuo, J Zheng, X Li, F Wang, B Ai, and J Qian, “A genetic algorithm based wrapper feature selection method for classification of hyperspectral images using support vector machine,” in Geoinformatics 2008 and Joint Conference on GIS and Built Environment: Classification of Remote Sensing Images, vol 7147 International Society for Optics and Photonics, 2008, p 71471J [10] E De la Hoz, E De La Hoz, A Ortiz, J Ortega, and A Mart´ınez´ Alvarez, “Feature selection by multi-objective optimisation: Application to network anomaly detection by hierarchical self-organising maps,” Knowledge-Based Systems, vol 71, pp 322–338, 2014 [11] Y Zhou, G Cheng, S Jiang, and M Dai, “Building an efficient intrusion detection system based on feature selection and ensemble classifier,” Computer networks, vol 174, p 107247, 2020 [12] L Khalvati, M Keshtgary, and N Rikhtegar, “Intrusion detection based on a novel hybrid learning approach,” Journal of AI and data mining, vol 6, no 1, pp 157–162, 2018 [13] S Mohammadi, H Mirvaziri, M Ghazizadeh-Ahsaee, and H Karimipour, “Cyber intrusion detection by combined feature selection algorithm,” Journal of information security and applications, vol 44, pp 80–88, 2019 [14] H Alazzam, A Sharieh, and K E Sabri, “A feature selection algorithm for intrusion detection system based on pigeon inspired optimizer,” Expert systems with applications, vol 148, p 113249, 2020 54 ... approaches for anomaly -based intrusion detection systems: A survey, taxonomy, and open issues,” Knowledge -Based Systems, vol 189, p 105124, 2020 [6] S Maza and M Touahria, ? ?Feature selection algorithms... Li, F Wang, B Ai, and J Qian, “A genetic algorithm based wrapper feature selection method for classification of hyperspectral images using support vector machine, ” in Geoinformatics 2008 and Joint... false-negative rate, and low false-positive rate [13] proposed an IDS based on feature selection and clustering algorithm using filter and wrapper methods - named feature grouping based on linear