Machine learning approaches to cyber security

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY COMPUTER SCIENCE AND ENGINEERING FACULTY GRADUATION THESIS Machine Learning Approaches to Cyber Security Department: Computer science Committee: Advisor: Reviewer: Students: Computer Science Prof Nguyen Duc Thai Prof Nguyen Le Duy Lai -o0o Huynh Kien Van 1552423 Nguyen Duc Kien 1552181 HO CHI MINH CITY, 12/2021 ĐẠI HỌC QUỐC GIA TP.HCM -TRNG I HC BỗCH KHOA KHOA: KH&KT My tnh B MïN: Hệ thống & Mạng m‡y t’nh CỘNG HđA XÌ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hnh phc NHIM V LUN ỗN TT NGHIP Ch ý: Sinh vi•n phải d‡n tờ nˆy vô trang thuyết tr“nh HỌ VË TæN: Huỳnh Kiến Văn Nguyễn Đức Ki•n NGËNH: Khoa học M‡y t’nh MSSV: 1552423 MSSV: 1552181 LỚP: Đầu đề luận ‡n: Machine Learning Approaches for Cyber Security Nhiệm vụ (y•u cầu nội dung vˆ số liệu ban đầu): - Do reaearch on Machine Learning and its applications - Do research on topics related to application of Machine Learning into cyber security - Propose a way how to create an IDS (Intrusion Detection System) using Machine Learning - Design the desired system as mentioned above - Implement the system with using any programming language(s) and technologies, prove that they are suitable for the solution - Demonstration the system to make sure it run properly and correctly Ngˆy giao nhiệm vụ luận ‡n: 30/08/2021 Ngˆy hoˆn thˆnh nhiệm vụ: 31/12/2021 Họ t•n giảng vi•n hướng dẫn: TS Nguyễn Đức Th‡i Nội dung vˆ y•u cầu LVTN đ‹ th™ng qua Bộ m™n Ngˆy 23 th‡ng 08 năm 2021 CHỦ NHIỆM BỘ MïN GIẢNG VIæN HƯỚNG DẪN CHêNH (Ký vˆ ghi r› họ t•n) (Ký vˆ ghi r› họ t•n) TS Nguyễn Đức Th‡i TS Nguyễn Đức Th‡i PHẦN DËNH CHO KHOA, BỘ MïN: Người duyệt (chấm sơ bộ): Đơn vị: _ Ngˆy bảo vệ: _ Điểm tổng kết: _ Nơi lưu trữ luận ‡n: TRNG I HC BỗCH KHOA KHOA KH & KT MỗY TờNH CNG HủA Xè HI CH NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phœc -Ngˆy 28 th‡ng 12 năm 2021 PHIẾU CHẤM BẢO VỆ LVTN (Dˆnh cho người hướng dẫn) Họ vˆ t•n SV: Huỳnh Kiến Văn Nguyễn Đức Ki•n MSSV: 1552423 MSSV: 1552181 Ngˆnh (chuy•n ngˆnh): Computer Science Đề tî: Machine Learning Approaches for Cyber Security Họ t•n người hướng dẫn: Nguyễn Đức Th‡i Tổng qu‡t thuyết minh: Số trang: Số chương: Số bảng số liệu Số h“nh vẽ: Số tî liệu tham khảo: Phần mềm t’nh to‡n: Hiện vật (sản phẩm) Tổng qu‡t c‡c vẽ: - Số vẽ: Bản A1: Bản A2: Khổ kh‡c: - Số vẽ vẽ tay Số vẽ tr•n m‡y t’nh: Những ưu điểm ch’nh LVTN: ¥! Students completed a desired features of the thesis and demonstrated them ¥! The students applied machine learning algorithms to analyze the network traffics and cybersecurity data Những thiếu s—t ch’nh LVTN: ¥! Many parts in the report are short and lack justifications ¥! Students provided evaluation of the received results, however, the evaluation was too short and no discussion presented Đề nghị: Được bảo vệ R Bổ sung th•m để bảo vệ o c‰u hỏi SV phải trả lời trước Hội đồng: 10 Đ‡nh gi‡ chung (bằng chữ: giỏi, kh‡, TB): Huỳnh Kiến Văn Nguyễn Đức Ki•n Kh™ng bảo vệ o Điểm : 8.2/10 7/10 Ký t•n (ghi r› họ t•n) Nguyễn Đức Th‡i KHOA KH & KT MÁY TÍNH -Ngày 27 tháng 12 2021 tên SV: Huynh Kien Van -1552423 Nguyen Duc Kien -1552181 Ngành (chuyên ngành): Computer Science MACHINELEARNINGAPPROACHESTOCYBER SECURITY Nguy ê Duy Lai 40 ng: 10 : 14 Nh In this dissertation, the topic is about how to apply machine learning approaches to analyze the amount of live network traffic The implementation of a Traffic validator expects to validate the incoming traffic into benign and malicious classes The network traffic has been filtered through a rule-based IDS such as Snort, and the model is an add-on to IDSs that aims to eliminate rule-based IDS false negatives The detection task is usually expected to be timed in milliseconds, as IDSs must respond quickly and without affecting user experiences Nh However, the presented topic is still limited including the inability to support a complex prediction model The title of this thesis seems to be very large and authors need to give a concise scope on problems and the approach to solutions The fundamental of networks in Section 2.1 retains very elementarily that may not be necessary for this context There are some limitations to this approach such as encrypted packets are not processed by most intrusion detection devices a How think that the attackers can use some techniques to evade IDS such as Fragmentation, Avoiding defaults, Coordinated, low-bandwidth attacks, Address spoofing/proxying, Pattern change evasion Your IDS integrating ML module can be immune to these evasion techniques? i, khá, TB): /10 Ký tên (ghi rõ Nguy ê Duy Lai VIETNAME UNIVERSITY OF TECHNOLOGY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE & ENGINEERNG BACHELOR OF ENGINEERING THESIS MACHINE LEARNING APPROACHES TO CYBER SECURITY COMPUTER SCIENCE COMMITTEE Advisor: Prof Nguyen Duc Thai Examiner: Prof Nguyen Le Duy Lai Students: Huynh Kien Van - 1552423 Nguyen Duc Kien - 1552181 Ho Chi Minh, 2021 Commitment We guarantee that the work in this dissertation was completed in accordance with the University’s regulations and that it has not been submitted to any other academic institutions The works are our own, unless otherwise stated in the text by a particular reference Acknowledgement First and foremost, we would like to express our special thanks of gratitude to our supervisor, professor Nguyen Duc Thai for his never ending grace His guidance and value knowledge helped us in all the time of writing thesis Also, professor Nguyen Duc Thai supports us in expertise and spirit to work on the thesis We also extend our grateful to our families and friends who have always been beside us in hard moments and encouraged us in this thesis and university life Abstract In this thesis, we are proposing machine learning-based approach to detect lively network traffic To increase the accuracy as well as reducing False-Negative cases, we apply the Deep Learning model We are building RNN models: LSTMs and GRU to classify a network traffic if malicious or normal Technically, we are building RNN model run parallel with IDS and combining the results and consider which actions which actions following the decision table Dataset used in this thesis mainly came from MTA-KDD19’ which was created by project have the same name To enrich our data, we are also using dataset ISCX2012[1] and USTC-TFC2016[2], then preprocessing following the stagegy of MTA-KDD’19 work For all result, the LSTM model is performced better than the GRU model For accuracy, the LSTM model even higher than the work of MTA-KDD’19 which used the traditional neural network, 99.8% compare to 99.74 For Prediction, the LSTM reach 98.3% and the GRU reach 99.5% Our goal is eliminate the False-Negative, so the results of Recall score of these two model is 99.75% (for GRU model) and 99.8% (for LSTM model), respectively Contents List of Figures List of Tables List of Abbreviations INTRODUCTION 1.1 Overview 1.2 Objective and scope 1.3 Thesis structure 10 10 11 BACKGROUND KNOWLEDGE 2.1 Fundamental of network 2.1.1 Networking concept 2.1.2 Reference models 2.2 Intrusion Detection System 2.3 Word Embedding 2.4 Deep Neural Network 2.4.1 Recurrent Neural Network 2.4.2 Long Short Term Memory 12 13 13 13 17 19 19 19 21 LITERATURE REVIEW 3.1 Deep learning-based approach in improvement signature of IDSs 3.2 Deep learning-based approaches for classifying network traffic 23 24 24 PROPOSED APPROACHES 4.1 Problem statements 4.2 Proposed approach 4.3 Design 25 26 26 27 DATASET 28 IMPLEMENTATION 6.1 Data pre-processing 6.1.1 Explaining features of MTA-KDD’19 dataset 6.1.2 Data processing 6.2 Prediction module 31 32 32 32 33 Chapter EXPERIMENTS 7.1 Evaluation methods 7.1.1 Data preperation 7.1.2 Confusing matrix 7.1.3 Accuracy 7.1.4 Precision 7.2 Model evaluation CONCLUSION 35 36 36 36 36 37 37 38 Appendices References 40 Chapter 4.1 Problem statements As we mentioned in section 1.1, IDSs is mainly used for cyber defense, but it tends to be high False-Negative However, by using the signature-based, IDS ensures quick and effective detection of known anomalies with a low risk of raising false alarms Time efficiency is also required so machine learning models must work in milliseconds Because we could not wait for process of detection from machine learning done, our module have to run parallel with the IDS As the parallel module with IDS, the input of our model is raw data from network traffic at packet level and the output is: if the network traffic is benign 4.2 Proposed approach From section 3.1 and 3.2, we summerise two main groups of approach as follow: • Improvement of IDSs’s rule: These work ([3] and [4]) suggests an approach of extracting signatures and train a machine learning For Roland et al.[3], they work with different clustering models to find the best one This approach cannot extensive amount of features because of time consumption For Christopher et al.[4], they classify rules by decision tree to minimize of comparasions This approach could be promising, but the big limit of IDS is that it is impossible to identify novel attacks like zero-day exploit is not eleminated • Classifying network traffic These work ([5] - [8]) collect dataset of network traffic including Merging Malware and Legitimate and train a deep neural network to cluster malicious and normal traffic These all work try to replace IDS Especially, Minghui Gao et al.[8] combines Neural Networks model with Association Analysis to have conclusion This approaches could observe a outstanding results of detect anomaly network traffic However, we could not extract these abnormalities to be human-readable to reproduce the signature-based and these approaches are not time efficiency Then these approaches is not effective From above, We come to a proposal: Machine learning-assisted approach Our proposal could be explained as follow: "If IDSs fail in alert a incoming attacks, our model could help it to detect the threats Otherwise, if our model fail to detect and IDS success to prevent a attack, these netflow could be used as the new data to update our model" From our proposal, we will build RNN model to detect the pattern of incoming traffic data and if it is benign Then combined with the classification of IDS to decide whether to block the request and log it to the database for updating model in future 26 Chapter Figure 4.1: Traffic validator architecture 4.3 Design Picture 4.1 presented the design of module For decision model, when the two results are the same, the result is straightforward For diffent result from two ones, there are four case could happen as table 4.1: IDS Valid Invalid Valid Invalid Prediction module Result Bengin Valid Bengin Invalid Malicious Invalid Malicious Invalid Table 4.1: Decision table for IDS and Prediction model For Prediction model, it will runs in a period to collect more data from Internet or from right output of this work in order to update model This module will run parallel with IDS The packet which is catch by IDS or some tool (as wireshark1 for example) is predicted by module There are only label for this model: or with for malicious traffic and is benign The Prediction model is the main task of our project It includes two parts: packet pre-processing and RNN First, we need pre-process the data for trainning Insipired by [10], we computed features as mentioned in section 1pcap is an application programming interface (API) for capturing network traffic 27 DATASET For recent year, when the research in deep learning and security fields, researcher have been recommended to use public famous traffic datasets such as KDD CUP1999 and NSL-KDD to train and test However, these datasets not provide information at the raw traffic level which lead to the missing of packet information in the data For the ones matching the requirements, USTC-TFC2016[2] is one of the most prominent dataset The summerize for this dataset is presented as table below 5.1: Benign Name Size Bittorrent 7.33MB FTP 60.2MB Facetime 2.4MB Gmail 9.6MB MySql 22.4MB Outlook 11.2MB Malware Name Size Cridex 2.55MB Zeus 13.4MB Shifu 57.9MB Neris 90.1MB Nsisey 28.1MB Geodo 28.8MB Table 5.1: Summary of benign and malware traffic in USTC-TFC2016 dataset Another prominent dataset is ISCX2012[1], we use it for enrich our experience Like USTC-TFC2016, ISCX2012 contains both malicious and benign traffic consists of packets collected for seven days Packets collected in the first and sixth day are normal traffic In the second and third day, both normal packets and attack packets are collected In the fourth, fifth, and seventh days, besides the normal traffic, HTTP DoS, DDoS and IRC Botnet, and Brute Force SSH packets are collected, respectively Finally, since inspired from [10] et al., we initialise our data with MTA-KDD’19 dataset which was published in 2019 This dataset is procecced with 33 features and labelled The features and formulas for calculating them as shown in these pictures belows: Chapter Figure 5.1: Dataset features and how to calculate of MTA-KDD’19 dataset 29 Chapter Figure 5.2: Functions and sets used in the feature definition formulas 30 IMPLEMENTATION Contents 6.1 6.2 Data pre-processing 6.1.1 Explaining features of MTA-KDD’19 dataset 6.1.2 Data processing Prediction module 32 32 32 33 Chapter 6.1 Data pre-processing 6.1.1 Explaining features of MTA-KDD’19 dataset The some relevance of feature showed in section is briefly explained in the following: • Features Ack,Syn,Fin,Psh,Urg,RstFlagDist: have been chosen since it has been empirically shown that the presence many packets with of certain TCP flags set may indicate malware traffic [11] or [12] • Or Features TCP,UDP,DNSOverIP: have been chosen since many attacks exploit specific characteristics of these protocols As an example, trojans and other remote access issue a large number of DNS requests to locate their command and control server, so an high DNSOverIP ratio may indicate malicious traffic [13] • Features MaxLen, MinLen, AvgLen, StdDevLen, MaxIAT, MinIAT, AvgIAT, AvgDeltaTime, MaxLenRx, MinLenRx, AvgLenRx, StdDevLenRx, MaxIATRx, MinIATRx, AvgIATRx,StartFlow, EndFlow, DeltaTime, FlowLen, FlowLenRx: have been chosen since packet number, size and inter-arrival times are useful to detect flooding-style attacks [14] • The rest of features could be found on the paper of [10] et al 6.1.2 Data processing Before going to process feature follows formulas in section Our group need to process raw packet data from the dataset as ISCX2012 or USTC-TFC201 Picture 6.1: For the Figure 6.1: Data processing task PCAP segmentation, we use python to read pcap file and sniffer it Input of this process is pcap file is read from wireshark or ids In the trainning steps, the in put of pcap come from datasets:ISCX2012 and USTC-TFC201 32 Chapter Then, we continue with feature extraction and selection following MTA-KDD’19 work Because ISCX2012 and USTC-TFC201 are categorized as benign and malware, we have to labelled it before mergeging them toghether Finally, we put it in RNN models to train and evaluate 6.2 Prediction module Choosing algorithms: In this stage, we are going to final step.At first, we considered both algorithms: LSTM and GRU We experiment both of them by training the dataset which was processed in section 6.1 and evaluating it by accuracy score Splitting Dataset and Feature Scaling : We split our data into sets: trainning set and testing set Next, LSTMs/GRU expect our data to be in a specific format, usually a 3D array We start by creating data in 60 timesteps and converting it into an array using NumPy Next, we convert the data into a 3D dimension array Building the LSTM: • In order to build the LSTM, we need to import a couple of modules from Keras: Sequential for initializing the neural network, Dense for adding a densely connected neural network layer, LSTM for adding the Long Short-Term Memory layer and GRU for adding the GRU layer, Dropout for adding dropout layers that prevent overfitting • We add the LSTM/GRU layer with the following arguments: 50 units, return_sequences = True which determines whether to return the last output in the output sequence, or the full sequence, input_shape as the shape of our training set, we specify 0.2 for Dropout layers, meaning that 20 % • Next, we compile our model using the popular adam optimizer and set the loss as the mean_squarred_error • we fit the model to run on 100 epochs with a batch size of 32 Evaluating model: Before going to the final model, we need to evaluating by testing the model with testing set 33 Chapter Figure 6.2: The full network model 34 EXPERIMENTS Contents 7.1 7.2 Evaluation methods 7.1.1 Data preperation 7.1.2 Confusing matrix 7.1.3 Accuracy 7.1.4 Precision Model evaluation 36 36 36 36 37 37 Chapter 7.1 Evaluation methods 7.1.1 Data preperation Splitting is a method for assessing a machine learning algorithm’s performance We use the standard method: train-test split The procedure is taking a dataset and separating it into two subsets The first subset is used to fit the model and is called as the training dataset The second subset is the input element of the dataset is provided to the model, then predictions are made and compared to the expected values This second dataset is referred to as the Testing dataset • Training dataset: Used to fit the machine learning model • Testing dataset: Used to evaluate the fit machine learning model 7.1.2 Confusing matrix All possible results can be divided into the following four cases • True Positive (TP): actual attacks are classified as attacks • True Negative (TN): actual normal records are classified as normal • False Positive (FP): actual normal records are classified as attacks This condition is also regarded as a false alarm • False Negative (FN): actual attacks are classifed as normal records With these cases we have a confusion matrix The confusion matrix is a crucial concept in classification performance The instances in a predicted class are represented by each row of the confusion matrix, whereas the occurrences in an actual class are represented by each column The confusion matrix is usually normalized to obtain the rates (table 7.1 Actual class TRUE FALSE Predicted class TRUE TPR = TP/(TP+FN) FPR = FP/(FP + TN) FALSE FNR = FN/(TP+FN) TNR = TN/(FP + TN) Table 7.1: Confusion matrix with normalize 7.1.3 Accuracy The performance of the proposed model is evaluated by using diferent evaluation indicators Accuracy is the simplest method to measures the proportion of the correctly classified 36 Chapter traffic samples to the total traffic samples 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 7.1.4 𝑇𝑃 + 𝑇𝑁 𝑇 𝑃 + 𝑇 𝑁 + 𝐹𝑃 + 𝐹 𝑁 (7.1) Precision Since our dataset’s class distribution is imbalanced (malware is much greater than bengin) Therefor, the accuracy may not a good indicator of model We need to look at class-specific performance metrics Precision is one such metric, which is defined as Equation 7.2: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 7.2 𝑇𝑃 𝑇 𝑃 + 𝐹𝑃 (7.2) Model evaluation As we mentioned in the section 6, we experienced two algorithms before going to decide to choose LSTM or GRU, and the table7.2 below is The experimental results: In MTA-KDD’19, they used ANN to evalute their work, we will compare our model Model Accuracy Precision GRU 99.6 0.995 LSTM 99.8 0.983 Table 7.2: Model evaluation with only MTA-KDD’19 dataset in the table7.3: Model Accuracy Precision GRU 99.6 0.995 LSTM 99.8 0.983 MTA-KDD’19 99.74 0.999 Minghui et al 97.22 96.25 Table 7.3: Comparing models 37 CONCLUSION After this thesis, we build successfully the prediction model of the network traffic which support for the false-negative case of IDS However, we cannot build the completement IDS fot the whole system because of lacking of time and knowledge, so that is the best we can The disappointment if this project is that depend to much on the database, so it cannot develop follow the propose we want Because of that, we want to reduce the impact of database to this project or find the new method to handle database On the other hand, we have the new approach to this field that is training RNN on the modern new data, instead of the traditional data 38 Appendices 39 References [1] Ali Shiravi et al “Toward developing a systematic approach to generate benchmark datasets for intrusion detection” In: computers & security 31.3 (2012), pp 357–374 [2] Wei Wang et al “Malware traffic classification using convolutional neural network for representation learning” In: 2017 International Conference on Information Networking (ICOIN) IEEE 2017, pp 712–717 [3] Roland Verbruggen and Tom Heskes “Creating firewall rules with machine learning techniques” In: Nĳmegen Netherlands: Kerckhoffs institute Nĳmegen (2014), pp 9–12 [4] Christopher Kruegel and Thomas Toth “Using decision trees to improve signature-based intrusion detection” In: International Workshop on Recent Advances in Intrusion Detection Springer 2003, pp 173–191 [5] Xiaoyang Liu and Jiamiao Liu “Malicious traffic detection combined deep neural network with hierarchical attention mechanism” In: Scientific Reports 11.1 (2021), pp 1–15 [6] Benjamin J Radford et al “Network traffic anomaly detection using recurrent neural networks” In: arXiv preprint arXiv:1803.10769 (2018) [7] Ren-Hung Hwang et al “An LSTM-based deep learning approach for classifying malicious traffic at the packet level” In: Applied Sciences 9.16 (2019), p 3414 [8] Minghui Gao et al “Malicious network traffic detection based on deep neural networks and association analysis” In: Sensors 20.5 (2020), p 1452 [9] Dhruba Kumar Bhattacharyya and Jugal Kumar Kalita Network anomaly detection: A machine learning perspective Chapman and Hall/CRC, 2019 [10] Ivan Letteri et al “MTA-KDD’19: A Dataset for Malware Traffic Detection.” In: ITASEC 2020, pp 153–165 [11] Raihana Syahirah Abdullah et al “Recognizing P2P botnets characteristic through TCP distinctive behaviour” In: International Journal of Computer Science and Information Security 9.12 (2011), p [12] G Kirubavathi Venkatesh and R Anitha Nadarajan “HTTP botnet detection using adaptive learning rate multilayer feed-forward neural network” In: IFIP International Workshop on Information Security Theory and Practice Springer 2012, pp 38–48 [13] Asaf Nadler, Avi Aminov, and Asaf Shabtai “Detection of malicious and low throughput data exfiltration over the DNS protocol” In: Computers & Security 80 (2019), pp 36–53 [14] Madhav Kale and DM Choudhari “DDOS attack detection based on an ensemble of neural classifier” In: International Journal of Computer Science and Network Security (ĲCSNS) 14.7 (2014), p 122 40 ... Computer Science MACHINELEARNINGAPPROACHESTOCYBER SECURITY Nguy ê Duy Lai 40 ng: 10 : 14 Nh In this dissertation, the topic is about how to apply machine learning approaches to analyze the amount... ‡n: Machine Learning Approaches for Cyber Security Nhiệm vụ (y•u cầu nội dung vˆ số liệu ban đầu): - Do reaearch on Machine Learning and its applications - Do research on topics related to application... research on topics related to application of Machine Learning into cyber security - Propose a way how to create an IDS (Intrusion Detection System) using Machine Learning - Design the desired system

Tiêu đề	Machine Learning Approaches to Cyber Security
Tác giả	Huynh Kien Van, Nguyen Duc Kien
Người hướng dẫn	TS. Nguyễn Đức Thái
Trường học	Vietnam National University - Ho Chi Minh City
Chuyên ngành	Computer Science
Thể loại	graduation thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	45
Dung lượng	1,09 MB