A deep learning model that detects the domain generated by the algorithm in the botnet

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	12
Dung lượng	777,08 KB

Nội dung

Domain Generation Algorithm (DGA) is the group of algorithms that generate domain names for attack activities in botnets. In this paper, we present a Bi-LSTM deep learning model based on Attention mechanism to detect DGA-generated domains.

52 HANOI METROPOLITAN UNIVERSITY A DEEP LEARNING MODEL THAT DETECTS THE DOMAIN GENERATED BY THE ALGORITHM IN THE BOTNET Nguyen Trung Hieu(*), Cao Chinh Nghia Faculty of Mathematics - Informatics and application of science and technology in crime prevention, The People's Police Academy Abstract: Domain Generation Algorithm (DGA) is the group of algorithms that generate domain names for attack activities in botnets In this paper, we present a Bi-LSTM deep learning model based on Attention mechanism to detect DGA-generated domains Through the experimental process, our model has given good results in detecting DGA-generated domains belong to the Post and Monerodownloader family In general, the F1 measure of the model in the multi-class classification problem reaches 90% The micro average (macro avg) efficiency is 86% and the average (weighted avg) efficiency is 91% Keywords: Bi-LSTM deep learning network; deep learning; malicious URL detection; Attention mechanism in deep learning Received June 2022 Revised and accepted for publication 26 July 2022 (*) Email: hieunt.dcn@gmail.com INTRODUCE Botnet Attacks The development of Internet has brought many benefits to users, but it is an environment for cybercriminals to operate also Botnet Attack is one of the common attacks Each member of the botnet is called a bot A bot is a malicious software created by attackers that control infected computers remotely through a command and control server (C&C server) The bot has a high degree of autonomy and is equipped with the ability to use communication channels to receive commands and update malicious code from the control system Botnets are commonly used to transmit malware, send spam, steal sensitive information, phishing, or create large-scale cyberattacks such as distributed denial of service (DdoS) attacks [1] SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 53 The distribution’s widespread of bots and the connection between bots and control servers often requires the Internet The bots need to know the IP address of the control server to access and receive commands In order to avoid detection, command and control servers not register static domain names, instead of continuously change addresses and different domains at different intervals Attackers use Domain Generation Algorithm (DGA) to generate different domain names for attacks [2] aimed at masking these control and control servers Identifying the attack of a malicious domain can effectively determine the purpose of the attack, the tools and malware used, and take preventive measures to greatly reduce the damage caused by the attack induced attack Domain Generation Algorithm The Domain Generation Algorithm (DGA) can use operators in combination with everchanging variables to generate random domain names The variables can be day, month, year values, hours, minutes, seconds or other keywords These pseudo-random strings are concatenated with the Top-level domain (.com, vn, net ) to generate the domain names The algorithm of the Chinad malware written in Python [3] shows the input seed includes letters from a-z and numbers from 0-9 and combines the values of days, months, five The results are combined with the TLDs ('.com', '.org', '.net', '.biz', '.info', '.ru', '.cn') to form the complete domain name Table Some DGA samples Conflicker gfedo.info ydqtkptuwsa.org bnnkqwzmy.biz Cryptolock er nvjwoofansjbh.ru qgrkvevybtvckik.org eqmbcmgemghxbcj.co uk Bigvikto support.showremote-conclusion.fans turntruebreakfast.futbol r speakoriginalworld.one Bamital cd8f66549913a78c5a8004c82bcf6b01.i nfo a024603b0defd57ebfef34befde16370.o rg 5e6efdd674c134ddb2a7a2e3c603cc14 org Chinad qowhi81jvoid4j0m.biz 29cqdf6obnq462yv.co m 5qip6brukxyf9lhk.ru 54 HANOI METROPOLITAN UNIVERSITY A DGA can generate a large number of domains in a short time, and bots can select a small portion of them to connect to the C&C server Table shows some examples of domain names initialized with DGA [4] The Chinad malware can generate 1000 domain names per day with the letters a-z and numbers 0-9 Bigviktor combines to different words from predefined lists (dictionaries) that can generate 1000 domains per month Figure depicts the connection process between the C&C server and the DGA domains [5] The attacker uses the same DGA and initial Figure DGA-based botnet communication mechanism kernels for the C&C server and the bot to generate the same domain dataset The attacker needs to select a domain name only from the generated list and register it for the C&C server hour before performing the attack The bots on the victim's machine will in turn send the domain name resolution requests in the generated list of domains to the Domain Name System (DNS) The DNS system will return the IP address of the corresponding C&C server, then the bots begin to communicate with the server to receive the command If the C&C server is not found in the previous domain, the bots will query the next set of domains generated by the DGA until an active domain name is found [6] MAIN CONTRIBUTION OF THE ARTICLE The main of contributions of the paper include: - Introduce a deep learning approach using Bidirectional-Long Short Term Memory (BiLSTM) model based on Attention mechanism in detecting domains created by DGA Our model has worked well in the problem of detecting malicious URLs [7] - Presenting experimental results shows a significant improvement compared to previous techniques with the use of open data sets The remainder of paper is organized as follows: Section presents related studies Our deep learning network architecture and solution is presented in Section Section presents our experimental process, including the steps to select the database and the results obtained Finally, Section is the conclusion, comments on the results achieved as well as the future direction of the paper SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 55 2.1 Related studies In recent years, much research on Botnet detection has been published Nguyen Van Can and colleagues [8] proposed a model to classify benign domains and DGA domains based on Neutrosophic Sets Testing on data sets of Alexa, Bambenek Consulting [9] and 360lab [4] shows that the model has an accuracy result of 81.09% R Vinayakumar et al [10] have proposed a DGA detection method based on analyzing the statistical features of DNS queries Feature vectors are extracted from domain names by text representation method, optimal features are calculated from numerical vectors using deep learning architecture in Table The results show that the model has high accuracy with 97.8% Table DBD deep architecture [10] Layers Output Shape Embedding (None, 91, 128) Conv1D (None, 8, 764) MaxPooling1 (None, 2, 164) LSTM (None, 70) Dense (None, 1) Activation (None, 1) Yanchen Qiao et al [2] have proposed a method to detect DGA domain names based on LSTM using Attention mechanism Their model is executed on the data set from Bambenek Consulting [9], with an accuracy of 95.14%, overall precision of 95.05%, recall of 95.14% and F1 score of 95.48% Duc Tran [11] built an LSTM.MI model that combines binary classifier and multiclass classifier with an unbalanced dataset In which, the original LSTM model is applied a costsensitive adaptation mechanism Cost items are included in the back-to-back learning process to account for the importance of delineation between classes They demonstrate that LSTM.MI provides at least 7% improvement in accuracy and macro mean recall over the original LSTM and other modern cost-sensitive methods It can also maintain high accuracy on non-DGA generated labels (0.9849 F1 points) 2.2 Proposed model Our proposed model includes: input layer, embedded layer, two Bi-LSTM layers, one attention layer and output layer The architecture of the model is shown in Figure [7] 56 HANOI METROPOLITAN UNIVERSITY The detection module will take as input a data set of 𝑇 domain addresses with a structure of the form {(𝑢1 , 𝑦1 ), … , (𝑢𝑇 , 𝑦𝑇 )} Where, xt is a pair (𝑢𝑡 , 𝑦𝑡 )where u t (with t = 1, …, 𝑇) is a domain in the training list and 𝑦𝑡 an associated label Each domain, in its raw form, before being trained, is processed in two steps to form the input vector: - Step 1: Cut off the TLD part of the domain name then tokenize the raw data – convert the string of characters in the rest to encrypted data in the form of an integer using Keras's Tokenizer library; Figure Bi-LSTM network architecture - Step 2: Normalize the size of the encrypted data in step to the same length This way we can convert the original domain string into the input vector V = {v1 , v2 , v3 , …vT } Each vector has a fixed length Any missing vector, add the value to give the length enough Next, we use a bidirectional LSTM network (Bi-LSTM) to model URL sequences based on a word vector representation In Bi-LSTM architecture, there are two layers of nodes hidden from two separate LSTMs, two LSTMs capturing distant dependencies in two different directions Since the output vector of the embedded layer is V = {v1, v2, v3, …vT }, the forward LSTM will read the input from v1 to vT and the backward LSTM will read the input from v T ⃗⃗⃗𝑖 𝑣à ℎ ⃖⃗⃗⃗𝑖 is initialized We can get the output of the Bito v Meanwhile a pair of hidden states ℎ LSTM layer by combining the two hidden states according to the formula: ⃗⃗⃗𝒊 , 𝒉 ⃖⃗⃗⃗𝒊 ] 𝒉𝒊 = [𝒉 𝑻 (1) It uses two layers of Bi-LSTM and the experimental data set is quite large Therefore, Batch Normalization layer will be used to normalize the data in batch layers to a normal distribution to stabilize the learning process and greatly reduce the number of epochs needed to train the network, thereby increasing the speed of training training As described in this paper, the hidden states at all locations are considered with different Attention weights We apply Attention mechanism to capture the relationship between ⃗⃗⃗ ℎ𝑖 𝑣à ⃖⃗⃗⃗ ℎ𝑖 This information is aggregated with respect to the feature from the output of the second BiLSTM network This helps the model to focus only on the important features instead of the confounding or less valuable information SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 57 Initially, the weights 𝑢𝑡 are calculated based on the correlation between the input and output according to the following formula: 𝒖𝒕 = 𝒕𝒂𝒏𝒉(𝑾𝒉𝒕 + 𝒃) (2) These weights will be renormalized to the Attention weight vector 𝛼𝑡 using the softmax function: 𝜶𝒕 = 𝐞𝐱𝐩 (𝒖𝑻𝒕 𝒖) ∑𝒕 𝐞𝐱𝐩 (𝒖𝑻𝒕 𝒖) (3) Then the vector 𝑐𝑡 is calculated based on the Attention weight vector and the hidden states ℎ1 … ℎ𝑇 as follows: 𝒄𝒕 = ∑ 𝜶𝒕 𝒉𝒕 (4) 𝒕 value 𝑐𝑡 , the more important the feature 𝑥𝑡 plays in detecting the DGA domain Finally, to predict a domain, the calculation results are passed through a Dense layer with hidden neuron using the activation function sigmoid to receive a return value between and The resulting y will be helps determine if a domain is benign or DGA Thus, the input domain name will be normalized into a vector form, this vector will be passed through the Embedding, Bi-LSTM, Batch Normalization, Bi-LSTM, Attention layers before giving the output result In addition, the model uses adam optimization algorithm with default parameters in keras And to prevent the model from falling into overfitting state (overfitting) compared with the real model of the data, we use more Dropout technique for Bi-LSTM layers The mechanism of Dropout is that in the process of training the model, with each time we update the weights, we randomly remove the number of nodes in the layer so that the model cannot depend on any node of the previous layer, but instead which tends to spread evenly 2.3 Experiment In this paper, we conduct experiments: 1- Experimentally check the accuracy of the model in 2-class classification: Domains generated by DGA algorithm and normal domains 2- Experiment to check the accuracy of the model in multi-class classification: Detect different DGA algorithms in a given data set Count Mean Std Min 25% 50% 75% Max DGA 30000 14.245103 4.337851 12 13 16 25 Regular domain name 30000 9.623797 3.300294 11 25 58 HANOI METROPOLITAN UNIVERSITY 2.4 Evaluation Dataset In this paper, we use a dataset consisting of DGA domains collected from Bambenek Consulting [9] and normal domains obtained from Alexa With two different tests, we use two different data sets Table Summary of the collect dataset Domain Type tinba ramnit necurs murofet Post qakbot shiotob monerodownloader ranbyus kraken Cryptolocker locky vawtrak qadars ramdo Sample 64313 62227 30789 24562 21881 19258 14451 14422 13417 5529 5780 3869 3022 2309 1932 Domain Type Chinad P2P Volatile proslikefan Sphinx Pitou Dircrypt Fobber padcrypt Zloader Geodo MyDoom Beebone tempedreve Vidro Sample 1484 985 966 750 733 749 699 572 551 555 557 333 291 242 188 Domain Type unknownjs beautiful baby pandabanker cryptowall an Unknowndroppr sisron kingminer gozi dromedan madmax g01 mirai Sample 172 161 94 91 67 59 59 29 24 2 1 Dataset for test 1: Consists of 30000 DGA domains with label and 30000 normal domains with label This dataset is randomly shuffled, then divided into two small sets as training dataset and test dataset In which there are 46 different types of DGA domain names with the number given in Table The distribution parameters of character length of each type of domain name given in Table In which, the sample with the smallest length is 6, the maximum length is 25 and the average length of the DGA domain name is 14.2, the normal domain name is 9.6 Table Label assessment to types Domain type Post Kraken Legit Monerodownloader Murofet Necurs Labels Domain type Qakbot Ramnit Ranbyus Shiotob/urlzone/bebloh Tinba Labels 10 SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 59 Dataset for test 2: With the goal of testing the multi-class classification, the types of DGA domains used include the families Post, Kraken, Monerodownloader, Murofet, Necurs, Shiotob/urlzone/bebloh, Qakbot, Ramnit, Ranbyus, Tinba are labeled according to Table The number of domain names for test includes 25,000 normal domain names and 25,000 domain names belonging to DGA families PERFORMANCE METRIC The performance of the algorithms is evaluated using the confusion matrix In there: • True negatives (TN) – are benign sites that are predicted to be benign • True Positives (TPs) – are malicious sites that are expected to be malicious • False negatives (FN) – are malicious sites that are expected to be benign • False positives (FPs) – are benign but expected to be malicious sites From there we have the measures: Accuracy: (𝑻𝑷+𝑻𝑵) ACC =(𝑻𝑷+𝑻𝑵+𝑭𝑷+𝑭𝑵) (5) The article also uses the measures of F-measure, precision, and recall, which are shown in the following formulas: 𝑻𝑷 𝑻𝑷 + 𝑭𝑷 𝑻𝑷 𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑻𝑷 + 𝑭𝑵 𝟐 ∗ 𝑹𝒆𝒄𝒂𝒍𝒍 ∗ 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑭𝟏 = 𝑹𝒆𝒄𝒂𝒍𝒍 + 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = (6) (7) (8) A high Precision value means that the accuracy of the points found is high A high recall means a high TP rate, which means that the rate of missing really positive points is low The Table Parameter of model higher the F1, the better the classifier In addition, In experience no.1 we also use the loss function binary cross entropy Layers Output Shape (BCE) to calculate the difference between two embedding (None, 38, 128) quantities: 𝑦̂- the label of the predicted URL and bidirectional (None, 38, 128) (None, 38, 128) y - the correct label of each URL Loss function is batch_normalization (None, 38, 128) like a way to force the model to pay a penalty for bidirectional_1 attention_with_context (None, 38, 128) each wrong prediction, and the number of addition (None, 128) penalties is proportional to the severity of the dense (None, 1) error The smaller the loss value, the better the model shows that the prediction results are good, on the contrary, if the prediction results differ too much from reality, the larger the loss value ̂) + (𝟏 − 𝒚) 𝒍𝒐𝒈(𝟏 − 𝒚 ̂)) 𝑩𝑪𝑬 = −(𝒚𝒍𝒐𝒈(𝒚 (9) 60 HANOI METROPOLITAN UNIVERSITY 3.1 Experimental results Table Experimental results no Loss 3.2705e-04 ACC 0.9999 Precision 0.9999 Recall 0.9998 F1 0.9999 3.1.1 Experiment number The model is built on the basic configuration of the Kaggle platform with Keras kernel, Tensorflow backend Which uses ModelCheckPoint to save the training process and EarlyStopping to immediately stop the training process when the best value is found The parameters of the model in the first experiment are showed in Table Table Parameter of the model in experience no Layers embedding bidirectional batch_normalization bidirectional_1 attention_with_context addition dense Output Shape (None, 33, 128) (None, 33, 128) (None, 33, 128) (None, 33, 128) (None, 33, 128) (None, 128) (None, 11) With the binary classification problem between the DGA domain and the normal domain, the model gives the results in Table with an accuracy of up to 99% With this result, we assume that there is a difference coming from the distribution of the domain's length We will run other tests to further test the stability of the model 3.2 Experiment number Table Results of experiment Class Precision 1.00 0.78 0.98 1.00 0.85 0.81 0.52 0.76 0.85 Recall 0.99 0.76 1.00 1.00 0.59 0.85 0.82 0.95 0.59 F1 1.00 0.77 0.99 1.00 0.69 0.83 0.64 0.84 0.69 Support 3986 1114 50086 2665 4383 5746 3572 11535 4383 SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 ten accuracy macro avg weighted avg 0.81 0.52 0.76 0.89 0.97 0.85 0.91 0.85 0.82 0.95 0.88 0.9 0.59 0.59 0.86 0.91 0.85 0.90 0.83 0.64 0.84 0.89 0.94 0.69 0.71 0.90 0.85 0.90 61 5746 3572 11535 2432 2602 4383 11879 100000 100000 100000 In this experiment, we test the multi-class detection ability of the model with three measures of precision, recall and f1 The parameters used in the model are presented in Table Due to multi-class classification, as an output, we use a hidden layer of size 11 corresponding to 11 labels to be classified The experimental results are presented in Table With the normal domain (labeled as 2), the Precision is 98%, F1 is 99% Our model gives the best results when classifying labels for DGA domains belonging to the Post family (label number 0) and Monerodownloader (label number 3) In contrast, the model gave the worst results on the Qakbot family (label 6) when the rate of classifying a benign site as a malicious site with a Precision measure of 52% For the Murofet family (label number 4) and Tinba (label number 10), the model gives false classification results when evaluating the DGA domain name into a benign domain with a recall measure of 59% In general, the F1 measure of the model in the multi-class classification problem reaches 90% The micro average (macro avg) efficiency is 86% and the average (weighted avg) efficiency is 91% COMPARISON WITH OTHER DGAS DETECTION METHODS The evaluation was performed on a dataset from the same source [9] as the studies being compared The results compared with the study of Chanwoong Hwang and colleagues are shown in Table showing that our model has a higher detection capacity Table 10 compares the ability to detect DGA domains labeled 4,5,6,7,8,9,10 in the study of Yanchen Qiao and Chanwoong Proposed Hwang model Duc Tran Yanchen Qiao [2] using LSTM with Attention Accuracy 88.77% 90% mechanism Duc Tran's model [11] is a cost-sensitive Precision 89.01% 91% original LSTM Cost items are class dependent, taking into Recall 88.77% 90% account the importance of classification between classes F1-score 88.695% 90% Our model exhibits good detectability across four DGA families: Necurs, Qakbot, Ramnit, Ranbyus and lesser on the Shiotob family, tinba Table Comparison 62 HANOI METROPOLITAN UNIVERSITY Table 10 Results comparring with Yanchen Qiao and Duc Tran La bel s Fa mil y Yanchen Qiao TABLE I Duc Tran Our Model TABLE III F TABLE VI F P TABLE II R TABLE IV P TABLE V R TABLE VII TABLE P VIII TABLE IX F ecall 1Score recision ecall R 1Score recision ecall 1score Mu rofe t 0.7641 0.7207 0.7418 0.5330 0.7423 0.6205 0.85 0.59 0.69 Nec urs 0.6651 0.1722 0.2735 0.5248 0.1104 0.1824 0.81 0.85 0.83 Qak bot 0.7862 0.5013 0.6122 0.7716 0.4350 0.5564 0.52 0.82 0.64 Ra mni t 0.4688 0.7525 0.5777 0.6068 0.8062 0.6925 0.76 0.95 0.84 Ran byu s 0.4672 0.8455 0.6018 0.3617 0.7073 0.4787 0.89 0.88 0.89 Shi oto b 0.9751 0.9251 0.9494 0.9741 0.9004 0.9358 0.97 0.90 0.94 10 Tin ba 0.9259 0.9920 0.9578 0.8951 0.9961 0.9429 0.91 0.59 0.71 recision CONCLUSION In this paper, we have presented an approach using Bi-LSTM deep learning network based on Attention mechanism [7] to solve the problem of detecting domains generated by algorithms in Botnet, too The model further shows a strong ability to detect DGA domains The model with layers of Bi-LSTM combined with Attention gives results when detecting DGA domains with 90 % accuracy In the future, we will continue to improve the model, and at the same time evaluate the model on larger, more complex datasets to verify the accuracy of the proposed model The research results in this direction can be integrated into the DNS domain name filtering systems to automatically discover the domains of the Botnet network REFERENCES Soleymani and F Arabgol (2021), "A Novel Approach for Detecting DGA-Based Botnets in DNS Queries Using Machine Learning Techniques," Journal of Computer Networks and Communications Y Qiao, B Zhang, W Zhang, A K Sangaiah and H Wu (2019), "DGA Domain Name Classiﬁcation Method Based on Long Short-Term Memory with Attention Mechanism," Applied Sciences, vol 9, no 20 SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 63 A Qi, J Jiang, Z Shi, R Mao and Q Wang (2018), "BotCensor: Detecting DGA-Based Botnet Using Two-Stage Anomaly Detection.," in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE) A K Sood and S Zeadally (2016), "A Taxonomy of Domain-Generation Algorithms," IEEE Security & Privacy, vol 14, no 4, pp 46-53, 05 August 2016 N T Hiếu and T N Ngọc (2020), "Phát URL độc hại sử dụng mạng học sâu Bi-LSTM dựa chế Attention," in Hội thảo quốc gia lần thứ XXIII: Một số vấn đề chọn lọc Công nghệ thông tin truyền thông, Quảng Ninh B N T T A T H V L L H S N T K S Nguyen Van Can (2020), "A new method to classify malicious domain name using Neutrosophic sets in DGA botnet detection," Journal of Intelligent & Fuzzy Systems, vol 38, p 4223–4236 S K P P A M J A Vinayakumar R (2019)., "DBD: Deep Learning DGA-Based Botnet Detection," in Advanced Sciences and Technologies for Security Applications Deep Learning Applications for Cyber Security, T M Alazab M., Ed., Switzerland, Springer, Cham, pp 127-149 H M V T H A T L G N Duc Chan (2018), "A LSTM based framework for handling multiclass imbalance in DGA botnet detection," in Neurocomputing, 2018, p 2401–2413 MỘT MƠ HÌNH HỌC SÂU PHÁT HIỆN TÊN MIỀN ĐƯỢC TẠO BỞI THUẬT TỐN TRONG MẠNG BOTNET Tóm tắt: Thuật tốn khởi tạo tên miền (DGA) nhóm thuật toán tạo tên miền phục vụ cho hoạt động công mạng botnet Trong báo này, chúng tơi trình bày mơ hình học sâu bi-lstm dựa chế attention để phát tên miền dga Qua trình thực nghiệm, thuật toán cho kết tốt việc phát tên miền dga thuộc họ post monerodownloader Về tổng thể, độ đo f1 mơ hình toán phân loại đa lớp đạt 90% Hiệu suất trung bình vi mơ (macro avg) đạt 86% hiệu suất trung bình (weighted avg) đạt 91% Từ khóa: Mạng học sâu Bi-LSTM, miên DGA, phát URL độc hại, chế Attention học sâu ... attack induced attack Domain Generation Algorithm The Domain Generation Algorithm (DGA) can use operators in combination with everchanging variables to generate random domain names The variables... domains generated by the DGA until an active domain name is found [6] MAIN CONTRIBUTION OF THE ARTICLE The main of contributions of the paper include: - Introduce a deep learning approach using... Identifying the attack of a malicious domain can effectively determine the purpose of the attack, the tools and malware used, and take preventive measures to greatly reduce the damage caused by the attack

Ngày đăng: 10/02/2023, 17:11