Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 30 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
30
Dung lượng
1,74 MB
Nội dung
RESEARCHING AND PROPOSING PSI GRAPH AS A FEATURE FOR BOTNET DETECTION ON IOT DEVICES – ……………… …… …… …… TABLE OF CONTENTS INTRODUCTION 1 The urgency of this thesis Research aim Research object and area Research outlines and methodology Thesis layout CHAPTER 1: THEORETICAL BASIS 1.1 Definition and characteristics of IoT devices 1.2 Definition of IoT botnet 1.3 The evolution of IoT botnet 1.4 Comparison between traditional botnet and IoT botnet CHAPTER IOT BOTNET MALWARE DETECTION METHOD 2.1 Comparison of static and dynamic analysis 2.2 Evaluation of IoT botnet detection methods based on static analysis 2.2.1 Constructing dataset for experimental 2.2.2 Experimental results and discussions CHAPTER PSI GRAPH FEATURE FOR DETECTION OF IOT BOTNET 3.1 Statement of the problem 3.2 Explaination of the problem 3.3 Proposed method 3.4 Function call graph in IoT botnet malware detection 3.5 PSI Graph construction 11 3.6 Experimental evaluation 13 3.6.1 Experimental environment 13 3.6.2 Evaluation model 13 3.6.3 Experimental results and discussion 14 CHAPTER PSI-ROOTED SUBGRAPH FEATURE IN DETECTING IOT BOTNET 16 4.1 Statement of the problem 16 4.2 Building PSI-rooted subgraph feaure 16 4.3 Experiment and evaluate the results 18 25 4.3.1 Experimental environment 18 4.3.2 Evaluation model 18 4.3.2 Experimental results and discussion 19 CONCLUSIONS 23 26 INTRODUCTION The urgency of this thesis The revolution of Industry 4.0, which is known as either Internet of Things or Industrial Internet, has a great impact on the industry of every nation Although having several alternative name, the industrial 4.0 has the most significant characteristic which is known as the replacement of traditional production machines into fully-automated machines which were built on top of IoT devices By applying the cutting edge technology of the Industry 4.0, humans are being able to take major leaps in almost every fields namely medical, education, economics, Although the Industry 4.0 is providing undeniable benefits, it has posed a plenty of cyber security threats which may directly cause negative impact on national security and regional stability Recent survey conducted on published articles from Elsivier, IEEE, Hindawi and Springer [6] suggested that authentication had been the most common solution in securing IoT devices while research in the field of trust management as well as lightweight cryptography and secure communication between IoT devices had being gained their popularity Furthermore, botnet had been one of the most dangerous threats to IoT devices Therefore, to meet the urgent demand of a real world problem in securing IoT devices, this thesis focused on researching and proposing a PSI graph which can be leveraged as a feature for botnet detection on IoT devices Research aim By analyzing the emerging needs as described above, this thesis specified the research target as to propose a feature having a novel yet efficient and low complex ity graph structure in detecting multi-arch IoT botnet with high accuracy Research object and area - Research object: the research objects of this thesis are multi-arch binary executables on IoT devices that operated on Linux Kernel 2.6 or 3.2 - Research area: this thesis focuses on reformulating malware detection as a binary classification problem with the following constraint: only research static analysis method for IoT botnet detection on IoT devices that have restricted resources (SOHO devices) such devices that have either low power consumption or small memory and limited computing capability Research outlines and methodology *) Research outlines: the thesis will focus on analyzing and evaluating some of the following contents: - Research the development, evolution and specification of IoT botnet and IoT botnet detection methods - Surveying, analyzing and evaluating existing IoT botnet detection methods that inherit from the static analysis on the same dataset and environment - Researching and proposing a new graph-based feature that can be applied in the IoT botnet detection process - Evaluating the proposed feature on accuracy and complexity in IoT botnet detection by using the reliable datasets as well as comparing the experimented results with others proposals which had the same approach *) Research methodology Combining theoretical research with practical research - Theoretical research: researching, surveying, concluding, evaluating related works at a national and international scope to analyzing the remaining problems that can be solved by following the proposed method Published articles had been collected from authorized sources such as: Google Scholar, Science Direct, ACM Digital Library, IEEE Xplore, industrial conferences namely Blackhat, USENIX, DEF CON, … In particular, focusing on theoretical research on the behavioral characteristics, infection life cycle of botnet malware, researching decompiled code fragments of sample sets executed on IoT devices - Practical research: Based on a data set of more than 10000 samples, including botnet malware and benign samples on IoT devices and divided into training and testing set at the rate of 70:30, using crossvalidation techniques, the thesis conducted experiments for constructing a feature for IoT botnet by applying IoT botnet detection on real world dataset, experimenting and evaluating the effectiveness of the proposed PSI graph feature with Deep Learning, experimenting and evaluating the effectiveness of the improved feature known as PSI-rooted subgraph with machine learning algorithm Thesis layout This thesis included the introduction along with chapters and finished with a conclusion and requests The appendix had 126 pages of illustration with 17 tables, 59 pictures, graph and 123 references Introdution: practical urgency and structure of this thesis Chapter 1: Theoretical basis Chapter 2: IoT botnet detection methods Chapter 3: PSI graph as a feature in IoT botnet detection Chapter 4: PSI-rooted subgraph as a feature in IoT botnet detection Conclusions and requests Appendix CHAPTER 1: THEORETICAL BASIS 1.1 Definition and characteristics of IoT devices The term IoT - Internet of Things was firstly defined by Kevin Ashton - the founder scientist of AutoID at MIT After that, there had been various definitions of IoT without a unified one However, all of the existing definitions had been focusing on the connection between things (devices) via the Internet Therefore, this thesis summarized the definition of IoT as “the platform consists of physical and logical things that can be integrated on applications, humans, environments and have the abilities to connect, transmit and process data for different purposes” A recent survey of Statista suggested that the number of IoT devices is going to increase dramatically up to 75 billion devices in 2025, which will be 2.4 times as many as 2020 Furthermore, IoT devices have taken place everywhere, in every field such as the medical system, production management system, energy management system, In the current research area, this thesis defined “IoT devices are both physical and logical multi-arch devices that have restricted computing resources and capabilities but have the ability to connect, transmit and process data for a specified purpose” In general, most of the existing IoT devices operate on various distros of the UNIX operating system The popularity of UNIX distros comes from its useful set of utilities Therefore, this thesis only focus on leveraging Linux executables that exists in a common format known as ELF - Executable Linkable Format Comparing with devices that operate based on traditional communication technology, IoT devices have several unique characteristics as follows: - Unsupervised operating environment: IoT devices has their own mobility and self-control - Non-unified : Iot devices were built on top of various process architectures such as: MIPS, ARM, PowerPC, MIPSEL, … - Constrained resource: IoT devices often have limited storage and small memory - Dynamic status: the status of IoT devices depends on their operating environment - Connectivity: IoT devices can effortlessly connect to each other and interact with the information and communication infrastructure at a global scope 1.2 Definition of IoT botnet Botnet is a type of malware that originated from the name “robot”, referred to as its automated operation Botnet is an application that has the ability to automatically interact with other services on the network Botnet is often designed to infect specified devices such as personal computers, mobile devices or IoT devices then turn these infected devices into a member of a larger network which was controlled by the attacker, known as bot-master Botnet only executes its malicious activities after receiving the commands from C&C server This is the main difference between botnets and other types of malwares Therefore, this thesis defines the IoT botnet as “the botnet that has the ability to automatically infect on IoT devices and is controlled by attackers” Figure 1.1 Relationship between some IoT botnet malware 1.3 The evolution of IoT botnet According to the analysis and evaluation of recent research of IoT malware as well as the experience in detecting real malware samples, this thesis summarized the evolution of IoT malware that was used for massive DDoS attacks into a graph However, the completed list of IoT malware had not been finished since attackers always modified and updated their malwares to create novel instances everyday 1.4 Comparison between traditional botnet and IoT botnet The comparison between traditional and IoT botnet are listed in the following table 1.1: Table 1.1 Compare botnet malware on traditional computers and IoT Criteria Attack types Architecture Variety Obfuscation Traditional botnet on PC Various attack types such as data encryption, data theft, DoS attack, IoT botnet Leverage a huge number of IoT devices at a global scope to perform massive DDoS attacks Multi-arch, based on the variety of IoT Mainly focused on x86_64 devices: ARM, MIPS, PowerPC, High variety yet complex structure Low variety, mostly based on modification from traditional botnet Leverage the computing power to Simple obfuscation due to the limit of techniques perform complex obfuscation Detectable Easy to detect the footprints by Harder to detect the footprints due to the footprints analyzing the behavior of computers Executable capabilities Infection Capabilities computational resources Harder to get a IoT botnet sample operate Easier to analyzing on sandbox in sandbox due to the multi-arch constraints and activation conditions Able to persist on the storage of Often delete persistence and only operate computers on volatile memory Not really competitive due to the large Competition operation characteristic of IoT devices amount of computational resource on PC Very competitive, due to the limited computational resources, IoT botnet often deactivates or removes other malwares after successfully infected on IoT devices Conclusion of chapter 1: This chapter presented an introduction of IoT botnet including the definition of IoT devices and IoT botnet as well as the evolution and life cycle of IoT botnet Furthermore, this chapter evaluated and compared between traditional botnet and IoT botnet and summarized a list of key differences between them These insights provided solid arguments for determining the compatible IoT botnet detection method CHAPTER IOT BOTNET MALWARE DETECTION METHOD 2.1 Comparison of static and dynamic analysis Both static and dynamic analysis have certain advantages and limitations Table 2.1 summarizes the advantages and disadvantages of each of the above methods Table 2.1 Comparison of both method in IoT botnet malware detection Dynamic analysis Static analysis - Observe the execution of a program to - Analyze programs in detail and give an determine more specifically Advantages overview of all their activation - Dynamic analysis is more effective possibilities against obfuscation malware - No need to execute malwares, not affected by multi-architecture building execution environment when - Only single-threaded execution can - Depends heavily on decompilation be monitored techniques - Disclose the process of detecting and - Difficulty handling malware using Disadvantage analyzing malwares obfuscation - May cause a threat to the network and the system - Difficult to fully emulate IoT devices (multi-architecture) To fit the research content, the thesis finds that with input as a multi-architectural executable file, it is necessary to choose a method capable of handling this problem effectively and efficiently, thus the thesis selects static analysis in proposing an approach to solving the research problem, in which the thesis exploits the strengths of static analysis and limits the weaknesses of this method The next part of the thesis will focus on analyzing and evaluating current studies based on static analysis in the detection of IoT botnet malware 2.2 Evaluation of IoT botnet detection methods based on static analysis Studies based on static analysis in malware detection often use common features such as: file headers, system-calls, API calls (Application Programming Interfaces), PSI ( Printable Strings Information), FLF (Function Length Frequency), linked libraries, OpCode (extracted from assembly code), Decompilation is a common approach to extract the above features from an executable file The way of extracting and processing those features greatly affects the accuracy and complexity of the IoT malware detection methods, which can be divided into two groups: graph-based methods and non-graph-based methods, as illustrated in figure 2.1 Figure 2.1 Classification of static features in IoT botnet detection Malware detection methods use non graph-based features to build detection models that contain binary file structure attributes to classify a binary as malicious or benign These methods are based on extracting features including Opcode, Strings, or a file structure with distinguishes malicious patterns These features can be divided into two groups: high-level features and low-level features In particular, low-level features can be gathered directly from within the file structure, whereas high-level features need to use disassembler tools such as IDA Pro or Radare2 Studies representing executable files with non graph-based features is heavily depend on the value of the features (e.g function call inet_toa) and will not be able to describe complex semantic information interference between features (for example, data dependency in the lifecycle of IoT malware capable of distributed denial of service attack, referred to as IoT botnet) Besides, studies using non graphbased features usually cannot handle obfuscation malwares techniques such as encryption, junk data insertion A comparison of IoT botnet malware detection methods based on static feature data representation summarized below shows state-of-art studies using static features in code detection IoT botnet poisoning has limitations - The studies following the direction of using typical Opcode data representation, such as Hamed HaddadPajouh [14], Ensieh Modiri Dovom [57], Darabian [52], Amin Azmoodeh et al [36] uses key mechanisms such as identifying malicious code through opcode sequence, applying fuzzy pattern tree to detect malicious code pattern, detecting malicious code based on opcode frequency These studies have limitations such as using only the sample set based on ARM architecture, and the dataset is not large enough - The research of Mohannad Alhanahnah [4] represents data in Strings format that allows generating the word carefully to classify malicious code However, the study was limited by the computational complexity and used only four types of malware - Research by F Shahzad et al [96] represents data as an ELF header to extract features from the binary file's section to detect malicious code However, the study was limited because the structure of the binary file was easily edited - Research by Jiawei Su et al [25] Grayscale image representation allows representing binary patterns as polymorphic grayscale images for malicious code detection However, the study was limited because of the lack of precision when the samples used confusing or coding techniques - Research by Hisham Alasmary et al [32] represents CFG data to compute 23 graph theory properties of CFG to distinguish between malicious and malicious code samples However, the study has computational complexity and inaccurate properties Based on the evaluation of current studies on IoT botnet malware detection, we can see that all studies have advantages and disadvantages However, each research method has been experimented on different datasets and environments On that basis, the thesis conducted an objective assessment of current studies with the same testing environment and on the same dataset The next part of the thesis will present in detail about the dataset, which is not only used to experiment for the evaluation in this Chapter but also used experimentally in the following chapters of the thesis 2.2.1 Constructing dataset for experimental In order to reliably and properly serve the experimental studies of the thesis, the construction of a dataset consist of malware and benign executable files on the IoT device has an important significance Table 2.2 Dataset description Family Name Variants Sample Number ARM MIPS Mirai 1,765 331 301 Bashlite 3,720 762 646 Other botnet 680 152 103 Benign - 3,845 561 533 10,010 1806 1583 Total The dataset contains 10010 samples, including 6165 IoT botnet malware samples 3845 IoT benign samples It also has many kind of architectures such as ARM, MIPS, PowerPC, Sparc, SuperH,… As can be seen in Figure 3.2, the number of vertices in PSI graph is concentrated mainly in the range [1, 300] for both malicious and benign files Although there is a slight difference in distribution, this difference is not obvious enough to establish a threshold value to distinguish between benign and IoT malicious samples Figure 3.2 Number of edges and vertices between sample patterns In order to easily visualize the operation results of the PSI graph generation algorithm, Figure 3.3 shows an example of the function call graph of the Linux.Bashlite pattern, it can be clearly seen that the PSI graph is much simpler than the graph function call On average, a PSI graph contains only about 16 vertices and 60 edges compared to the 156 vertices and 360 edges of the function call graph Figure 3.3 Function call graph (left) and PSI graph (right) of Linux.Bashlite malware sample In summary, the PSI graph characteristics obtained by the thesis have the following characteristics: - Be built based on static method; - Can reflect "lifecycle behavior" or can be called as simulation of infection process of IoT botnet malware; - Only consider the structure of printable string information (PSI), not consider the value of the strings; 12 - Be built based on function call graph 3.6 Experimental evaluation 3.6.1 Experimental environment Using the experimental data set presented in section 2.2.1 of this thesis summary, to conduct the experiments, the thesis divides the dataset into two subset: training set and testing set The training set contain an equal number of 2690 samples for both the malicious and the benign classes The test subset contains 4630 samples The experiment is built with Python and PyTorch framework on Ubuntu 16.04 operating system using Intel Core i5-8500, 3.0GHz chip, NVIDIA GeForce GTX1080Ti graphics card and 32 GB RAM 3.6.2 Evaluation model To evaluate the effectiveness of PSI graph features in the IoT botnet malware detection problem, the thesis feeds PSI graph features into the evaluation model as shown in Figure 3.4 The thesis aims at approach based on the analysis and representation of the entire structure of the PSI graph into fixed-length numerical vector values, so the thesis uses graph2vec [39] in the data preprocessing process Figure 3.4 Evaluation model of detecting IoT botnet malware using PSI Graph Graph2vec is an unsupervised learning technique for converting a graph into a digital vector Graph2vec is based on the idea of a doc2vec approach [82] using the skip-gram network Graph2vec learns to represent graphs by treating an entire graph as a text and subgraphs as the words that make up that text Thuật toán 3.3: Graph2vec (𝒢, 𝐷, 𝛿, 𝔢, 𝛼) Input: 𝒢 = {𝐺1 , 𝐺2 , … , 𝐺𝑛 }: Set of graphs such that each graph 𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 , 𝜆𝑖 ) for which embedding have to be learnt 𝐷: Maximun degree of rooted subgraphs to be considered for learning embeddings This will produce a vocabulary of subgraphs, 𝑆𝐺𝑣𝑜𝑐𝑎𝑏 = {𝑠𝑔1 , 𝑠𝑔2 , … } from all the graphs in 𝒢 𝛿: number of dimensions (embedding size) 𝔢: number of epochs 𝛼: Learning rate Output: Matrix of vector representation of graphs Φ ∈ ℝ|𝒢| × 𝛿 1: Initialization: Sample Φ from ℝ|𝒢| × 𝛿 2: for 𝔢 = to 𝔢 3: 𝜔 = 𝑆h𝑢𝑓𝑓𝑙𝑒(𝒢) 4: for each 𝐺𝑖 ∈ 𝜔 5: for each 𝑣 ∈ 𝑉𝑖 6: for 𝑑 = to 𝐷 13 7: 8: 9: (𝑑) 𝑠𝑔𝑣 := GetWLSubgraph(𝑣, 𝐺𝑖 , 𝑑) (𝑑) 𝒥(Φ) = − log Pr( 𝑠𝑔𝑣 |Φ(𝒢)) 𝜕𝒥 Φ = Φ − 𝛼 𝜕Φ 10: Return Φ The working principle of graph2vec is as follows: the entire graph is treated as a document, then the subgraphs in the graph in question are treated as sentences where each vertex in the graph is processed as a word Then the document is built by using the graph traverse technique Once the document has been built, use the skipgram technique to represent this graph Due to having to predict subgraphs, that is, graphs with similar subgraphs and similar structures have similar embedding The result of this step is a set of one-hot vectors of arbitrary length representing the set of graphs In the proposed study, the thesis presents PSI graphs as numerical vectors of 1024 length and used for later classification The data collected after the PSI graph preprocessing step will be used to decide whether a file is malicious using the deep neural network classifier To build convolutional neural networks, the thesis inherits the network model proposed by Kim [75] The first layer of the neural network is the input layer, the next layer performs convolution operations using multiple filter sizes The output of this class is passed to a nonlinear function, called the ReLU trigger, defined as 𝑓(𝑥) = max(0, 𝑥), because the ReLU trigger has a simpler computation compared with the sigmoid activation function (this usually requires an exponential computational complexity) [100] Next, the max-pooling class is used to reduce the data dimension from the convolutional layer, so the complexity and computational resources of the processing can be reduced and data scalable Finally, the fully connected layer performs subclassing the outputs generated from the convolution layer and the pooling class 3.6.3 Experimental results and discussion In order to evaluate the effectiveness of features of PSI graph in detecting IoT botnet malware, the thesis experimented and gave a result table in which focus on 02 features: PSI graph and FCG graph features with Measurement metrics include accuracy, FNR, FPR and cost of processing time Table 3.2 The results of detecting IoT botnet malware by PSI graph and function call graph Metric Accuracy FNR FPR Time (m) Features (%) (%) (%) 98,7 1,83 0,78 88 PSI-graphs 95,3 5,81 4,13 545 FCGs From the results in Table 3.2, it can be seen that the proposed method using PSI graph features performs better than the function call graph The results showed that the proposed method achieved 1.7% higher accuracy than using the call graph, and the execution time was also 457 minutes less Besides, the false negative rate (false nagative/false elimination rate) in the proposed method is 1.83% while the FCG method is 5.81% Meanwhile, with malware detection problems, the lower the false negative rate, the lower the classifier misdetecting the malicious code as benign files Besides, the proposed method of the thesis still has a very small rate of error in wrongly labeling benign files as malicious code This occurs in some benign files having a PSI graph structure similar to that of some Linux.Bashlite malware samples Manually analyzing those sample sets found that the different executables, the FCG graph and the resulting assembly code were different but still had the same PSI graph structure However, this false detection rate is only 0.78%, a very small percentage Table 3.3 Comparison between the IoT botnet detection methods Methods Algorithms Dataset Accuracy (%) Su et al [25] Deep neural network (CNN) 95.13 14 Methods HaddadPajouh et al [14] Algorithms Recurrent neural network (RNN) PSI-Graph Deep neural network (CNN) Dataset Dataset described in section 2.2.1 includes 6943 samples (of which 3098 botnet from IoTPOT) Accuracy (%) 97.88 98.7 From the result table 3.3, it can be seen that the research methods of Su et al [25], HaddadPajouh et al [14] all showed promising results Although the results of the current studies are promising, the lack of test data sets and the source code of the test models makes retesting and evaluating them quite difficult This thesis tries to rebuild those methods through the materials, published articles of the above methods The results showed that the proposed method of the thesis achieved better accuracy than that of Su and HaddadPajouh at 3.57% and 0.82%, respectively Methods PSI-Graph Table 3.4 Evaluation over-fitting Algorithms Dataset Dataset described in section 2.2.1 includes 10,010 Deep neural network (CNN) samples (of which 6165 botnet IoTPOT and VirusShare) Accuracy (%) 97,8 Finally, over-fitting problems often occur with deep learning algorithms This occurs when the model too matches the training data set but does not perform well when it executes it on the extended subsets To evaluate the over-matching problem in the proposed model, the thesis added 3067 malicious code samples collected from VirusShare to the test set and recalculated the accuracy As shown in Table 2.4, when adding malicious code samples from VirusShare to the sample data set, the detection accuracy of malicious code decreased slightly (down 0.9%) Thus, from the experimental results, the thesis finds that the proposed method achieves good results in detecting IoT malware, and at the same time solving the problem of over-fitting in the acceptable range Conclusion Chapter Based on the analysis and evaluation of the characteristics of the IoT botnet malware and in order to solve the limitations of previous studies in detecting the botnet IoT malware based on the feature of the graph structure, the thesis proposed a high-level feature-based light approach, called the PSI graph, to detect the IoT botnet malware The proposed method of mining the life cycle of IoT botnet malware to generate PSI graph characteristics, applying the advantages of deep learning method to achieve accuracy up to 98.7% with the same degree of overlap in the handicap range received with the problem of detecting IoT botnet malware However, the proposed method only focuses on exploiting the overall structure of the PSI graph, and still has a rather large time cost complexity Contributions of Chapter Proposing a new feature with a graph structure, effective in detecting multi-architectural botnet malware on IoT devices, called PSI graph The research results have been published and presented in the Proceedings of Conferences and prestigious journals domestically and internationally (at [B1], [B6], [B7] in the list of works of the author) 15 CHAPTER PSI-ROOTED SUBGRAPH FEATURE IN DETECTING IOT BOTNET 4.1 Statement of the problem The method of detecting IoT botnet malware based on PSI graph features has shown high feasibility and efficiency However, this proposed method focuses on exploiting the overall structure of the PSI graph and does not exploit the paths in the PSI graph, in other words the method focuses on considering the PSI graph as a graph application The fact that the growing trend of botnet malware executables on IoT devices is getting more and more complex is the fact that the structure of the PSI Graph will also be complex Meanwhile, the malicious behaviors that often appear in the life cycle of the IoT botnet malware can be the paths in the PSI graph, illustrated in Figure 4.1, it can be the green or red paths, while the other routes are redundant data Based on that, the research problem of this Chapter is stated as follows: Building a new feature based on PSI graph features, but focusing on exploring paths in PSI graphs, thereby building the characteristic Displaying a new graph, called PSI-rooted subgraph representing malicious behavior of IoT botnet malware, improving efficiency of detecting IoT botnet malware with simple machine learning algorithms Figure 4.1 Illustration the problem idea using a PSI-rooted subgraph 4.2 Building PSI-rooted subgraph feaure Definition 4.1 (PSI-rooted subgraph): Let 𝐺𝑠𝑔 = (𝑉, 𝐸, 𝜃, 𝑑) represents an acyclic directed PSIRooted sub-graph that is generated from 𝐺𝑃𝑆𝐼 rooted at vertex 𝜃; where 𝑉 𝜖 𝐺𝑃𝑆𝐼 is the set of vertexes whereas the length between (𝜃, 𝑉𝑖 ) satisfy ≤ (𝜃, 𝑉𝑖 ) ≤ 𝑑, and E is a set of directed edges between vertexes in 𝑉 After building PSI graph, as well as identifying vertices in PSI, the dissertation proceeds to traverse PSI graph with each vertices as the root in PSI graph, implementation progress is shown in algorithm 4.1 Algorithm 4.1: 𝐸𝑥𝑡𝑟𝑎𝑐𝑡𝑅𝑜𝑜𝑡𝑒𝑑𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ(𝒢, 𝐷) 𝒢 = {𝐺1 , 𝐺2 , … , 𝐺𝑛 }: Set of PSI graphs 𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 ), representation for ELF files 𝐈𝐧𝐩𝐮𝐭 𝐷: Maximum degree of PSI-rooted subgraph 𝒮𝒢 = {𝑆𝐺1 , 𝑆𝐺2 , … , 𝑆𝐺𝑛 }: Set of PSI-rooted subgraph 𝑆𝐺𝑖 = (𝑉𝑖′ , 𝐸𝑖′ , 𝑣, 𝐷) extracted Output from 𝒢 1: 𝑰𝒏𝒊𝒕𝒊𝒂𝒍𝒊𝒛𝒂𝒕𝒊𝒐𝒏: 𝒮𝒢 = ∅ 2: 𝒇𝒐𝒓 𝒆𝒂𝒄𝒉 𝐺𝑖 ∈ 𝒢 𝒅𝒐 3: 𝒇𝒐𝒓 𝒆𝒂𝒄𝒉 𝑣 ∈ 𝑉𝑖 𝒅𝒐 16 4: 5: 6: 7: 𝒇𝒐𝒓 𝑑 = 𝑡𝑜 𝐷 𝒅𝒐 𝑆𝐺𝑖 ≔ 𝐺𝑒𝑡𝑊𝐿𝑆𝑢𝑏𝐺𝑟𝑎𝑝ℎ(𝑣, 𝐺𝑖 , 𝑑) 𝒮𝒢 ≔ 𝒮𝒢 ∪ 𝑆𝐺𝑖 𝑟𝑒𝑡u𝑟𝑛 𝒮𝒢 Algorithm 4.1 chooses all neighbors of a vertex to extract the subgraph The process of extracting the PSI rooted subgraph based on the breadth first search algorithm (BFS - Breadth First Search) will be more efficient than the depth-first search algorithm (DFS - Depth First Search) The main reason is that BFS starts at the root vertex and exploits all neighboring vertices at the same depth before going to the next next degree while DFS exploits the nodes with the depth before calling back the search With a fixed depth (or degree) of a derived subgraph, the BFS algorithm is clearly more suitable for extracting derived subgraphs In order to choose the appropriate depth of sub-graph, the thesis experimented with depths of 𝐷 = 2, 𝐷 = and 𝐷 = To balance accuracy and complexity, Thesis selects depth 𝐷 = to process PSI rooted subgraph In which, algorithm 4.1 takes the root vertex 𝑣, graph 𝐺𝑖 and order 𝑑 of subgraph as input values and returns the result as subgraph 𝑆𝐺𝑖 , as at line 5, execution is handled with the GetWLSubGraph function GetWLSubGraph function in algorithm 4.2, which the thesis inherits from the study of Annamalai Narayanan et al [89] Algorithm 4.2: GetWLSubGraph (𝑣, 𝐺, 𝑑) 𝑣: Node which is the root of PSI subgraph 𝐺 = (𝑉, 𝐸) : PSI graph from which subgraphs has to be extracted Input: 𝑑: Degree of neighbours to be considered for extracting PSI-rooted subgraph Output: 𝑠𝑔𝑣(𝑑) : rooted subgraph of degree 𝑑 around node 𝑣 (𝑑) 𝑠𝑔𝑣 = ∅ // initialize the subgraph root is empty If 𝑑 = then (𝑑) 𝑠𝑔𝑣 ≔ (𝑣) else 𝑁𝑣 ≔ {𝑣 ′ |(𝑣, 𝑣 ′ ) ∈ 𝐸} (𝑑) 𝑀𝑣 ≔ {𝐺𝐸𝑇𝑊𝐿𝑆𝑈𝐵𝐺𝑅𝐴𝑃𝐻(𝑣 ′ , 𝐺, 𝑑 − 1)| 𝑣 ′ ∈ 𝑁𝑣 (𝑑) (𝑑) 𝑠𝑔𝑣 ≔ 𝑠𝑔𝑣 ∪ 𝐺𝐸𝑇𝑊𝐿𝑆𝑈𝐵𝐺𝑅𝐴𝑃𝐻 (𝑣, 𝐺, 𝑑 − 1) ⊕ 𝑠𝑜𝑟𝑡(𝑀𝑣(𝑑) ) (𝑑) Return 𝑠𝑔𝑣 To illustrate the process of constructing the PSI rooted subgraph, the thesis traverses the PSI graph (in Figure 4.1) to find an example of the subgraph started at vertex 11 with depth d equal to 2, the results are displayed is shown in table 4.1 Table 4.1 An sample generates a PSI-rooted subgraph with a depth of Degree Vertexes 11 d=0 0, 8, 10, 7, d=1 18, 0, 0, 7, 0, 5, 6, 15, 16 d=2 The traversing process is described as follows: at d = 0, there is only vertex 11; then at d = 1, each vertex with d = for will be traversed, then the result will contain vertices linked with vertex 11 as {0,8,10,7,9}; similarly at d = 2, then take each vertex with d = to traverse, as vertex {0} is linked to vertex {18}, vertex {8} is linked to vertex {0}, Continue until if all the vertices with d = are traversed, then the result would be a list of vertices with d = The resulting PSI-rooted subgraph with root 11 is a list {11, 0, 8, 10, 7, 9, 18, 0, 0, 7, 0, 5, 6, 15, 16} Continuing the process of traversing the whole PSI graph with roots as other vertices 17 in the graph, then a list of the PSI-rooted subgraphs would be formed, so the data looks like a forest with many trees would be obtained (because the subgraph removes cycles, the subgraph will have a tree-like structure) It is then necessary to identify the PSI- rooted subgraph, which contains the behavior in the life cycle of the IoT botnet malware 4.3 Experiment and evaluate the results 4.3.1 Experimental environment Using the data set and the experimental environment presented in Section 1.2 of this thesis, to conduct the experiments, the thesis divides the dataset into subset: training set and testing set Which uses 70% of the data set to perform training and the remaining 30% to perform the testing phase To minimize the possibility of over-fitting in the testing process, the thesis uses cross-validation (k-fold) In the thesis using a value of k equal to 5, that is, the training data set will be divided into five parts, of which four parts are used for training and one part is used for evaluation to find the most suitable parameters for the model 4.3.2 Evaluation model To evaluate the effectiveness of PSI-rooted subgraph features in the IoT botnet malware detection problem, the thesis feed PSI-rooted subgraph features into the evaluation model as shown in Figure 4.2 Figure 4.2 The evaluation model of PSI-rooted subgraph feature applying in IoT botnet malware detection The input data is the PSI-rooted subgraph data, obtained from the processing of the PSI graph Before feeding this data into implementation steps, the thesis processes the PSI-rooted subgraph based on word embedding technique To conform to the thesis's approach, the thesis uses the treatment based on the frequency of occurrence, namely considering each PSI graph as a document and the PSI-rooted subgraph as a word in the document Counting the appearances of words in each document, then the frequency of occurrence of each subgraph originated as a feature Thus, it can be seen that the representative vector of an executable is the frequency of occurrence of the PSI-rooted subgraph in the corresponding PSI graph of the executable file This vector is considered a multivariable sample and this data can be represented as a matrix with rows (graph representations) and columns (subgraph representation that originated in the graph data set) The thesis found that the obtained matrix has features with many different range of values, so it is necessary to standardize to ensure the classification results, the normalization process according to formula (4.1), where the numerator is the vector value, represents a sample in the original PSI subgraph data set, and the denominator is the length of that vector (here a real number) calculated by Euclidean distance 𝑥 𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 = ∥ 𝑥 ∥2 18 (4.1) Next, the thesis applies the feature selection technique according to the Wrapper method, that is, the assessment is based on a specific machine learning algorithm to find the optimal features, the algorithm that the thesis uses is linear SVM because linear SVM is good at constructing a dividing line of classes to select influential features, secondly linear SVM is able to compute feature importance and thirdly, the linear SVM is quite fast With the obtained features, the thesis chose not to use complex machine learning algorithms but chose some popular machine learning algorithms in the malware detection problem [99] such as SVM, Decision Tree, Random Forest, along with simple, rarely used machine learning algorithms such as Bagging and kNN, to demonstrate the robustness and effectiveness of the original PSI subgraph feature in the detection of IoT botnet malware 4.3.2 Experimental results and discussion In order to evaluate the effectiveness of PSI-rooted subgraph feature in detecting IoT botnet malware, the thesis experimented and showed results in Table 4.2, 4.3 and 4.4 In which the result showing in table 4.2 was performed on the entire data set, while the result showing in table 4.3 and 4.4 only worked with files of ARM and MIPS architecture respectively Table 4.2 Results of classifiers with the proposed feature Classifier TPR FPR Accuracy AUC F1-score (%) (%) (%) (%) DT 97 0.043 96.3 96.4 97 RF 98 0.03 97.2 97.1 98 SVM 98 0.041 97 96.8 98 Bagging 98 0.04 97.3 97.1 98 kNN 97 0.044 96.8 96.7 98 Figure 4.3 ROC curves for Bagging, RF, DT, kNN and SVM on the dataset Table 4.3 Results of evaluation of malware detection on the ARM-based dataset Classifier TPR FPR Accuracy AUC F1-score (%) (%) (%) (%) DT 99 0.019 98.3 98.3 98 RF 99 0.01 98.8 98.8 99 SVM 100 0.01 99.3 99.3 99 Bagging 99 0.01 98.8 98.8 99 kNN 98 0.019 97.8 97.8 98 19 Figure 4.4 ROC curves of Bagging, RF, DT, kNN Figure 4.5 ROC curves of Bagging, RF, DT, kNN and SVM on the ARM-based dataset and SVM on the MIPS-based dataset Through the above results, it can be seen that the proposed method has a high detection rate for each classifier using the combined dataset from the multiple architecture files as shown in Table 4.2 Random Forest has been shown to perform better than other classifiers with a TPR of 98% and other metrics with quite satisfactory results Furthermore, the AUC of the classifiers used in the above experiments showed both results greater than 96% The AUC value means that the IoT botnet malware detection system gives good results, where the Random Forest classifier is the best as shown in Figures 4.3, 4.4 and 4.5 Table 4.4 The evaluation results of malware detection on the MIPS-based dataset Classifier TPR FPR Accuracy AUC F1-score (%) (%) (%) (%) DT 98 0.007 99 98.7 98 RF 99 0.005 99.3 99.1 98 SVM 100 0.007 99.4 99.6 99 Bagging 96 0.011 98.3 97.6 96 kNN 99 0.004 99.4 99.2 99 In addition, this thesis is also evaluated with data sets based on ARM architecture and MIPS architecture only, as shown in Table 4.3 and 4.4 Because each data set contains only ARM or MIPS architecture files, the SVM classifier has higher performance than any other classifier The SVM achieved a correct rate of 100% in both datasets As mentioned earlier, precision measurement is the ratio of correctly identified representations from all data In other words, the precision metric indicates the ability of a classifier to predict malicious code instances Meanwhile, the F1-score was calculated from Precision, so the RF classifier and SVM achieved a F-score greater than 98%, meaning these classifiers can predict malicious code instances Experimental results are shown in Table 4.5 for the processing times when using feature extraction and without using feature extraction has the big difference When using all 530,155 features, the processing time is 9305.2 seconds; meanwhile with the feature selection, the processing time is reduced to 69.18 seconds for the RF classifier However, other classifiers also showed reduced processing times using the feature selection Therefore, the processing time of the classifiers is directly proportional to the feature size 20 Table 4.5 Comparison of processing times Classifier Processing Time (second) Processing time with feature selection DT 1.84 RF 69.18 Bagging 144.64 kNN 12.83 SVM 237.78 Processing time without feature selection DT 18.49 RF 9305.21 Bagging 5225.02 kNN 19.60 SVM 1705.33 Besides, this thesis also compares the proposed method with that of Hamed HaddadPajouh et al [14] when using Opcode sequences as feature There are main reasons for this thesis to choose for comparison: firstly, this study used a typical static approach with IoT executables; Secondly, this study used machine learning for the classification phase Table 4.6 Comparison between traditional machine classifiers in botnet IoT detection Accuracy (%) Classifier Proposed method Hamed et al [14] Random Forest 98.8 92.37 SVM 99.3 82.21 kNN 97.8 94 Decision Tree 97.8 92.36 Hamed et al's method of experimenting on datasets consisting of patterns are IoT executables that have ARM architecture only Therefore, this thesis uses the experimental results on the ARM-based dataset, as shown in Table 4.3 The results showed that the proposed method of this thesis is better Therefore, the PSIrooted subgraph feature is effective in detecting the IoT botnet malware when using machine learning Finally, the evaluation of complexity compared to the method using the PSI graph feature found that the approach based on PSI-rooted subgraph feature has a lower complexity Firstly, considering the complexity of the method based on PSI graph features, it is found that the processing of PSI graphs is based on Graph2vec, in which skipgram is mainly deep learning technique Consider the skipgram model in data processing with embedding technique as shown in Figure 4.6 The complexity of the skipgram depends on the the number of epochs, the number of iterations (the number of backtracks taken to update the weights), and the complexity of an iteration See the workflow of Graph2vec at Algorithm in this thesis and the detailed processing by the thesis inherited from the study of Annamalai Narayanan et al [40], it can be seen that: - In an iteration: the complexity depends on the number of calculations, showing in the network model in Figure 4.6, it can be seen that the complexity depends on the number of hidden layers and 𝑦𝑝𝑟𝑒𝑑 and the phase of updating the weights 𝑊𝑖𝑛𝑝𝑢𝑡 , 𝑊𝑜𝑢𝑡𝑝𝑢𝑡 + Calculate the hidden class and the 𝑊𝑖𝑛𝑝𝑢𝑡 only depends on the V row of the 𝑊𝑖𝑛𝑝𝑢𝑡 weight, so the complexity will be 𝑂(𝑁) 𝑇 𝑇 + Update 𝑊𝑜𝑢𝑡𝑝𝑢𝑡 also updated only the K+1 column of 𝑊𝑜𝑢𝑡𝑝𝑢𝑡 so the complexity will be 𝑂(𝑁 ∗ (𝐾 + 1)) 21 𝑇 + Calculate 𝑦𝑝𝑟𝑒𝑑 depends on the matrix multiplication 𝑊𝑜𝑢𝑡𝑝𝑢𝑡 (the complexity is 𝑂(𝑁 ∗ (𝐾 + 1) 𝑇 because negative sampling technique only updates (𝐾 + 1) column of 𝑊𝑜𝑢𝑡𝑝𝑢𝑡 ) and compute softmax (whose complexity is 𝑂(𝑉)) Thus, the complexity is equal to max(𝑁 ∗ (𝐾 + 1), 𝑉), ie 𝑂(𝑁 ∗ (𝐾 + 1)) Therefore, the complexity of an interation depend on max(𝑁, 𝑁 ∗ (𝐾 + 1)), ie 𝑂(𝑁 ∗ (𝐾 + 1)) - The number of iterations made by Graph2vec using the Stochastic gradient descent (SGD) technique will depend on the number of samples to be trained, so S = {graph_id, )}, where sampled_word is taken randomly from a window in that graph In this study, the thesis has a rather large set of subgraph vocabulary (about 500,000), so the size |S| can reach millions of steps or even greater This process takes place many times to update the weight Each run is of complexity 𝑂(𝑁 ∗ (𝐾 + 1) ∗ |𝑆|) - The number of epoch: is the hyper value of the set parameter Thus, the complexity of the thesis proposed method based on the features of the PSI graph is 𝑒 ∗ |𝐺| ∗ |𝑉| ∗ max(𝑘 𝐷 , 𝐷 ∗ |𝑆| ∗ 𝑁 ∗ (𝐾 + 1)), tức 𝑂(𝑒 ∗ |𝐺| ∗ |𝑉| ∗ 𝐷 ∗ |𝑆| ∗ 𝑁 ∗ (𝐾 + 1) Figure 4.6 The Skipgram model structure with center word case “passes” [114] Meanwhile, considering the complexity of generating the rooted-subgraph feature of PSI graph in algorithms 4.1 and 4.2, in the worst case, it will have to traverse through all the neighbor vertices of a vertex in PSI graph Specifically, at the 4th line of algorithm 4.1, it can be seen that the complexity will be 𝑂(𝑘 𝐷 ), where D is the degree of the subgraph PSI and k is the maximum number of neighbors of the root in PSI subgraph (because this is a brute-force algorithm, will consider all vertices adjacent to the root vertex until a tree of depth D is reached) In addition, it is necessary to consider the complexity of the PSI subgraph data processing originating at the vectorization step, with the output after vectorization is a sparse matrix, so the complexity only depends on the input size, namely (|𝐺| ∗ 𝑅), where G is the PSI graph set and R is the largest number of rooted PSI subgraphs in the entire graph Thus, the complexity of the method using PSI-rooted subgraph feature is 𝑚𝑎𝑥( |𝐺| ∗ |𝑉| ∗ |𝐷| ∗ 𝑘 𝐷 ), |𝐺| ∗ 𝑅), ie 𝑂(|𝐺| ∗ |𝑉| ∗ |𝐷| ∗ 𝑘 𝐷 ) Comparing with the above complexity, it is found that the complexity of the PSI subgraph feature-based method is simpler than the PSI graph feature-based method Conclusion of Chapter 4: This thesis has presented a new method based on the PSI-rooted subgraph in IoT botnet malware detection, this method extracted new features from the PSI graph of ELF files These features are applied to the classifiers in machine learning as a malicious code detector with over 97% accuracy, and the Random Forest classifier has been shown to outperform the other classifiers In addition, comparing 22 with the existing methods, the experimental results also show that the proposed method of the thesis is more effective Contributions of Chapter 4: Based on PSI graph, the thesis has proposed a method of exploring PSI graph to extract new features effectively in detecting IoT botnet malware, called PSI-rooted subgraph features The research results have been published and presented in the Proceedings of Conferences, and domestic/ international journals (at [B2], [B8] in the author's list of publications) CONCLUSIONS 1) The main results of this thesis: The content of the thesis has focused on researching the method of detecting the IoT botnet malware Through the process of learning, researching and implementing the thesis, the main results are achieved as follows: Contribution 1: Experimenting, analyzing and evaluating current IoT malware detection methods with the same large database of IoT executable files (including malicious and benign code), with the real-world malware samples, the experiment of those methods performed on the same system configuration The achieved results is to provide an overview of the current IoT malware detection methods, so that researchers can choose the appropriate approach for the IoT malware detection problem in general and IoT botnet in particular Contribution 2: This thesis proposes a new feature, called a PSI (Printable String Information) Graph that simulates the infection process of IoT botnet malware The proposed method has low complexity but still ensures high accuracy in detecting IoT botnet malware Contribution 3: This thesis proposes to improve the method of detecting IoT botnet malware based on PSI graph with new feature, called PSI-rooted subgraph, has proven its effectiveness in detecting IoT botnet malware 2) Future development of the thesis: - The proposed method of the thesis is currently experimenting with IoT botnet malware, while there are other types of IoT malware such as Trojan, Worm In the future, it is necessary to continue testing the proposed method of the thesis with many other types of IoT malware - The process of extracting dynamic features is complex and time-consuming but has the potential to solve the limitations of static analysis Therefore, the future research of the thesis is to combine both static analysis and dynamic analysis, thus improve PSI graph feature mining in IoT malware detection - The process of traversing the PSI-rooted subgraph is still complicated, future research can apply Reinforcement Learning to improve the ability to identify malicious behaviors of IoT malware botnet, so that traversing the PSI-rooted subgraph will have low complexity This approach has been researched, experimented and published in the project [B9] - The thesis used a data set with a large number of samples to conduct experiments and evaluation, but future experiments can be done with larger data sets Results with larger data sets will increase the reliability of the thesis's proposed method - Combining with nonstructural features of graph: the thesis approaches using vector features converted from PSI graph features, so it can be easily combined with other vector features 23 Mô hình ứng dụng thực tế phương pháp phát IoT botnet sử dụng đặc trưng đồ thị PSI 24 LIST OF AUTHOR’S PUBLISHED WORKS Journal [B1] Huy-Trung Nguyen, Quoc-Dung Ngo, and Van-Hoang Le "A novel graphbased approach for IoT botnet detection." International Journal of Information Security, Vol 19, pp 567-577, 2020 (SCIE index, Q2) ISSN: 1615-5262 (Print) 1615-5270 (Online) DOI: 10.1007/s10207-019-00475-6 [B2] Huy-Trung Nguyen, Quoc-Dung Ngo, Doan-Hieu Nguyen, and Van-Hoang Le "PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms.", ICT Express Journal, 6(2), pp 128-138, 2020 (ESCI/SCOPUS index, Q1), ISSN: 2405-9595 DOI: 10.1016/j.icte.2019.12.001 [B3] Quoc-Dung Ngo, Huy-Trung Nguyen, Van-Hoang Le, Doan-Hieu Nguyen, “A survey of IoT malware and detection methods based on static features”, ICT Express Journal, In press, 2020 (ESCI/SCOPUS index, Q1), ISSN: 2405-9595 DOI: 10.1016/j.icte.2020.04.005 Conference Proceedings [B4] Nguyen Huy Trung, Ngo Quoc Dung, Nguyen Anh Quynh, Tran Nghi Phu, Nguyen Ngoc Toan, Nguyen Manh Son “Developing a hybrid method for detecting botnet on routers”, The 20th National Symposium of Selected ICT Problems, Quy Nhon, 23-24/11/2017 [B5] Su Ngoc Anh, Le Hai Viet, Nguyen Huy Trung, Ngo Quoc Dung “Building a model for network attack detection collection using IoT devices”, The 2nd National Symposium of Selected Information Security Problems, 2017 [B6] Su Ngoc Anh, Nguyen Huy Trung, Nguyen Anh Quynh, Pham Van Huan “Detecting IoT botnet malware”, The 3rd National Symposium of Selected Information Security Problems, Da Nang, 12/2018 (Conference proceedings are published in Journal on Information and Communications, ISSN 1859-3550, pp 8994, 2018) [B7] Huy-Trung Nguyen, Quoc-Dung Ngo, and Van-Hoang Le "IoT Botnet Detection Approach Based on PSI graph and DGCNN classifier." In IEEE International Conference on Information Communication and Signal Processing (ICICSP), pp 118-122, 2018 (SCOPUS Index) DOI: 10.1109/ICICSP.2018.8549713 [B8] Huy-Trung Nguyen, Doan-Hieu Nguyen, Quoc-Dung Ngo, Vu-Hai Tran, and Van-Hoang Le "Towards a rooted subgraph classifier for IoT botnet detection." In Proceedings of the 7th International Conference on Computer and Communications Management, pp 247-251 2019 (SCOPUS index) DOI: 10.1145/3348445.3348474 [B9] Quoc-Dung Ngo, Huy-Trung Nguyen, Hoang-Long Pham, Hoang HanhNhan Ngo, Doan-Hieu Nguyen, Cong-Minh Dinh, Xuan-Hanh Vu “A graph-based approach for IoT botnet detection using Reinforcement Learning”, In: 12th International Conference on Computational Collective Intelligence (ICCCI), DaNang, Vietnam Lecture Notes in Artificial Intelligence, Springer Cham, pp 114, 2020 [Accepted] ... of IoT devices 1.2 Definition of IoT botnet 1.3 The evolution of IoT botnet 1.4 Comparison between traditional botnet and IoT botnet CHAPTER IOT BOTNET. .. features converted from PSI graph features, so it can be easily combined with other vector features 23 Mơ hình ứng dụng thực tế phương pháp phát IoT botnet sử dụng đặc trưng đồ thị PSI 24 LIST OF AUTHOR’S... traditional botnet and IoT botnet The comparison between traditional and IoT botnet are listed in the following table 1.1: Table 1.1 Compare botnet malware on traditional computers and IoT Criteria Attack