Phân loại mã độc Android sử dụng học sâu

Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu

Background Information

Android Platform

Android platform is a software stack that consists of many components as follows (Fig 1.1):

Android platform provides users tools and APIs to create applications (apps) for mobile phones, televisions, smartwatches, etc.

Figure 1.1: Architecture of Android OS system [37]

The Android Operating System is built upon the Linux kernel version 2.6 Should they wish to be executed, all operations are carried out at this level These processes include memory management, hardware communications (driver models), security tasks, and process management.

Although Android was built upon the Linux kernel, the kernel has been heavily modified These modifications are tailor-made to satisfy the characteristics of handheld devices, such as the limited nature of the CPU, memory and storage,screen size, and, most importantly, the continuous need for wireless connections.

This level contains the following components:

– Display Driver: controls the screen’s display and captures user interactions (e.g., touch, gestures).

– Camera Driver: manages the camera’s operation and receives data streams from the camera.

– Bluetooth Driver: controls the transmission and reception of Bluetooth signals.

– USB Driver: manages the functionality of USB communication ports.

– Keypad Driver: controls the keypad input.

– Wi-Fi Driver: responsible for sending and receiving wifi signals.

– Audio Driver: controls audio input and output devices, decoding audio signals to sound and vice versa.

– Binder IPC Driver: handles connections and communication with wireless networks such as CDMA, GSM, 3G, 4G, and E to ensure seamless communication functionalities.

– M-System Driver: manages reading and writing operations on memory devices like SD cards and flash drives.

– Power Management: monitors power consumption.

The hardware abstraction layer (HAL) provides standard interfaces that expose device hardware capabilities to the higher-level Java API framework The HAL consists of multiple library modules, each implementing an interface for a specific hardware component, such as the camera or Bluetooth module When a framework API calls to access device hardware, the Android system loads the library module for that hardware component.

The Android Runtime provides the libraries that any programs in Java need to function correctly It has two main components, much like the Java equivalent on personal computers The first component is the Core Library, which contains classes such as Java IO, Collections, and File Access The second component is the Dalvik Virtual Machine, an environment for running Android applications.

This section comprises numerous libraries written in C/C++ to be utilized by software applications These libraries are grouped into the following categories:

– System C Libraries: these libraries are based on the C standard and are used exclusively by the operating system.

– OpenGLES: Android supports high-performance 2D and 3D graphics with the Open Graphics Library (OpenGL ® ), specifically, the OpenGL ES API. OpenGL is a cross-platform graphics API specifying a standard 3D graphics processing hardware software interface.

– Media Libraries: this collection contains various code segments to support the playback and recording of standard audio, image, and video formats.

– Web Library (LibWebCore): this component enables content viewing on the web and is used to build the web browser software (Android Browser) and for embedding into other applications It is highly robust, supporting powerful technologies such as HTML5, JavaScript, CSS, DOM, AJAX, etc.

– SQLite Library: this is a database system that applications can utilize.

• Java API Framework The entire feature set of the Android OS is available to you through APIs written in the Java language These APIs form the building blocks you need to create Android apps by simplifying the reuse of core, modular system components, and services, which include the following components:

– Activity Manager: this manages the lifecycle of applications and provides tools to control Activities, overseeing various aspects of the application’s lifecycle and Activity Stack.

– Telephony Manager: provides tools for communication functions such as making phone calls.

– XMPP Service: facilitates real-time communication.

– Location Manager: this class provides access to the system location services.

– Window Manager: manages the construction and display of user interfaces and the organization and management of interfaces between applications.

– Resource Manager: handles static resources of applications, including image files, audio, layouts, and strings It enables access to embedded resources (not code) such as strings, color settings, and UI layouts.

– Notification Manager: allows applications to display notifications to users.

– Content Providers: enables applications to publish and share data with other applications.

– View System: a collection of views used to create the application user interface.

System Apps are apps that communicate with the users Some of these apps include:

– The basic apps that come with the OS, such as Phone, Contacts, Browser, SMS, Calendar, Email, Maps, Camera, etc.

– The user-installed apps, like games, dictionaries, etc.

These applications share these characteristics:

– Written in Java or Kotlin, with extension type APK (APK file).

– When an app is run, a Virtual Machine is initialized for that runtime The app can be an Active Program with a user interface, a background app, or a service.

– Android is a multitasking operating system, meaning users can run multiple programs and tasks simultaneously However, for each app, there exists only one instance This prevents the abuse of resources and generally helps the system run more efficiently.

– Applications in Android are assigned user-specific ID numbers to differentiate their privileges when accessing resources, hardware configurations, and the system.

– Android is an open-source operating system, distinguishing it from many other mobile operating systems It allows third-party applications to run in the background However, these background apps have a minor restriction, as they are limited to using only 5-10% of the CPU capacity This limitation is in place to prevent monopolization of CPU resources Background apps do not have a fixed entry point or a primary method to start execution.

Overview of Android Malware

According to NIST [38] , Malware is defined as:

“Malware, also known as malicious code, refers to a program that is covertly in- serted into another program intending to destroy data, run destructive or intrusive programs, or otherwise compromise the confidentiality, integrity, or availability of the victim’s data, applications, or operating system Malware is the most common external threat to most hosts, causing widespread damage and disruption and necessitating extensive recovery efforts within most organizations”.

From the above definition, it can be seen that malware is unsuitable for users and systems Understanding malware and how to prevent it helps protect users in today’s connected environment.

The rise of malware comes with the development of the internet, especially when all activities, including social and financial, can now be performed online, and they are subject to anonymous attacks for unrighteous intentions Malware will be classified into seven types, as shown in Table 1.1 below [38, 39]:

Viruses self-replicate by inserting copies of themselves into host programs or data files Viruses are often triggered through user interaction, such as opening a file or running a program Viruses can be divided into the following two subcategories:

– Compiled Viruses: a compiled virus is executed by an operating system Types of compiled viruses include file infector viruses, which attach themselves to executable programs; boot sector viruses, which infect the master boot records of hard drives or the boot sectors of remov- able media; and multipartite viruses, which combine the characteristics of file infector and boot sector viruses.

– Interpreted Viruses: interpreted viruses are executed by an application Within this subcategory, macro viruses take advantage of the capabilities of applications’ macro programming language to infect application documents and document templates In contrast, scripting viruses infect scripts that are understood by scripting languages processed by services on the OS.

Example: ILOVEYOU, CryptoLocker, Tinba, Welchia,Shlayer.

Worms: a worm is a self-replicating, self-contained program that usually executes itself without user intervention Worms are divided into two categories:

– Network Service Worms: a network service worm takes advantage of a vulnerability in a network service to prop- agate itself and infect other systems.

– Mass Mailing Worms: a mass mailing worm is similar to an e-mail-borne virus but is self-contained rather than infecting an existing file.

Trojan Horses a Trojan Horse is a self-contained, nonreplicating program that, while appearing benign, actually has a hidden malicious purpose Trojan horses either replace existing files with malicious versions or add new ones to systems They often deliver other attacker tools to systems.

Spyware is malware that can run secretly on the system without notifying users To disrupt system processes, spyware aims to collect private information and grant remote access to bad actors Spyware is often used to steal financial information or private user information.

Example: DarkHotel, Olympic Vision, Keylogger

Adware is the most commonly used malware to collect user data on the system and provide ads to users without permission Even though adware isn’t occasionally dangerous, in some situations, adware can cause system crashes They can redirect browsers to unsafe websites with Trojan viruses and spyware In addition, adware is one of the reasons for system lagging.

Ransomware is a kind of malware that has permission to access system private information; it encrypts data to prevent user access, and then the attackers can take advantage of the situation and blackmail users Ransomware is usually part of phishing actions The attacker can encrypt information that can only be opened with his key.

Example: RYUK, Robbinhood, Clop, DarkSide

Fileless malware live inside the memory This software will be processed from the victim system’s memory (NOT from files on the hard disk) Thus, it is harder to detect compared to other classic malware It also makes the encryption process harder because Fileless malware will disappear when restart- ing the system.

Android OS always holds a high market share on the mobile operating system. Following the statistics of [1] in June 2023, Android dominated 70.79% of the mobile market Thus, Android OS’s vulnerabilities are attractive to hackers, as all the social and financial activities can now be performed on mobile devices. According to AV-Test [2], new types of malware are still being created annually, along with the development of an open-source OS like Android The malware increase from 2013 to March 2022 is shown in Fig 1.2.

Malware is a developing threat to every connected individual in the age of mobile phones and the internet Because of the financial incentives, the number and complexity of Android malware are growing, making it more difficult to detect. Android malware is almost identical to the varieties of malware that users might be familiar with on their desktops, but it is only for Android phones and tablets. Android malware primarily steals private information, which can be as common as the phone number, emails, or contacts of the user or as critical as financial credentials With that data, the scammers have many unlawful options that can earn them substantial money There are some signs indicating that a mobile device was infected by malware: (1) users often see sudden pop-up advertisements on their devices; (2) mobile batteries drain faster than usual; (3) users notice applications that they did not intentionally install; and (4) some apps do not appear on the screen after installation Android malware appears in many forms,

Figure 1.2: The increase of malware on Android OS such as trojans, adware, ransomware, spyware, viruses, phishing apps, or worms. Kaspersky has investigated widespread malware in 2020 and 2021 and categorized them (Fig 1.3) [40] Malware often infiltrates via various traditional sources, such as harmful downloads in emails, browsing dubious websites, or following links from unknown senders.

Figure 1.3: Types of malware on Android OS

Common sources of Android malware:

– Applications that have been infected: Attackers can collect popular programs, repackage them with malware, and re-distribute them through download links This method is so effective that many fraudsters tend to design or advertise new apps; naive users may follow customized download links and accidentally install or download malware to their devices.

Android Malware Classification Methods

Signature-based Method

In this method, the signature of sample malware will be stored in a list of known threats and their indicators of compromise (IOCs) The signature can be extracted by static or dynamic analysis The method compares the sample’s signature with all the signatures stored in the database to decide whether a sample is malware.

One of the attributes of the signature-based method is high accuracy To achieve that, indicators stored in the database must be accurate, have comprehensive coverage, and be updated regularly, as new malware is born rapidly On the other hand, using a signature-based method is time-consuming The larger the number of files or apps that need to be checked, the longer the testing time required because the system needs to sequentially decompile each app, extract features, and then compare each feature with the patterns defined in the database The program can often combine static and dynamic signatures, e.g., data extracted from the decompiled code and behavioral data while the app runs The combination will provide more comprehensive coverage, but the examination time will increase considerably.

Permissions, API calls, class names, intents, services, or opcode patterns are often used to spot the malware In [16], Enck et al proposed a security service for the Android operating system called Kirin The Kirin authenticates an app at installation time using a set of protection rules designed to match the properties configured in the app Kirin system also evaluates configurations extracted from the installer’s manifest files and compares them with the rules set up and saved in the system.

Batyuk et al [17] applied static analysis on 1865 top free Android apps retrieved from the Android Market The experiments showed that at least 167 access private information such as IMEI, IMSI, and phone numbers among the analyzed apps One hundred fourteen apps read sensitive data and immediately write them to a stream, which indicates a significant privacy concern.

Dynamic analysis is highly efficient when dealing with obfuscation techniques such as polymorphism, binary packaging systems, and encryption However, app operation(even in a virtual environment) also costs dynamic analysis more time than static analysis Chen et al [15] proposed an approach to indicate dangerous samples in Android devices using static features and dynamic patterns The static features were acquired via decompilation of APK files, and connections between the app’s classes, attributes,methods, and variables will be extracted The program also analyzes function calls and the relationships between data threads when the Android app runs All that information can be used to deduce threat patterns and check whether the app accesses private data or conducts any illegal operation, e.g., sending messages without permission or stealing confidential information The experiments in the report show that the rate of malware found in 252 samples using the dynamic signature-based method is 91.6%.

Figure 1.4: Anomaly-Based Detection Technique

Despite the advantages mentioned above, there are two drawbacks to the signature- based detection method: (i) it cannot detect zero-day malware, and (ii) it can easily be bypassed by code obfuscation.

Anomaly-based Method

An anomaly-based method uses a different approach and can resolve problems An anomaly-based approach relies on heuristics and empirical running processes to detect abnormal activities The anomaly-based detection technique consists of the training and detection stages, as presented in Fig 1.4 This technique observes normal behaviors of the app over a period and uses attributes of standard models as vectors to compare and detect abnormal behaviors if any occur A set of standard behavior attributes will be developed in the training stage In the detection stage, when any abnormal “vectors” arise between the model and the running app, that app will be defined as an anomaly program This technique allows for recognizing even unknown malware and zero-day attacks.

In an anomaly-based approach, application-extracted behaviors can be achieved in three ways: static analyses, dynamic analyses, or hybrid analyses Static analyses will be investigated before installation using the app’s source code Dynamic analyses will perform the test and collect all the app data during execution, for example, API calls, events, etc., where hybrid methods use both.

However, the abnormal and expected behaviors of the samples are not easily separated because of the large number of behaviors extracted There is no basis to determine what behavior is normal and not normal It is not feasible to divide these behaviors based solely on the analyst’s experience Machine learning models are applied during training to minimize time and increase efficiency When applying machine learning, the number of behaviors that should be fed into the training model can be enormous, as all behaviors must be collected as features Nowadays, there are many machine learning models have been applied to malware detection, such as

SVM (Support Vector Machine), KNN (K-Nearest Neighbors), RF (Random Forest), etc., and the modern deep-learning models DNN (Deep Neural Network), DBN (Deep Belief Network), CNN (Convolutional Neural Network), RNN (Recurrent Neural Net- work), LSTM (Long Short-Term Memory), GAN (Generative Adversarial Network), etc Those models will be discussed in a later section of the dissertation.

Schmidt et al [41] have analyzed Linux ELF (Executable and Linking Format) object files in an Android environment using the command readelf The function calls read from the executables are compared with the malware database for classification by using the Decision Tree learner (DT), Nearest Neighbor (NN) algorithm, and Rule Inducer (RI) This technique shows 96% accuracy in the detection phase with 10% false positives Schmidt et al extended their function calls-based technique to Symbian OS [42] They extracted function calls from binaries and applied their centroid machine, based on a lightweight clustering algorithm, to identify benign and malware executables The technique provides 70-90% detection accuracy and 0-20% false positives. Schmidt et al [43] proposed a framework to monitor smartphones running Symbian

OS and Windows Mobile OS to extract system features for detecting anomalous apps. The proposed framework is based on tracking clients runs on mobile devices, collecting data describing the system state, such as the amount of free RAM, the number of running processes, CPU usage, and the number of SMS messages in the sent direc- tory, and sending it to the Remote Anomaly Detection System (RADS) The remote server contains a database to store the received features; the detection units access the database and run machine learning algorithms, e.g., AIS or SOM, to distinguish between normal and abnormal behaviors A meta-detection unit weighs the detection results of the different algorithms The algorithms were executed on four feature sets of different sizes, reducing the set of features from 70 to 14, thus saving 80% of disk space and significantly reducing computation and communication costs Consequently, the approach positively influences battery life and has a small impact on actual positive detection.

Only the machine learning methods applied to malware detection on the Android system will be discussed in this dissertation The next chapter will detail the analysis to get the behaviors or features by static, dynamic, and hybrid methods.

Android Malware Classification Evaluation Metrics

In the problem of recognizing and classifying, some commonly used measures areAccuracy (Acc),P recision,Recall,F 1 -score,confusion matrix,ROC curve,Area Under the Curve (AUC), etc For the classification problem of having multiple outputs, there are slight differences in the use of measures.

1.2.3.1 Metrics for the Binary Classification Problem

In the detection problem, the output has only two labels, commonly called Positive and Negative, where Positive indicates an app is malware, and Negative alludes to the opposite Hence, there are four definitions provided:

• T P (True Positive): apps correctly classified as malware.

• F P (False Positive): apps mistakenly classified as malware.

• T N (True Negative): apps correctly classified as benign.

• F N (False Negative): apps mistakenly classified as benign.

While evaluating, the ratio (rate – R) of these four measures is considered:

For the four above measures, F N R is crucial, as the higher this ratio, the less trustworthy the model is because more malware apps will be mistakenly recognized as benign For the F N R measure, the false alarm rate means benign apps are mistaken for malware, but it won’t be as important as theF N R.

The most popular and simplest measure is Accuracy (Acc), given in Equation 1.1.

Acc is often used with problems where the number of positive and negative samples are equal As for problems with a large deviation between the number of positive and negative samples, theP recision, Recall, and F 1 -score measures are often used.

• P recision is defined as the ratio of T P scores among those classified as positive (T P +F P) The formula for calculating P recision is shown as Equation 1.2.

• Recall is defined as the ratio of T P points to those that are actually positive (T P +F N) The formula for calculating Recall is shown as Equation 1.3.

• F1-score is the harmonic mean of precision and recall The formula of F1-score is shown as Equation 1.4.

1.2.3.2 Metrics for Multi-labelled Classification Problem

When there are multiple labels as output in the classification problem, it can be reduced to a detection problem for each class, considering the data belonging to the class under consideration to be positive and all the remaining data labels to be negative. Thus, there will be a pair of precision and recall for each class The concepts of micro- average and macro-average will be used to evaluate the classification problem.

Micro-average precision and Micro-average recall are calculated in Equation 1.5.

(1.5) where T P c , F P c , and F N c respectively are T P, F P, and F N of the classc.

Macro-average precision is the average of precisions by class, similar to M icro − average recall(called Recall: average actual classification of each class of malware and benign), given in Equation 1.6.

With the abovementioned measures, the Acc and Recall measures are used for this classification problem in experiments.

Android Malware Dataset

Many datasets have been published for the research community as follows:

• Contagio mobile: released in 2010 and last updated in 2010 It consists only of

189 malware without benign ones This dataset is public to users.

• Malgenome: samples were collected from 2010 to 2011 and published in 2012 The size of the dataset is 1260 malware However, the dataset was decommissioned in 2021.

• Virusshare: this is a repository of malware It has been provided publicly to users since 2011 This dataset only includes malware files without labels.

• Drebin: samples were collected from 2010 to 2012 and published in 2014 The dataset consists of 5560 malware divided into 178 malware families The number of files in each family is not balanced Some families have only one or a few (less than 10) malware files, while others may have more than 1000 files Furthermore, Drebin also provides 123,453 benign samples in the form of extracted features.

• PRAGuard: released in 2015, PRAGuard consists of 10479 malware without malware family labels PRAGuard was created by mixing MalGenome and Contagio Minidump data with seven different mixing techniques In April 2021, this dataset was decommissioned.

• Androzoo: Androzoo was created in 2016 and is still being updated It provides both malware and benign in large quantities However, Androzoo only provides apps, and they haven’t been classified as families So far, the number of files offered is over 20 million in the form of APKs.

• AAGM: made public in 2017, it consists of 3 categories: Adware with 250 apps, General Malware with 150 apps, and Benign with 1500 apps from Google Play.

• AMD: malware was collected from 2010 to 2016 and made public in 2017, including 25,553 files from 71 families By 2021, this dataset is decommissioned.

• CICMalDroid 2020: samples collected in 2018 and published in 2020 with a size of 13,077 files in 5 categories (Adware, Banking Malware, SMS Malware, Mobile Riskware, Benign).

• InvesAndMal (CIC MalDroid2019): samples collected in 2017 and published in

2019 with 5491 files This dataset is divided into four categories (Adware, Ran- somware, Scareware, SMS malware) and consists of 42 families within the above categories Most of the benign accounts for 5000 samples It is currently still public.

• MalNet2020: the dataset was published in December 2020 with 1,262,024 samples. This dataset is essentially downloaded from Androzoo but has extracted features from FCG (Function Call Graph) and Image The dataset is divided into 696 malware families and 47 malware types The APK file cannot be directly downloaded from MalNet’s homepage (https://mal-net.org/), but the author team only provides SHA256 to download from Androzoo.

In the experiments in the doctoral dissertation (including experiments in journal articles and conferences), the following datasets were used:

• Virusshare: in the conference paper FAIR [Pub.4], a small number of samples,including 500 (250 malware and 250 benign), were used Since the number of malware and benign programs is balanced, the only measure to apply is accuracy.

• Drebin: This is a well-known dataset used in many papers by local and foreign authors During this research work, the Drebin dataset was constantly implemented, such as:

– [Pub.1]: this research experimented on the entire Drebin dataset (including both benign and malware provided) The article showed that using a CNN model had advantages over the original Drebin SVM model Because the Drebin dataset has a significant imbalance of samples between families, additional measures are also applied to obtain a better evaluation.

– [Pub.2, Pub.6]: those journals utilized the entire Drebin malware combined with 7,140 benign samples from a different source Multiple measurements were performed to evaluate the feature selection in the paper.

• AMD: similar to Drebin, this dataset is widely used by researchers due to the large quantity and variety of samples.

– In [Pub.10], the 65 families with the most samples were appropriate for the research The [Pub.11] study used the AMD dataset with families with at least 20 samples (including 35 families).

– In [Pub.3], Drebin and AMD were employed as malware data.

The datasets are summarized in Table 1.2 as follows:

Table 1.2: Summary of Android malware datasets

ID Dataset Description Samples Benign Families Published

Announced in 2012, public many 0 Unknown [47, 50]

ID Dataset Description Samples Benign Families Published

To evaluate the quality of a dataset, the dissertation uses several criteria: the number of samples, the number of labels, the distribution of samples among classes, and the level of updating of the datasets These criteria can help ensure that the dataset is comprehensive, well-labeled, balanced, and up-to-date, which can increase the reliability and generalization of the research results.

The quality of classification depends on the dataset:

Based on the above datasets, some datasets are suitable for malware detection tasks (which only provide malware and are not divided into many malware families) and classification tasks (in which many malware families are within malware) It also needs to be combined with a separate benign set (which can be downloaded from sources such as Androzoo, Google Play, etc.) With the same machine learning and deep learning algorithm, the adaptation to each dataset gives different results This is because the features extracted in each dataset (a set of samples) are different Assuming that all datasets have good quality, there is still a clear difference between each dataset due to the different years of publication Each year, Google provides new versions with many changes, so the features extracted in each set are different" Some datasets have specific features, such as datasets containing C++ code instead of just Java code, datasets containing scrambled and not simply readable like regular code, encrypted datasets, or datasets that have code rearranged in different positions, etc From the above, it can be seen that the quality of each dataset significantly affects the classification quality. Modification and advancing the dataset:

The investigation conducted in the dissertation indicates that the labeled datasets exhibit a discrepancy among distinct malware families The Virusshare and Androzoo datasets, which furnish APK files, exhibit a partiality towards specific labels when subjected to labeling software despite their lack of inherent labeling Consequently,this research has incorporated multiple supplementary evaluation metrics to furnish a more all-encompassing appraisal of the correlation among diverse families with varying quantities, including but not limited to precision, recall, and F1-score.

Machine Learning-based Method for Android Malware Classification

The problem of malware classification on the Android platform is described in Fig. 1.5 In general, there are four steps involved in Android malware classification.

APK file is a compressed file containing other files likeAndroidmanifest.xml (later called the XML file), and classes.dex (later called the DEX file), etc Features extracted from APK files form a dataset and serve as input to training models. Features are critical to a model and the key components for the model to make true or false decisions Arrays of features can be collected via static analysis, dynamic analysis, or hybrid techniques and then tailored to make a feature set. For example, index classes can be transformed to image features, collect groups of dynamic features such as permission, API call, intent, etc., or transform file code to a “smali” file A set of extracted features could be defined as a “raw feature dataset.”

Features of the original dataset (original feature dataset) were transformed into binary form, and then these binary values can be specified differently as:

• Images: transformed from text to binary or Hex code for image point collection.

• Frequency: attributes occurrence (Permission, API call, etc.) frequency in APK’s file can be transformed into features.

• Binary encoding: if the defined behavior takes place, pass “1”, else pass

Figure 1.5: Overview of the problem of detecting malware on the Android

• Relationship weight: apply a mathematical model to retrieve the relationship between features and assigned weights to newly acquired relationships (For example, the relationship between APIs).

Mathematical formulas such as the TF-IDF algorithm (Term Frequency – Inverse Document Frequency), IG (Information Gain), PSO (Particle Swarm Optimiza- tion), GA (Genetic Algorithm), etc could be used to assign weights to each feature Besides, the above algorithms can be used to evaluate the importance of each feature in the dataset Many studies didn’t assign new weight to each feature but used the original dataset directly.

The dataset can contain hundreds or tens of thousands of features Of course, the more features you put into the training model, the longer the training time On the other hand, many features are not necessarily suitable for the classification problem Therefore, feature selection is also a problem studied in classification in general and malware classification in particular Which features to choose and which features to remove depend on the criteria given by each researcher; for example, it is possible to create a threshold of accuracy, recall, etc., to stop the feature removal or rely on the weights of each feature in the original dataset to remove the feature with the threshold set by us.

3 Feature augmentation from the available feature set

The features used in classifying Android malware are primarily discrete They will be related to each other in the application Features will often come in groups (for example, calling ACCESS_FINE_LOCATION will include the getFromLo- cation() system call) Finding the relationship between features in each such application is complex Therefore, data engineers have developed many methods to enhance the input dataset The augmentation approach could generate more features based on their correlation or hybridization or reduce the number of features used as input data Some feature generation methods, such as Apriori, K-means, FP-growth, etc., can be mentioned On the other hand, dimensionality reduc- tion techniques often used are low-variance filtering or Generalized discriminant analysis.

4 Deep learning and machine learning models

Applying new models in training always piques interest in data classification problems Researchers have applied many machine learning models to improve classification quality, especially image classification New methods and models are developed based on the existing studies, and deep learning emerges as an evo- lution of traditional machine learning In deep learning, the typical model used is CNN Along with CNN, there are many variations of CNN, such as VGG-16, VGG-19, ResNet, etc It can be seen that developing or applying a new model to a classifier is of great significance Many research groups have applied the models to other problems, including the Android malware detection problem.

Related Works

Related Works on Feature Extraction

The overview model of feature extraction is described in Fig 1.6.

The current research follows the feature extraction methods:

1 Static features extraction: analyzing source code (via reverse engineering) to get the features as strings (string type) from the file.

2 Dynamic features extraction: take each APK file and run it in an isolation environment (e.g., a sandbox independent of the operating system environment),

Figure 1.6: General model of feature extraction methods just like installing an app and running each module in that app The desired features can be extracted during execution.

3 Hybrid features extraction: combining static method (1) and dynamic method (2).

4 Image conversion: usually, an APK file, DEX file, and XML file can be transformed into a sequence of bytes, and then an image can be created from the binary sources The image here can be a grayscale image or a color GRB image This image conversion method can also be considered a static method; however, the features are not necessarily analyzed thoroughly to produce internal strings like in the static method, so in this study, image conversion analysis will be categorized separately. a) Static Extraction Method

The static analysis method decompiles the APK packet It analyzes the internal characteristics, thereby collecting suspicious characteristics in the decompiled code files, and those “suspicious attributes” in the form of strings are called features The static method has many advantages, such as:

• Can detect malware whose behavior is not directly visible to the outside.

• Several available tools for reverse engineering.

The number of “strings” extracted from the decompiled code of the APK packet is enormous Each could be divided into many groups: permission, API calls, main activity, packet name, opcode, intent, hardware component, strings, and system commands Each group plays a vital role in malware detection, but three groups are widely used in static analysis:

• Permission: these are the permissions declared in the XML file There are two types of permission: permission provided by the Android operating system and permission declared by the programmer.

• API call: API call describes the working process of an app An API call combines the class name, method name, and descriptor.

• Opcode: opcode describes the instruction script for the data operation The Dalvik register set, instruction set, and instruction set architecture differ from those in the JVM but are similar to the assembly instructions in x86 In opcode, there are many types of instructions like data definition, object operation, data calculation, field operation, method call, data operation, array operation, comparison, jump, data conversion, and synchronization.

Permission is always an important feature Many researchers only use permission to characterize malware, as in [26, 28, 29, 61, 62, 63, 64, 65, 66, 67, 68] Each paper used a different data set, so the number of features used differs There are many methods to standardize data from strings to numbers, such as one-hot encoding, label encoding, ASCII encoding, Unicode encoding, and word embedding Most of those research using permission features use a standardized one-hot encoding This normalization technique builds a vector containing features and converts each value into binary features, containing only 1 (feature appears) or 0 (feature does not appear) in each application D. Sahin et al [28] used 76 feature permissions extracted from an XML file with the sample dataset named MoDroid, which includes 200 Malware apps and 200 Benign apps.

In addition, the author also used 102 permissions as the feature set from 1000 malware apps of the AMD dataset and 1000 Benign apps from APKPure The features have been normalized according to one-hot encoding When experimenting with many machine learning methods, the highest accuracy detection results are 95.6% using Linear regression.

In addition, many groups also use datasets extracted from permission features such as CICMaldroid [64, 69], Virusshare, Androzoo [63], Genome [61], etc or mixing samples of several datasets and acquired positive results, with accuracy of detection models reaching over 90%.

Sensitive permissions are represented in Table 1.3 Those permissions, when used,may indicate an app prone to being malware.

Used for runtime permissions related to the user’s calendar

Used for permissions that are associated with accessing the camera or capturing images/video from the device

READ_CONTACTS WRITE_CONTACTS GET_ACCOUNTS

Used for runtime permissions related to contacts and profiles on this device.

Used for permissions that allow accessing the device location

MICROPHONE RECORD_AUDIO Used for permissions that are associated with accessing microphone audio from the device.

READ_PHONE_STATE CALL_PHONE

READ_CALL_LOG WRITE_CALL_LOG ADD_VOICEMAIL USE_SIP

Used for permissions that are associated with accessing body or environmental sensors

SENSORS BODY_SENSORS Used for permissions that are associated with accessing body or environmental sensors. SMS

SEND_SMS RECEIVE_SMS READ_SMS

Used for runtime permissions related to user’s SMS messages

RECEIVE_WAP_PUSH READ_EXTERNAL_STORAGE WRITE_EXTERNAL_STORAGE

Used for runtime permissions related to the shared external storage.

API call features, as well as permissions, are used in many research Many studies have used API call features exclusively to demonstrate the effectiveness of these features in detection and classification [4, 5, 7, 8, 9, 10, 11, 12, 13, 70, 71] Usually, other feature groups treat each feature as a separate one that is unrelated to others. Therefore, it is expected to convert each feature to a vector (numeric form) to include in training models In the case of API calls, each feature is inherently related to the other and forms an internal chain of API calls, or API calls will combine with the app to form a chain of API calls and apps Such a chain between APIs or APIs and apps is called the Function Call Graph (FCG) The FCG-adopted method is only used recently for Android malware detection, typically studies of [5, 11, 13, 72] J. Kim et al [5] extracted APIs from 10,654 malware apps from Virusshare to build an API Call Graph The detection results when using CNN achieved an accuracy of91.27% H Gao et al [11] proposed a system named Gdroid that implements theGraph Convolutional Network (GCN), and then the FCG will be fed into the training model for classification The paper used many malware datasets, such as AMGP, DB,AMD, and Drebin, and attained a high accuracy of 98.99% Q Li et al [13] used datasets named Genome, Drebin, and Faldroid with the feature of API call graph and

GCN model, resulting in a 93.8% accuracy in malware detection D Zou et al [71] combined the conventional graph-based method with the high scalability of the social network analysis-based method for malware classification APIs were extracted from a dataset of more than 8,000 malware and benign apps to create the call graph, and the results achieved an F-measure of 97.1% with a false positive rate of less than 1%. Besides, many papers still convert API calls to vectors [9, 14, 70, 73] Transforming API calls into vectors as input to the model also produces good results S K Sasid- haran et al [70] trained a model using the Profile Hidden Markov model (PHHM). API calls and methods from malware in the DREBIN dataset were transformed into an encoded list and trained with a proportion of 70% for training and 30% for testing. The result’s accuracy reached 94.5% with a 7% false positive rate The precision and recall acquired 0.93 and 0.95, respectively.

Although not being used as much as permissions and API calls, many studies have used opcodes exclusively in malware detection problems such as [32, 52, 53, 74, 75, 76,

77, 78, 79, 80, 81] The extracted opcodes were converted to grey images and put into a deep-learning model, resulting in a detection accuracy of 95.55% and a classification accuracy of 89.96% Besides, V Sihag et al [53] used opcode to solve the problem of code obfuscation The detection result achieved 98.8% accuracy when using the Random Forest algorithm on the Drebin and PRAGuard dataset of code obfuscation, with the number of malware apps used is 10,479 In [79], the authors proposed an effective opcode extraction method and applied a Convolutional Neural Network for classification The k-max pooling method was used in the pooling phase to achieve an accuracy of more than 99% On the other hand, M.Amin et al [80] vectorized the extracted opcode through encoding and applied deep neural networks to train the model, e.g., Bidirectional long-short-term memory (BiLSTMs) With a dataset of more than 1.8 million apps, the paper acquired a result of 99.9% accuracy level.

For other feature groups, they are usually combined with permissions, API calls, or opcodes Because these groups often have few features and are unavailable in all apps, it isn’t easy to use them independently From 2019 until now, according to the statistics in dblp, only two papers [82, 83] use the Intent feature independently The results show that accuracy reaches 95.1% [82] andF 1 -score reaches 97% [83]; however, the dataset is self-collected, and the number of usable files in a dataset is small. Some common API packages in the Android malware detection problem datasets are described in Table 1.4.

Features combination is commonly used, in which permission and API calls appear a lot as they play a crucial part in malware detection [14, 25, 33, 44, 84, 85, 86, 87,

88, 89, 90] In many research papers, using feature groups has shown high effectiveness through evaluation results.

Some studies have converted extracted features into images, such as N H Khoa et

API package API package java.lang.StringBuilder.toString android.content.Context.getSystemService java.lang.System android.content.Context.startActivity java.lang.Integer java.lang.Thread java.lang.String.substring java.lang.Boolean java.io android.app android.content.SharedPreferences java.lang.String android.content.Context.getPackageName java.lang.StringBuilder java.util java.lang.Long java.lang.String.length android.view android.content.Intent.putExtra java.lang.Exception

Table 1.5: Common suspicious API call

Privacy getSimSerialNumber(), getSubricbierId(), getImei(), getDeviceId(), detLi- neNumber(), getNetworkOperator(), newOutgoingCalls()

SMS sendTextMessages(), sendBroadcast(), sendDataMessage(), telephonySMS-

Httpclient.excute(), getOutputStream(), getInputStream(), getNetworkInfi(), httpUrlConnectionConn(), execHttpRequest(), SendRequestMethod(), set- DataAndTypes()

Location getLongitiude(), getLatitute(), getCellLocation(), requestLocationUpdate(), getFromLocation(), getLastLocationKnown() Application getInstallPackades(), installPackadges()

File URL(), getAssets(), OpenFileOutPut(), Browser:Bookmarks_URL()

Obfuscation DexClassLoder(), Cipher.getInstence() al [58] extracted the features and then transformed those features into images The features are permissions, opcode, API calls, system commands, activities, services, receivers, and package names Applying several CNN models to the extracted features set from CICAndMal2020 databases, the mobilenet_v2 achieved detection results of 98% accuracy and 99% when optimization methods were used together. b) Dynamic Extraction Method

Dynamic analysis is the type of analysis while executing and running all the app’s functions During execution, the running process is saved in a log file Necessary strings from the log file are extracted and denoted as features The implementation of running the APK packet can be done in two ways: (1) directly running on the actual device and (2) running on a sandbox that is isolated from the other parts of the system. Usually, running on a virtual environment in dynamic analysis is common because it is isolated and does not negatively affect the whole system However, in general, the execution always takes longer than the decompiled code analysis On the other hand, setting up the execution environment of dynamic analysis is difficult The dynamic analysis also has some advantages, such as the ability to extract features that only a running system can detect, e.g., the productivity of the system, the rate of using CPU [91], the rate of using RAM [92], battery, process [76], tcpdump, strace [54], traffic networks [50, 57, 93, 94, 95], etc The standard features such as permission, API calls [34, 45, 51, 59, 72, 96, 97, 98] are also different compared to static analysis, because real APIs were called in the functions Therefore, although dynamic analysis takes more time, in many cases, it is still necessary to understand the impact of malware in depth.

S Khalid et al [76] used many features such as memory, network, battery, logs, process, and APIs for malware detection Dynamic analysis can also deal with code obfuscation as the conducting functions aren’t altered, and impacts on hardware such as memory, network, and battery will be preserved H Long et al [54] used two major feature groups, which are strace and tcpdump, in malware detection and acquired an accuracy level of 59.3% This is a low result compared to other feature groups, but it is higher than other antiviruses At the same time, the data set used in the study is small (917 malware apps taken from the Androzoo dataset combined with 1,293 benign apps) As using network traffic, M Abuthawabeh et al [95] used feature selection algorithms like FRF, LightGBM, and RF to remove some trivial features according to the algorithm The accuracy obtained when using extra trees is 87.75% However, when applying the model to classification with many labels, the recall result is only 41.09% S.Wang et al [50] used URL in Network traffic to detect malware and received an accuracy of 98.35% This result is obtained when the authors use a Neural Network and a dataset of 40,751 malware extracted from Virusshare combined with 10,000 collected benign I J Sanz et al [57] only used 14 network traffic features after feature selection and applied the Random Forest model with 359 files and malware taken from the AMD dataset The result showed that the accuracy level reaches 90.24%, and the results can go up to 91.46% without feature selection F Wu et al [94] utilized Network traffic and the Bayesian machine learning model and achieved an AUC curve of 96% However, the study used a small dataset including 307 malware apps and 250 benign apps (malware apps were collected from the Genome and Virusshare datasets).

Related Works on Machine Learning-based Methods

In recent years, most research groups have used machine learning and deep learning models in the Android malware recognition problem Fig 1.7 presents the number of related articles based on dblp statistics during this period.

Many machine learning and deep learning algorithms adopted in Android malware detection studies achieved high accuracy In this section, typical machine learning models used in the problem will be summarized.

RF is a classifier algorithm that includes the prediction of decision trees [121] Each decision tree was trained on a sub-dataset – part of the training dataset Each tree outputs a class prediction.

The instruction to create a Random Forest is as follows:

1 Choose/select random “k” features from the set “m” features with k ô m.

2 From “k” features, calculate node “d” which is the best node for classification.

3 Classify the child node based on the found node above.

4 Repeat Steps 1-3 until k nodes are added to the tree.

5 Repeat steps 1-4 until “n” trees are created.

The instruction to use a Random Forest structure for predictive tasks is as follows:

1 Take the test features and use a decision tree to predict the model, then save the predictions to a list.

2 Calculate the number of votes for each prediction across the entire Random Forest.

3 Consider the prediction with the highest number of votes as the final model prediction.

Many studies used RF models [8, 14, 22, 30, 53, 63, 64, 66, 67, 85, 86, 115, 122] and achieved good results M M Alani [14] chose RF as one of the classifiers to create a lightweight approach to the features dataset, in which the RF model achieved the highest accuracy rate of 98.65% The authors used static features, such as permissions,API calls, intents, and command signatures M L Anupama et al [22] extracted static and dynamic features from the dataset and applied different algorithms to the selected dataset, bringing out a detection rate of 95.64% for dynamic features In [67],the authors reduced the list of permissions using multi-stage extraction and input to several classifier models, where RF performed the best with an accuracy of 96.95% and f-measure of 0.96 V Sihag et al [53] presented a novel obfuscation counter method based on opcode features Four algorithms had been applied with 10-fold cross- validation, in which RF achieved the highest malware detection accuracy of 98.2% on two datasets M Cai [86] et al achieved the highest result using RF in the recognition process with 99.87% on Acc measurement by using permission and API calls M.Dhalaria et al [30], using nearly 2,000 samples with 13 malware families, gave the highest result using RF of 88.6% with Acc measure In the paper, the authors combined nearly 2,000 benign to test with two labels, giving the highest result of 90.1% with theAcc measure H Rathore et al [66] used 20 malware families in the Drebin dataset; when classified using the RF algorithm, the highest result was 93.81% with the Acc measure In general, studies using RF present detection results with high accuracy, most of which are over 90%, whether using static, dynamic, hybrid, or image conversion analysis.

Although the results of the RF model for the Android malware detection problem are awe-inspiring (>96%), these results are just for the detection problem For research in classification problems, the RF model gives results of about 90%.

SVM [123] is a discriminative classifier formally defined by a separating hyperplane of n-1 dimensions in n-dimensional data space, such that the hyperplane can most optimally classify the layer.

SVM was first developed to classify data with two labels, then improved to classify data with n labels With m elements x 1 , x 2 , , x m in the n-dimensional space with the corresponding labels of the elements are y1, y2, , ym of value 1 (positive layer) or-1 (negative layer), SVM finds the furthest hyperplane (optimal hyperplane) The process to find the optimal hyperplane is shown in Equation 1.7: minα

Where C is a positive constant used to customize the magnitude of the margin and the total error distance K is a linear multiplication whereK =xi×xj. Solving Equation 1.7,SV are acquired elements that are acquired xi corresponding, called support vectors Using the SV supporting vector, the classified hyperplane can be reconstructed SVM implements the classification of new elements by Equation 1.8. predict(x) =sign(

Many studies have used SVM models to detect and classify Android malware [6, 14,

21, 22, 23, 30, 31, 66, 124, 125, 126] The use of the SVM model yields high results, and many experiments show results above 90% M.M Alani et al [14] used a static feature set with the CICDroid dataset and reduced the number of features from 215 to 35, achieving 99.33% accuracy when applying SVM classifiers Shatnawi et al [23] implemented a static classification method using popular algorithms such as SVM,KNN, and NB on the CICInvesAndMal2019 Dataset Using permission and API call graph, the detection results achieved 94.36% K Shao et al [126] showed an accuracy outcome of 98.4% with several groups of static features and feature selection.

M Dhalaria et al [30] experimented in detecting malware (nearly 2,000 malware samples and 2,000 benign samples) with an Acc measurement of 87.06%; Evaluation in the problem of classifying malware (classifying 13 families with nearly 2,000 malware samples) resulted in an Acc measure of 86.85% H Rathore et al [66] used 20 malware families in the Drebin dataset; when classified using the SVM algorithm, the highest result was 85.42% with the Acc measure.

The results when using the SVM model to detect and classify malware on Android are often lower than those of the RF model.

Thomas Cove has proposed KNN and is a suggested method for classification and regression problems [127] “k” closest training samples in the feature space are used as input vectors The model’s output will vary depending on whether the KNN approach is used for classification regression Test samples are classified according to their closest neighbors and assigned to the class that hosts the closest neighbors If thek value takes

1, the class of its closest neighbors will be assigned to that class In regression, the output will be the average of the next k neighbor’s feature values It is a type of sample-based learning algorithm KNN is described in detail by the following steps: Require: training sample set T, Sample to be classified x, Number of neighbors k. Ensure: sample label y

1 First, initialize the distance as the maximum distance;

2 Calculate the distance dist between unknown samples and each training sample;

3 Obtain the maximum distance max_dist In the current k collect samples;

4 Ifdist Is less thanmax_dist, the training sample is taken as the k-nearest neighbor sample;

5 Repeat steps 2, 3, and 4 until the distance between the unknown sample and all training samples is calculated;

6 Count the occurrence times of each category ink nearest neighbor samples;

7 Select the category with the highest occurrence frequency as the category of the unknown samples;

The KNN algorithm is less widely used for detection and classification than the

RF or SVM algorithms However, applying the KNN model also gives good results in detecting and classifying malware on Android (usually lower than the RF model)[25, 26, 27, 30] D T Dehkordy et al [27] engineered a balanced dataset and appliedKNN, SVM, and Iterative Dichotomiser 3 classifier The results indicated that KNN produced the highest accuracy, precision, and F-measure with the processed dataset, which are 98.69%, 97.89%, and 98.69%, respectively.

M Dhalaria et al [30] experimented with detecting malware (nearly 2,000 malware samples and 2,000 benign samples) with an Acc measurement of 85.4%; Evaluation in the problem of classifying malware (classifying 13 families with almost 2,000 malware samples) resulted in an Acc measure of 83.91%

DBN is a widely used deep learning framework [128] The deep belief network is divided into two parts The bottom part is formed by stacking multiple restricted Boltzmann machines Each layer’s Restricted Boltzmann Machine (RBM) is trained by the contrastive divergence (CD) algorithm The upper part is a supervised back propagation neural network used to fine-tune the whole network.

This model interests research groups and produced positive results [50, 129, 130,

131, 132] on malware detection J Wang et al [50] using the DBN model obtained accuracy results as high as 98.3% In that case, the malware and benign samples are balanced at 8,000 samples each In [131], the authors used a combination of features by applying image conversion methods to samples derived from the Drebin dataset, reaching a detection accuracy of 95.43% In [132], the outcome is 98.71% accuracy while using 5,154 features from static feature sets.

CNN is a well-known Neural network type, often used to learn abstract features from primary data sources of different kinds This process involves extracting hidden classes from the input The input can be a vector or a matrix-like image Features are introduced using filters called kernels with small sizes that run through the entire input to create new hidden layers These hidden layers can go through one or several pooling layers of small matrices, for example, a 2x2 pooling matrix The dimension of hidden layers will be reduced one more time Finally, backpropagation connects them to the Dense layer and output classes CNN structure is shown in Fig 1.8.

Proposed Methodology

Android malware detection process includes main tasks such as pre-processing, feature techniques, malware classification, and malware removal The feature selection and classification phase contribute most to the accuracy of detection results Feature techniques include feature extraction, feature selection, and feature enhancement To those learning models that have achieved stability and high accuracy, the enhancement of classification results is attributed to feature quality with classification problems in general and Android malware classification in particular Thus, feature enhancement is critical in Android malware classification using stable learning models However, in most research today, as will be presented in Chapter 2, features extraction is focused,such as Permission extraction from file manifest, API extraction DEX, and image conversion At the time of research, there was little research on feature improvement and feature enhancement in malware classification; thus, it remains a challenging problem that needs addressing This dissertation has studied and proposed three enhancement methods: feature improvement based on co-matrix, feature augmentation based on the Apriori algorithm, and feature selection based on popularity and contrast value in a multi-object approach These three methods have been published as follows:

1 Feature selection: In addition to applying the mentioned algorithms, such as

IG and TF-IDF, a new features selection algorithm was proposed based on the following factors: popularity and contrast value (the contrast between malware and benign, the contrast between malware families) [Pub.10] After applying additional methods to Android malware detection, the results achieved good metrics despite removing many features.

2 Feature augmentation based on Co-matrix: Associating co-occurrence matrix features of each pair in the feature group [Pub.2] The co-matrix is established based on a list of raw features extracted from APK files The proposed feature can take advantage of CNN while remaining important features of the Android malware.

3 Feature augmentation based on Apriori: Implement the Apriori algorithm to generate enhancement features The Apriori algorithm was applied for each feature group [Pub.6] The method studies association rules from the initial feature set to devise the highly correlated and informative features, which will be added to the initial set.

Besides the three methods above of feature augmentation and selection applying for stable learning machine models, improving learning models is demanded in Android malware detection With a giant and diverse sample set, Android malware detection is still a new challenge, as the OS can be installed on various machines, including phones, automotive, vending machines, clocks, and televisions Also, the number of malware is increasing monthly, in virtual as well as native environments, which leads to a diversity of features Conventional learning machine models such as RF, SVM, KNN, and DT are not suitable as these models do not have feature generalization ability They cannot produce generalized features and decrease dimension with large samples Also, the detection accuracy decreases sharply when the number of samples and features increases According to the review in Chapter 2, studies using these models were mainly tackled with available feature sets, with a small number of features and samples, and proceeded without a feature augmentation method Therefore, there is a demand for research on learning models appropriate for Android malware detection,with broad samples and features To contribute to dealing with the problem, after conducting research, two improvements for learning models are developed:

1 WDCNN Model [Pub.3]: this is an improved model of the CNN model In the WDCNN model, more information has been put into the model (the model requires two inputs, the wide component and the deep component) The results of the WDCNN model are better than those of conventional deep learning and machine learning models.

2 Federated learning method[Pub.11]: the federated learning model was used to conduct training and detection on many machines Although the accuracy results in the federated learning model are lower, they are not significant, and their high processing speed allows models and real applications to be tested and deployed quickly.

In the scope of the study, the work was limited to static feature extraction, and theDrebin dataset and AMD dataset were used for Android malware families.

Chapter Summary

Chapter 1 of the dissertation provides an overview of foundational knowledge of the Android operating system architecture and the challenge of developing malicious code on Android It then proposes solutions for classifying malicious code on the Android platform.

The dissertation is directed toward employing machine learning and deep learning models to classify Android malware Section 1.3 emphasizes that improvements can be made by advancing features and refining the model to enhance the capability of classifying Android malware To clarify further, in Section 1.4, surveys related research on feature extraction and deep learning machine learning models This exploration reveals that, despite numerous studies addressing the issue, there is still potential for improving the accuracy of Android malware classification.

Section 1.5 explicitly outlines the dissertation’s contributions The following chap- ters will delve into the detailed contributions regarding feature enhancement and improvements in deep learning models.

Chapter 2 PROPOSED METHODS FOR FEATURE EXTRACTION

This chapter focuses on feature set augmentation In general, feature extraction, selection, and development techniques are very important in detection and classification problems If the features given after selection and development are good, it will help the model give good results There are two approaches as following:

• Feature set development: additional features will be generated from the initial dataset to obtain a novel set of features.

• Feature selection: eliminating features with low weights (according to the applied algorithm) From there, a new set of features has a smaller number than the original but is considered "marrow" or essential in the classification.

Feature Augmentation based on Co-occurrence matrix

Proposed Idea

A co-occurrence matrix is typically a symmetrical square matrix, with each row and column representing the corresponding feature vector In classifying Android malware, features are commonly extracted statically from the APK file These features are permissions, API calls, intents, services, and others The extracted features are discrete entities The inquiry pertains to the method of connecting the characteristics present in a cluster In the realm of Android apps, it is common for features to exhibit a degree of interdependence A messaging app is expected to observe the concurrent declaration of permissions such as SEND_SMS, RECEIVE_SMS, and READ_SMS Consequently, the characteristics of the identical category will exhibit interdependence rather than autonomy Conversely, the utilization of Convolutional Neural Network (CNN) models in the classification of Android malware is prevalent The CNN model receives its input from a matrix of images, wherein neighboring image pixels exhibit comparable color values The proposed approach aims to utilize a co-occurrence matrix in the context of Android malware detection to establish a correlation between the features within each group.

The overall model applying the co-occurrence matrix to improve the feature set for the Android malware classification task is shown in Fig 2.1 To prove the effectiveness of the co-occurrence matrix feature, two scenarios are set up: those with and without the co-occurrence matrix feature computation module The process is as follows:

Figure 2.1: Evaluation model for Android malware classification using co-occurrence matrix

1 From APK files, the raw feature extraction module extracts features, including API call strings and permission requests.

2 For the baseline architecture, the raw features go to the Normal Matrix Formation module The module converts the raw features in string format into a vector using a dictionary of API calls and permissions Each element in the vector would have a value of 1 or 0, depending on whether the API or permission can be found in the current APK file The vector is then reshaped into a matrix, later treated as CNN input Raw features go to the proposed architecture’s co-occurrence matrix feature computation module The module forms a matrix based on the concurrence presence of two APIs or permissions in the APK file.

3 Next, the CNN module is applied to learn the features and classify the APK files into benign or specific malware families.

Raw Feature Extraction

Algorithm 1: Convert string features to number

Input : dictionaryFeatures: a dictionary of all APIs and permissions; listFeatureFile: list of APIs and permission string of a file;

Output: vectorOutput: feature vector; length: length(dictionaryFeature);

5 if aF eature ∃ listF eatureF ile then

To extract features from APK files, some decompilation tools like the APK tool, Dex2jar, Baksmali, Androguard, Jadx, Jd-gui, or Androidpytool can be used In this part, Androidpytool was used to extract features All the features are static and extracted from two files: XML file and DEX file.

From the raw feature sets, the top (200) most common APIs that appear in all APK files are employed, and 385 feature permissions are declared and used in XML files.

In the form of strings, these features are the input of the next module in the process chain, as mentioned in Fig 2.1 Algorithm 1 illustrates the implementation process to convert string features into number vectors.

Co-occurrence Matrix Feature Computation

The co-occurrence matrix was first mentioned in 1957 when linguist J.R Firth referred to the relationship between words in a sentence A word is represented se- mantically by the words around it, so the placement of words will affect the sentence’s meaning.

In this context, the co-occurrence matrix is now connected in each paragraph word. This concept is applied to the Android malware features The implementation of co- occurrence matrix computation is described in Algorithm 2.

Algorithm 2: Co-occurrence matrix computation algorithm

Output: coMatrix: Co-concurence matrix;

3 coM atrix ← new M atrix(length, length);

6 if vecF eature[i] = 1 and vecF eature[j] = 1 then

After converting raw features in string form to a vector of numbers, the next step is to reshape this vector into a matrix, which can be used as input to CNN later This step may have a huge impact on the final classification results The reason is that the order of features might change a lot when reshaping the vector to different matrix sizes Fig.2.2 illustrates an example of forming an output matrix with different sizes Because harmful malware tends to call an API together with another one or a permission request (e.g., the API CreateFile might be called together with INTERNET_ACCESS permission in malware), CNN can learn the relationship between these two elements if they are located close to each other in the output matrix, i.e., in the case of forming a matrix size k by k In contrast, CNN may lose the information if the output matrix is formed in different sizes, i.e., (k+1) by (k+1), as shown in Fig 2.2 Hence, using CNN, the order of elements in the input vector also affects the final classification rate.

Figure 2.2: Output matrix with different size

The co-occurrence matrix proposed can potentially address the issue of input element reordering This is because the co-occurrence matrix emphasizes the co-occurrence matrix between two elements instead of a singular cell.

Experimental Results

The present study employs the Drebin dataset to assess the proposed methodology. The dataset includes 5,438 malware files with 179 families in the Drebin dataset [146] and 6,732 benign files, including apps and games [147] The top (10) family of malware with the most significant number of samples is shown in Fig 2.3 In the context of feature extraction, various internal feature categories exist, such as permissions, APIs, services, URLs, and intents The present study solely concentrates on obtaining authorization and utilizing API functionalities, encompassing system and function calls within the program.

The study utilized the 398 highest-level permissions and the 200 most frequently employed API function calls across all files Hence, the individual APK file is comprised of 598 raw features The co-occurrence matrix is computed for each permission and API group, resulting in 158,404 permission features and 40,000 API features The features are stored in a Comma-Separated Values (CSV) file, input for machine learning and deep learning algorithms.

Figure 2.3: Top (10) malware families in Drebin dataset

With the data set described in section 2.1.4.1, two experimental scenarios are utilized as follows:

• Scenario 1: 598 features based on permissions and API.

• Scenario 1: 198,404 features after using the co-occurrence matrix of 598 features in scenario 1 (158,404 permission features and 40,000 API features)

Figure 2.4: CNN having multi-convolutional networks

Table 2.1: Details of parameters set in the CNN model

Pooling_1(I) 199x199x32 Pooling_1(II) 100x100x32 Conv_2(I) + Pooling_2(I) 100x100x64 Conv_2(II) + Pooling_2(II) 50x50x64 Conv_3(I) + Pooling_3(I) 50x50x64 Conv_3(II) + Pooling_3(II) 25x25x64 Flatten (III) 50x50x64+25x25x64 = 200.000

2.1.4.3 Malware Classification based on CNN Model

The structure of the CNN model is shown in Fig 2.4 The parameters of the CNN model used for the problem of using the co-occurrence matrix combination feature are shown in Table 2.1.

With two datasets: the raw data and the transformed data with a co-occurrence matrix (described as the two scenarios) For each set, the dataset is divided into groups using k-fold cross-validation sampling, with k = 10, dividing the data into ten equal parts of samples having both benign and malware (stratified), with 80% for training, 10% for validation testing, and 10% for testing The cross-validation process was performed ten times, and the average of the classification results was calculated. The results of the CNN model under the mentioned conditions are shown in detail in Table 2.2 In addition, measures such as PR, RC, F 1 -score, and FPR are used to evaluate the results, as shown in Table 2.3.

Table 2.2: Classification with CNN model using accuracy measure (%)

Set CNN model Raw features Co-occurrence matrix features

In Table 2.2, it can be seen that using the co-occurrence matrix has increased the average Acc by 0.58%, and the classification difference among 10-fold runs has also

MEASURE CNN CNN with co-occurrence matrix

Acc 95.78 96.23 decreased from 5.5% (using raw feature set) to 3.98% (using co-occurrence matrix) It proved that the links between features did affect the classification results.

Results when using some other measures are shown in Table 2.3 It can be seen that the PR metric, when using the co-occurrence matrix feature, increased by 0.3% compared with that of the raw feature set TheF 1 -score metric is also better (increase 0.58% when using co-occurrence features) Overall, using co-occurrence feature augmentation increases the classification accuracy compared with using raw feature sets. When using a co-occurrence matrix, even though classification results are better, the overall efficiency (training and test time) was reduced due to the increase of input from n features to n x n features The co-occurrence matrix produced was [n x n], but half of the features are duplicated (they do not affect the classification process).

Feature Augmentation based on Apriori Algorithm

Proposed Idea

The Apriori algorithm is a commonly employed technique in data mining Its primary purpose is to explore the association rule between various objects In detecting Android malware, features are extracted from APK files The two significant categories of attributes are permissions and API calls However, the discrete files are devoid of any interconnection Thus, utilizing the Apriori algorithm is feasible for acquiring knowledge of the association rules in this particular issue.

To apply the Apriori algorithm to advance the feature set and adapt it to the malware classification problem on Android, it is processed following the procedure in Fig 2.5, which is shown in a 5-step process as follows:

• Step 1 Extract features: based on the raw dataset as an APK file, decompile the APK files and perform text preprocessing to make raw features.

• Step 2 Associative rule mining by the Apriori algorithm: apply the Apriori algorithm to identify patterns and associations among the extracted features.

• Step 3 Create a feature augmentation set:

Figure 2.5: The process of research and experiment using Apriori

– Utilize the results from Step 2 (Apriori algorithm) and apply them to the raw dataset obtained in Step 1.

– Combine the new associations with the existing features to create an enhanced feature set called the "feature augmentation set."

• Step 4 The datasets will be put into machine learning and deep learning models to evaluate whether the new, Apriori-transformed feature set is better than the raw feature Three models, namely SVM, RF, and CNN, were utilized to make the assessment.

• Step 5 Comparison and Evaluation: different metrics will be used to evaluate and compare the Apriori-transformed feature set and the raw data set.

Apriori Algorithm

The Apriori algorithm was first proposed by Rakesh Agrawal, Tomasz Imielinski, and Arun Swami in 1993.

The problem is stated: find t with support s satisfying s ≥ s 0 and the confidence level c≥c 0 (s 0 , c 0 are 2 values specified by users and s 0 =minsupp, c 0 =minconf). Symbol L k is an array of arrays of k −f requent itemset, C k is an array of arrays of k−candidate itemset.

1 Find all frequent itemsets with certain minsupp.

2 Use frequent itemsets to generate association rules with some minconf confidence.

The values minsup and mincof are the thresholds to be defined before generating the association rule An itemset with its appearance frequency minsup is called a frequent itemsets.

The idea of the Apriori algorithm

• Find all the frequent itemsets: use k−itemset(itemsets contain k items) to find (k+ 1) itemsets.

• Find all the association rules from the frequent itemsets (satisfy bothminsupand mincof).

Phase 1: first, find the 1−itemset (denoted F 1 ) F 1 would be used to find F 2 (2−itemsets) F 2 would be used to find F 3 (3−itemsets), and the process went on until nok−itemsetwas found Shown in Algorithm 3.

2 F k ← ∅ // initialize the set of candidates

3 for f 1 , f 2 ∈ F k−1 // find all pairs of frequent itemsets

4 with f 1 = {i 1 , , i k−2 , i k−1 } // that differ only in the

// according to the lexicographic order

// join the two itemsets f1 and f2

// add the new itemset c to the candidates

15 return C k // return the generated candidates

Phase 2: use the frequentitemsetsacquired in phase 1 to generate association rules which satisfy confidence≥minconf Shown in Algorithm 4.

Feature Set Creation

Definition 2.1 (The initial feature set) The initial Android feature set is defined as the features extracted from the APK samples, including benign and malware files This feature set is denoted asF A and represented as in formula 2.1.

// F is the set of all frequent itemsets

3 output 1 − item consequent rule of f k with

5 and support ← f k count n // n is the total number of transactions in T

6 H 1 ← { consequents of all 1-item consequent rules derived from f k above }

// H m is the set of m-item consequents

• N is the number of features.

• Each f i feature could be a number or a string.

Two feature sets were extracted as follows:

• Set 2 contains miscellaneous features such as permissions, API calls, file size, native libc usage, number of services, and existing features.

Definition 2.2 (The feature association rule) The association rule defines the correlation between the two associated groups in the initial feature set Each feature group is a subset of features For two feature subsetsX and Y, the association rule is defined as in Formula 2.2 The association rule was examined through support and confidence, calculated as in Formula 2.3 and Formula 2.4.

X →Y, with X ∈F A , Y ∈F A and X∩Y =∅ (2.2) support= (X∪Y).count n (2.3) confident= (X∪Y).count

• N is the number of transactions.

• (X∪Y).countis the number of transactions with X∪Y

• X.countis the number of transactions that contain X

Definition 2.3 (Associated features) Associated features are created based on the association rule and satisfy the support and confidence threshold Based on the association rule described in Formula 2.2, A formula calculates the associated featuresf m as in Formula 2.5. f m =X x∈X x+X y∈Y y+support+confident

Definition 2.4 (The feature augmentation set) The feature augmentation set, denoted as F C , is the union of the initial feature set and the associated features F C is constructed as in Formula 2.6.

• F A is the initial feature set.

• F M is the associated feature set.

Inputs for malware detection and classification using machine learning and deep learning methods are often numerical; therefore, normalization is performed on the extracted feature set From the raw feature sets, the top (200) most common API calls that appear in all APK files are used, and 385 feature permissions are declared and used in XML files Algorithm 1 illustrates the implementation process to convert string features into a number vector.

Corresponding to each data group in the dataset described above, the Apriori algorithm was implemented to show the correlation between features in each group.For group 1 (permissions), which corresponds to the permission feature set, the permission has a tight correlation, which means that permission typically comes together with one or a group of other permissions on another APK file The min_sup used in this work is 0.4 After passing the first group through the Apriori algorithm, the seti was acquired.

For group 2 of API calls, services, and activities that have been ranked, it’s observed that the correlation between those features is not as tight as in the case of permission; therefore, the min_sup value was set to 0.2 After passing the second group through the algorithm, setii was acquired.

Apply the Apriori algorithm in each permission and API feature set described in Fig 2.6.

Figure 2.6: Apply the Apriori algorithm to the feature set

The data set from Drebin [146] with 5,560 malware files with 179 labels and 7,140 benign files are apps and games downloaded from [147] The features of files are saved in a CSV file with the number of rows equal to the number of files under analysis The number of columns corresponds to the number of extracted features.

The top (10) malware families have the most significant samples, as shown in Fig. 2.3.

Corresponding to the feature extraction in Section 2.2.3, the initial dataset was divided into four scenarios:

• Scenario 1: features permission Android system permissions and user-defined permissions: 398 features.

• Scenario 2: features from Scenario 1 + associated features using Apriori algorithm:

• Scenario 3: a collection of static analysis features The proposed features include permissions from Scenario 1 (398), APIs (200), sizes of files, user-defined permissions, use of native libs, number of services, and existing features in 603 features.

• Scenario 4: a collection of static analysis features + associated features in each set using the Apriori algorithm, in total: 603 (Scenario 3) + 2,132 (Apriori for permissions) + 3,085 (Apriori for API) = 5,820 features.

For each case, the data is divided into 10-fold Each dataset will be taken at 80% for training and 20% for testing (in the case of CNN, 10% for testing and 10% for validation).

2.2.4.2 experiment based on CNN Model

The structure of the CNN model is shown in Fig 2.7.

Figure 2.7: Architecture of CNN model used in the experiment with Apriori

The parameters of the CNN model used for the problem of using the co-occurrence matrix combination feature are shown in Table 2.4.

Table 2.4: Details of parameters set in the CNN model

Input: n × m × 1 Layer1: CONV1: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

Layer2: CONV2: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

The experimental model is detailed in Fig 2.5.

The four datasets were fed to the CNN model to evaluate the Apriori algorithm as described in Section 2.2.4.1 The measures used in the experiment are Acc, Precision, Recall, andF1-score The results are shown in Table 2.5 and Table 2.6, which describe the effect of using the same CNN algorithm on 4 sets of data when applying and not applying the Apriori algorithm.

Table 2.5: Classification results by CNN

Data set Scenario 1 Scenario 2 Scenario 3 Scenario 4

Table 2.6: Results of using CNN with measurements (%)

Precision Recall F 1 -score Acc Scenario 1 97.82 97.2 97.51 96.9

To more clearly compare with machine learning models, Fig 2.8 shows the results in the form of a chart as follows:

Looking at Fig 2.8, regardless of the models, using datasets preprocessed with the Apriori algorithm results in higher performances While the accuracy is not significantly higher, it is a prerequisite for applying the Apriori algorithm to future classification tasks The Apriori algorithm depends on two factors:

• The value ofmin_sup, which is the threshold when incorporating features.

• The number of repeating times for the Apriori algorithm (k-itemset) When combining features, the number k will increase The larger the value of k, the higher the number of incorporated features, which leads to fewer new features generated (as initial features incorporated must pass the threshold min_sup).

In the experiment,min_sup andk-itemset were not optimized Currently,min_sup used was 0.4 for the permission features set, 0.2 for the API calls feature set, and k-

Figure 2.8: Learning method implementation results itemset used was 2 Therefore, the new experiment indicated that Apriori produced better results than raw features In the following studies, the optimized algorithm must be implemented to determine each dataset’s optimizedmin_sup and k-itemset value.

Feature Selection Based on Popularity and Contrast Value in a Multi-

Proposed idea

The main idea of this proposed method is to use the Pareto multi-objective optimization method to build a selectivity function (global function) based on three component measures: popularity, the contrast between benign and malware files, and the contrast between classes of malware.

The overall model of the method is depicted in Fig 2.9.

• First, building component measures include popularity (M1), the contrast between benign and malware files (M2), and the contrast between classes of malware (M3).

• Second, for each feature in the raw feature set, calculate the value of the selection function based on the component measures The selection function is the global optimal function - built on these three measures in a balanced approach between the component measures.

• Third, only features with the value of the selection function greater than or equal to the threshold are selected With the feature set selected, the data is fed into

Figure 2.9: Proposing a feature selection model the deep learning model to evaluate the efficiency.

Popularity and Contrast Computation

In this section, component measures based on the value of each feature in the dataset are built Each component measure represents a quality characteristic of the feature. Component measures are used to construct the selection function - global feature evaluation.

Definition 2.5 (Popularity) The popularity of each feature is a measure built on the frequency of the feature Popularity is denoted M1 and calculated according to Equation 2.7.

Definition 2.6 (Contrast with benign) The contrast with benign is a measure that evaluates the contrast value of features in benign and malware samples The larger this measure, the higher the value contrast between malware and benign, so the better for classification Contrast with benign is denoted M2 and calculated according to Equation 2.8.

Definition 2.7 (Contrast between classes of malware) The contrast between classes of malware is a measure of the contrast between classes of malware The contrast between classes of malware is denoted M3 and calculated by Equation 3.8.

• V j is a benign set with the label j

Pareto Multi-objective Optimization Method

Pareto optimization is a key method in multi-objective optimization In this method, callingX ∗ the solution to be found, then X* must have the following properties:

• X ∗ must be at the point where all possible solutions to the problem satisfy con- straints X ∗ ∈D.

• Possible alternative other than X ∈ D that has one objective better (fi(X) ≥ fi(X ∗ ) must also have at least one other objective worse (fj(X) < fj(X ∗ )) with i̸=j.

X ∗ is also called the efficient solution That is, efficient solutions are those satisfying:

∄X ∈ D which is possible: for any i where f i (X)≥ f i (X ∗ ), and there must exist a j wheref j (X)> f j (X ∗ ) On the whole, no single X can outperform X ∗

Selection Function and Implementation

The multi-objective optimization method aims not to optimize a specific component but to balance and match optimal goals According to this approach, each measure is a component objective function Each of these metrics represents a particular measure of quality called a component objective However, simultaneously optimizing multiple component objectives is impossible; even improving one component’s goal may interfere with another.

The built-in selection function is the global objective function The aim is to optimize the arguments globally and balance between the component goals The selection function is denoted F, built based on the respective weight and component measures as in Equation 2.10.

• w 1 , w 2 , w 3 : are the weights corresponding to each measure Depending on the problem and the optimization goal, the importance of each measure and the weights are set accordingly.

• F is a selection function; the larger the value, the better.

The selection function is used to select suitable features The selection function aims to achieve the best fit and balance between the component measures; that is, it is aimed at the global target and the overall quality of the features Depending on the problem and the number of features to be selected, an appropriate threshold valueM 0 is put in Feature with value F ≥ M0 will be chosen Feature selection is performed according to Algorithm 5.

M 0 : threshold for the selection function;

Figure 2.10: Top (20) family of malware with the most samples in the AMD dataset

In this work, the AMD malware dataset [148] was used to provide the malware part of the dataset, which contains 24,553 samples categorized into 135 varieties among 71 malware families from 2010 to 2016 19,943 samples out of 65 malware families were used; some samples were left out because the malware families had too few samples inside 6,771 samples were downloaded at [147] for the benign class Thus, the classification includes 26,714 samples from 66 families Fig 2.10 depicts the top (20) malware families with the most samples in the AMD dataset.

The data was divided into ten equal sections of malware and benign families Eight parts were trained in the CNN model for evaluation; one part was put into validation, and the last one was used for separate testing.

Feature processing in the raw dataset

Permissions and API calls are the two main features used in this research e.g.,

877 permission features include permissions provided by Android and declared by the user and the top (1000) most used API call features of all samples If the feature (permission and API call) is used in each sample, it will be numbered1 The feature will be numbered 0 if not used Algorithm 1 illustrates the implementation process to convert string features into numeric vectors In this work, the dataset is called dataset_raw The permission group is sorted in ascending order of occurrences, and the API group is sorted in descending order of occurrences in the dataset.

To evaluate and compare the proposed method with other feature selection methods, experiments were conducted on three datasets: feature set D1.1, which only includes APIs; feature set D1.2, which only includes permission; and feature set D1.3, which consists of both as depicted in Fig 2.11 The proposed selection method in the study was applied to three feature sets, D2.1, D2.2, and D2.3, which were obtained At the same time, apply the Information Gain (IG) method [118, 119] to D1.1, D1.2, and D1.3 to obtain three feature sets, respectively, D3.1, D3.2, and D3.3 The same CNN model was applied to all six obtained feature sets The parameters in the CNN model are shown in the Table 2.7.

Table 2.7: Details of parameters set in the CNN model for selection feature

Input: n × 1 Layer1: CONV1: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

FC: 7040 Hidden Neuron FC: 1024 Hidden Neuron FC: 66 Output Classes

Accordingly, the research experimented with the following: with the feature set, each feature is weighted according to the proposed IG algorithm Remove features that are less than the threshold The retained features are put into the CNN model, and Acc; Recall measures are used to evaluate the selected set of features.

The scenarios for feature selection are as follows:

• Scenario 1: the dataset put into the proposed feature selection algorithm

– Scenario 1.1: with the API call feature set, calculate measures M1, M2, and

M3; calculate the weight for each feature using the function F.

– Scenario 1.2: with the permission feature set, calculate measures M1, M2, and M3; calculate the weight for each feature using the function F.

– Scenario 1.3: with the permission and API call feature set, calculate measures

M1, M2, and M3; calculate the weight for each feature using the functionF.

For scenario 1, the function F is formula (3.9) For each scenario 1.1, 1.2, 1.3, Three different weights of (w 1 , w 2 , w 3 ) will be applied.

• Scenario 2: the dataset put into the IG algorithm

– Scenario 2.1: with the API call feature set, use the IG algorithm to calculate the weight for each feature.

Figure 2.11: Experimental model when applying feature selection algorithm

– Scenario 2.2: with the permission feature set, use the IG algorithm to calculate the weight for each feature.

– Scenario 2.3: with the permission and API call feature set, use the IG algorithm to calculate the weight for each feature.

According to the proposed method, to select good features, the measures M1, M2, and M3 are calculated according to Equations (2.7), (2.8), and (2.9) The selection function (F) could then be calculated according to the values of the respective measures and weights This weighted value represents the importance of each measure. According to the Pareto multi-objective method, these weights are selected based on experience and experiment Feature selection results are illustrated in Table 2.8, including ten features with the highest function value F Feature selection is performed on three datasets: API, permission, and combined API with permission Based on the value of F with three ways of choosing weights in Table 2.9, showing the removal of features with small F values.

For comparison, in Fig 2.12, the performances of an ML/DL algorithm were cal-

Table 2.8: Summary of feature evaluation measures selectivity functions (top (10)) – with API set

1 java.lang.StringBuil der.toString 0.8451 0.6114 0.9421 0.7847 0.7944 0.77 10

4 java.lang.String.sub string 0.7708 0.6790 0.8952 0.7557 0.7681 0.7589

7 android.content.Con text.getPackageName 0.7322 0.6698 0.8458 0.7248 0.7362 0.7299

9 java.lang.String.leng th 0.7228 0.6867 0.8434 0.7241 0.7361 0.7325

10 android.content.Inte nt.putExtra 0.7199 0.5544 0.7741 0.6757 0.6811 0.6646

Table 2.9: Summary of results with datasets and feature sets

ID Dataset Feature set Removed 300 Removed 400 Removed 500 Removed 600

Acc Recall Acc Recall Acc Recall Acc Recall

Feature original Don’t remove Acc:96.61; Recall:87.67 Feature with F1 score 96.38 88.48 96.42 88.71 95.44 85.37 95.59 83.1 Feature with F2 score 96.1 87 95.6 85.9 95.89 89.73 94.91 83 Feature with F3 score 95.44 88.21 96.11 88.9 95.81 87.82 94.42 85.44 Feature with IG score 97.06 81.78 96.91 81.99 96.53 80.19 96.49 83.27

Feature original Don’t remove Acc:92.15; Recall:84.85 Feature with F1 score 89.51 78.07 88.99 81.46 88.87 80.79 88.83 79.73 Feature with F2 score 89.74 78.26 89.51 79.91 88.95 80.89 89.14 77.71 Feature with F3 score 91.55 82.71 91.59 84.09 91.47 83.65 91.93 85.61 Feature with IG score 91.97 84.79 92.38 83.88 92.76 83.59 92.3 85

Feature original Don’t remove Acc:98.19;Recall:93.2 Feature with F1 score 95.44 88.21 96.11 88.9 98 92.3 97.2 92

Feature with F3 score 97.74 93.38 96.64 88.47 97.77 93.28 97.02 92.78Feature with IG score 97.21 79.82 96.42 79.73 96.91 83.25 97.13 81.55 culated after altering the dataset by deleting 300, 400, 500, and 600 features for each group For each group, two measures of Acc (first four columns) and Recall (last four columns) of the two proposed feature selection algorithms and the IG algorithm are shown Each group consists of eight columns, respectively, as follows:

• The first four columns of each group are the Acc measure values: the first three columns of each group correspond to the weighting functions F1, F2, and F3. The fourth column is weighted according to the IG algorithm.

• The last four columns of each group are the values of the Recall measure: the 5th column, the 6th column, and the 7th column with the weighting functionsF1,F2, and F3, respectively The 8th column is weighted according to the IG algorithm.

Figure 2.12: Experimental results when applying feature selection algorithm

To evaluate more accurately the proposed method, the feature selection program according to the IG algorithm in the paper [118, 119] was also installed With Table 2.9 and Fig 2.12 , it is shown that:

• The Recall of the IG algorithm is lower than that of the proposed algorithm across all scenarios.

• Fig 2.12 depicts the results with a feature set of API calls and permissions,along with weights like F1, F2, F3, IG, and the number of features removed.Features ranging from 300 to 850 were systematically removed for each specific set(permission set, API set) The experiment showed that the proposed arguments gave stable results with the Accuracy and Recall metrics Feature selection with

IG gives relatively high Acc results but lower Recall results than F1, F2, and F3.

• Table 2.9 has shown that compared with the results when using dataset_raw,the Acc measure of the measures with F1, F2, and F3 is not significantly lower;the Recall measure is even higher than dataset_raw Thus, the proposed method gives positive results, especially with the Recall measure.

Chapter Summary

Feature engineering is crucial in addressing the challenges of classifying, identifying, and detecting malware This essential process encompasses various techniques, including preprocessing, feature extraction, evaluation, feature selection, and enhancing the feature set By applying different extraction techniques to the raw dataset, diverse feature sets can be generated Additionally, new features can be derived from the relationships among the raw features to complement and refine the existing feature set. The augmentation of the feature set, through development and feature selection, aligns with the specific objectives of the problem, such as increasing accuracy and improving overall system performance, etc.

According to the content presented in Chapter 3, the dissertation has proposed, built, and tested 3 methods of improving the feature set The summary results of these proposed methods are shown in Table 2.10.

Table 2.10: Summary of results of proposed feature augmentation methods

ID Proposed features improvement methods

Accuracy improvement rate Proposed study

1 Apriori-based improvement method Drebin 0.8% with Acc score [Pub.6]

Co-occurrence matrix- based improvement method

Prevalence and value contrast-based improvement method

AMD 5.17% with Recall score [Pub.10]

Evaluation of co-occurrence matrix algorithm:

In the study of [Pub.2], the co-occurrence matrix was used to improve features input to the CNN model, and the outcomes produced are better than using raw features in terms of PR, RC, F 1 -score, and Acc (shown in Table 2.2 and Table 2.3).

While the feature generation algorithms mentioned above have demonstrated favorable results in the detection and classification processes, they are not without limi- tations One such limitation is the creation of too many additional features When implementing a co-occurrence matrix, the number of observed features shows a pro- portional increase from n to nxn As a result, there is a significant increase in feature training and processing time.

Evaluation of method based on Apriori:

During the dissertation research, a remarkable observation emerged regarding the basic features used in the feature extraction process These features exhibit a discrete nature and lack of interconnectedness The connectivity of permissions and APIs plays a prominent role Usually, permissions are acquired in clusters For example, the ability to read messages is often combined with the ability to compose messages and access memory In addition, Application Programming Interfaces (APIs) often represent in- terdependencies by calling other APIs and establishing cross-referencing relationships. This emphasizes the importance of establishing connections between features.

Chapter 3 involved the application of feature-generation algorithms The Apriori and co-occurrence matrix algorithms are examples of algorithms designed to facilitate the linkage of features In addition, alternative algorithms, namely FP-Growth and K-means, were employed to identify malware software on Internet of Things (IoT) devices The outcomes of implementing these feature-generation algorithms exhibit great potential.

In the study of [Pub.6], the Apriori algorithm produced better results than individual features alone The outcome is described in Table 2.5 and Table 2.6 Otherwise, the result of using CNN models is better than the adaptation of other machine learning algorithms, such as SVM and RF.

Evaluation based on prevalence and contrast:

Data preprocessing is paramount in detection and classification tasks involving learning models, such as machine and deep learning models Nonetheless, not all features used in these problems prove to be optimal Although deep learning models streamline feature selection through internal convolutional layers that preprocess data, incorporating too many features can result in time-consuming analysis, training, and testing Mainly when dealing with vast datasets prevalent in modern times.

Furthermore, deep learning models primarily aggregate features that are near each other, as dictated by the filter matrix (e.g., nearby pixels in this study, the experiment was conducted, and it found that it is possible to remove up to 75% of the raw used features without changing the results Here, discrete features are standard (unlike the continuous nature often encountered in image processing tasks).

From there, it shows that learning to select good features helps solve the problem faster, and the accuracy in detection and classification can be equivalent to the raw In the dissertation, a method was studied and proposed to select features based on many criteria [Pub.10] In this study, the experiment was conducted, and it found that it is possible to remove up to 75% of the raw used features without changing the results.The results are shown in Table 2.9 and Fig 2.12.

Apart from the advantages and disadvantages, feature generation is generally a promising research direction This feature generation can be combined with feature selection methods such as PSO, IG, etc., to extract meaningful features and use the association between features.

In the direction of feature development, the number of features is increased according to the combination of the original features The advantage of this method is that the accuracy of the system is improved The disadvantage of this method is that the training time and detection time of the system will be longer because the system takes time to analyze the application, training, and detection time (the number of features increases).

In the direction of feature selection, the number of original features is reduced so training and detection time are improved However, the disadvantage of this method is that the classification results often decrease (although the decrease amplitude is small). Many algorithms in data mining can be applied to augmentation features However, in the dissertation, some typical and widely used algorithms in data mining were used such as Apriori and Co-matrix.

DEEP LEARNING-BASED ANDROID MALWARE

This chapter will elucidate the use of various deep-learning models to detect Android malware The model’s augmentation and the system’s advancement are thoroughly examined The content of this chapter is as follows:

• Application of deep learning models such as DBN and CNN in Android malware classification task.

• Presentation of the Wide&Deep CNN (WDCNN) model is an improved, more accurate version of CNN.

• Constructing a federated learning system using the CNN model helps reduce training time, detection, and more accessible apps on natural systems.

Applying DBN Model

DBN Model

DBN was the first deep learning network, introduced by Professor Hinton in 2006 [128] This model uses unsupervised training and stacks multiple unsupervised networks, such as restricted Boltzmann machines or autoencoders.

DBN has been successfully applied to image classification tasks, and many research groups in other fields have also applied it In line with the development trend, applying deep learning models to solve the malware detection problem has interested many research groups In 2014, Z Yuan et al [149] implemented a deep learning method (which uses DBN) to detect Android malware However, it is rare to use deep learning to solve the problem of detecting malware on Android.

Applying the DBN model to various Android malware datasets and setting up different numbers of hidden layers will contribute to a more accurate evaluation of how suitable the DBN model is for detecting malware on Android.

The system structure follows the procedure in Fig 3.1 Includes five stages:

• Stage 1: Evaluation, selection, and combination of features During this phase, an analysis was conducted on the APK file, wherein significant components were identified and selected to serve as features The present investigation involved the utilization of permissions in the XML file, API calls, and the Header of the DEX file as distinctive features.

• Stage 2: Training Utilizing the DBN model explained previously; a program was composed to extract characteristics from the dataset containing malware samples and their corresponding labels The model is fed with labeled data tailored to the specific task during training The detection model was acquired upon the culmination of the training process.

• Stage 3: Detection The model trained in the previous stage is utilized to process authentic files for verification purposes The model infers the unlabeled input APK files.

• Stage 4: Construction of cross-validation test data (k-fold) To be unbiased, k-fold divides the data and performs cross-validation This avoids unevenly dividing the data, and many files of a particular label are concentrated in one part of the split data Typically, only 10-fold, 5-fold, or 2-fold were utilized.

• Stage 5: Testing and evaluation The model is subjected to k-fold cross-validation, which involves splitting the dataset into k-folds of train and test data in stage 4.

Figure 3.1: System development and evaluation process using the DBN

Boltzmann Machine and Deep Belief Network

A restricted Boltzmann machine (RBM) is an artificial neural network based on a probabilistic energy model It is a model consisting of a set of n v random binary variables collectively known as vector v and the hidden layer of such random binary variables n h , denoted h The connections between the layers form a bipartite graph, meaning no connections are within the same layer The joint probability distribution is represented through the Gibbs distribution formula p(v, h|θ) with energy function E(v, h)described in Equation 3.1. p(v, h) = 1

Assuming to have a training setxwith a one-dimensional size ofd: x= (x1, x2, x3, , x d ), it is called the form of the Boltzmann machine In that case, the energy function of the RBM can be represented as in Equation 3.2:

The hidden layershconsist of|H|hidden units: h =h 1 , h 2 , , h |H| , and the param- eter mapping θ is created from the weight set w, the bias vector b and b ′

When x is input to the first layer, the RBM activates the hidden units based on the conditional probability Here, sigmoid function will be used to calculate the joint probabilities: P(h j |x)và P(x i |h) as shown in Equation 3.3:

A deep belief network is constructed by stacking layers of RBMs on top of each other, with the activation layer of one RBM serving as the input for the next RBM.

It uses a layer-wise approach to training the network by initializing the network’s initial values through unsupervised learning and then adjusting the parameters using an optimization algorithm to adjust the weights appropriately so that the achieved probability of the output from the corresponding input values is maximized Fig 3.2 describes the structure of DBN.

In the experiment, two datasets as follows are used:

Figure 3.2: Architectural diagram of DBN application in Android malware detection

• Dataset 1: 500 APK files from Virusshare [150], with 250 benign samples and 250 malware samples This dataset detects malware (binary classification of whether an app is benign or malware) Features used in Dataset 1:

– DEX file: 121 most used header and top (100) most used API calls

– XML file: 63 permissions, including 24 dangerous and 39 safe permissions.

• Dataset 2: 5,405 samples with 179 malware families in the Drebin dataset [146], with 6,730 benign samples from [147] Features used in dataset 2:

– DEX file: top (1000) most used API calls.

– XML file: 877 permissions, including Android OS and user-defined permissions.

Based on two datasets and the DBN model, three following experimental scenarios are suggested:

• Scenario 1: using 121 features extracted from DEX files (100 API features and 21 header features) from dataset 1.

• Scenario 2: using all features from dataset 1, meaning all 184 features, of which the first 121 features from scenario 1, and an additional 63 permission features from the XML files.

• Scenario 3: using all features from dataset 2, meaning all 1877 features extracted from DEX files and XML files with Top (1000) API calls used in DEX file and

877 permissions declared in XML files.

Table 3.1 describes the results in scenario 1, and Table 3.2 describes the results in scenario 2 In scenarios 1 and 2, the number of hidden layers and epochs are adjusted to evaluate how these hyperparameters affect malware detection results.

Table 3.1: Result with Acc measure (%) in scenario 1

Parameters Acc Acc malware Acc Benign Layer: 3

Table 3.2: Result with Acc measure (%) in scenario 2

Parameters Acc Acc malware Acc Benign Layer: 3

Table 3.3 describes the results in scenario 3 For scenario 3, Acc,P recision,Recall, and F 1 -score measures are applied to evaluate the malware classification problem.

Table 3.3: Results with measures in scenario 3 (%)

Parameters Acc Precision Recall F 1 -score Layer: 3

• Regarding dataset 1: the malware class detection accuracy (94% in scenario one and 94.5% in scenario 2) is significantly higher than that of the benign class (74.5% in scenario one and 65.5% in scenario 2) The overall average accuracy is also around 80% The overall average accuracy is also quite decent, hovering around 80% However, it’s worth noting that the test dataset might not be extensive enough, and the feature set used is relatively small, with 121 features in scenario

1 and 164 features in scenario 2 These factors could have influenced the results to some extent.

• Regarding dataset 2: the malware classification shows an impressive accuracy level of 95%, indicating many correctly classified samples However, other metrics only reach around 8x%, with the recall rate being particularly at 82% This suggests that the classification of different families is uneven, with several families having a low detection rate The significant difference between the high accuracy and low values of other metrics also indicates an imbalanced distribution of samples across the various families.

• The results obtained from the two independent datasets show accuracies ranging from 80% to 95% with the evaluation metric Acc Moreover, these datasets, with varying feature sets, demonstrate that the DBN model is suitable for detecting and classifying malware.

Applying CNN Model

CNN Model

In 2012, on ILSVRC, Krizhevsky et al participated in a challenge and received a top-5 error rate of 16% [151] The model used by the authors is the Deep Convolutional Neural Network called AlexNet From that point on, deep learning models have seen a sharp improvement In challenges, models that received high prizes were adapted to the deep learning method On the other hand, many big companies were also attracted and produced many technologies with deep learning.

Since 2017, several research groups have begun applying CNN models to the problem of detecting malware on Android The results showed that the CNN model could improve the detection/classification results to a new level, which leads to a widespread application, testing, and customization of CNNs, along with different datasets and feature selection methods.

In this section, the CNN model was applied to classify Android malware and static feature extraction was used to evaluate the model using the Acc metric.

Based on the convolutional neural network theory, this model is used to solve An- droid malware problems The entire model is described in Fig 3.3 In the training stage, benign or malware files are extracted according to the feature set and then converted to a numeric matrix, which is used as the model’s input; the feature-pooling process can occur many times for each pair of manipulation (convolution, pooling). After this process, a dense neural layer is made, fully connected with the neural outputs, and labels are correspondingly mapped with such outputs In the detecting stage, APK files are also extracted, converted to an algebraic matrix, and fed to the network; according to a weighted table created in the training stage, one of the neural outputs will be chosen; a particular class of benign or malware files is a respective label of neural inputs attached to the files.

Figure 3.3: The overall model of the training and classification of malware using theCNN model

The Drebin [146] dataset includes a total of 129,013 samples with 180 families (benign and malware) The details of the dataset are as follows:

• 179 malware families with 5560 malware samples.

A 10-fold cross-validation test was employed on the model An 80-10-10 ratio split on the dataset is employed for the training, testing, and validation phases.

The process of assigning labels to classes is executed in the following manner:

• The 179 malware families will be labeled from 1 to 179, respectively.

• The benign class is labeled 0.

Each APK file will be extracted into a vector of 9,604 components, equivalent to the 9,604 features, the largest number of features in the data set Any files that have missing values will be filled in with 0 The following features are arranged:

All features are assembled into a CSV file row, so the featured file will have 9,605 columns (9,604 features and a label column) The data is tabular, with 9,605 columns (9,604 features and a label column), organized in CSV files.

The raw feature set contains four types of permissions, which are:

• Hardware components (1): indicates the required hardware Malware files can collect and send information, such as location, which requires GPS or networks.

• Requested permissions (2): represent the required permission before installing the app.

• App components (3): represent the four components: activities, services, content providers, and broadcast receiver.

• Filtered intents (4): internal communications between app intents facilitate shar- ing events and information among different app components Malware files can exploit this mechanism to gather such sensitive information.

Android apps written in Java and compiled into bytecode are contained in a DEX file, directly establishing the app’s behavior The following information will be chosen as features:

• Restricted API calls (5): a request to use limited APIs is a suspicious action that needs monitoring because The Android grant system has limited some critical APIs.

• Used permissions (6): Required permissions that the app needs to function prop- erly.

• Suspicious API calls (7): calls to APIs that give access to essential databases or resources.

• Network addresses (8): malware files usually require a network connection to collect data from victim devices Some network addresses may be hacker’s servers, botnets, etc.

Information will be extracted from an APK file as files containing strings, which then will be converted to a binary vector and stored in a CSV file Each vector component corresponds to a feature with a value of1 or 0 The value "1" typically represents the presence or activation of a specific feature, while the value "0" indicates the absence or deactivation of that feature The first column is the label of the file All missing values were assigned the value of0.

3.2.2.3 Malware Classification using CNN Model

The architecture model used in this experiment is shown in Fig 3.3:

The feature matrix, shape 98x98, goes through the first convolutional layer with filter sizes 3x3 and 32 filters The output is a matrix of shape 98x98x32 A max pooling layer, size 2x2 and strides 2, is applied to the first layer’s output The size of the feature matrix will be reduced to 49x49 Similarly, the max pooling layer’s output will be the input of the second convolution layer, with 64 filters and a filter size of 3x3, which will then be reduced to 25x25x64 by a second max pooling layer The output of the last convolution and pooling layer is a feature matrix with a shape of 13x13x64.

A flattened layer will be used to change the feature matrix to the size of 10816x1, which is then fed through a fully connected layer with 1024 neurons Finally, it will go out from the output layer, whose neuron number still depends on several malware files introduced in the training stage.

The average classification results for 10-fold are shown in Table 3.4 and visually in Fig 3.4.

The present study reveals that the outcomes obtained were superior to the previous research that employed the Support Vector Machine (SVM) algorithm, with a recorded accuracy rate of 94% This demonstrates the feasibility of utilizing deep learning models for yielding outcomes that typically surpass the efficacy of other machine learning models.

Table 3.4: Experimental results using CNN model

Set Number of samples Train Test Validation Test accuracy rate (%)

Proposed Method using WDCNN Model for Android Malware Classifi-

Proposed Idea

The Wide and Deep (W&D) model has demonstrated successful application in flower classification and capacity prediction The proposed model is highly compatible with aggregated data sets from diverse sources Its application will be shown through its utilization in addressing the issue of detecting Android malware The W&D model comprises two components: deep and Wide components The Deep component is tasked with extracting features from the raw feature set The Wide component re- tains selected features on APK files—for example, a list of used APIs and required permissions.

Fig 3.5 describes the WDCNN model operation diagram First, the sample dataset, a set of APK files, is extracted to produce a raw feature set consisting of API calls, permissions, and grey image pixel features Each grey image is generated from a bytecode extracted from an APK file According to Algorithm 6, the raw feature set is divided into two subsets: F w and F d F w includes general API calls and permission features; F d consists of grey image pixel features F w is put in the comprehensive component of the model F d is put in the deep component of the model using CNN. This model includes two components, the wide and the deep component, as follows:

The deep learning component can help derive new features (a highly generalizable deep learning model) based on an internal structure consisting of convolutional and pooling layers The raw“image” feature (features in F d ) is used as the input

Figure 3.4: Test rate according to the 10-fold to the DeepCNN model It has an input matrix of 128x128, four convolutional layers, and pooling layers, which generalize the features In the first layer, the convolution has 32 filters to create 32 matrices The size of the max-pooling is 2x2, meaning that the size of each output convolutional matrix is reduced by 4, resulting in a 64x64 matrix In the second layer, using 32 filters and the max- pooling of 2x2, the number of matrices is 32, but the matrix size is reduced by four times, becoming a 32x32 matrix In the third layer, with 64 filters and the max-pooling of 2x2, 64 matrices of size 16x16 are created The number of filters and the size of the max pooling in the fourth layer are similar to those in layer

3, so the output is 64 matrices of 8x8 size Finally, in the flattening layer, the outputs of the fourth layer are converted to a vector of 4069 neural units This vector is the output of the deep component The detailed implementation steps are shown on the left of Fig 3.5.

The wide component is a generalized linear model used for large-scale regression and classification problems [152] This component is responsible for memorizing feature interactions In this work, the wide component is the vector of API call and permission features Since there are too many API calls in the raw dataset, only the top (1000) of the most popular ones in the dataset are chosen.

The API features are the top (1000) features from the raw dataset, and all permission features are in the raw data set They are part of the wide component The neurons of

Figure 3.5: WDCNN model operation diagram the DeepCNN model and the Wide component were combined as the input to a dense layer of 1,024 neurons, which produced the output layer as a set of labels The detailed implementation steps are shown in Fig 3.6.

Building Components in the WDCNN Model

The model’s objective is to integrate the rapid classification capability of wide learning with the capacity to generalize deep learning The input feature set will be parti- tioned into two corresponding subsets Multiple convolutional neural networks (CNNs) were employed to execute the deep learning component It is possible to configure each Convolutional Neural Network (CNN) to utilize a combination of convolutional and pooling layers or solely rely on convolutional layers encompassing convolution and filtering techniques This facilitates the process of generalizing characteristics and decreasing the number of dimensions Concurrent partitioning is required for the comprehensive and extensive feature set To construct the mathematical model, the initial step involves providing the subsequent definitions:

Definition 3.1 (Initial feature set) The initial feature set, denoted by F 0 , contains all features in the W&D learning model.

Definition 3.2 (Wide feature set) The wide feature set, denoted by F w , is a subset of F 0 , used for the wide learning component in the W&D learning model

Definition 3.3 (Deep feature set) The deep feature set, denoted by F d , is a subset of

F 0 , used to generalize features in the deep learning component of the W&D learning model.

Figure 3.6: Structure and parameters of the WDCNN model

Drawing from the definitions above, a comprehensive mathematical framework was constructed for the issue at hand, as illustrated in the following Equation 3.4:

• L is a set of labels, including benign labels and malware labels.

• f p is a partition function, to divide the set F 0 into F d and F w

• ϵ 1 is the mapping from the wide feature set to vector v 1

• ϵ 2 is the mapping from the wide feature set to vector v 2

• v is the composite vector, the input for the W&D.

To evaluate and partition the raw feature set into a deep feature set and a wide feature set, the following definitions are proposed:

Definition 3.4 (Raw feature) A raw feature, denoted by r, is a feature that does not represent or does not entirely mean a behavior, operation, or attribute of malware For example, one byte in the DEX file, i.e., one pixel in the converted “image” of the DEX file.

Definition 3.5(General features) The general feature, denoted byα, is a feature that represents the behavior, operation, or property of malware For example, a permission or an API.

Definition 3.6 (Group-level general features) A group-level generic feature, denoted by g, is a feature that represents a group of malware behaviors, operations, or attributes For example, the memory access permission and file manipulation API groups can be understood as group-level generic features.

The proposed solution must also tackle the problem of set division Set F needs to be divided into F d and F w since a raw feature often doesn’t have a whole meaning.

It needs to transform to form a generalizable or group-level generalizable feature to reduce the number of dimensions Thus, the features are put into the set F d Group- level generic features are often divided into Fw because these features take on the meaning and generalizability of the malware However, when the system has a large group-level generalizable feature set, it is still possible to include F d to reduce the number of dimensions Depending on the problem context and the level of generality, the general features and the group-level features can be included inF d or F w

The partition of the feature set

A partition is a division of the initial feature set into a wide feature set and a deep feature set suitable for the problem context and properties of the feature set Algorithm

6 is devised to partition the feature set.

This dissertation focuses on the initial feature set, which is composed of three subsets of features: the permissions set, APIs set, and image file converted from the byte code in the DEX file As previously stated, the set of pixels in the image file represents the raw feature set, as pixels do not fully describe a malware behavior, operation, or attribute Permissions and APIs are two broad categories of features, each representing a malware behavior or an operation The deep component will receive the raw features,

A: Set of behavior/operation/attribute; G: Set of behavior/operation/attribute group; R: Set of rules for division;

7 if F i is satified by the set G then

10 if F i is satified by the set A then

21 if F i is satified by the set R then

30 if F i is satified by the set R then

37 return F w , F d ; while the wide component will receive the general features To implement Algorithm

6, F d is chosen as the raw feature set, and Fw is selected as the set of all generalized features, including permission and API call.

Drebin [146] and AMD [148] are two widely used datasets for Android malware classification These two datasets do not contain benign samples Therefore, more benign samples are collected from archive.org [147], a large and free application database The dataset’s composition is as follows:

• The Drebin dataset contains 5,560 samples of 179 malware families, of which 5,438 are used.

• The AMD dataset contains 24,553 samples of 71 malware families, of which 19,299 samples of 65 malware families are used.

• The benign dataset consists of 6,730 samples.

There are 16 overlapping families of malware between the AMD and Drebin datasets, resulting in 28 families of malware used in the experiments Therefore, the total number of samples used in the experiments is 31,467, with 229 families (228 families of malware and one benign family).

The AMD and Drebin datasets used have the following characteristics:

• The malware files are not equally distributed into families; some families are more dominant than others Thus, the total number of files in the top (10) malware families is 16,684 files, and those of the following ten families are 1,612 files (the number of malware files in AMD and Drebin suites is 17,742 and 3,551, respectively).

• Some malware families have few samples, less than ten samples The AMD dataset has nine malware families with a sample size of less than 10, while the combined AMD and Drebin datasets have 127 similar malware families.

Fig 3.7 describes the top (20) malware families of the AMD and Drebin datasets. Because of the uneven distribution of samples among the malware families, empirical analysis was performed on the top (10) and the top (20) most populous families The statistical analysis is shown in Table 3.5.

The dataset necessitates extracting two distinct types of features, namely "image" features and "string" features Regarding the image features, the DEX files were transformed into images with dimensions 128x128 The dimensionality of an image feature is

Figure 3.7: Top 20 malware family AMD and Drebin Table 3.5: The datasets used for the experiment

Dataset Total simple Benign Malware Malware source Description

16,384 The conversion process involves interpreting each set of three bytes in the DEX file as a color pixel in the resultant image The conversion process entails interpreting each set of three bytes in the DEX file as a color pixel in the resulting image Subse- quently, the chromaticity of the image is transformed into a monochromatic 128x128 image.

Permissions and APIs are the two most commonly utilized features in malware classification when presented as strings If a program is classified as malware, it may necessitate the transfer of sensitive data from the targeted device to an external location, specifically the server of the perpetrator To execute the task, the program must request network-related permissions, including but not limited to INTERNET and ACCESS WIFI STATE The malware may request additional permissions, such asACCESS_FINE_LOCATION and ACCESS_COARSE_LOCATION, to access the user’s location A correlation exists between authorization solicitations and application programming interface invocations within the malware software Consequently, the present study involves the extraction of "string" characteristics, such as permissions and APIs:

• To obtain the desired permissions from APK files, an analysis of all permissions listed in the XML file was conducted.

• The tool "AKPtool" is utilized to extract the APIs by reading DEX files [153]. Subsequently, all the application programming interfaces (APIs) utilized within the dataset were extracted, followed by a statistical analysis of the frequency of API occurrence in each file contained within the dataset The 1000 most frequently utilized APIs are selected The number of files associated with the foremost API, ranked at number one, is 22,082 This figure was subsequently reduced to 7,110 for the API ranked at one thousand The findings indicate that the application programming interfaces (APIs) ranked within the top 1000 exhibit favorable characteristics observed across numerous files within the dataset.

As described in the previous section, those extracted features were used as input to the WDCNN operation model The entire feature of the dataset is contained in a CSV file The above permission and API call characteristics in the CSV file are arranged into columns (the first column is the label), and the corresponding rows are APK files. The cell will be filled with the number "1" if the feature is extracted in the file Cells are filled with "0" if the feature is absent in the file.

To evaluate the WDCNN model, some experiments are conducted as shown in Fig. 3.8, following the following testing scenarios:

• Scenario 1: evaluating the components of the WDCNN model.

– Scenario 1.1: using image features in the Deep model.

– Scenario 1.2: utilizing permission, API call features, and the Wide component.

– Scenario 1.3: combining image features in the Deep model and incorporating permission and API call features in the Wide component (running the WDCNN model).

For each sub-scenario 1.1, 1.2, and 1.3, evaluation is performed by using various datasets:

– AMD + benign datasets (Full dataset, top 20 datasets, and top 10 datasets)

- referred to as "simple dataset" in the experiment.

– AMD + Drebin + benign datasets (Full dataset, top 20 datasets, and top 10 datasets) - referred to as "complex dataset" in the experiment.

• Scenario 2: verify the performance of the WDCNN model against alternative machine learning models such as KNN, RF, Logistic, DNN, and RNN The following experiments were done to ensure a fair comparison.

– Scenario 2.1: compare the performance of WDCNN against common deep learning algorithms (RNN, DNN, RF, KNN, Logistic).

– Scenario 2.2: using an independent feature extraction scheme, which is proposed in research [110] and applied on two malware datasets (Extract 256 image features (256 pixels) These pixels are converted from APK files to binary, forming a grayscale image From there, use the histogram to get 256 pixels First, extract 256 image features (256 pixels) from the APK files, convert them to binary, and create a grayscale image Then, use the histogram to obtain the 256 pixels) The performance of the proposed WDCNN model has been compared to the best model in the research [110] using the new feature set.

• Scenario 3: due to the large file size of the AMD dataset’s DEX, only a maximum of 48 KB of data from the DEX file can be converted into an image format. Consequently, the data towards the end cannot be utilized The Drebin dataset is employed for evaluation in this scenario to address this Additionally, since API calls are extracted from the DEX file, experiments were conducted by combining permission and Wide components with Image features in the Deep component (without utilizing API call features) The detailed experiments are as follows:

– Scenario 3.1: using image features in the Deep model.

– Scenario 3.2: employing image features in the Deep model and incorporating permission features into the Wide component (running the WDCNN model without using API call features).

– Scenario 3.3: utilizing image features in the Deep model and incorporating both the permission and API call features into the Wide component (running the WDCNN model with all features).

In scenario 3, evaluation was calculated by using both the Drebin + Benign dataset and the top (10) malware families from the Drebin + Benign dataset for assessment.

The K-fold cross-validation method with k = 10 was used in this experiment The dataset is divided into10 parts, 8 parts for training, and2 parts for testing, of which one part is used for validating purposes (validation) and one part for the final testing(test) The experimental process was conducted according to the described scenarios.

Table 3.6 and Table 3.7 present the results from scenario 1, which consists of three sub-scenarios: scenario 1.1, scenario 1.2, and scenario 1.3 Acc and Recall were employed as performance metrics to evaluate the model and dataset Each experiment underwent 10-fold cross-validation to ensure robustness.

In Fig 3.9, the outcomes of the Simple dataset are visually depicted on a chart for convenient evaluation and comparison.

Figure 3.9: Classification of malware depending on the number of labels

Table 3.6: Experimental results of Simple dataset

Accuracy 67.85 68.28 65.4 64.07 66.45 67.39 66.33 68.18 66.44 65.61 66.6 Full Recall 54.57 42.96 55.9 44.41 52.44 45 46.88 47.56 50.45 46.32 48.65 Accuracy 68.37 69.29 65.72 64.87 66.92 68.17 66.96 68.85 67.85 66.32 67.33 Top

(b) Input: Pemision+API; Model: Wide

Accuracy 96.94 99.61 98.64 98.69 98.56 99.3 99.08 98.21 98.66 99.43 98.71 Recall 97.52 99.66 98.01 98.63 98.61 99.26 98.87 98.52 97.82 99.4 98.63 (c) WDCNN model with input: Image into model DeepCNN; Input: permission+API into model WideCNN

Applying Federated Learning Model

Federated Learning Model

The primary objective of the research is to introduce a novel approach for weight synthesis in a federated learning model, which considers the magnitude of the accuracy sample set and the weight set for each client Upon completing the training process, the individual workstation transmits weight values, accuracy metrics, and sample set sizes to the central server The server performs calculations based on the accuracy that corresponds to the set of weights and sample set size on the clients The quality of the component sample set is reflected by accuracy, while the sample set size determines the impact of the sample set on the synthesis process.

System model using federated learning:

Figure 3.10: DEX file size by size in the Drebin dataset

The proposed system model is depicted in Fig 3.11 Each client and member server uses the same CNN model.

Implement Federated Learning Model

To build a mathematical model as a basis for assessing the importance and summa- rizing the weights, the following definitions are thought up:

Definition 3.7 (The set of composite weights) The set of composite weights is the set that contains the weights calculated on a server based on the component weight sets, and it is sent back to use for all clients in the system.

Definition 3.8(The component weight set) The component weight set is the weight set trained on each client with the individual dataset by the CNN model This weight set is sent to the server to compose.

Definition 3.9 (The component dataset) The component dataset is the individual dataset used to train each client This dataset is updated and trained by the transfer learning model to improve the set of weights.

Definition 3.10 (The importance of the component set of weights) The importance of the partial set of weights is a value that evaluates the influence of this set on the composite weights, denoted a.

In deep learning, the larger the size of the training dataset, the more the network is trained and the more valuable the weights are Therefore, the significance of the

Figure 3.11: Overall model using federated learning component weights is defined in terms of the dependence on the dataset size The importance is calculated according to Equation 3.5: a i =k 1 D i

• N is the number of clients:

• k1is the influence coefficient of the sample set size

• k2is the coefficient of accuracy

• Di is the size of the individual dataset on client ith

• ai is the importance of the weight set W i

In this proposed federated learning model, each client needs to send a set of component weights along with the size of the dataset.

Based on the component set of weights and the size of the sent dataset, the composite weight set is calculated according to Equation 3.6:

• N is the number of clients

• W i is the weight set of the clientith

• a i is the importance of the weight set W i

3.4.2.2 The Process of Synthesizing Weight Set

The process of aggregating weights is as follows:

• Training at clients: each client operates independently, and users use the device to check for malware At the time of testing, the client will store the test file along with the features extracted from the test samples As per the specified time intervals, the sample characteristics will be inputted into the model on the workstation for additional training The updated set of weights, values, and test results of the samples will then be saved.

• Send data to the server: based on the preset time in the system, the client sends the set of weights and results to the server If a client has no data available at that time, it won’t send any The server aggregates data based on the received information

• Aggregate weights on the server: on the server, based on the results sent by each client, the corresponding weights will be assigned to each set of weights of each sending client according to Equation 3.5.

The AMD malware dataset [148] was used This dataset collected 24,533 samples, including 71 malware families, from 2010-2016 However, in this dataset, many malware families contain a minimal number of files (less than ten files in a family), and some samples are faulty during analysis and extraction Therefore, only 37 malware families with at least 20 files each are kept The total number of malware files used is 18,956. Combined with the malware, 6,771 benign files taken from [147] are utilized Thus, in the experimental data, there are a total of 38 classes and 25,707 files.

AndroPytool is used to extract from APK files to JSON format From there, aPython program was developed to get the following features:

• Permission: these are the permissions declared and used in the program’s source code There are two types of permissions: permission provided by the Android operating system and permission declared by the programmer The total number of permissions is 877 permissions.

• API calls: these are the declared and used APIs The number of APIs in the dataset is huge Therefore, the top (1000) most used APIs are taken as a feature subset.

The features are converted to numeric values based on two distinct groups of statically extracted features: permissions and API calls For each APK file, a binary represen- tation is adopted, whereby a feature that is utilized is assigned a value of "1" In contrast, a feature that is not utilized is assigned a value of "0" The conversion from strings to numbers for inclusion in the deep learning model is done as an algorithm 3.

The experiment conducted in the present study is identical to that of [Pub.8] In this experiment, instead of transferring the number of files and numbers from the clients to the Server, the Acc measure and weights transmission from the clients to the Server is improved The objective is to compare the Acc of both the clients and the Server and establish a collection of aggregate weights derived from each set of weights transmitted to the server.

As in the paper [Pub.8], the data set is divided into ten parts (divide the files of each label equally) and used to experiment according to the following steps:

• Step 1: the training and testing process begins individually on multiple computers (train, train1, train2, train3) The data is distributed to the server (S) and three clients (CL1, CL2, CL3).

• Step 2: the server calculates the average weights from the component computers and returns the results to the clients.

• Step 3: data train4 is assigned to CL1, and train5 is assigned to CL2 for training and updating the set of weights.

• Step 4: using the updated weights from Step 3, the server updates the set of weights and sends the updated information back to the clients.

• Step 5: repeat Step 3 and Step 4 with new data assignments (train6 to CL1 and train7 to CL2).

• Step 6: After completing the previous steps, all data is combined and trained on a single computer for final testing.

Table 3.11: Average set of weights (accuracy - %)

Three scenarios were tested to compare the methods of weight synthesis, as outlined below:

Scenario 1: average weight In this case, the clients will only send the weights to the server side From there, the Server will sum up by dividing the average by the number of customers sent The final training and test results are denoted as W1. Scenario 2: aggregate weight depends on the number of samples This is the direction suggested in [Pub.8] The client computers will send the set of weights and the number of samples to the server The server will calculate the aggregate weight based on the number of files that the Client sends The final training and testing results are denoted as W2.

Scenario 3: aggregate weights with the samples and the Acc The client will send the set of weights, the number of samples, and the testedAcc to the Server The server relies on the Acc value and the number of samples the client sends to assign importance to the weights according to Equation 3.5 The final training and testing results are denoted as W 3 This experiment is evaluated according to the influence coefficients k 1 and k 2 I test increasing (k1, k2) from (0; 1) to (1; 0) in turn with a jump of 0.1satisfy k1+k2= 1.

The training on the same machine (step 6) is performed in each scenario Therefore, the final result W all will be the average of 3 independent runs in 3 scenarios.

The experimental results according to Scenario 1, scenario 2, and scenario 3 respectively correspond to Table 3.11, Table 3.12, and Table 3.13 and are shown in Fig 3.12.Results of Table 3.13 and W3 value in Fig 3.12 shown with the influence coefficient k1 = 0.6 and k2 = 0.4 give the highest result when changing the coefficient(k1, k2).Fig 3.13 shows the results when changing the coefficient of influence(k1, k2) This shows the relationship between the number of trained files and the classification results.The above results show that the weighted aggregation method gives a 97.08% result,which is the highest among the three weighting methods.

Table 3.12: Set of Weights according to the number of samples (accuracy - %)

Table 3.13: Our proposed set of weights (accuracy - %)

The CNN model was used in the experiments in Chapter 2 of the dissertation How- ever, to synthesize deep learning models in Chapter 3, the dissertation also mentions the CNN model in Section 3.2 On the other hand, the CNN model is the basis for the proposed WDCNN model in Section 3.3.

In this chapter, the dissertation proposes, applies, tests, and improves some deep learning models for malware classification on Android There are many deep learning models, however, the dissertation uses some typical deep learning models for experi- mentation and evaluation such as DBN, DNN, CNN, and RNN; The dissertation also proposes to use the WDCNN model to improve accuracy when classifying malicious code The proposed methods have been tested for verification and evaluation on typical data sets, such as the Drebin and AMD datasets The results are also summarized inTable 3.14 as follows.

Figure 3.12: Compare the results of the weighted aggregation methods

Table 3.14: Summary of results of proposed machine learning, deep learning models and comparison

Deep learning and machine learning models

Types The experimental dataset

6 RF Compare in dissertation Drebin + AMD 84.80% [Pub.3]

7 RNN Compare in dissertation Drebin + AMD 97.26 [Pub.3]

8 RF Compare in dissertation Drebin 92.90% [Pub.6]

9 SVM Compare in dissertation Drebin 92.40% [Pub.6]

Conformity of the model with the features of the dataset:

Based on the summary results in Table 3.14 and the contents presented in this chapter, the dissertation provides a general evaluation of the appropriateness of the machine learning/deep learning model for the features of the data sets The DBN model is a feedforward neural network with many hidden layers, but it is not truly a deep learning model because it lacks features of generalizing layers Therefore, it is ineffective when the number of features is large and the features are shallow According to the experimental results and model evaluation, DBN is unsuitable for the Android malware detection task compared to the proposed models The CNN model proposed

Figure 3.13: Classification results with influence factor for the app has high accuracy and is suitable for datasets with shallow features and many features The WDCNN model is ideal for diverse datasets, including shallow and extracted generalized features from shallow data The WDCNN model combines traditional deep learning components on shallow features and traditional classification methods as a wide component The Federated CNN model is suitable for decentralized data sets.

CNN is particularly suitable for datasets with shallow features, a large number of features, a large number of samples, and many labels.

To effectively train a machine learning model on many features, it is necessary to utilize both machine and deep learning models This dissertation primarily employs deep learning models, exploring the convolutional neural network (CNN) model and its various iterations.

The efficacy of deep learning models has been demonstrated in experiments utilizing a consistent dataset and feature count relative to other machine learning models The author of [Pub.1] presented findings indicating that implementing the CNN model with a 10-fold cross-validation approach yielded an accuracy rate of 96.23% The obtained outcome surpasses the accuracy achieved by SVM, which yielded a rate of 94% The results are shown in Table 3.4.

In [Pub.3], the WDCNN model, an improvement of the CNN model, was used In theWDCNN model, many feature sets like images, API calls, and permissions were applied.

In the experiment, the results of the WDCNN model with other deep learning and machine learning models In addition, the experiment was also conducted according to the settings of [110] was compared All experiments show that the WDCNN model outperforms other models (according to Acc and Recall measures) The detailed results are shown in Table 3.8 and Table 3.9.

Utilizing machine learning and deep learning models has proven to be highly pro- ductive in detecting malware on Android The prevalence of the Android OS across numerous devices poses a challenge in developing a server system for malware detection Transmitting APK files from client devices, including mobile phones and TVs, to the server for analysis and returning the results is time-consuming Conversely, utilizing a single server may be inadequate in fulfilling the demand due to numerous client devices The utilization of multiple servers may result in elevated expenses Conse- quently, federated learning models have been employed to identify Android malware.

In [Pub.11], training on many machines has many advantages, such as:

• The detection and classification times are short because the feature extraction and detection are done directly on the client side.

Tiêu đề	Android Malware Classification Using Deep Learning
Tác giả	Le Duc Thuan
Người hướng dẫn	Ph.D. Nguyen Kim Khanh, Ph.D. Hoang Van Hiep
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Engineering
Thể loại	Doctoral Dissertation
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	141
Dung lượng	4,14 MB