Phân loại mã độc Android sử dụng học sâu

Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu

BackgroundInformation

AndroidPlatform

Android platform is a software stack that consists ofmanycomponents as follows (Fig.1 1 ) :

AndroidplatformprovidesuserstoolsandAPIstocreateapplications(apps)formobile phones, televisions, smartwatches,etc.

Figure 1.1: Architecture of Android OS system[37]

The Android Operating System is built upon the Linux kernel version 2.6 Should they wish tobeexecuted, all operations are carried out at this level These pro- cessesincludememorymanagement,hardwarecommunications(drivermodels), security tasks, and processm a n a g e m e n t

Although Android was built upon the Linux kernel, the kernel has been heavily modified These modifications are tailor-made to satisfy the characteristics of handheld devices, such as the limited nature of the CPU, memory and storage,screen size, and, most importantly, the continuous need for wireless connections.

This level contains the following components:

– Display Driver: controls the screen’s display and captures user interactions (e.g., touch,gestures).

– Camera Driver: manages the camera’s operation and receives data streams from thecamera.

– Bluetooth Driver: controls the transmission and reception of Bluetooth sig- nals.

– Keypad Driver: controls the keypadi n p u t

– AudioDriver:controlsaudioinputandoutputdevices,decodingaudiosignals to sound and vicev e r s a

– Binder IPC Driver: handles connections and communication with wireless networks such as CDMA, GSM, 3G, 4G, and E to ensure seamless communi- cationfunctionalities.

– M-SystemDriver:managesreadingandwritingoperationsonmemorydeviceslikeSD cards and flashdrives.

The hardware abstractionlayer(HAL) provides standard interfaces that expose device hardware capabilities to the higher-levelJavaAPI framework The HAL consistsofmultiplelibrarymodules,eachimplementinganinterfaceforaspecific hardware component, such as the camera or Bluetooth module When a framework API calls to access device hardware, the Android system loads the library module for that hardwarec o m p o n e n t

The Android Runtime provides the libraries thatanyprograms inJavaneed to functioncorrectly.It hastwomain components,muchliketheJavaequivalent on personal computers The first component is the CoreLibrary,which contains classes such asJavaIO, Collections, and File Access The second component is theDalvikVirtualMachine,anenvironmentforrunningAndroidapplication s.

This section comprises numerous libraries written in C/C++ to be utilized bysoftware applications These libraries are grouped into the following categories:

– System C Libraries: these libraries are based on the C standard and are used exclusivelybythe operatings y s t e m

– OpenGLES: Android supports high-performance 2D and 3D graphics withtheOpenGraphicsLibrary(OpenGL ® ),specifically,theOpenGLESAPI.Ope nGL is a cross-platform graphics API specifying a standard 3D graphics processing hardware softwareinterface.

– Media Libraries: this collection contains various code segments to support theplaybackandrecordingofstandardaudio,image,andvideoformats.

– WebLibrary(LibWebCore):thiscomponentenablescontentviewingonthewebandisusedt obuildthewebbrowsersoftware(AndroidBrowser)andforembeddingintootherappl ications.Itishighlyrobust,supportingpowerful technologiessuchasHTML5,JavaScript,CSS,DOM,AJAX,etc.

•JavaAPIFrameworkThe entire feature set of the Android OS isavailabletoyouthrough APIs written in theJavalanguage These APIs form the building blocksyouneed to create Android appsbysimplifying the reuse of core, modular system components, and services, which include the followingcomponents:

– ActivityManager:thismanagesthelifecycleofapplicationsandprovidestools to control Activities, overseeing various aspects of the application’s lifecycle and ActivityStack.

– XMPP Service: facilitates real-timecommunication.

– Window Manager: manages the construction and display of user interfaces and the organization and management of interfaces betweena p p l i c a t i o n s

– Resource Manager: handles static resources of applications, including image files,audio,layouts,andstrings.Itenablesaccesstoembeddedresources(not code)suchasstrings,colorsettings,andUIlayouts.

– ContentProviders: enables applications to publish and share data with other applications.

– View System: a collection of views used to create the application user interface.

System Apps are apps that communicate with the users Some of these apps include:

– The basic apps that come with the OS, such as Phone, Contacts, Browser, SMS,Ca le nd ar , Ema il ,Ma ps , Ca me r a, e t c.

– The user-installed apps,likegames, dictionaries, etc

– When an app is run, a Virtual Machine is initialized for that runtime The app canbeanActiveProgram with a user interface, a background app, or a service.

– Android is a multitasking operating system, meaning users can run multiple programs and taskssimultaneously However,for each app, there exists only one instance Thispreventsthe abuse of resources and generally helps the system run moreefficiently.

– ApplicationsinAndroidareassigneduser-specificIDnumberstodifferentiate their privileges when accessing resources, hardware configurations, and the system.

– Android is an open-source operating system, distinguishing it frommanyother mobile operating systems It allows third-party applications to run in the background.However,these background appshavea minor restriction, as they are limited to using only 5-10% of the CPUcapacity.This limitation is in place topreventmonopolization of CPU resources.

Overview ofAndroidMalware

According to NIST[38], Malware is defined as:

“Malware, also known as malicious code, refers to a program that is covertly in- sertedintoanotherprogramintendingtodestroydata,rundestructiveorintrusiveprograms, or otherwise compromise theconfidentiality,integrity,oravailabilityof the victim’s data, applications, or operating system Malware is the most common external threat to most hosts, causing widespread damage and disruption andnecessitatingextensiverecoveryeffortswithinmostorganizations”.

Fromthe above definition, it canbeseen that malware is unsuitable for users and systems Understanding malware andhowtopreventit helps protect users in today’s connected environment.

The rise of malware comes with the development of the internet, especiallywhen all activities, including social and financial, cannowbeperformed online, and they are subject to anonymous attacks for unrighteous intentions Malware willbeclassifiedintoseven types, as shown inTable1.1below[38, 39]:

Viruses self-replicatebyinserting copies of themselvesintohost programs or data files Viruses are often triggered through user interaction, such as opening a file or running a program Viruses canbedividedintothe followingtwosubcategories:

– Compiled Viruses: a compiled virus is executedbyan operating system Types of compiled viruses include file infector viruses, which attach themselves to executable programs;bootsector viruses, which infect the masterbootrecordsofharddrivesorthebootsectorsofremov

- able media; and multipartite viruses, which combine the characteristics of file infector andbootsectorv i r u s e s

– Interpreted Viruses: interpreted viruses are executedbyan application Within this subcategory, macro virusestake advantageof the capabilities of applications’ macro programming language to infect application documents and document templates In contrast, scripting viruses infect scripts that are understoodbyscripting languages processedbyservices on theOS.

Example: ILOVEYOU, CryptoLocker, Tinba, Welchia, Shlayer.

Worms:a worm is a self-replicating, self-contained program that usually executes itself without userintervention.Wormsare dividedintotwocategories:

– NetworkServiceWorms:anetworkservice worm takesadvantageof a vulnerability in anetworkservice toprop- agateitselfandinfectothersystems.

– Mass MailingWorms:a mass mailing worm is similar to an e-mail-borne virus but is self-contained rather than infecting an existingfile.

Trojan Horses a Trojan Horse is a self-contained, nonreplicating program that, while appearing benign, actually has a hidden malicious purpose.Trojanhorseseitherreplaceexistingfileswithmali- ciousversionsoraddnewonestosystems.Theyoftendeliver other attacker tools tos y s t e m s

Spywareismalwarethatcanrunsecretlyonthesystemwith- out notifying users.Todisrupt system processes,spywareaims to collectprivateinformation and grant remote access to bad actors.Spywareis often used to steal financial information orprivateuseri n f o r m a t i o n

Example: DarkHotel, Olympic Vision, Keylogger

Adwareis the most commonly used malware to collect user data on the system and provide ads to users without permission Even though adware isn’t occasionally dangerous, in some situations, adware can cause system crashes. TheycanredirectbrowserstounsafewebsiteswithTrojanvirusesan d spyware In addition, adware is one of the reasons forsystem lagging.

Ransomware is a kind of malware that has permission to access system private information; it encrypts data to prevent user access, and then the attackers can take advantage of the situation and blackmail users Ransomware is usually part of phishing actions The attacker can encrypt information that can only be opened with his key.

Example: RYUK, Robbinhood, Clop, DarkSide

Filelessmalwareliveinsidethememory.Thissoftwarewillbeprocess ed from the victim system’s memory (NOT from files on the hard disk) Thus, it is harder to detect compared to other classic malware It also makes the encryption process harderbecauseFilelessmalwarewilldisappearwhenrestart- ing thesystem.

Android OSalwaysholds a high market share on the mobile operating system.Followingthe statistics of[1]in June 2023, Android dominated 70.79% of the mobile market Thus, Android OS’s vulnerabilities are attractive to hackers, as all the social and financial activities cannowbeperformed on mobile devices. According toAV-Test[2],new types of malware are still being createdannually,along with the development of an open-source OSlikeAndroid. The malware increasefrom2013toMarch2022isshowninFig.1 2

Malware is a developing threat to every connected individual in the age ofmobile phones and the internet Because of the financial incentives, the number and complexity of Android malware are growing, making it more difficult to detect Android malware is almost identical to the varieties of malware that users mightbefamiliar with on their desktops, but it is only for Android phones and tablets Android malware primarily stealsprivateinformation, which canbeas common as the phone number, emails, or contacts of the user or as critical as financial credentials With that data, the scammershavemanyunlawful options that can earn them substantialmoney.There are some signs indicating that a mobile devicewasinfectedbymalware:(1)usersoftenseesuddenpop-upadvertisements on their devices; (2) mobile batteries drain faster than usual; (3) users notice applications that they did not intentionally install; and (4) some apps do not appear on the screen after installation Android malware appears inmanyforms,

Figure 1.2: The increase of malware on Android OS such as trojans, adware, ransomware, spyware, viruses, phishing apps, orworms. Kasperskyhasinvestigatedwidespreadmalwarein2020and2021andcategorized them(Fig.1.3) [40].Malwareofteninfiltratesviavarioustraditionalsources,such asharmfuldownloadsinemails,browsingdubiouswebsites,orfollowinglinksfrom unknownsenders.

Figure 1.3: Types of malware on Android OS

Common sources of Android malware:

– Applications thathavebeen infected:Attackerscan collect popular programs,repackagethem with malware, and re-distribute them through download links This method is so effective thatmanyfraudsters tend to design oradvertisenewapps;naiveusersmayfollowcustomizeddownloadlinksand accidentallyinstallordownloadmalwaretotheirdevices.

– Malvertisements: malvertising is the kind of malware embedded in adver- tising distributed through advertisements A virus willbedownloaded to the user’scomputeriftheuserclicksononeofthesepop-ups.Theusercanblock adsontheAndroiddevice,whichisaneffectivewaytopreventmalware.

– Scams: phishing assaults and other standard email- or SMS-based frauds are examples of online scams The email or message will contain a link to malware, which willbeinstalled on the phone when the user clicks the link. It’soneofthemostcommonwaystoinfectAndroidphones.

– Direct download to the device: this is the most trivialwayto infect a device with malware The attackersmustonly directly connect a gadget or USB to the phone and install the virus programs.However,it is difficult to do thisway,because it is difficult for the attacker to gain direct access to the victim’sdevice.

Asignature-basedapproachisoftenemployedincommercialantivirusproducts,as the detection results attain high accuracy and precision Malware behaviors or features willberetained in a database of samples or characteristics A malware detection system (a detector) will analyze and recognize malware based on one or severalchar- acteristics that match pre-defined patterns Malware signatures canbestatic, knownbytesequences, or behavior characteristics, such asnetworkbehavior.However,this methodisuselessindetectingunknownorzero- daymalware,astheiruniquetraitsdo notexistintheprogramdatabase.

On the other hand, the anomaly-based method can detect unknown suspicious behavior This method is usually based on machine learning techniques The difference between normal and abnormal behavior can be modeled during training Since 2017,machine learning and deep learning, in particular, have been extensively applied for malware detection on mobile devices.

Android MalwareC l a s s i f i c a t i o n Methods

Signature-basedMethod

In this method, the signature of sample malware will be stored in a list of known threats and their indicators of compromise (IOCs) The signature can be extracted by static or dynamic analysis The method compares the sample’s signature with all the signatures stored in the database to decide whether a sample is malware.

One of the attributes of the signature-based method is highaccuracy.Toachievethat,indicatorsstoredinthedatabasemustbeaccurate,havecomprehen sivecoverage, andbeupdatedregularly,as new malware is bornrapidly.On the other hand, using a signature-based method is time-consuming The larger the number of files or apps thatneedtobechecked,thelongerthetestingtimerequiredbecausethesystemneeds to sequentially decompile each app, extract features, and then compare each feature with the patterns defined in the database The program can often combine static and dynamicsignatures,e.g.,dataextractedfromthedecompiledcodeandbehavioraldata while the app runs The combination will provide more comprehensive coverage, but theexaminationtimewillincreaseconsiderably.

Permissions, API calls, class names, intents, services, or opcode patterns are often used to spot the malware In[16],Enck et al proposed a security service for the Android operating system called Kirin The Kirin authenticates an app at installation time using a set of protection rules designed to match the properties configured in the app Kirin system also evaluates configurations extracted from the installer’s manifest files and compares them with the rules set up and saved in the system.

Batyuk et al.[17]applied static analysis on 1865 top free Android apps retrieved from the Android Market The experiments showed that at least 167 access private information such as IMEI, IMSI, and phone numbers among the analyzed apps One hundred fourteen apps read sensitive data and immediately write them to a stream, which indicates a significant privacy concern.

Dynamic analysis is highly efficient when dealing with obfuscation techniques such as polymorphism, binarypackagingsystems, and encryption.However,app operation(eveninavirtualenvironment)alsocostsdynamicanalysismoretimethanstaticanal- ysis Chen et al.[15]proposed an approach to indicate dangerous samples in Android devices using static features and dynamic patterns The static features were acquired via decompilation of APK files, and connections between the app’s classes, attributes,methods,andvariableswillbeextracted.Theprogramalsoanalyzesfunctioncallsand the relationships between data threads when the Android app runs All that informa- tioncanbeusedtodeducethreatpatternsandcheckwhethertheappaccessesprivatedata or conductsanyillegal operation, e.g., sending messages without permission or stealing confidential information The experiments in the report show that the rate of malware found in 252 samples using the dynamic signature-based method is91.6%.

Figure 1.4: Anomaly-Based Detection Technique

Despitetheadvantagesmentionedabove,therearetwodrawbackstothesignature- based detection method: (i) it cannot detect zero-day malware, and (ii) it can easilybebypassedbycodeobfuscation.

Anomaly-basedMethod

An anomaly-based method uses a different approach and can resolve problems An anomaly-basedapproachreliesonheuristicsandempiricalrunningprocessestodetect abnormal activities The anomaly-based detection technique consists of the training and detection stages, as presented in Fig.1.4.This technique observes normal behaviors of the appovera period and uses attributes of standard models as vectors to compare and detect abnormal behaviors ifanyoccur A set of standard behavior attributes willbedeveloped in the training stage In the detection stage, whenanyabnormal “vectors” arise between the model and the running app, that app willbedefined as an anomaly program This technique allows for recognizing even unknown malware and zero-dayattacks.

In an anomaly-based approach, application-extracted behaviors can be achieved in three ways: static analyses, dynamic analyses, or hybrid analyses Static analyses will be investigated before installation using the app’s source code Dynamic analyses will perform the test and collect all the app data during execution, for example, API calls, events, etc., where hybrid methods use both.

However,the abnormal and expected behaviors of the samples are not easily sep- arated because of the large number of behaviors extracted There is no basis to determine what behavior is normal and not normal It is not feasible to divide these behaviors based solely on the analyst’s experience Machine learning models are applied during training to minimize time and increaseefficiency.When applying machine learning, the number of behaviors that shouldbefedintothe training model canbeenormous, as all behaviorsmustbecollected as features.Nowadays,there aremanymachine learning modelshavebeen applied to malware detection, suchas

SVM (SupportVectorMachine), KNN (K-Nearest Neighbors), RF (RandomForest),etc., and the modern deep-learning models DNN (Deep Neural Network), DBN (Deep Belief Network), CNN (Convolutional Neural Network), RNN (Recurrent Neural Net- work), LSTM (LongShort-TermMemory), GAN (Generative Adversarial Network), etc.T h o s e modelswillbediscussedinalatersectionofthedissertation.

Schmidt et al.[41]haveanalyzed Linux ELF (Executable and LinkingFormat)objectfiles in an Android environment using the command readelf The function calls read from the executables are compared with the malware database for classificationbyusing the DecisionTreelearner (DT), Nearest Neighbor (NN) algorithm, and Rule Inducer(RI).Thistechniqueshows96%accuracyinthedetectionphasewith10%false positives Schmidt et al extended their function calls-based technique to Symbian OS[42].They extracted function calls from binaries and applied their centroid machine, based on alightweightclustering algorithm, to identify benign and malware executables The technique provides 70-90% detection accuracy and 0-20% falsepositives.

Schmidtetal.[43]proposedaframeworktomonitorsmartphonesrunningSymbian OS and Windows Mobile OS to extract system features for detecting anomalous apps. Theproposedframeworkisbasedontrackingclientsrunsonmobiledevices,collecting data describing the system state, such as the amount of free RAM, the number of running processes, CPU usage, and the number of SMS messages in the sent direc-tory,and sending it to the Remote Anomaly Detection System (RADS) The remote server contains a database to store the received features; the detection units access the database and run machine learning algorithms, e.g., AIS or SOM, to distinguish between normal and abnormal behaviors A meta- detection unit weighs the detection resultsofthedifferentalgorithms.Thealgorithmswereexecutedonfourfeaturesetsof differentsizes,reducingthesetoffeaturesfrom70to14,thussaving80%ofdiskspace and significantly reducing computation and communication costs.Consequently,the approach positively influences battery life and has a small impact on actual positive detection.

Only the machine learning methods applied to malware detection on the Android system will be discussed in this dissertation The next chapter will detail the analysis to get the behaviors or features by static, dynamic, and hybrid methods.

Android Malware ClassificationE v a l u a t i o n Metrics

In the problem of recognizing and classifying, some commonly used measures are

Accuracy (Acc),Precision,Recall,F 1- score,confusionmatrix,ROCcurve,AreaUnderthe Curve (AUC), etc.Forthe classification problem of having multiple outputs, there areslightdifferencesintheuseofmeasures.

In the detection problem, the output has only two labels, commonly called Positive and Negative, where Positive indicates an app is malware, and Negative alludes to the opposite Hence, there are four definitions provided:

•TP(TruePositive): apps correctly classified asm a l w a r e

•FP(FalsePositive): apps mistakenly classified asm a l w a r e

•TN(TrueNegative): apps correctly classified asb e n i g n

•FN(FalseNegative): apps mistakenly classified asb e n i g n

While evaluating, the ratio (rate – R) of these four measures is considered:

•TPR=TP/(TP+FN):TruePositiveRate.

•FNR=FN/(TP+FN):FalseNegativeRate.

•FPR=FP/(FP+TN):FalsePositiveRate.

•TNR=TN/(FP+TN):TrueNegativeRate.

Forthe four above measures,FNRis crucial, as the higher this ratio, the lesstrustworthythemodelisbecausemoremalwareappswillbemistakenlyrecognizedas benign.FortheFNRmeasure, the false alarm rate means benign apps are mistaken formalware,butitwon’t beasimportantastheFNR.

The most popular and simplest measure is Accuracy (Acc), given in Equation1.1.

Accisoftenusedwithproblemswherethenumberofpositiveandnegativesamples are equal As for problems with a large deviation between the number of positive and negativesamples,thePrecision,Recall,andF 1-scoremeasuresareoftenused.

•Precisionis defined as the ratio ofTPs c o r e s a m o n g t h o s e c l a s s i f i e d a s p o s i t i v e

(TP+FP) The formula for calculatingPrecisionis shown as Equation1.2.

•Recallis defined as the ratio ofTPpoints to those that are actuallypositive

(TP+FN) The formula for calculatingRecallis shown as Equation1.3.

•F 1-score is the harmonic mean of precision and recall The formula ofF 1-score is shown as Equation1.4.

When there are multiple labels as output in the classification problem, it canbereduced to a detection problem for each class, considering the data belonging to the classunderconsiderationtobepositiveandalltheremainingdatalabelstobenegative Thus, there willbea pair of precision and recall for each class The concepts of micro- averageandmacro-averagewillbeusedtoevaluatetheclassificationproblem.

Micro-average precisionandMicro-average recallare calculated in

(TP c +FN c ) c =1 whereTP c ,FP c , andFN c respectively areTP,FP, andFNof the classc.

Macro-average precisionis the average of precisions by class, similar toMicro−average recall(called Recall: average actual classification of each class of malware and benign), given in Equation1.6.

With the abovementioned measures, the Acc and Recall measures are used for this classification problem in experiments.

AndroidMalwareDataset

Many datasets have been published for the research community as follows:

•Contagio mobile: released in 2010 and last updated in 2010 It consists only of 189malwarewithoutbenignones.T h i s datasetispublictousers.

•Malgenome:sampleswerecollectedfrom2010to2011andpublishedin2012.The sizeofthedatasetis1260malware.However,thedatasetwasdecommissionedin 2021.

•Virusshare: this is a repository of malware It has been provided publicly to users since2011.T h i s datasetonlyincludesmalwarefileswithoutlabels.

•Drebin: samples were collected from 2010 to 2012 and published in 2014 The datasetconsistsof5560malwaredividedinto178malwarefamilies.Thenumber of files in each family is not balanced Some familieshaveonly one or a few (less than10)malwarefiles,whileothersmayhavemorethan1000files.Furthermore,Drebin also provides 123,453 benign samples in the form of extractedfe a t ur e s

•PRAGuard: released in 2015, PRAGuard consists of 10479 malware without malware family labels PRAGuardwascreatedbymixing MalGenome and Contagio Minidumpdatawithsevendifferentmixingtechniques.InApril2021,thisdatasetwasdeco mmissioned.

•Androzoo: Androzoowascreated in 2016 and is still being updated It provides both malware and benign in large quantities.However,Androzoo only provides apps, and theyhaven’tbeen classified as families So far, the number of files offeredisover20millionintheformofAPKs.

•AAGM: made public in 2017, it consists of 3 categories:Adwarewith 250 apps, GeneralMalwarewith150apps,andBenignwith1500appsfromGooglePlay.

•AMD:malwarewascollectedfrom2010to2016andmadepublicin2017,including

•CICMalDroid 2020: samples collected in 2018 and published in 2020 with a size of 13,077 files in 5 categories (Adware, Banking Malware, SMS Malware, Mobile Riskware,Benign).

•InvesAndMal (CIC MalDroid2019): samples collected in 2017 and published in

2019 with 5491 files This dataset is dividedintofour categories(Adware,Ran- somware, Scareware, SMS malware) and consists of 42 families within the above categories Most of the benign accounts for 5000 samples It is currently still public.

•MalNet2020:thedatasetwaspublishedinDecember2020with1,262,024samples This dataset is essentially downloaded from Androzoo but has extracted features fromFCG (FunctionCall Graph) and Image The dataset is dividedinto696 malware families and 47 malware types The APK file cannotbedirectly down- loadedfromMalNet’shomepage(https://mal-net.org/),buttheauthorteamonly providesSHA256todownloadfromAndrozoo.

In the experiments in the doctoral dissertation (including experiments in journal articles and conferences), the following datasets were used:

•Virusshare: in the conference paperFAIR[Pub.4], a small number of samples,including 500 (250 malware and 250 benign), were used Since the number of malware and benign programs is balanced, the only measure to apply isaccuracy.

•Drebin:Thisisawell-knowndatasetusedinmanypapersbylocalandforeignau- thors.Duringthisresearchwork,theDrebindatasetwasconstantlyimplemented, suchas:

– [Pub.1]: this research experimented on the entire Drebin dataset (including both benign and malware provided) The articleshowedthat using a CNN model hadadvantagesoverthe original Drebin SVM model Because the Drebindatasethasasignificantimbalanceofsamplesbetweenfamilies,addi- tionalmeasuresarealsoappliedtoobtainabetterevaluation.

– [Pub.2, Pub.6]: those journals utilized the entire Drebin malware combined with 7,140 benign samples from a different source Multiple measurements wereperformedtoevaluatethefeatureselectioninthepaper.

•AMD: similar to Drebin, this dataset is widely usedbyresearchers due to the largequantityandvarietyofsamples.

– In [Pub.10], the 65 families with the most samples were appropriate for the research The [Pub.11] study used the AMD dataset with families with at least20samples(including35families).

– In [Pub.3], Drebin and AMD were employed as malware data.

Table 1.2: Summary of Android malware datasets

ID Dataset Descriptio n Sample s Benig n Familie s Publishe d

Announced in 2012, public many 0 Unknown [47,50]

ID Dataset Descriptio n Sample s Benig n Familie s Publishe d

Toevaluate the quality of a dataset, the dissertation uses several criteria: thenumberofsamples,thenumberoflabels,thedistributionofsamplesamongclasses,andthe levelofupdatingofthedatasets.Thesecriteriacanhelpensurethatthedatasetiscom- prehensive, well-labeled, balanced, and up-to-date, which can increase the reliability andgeneralizationoftheresearchresults.

The quality of classiftcation depends on the dataset:

Basedontheabovedatasets,somedatasetsaresuitableformalwaredetectiontasks (which only provide malware and are not dividedinto manymalware families) and classificationtasks(inwhichmanymalwarefamiliesarewithinmalware).Italsoneeds tobecombinedwithaseparatebenignset(whichcanbedownloadedfromsourcessuch as Androzoo, GooglePlay,etc.) With the same machine learning and deep learning algorithm, the adaptation to each dataset gives different results This is because the features extracted in each dataset (a set of samples) are different Assuming that all datasetshavegoodquality,there is still a clear difference between each dataset due to the different years of publication Each year, Google provides new versions withmanychanges,sothefeaturesextractedineachsetaredifferent".Somedatasetshavespecific features, such as datasets containingC++code instead of justJavacode, datasets containingscrambledandnotsimplyreadablelikeregularcode,encrypteddatasets,or datasets thathavecode rearranged in different positions, etc.Fromthe above, it canbeseenthatthequalityofeachdatasetsignificantlyaffectstheclassificationqualit y.

Modiftcation and advancing the dataset:

The investigation conducted in the dissertation indicates that the labeled datasets exhibit a discrepancy among distinct malware families The Virusshare and Androzoo datasets, which furnish APK files, exhibit a partialitytowardsspecific labels when subjected to labeling software despite theirlackof inherent labeling.Consequently,thisresearchhasincorporatedmultiplesupplementaryevaluationmetr icstofurnisha moreall-encompassingappraisalofthecorrelationamongdiversefamilieswithvarying quantities,includingbutnotlimitedtoprecision,recall,andF 1-score.

Machine Learning-based Method for AndroidM a l w a r e Classification

The problem of malware classification on the Android platform is described in Fig. 1.5 Ingeneral,therearefourstepsinvolvedinAndroidmalwareclassification.

APK file is a compressed file containing other fileslikeAndroidmanifest.xml(later called the XML file), andclasses.dex(later called the DEX file), etc.Featuresextracted from APK files form a dataset and serve as input to training models.Featuresare critical to a model and thekeycomponents for the model tomaketrue or false decisions. Arrays of features canbecollected via static analysis, dynamic analysis, or hybrid techniques and then tailored tomakea feature set.Forexample,indexclassescanbetransformedtoimagefeatures,collectgroupsof dynamic features such as permission, API call,intent,etc., or transform file code to a“smali” file A set of extracted features couldbedefined as a

Features of the original dataset (original feature dataset) were transformed into binary form, and then these binary values can be specified differently as:

•Images: transformed from text to binary or Hex code for image point collection.

•Frequency: attributes occurrence (Permission, API call, etc.) frequency in

•Binary encoding: if the defined behavior takes place, pass “1”, else pass“0”.

Figure 1.5: Overview of the problem of detecting malware on the Android

•Relationshipweight:apply a mathematical model to retrieve the relationship between features and assignedweightsto newly acquired relationships(Forexample, the relationship betweenAPIs).

Mathematical formulas such as the TF-IDF algorithm(TermFrequency– Inverse DocumentFrequency),IG (Information Gain), PSO (Particle Swarm Optimiza- tion), GA (Genetic Algorithm), etc couldbeused to assignweightsto each feature. Besides, the above algorithms canbeused to evaluate the importance of eachfeatureinthedataset.Manystudiesdidn’tassignnewweighttoeachfeature butusedtheoriginaldataset directly.

The dataset can contain hundreds or tens of thousands of features Of course, the more featuresyouputintothe training model, the longer the training time On the other hand,manyfeatures are not necessarily suitable for the classification problem Therefore, feature selection is also a problem studied in classification in general and malware classification in particular Which features to chooseand which features to remove depend on the criteria given by each researcher; for example, it is possible to create a threshold of accuracy, recall, etc., to stop the feature removal or rely on the weights of each feature in the original dataset to remove the feature with the threshold set by us.

ThefeaturesusedinclassifyingAndroidmalwareareprimarilydiscrete.Theywillberelated to each other in the application.Featureswill often come in groups (for example, calling ACCESS_FINE_LOCATION will include thegetFromLo- cation()systemcall).Findingtherelationshipbetweenfeaturesineachsuchap-plication is complex Therefore, data engineershavedevelopedmanymethods to enhance the input dataset The augmentation approach could generate more fea- turesbasedontheircorrelationorhybridizationorreducethenumberoffeaturesusedasinp utdata.Somefeaturegenerationmethods,suchasApriori,K-means, FP-growth, etc., canbementioned On the other hand, dimensionality reduc-tion techniques often used are low-variance filtering or Generalized discriminantanalysis.

Applyingnewmodelsintrainingalwayspiquesinterestindataclassificationprob- lems. Researchershaveappliedmanymachine learning models toimproveclas- sificationquality,especially image classification New methods and models are developed based on the existing studies, and deep learning emerges as an evo- lution of traditional machine learning In deep learning, the typical model used is CNN Along with CNN, there aremanyvariations of CNN, such as VGG-16, VGG-

19, ResNet, etc It canbeseen that developing or applying a new model to a classifier is of great significance Many research groupshaveapplied themodels tootherproblems,includingtheAndroidmalwaredetectionproblem.

RelatedWorks

RelatedWorksonFeatureExtraction

The overview model of feature extraction is described in

Fig.1.6.The current research follows the feature extraction methods:

1 Static features extraction: analyzing source code (via reverse engineering) to getthefeaturesasstrings(stringtype)fromthefile.

2 Dynamic features extraction:takeeach APK file and run it in an isolation environment (e.g., a sandbox independent of the operating systemenvironment),

Figure 1.6: General model of feature extraction methods just like installing an app and running each module in that app The desired features can be extracted during execution.

3 Hybrid features extraction: combining static method (1) and dynamic method (2).

4 Image conversion:usually,an APK file, DEX file, and XML file canbetrans- formedintoasequenceofbytes,andthenanimagecanbecreatedfromthebinary sources. The image here canbea grayscale image or a color GRB image This image conversion method can alsobeconsidered a static method;however,the featuresarenotnecessarilyanalyzedthoroughlytoproduceinternalstringslikein the static method, so in thisstudy,image conversion analysis willbecategorized separately. a) Static ExtractionMethod

The static analysis method decompiles the APKpacket.It analyzes the internal characteristics,therebycollectingsuspiciouscharacteristicsinthedecompiledcodefiles, and those “suspicious attributes” in the form of strings are called features The static method hasmany advantages,suchas:

The number of“strings”extracted from the decompiled code of the APKpacketis enormous Each couldbedividedinto manygroups:permission, API calls, mainactivity,packetname,opcode,intent,hardware component,strings, and systemcom-mands Each group plays a vital role in malware detection, but three groups arewidely used in staticanalysis:

•Permission: these are the permissions declared in the XML file There aretwotypes of permission: permission providedbythe Android operating system and permission declaredbythep r o g r a m m e r

•API call: API call describes the working process of an app An API call combines theclassname,methodname,anddescriptor.

•Opcode: opcode describes the instruction script for the data operation The

Dalvik register set, instruction set, and instruction set architecture differ from those in the JVM but are similar to the assembly instructions in x86 In opcode, there aremanytypes of instructions like data definition,objectoperation, data calculation, field operation, method call, data operation, array operation, comparison,jump,dataconversion,andsynchronization.

Permissionisalwaysanimportantfeature.Manyresearchersonlyusepermissionto characterize malware, as in[26,28,29,61,62,63,64,65,66,67,68].Each paper used a different data set, so the number of features used differs There aremanymethods tostandardizedatafromstringstonumbers,suchasone-hotencoding,labelencoding, ASCII encoding, Unicode encoding, and word embedding Most of those researchusing permissionfeaturesuseastandardizedone-hotencoding.Thisnormalizationtechnique builds a vector containing features and converts eachvalue intobinary features, containing only 1 (feature appears) or

0 (feature does not appear) in each application D Sahin et al.[28]used 76 feature permissions extracted from an XML file with the sample dataset named MoDroid, which includes 200 Malware apps and 200 Benign apps. Inaddition,theauthoralsoused102permissionsasthefeaturesetfrom1000malware apps of the AMD dataset and 1000 Benign apps from APKPure The featureshaveb e e n n o r m a l i z e d a c c o r d i n g t o o n e - h o t e n c o d i n g W h e n e x p e r i m e n t i n g w i t h manyma- chinelearningmethods,thehighestaccuracydetectionresultsare95.6%usingLinear regression.

Inaddition,manygroupsalsousedatasetsextractedfrompermissionfeaturessuch as CICMaldroid[64,69],Virusshare, Androzoo[63],Genome[61],etc or mixing samples of several datasets and acquired positive results, with accuracy of detection models reachingover90%.

Sensitive permissions are represented in Table1.3.Those permissions, when used,may indicate an app prone to being malware.

Permission group Permissi on Funcio n

Used for runtime permissions related to the user’s calendar

CAMERA CAMERA Used for permissions that are associated with accessing the camera or capturing images/video from thed e v i c e

READ_CONTACTS WRITE_CONTACTS GET_ACCOUNTS

Used for runtime permissions related to contacts and profiles on this device.

Used for permissions that allow accessing the device location

MICROPHONE RECORD_AUDIO Used for permissions that are associated with accessing microphone audio from the device.

READ_CALL_LOG WRITE_CALL_LOG ADD_VOICEMAIL USE_SIP

Used for permissions that are associated with accessing body or environmental sensors

SENSORS BODY_SENSORS Used for permissions that are associated with accessing body or environmental sensors.

SEND_SMS RECEIVE_SM SREAD_SMS

RECEIVE_WAP_PUSH READ_EXTERNAL_STORAGE WRITE_EXTERNAL_STORAGE

Used for runtime permissions related to the shared external storage.

API call features, as well as permissions, are used inmanyresearch Many studieshaveused API call features exclusively to demonstrate the effectiveness of these features in detection and classification[4,5,7,8,9,10,11,12,13,70,71].Usually,other feature groups treat each feature as a separate one that is unrelated to others. Therefore, it is expected to convert each feature to a vector (numeric form) to include in training models In the case of API calls, each feature is inherently related to the other and forms an internal chain of API calls, or API calls will combine with the app to form a chain of API calls and apps Such a chain between APIs or APIs and apps is called theFunctionCall Graph (FCG) The FCG-adopted method is only used recently forAndroid malware detection, typically studies of[5,11,13,72].J.Kim et al.[5]extractedAPIs from 10,654 malware apps from Virusshare to build an API Call Graph The detection results when using CNNachievedan accuracy of 91.27% H Gao et al.[11]proposed a system named Gdroid that implements the GraphConvolutionalNetwork(GCN), and then theFCGwillbefedintothe training model for classification The paper usedmanymalware datasets, such asAMGP,DB, AMD, and

Drebin, and attained a high accuracy of 98.99% Q Li et al.[13]used datasetsnamedGenome,Drebin,andFaldroidwiththefeatureofAPIcallgraphand

GCN model, resulting in a 93.8% accuracy in malware detection D Zou et al. [71]combined theconventionalgraph-based method with the high scalability of the socialnetworkanalysis-basedmethodformalwareclassification.APIswereextractedfroma dataset of more than 8,000 malware and benign apps to create the call graph, and the resultsachievedanF-measureof97.1%withafalsepositiverateoflessthan1%.

Besides,manypapers still convert API calls to vectors[9,14,70,73].TransformingAPI callsintovectors as input to the model also producesgoodresults S K Sasid- haran et al. [70]trained a model using the Profile HiddenMarkovmodel (PHHM) API calls and methods from malware in the DREBIN dataset were transformedintoan encoded list and trained with a proportion of 70% for training and 30% for testing The result’s accuracy reached 94.5% with a 7% false positive rate The precision and recallacquired0.93and0.95,respectively.

Although not being used asmuchas permissions and API calls,manystudieshaveusedopcodesexclusivelyinmalwaredetectionproblemssuchas[32,52,53 ,74,75,76,

77,78,79,80,81] The extracted opcodeswereconverted to grey images and putintoadeep-learningmodel,resultinginadetectionaccuracyof95.55%andaclassification accuracy of 89.96% Besides, V Sihag et al.[53]used opcode to solve the problem of code obfuscation The detection resultachieved98.8% accuracy when using the RandomForestalgorithm on the Drebin and PRAGuard dataset of code obfuscation, with the number of malware apps used is 10,479 In[79],the authors proposed an effective opcode extraction method and applied a Convolutional NeuralNetworkfor classification The k-max pooling methodwasused in the pooling phase toachievean accuracy of more than 99% On the other hand, M.Amin et al.[80]vectorized the extracted opcode through encoding and applied deep neural networks to train the model, e.g., Bidirectional long-short-term memory (BiLSTMs) With a dataset of more than1.8millionapps,thepaperacquiredaresultof99.9%accuracylevel.

Forother feature groups, they are usually combined with permissions, API calls, or opcodes Because these groups oftenhavefew features and are unavailable in all apps, it isn’t easy to use them independently.From2019 until now, according to the statistics in dblp, onlytwopapers[82,83]use theIntentfeature independently The resultsshowthataccuracyreaches95.1%[82]andF 1- scorereaches97%[83];however,thedatasetisself- collected,andthenumberofusablefilesinadatasetissmall.

Some common APIpackagesin the Android malware detection problem datasets are described inTable1.4.

Featurescombination is commonly used, in which permission and API calls appear a lot as theyplaya crucial part in malware detection[14,25,33,44,84,85,86,87,88,89,90].Inmanyresearchpapers,usingfeaturegroupshass hownhigheffectiveness through evaluationresults.

Some studies have converted extracted features into images, such as N H Khoa et

API package java.lang.StringBuilder.toString android.content.Context.getSystemSer vice java.lang.System android.content.Context.startActivity java.lang.Integer java.lang.Thread java.lang.String.substring java.lang.Boolean java.io android.app android.content.SharedPreferences java.lang.String android.content.Context.getPackageNa me java.lang.StringBuilder java.util java.lang.Long java.lang.String.length android.view android.content.Intent.putExtra java.lang.Exception

Privacy getSimSerialNumber(), getSubricbierId(), getImei(), getDeviceId(), detLi- neNumber(), getNetworkOperator(), newOutgoingCalls() SMS sendTextMessages(), sendBroadcast(), sendDataMessage(), telephonySMS- Received(), content:sms() Network

Httpclient.excute(), getOutputStream(), getInputStream(), getNetworkInfi(), httpUrlConnectionConn(), execHttpRequest(), SendRequestMethod(), set- DataAndTypes()

Location getLongitiude(), getLatitute(), getCellLocation(), requestLocationUpdate(), getFromLocation(), getLastLocationKnown()

File URL(), getAssets(), OpenFileOutPut(), Browser:Bookmarks_URL() Function Runtime.exec()

Obfuscation DexClassLoder(), Cipher.getInstence() al.[58]extracted the features and then transformed those features into images The features are permissions, opcode, API calls, system commands, activities, services, receivers, and package names Applying several CNN models to the extracted features set from CICAndMal2020 databases, the mobilenet_v2 achieved detection results of 98% accuracy and 99% when optimization methods were used together. b) Dynamic ExtractionM e t h o d

Dynamic analysis is the type of analysis while executing and running all the app’s functions During execution, the running process issavedin a log file Necessary strings from the log file are extracted and denoted as features The implementation of running the APKpacketcanbedone intwoways:(1) directly running on the actual deviceand(2)runningonasandboxthatisisolatedfromtheotherpartsofthesystem.Usually,runni ng on a virtual environment in dynamic analysis is common because it is isolated and does not negatively affect the whole system.However,in general, the executionalwaystakeslongerthanthedecompiledcodeanalysis.O n theotherhand, setting up the execution environment of dynamic analysis is difficult The dynamic analysis also has some advantages, such as the ability to extract features that only a running system can detect, e.g., the productivity of the system, the rate of using CPU[91],the rate of using RAM[92],battery, process[76],tcpdump, strace[54],traffic networks[50,57,93,94,95],etc The standard features such as permission, API calls

[34,45,51,59,72,96,97,98]are also different compared to static analysis, because real APIs were called in the functions Therefore, although dynamic analysis takes more time, in many cases, it is still necessary to understand the impact of malware in depth.

RelatedWorkson MachineL e a r n i n g - b a s e d Methods

Inrecentyears,mostresearchgroupshaveusedmachinelearninganddeeplearning models in the Android malware recognition problem Fig.1.7presents the number of relatedarticlesbasedondblpstatisticsduringthisperiod.

Many machine learning and deep learning algorithms adopted in Android malware detection studies achieved high accuracy In this section, typical machine learning models used in the problem will be summarized.

RFisaclassifieralgorithmthatincludesthepredictionofdecisiontrees[121].Each decision treewastrained on a sub-dataset – part of the training dataset Each tree outputs a classprediction.

The instruction to create a Random Forest is as follows:

2 From“k” features, calculate node“d” which is the best node forclassification.

The instruction to use a RandomForeststructure for predictive tasks is asfollows:

1 Takethe test features and use a decision tree to predict the model, thensavethe predictions to alist.

3 Consider the prediction with the highest number of votes as the final model prediction.

Many studies used RF models[8,14,22,30,53,63,64,66,67,85,86,115,122]andachievedgoodresults.M M

A l a n i [14]chose RF as one of the classifiers to create alightweightapproach to the features dataset, in which the RF modelachievedthe highestaccuracyrateof98.65%.Theauthorsusedstaticfeatures,suchaspermissions, API calls,intents,and command signatures M L Anupama et al.[22]extracted static and dynamic features from the dataset and applied different algorithms to the selected dataset, bringing out a detection rate of 95.64% for dynamic features In[67],the authors reduced the list of permissions using multi-stage extraction and input to several classifier models, where RF performed the best with an accuracy of 96.95% and f-measure of 0.96 V Sihag et al.[53]presented anovelobfuscation counter methodbasedonopcodefeatures.Fouralgorithmshadbeenappliedwith10-foldcross- validation, in which RFachievedthe highest malware detection accuracy of 98.2% ontwodatasets M.Cai[86]et al.achievedthe highest result using RF in the recognition process with 99.87% onAccmeasurementbyusing permission and API calls M Dhalaria et al.[30],using nearly 2,000 samples with 13 malware families,gavethe highestresultusingRFof88.6%withAccmeasure.Inthepaper,theauthorscombined nearly 2,000 benign to test withtwolabels, giving the highest result of 90.1% with theAccmeasure.H.Rathoreetal.[66]used20malwarefamiliesintheDrebindataset;

1 when classified using the RF algorithm, the highest resultwas93.81% with theAccmeasure In general, studies using RF present detection results with highaccuracy,mostofwhichareover90%,whetherusingstatic,dynamic,hybrid,orimageconversion analysis.

Although the results of the RF model for the Android malware detection problem areawe-inspiring(>96%),theseresultsarejustforthedetectionproblem.Forresearch inclassificationproblems,theRFmodelgivesresultsofabout90%.

1dimensions in n-dimensional data space, such that the hyperplane can most optimally classify thelayer.

SVMwasfirst developed to classify data withtwolabels, thenimprovedto classify data withnlabels Withmelementsx 1 , x 2 , , x m in the n-dimensional space with the corresponding labels of the elements arey 1 , y 2 , , y m ofvalue1(positivelayer)or- 1(negativelayer),SVM finds the furthest hyperplane (optimal hyperplane) The processtofindtheoptimalhyperplaneisshowninEquation1.7: m m m min1ΣΣy y α α K < x , x >

WhereCis a positive constant used to customize the magnitude of the margin andthetotalerrordistance.K is alinearmultiplicationwhereK = x i ×x j Solving Equation1.7,SVare acquired elements that are acquiredx i corresponding,called support vectors Using theSVsupporting vector, the classified hyperplane canbe reconstructed.

SVM implements the classification of new elements by Equation1.8.

Many studieshaveused SVM models to detect and classify Android malware[6,14,21,22,23,30,31,66,124,125,126].The use of the SVM model yields high results, andmanyexperiments show results above 90% M.M Alani et al.[14]used a static feature set with the CICDroid dataset and reduced the number of features from 215 to

35, achieving 99.33% accuracy when applying SVM classifiers Shatnawi et al.[23] implemented a static classification method using popular algorithms such as SVM, KNN, and NB on the CICInvesAndMal2019 Dataset Using permission and API call graph, the detection resultsachieved94.36% K Shao et al.[126]showedan accuracy outcomeof98.4%withseveralgroupsofstaticfeaturesandfeatureselection. i

M Dhalaria et al.[30]experimented in detecting malware (nearly 2,000 malware samplesand2,000benignsamples)withanAccmeasurementof87.06%;Evaluationin the problem of classifying malware (classifying 13 families with nearly 2,000malware samples)resultedinanAccmeasureof86.85%.H.Rathoreetal.[66]used20malware families in the Drebin dataset; when classified using the SVM algorithm, the highest resultwas85.42% with theAccm e a s u r e

The results when using the SVM model to detect and classify malware on Android are often lower than those of the RF model.

ThomasCovehas proposed KNN and is a suggested method for classification and regressionproblems[127].“k”closesttrainingsamplesinthefeaturespaceareusedas input vectors. The model’s output willvarydepending on whether the KNN approach isusedforclassificationregression.Testsamplesareclassifiedaccordingtotheirclosest neighborsandassignedtotheclassthathoststheclosestneighbors.Ifthekvaluetakes1, the class of its closest neighbors willbeassigned to that class In regression, the output willbethe average of the nextkneighbor’s feature values It is a type of sample-based learning algorithm KNN is described in detailbythe following steps:

Require: training sample set T, Sample to be classified x, Number of neighbors k. Ensure: sample label y

2 Calculate the distancedistbetween unknown samples and each trainingsample;

4 IfdistIslessthanmax_dist,thetrainingsampleistakenasthek-nearestneighbor sample;

5 Repeat steps 2, 3, and 4 until the distance between the unknown sample and all training samples iscalculated;

7 Select the category with the highest occurrence frequency as the category of the unknownsamples;

The KNN algorithm is less widely used for detection and classification than the RF or SVM algorithms.However,applying the KNN model also givesgoodresults in detecting and classifying malware on Android (usuallylowerthan the RF model) [25,26,27,30].D T Dehkordy et al.[27]engineered a balanced dataset and applied KNN,S V M , a n d I t e r a t i v e D i c h o t o m i s e r 3 c l a s s i f i e r T h e r e s u l t s i n d i c a t e d t h a t K N N produced the highest accuracy, precision, and F-measure with the processed dataset, which are 98.69%, 97.89%, and 98.69%, respectively.

M.Dhalariaetal.[30]experimentedwithdetectingmalware(nearly2,000malware samples and 2,000 benign samples) with anAccmeasurement of 85.4%; Evaluation in theproblemofclassifyingmalware(classifying13familieswithalmost2,000malware samples)resultedinanAccmeasureof83.91%

DBN is a widely used deep learning framework[128].The deep beliefnetworkis dividedintotwoparts The bottom part is formedbystacking multiple restricted Boltzmann machines Eachlayer’sRestricted Boltzmann Machine (RBM) is trainedbythe contrastive divergence (CD) algorithm The upper part is a supervisedbackpropagationneuralnetworkusedtofine-tunethewholenetwork. This model interests research groups and produced positive results[50,129,130,131,132]on malware detection.J.Wang et al.[50]using the DBN model obtained accuracy results as high as 98.3% In that case, the malware and benign samples are balanced at 8,000 samples each In[131],the authors used a combination of features by applying image conversion methods to samples derived from the Drebin dataset, reaching a detection accuracy of 95.43% In[132],the outcome is 98.71% accuracy while using 5,154 features from static feature sets.

CNN is a well-known Neuralnetworktype, often used to learn abstract features from primary data sources of different kinds This processinvolvesextracting hidden classes from the input The input canbea vector or a matrix-like image.Featuresare introduced using filters called kernels with small sizes that run through the entire input to create new hiddenlayers.These hiddenlayerscan go through one or several poolinglayersof small matrices, for example, a 2x2 pooling matrix The dimension of hiddenlayerswillbereduced one more time.Finally,backpropagation connects them totheDenselayerandoutputclasses.C N N structureisshowninFig.1 8

WiththeadvantageoftheCNNmodel,itisnaturallytobeutilizedbymanyresearch groupsformalwaredetectionandclassification[32,34,35,36,134,135,136,137,138,139].I M. Almomani et al.[36]transformed features toRGBand grey images and acquiredanoutcomeof98.08%accuracy.In[32],withtheexclusivelyusedopcode,the resultofclassificationis98%accuracy.H.Rathoreetal.[34]transformedstaticfeatures such asintents,opcodes, and permissionsintoimage form, then applied a CNN model toachieve99.56% detectionaccuracy.S Millar et al.[138]used static features such as opcodes, permissions, and API calls with the Drebin and AMD dataset andachievedresults of 99.28% and 99.63% respectively, whileF 1-score reached 91%,and81%.

Figure 1.8: Architecture of the CNN model[133]

A static malware sample detection method based on deep neural networks is proposed The method obtains grey images directly from executable samples and uses a grey histogram to collect a group of features from each image to build multiple classifiers Experiments were carried out on the actual samples of 50,000 (24,553 malware in71familiesand25,447benignsamples)todetectwhethertheanalyzedsampleswere malware, detect their families and relatedvariants,andachievemulti-classification of malware families with an accuracy of 92.9%[35].Using the code’s grey image converted from the binary bytecode of the malware DEX file, an Android malware family classification method based on deep learning is proposed The deep learning classifier is constructedbyreusing the feature extractionlayerof the convolutional neuralnetworkGoogle Perception V3, which has been successfully trained on large datasets for traditional image classification tasks It can automatically learn and distinguish features from malware images andachieve97.7% accuracy on the data sets of 4892 malwaresamplesand30malwarefamilies[139].

In[134],theresultwas98.1%accuracywhenusingsyntaxtreefeaturesfortheclas- sificationproblem.However,therecallmeasureonlygavearesultof75.1%.Thisshows that the distribution of the number of samples across malware families is uneven In[137],using API calls for dynamic features for image transformation, the classification accuracy reached 99.84% This study eliminated malware families with small samples in the AMD and Drebin data sets On the other hand, the dataset used did nothaveanycombination with the benigndataset.

Recently, some advanced models such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), and Generative Adversarial Network (GAN) have been applied to the problem of detecting malware on Android.

ProposedMethodology

Androidmalwaredetectionprocessincludesmaintaskssuchaspre-processing,feature techniques, malware classification, and malwareremoval.The feature selection and classification phase contribute most to the accuracy of detection results.Featuretechniques include feature extraction, feature selection, and feature enhancement.Tothoselearningmodelsthathaveachievedstabilityandhighaccuracy,theenha ncement of classification results is attributed to feature quality with classification problems in general and Android malware classification in particular Thus, feature enhancement is critical in Android malware classification using stable learning models.However,in most researchtoday,as willbepresented in Chapter 2, features extraction is focused,suchasPermissionextractionfromfilemanifest,APIextractionDEX,andimagecon- version.Atthe time of research, therewaslittle research on featureimprovementand feature enhancement in malware classification; thus, it remains a challengingproblem thatneedsaddressing.Thisdissertationhasstudiedandproposedthreeenhancement methods:featureimprovementbasedonco-matrix,featureaugmentationbasedonthe Apriori algorithm, and feature selection based on popularity and contrastvaluein a multi-object approach These three methodshavebeen published asf o l l o w s :

1 Featureselection:In addition to applying the mentioned algorithms, such as

IG and TF-IDF, a new features selection algorithmwasproposed based on the following factors: popularity and contrastvalue(the contrast between malware and benign, the contrast between malware families) [Pub.10] After applying ad- ditionalmethodstoAndroidmalwaredetection,theresultsachievedgoodmetrics despite removingmanyfeatures.

2 Featureaugmentation based on Co-matrix:Associating co- occurrence ma- trixfeaturesofeachpairinthefeaturegroup[Pub.2].Theco- matrixisestablished based on a list ofrawfeatures extracted from APK files The proposed feature cantake advantageof CNN while remaining important features of the Android malware.

3 Featureaugmentation based on Apriori:Implement the Apriori algorithm to generate enhancement features The Apriori algorithmwasapplied for each feature group [Pub.6] The method studies association rules from the initial feature set to devise the highly correlated and informative features, which willbeadded to the initials e t

Besidesthethreemethodsaboveoffeatureaugmentationandselectionapplyingfor stable learning machine models, improving learning models is demanded in Android malware detection With a giant and diverse sample set, Android malware detection is still a new challenge, as the OS canbeinstalled on various machines, including phones, automotive, vending machines, clocks, and televisions Also, the number of malware is increasingmonthly,in virtual as well as native environments, which leads to a diversity of features.Conventionallearning machine models such as RF, SVM, KNN, and DT are not suitable as these models do nothavefeature generalizationability.They cannot produce generalized features and decrease dimension with large samples Also, the detection accuracy decreases sharply when the number of samples andfeaturesincreases.AccordingtothereviewinChapter2,studiesusingthesemodels were mainly tackled withavailablefeature sets, with a small number of features and samples, and proceeded without a feature augmentation method Therefore, there is ademandforresearchonlearningmodelsappropriateforAndroidmalwaredetection, with broad samples and features.Tocontribute to dealing with the problem, after conductingresearch,twoimprovementsforlearningmodelsaredeveloped:

1 WDCNN Model[Pub.3]: this is animprovedmodel of the CNN model In the

WDCNN model, more information has been putintothe model (the model requirestwoinputs, the wide component and the deep component) The results of the WDCNN model are better than those ofconventionaldeep learning and machine learningmodels.

2 Federatedlearning method[Pub.11]: the federated learning modelwasused to conduct training and detection onmanymachines Although the accuracy results in the federated learning model arelower,they are not significant, and their high processing speed allows models and real applications tobetested anddeployedquickly.

In the scope of the study, the work was limited to static feature extraction, and theDrebin dataset and AMD dataset were used for Android malware families.

ChapterS umma r y

Androidoperatingsystemarchitectureandthechallengeofdevelopingmaliciouscode on Android It then proposes solutions for classifying malicious code on the Android platform. The dissertation is directedtowardemploying machine learning and deep learning models to classify Android malware Section1.3emphasizes thatimprovementscanbemadebyadvancing features and refining the model to enhance the capability of classifyingAndroidmalware.Toclarifyfurther,inSection1.4,surveysrelatedresearch on feature extraction and deep learning machine learning models This exploration reveals that, despite numerous studies addressing the issue, there is still potential for improvingtheaccuracyofAndroidmalwareclassification.

Section1.5explicitly outlines the dissertation’s contributions The following chapters will delve into the detailed contributions regarding feature enhancement and im- provements in deep learning models.

Chapter 2 PROPOSED METHODS FOR FEATURE EXTRACTION

This chapter focuses on feature set augmentation In general, feature extraction, selection,anddevelopmenttechniquesareveryimportantindetectionandclassification problems If the features given after selection and development aregood,it will help themodelgivegoodresults.T h e r e aretwoapproachesasfollowing:

•Featureset development: additional features willbegenerated from the initial datasettoobtainanovelsetoffeatures.

•Featureselection: eliminating features withlow weights(according to theapplied algorithm).Fromthere, a new set of features has a smaller number than the originalbutisconsidered"marrow"oressentialintheclassification.

FeatureAugmentation based onC o - o c c u r r e n c e matrix

ProposedIdea

Aco-occurrencematrixistypicallyasymmetricalsquarematrix,witheachrowand columnrepresentingthecorrespondingfeaturevector.InclassifyingAndroidmalware, features are commonly extracted statically from the APK file These features are permissions,APIcalls,intents,services,andothers.Theextractedfeaturesarediscrete entities.Theinquirypertainstothemethodofconnectingthecharacteristicspresentin a cluster In the realm of Android apps, it is common for features to exhibit a degree of interdependence.Amessagingappisexpectedtoobservetheconcurrentdeclarationof permissions such as SEND_SMS, RECEIVE_SMS, and READ_SMS.Consequently,the characteristics of the identical category will exhibit interdependence rather thanautonomy.Conversely,the utilization of Convolutional NeuralNetwork(CNN) models intheclassificationofAndroidmalwareisprevalent.TheCNNmodelreceivesitsinput from a matrix of images, wherein neighboring image pixels exhibit comparable color values The proposed approach aims to utilize a co-occurrence matrix in the contextof Androidmalwaredetectiontoestablishacorrelationbetweenthefeatureswithineach group.

The overall model applying the co-occurrence matrix toimprovethe feature set for the Android malware classification task is shown in Fig.2.1.Toprovethe effectiveness of the co-occurrence matrix feature,twoscenarios are set up: those with and without the co-occurrence matrix feature computation module The process is as follows:

10 for i ← 1 to length do vectorOutput ← new vector ( length ); aFeature ← dictionaryFeatures[i]; vectorOutput[i] = 1; else vectorOutput[i] = 0; end end if aFeature ∃ listFeatureFile then

Figure 2.1: Evaluation model for Android malware classification using co-occurrence matrix

1 FromAPK files, therawfeature extraction module extracts features, including APIcallstringsandpermissionrequests.

2 Forthe baseline architecture, therawfeatures go to the Normal MatrixFormationmodule.Themoduleconvertstherawfeaturesinstringformatintoavec torusing a dictionary of API calls and permissions Each element in the vector wouldhaveavalueof 1 or 0, depending on whether the API or permission canbefound in the current APK file The vector is then reshapedintoa matrix, later treated as CNN input.Rawfeatures go to the proposed architecture’s co- occurrence matrix feature computation module The module forms a matrix based on the concurrencepresenceoftwoAPIsorpermissionsintheAPKfile.

3 Next, the CNN module is applied to learn the features and classify the APK filesintobenign or specific malwarefamilies.

RawFeatureE x t r a c t i o n

Algorithm 1:Convert string features to number

Input :dictionaryFeatures: a dictionary of all APIs and permissions; listFeatureFile: list of APIs and permission string of a file;

Output:vectorOutput: feature vector; length: length(dictionaryFeature);

12 coMatrix length ← length ← new Matrix ( vecFeature ( length, length ); ); for i ← 1 to length do for j ← 1 to length do if vecF eature[i] = 1 and vecF eature[j] = 1 then else coMatrix[i][j] ← 1; end end end coMatrix[i][j] ← 0;

Toextract features from APK files, some decompilation toolslikethe APK tool, Dex2jar, Baksmali, Androguard, Jadx, Jd-gui, or Androidpytool canbeused In this part, Androidpytoolwasused to extract features All the features are static and extracted fromtwofiles:X M L f i l e a n d D E X f i l e

Fromtherawfeature sets, the top (200) most common APIs that appear in all APK files are employed, and 385 feature permissions are declared and used in XML files In the form of strings, these features are the input of the next module in the process chain, as mentioned in Fig.2.1.Algorithm1illustrates the implementation process to convert string featuresintonumbervectors.

Co-occurrence MatrixFeatureComputation

The co-occurrence matrixwasfirst mentioned in 1957 when linguistJ.R.Firth referred to the relationship between words in a sentence A word is represented se- manticallybythe words around it, so the placement of words will affect the sentence’s meaning.

In this context, the co-occurrence matrix isnowconnected in each paragraphword This concept is applied to the Android malware features The implementation of co- occurrencematrixcomputationisdescribedinAlgorithm2.

Algorithm 2:Co-occurrence matrix computation algorithm

1 Begin: vecFeature: feature vector; coMatrix: Co-concurence matrix;

After convertingrawfeatures in string form to a vector of numbers, the next step is to reshape this vectorintoa matrix, which canbeused as input to CNN later This stepmayhaveahugeimpact on the final classification results The reason is that the order of features might change a lot when reshaping the vector to different matrix sizes.Fig.

2.2illustrates an example of forming an output matrix with different sizes Because harmful malware tends to call an API together with another one or a permission request (e.g., the API CreateFile mightbecalled together with INTERNET_ACCESS permissioninmalware),CNNcanlearntherelationshipbetweenthesetwoelementsif they are located close to each other in the output matrix, i.e., in the case of forming a matrix size kbyk In contrast, CNNmaylose the information if the output matrix is formed in different sizes, i.e., (k+1)by(k+1), as shown in Fig.2.2.Hence, using CNN,theorderofelementsintheinputvectoralsoaffectsthefinalclassificationrate.

Figure 2.2: Output matrix with different size

The co-occurrence matrix proposed can potentially address the issue of input ele- mentreordering.Thisisbecausetheco-occurrencematrixemphasizestheco-occurrence matrixbetweentwoelementsinsteadofasingularcell.

ExperimentalResults

Thedatasetincludes5,438malwarefileswith179familiesintheDrebindataset[146]and6,732b enignfiles,includingappsandgames[147].Thetop(10)familyofmalware with the most significant number of samples is shown in Fig.2.3.In the context of feature extraction, various internal feature categories exist, such as permissions, APIs, services, URLs, andintents.The present study solely concentrates on obtaining authorizationandutilizingAPIfunctionalities,encompassingsystemandfunctioncalls within theprogram.

The study utilized the 398 highest-level permissions and the 200 most frequently employedAPIfunctioncallsacrossallfiles.Hence,theindividualAPKfileiscomprised of598rawfeatures.Theco-occurrencematrixiscomputedforeachpermissionandAPI group, resulting in 158,404 permission features and 40,000 API features The features are stored in a Comma-SeparatedValues(CSV) file, input for machine learning and deep learningalgorithms.

Figure 2.3: Top (10) malware families in Drebin dataset

•Scenario 1: 198,404 features after using the co-occurrence matrix of 598features inscenario1(158,404permissionfeaturesand40,000APIfeatures)

Figure 2.4: CNN having multi-convolutional networks

Table 2.1: Details of parameters set in the CNN model

2.1.4.3 Malware Classification based on CNNModel

The structure of the CNN model is shown in Fig.2.4.The parameters of the CNN modelusedfortheproblemofusingtheco-occurrencematrixcombinationfeatureare shown inTable2.1.

Withtwodatasets: therawdata and the transformed data with a co-occurrence matrix(describedasthetwoscenarios).Foreachset,thedatasetisdividedintogroups using k-fold cross-validation sampling, with k, dividing the dataintoten equal parts of samples having both benign and malware (stratified), with 80% for training, 10% for validation testing, and 10% for testing The cross-validation processwasperformedtentimes,andtheaverageoftheclassificationresultswascalcul ated.

The results of the CNN model under the mentioned conditions are shown in detail in Table2.2.In addition, measures such as PR, RC,F 1-score, and FPR are used to evaluate the results, as shown in Table2.3.

Table 2.2: Classification with CNN model using accuracy measure (%)

Set CNN model Raw features

In Table2.2,it can be seen that using the co-occurrence matrix has increased the average Acc by 0.58%, and the classification difference among 10-fold runs has also

MEASURE CNN CNN with co-occurrence matrix

Acc 95.78 96.23 decreasedfrom5.5%(usingrawfeatureset)to3.98%(usingco- occurrencematrix).Itprovedthatthelinksbetweenfeaturesdidaffecttheclassificatio nresults.

Results when using some other measures are shown inTable2.3.It canbeseen that the PR metric, when using the co-occurrence matrix feature, increasedby0.3% compared with that of therawfeature set TheF 1-score metric is also better (increase0.58% when using co-occurrence features) Overall, using co-occurrence feature augmentation increases the classification accuracy compared with usingrawfeature sets.When using a co-occurrence matrix, even though classification results are better,theoverallefficiency(trainingandtesttime)wasreducedduetotheincreaseofinputfromnfeatures ton x nfeatures The co-occurrence matrix producedwas[n x n], but half ofthefeaturesareduplicated(theydonotaffecttheclassificationprocess).

FeatureAugmentation based onAprioriAlgorithm

ProposedIdea

The Apriori algorithm is a commonly employed technique in data mining Its primary purpose is to explore the association rule between various objects In detecting Androidmalware,featuresareextractedfromAPKfiles.Thetwosignificantcategories of attributes are permissions and API calls.However,the discrete files are devoid ofanyinterconnection Thus, utilizing the Apriori algorithm is feasible for acquiring knowledgeoftheassociationrulesinthisparticularissue.

To apply the Apriori algorithm to advance the feature set and adapt it to the malware classification problem on Android, it is processed following the procedure in Fig.2.5,which is shown in a 5-step process as follows:

•Step 1 Extract features: based on therawdataset as an APK file, decompile the APK files and perform text preprocessing tomake rawfeatures.

•Step 2 Associative rule miningbythe Apriori algorithm: apply the Apriori algorithmtoidentifypatternsandassociationsamongtheextractedfeatures.

Figure 2.5: The process of research and experiment using Apriori

– Utilize the results from Step 2 (Apriori algorithm) and apply them to therawdataset obtained in Step1

•Step 4 The datasets willbeputintomachine learning and deep learning models to evaluate whether the new, Apriori-transformed feature set is better than therawfeature Three models, namely SVM, RF, and CNN, were utilized tomaketheassessment.

•Step 5 Comparison and Evaluation: different metrics willbeused to evaluate andcomparetheApriori-transformedfeaturesetandtherawdataset.

AprioriAlgorithm

The Apriori algorithm was first proposed by Rakesh Agrawal, Tomasz Imielinski, and Arun Swami in 1993.

The problem is stated: findtwith supportssatisfyings≥s 0and the confidence levelc≥c 0(s 0,c 0are 2valuesspecifiedbyusers ands 0=minsupp,c 0=minconf) SymbolL k is an array of arrays ofk−frequentitemset,C k is an array of arrays ofk−candidateitemset.

2 Use frequent itemsets to generate association rules with some minconfconfidence.

The valuesminsupandmincofare the thresholds to be defined before generating the association rule An itemset with its appearance frequencyminsupis called a

2 Fk ← ∅ // initialize the set of candidates for f1, f2 ∈ Fk−1 // find all pairs of frequent itemsets and ik−1 < i′ k−1 do and f with f1 = {i1, , ik−2, ik−1} // that differ only in the 2 = {i 1 , , ik−2 , i′k−1} // last item

// according to the lexicographic order c ← {i1, , ik−2, i′k−1};

// join the two itemsets f1 and f2

// add the new itemset c to the candidates for (k-1)-subset s of c do if s ∈/ Fk−1 then delete c from Ck; end

15 return Ck // return the generated candidates frequent itemsets.

The idea of the Apriori algorithm

•Findallthefrequentitemsets:u s ek−itemset(itemsetscontainkitems)tofind (k+ 1)itemsets.

•Find all the association rules from the frequent itemsets (satisfy bothminsupand mincof).

Phase 1: first, find the1−itemset(denotedF 1).F 1wouldbeused to findF 2(2−itemsets).F 2wouldbeused to findF 3(3−itemsets), and the processwenton until nok−itemsetwasfound Shown in Algorithm3

Phase 2: use the frequentitemsetsacquired in phase 1 to generate association rules which satisfyconfidence≥minconf Shown in Algorithm4.

FeatureS et Cr e a t i o n

Deftnition 2.1(The initial feature set).The initial Android feature set is defined as the features extracted from the APK samples, including benign and malware files This feature set is denoted asF A and represented as in formula2.1. where

// Hm is the set of m-item consequents

Hm +1 ← candidate − gen ( Hm ); for h∈ Hdo m+1m+1 fk.count ; if conf ≥ minconf then conf ← (f −h km+1 ).count output the rule ( fk − hm +1) → hm +1 with confidence ← conf and fk.count else support ← n ; delete hm+1 from Hm+1; end end end

//Fis the set of all frequent itemsets

5 andsupport f k count //nisthetotalnumberof transactions inT

•Eachf i feature couldbea number or a string.Twofeature sets were extracted asf o l l o w s :

•Set 2 contains miscellaneous features such as permissions, API calls, file size, nativelibcusage,numberofservices,andexistingfeatures.

Deftnition 2.2(The feature association rule).The association rule defines the correlation between thetwoassociated groups in the initial feature set Each feature group is a subset of features.Fortwofeature subsetsXandY,the association rule is defined as inFormula2.2.The association rulewasexamined through support and confidence, calculated as inFormula2.3andFormula2.4.

X→Y,with X∈F A ,Y∈F A and X∩Y=∅ (2.2) support=( X ∪ Y ) count n (2.3) where m 2 confi dent ( X ∪ Y

•(X∪Y).countis the number of transactions withX∪Y

Deftnition 2.3(Associated features).Associated features are created based on the association rule and satisfy the support and confidence threshold Based on the association rule described inFormula2.2,A formula calculates the associated featuresf m as inFormula2.5. f =Σ x+Σ y+support+ confident

Deftnition 2.4(The feature augmentation set).The feature augmentation set, denoted asF C , is the union of the initial feature set and the associated features.F C is constructed as inFormula2.6.

2.2.3.3 InputFeatureNo r ma li za ti on

Inputs for malware detection and classification using machine learning and deep learning methods are often numerical; therefore, normalization is performed on the extracted feature set.Fromtherawfeature sets, the top (200) most common API calls that appear in all APK files are used, and 385 feature permissions are declared and used in XML files Algorithm1illustrates the implementation process to convert string featuresintoa numbervector.

Corresponding to each data group in the dataset described above, the Apriori algorithm was implemented to show the correlation between features in each group.

Forgroup1(permissions),whichcorrespondstothepermissionfeatureset,theper- mission has a tight correlation, which means that permission typically comestogether with one or a group of other permissions on another APK file The min_sup used in this work is 0.4 After passing the first group through the Apriori algorithm, the setiwas acquired.

Forgroup 2 of API calls, services, and activities thathavebeen ranked, it’s observed that the correlation between those features is not as tight as in the case of permission; therefore, the min_supvalue wasset to 0.2 After passing the second group through the algorithm, setiiwasacquired.

Apply the Apriori algorithm in each permission and API feature set described in Fig.2 6

Figure 2.6: Apply the Apriori algorithm to the feature set

ExperimentalResults

The data set from Drebin[146]with 5,560 malware files with 179 labels and 7,140 benignfilesareappsandgamesdownloadedfrom[147].Thefeaturesoffilesaresavedin a CSV file with the number of rows equal to the number of files under analysis The numberofcolumnscorrespondstothenumberofextractedfeatures.

The top (10) malware families have the most significant samples, as shown in Fig. 2.3.

Corresponding to the feature extraction in Section2.2.3, the initial dataset was divided into four scenarios:

•Scenario 1: features permission Android system permissions and user-defined permissions: 398features.

Max Pool: 2 × 2 size FC: 1024 Hidden Neuron Dropout: 0.8

Input: n × m × 1 Layer1: CONV1: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

Layer2: CONV2: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

Layer3: CONV3: 3 × 3 size, 64 filter, ReLU CNN model

•Scenario 3: a collection of static analysis features The proposed features include permissions from Scenario 1 (398), APIs (200), sizes of files, user-defined permissions, use of native libs, number of services, and existing features in 603 features.

•Scenario 4: a collection of static analysis features+associated features in each set using the Apriori algorithm, in total: 603 (Scenario 3)+2,132 (Apriori for permissions)+3,085(AprioriforAPI)=5,820features.

Foreach case, the data is dividedinto10-fold Each dataset willbetaken at 80% for training and 20% for testing (in the case of CNN, 10% for testing and 10% for validation).

The structure of the CNN model is shown in Fig.2.7.

Figure 2.7: Architecture of CNN model used in the experiment with Apriori

The parameters of the CNN model used for the problem of using the co-occurrence matrix combination feature are shown in Table2.4.

Table 2.4: Details of parameters set in the CNN model

The experimental model is detailed in Fig.2.5.

The four datasets were fed to the CNN model to evaluate the Apriori algorithm as described in Section2.2.4.1.The measures used in the experiment are Acc, Precision, Recall, andF 1-score The results are shown in Table2.5and Table2.6,which describe the effect of using the same CNN algorithm on 4 sets of data when applying and not applying the Apriori algorithm.

Table 2.5: Classification results by CNN

Table 2.6: Results of using CNN with measurements (%)

To more clearly compare with machine learning models, Fig.2.8shows the results in the form of a chart as follows:

Looking at Fig.2.8,regardless of the models, using datasets preprocessed with the Apriori algorithm results in higher performances While the accuracy is not significantly higher, it is a prerequisite for applying the Apriori algorithm to future classification tasks The Apriori algorithm depends on two factors:

•The number of repeating times for the Apriori algorithm (k-itemset) When combining features, the number k will increase The larger thevalueof k, the higher thenumberofincorporatedfeatures,whichleadstofewernewfeaturesgenerated (asinitialfeaturesincorporatedmustpassthethresholdmin_sup).

In the experiment,min_supandk-itemsetwere not optimized.Currently,m i n _ s u p usedwas0.4forthepermissionfeaturesset,0.2fortheAPIcallsfeatureset,andk-

Figure 2.8: Learning method implementation results itemsetusedwas2 Therefore, the new experiment indicated that Apriori produced betterresultsthanrawfeatures.Inthefollowingstudies,theoptimizedalgorithmmustbeimpleme nted to determine each dataset’s optimizedmin_supa n d k-itemsetvalue.

FeatureSelection Based on Popularity and ContrastValuein a Multi- objectiveApproach

Proposedidea

ThemainideaofthisproposedmethodistousetheParetomulti-objectiveoptimiza- tionmethodtobuildaselectivityfunction(globalfunction)basedonthreecomponent measures:popularity,thecontrastbetweenbenignandmalwarefiles,andthecontrast between classes ofmalware.

The overall model of the method is depicted in Fig.2.9.

•First,buildingcomponentmeasuresincludepopularity(M1),thecontrastbetween benignandmalwarefiles(M2),andthecontrastbetweenclassesofmalware(M3).

•Second, for each feature in therawfeature set, calculate thevalueof the selection function based on the component measures The selection function is the global optimal function - built on these three measures in a balanced approach between the componentmeasures.

•Third, only features with thevalueof the selection function greater than or equal tothethresholdareselected.Withthefeaturesetselected,thedataisfedin to Σ

Figure 2.9: Proposing a feature selection model the deep learning model to evaluate the efficiency.

Popularity andContrastCo mp ut a t io n

Inthissection,componentmeasuresbasedonthevalueofeachfeatureinthedataset are built Each component measure represents a quality characteristic of the feature. Componentmeasuresareusedtoconstructtheselectionfunction-globalfeatureeval-uation.

Deftnition 2.5(Popularity).The popularity of each feature is a measure built on the frequency of the feature Popularity is denoted M1 and calculated according to Equation2.7.

Deftnition 2.6(Contrast with benign).The contrast with benign is a measure that evaluates the contrastvalueof features in benign and malware samples The larger thismeasure,thehigherthevaluecontrastbetweenmalwareandbenign,sothebetter j=1 for classification Contrast with benign is denoted M2 and calculated according to Equation2.8.

Deftnition 2.7(Contrast between classes of malware).The contrast between classes of malware is a measure of the contrast between classes of malware The contrast between classes of malware is denoted M3 and calculated by Equation 3.8.

Pareto Multi-objectiveOptimizationMe t ho d

Paretooptimizationisakeymethodinmulti- objectiveoptimization.Inthismethod,callingX ∗ thesolution tobefound,thenX*musthavethe followingproperties:

•Possible alternative other thanX∈Dthat has one objective better(f i (X)≥f i (X ∗ )mustalsohaveatleastoneotherobjectiveworse(f j (X)f j (X ∗ ).Onthewhole,nosingleXcanoutperformX ∗

SelectionFunctionandImplementation

Themulti-objectiveoptimizationmethodaimsnottooptimizeaspecificcomponent but to balance and match optimal goals According to this approach, each measure is acomponentobjectivefunction.Eachofthesemetricsrepresentsaparticularmeasure

F1 ← ∅; while F0 ̸= ∅ do put a feature fi from F0; if F ≥ M0 then delete fi from F0; else delete fi from F0; end end return F1;

F1 ← F1 ∪ {fi}; of quality called a component objective.However,simultaneously optimizing multiple componentobjectivesisimpossible;evenimprovingonecomponent’sgoalmayinterfere withanother.

Thebuilt-inselectionfunctionistheglobalobjectivefunction.Theaimistoopti-mize the arguments globally and balance between the component goals The selectionfunctionisdenotedF,builtbasedontherespectiveweightandcomponentmeasures as in Equation2.10. where,

•w 1,w 2,w 3:are theweightscorresponding to each measure Depending on the problem and the optimization goal, the importance of each measure and theweightsare setaccordingly.

The selection function is used to select suitable features The selection function aims toachievethe best fit and balance between the component measures; that is, it is aimed at the global target and the overall quality of the features Depending on the problem and the number of features tobeselected, an appropriate thresholdvalueM 0is put in.FeaturewithvalueF≥M 0willbechosen.Featureselection is performed according to Algorithm5.

M 0 : threshold for the selection function;

Figure 2.10: Top (20) family of malware with the most samples in the AMD dataset

ExperimentalResults

Inthiswork,theAMDmalwaredataset[148]wasusedtoprovidethemalwarepart ofthedataset,whichcontains24,553samplescategorizedinto135varietiesamong71 malware families from 2010 to 2016 19,943 samples out of 65 malware familieswere used; some samples were left out because the malware families had too few samples inside.6,771samplesweredownloadedat[147]forthebenignclass.Thus,theclassifi- cationincludes26,714samplesfrom66families.Fig.2.10depictsthetop(20)malware familieswiththemostsamplesintheAMDdataset.

The datawasdividedintoten equal sections of malware and benign families Eight parts were trained in the CNN model for evaluation; one partwasputintovalidation, andthelastonewasusedforseparatetesting.

Permissions and API calls are thetwomain features used in this research e.g.,877permissionfeaturesincludepermissionsprovidedbyAndroidanddeclaredbythe user and the top (1000) most used API call features of all samples If the feature (permission and API call) is used in each sample, it willbenumbered1 The feature willbenumbered0if not used Algorithm1illustrates the implementation process to convert string featuresintonumeric vectors In this work, the dataset is called dataset_raw The permission group is sorted in ascending order of occurrences, and theAPIgroupissortedindescendingorderofoccurrencesinthedataset.

Max Pool: 2 × 2 size FC: 7040 Hidden Neuron FC: 1024 Hidden Neuron FC: 66 Output Classes

Input: n × 1 Layer1: CONV1: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

Layer2: CONV2: 3 × 3 size, 32 filter, ReLU Max Pool: 2 × 2 size

Layer3: CONV3: 3 × 3 size, 64 filter, ReLU CNN model

Toevaluateandcomparetheproposedmethodwithotherfeatureselectionmethods, experiments were conducted on three datasets: feature set D1.1, which only includes APIs; feature set D1.2, which only includes permission; and feature set D1.3, which consists of both as depicted in Fig.2.11 The proposed selection method in the studywasapplied to three feature sets, D2.1, D2.2, and D2.3, which were obtained.Atthe same time, apply the Information Gain (IG) method[118,119]to D1.1, D1.2, and D1.3 to obtain three feature sets, respectively, D3.1, D3.2, and D3.3 The same CNN modelwasapplied to all six obtained feature sets The parameters in the CNN model are shown in theTable2.7.

Table 2.7: Details of parameters set in the CNN model for selection feature

Accordingly,the research experimented with the following: with the feature set, each feature is weighted according to the proposed IG algorithm.Removefeatures that are less than the threshold The retained features are putintothe CNN model, andAcc;Recallmeasuresareusedtoevaluatetheselectedsetoffeatures.

The scenarios for feature selection are as follows:

– Scenario 1.1: with the API call feature set, calculate measuresM1,M2, and

M3; calculate the weight for each feature using the functionF.

– Scenario1.2:withthepermissionfeatureset,calculatemeasuresM1,M2, andM3; calculate theweightfor each feature using the functionF.

– Scenario 1.3: with the permission and API call feature set, calculatemeasures

M1,M2, andM3; calculate the weight for each feature using the functionF.

For scenario 1, the functionFis formula (3.9) For each scenario 1.1, 1.2, 1.3, Three different weights of (w 1,w 2,w 3) will be applied.

– Scenario 2.1: with the API call feature set, use the IG algorithm to calculate theweightfor eachf e a t u r e

Figure 2.11: Experimental model when applying feature selection algorithm

– Scenario2.2:withthepermissionfeatureset,usetheIGalgorithmtocalculate theweightfor eachf e a t u r e

– Scenario 2.3: with the permission and API call feature set, use the IG algo- rithmtocalculatetheweightforeachfeature.

According to the proposed method, to select good features, the measuresM1,M2, andM3are calculated according to Equations(2.7),(2.8),and(2.9).The selection function (F) could then be calculated according to the values of the respective measures and weights This weighted value represents the importance of each measure According to the Pareto multi-objective method, these weights are selected based on experience and experiment Feature selection results are illustrated in Table2.8,including ten features with the highest function value F Feature selection is performed on three datasets: API, permission, and combined API with permission Based on the value of F with three ways of choosing weights in Table2.9,showing the removal of features with small F values.

For comparison, in Fig.2.12,the performances of an ML/DL algorithm were cal-

Table 2.8: Summary of feature evaluation measures selectivity functions (top (10)) – with API set

1 java.lang.StringBuil der.toString 0.845

4 java.lang.String.su b string

7 android.content.Co n text.getPackageNa me

9 java.lang.String.len g th

10 android.content.Int e nt.putExtra

Table 2.9: Summary of results with datasets and feature sets

ID Dataset Feature set Removed

600 Acc Recall Acc Recal l Acc Recal l Acc Recall

Feature original Don’t remove Acc:96.61; Recall:87.67 Feature with F1 score

9 89.73 94.91 83Feature with F3 95.44 88.21 96.1 88.9 95.8 87.82 94.42 85.44 score 1 1 Feature with IG score 97.06 81.78 96.9

Feature original Don’t remove Acc:92.15; Recall:84.85 Feature with F1 score

Feature original Don’t remove Acc:98.19;Recall:93.2 Feature with F1 score 95.44 88.21 96.1

1 83.25 97.13 81.55 culated after altering the dataset by deleting 300, 400, 500, and 600 features for each group For each group, two measures of Acc (first four columns) and Recall (last four columns) of the two proposed feature selection algorithms and the IG algorithm are shown Each group consists of eight columns, respectively, as follows:

•The first four columns of each group are theAccmeasure values: the first three columns of each group correspond to the weighting functionsF1,F2, andF3. ThefourthcolumnisweightedaccordingtotheIGalgorithm.

•The last four columns of each group are thevaluesof the Recall measure: the 5th column, the6thcolumn,andthe7thcolumnwiththeweightingfunctionsF1,F2, andF3, respectively The 8th column is weighted according to the IGalgorithm.

Figure 2.12: Experimental results when applying feature selection algorithm

To evaluate more accurately the proposed method, the feature selection program according to the IG algorithm in the paper[118,119]was also installed With Table 2.9 andFig.2 1 2 ,itisshownthat:

•The Recall of the IG algorithm islowerthan that of the proposed algorithmacross allscenarios.

•Fig.2.12depicts the results with a feature set of API calls and permissions, along withweights likeF1,F2,F3,IG,and the number of featuresremoved.Featuresrangingfrom300to850weresystematicallyremovedforeachspecificset (permission set, API set) The experimentshowedthat the proposedarguments gavestable results with the Accuracy and Recall metrics.Featureselection with IGgivesrelativelyhighAccresultsbutlowerRecallresultsthanF1,F2,andF3.

•Table2.9has shown that compared with the results when using dataset_raw,theAccmeasure of the measures with F1, F2, and F3 is not significantlylower;theRecall measure is even higher than dataset_raw Thus, the proposed method givespositiveresults,especiallywiththeRecallmeasure.

ChapterS umma r y

Featureengineeringiscrucialinaddressingthechallengesofclassifying,identifying, anddetectingmalware.Thisessentialprocessencompassesvarioustechniques,including preprocessing, feature extraction, evaluation, feature selection, and enhancing the feature set By applying different extraction techniques to therawdataset, diverse feature sets canbegenerated.Additionally,new features canbederived from the relationships among therawfeatures to complement and refine the existing feature set. Theaugmentationofthefeatureset,throughdevelopmentandfeatureselection,aligns with the specific objectives of the problem, such as increasing accuracy andimprovingoverallsystem performance,etc.

According to thecontentpresented in Chapter 3, the dissertation has proposed, built, and tested 3 methods of improving the feature set The summary results of theseproposedmethodsareshowninTable2.10.

Table 2.10: Summary of results of proposed feature augmentation methods

ID Proposed features im- provement methods

3 Prevalence and value contrast-based improve- ment method

AMD 5.17% with Recall score [Pub.10]

Inthestudyof[Pub.2],theco-occurrencematrixwasusedtoimprovefeaturesinput to the CNN model, and the outcomes produced are better than usingrawfeatures in termsofPR,RC,F 1- score,andAcc(showninTable2.2andTable2.3).

Whilethefeaturegenerationalgorithmsmentionedabovehavedemonstratedfavor-able results in the detection and classification processes, they are not without limi- tations One such limitation is the creation of toomanyadditional features When implementingaco- occurrencematrix,thenumberofobservedfeaturesshowsapro- portional increase from n tonxn As a result, there is a significant increase in feature training and processing time.

Evaluation of method based on Apriori:

During the dissertation research, a remarkable observation emerged regarding the basic features used in the feature extraction process These features exhibit a discrete nature andlackof interconnectedness The connectivity of permissions and APIsplaysa prominent role.Usually,permissions are acquired in clusters.Forexample, the abil- itytoreadmessagesisoftencombinedwiththeabilitytocomposemessagesandaccessmemory.In addition, Application Programming Interfaces (APIs) often represent in- terdependenciesbycallingotherAPIsandestablishingcross-referencingrelationships.

Chapter 3 involved the application of feature-generation algorithms The Apriori and co-occurrence matrix algorithms are examples of algorithms designed to facilitate the linkage of features In addition, alternative algorithms, namely FP-Growth and K- means, were employed to identify malware software on Internet of Things (IoT) devices The outcomes of implementing these feature-generation algorithms exhibit great potential.

Inthestudyof[Pub.6],theApriorialgorithmproducedbetterresultsthanindividual features alone. The outcome is described inTable2.5andTable2.6.Otherwise, the result of using CNN models is better than the adaptation of other machine learning algorithms,suchasSVMandRF.

Evaluation based on prevalence and contrast:

Data preprocessing is paramount in detection and classification tasks involving learning models, such as machine and deep learning models Nonetheless, not all features used in these problemsprovetobeoptimal Although deep learning models streamlinefeatureselectionthroughinternalconvolutionallayersthatpreprocessdata, incorporating toomanyfeatures can result in time-consuming analysis, training, and testing Mainly when dealing withvastdatasetsprevalentin moderntimes.

Furthermore,deep learning models primarily aggregate features that are near each other, as dictatedbythe filter matrix (e.g., nearby pixels in thisstudy,the experimentwasconducted, and it found that it is possible toremoveup to 75% of therawused features without changing the results Here, discrete features are standard (unlike the continuous nature often encountered in image processing tasks).

Fromthere, it shows that learning to selectgoodfeatures helps solve the problem faster, and the accuracy in detection and classification canbeequivalent to the raw In the dissertation, a methodwasstudied and proposed to select features based onmanycriteria [Pub.10] In thisstudy,the experimentwasconducted, and it found that it is possible toremoveup to 75% of therawused features without changing the results.The results are shown inTable2.9and Fig.2.12.

Apart from the advantages and disadvantages, feature generation is generally a promising research direction This feature generation can be combined with feature selection methods such as PSO, IG, etc., to extract meaningful features and use the association between features.

Inthedirectionoffeaturedevelopment,thenumberoffeaturesisincreasedaccording to the combination of the original features Theadvantageof this method is that the accuracy of the system is improved.

Thedisadvantageof this method is that the trainingtimeanddetectiontimeofthesystemwillbelongerbecausethesystemtakes time to analyze the application, training, and detection time (the number of features increases). Inthedirectionoffeatureselection,thenumberoforiginalfeaturesisreducedsot raininganddetectiontimeareimproved.However,thedisadvantageofthismethodisthattheclassif icationresultsoftendecrease(althoughthedecreaseamplitudeissmall).Many algorithms in data mining canbeapplied to augmentation features.However, in the dissertation, some typical and widely used algorithms in data mining were used such as Apriori and Co-matrix.

DEEP LEARNING-BASED ANDROID MALWARE

Thischapterwillelucidatetheuseofvariousdeep-learningmodelstodetectAndroid malware The model’s augmentation and the system’s advancement are thoroughly examined.T h e contentofthischapterisasfollows:

•Application of deep learning models such as DBN and CNN in Android malware classificationtask.

•Presentation of the Wide&Deep CNN (WDCNN) model is an improved, more accurate version ofCNN.

ApplyingD B N M o d e l

DBNModel

DBN was the first deep learning network, introduced by Professor Hinton in 2006[128].This model uses unsupervised training and stacks multiple unsupervised networks, such as restricted Boltzmann machines or autoencoders.

DBN has been successfully applied to image classification tasks, andmanyresearch groupsinotherfieldshavealsoappliedit.Inlinewiththedevelopmenttrend,applying deep learning models to solve the malware detection problem has interestedmanyresearch groups In 2014, Z.Yuanet al [149]implemented a deep learning method (which uses DBN) to detect Android malware.However,it is rare to use deep learning tosolvetheproblemofdetectingmalwareonAndroid.

Applying the DBN model to various Android malware datasets and setting up different numbers of hidden layers will contribute to a more accurate evaluation of how suitable the DBN model is for detecting malware on Android.

The system structure follows the procedure in Fig.3.1.Includes five stages:

•Stage 1: Evaluation, selection, and combination of features During this phase, an analysiswasconducted on the APK file, wherein significant components were identifiedandselectedtoserveasfeatures.Thepresentinvestigationinvolvedthe utilization of permissions in the XML file, API calls, and the Header of the DEX file as distinctivefeatures.

•Stage 2:Training.Utilizing the DBN model explained previously; a programwascomposedtoextractcharacteristicsfromthedatasetcontainingmalwaresam ples and their corresponding labels The model is fed with labeled data tailored to the specific task during training The detection modelwasacquired upon the culmination of the trainingp r o c e s s

•Stage 3: Detection The model trained in the previous stage is utilized to process authentic files for verification purposes The model infers the unlabeled input APKfiles.

•Stage 4: Construction of cross-validation test data (k-fold).Tobeunbiased, k-fold divides the data and performs cross-validation Thisavoidsunevenly dividing the data, andmanyfiles of a particular label are concentrated in one part of the split data.Typically,only10-fold,5-fold,or2-fold wereutilized.

•Stage 5:Testingand evaluation The model is subjected to k-fold cross-validation, whichinvolvessplittingthedatasetintok-foldsoftrainandtestdatainstage4.

Figure 3.1: System development and evaluation process using the DBN Σ i= 1 j=1 i=

Boltzmann Machine and DeepBeliefNetwork

A restricted Boltzmann machine (RBM) is an artificial neuralnetworkbased on a probabilistic energy model It is a model consisting of a set ofn v random binary variables collectively known as vectorvand the hiddenlayerof such random binary variablesn h ,denotedh The connections between thelayersform a bipartite graph, meaning no connections are within the samelayer.The joint probability distribution is represented through the Gibbs distribution formulap(v, h|θ)with energy functionE(v,h)described in Equation3 1 p(v, h) =1e −E(v,h)

Assuming to have a training setxwith a one-dimensional size ofd:x= (x 1 , x 2 , x 3 , , x d ), it is called the form of the Boltzmann machine In that case, the energy function of the RBM can be represented as in Equation3.2: d |H| d |H|

The hidden layershconsist of|H|hidden units:h=h 1 , h 2 , , h |H| , and the param- eter mappingθis created from the weight setw, the bias vectorbandb ′

Whenxis input to the firstlayer,the RBM activates the hidden units based on the conditionalprobability.Here,sigmoidfunction willbeused to calculate the joint probabilities:P(h j |x)vàP(x i |h)as shown in Equation3.3: d |H|

A deep beliefnetworkis constructedbystackinglayersof RBMs on top of each other,with the activationlayerof one RBM serving as the input for the next RBM It uses a layer-wise approach to training thenetworkbyinitializing the network’s initialvaluesthrough unsupervised learning and then adjusting the parameters using an optimization algorithm to adjust theweightsappropriately so that theachievedprobability of the output from the corresponding inputvaluesis maximized.Fig.3.2describes the structure ofD B N

ExperimentalResults

In the experiment, two datasets as follows are used:

Figure 3.2: Architectural diagram of DBN application in Android malware detection

•Dataset1:500APKfilesfromVirusshare[150],with250benignsamplesand250 malware samples This dataset detects malware (binary classification of whether anappisbenignormalware).FeaturesusedinDataset1:

– XMLfile:6 3 permissions,including24dangerousand39safepermissions.

•Dataset 2: 5,405 samples with 179 malware families in the Drebin dataset[146], with 6,730 benign samples from[147].Featuresused in dataset2:

– XML file: 877 permissions, including Android OS and user-defined permissions.

Based on two datasets and the DBN model, three following experimental scenarios are suggested:

•Scenario 1: using 121 features extracted from DEX files (100 API features and 21 header features) from dataset1

•Scenario 2: using all features from dataset 1, meaning all 184 features, of which the first 121 features from scenario 1, and an additional 63 permission features from the XMLfiles.

•Scenario 3: using all features from dataset 2, meaning all 1877 features extracted from DEX files and XML files withTop(1000) API calls used in DEX file and 877permissionsdeclaredinXMLfiles.

Table3.1describes the results in scenario 1, and Table3.2describes the results in scenario 2 In scenarios 1 and 2, the number of hidden layers and epochs are adjusted to evaluate how these hyperparameters affect malware detection results.

Table3.1: Result withAccmeasure (%) in scenario1

Table3.2: Result withAccmeasure (%) in scenario2

Table3.3describes the results in scenario 3 For scenario 3,Acc,Precision,Recall, andF 1-score measures are applied to evaluate the malware classification problem.

Table 3.3: Results with measures in scenario 3 (%)

•Regarding dataset 1: the malware class detection accuracy (94% in scenario one and94.5%inscenario2)issignificantlyhigherthanthatofthebenignclass(74.5% in scenario one and 65.5% in scenario 2) The overall average accuracy is also around 80% The overall average accuracy is also quite decent, hovering around 80% However, it’s worth noting that the test dataset might not be extensive enough, and the feature set used is relatively small, with 121 features in scenario

1 and 164 features in scenario 2 These factors could have influenced the results to some extent.

•Regardingdataset2:themalwareclassificationshowsanimpressiveaccuracylevel of95%,indicatingmanycorrectlyclassifiedsamples.However,othermetricsonly reach around 8x%, with the recall rate being particularly at 82% This suggests that the classification of different families is uneven, with several families having alowdetection rate The significant difference between the high accuracy andlow valuesof other metrics also indicates an imbalanced distribution of samples across the variousfamilies.

•The results obtained from thetwoindependent datasets show accuracies ranging from 80% to 95% with the evaluation metric Acc Moreover, these datasets,withvaryingfeature sets, demonstrate that the DBN model is suitable for detecting and classifyingmalware.

Applying CNN Model

CNNModel

In 2012, on ILSVRC, Krizhevsky et al participated in a challenge and received a top- 5errorrateof16%[151].ThemodelusedbytheauthorsistheDeepConvolutional

NeuralNetworkcalled AlexNet.Fromthat point on, deep learning modelshaveseen a sharpimprovement.In challenges, models that received high prizes were adapted to the deep learning method On the other hand,manybig companies were alsoattracted and producedmanytechnologies with deeplearning.

Since2017,severalresearchgroupshavebegunapplyingCNNmodelstotheproblem of detecting malware on Android The resultsshowedthat the CNN model couldimprovethedetection/classificationresultstoanewlevel,whichleadstoawidespread application, testing, and customization of CNNs, along with different datasets and feature selectionmethods.

In this section, the CNN model was applied to classify Android malware and static feature extraction was used to evaluate the model using the Acc metric.

Based on the convolutional neural network theory, this model is used to solve An- droid malware problems The entire model is described in Fig.3.3.In the training stage, benign or malware files are extracted according to the feature set and then converted to a numeric matrix, which is used as the model’s input; the feature-pooling process can occurmanytimes for each pair of manipulation (convolution, pooling) After this process, a dense neurallayeris made, fully connected with the neural outputs,andlabelsarecorrespondinglymappedwithsuchoutputs.Inthedetectingstage, APK files are also extracted, converted to an algebraic matrix, and fed to the network; according to a weighted table created in the training stage, one of the neural outputs willbechosen; a particular class of benign or malware files is a respective label of neuralinputsattachedtothefiles.

Figure 3.3: The overall model of the training and classification of malware using theCNN model

ExperimentalResults

TheDrebin[146]datasetincludesatotalof129,013sampleswith180families(be- nignandmalware).T h e detailsofthedatasetareasfollows:

A 10-fold cross-validation test was employed on the model An 80-10-10 ratio split on the dataset is employed for the training, testing, and validation phases.

The process of assigning labels to classes is executed in the following manner:

Each APK file willbeextractedintoa vector of 9,604 components, equivalent to the 9,604 features, the largest number of features in the data set.Anyfiles thathavemissingvalueswillbefilledinwith0.T h e followingfeaturesarearranged: Label

All features are assembled into a CSV file row, so the featured file will have 9,605 columns (9,604 features and a label column) The data is tabular, with 9,605 columns (9,604 features and a label column), organized in CSV files.

The raw feature set contains four types of permissions, which are:

•Hardware components (1): indicates the required hardware Malware files can collectandsendinformation,suchaslocation,whichrequiresGPSornetworks.

•App components (3): represent the four components: activities, services,contentproviders, and broadcastreceiver.

•Filteredintents(4): internal communications between appintentsfacilitate shar- ingeventsand information among different app components Malware files can exploitthismechanismtogathersuchsensitiveinformation.

Android apps written in Java and compiled into bytecode are contained in a DEX file, directly establishing the app’s behavior The following information will be chosen as features:

•Restricted API calls (5): a request to use limited APIs is a suspicious action that needs monitoring because The Android grant system has limited some critical APIs.

•Used permissions (6): Required permissions that the app needs to function prop- erly.

•Suspicious API calls (7): calls to APIs that give access to essential databases or resources.

•Networkaddresses (8): malware files usually require anetworkconnection to collectdatafromvictimdevices.Somenetworkaddressesmaybehacker’sservers, botnets,etc.

InformationwillbeextractedfromanAPKfileasfilescontainingstrings,whichthen willbeconverted to a binary vector and stored in a CSV file Each vector component corresponds to a feature with avalueof1or 0 Thevalue"1"typically represents the presence or activation of a specific feature, while thevalue"0"indicates the absence or deactivation of that feature The first column is the label of the file All missingvalueswere assigned thevalueof0.

The architecture model used in this experiment is shown in Fig.3.3:

The feature matrix, shape 98x98, goes through the first convolutionallayerwith filter sizes 3x3 and 32 filters The output is a matrix of shape 98x98x32 A max poolinglayer,size 2x2 and strides 2, is applied to the firstlayer’soutput The size of the feature matrix willbereduced to 49x49.Similarly,the max poolinglayer’soutput willbethe input of the second convolutionlayer,with 64 filters and a filter size of3x3, which will thenbereduced to 25x25x64bya second max poolinglayer.The output of the last convolution and poolinglayeris a feature matrix with a shape of 13x13x64 A flattenedlayerwillbeused to change the feature matrix to the size of 10816x1, which is then fed through a fully connectedlayerwith 1024 neurons.Finally,it will go outfromtheoutputlayer,whoseneuronnumberstilldependsonseveralmalwarefiles introduced in the trainings t a g e

The average classification results for 10-fold are shown inTable3.4and visually in Fig.3 4

Thepresentstudyrevealsthattheoutcomesobtainedweresuperiortotheprevious research thatemployedthe SupportVectorMachine (SVM) algorithm, with a recorded accuracyrateof94%.Thisdemonstratesthefeasibilityofutilizingdeeplearningmodels for yielding outcomes that typically surpass the efficacy of other machine learning models.

Table 3.4: Experimental results using CNN model

Set Number of samples Train Test Validatio n Test accuracy rate (%)

Proposed Method using WDCNN Model for Android Malware Classifi-cation 84

ProposedIdea

TheWideandDeep(W&D)modelhasdemonstratedsuccessfulapplicationinflowerclassification and capacity prediction The proposed model is highly compatible with aggregated data sets from diverse sources Its application willbeshown through its utilization in addressing the issue of detecting Android malware The W&D model comprisestwocomponents: deep and Wide components The Deep component is tasked with extracting features from therawfeature set The Wide component retains selected features on APK files—for example, a list of used APIs and required permissions.

Fig.3.5describes the WDCNN model operation diagram First, the sample dataset, a set of APK files, is extracted to produce arawfeature set consisting of API calls,permissions, and grey image pixel features Each grey image is generated from a bytecode extracted from an APK file According to Algorithm6,therawfeature set is dividedintotwosubsets:F w andF d F w includes general API calls and permission features;F d consists of grey image pixel features.F w is put in the comprehensive component of the model.F d is put in the deep component of the model using CNN This model includestwocomponents, the wide and the deep component, as follows:

Thedeeplearningcomponentcanhelpderivenewfeatures(ahighlygeneralizable deep learning model) based on an internal structure consisting of convolutional and poolinglayers.Theraw“image”feature (features inF d )is used as the input

Figure 3.4: Test rate according to the 10-fold to the DeepCNN model It has an input matrix of 128x128, four convolutionallayers,and poolinglayers,which generalize the features In the firstlayer,the convolution has 32 filters to create 32 matrices The size of the max-pooling is 2x2, meaning that the size of each output convolutional matrix is reducedby4, resulting in a 64x64 matrix In the secondlayer,using 32 filters and the max- pooling of 2x2, the number of matrices is 32, but the matrix size is reducedbyfour times, becoming a 32x32 matrix In the thirdlayer,with 64 filters and the max-pooling of 2x2, 64 matrices of size 16x16 are created The number of filters and the size of the max pooling in the fourthlayerare similar to those inlayer3, so the output is 64 matrices of 8x8 size.Finally,in the flatteninglayer,the outputs of the fourthlayerare converted to a vector of 4069 neural units This vector is the output of the deep component The detailed implementation steps areshownontheleftofFig 3 5

Thewidecomponentisageneralizedlinearmodelusedforlarge-scaleregression and classification problems[152].This component is responsible for memorizing feature interactions In this work, the wide component is the vector of API call and permission features Since there are too many API calls in therawdataset, onlythetop(1000)ofthemostpopularonesinthedatasetarechosen.

TheAPIfeaturesarethetop(1000)featuresfromtherawdataset,andallpermission features are in therawdata set They are part of the wide component.The neurons of

Figure 3.5: WDCNN model operation diagram the DeepCNN model and the Wide component were combined as the input to a denselayerof1,024neurons,whichproducedtheoutputlayerasasetoflabels.Thedetailed implementationstepsareshowninFig.3 6

Building Components in theWDCNNM o d e l

Themodel’sobjectiveistointegratetherapidclassificationcapabilityofwidelearn- ing with the capacity to generalize deep learning The input feature set willbeparti- tionedintotwocorrespondingsubsets.Multipleconvolutionalneuralnetworks(CNNs) wereemployedtoexecutethedeeplearningcomponent.Itispossibletoconfigureeach Convolutional NeuralNetwork(CNN) to utilize a combination of convolutional and poolinglayersor solely rely on convolutionallayersencompassing convolution and filtering techniques. This facilitates the process of generalizing characteristics and decreasingthenumberofdimensions.Concurrentpartitioningisrequiredforthecom- prehensive and extensive feature set.Toconstruct the mathematical model, the initial stepinvolvesproviding the subsequentdefinitions:

Deftnition 3.1(Initial feature set).The initial feature set, denotedbyF 0,contains allfeaturesintheW&Dlearning model.

Deftnition 3.2(Wide feature set).The wide feature set, denotedbyF w ,is a subset ofF 0,usedforthewidelearningcomponentintheW&Dlearningmodel

Deftnition 3.3(Deep feature set).The deep feature set, denoted byF d , is a subset ofF 0, used to generalize features in the deep learning component of the W&D learning model.

Figure 3.6: Structure and parameters of the WDCNN model

•ϵ 1is the mapping from the wide feature set to vectorv 1.

•ϵ 2is the mapping from the wide feature set to vectorv 2.

To evaluate and partition the raw feature set into a deep feature set and a wide feature set, the following definitions are proposed:

Deftnition 3.4(Rawfeature).Arawfeature, denotedbyr, is a feature that does not representordoesnotentirelymeanabehavior,operation,orattributeofmalware.Forexample, onebytein the DEX file, i.e., one pixel in the converted “image” of the DEX file.

Deftnition 3.5(General features).The general feature, denotedbyα, is a feature that representsthebehavior,operation,orpropertyofmalware.Forexample,apermission or anAPI.

Deftnition 3.6(Group-level general features).A group-level generic feature, denotedbyg, is a feature that represents a group of malware behaviors, operations, or attributes.Forexample,thememoryaccesspermissionandfilemanipulationAPIgroups canbeunderstoodasgroup-levelgenericfeatures.

The proposed solutionmustalso tackle the problem of set division SetFneeds tobedividedintoF d andF w since arawf e a t u r e o f t e n d o e s n ’ t havea whole meaning It needs to transform to form a generalizable or group-level generalizable feature to reduce the number of dimensions Thus, the features are putintothe setF d Group- level generic features are often dividedintoF w because these featurestakeon the meaning and generalizability of the malware.However,when the system has a large group-level generalizable feature set, it is still possible to includeF d to reduce the number of dimensions Depending on the problem context and the level ofgenerality,thegeneralfeaturesandthegroup- levelfeaturescanbeincludedinF d orF w

The partition of the feature set

A partition is a division of the initial feature setintoa wide feature set and a deep featuresetsuitablefortheproblemcontextandpropertiesofthefeatureset.Algorithm6isdevised topartitionthefeatureset.

Thisdissertationfocusesontheinitialfeatureset,whichiscomposedofthreesubsets of features: the permissions set, APIs set, and image file converted from thebytecode in the DEX file As previously stated, the set of pixels in the image file represents therawfeature set, as pixels do not fully describe a malware behavior, operation, or attribute.PermissionsandAPIsaretwobroadcategoriesoffeatures,eachrepresenting amalwarebehaviororanoperation.Thedeepcomponentwillreceivetherawfeatures,

FR ← ∅; FG ← ∅; while F0 ̸= ∅ do put a feature fi from F0; if Fi is satified by the set tt then else FG ← FG ∪ {fi}; if Fi is satified by the set A then else FG ← FG ∪ {fi}; end FR ← FR ∪ {fi}; end remove fi from F0; end

Fd ← FR; while FA ̸= ∅ do put a feature fi from FA; if Fi is satified by the set R then else Fw ← Fw ∪ { fi }; end Fd ← Fd ∪ { fi }; remove Fi from FA; end while FG ̸= ∅ do put a feature fi from FG; if Fi is satified by the set R then else Fw ← Fw ∪ { fi }; end Fd ← Fd ∪ {fi}; remove Fi from FG; end return Fw, Fd;

A: Set of behavior/operation/attribute; tt: Set of behavior/operation/attribute group; R: Set of rules for division;

1 Begin: while the wide component will receive the general features To implementAlgorithm6,F d is chosen as the raw feature set, andF w is selected as the set of all generalized features, including permission and API call.

ExperimentalResults

Drebin[146]andAMD[148]aretwowidelyuseddatasetsforAndroidmalwareclas- sification. Thesetwodatasets do not contain benign samples Therefore, more benign samplesarecollectedfromarchive.org[147],alargeandfreeapplicationdatabase.The dataset’s composition is asf o l l o w s :

•TheDrebindatasetcontains5,560samplesof179malwarefamilies,ofwhich5,438 areused.

•TheAMDdatasetcontains24,553samplesof71malwarefamilies,ofwhich19,299 samplesof65malwarefamiliesareused.

Thereare16overlappingfamiliesofmalwarebetweentheAMDandDrebindatasets, resultingin28familiesofmalwareusedintheexperiments.Therefore,thetotalnumber ofsamplesusedintheexperimentsis31,467,with229families(228familiesofmalware and one benignfamily).

The AMD and Drebin datasets used have the following characteristics:

•Themalwarefilesarenotequallydistributedintofamilies;somefamiliesaremore dominant than others Thus, the total number of files in the top (10) malware families is 16,684 files, and those of the following ten families are 1,612 files (the number of malware files in AMD and Drebin suites is 17,742 and 3,551, respectively).

•Somemalwarefamilieshavefewsamples,lessthantensamples.TheAMDdataset has nine malware families with a sample size of less than 10, while the combined AMDandDrebindatasetshave127similarmalwarefamilies.

Fig.3.7describes the top (20) malware families of the AMD and Drebin datasets.Because of the uneven distribution of samples among the malware families, empiricalanalysis was performed on the top (10) and the top (20) most populous families.

The statistical analysis is shown in Table3.5.

The dataset necessitates extractingtwodistinct types of features, namely "image" features and "string" features Regarding the image features, the DEX files were trans- formedintoimageswithdimensions128x128.Thedimensionalityofanimagefeatureis

Figure 3.7: Top 20 malware family AMD and DrebinTable 3.5: The datasets used for the experiment

16,384.TheconversionprocessinvolvesinterpretingeachsetofthreebytesintheDEX file as a color pixel in the resultant image The conversion process entails interpreting each set of three bytes in the DEX file as a color pixel in the resulting image Subse-quently,the chromaticity of the image is transformedintoa monochromatic 128x128 image.

PermissionsandAPIsarethetwomostcommonlyutilizedfeaturesinmalwareclas-sification when presented as strings If a program is classified as malware, itmaynecessitatethetransferofsensitivedatafromthetargeteddevicetoanexternallo- cation, specifically the server of the perpetrator.Toexecute the task, the programmustrequestnetwork-relatedpermissions,includingbutnotlimitedtoINTERNET andACCESSWIFISTATE.Themalwaremayrequestadditionalpermissions,suchas

ACCESS_FINE_LOCATION and ACCESS_COARSE_LOCATION, to access theuser’s location A correlation exists between authorization solicitations and applica- tionprogramminginterfaceinvocationswithinthemalwaresoftware.Consequentl y, the present study involves the extraction of "string" characteristics, such as permissions and APIs:

•Toobtain the desired permissions from APK files, an analysis of all permissions listed in the XML filewasconducted.

•The tool "AKPtool" is utilized to extract the APIsbyreading DEX files[153].Subsequently,all the application programming interfaces (APIs) utilized within the dataset were extracted, followedbya statistical analysis of the frequency of APIoccurrenceineachfilecontainedwithinthedataset.The1000mostfrequently utilized APIs are selected The number of files associated with the foremost API, ranked at number one, is 22,082 This figurewassubsequently reduced to 7,110 for the API ranked at one thousand The findings indicate that the application programminginterfaces(APIs)rankedwithinthetop1000exhibitfavorablechar- acteristicsobservedacrossnumerousfileswithinthedataset.

Asdescribedintheprevioussection,thoseextractedfeatureswereusedasinputto the WDCNN operation model The entire feature of the dataset is contained in a CSV file The above permission and API call characteristics in the CSV file are arrangedintocolumns (the first column is the label), and the corresponding rows are APK files The cell willbefilled with the number "1" if the feature is extracted in the file Cells arefilledwith"0"ifthefeatureisabsentinthefile.

To evaluate the WDCNN model, some experiments are conducted as shown in Fig. 3.8 , following the following testingsc enari os:

– Scenario 1.1: using image features in the Deepmodel.

– Scenario 1.2: utilizing permission, API call features, and the Widecomponent.

– Scenario 1.3: combining image features in the Deep model and incorporating permission and API call features in the Wide component (running the WDCNNmodel).

For each sub-scenario 1.1, 1.2, and 1.3, evaluation is performed by using various datasets:

– AMD+benigndatasets(Fulldataset,top20datasets,andtop 10datasets)

- referred to as "simple dataset" in the experiment.

– AMD+Drebin+benign datasets(Fulldataset, top 20 datasets, and top 10 datasets)-referredtoas"complexdataset"intheexperiment.

•Scenario 2: verify the performance of the WDCNN model against alternative machine learning models such as KNN, RF, Logistic, DNN, and RNN The following experimentsweredonetoensureafaircomparison.

– Scenario 2.1: compare the performance of WDCNN against common deep learninga l g o r it h m s (R N N , D N N , R F , K N N , L o g i s t i c )

– Scenario 2.2: using an independent feature extraction scheme, which is proposed in research[110]and applied ontwomalware datasets (Extract 256 image features (256 pixels) These pixels are converted from APK files tobinary,forming a grayscale image.Fromthere, use the histogram to get 256 pixels First, extract 256 image features (256 pixels) from the APK files, convert them tobinary,and create a grayscale image Then, use the histogram to obtain the 256 pixels) The performance of the proposed WDCNN model has beencomparedtothebestmodelintheresearch[110]usingthenewfeature set.

•Scenario 3: due to the large file size of the AMD dataset’s DEX, only a maximum of 48 KB of data from the DEX file canbeconvertedintoan image format.Consequently,the datatowardsthe end cannotbeutilized The Drebin dataset is employed for evaluation in this scenario to address this.Additionally,since API calls are extracted from the DEX file, experiments were conductedbycombining permission and Wide components with Image features in the Deep component

– Scenario 3.2: employing image features in the Deep model and incorporating permission featuresintothe Wide component (running the WDCNN model without using API callf e a t u r e s )

– Scenario 3.3: utilizing image features in the Deep model and incorporating both the permission and API call featuresintothe Wide component (running theWDCNNmodelwithallfeatures).

Inscenario3,evaluationwascalculatedbyusingboththeDrebin+Benigndataset and the top (10) malware families from the Drebin+Benign dataset for assess-ment.

The K-fold cross-validation method withk= 10was used in this experiment The dataset is divided into10parts,8parts for training, and2parts for testing, of which one part is used for validating purposes (validation) and one part for the final testing (test).The experimental process was conducted according to the described scenarios.

Table3.6and Table3.7present the results from scenario 1, which consists of three sub-scenarios: scenario 1.1, scenario 1.2, and scenario 1.3 Acc and Recall were employed as performance metrics to evaluate the model and dataset Each experiment underwent 10-fold cross-validation to ensure robustness.

In Fig.3.9,the outcomes of the Simple dataset are visually depicted on a chart for convenient evaluation and comparison.

Figure 3.9: Classification of malware depending on the number of labels

Table 3.6: Experimental results of Simple dataset

(c) WDCNN model with input: Image into model DeepCNN; Input: permission+APIintomodelWideCNN

In Scenario 2.1, the testing results of various machine learning and deep learning models (RNN, DNN, RF, KNN, Logistic) are presented and compared with the WD- CNNmodel,asshownin Table3.8.

Ontheotherhand,inScenario2.2,experimentssimilartothestudy[110]weredone to evaluate the image features when integratedintothe proposed WDCNN model and theDNNmodelasimplementedin[110].T h e outcomesaredisplayedinTable3.9.

ThepurposeofthisscenarioistoutilizetheDrebindataset(withasmallnumberof DEX files) to evaluate the incorporation of image featuresintothe Deep model This evaluation aims to accurately assess the synthesis of image features within the Deep model.

The results of Scenario 3 are presented in Table3.10.

3.3.3.5 EvaluationResults a) Suitability assessment of the

Table3.6andTable3.7showtheaveragescoreofallmalwarefamiliesintwodatasets (the Simple and Complex datasets) for three methods (proposed WDCNN, DeepCNN, and the Wide component) The

WDCNN shows the best performance compared to theothers.TheaverageaccuracyandrecallofDeepCNNisrelativelylow,less t han

Table 3.7: Experimental results of Complex dataset

(c) WDCNN model with input: Image into model DeepCNN; Input: permission+APIintomodelWideCNN

Table 3.8: Experimental results when comparing models

Features DNN Deep CNN WDCNN

70% This leads to the conclusion that using only the image feature as an input for theCNNmodelisunsuitableforAndroidmalwareclassification.

Theaverageaccuracy of the wide component with permission and API features is high and reaches 97.68% and 94.53% for the SimpleFulland ComplexFulldatasets, respectively.TheseresultsdemonstratedthecriticalimportanceofpermissionandAPI callfeaturesinclassifyingAndroidmalware.

The superior performance of WDCNN demonstrates that the proposed model is an ideal candidate for Android malware classification (98.64% with the simple full dataset and 95.08% with the complex full dataset), as it retains the benefits of the wide component while utilizing deep learning to extract useful information fromrawfeatures. b) Evaluationof malwarefamilies

As mentioned above, the malware families differ widely in the number of samples.

Table3.9:AccuracycomparisonofmodelsFeatures:Images128x128+permission+API

Table 3.10: Experimental results with scenario 3 (%)

Scenario 3.1: Put Images into Deep model

Scenario 3.2: Put Images into Deep model Permission into Wide component

Scenario 3.3: Put Images into Deep model Permission and API call into

Somemalwarefamiliesonlyincludethreeorfoursamples,whilethelargestoneincludes 3,970 samples, as shown in Fig.3.7.Hence, using the average accuracy of all familiesmaynot reflect the real performance of each model Instead, the recall metric willbemeasured for each family of the top (10) largest malware families and the top (20) largest malware families for classification The proposed WDCNN shows the best performance on the top (10) and top (20) of the Simple and Complex datasets, as shown in Fig.3.9(99.54% withAccmeasure and 99.42% with Recall measure when tested on a Simple dataset). Early stoppingwasapplied topreventthe model from overfitting. c)ComparingWDCNN withsomeother machinelearningmodel s

Experiment 3.1 was conducted on other models such as RNN[154],DNN[110],KNN,

RF, and Logistic using the same feature set extracted from the Simple dataset

(consisting of the AMD dataset and Benign samples) The results of the 10-fold cross- validation were shown in Table3.9.The results indicated that the proposed WDCNN outperformed the RNN model (the best model mentioned in[154])by 1.38%, i.e., 98.64% compared to 97.26% In[110],the DNN with 10 hidden layers (DNN (10)) yielded the best results when tested with the dataset, obtaining the average classifier result of 94.5%, lower than the WDCNN model by 9.18%. d) Evaluationwh en extrac tingfeatur es in th es tu dy [4]

Thechosenfeatureextractionschemewasmentionedin[110].Theschemeconverts theAPKfiletobinary,thento256-pixelimagesusinghistograms.Theconvertedimagewasthen fed to the DNN and DeepCNN models On thecontrary,the data with a featuresetthatcombinedthe256-pixelimagefeaturewithpermissionandAPIfeatures intothe WDCNN model and the DNN (10) modelwasfed The results are shown inTable3.8.The outcome shows that with the 256-pixel image features obtained as in[110],the result of the DeepCNN model componentwas7.16% higher than that of the DNN (10).However,both hadlowaccuracy (less than 50%) When combined with the permission and API features, the WDCNN model gives a result of 97.73%, 1.26% higher than that of the DNN (10) model.Additionally,regarding the recall metric, the WDCNN model ismuchbetter than DNN (10), 85.65% compared to 47.8% This shows that, on average, the rate of accurate prediction of the WDCNN ismuchhigher than that of DNN( 1 0 )

ApplyingFederatedLearningModel

FederatedLearningModel

The primary objective of the research is to introduce anovelapproach forweightsynthesisinafederatedlearningmodel,whichconsidersthemagnitudeoftheaccuracy sample set and theweightset for each client Upon completing the training process, the individual workstation transmitsweightvalues, accuracy metrics, and sample set sizestothecentralserver.Theserverperformscalculationsbasedontheaccuracythat corresponds to the set ofweightsand sample set size on the clients The quality of the component sample set is reflectedbyaccuracy,while the sample set size determines theimpactofthesamplesetonthesynthesisprocess.

System model using federated learning:

Figure 3.10: DEX file size by size in the Drebin dataset

The proposed system model is depicted in Fig.3.11.Each client and memberserver uses the same CNNm o d e l

ImplementFederatedLearningModel

Tobuild a mathematical model as a basis for assessing the importance and summa- rizingtheweights,thefollowingdefinitionsarethoughtup:

Deftnition 3.7(The set of composite weights).The set of compositeweightsis the set that contains theweightscalculated on a server based on the componentweightsets,anditissentbacktouseforallclientsinthesystem.

Deftnition 3.8(The component weight set).The component weight set is the weight set trained on each client with the individual dataset by the CNN model This weight set is sent to the server to compose.

Deftnition 3.9(The component dataset).The component dataset is the individual dataset used to train each client This dataset is updated and trained by the transfer learning model to improve the set of weights.

Deftnition 3.10(The importance of the component set of weights).The importance of the partial set ofweightsis avaluethat evaluates the influence of this set on the composite weights, denoteda.

In deep learning, the larger the size of the training dataset, the more thenetworkistrainedandthemore valuable theweightsare.Ther efore, t he signi ficanceofthe Σ Σ

Figure 3.11: Overall model using federated learning component weights is defined in terms of the dependence on the dataset size The importance is calculated according to Equation3.5: where, a i =k D i

In this proposed federated learning model, each client needs to send a set of component weights along with the size of the dataset.

The process of aggregating weights is as follows:

•Trainingat clients: each client operatesindependently,and users use the device tocheckfor malware.Atthe time of testing, the client will store the test file along with the features extracted from the test samples As per the specified timeintervals,the sample characteristics willbeinputtedintothe model on the workstation for additional training The updated set of weights, values, and test resultsofthesampleswillthenbesaved.

•Send data to the server: based on the preset time in the system, the client sends the set ofweightsand results to the server If a client has no dataavailableat that time, it won’t sendany.The server aggregates data based on the received information

•Aggregateweightson the server: on the server, based on the results sentbyeach client, the correspondingweightswillbeassigned to each set ofweightsof each sendingclientaccordingtoEquation3.5.

ExperimentalResults

The AMD malware dataset[148]wasused This dataset collected 24,533 samples, including71malwarefamilies,from2010-2016.However,inthisdataset,manymalware families contain a minimal number of files (less than ten files in a family), and some samplesarefaultyduringanalysisandextraction.Therefore,only37malwarefamilies with at least 20 files each are kept The total number of malware files used is 18,956 Combined with the malware, 6,771 benign files taken from[147]are utilized Thus, in theexperimentaldata,thereareatotalof38classesand25,707files.

AndroPytool is used to extract from APK files to JSON format.Fromthere, aPythonprogramwasdevelopedtogetthefollowingfeatures:

•Permission: these are the permissions declared and used in the program’s source code There aretwotypes of permissions: permission providedbythe Android operating system and permission declaredbythe programmer The total number of permissions is 877p e r m i s s i o n s

•API calls: these are the declared and used APIs The number of APIs in the dataset is huge Therefore, the top (1000) most used APIs are taken as a feature subset.

Thefeaturesareconvertedtonumericvaluesbasedontwodistinctgroupsofstatically extracted features: permissions and API calls.Foreach APK file, a binary represen- tation is adopted, whereby a feature that is utilized is assigned avalueof "1" In contrast, a feature that is not utilized is assigned avalueof "0" The conversion from strings to numbers for inclusion in the deep learning model is done as an algorithm3

Theexperimentconductedinthepresentstudyisidenticaltothatof[Pub.8].Inthis experiment,insteadoftransferringthenumberoffilesandnumbersfromtheclientsto the Server, theAccmeasure andweightstransmission from the clients to the Server is improved The objective is to compare theAccof both the clients and the Server and establishacollectionofaggregateweightsderivedfromeachsetofweightstransmitted to theserver.

As in the paper [Pub.8], the data set is dividedintoten parts (divide the files of eachlabelequally)andusedtoexperimentaccordingtothefollowingsteps:

•Step1:thetrainingandtestingprocessbeginsindividuallyonmultiplecomputers (train, train1, train2, train3) The data is distributed to the server (S) and three clients (CL1, CL2,CL3).

•Step 2: the server calculates the averageweightsfrom the component computers andreturnstheresultstotheclients.

•Step 3: data train4 is assigned to CL1, and train5 is assigned to CL2 for training andupdatingthesetofweights.

•Step 4: using the updatedweightsfrom Step 3, the server updates the set ofweightsand sends the updated informationbackto thecl i e n t s

•Step 5: repeat Step 3 and Step 4 with new data assignments (train6 to CL1 and train7 toCL2).

•Step 6: After completing the previous steps, all data is combined and trained on asinglecomputerforfinaltesting.

Table 3.11: Average set of weights (accuracy - %)

Scenario1:averageweight In this case, the clients will only send theweightsto the server side.Fromthere, the Server will sum upbydividing theaveragebythe numberofcustomerssent.T h e finaltrainingandtestresultsaredenotedasW1.

Scenario 2:aggregatew e i g h t dependson thenumberofsamples This is the direction suggested in [Pub.8] The client computers will send the set ofweightsandthenumberofsamplestotheserver.Theserverwillcalculatetheaggregateweightbas edonthenumberoffilesthattheClientsends.Thefinaltrainingandtestingresults are denoted asW2.

Scenario 3: aggregate weights with the samples and theAcc.

The client will send the set of weights, the number of samples, and the testedAccto the Server The server relies on theAccvalue and the number of samples the client sends to assign importance to the weights according to Equation3.5.The final training and testing results are denoted asW 3 This experiment is evaluated according to the influence coefficientsk 1andk 2 I test increasing(k1, k2)from(0; 1)to(1; 0)in turn with a jump of0.1satisfyk 1+k 2= 1.

Thetrainingonthesamemachine(step6)isperformedineachscenario.Therefore, thefinalresultW all willbetheaverageof3independentrunsin3scenarios.

TheexperimentalresultsaccordingtoScenario1,scenario2,andscenario3respec- tively correspond toTable3.11,Table3.12,andTable3.13and are shown in Fig.3.12.Results ofTable3.13andW 3valuein Fig.3.12 shown with the influence coefficientk1= 0.6andk20.4give the highest result when changing the coefficient(k1,k2).

Fig.3.13shows the results when changing the coefficient of influence(k1, k2) This showstherelationshipbetweenthenumberoftrainedfilesandtheclassificationresults The above results show that the weighted aggregation method gives a 97.08% result,whichisthehighestamongthethreeweightingmethods.

Table 3.12: Set of Weights according to the number of samples (accuracy - %)

Table 3.13: Our proposed set of weights (accuracy - %)

ChapterSummary

The CNN modelwasused in the experiments in Chapter2of the dissertation How- ever, to synthesize deep learning models in Chapter3,the dissertation also mentions the CNN model in Section3.2.On the other hand, the CNN model is the basis for the proposedWDCNNmodelinSection3.3.

In this chapter, the dissertation proposes, applies, tests, andimprovessome deep learning models for malware classification on Android There are many deep learning models,however,the dissertation uses some typical deep learning models for experi- mentation and evaluation such as DBN, DNN, CNN, and RNN; The dissertation also proposes to use the WDCNN model toimproveaccuracy when classifying malicious code.Theproposedmethodshavebeentestedforverificationandevaluationontypical data sets, such as the Drebin and AMD datasets The results are also summarized inTable3.14asfollows.

Figure 3.12: Compare the results of the weighted aggregation methods

Table 3.14: Summary of results of proposed machine learning, deep learning models and comparison

ID Deep learning and machine learning models

Types The experimental dataset

6 RF Compare in dissertation Drebin + AMD 84.80% [Pub.3]

7 RNN Compare in dissertation Drebin + AMD 97.26 [Pub.3]

8 RF Compare in dissertation Drebin 92.90% [Pub.6]

9 SVM Compare in dissertation Drebin 92.40% [Pub.6]

Conformity of the model with the features of the dataset:

Based on the summary results inTable3.14and thecontentspresented in this chapter,the dissertation provides a general evaluation of the appropriateness of the machine learning/deep learning model for the features of the data sets The DBN model is a feedforward neuralnetworkwithmanyhiddenlayers,but it is not truly a deep learning model because it lacks features of generalizing layers Therefore, it is ineffectivewhenthenumberoffeaturesislargeandthefeaturesareshallow.According to the experimental results and model evaluation, DBN is unsuitable for the Android malware detection task compared to the proposed models The CNN modelproposed

Figure 3.13: Classification results with influence factor for the app has high accuracy and is suitable for datasets with shallow features andmanyfeatures The WDCNN model is ideal for diverse datasets, including shallow and extracted generalized features from shallow data The WDCNN model combines traditionaldeeplearningcomponentsonshallowfeaturesandtraditionalclassification methods as a wide component TheFederatedCNN model is suitable for decentralized datasets.

CNN is particularly suitable for datasets with shallow features, a large number of features, a large number of samples, and many labels.

To effectively train a machine learning model on many features, it is necessary to utilize both machine and deep learning models This dissertation primarily employs deep learning models, exploring the convolutional neural network (CNN) model and its various iterations.

Theefficacyofdeeplearningmodelshasbeendemonstratedinexperimentsutilizing a consistent dataset and feature count relative to other machine learning models The authorof[Pub.1]presentedfindingsindicatingthatimplementingtheCNNmodelwith a 10-fold cross-validation approach yielded an accuracy rate of 96.23% The obtained outcome surpasses the accuracyachievedbySVM, which yielded a rate of 94% The results are shown inTable3.4.

In [Pub.3], the WDCNN model, animprovementof the CNN model,wasused In theWDCNNmodel,manyfeaturesetslikeimages,APIcalls,andpermissionswereapplied.

In the experiment, the results of the WDCNN model with other deep learning and machine learning models In addition, the experimentwasalso conducted according to the settings of[110]wascompared All experiments show that the WDCNN model outperformsothermodels(accordingtoAccandRecallmeasures).Thedetailedresults are shown inTable3.8andTable3.9.

Utilizing machine learning and deep learning models has proven to be highly pro- ductive in detecting malware on Android The prevalence of the Android OS across numerous devices poses a challenge in developing a server system for malware detection Transmitting APK files from client devices, including mobile phones and TVs, to the server for analysis and returning the results is time-consuming Conversely, utilizing a single server may be inadequate in fulfilling the demand due to numerous client devices The utilization of multiple servers may result in elevated expenses Conse- quently, federated learning models have been employed to identify Android malware.

In [Pub.11], training on many machines has many advantages, such as:

•The devices send periodic information to the server, and the server updates the weights This helps synchronize learning on all devices, and the updates are run silently on the client and server according to the predefined time of the adminis- trator.

The study denoted as [Pub.11] reveals that the federated model, despite yielding marginallyinferioroutcomesincontrasttotrainingonasolitarymachine,offersnumer- ousbenefits,includinginstantaneousupdatesacrossalldevicesandreduceddetection time attributable to feature extraction and detection being executed on the client side This demonstrates the practicality of implementing the model in actual settings The results are shown inTable3.13.

The Federated Learning model presented in the dissertation has been used to experimentally deploy on Android devices (clients) and a server.

ResearchontheefficiencyimprovementofAndroidmalwaredetectionandclassi- fication is significant Solving this problem thoroughly requires research,asAndroidmalwaresignificantlyimpactsmobileusers.Theworkhasbeendoneinthedissertation :SurveyonresearchrelatedtodetectingandclassifyingmalwareonAndroidfromthe inceptionoftheAndroidoperatingsystemuntilnow(especiallystudiesfrom2019to the present).Fromthese studies, Inowhavea more general view of ther e s e a r c h directions in the problem of detecting and classifying malware on Android.

Regardingfeatureaugmentation,Iappliedalgorithmstoaugmentationfeatures(Co- matrix and Apriori), creating connections between features Regarding feature selection, I proposed a new method for the feature selection problem (based on popularity and contrastcalculation).

Regardingmachinelearningmodelsanddeeplearningmodels,inthedissertation,Ihaveappli ed traditional machine learning models such as SVM, KNN, and RF Deep learning models such as DBN, CNN, RNN, etc These models give positive experimental results.

In addition to applying these existing models, I used the WDCNN model, which isimprovedfrom the CNN model Experimentally, this model hasmany advantagesandgiveshigherresultsthanotherdeeplearningmodels.

Federated learning models and federated learning were also studied and applied during my dissertation work Although it does not improve the performance, this model helps me put research into practice.

Ihavealsodeployedmyproduct on application platforms Malware detection and classification apps canberun on websites, on mobile apps, or embedded in the Nextcloudsystem.

In summary, the dissertation has two contributions as follows:

– Featureaugmentation based on co-occurrence matrix in the work[Pub.2]

– Featureaugmentation based on the Apriori algorithm in the work[ P u b 6 ]

•Thedissertationhasobtainedsomesignificantresults.However,thereisroomforimprovem ent.Inourfutureworks,wewouldliketoexplorethefollowingresearch directions listed asfollows:

•Inmydissertation research, I utilized the AMD and Drebin datasets to train and detect malware.However,thesetwodata sets are outdated Now,manydifferent types of malwarehavedifferentwaysto infect Therefore, I will soon collect new malware samples, for example,M a l - N e t

•Currently,new types of malware are challenging to decompile I will collect more data related to code obfuscation and data onC++code to enrich the dataset andbeabletogeneralizetheprocessofdetectingandclassifyingmalware.

•There aremanyfeature extraction methods In the dissertation, I mainly use the featuresofpermissionsandAPIcalls.Inthefuture,Iwilluseotherfeaturegroups andfeaturegenerationmethodssuchasfunctioncallgraph,image,etc.

•The use of a federated learning model, federated learning hasmany advantages. However,when transferring data between the client and the server, Ihavenot givenanysafety measures In the future, I will use encryption and protection measurestotransferdatasothatitisnotstolenorfalsified.

Tiêu đề	Android Malware Classification Using Deep Learning
Tác giả	Le Duc Thuan
Người hướng dẫn	Ph.D. Nguyen Kim Khanh, Ph.D. Hoang Van Hiep
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Engineering
Thể loại	Doctoral Dissertation
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	176
Dung lượng	2,7 MB