Toward a Deep Learning Approach for Detecting PHP Webshell44877

Toward a Deep Learning Approach for Detecting PHP Webshell Ngoc-Hoa NGUYEN Viet-Ha LE VNU University of Engineering and Technology Hanoi, Vietnam hoa.nguyen@vnu.edu.vn Office of the Government Hanoi, Vietnam levietha@chinhphu.vn Van-On PHUNG Phuong-Hanh DU Office of the Government Hanoi, Vietnam phungvanon@gmail.com VNU University of Engineering and Technology Hanoi, Vietnam hanhdp@vnu.edu.vn ABSTRACT According to Internet Live Stats up to 2019 September[13], there is an enormous amount of websites being attacked everyday (from 25.000 hacked websites per day on April 2015 to 61.750 hacked websites per day on September 2019), causing direct significant impact on nearly 4.43 billion Internet users In case of having Web application source codes, Web security can be improved by performing the task to detecting malicious codes, such as a Webshell which is defined as a script that is installed on source code of web application to enable remote administration on the infected server Webshell could be injected into the system directly by attackers or through malicious plugin installed by the webmaster [7] An essential feature of a Webshell is command execution With this unsophisticated weapon, an attacker can many stuff such as communicating with files/folders, listing active processes or let it act as a backdoor These webshells seem to be extremely tiny, but their capabilities are so diversity and high-plasticity Besides that, they sometimes use encoding method like base64 or gzinflate to encode themselves for self-defense All of them are wrapped in only one file, so this type of WebsShell can be injected quickly Webshell can be installed as other kinds of backdoor For example, CryptoPHP is a hidden backdoor found by FoxIT1 CryptoPHP is a threat that compromises Web servers on a large scale through installing unoriginal WordPress, Joomla, and Drupal themes and plug-ins CryptoPHP has some activities and properties, included (i) integrates with popular content management systems like Drupal, WordPress and Joomla: injecting hyperlink into post content (for Black Hat SEO’s purpose2 ), and so on; (ii) Uses asymmetric cryptography3 (RSA public-key)4 for communication between the victim’s server and the C&C server; (iii) in case C&C server or domain takedowns in multiple times, CryptoPHP can encrypt its data and send via email to some specific mail addresses; (iv) supports manually control via HTTP requests; (v) updates automatically the list of C&C servers; and (vi) haves ability to receive new version from C&C server and update itself Several popular approaches for securing web applications [3] have been investigated, for example safe web development [11], implementing intrusion detection and protection systems, code reviewing, and web application firewalls Masood et al [10] presented an efficient way for securing web applications by searching and eliminating vulnerabilities therein In fact, an attack campaign The most efficient way of securing Web applications is searching and eliminating threats therein (from both malwares and vulnerabilities) In case of having Web application source codes, Web security can be improved by performing the task to detecting malicious codes, such as Web shells In this paper, we proposed a model using a deep learning approach to detect and identify the malicious codes inside PHP source files Our method relies on (i) pattern matching techniques by applying Yara rules to build a malicious and benign datasets, (ii) converting the PHP source codes to a numerical sequence of PHP opcodes and (iii) applying the Convolutional Neural Network model to predict a PHP file whether embedding a malicious code such as a webshell Thus, we validate our approach with different webshell collections from reliable source published in Github The experiment results show that the proposed method achieved the accuracy of 99.02% with 0.85% false positive rate CCS CONCEPTS • Security and privacy → Malware and its mitigation; Web application security; KEYWORDS pattern matching, yara rules, deep learning, CNN, opcode sequence, webshell detection ACM Reference Format: Ngoc-Hoa NGUYEN, Viet-Ha LE, Van-On PHUNG, and Phuong-Hanh DU 2019 Toward a Deep Learning Approach for Detecting PHP Webshell In The Tenth International Symposium on Information and Communication Technology (SoICT 2019), December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam ACM, New York, NY, USA, pages https://doi.org/10.1145/3368926.3369733 INTRODUCTION Nowadays, web applications are everywhere and Web security has also received a lot of attention from both researchers and managers Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for components of this work owned by others than ACM must be honored Abstracting with credit is permitted To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee Request permissions from permissions@acm.org SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam © 2019 Association for Computing Machinery ACM ISBN 978-1-4503-7245-9/19/12 $15.00 https://doi.org/10.1145/3368926.3369733 https://fox-it.com/ https://en.wikipedia.org/wiki/Search_engine_optimization https://en.wikipedia.org/wiki/Public-key_cryptography https://en.wikipedia.org/wiki/RSA_(cryptosystem) 514 SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H NGUYEN et al is temporary However, attackers might upload their backdoors to that system for persistence, as they can come back to interact and steal information anytime without exploiting any vulnerability This situation leads to serious consequences [12] since these backdoors are Web shells, and they allow to remotely control files, databases and execute commands They are not only flexible but also countless } strings: $a = 'hello' $b = {01 23 45 67 89 ab cd ef} $c = /md5: [0-9a-zA-Z]32/ condition: $a or $b and $c Each Yara rule consists of components: • Meta: store the metadata information such as description, created date, references, etc • Strings: define the patterns to be matched of the rule There are type of strings that can be defined: text , hexadecimal and regular expression • Condition: define as a Boolean expression, that is determines the logic to combine the results of pattern matching of each strings Figure 1: Command execution in webshell b374k Indeed, lacking of secure programming awareness and of ability to discover both malicious web shells and web vulnerabilities from web developers are main root causes These current issues in web application security raise a demand for one solution which allows web developers and security penetration testers to detect securityrelated problems in the easiest way In this research, we proposed a model using a deep learning approach to detect and identify the malicious codes inside PHP source files The reason why we focus on web applications written in the PHP language is because the popular usage of PHP in serverside programming languages - about 79.0% of all the websites (up to September 2019) [2] Our method relies on techniques First of all, we use pattern matching techniques by applying Yara rules to build a malicious and benign datasets Secondly, we convert the PHP source codes to a numerical sequence of PHP opcodes Finally, we apply the Convolutional Neural Network model to predict a PHP file whether embedding a malicious code such as a webshell The organization of this paper composes sections: in Section 2, we revise some basic principles, literature research and related work in malware detection and deep learning techniques In Section 3, we describe our proposed solution that is a combination of different techniques as mentioned above to solve the problem of detecting malicious code in the web application source code In Section 4, we present our experiment results, evaluate our work and provide benchmarks The last section is dedicated to some conclusions and future work Based on algorithm design, there are types of pattern matching technique: prefix-based matching, suffix-based matching and factor matching [15] • Prefix-based matching: the matching process start searching from the top of the sliding window, all characters in the text are read and checked if it doesn’t match then move to the next character This is the simplest strategy but the number of comparisons is large so the execution speed is slow • Suffix-based matching: the matching process start searching from the bottom of the sliding window It does not read all the consecutive characters in the text, ignoring the characters base on the comparison result of the characters at the bottom of the sliding window.This is the basis for reducing the number of comparisons and reducing the complexity of the algorithm • Factor-based matching: the matching process start searching from the bottom of the sliding window, It does not read all the consecutive characters in the text, but compare each special character to predict the set of factors (subsamples) of the original sample PRELIMINARIES AND RELATED WORK 2.1 Yara and Pattern Matching All algorithms have stages: pre-processing and searching The pre-processing stage has to build the Yara ruleset, meanwhile the second stage will use the pattern matching techniques (using the regular expression) based on the Yara ruleset Figure 2is illustrated the flowchart of PHP webshell detection using Yara and pattern matching approach Yara ruleset is a list of rules that define the strings that is called patterns and the logical condition between matches and non-matches of those pattern to determine the final result yara rule example { meta: description = An example of YARA rule 515 Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam Table 1: Top 15 opcodes used exclusively used by malware Opcode stosq syscall setno cvtsd2si movmskpd prefetcht1 fprem cmpsq lodsq scasq cvtss2si fnsave orpd fxsave movmskps Figure 2: Webshell detection process using Yara It can be said that in the problem of detecting malicious code by pattern matching method using Yara rule set, pattern matching technique only determines the resource usage and calculation speed As for accuracy, it will be determined completely by the Yara ruleset In this study, we use the latest Yara ruleset for detecting PHP webshell from GitHub in conjunction with the one we collected during our research Opcodes stand for Operation Codes, is the portion of a machine language instruction that specifies the operation to be performed In programming in PHP or any other language, we can extract the list of opcode used[4] When making statistics of lists of opcodes created from benign files and malicious files, we can easily see the huge difference between them This can be explained by the fact that the opcodes used by malicious files will tend to perform data theft, impact on the system to gain control or perform check the system environment to hide it behavior, etc, while the benign files rarely these things Taking an example of the use of functions related to virtualized operation functions, malicious files often use these functions to check if they are being executed in a virtualized environment, if it is true, they will not execute malicious behavious to avoid detection Because of this, machine learning approaches often use this sequence of opcode to predict whether a file is malicious or benign According to the statistical results of Bragen and Simen Rune [5], it has been shown that the list of the 15 most used opcode by malicious files is shown in Table 2.2 Description Store String Fast System Call Set Byte on Condition - not overflow (OF=0) Convert Scalar Double-FP Value to DW Integer Extract Packed Double-FP Sign Mask Prefetch Data Into Caches Partial Remainder (for,compatibility with i8087 and i287) Compare String Operands Load String Scan String Convert Scalar Single-FP Value to DW Integer Store x87 FPU State Bitwise Logical OR of Double-FP Values Save x87 FPU, MMX, XMM, and,MXCSR State Extract Packed Single-FP Sign Mask feature engineering One preeminent advantage of deep learning is that a bigger training data make it learn more robust feature One of the most famous example of deep learning technique is Convolution Neural Network (CNN), in which the local receive field from the previous layer is handled in a sliding window Because of these advantages, more and more research is being applied to deep learning technique in the field of malware detection [8] 2.3 Related Work In this section, we briefly introduce some related research and solutions regarding malware, including some popular Web Shell detector, malware detection based on deep learning: Web Shell Detector7 is a python tool that helps on detecting Web Shells This product is a quite good solution as it is easy in using, developing and customizing However, the Web Shell pattern set in Web Shell Detector database is not up-to-date and also very limited PHP Malware Finder8 is also an effective tool to scan Web Shells with its YARA-based rules Because the detecting mechanism of this product is quite simple, the False/Positive rate in final results is somewhat high Also, PHP Malware Finder can depict suspicious files, not show whether a file is precisely a Web Shell or a dangerous file VirusTotal is an online service that supports analyze suspicious files, included viruses, worms and Web application ones through the detection of tens of other anti-virus products However, it is limited to at most one file of any nature in any given in at once This restriction may lead to the time-consuming problem It is almost not proper to validate whole of a Web project In a research of Yingying and Wang [9], they proposed a malware detection system using deep learning on API calls Based on the feature of an solution to automated analyze malicious code Cuckoo Sandbox 10 , they extracted the API calls sequence of malicious programs, then using some Deep Learning technique such as: GRU, BGRU, LSTM, SimpleRNN, and BLSTM to train and test on an Webshell Detection by Deep Learning Approaches Deep learning is the application of deep neural networks to machine learning Deep learning is capable of simulating complex functions by learning deep nonlinear network structures to solve complex problems A neural networks contain of an input layer, followed by a list of hidden layers, then ending with an output layer Value of output of a layer turn to input of the next layer Unlike machine learning techniques, deep learning is trained by learning features rather than task-specific algorithms Different layers of neural networks automatically learn features at different levels Therefore it can work on raw data without any need of manual http://www.shelldetector.com/ https://github.com/nbs-system/php-malware-finder https://github.com/Yara-Rules/rules https://virustotal.com https://en.wikipedia.org/wiki/Opcode 10 https://cuckoosandbox.org/ 516 SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H NGUYEN et al dataset including 21,378 samples The result show that BLSTM has the best performance for malware detection, reaching the accuracy of 97.85% Kemal Ozkan [18] wants to use image processing techniques to detect malicious code Realized that some image based techniques have been developed together with feature extraction and classifiers in order to discover the relation between malware binaries in grayscale color representation, they applied the CNN features to overcome the malware detection problem With the datasets consisting of 12,279 malware samples, the classifier has an 85% accuracy rate, increased to 99% with a dataset containing 9, 339 samples Another research using CNN to detect Webshell by YifanTian [14], focus on the HTTP request of web service, they use ’word2vec’ technique to segmented the HTTP requests to the form of HTTP symbol words, then HTTP request can be represented as a matrix Once having the matrix representation, they applied CNN to extract feature and train the model for detecting malicious webshell Using 35 different features extracted from packet flow, M Yeo [17] proposed an automated malware detection method based on convolutional neural network (CNN), multi-layer perceptron (MLP), support vector machine (SVM), and random forest (RF) With a netflow capture from Stratosphere IPS which has nine different public malware packets and normal state packets were converted to flow data, they can show >85% accuracy, precision and recall for all classes using CNN and RF • Convert the PHP source files to a numerical sequence of PHP opcodes These opcodes are used to remove the duplicate PHP files for both benign and webshell datasets • Build the clean datasets of both benign and webshell samples for both training and testing sets For that, pattern matching techniques by applying Yara rules is chosen to generate the clean datasets • Build the Convolutional Neural Network model by the deep learning approach with the clean datasets This model will be used to to predict a PHP file whether embedding a malicious code such as a webshell We will detail the two last stages in the next subsections 3.2 Building Clean Datasets Our idea to build clean datasets is shown in Figure 3: PHP WEBSHELL DETECTION BY DEEP LEARNING METHOD In this section, we will propose a solution that combines pattern matching for malicious code detection technique using Yara rule set and CNN based approach 3.1 Approach Each technique has its own advantages and disadvantages, for the pattern matching method, the rate of True Positive detecting the type of known malicious codes is extremely high, but this method will have difficulty in predicting the types of unknown malicious code As for the CNN deep learning method, the prediction model only approach high accurate if we build the correct training data set In the process of researching and developing the training data set, we had difficulty finding malicious code samples For a dataset of benign PHP files, it is not difficult to search within the source code of popular content management systems (CMS) using PHP languages such as Wordpress, Joomla or Drupal, etc As for the malicious code dataset, although we have tried to use the most reliable data sources, however, most of the datasets we found both contained clean files, which led to inaccurate training results With the number of thousands of files in each dataset, it is difficult to manually remove clean files Therefore, our idea is to use a malware detection method using the Yara rule set to standardize the dataset of malicious code files, as the training input data for the CNN learning model From that, our method to detect PHP webshells is based on three stages: Figure 3: Building Clean Datasets using Yara rulesets As we can see in Figure 3, at the beginning, to eliminate the fake malicious files in the webshell datasets, we use the Yara-based webshell detection by applying Yara rulesets for the raw datasets After that, a training data set consisting of benign PHP files and malicious PHP files was translated to opcode sequences via an Opcode Converter This converter also has the function of eliminating duplicated opcodes during conversion to avoid affecting the accuracy when training the model The duplication of opcodes of completely different PHP files can be explained because opcode is a sequence of numbers representing a list of called Operation Codes functions, 517 Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam if the files are accidentally the same in the list of called opcode functions, their opcode sequence will be the same At the end of the process, we have the clean benign/webshell datasets for both the training and testing phases 3.3 Detecting Webshell by CNN Model We will use the CNN model to implement our deep learning approach for detecing the webshells in PHP source files The following figure illustrates our training and testing model: Figure 5: CNN Architecture for Detecting Webshell EXPERIMENT AND EVALUATION Based on the proposed method, we built and implemented our solution, namely WSDetector, in python language The experiments were performed in a computer having x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz (45MB Cache, 18-cores per CPU), 128GB for the main memory, CentOS Linux release 7.4.1708, python release 2.7 For the deep learning platform, we use tensorflow v.1.14.0, scikit-learn v.0.20.4, scipy v.1.2.2, numpy v.1.16.5 and yara-python v.3.10.0 4.1 Evaluation Metrics To evaluate the ability of PHP webshell detection tools, we will use two different test sets: one contains malicious PHP web shells and one is a collection of clean, benign PHP codes We will observe the true positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) samples, then compute the Accurary, Precision, Recall (sensitivity, or true positive rate -TPR), F1-score and Fall Positive Rate (FPR) with the following formulas [15]: TP +TN TP Accuracy = Precision = TP + FP + FN + T N TP + FP TP FP Recall = F PR = TP + FN FP + T N 2T P F − score = 2T P + F P + F N Figure 4: Webshell Detection Using CNN Model The training input data consists of benign and webshell dataset As mentioned in the previous section, because the webshell dataset is collected from many different sources, we will get benign files, so we use the pattern matching method with the Yara rule set to ensure webshell data is most accurate The standardized dataset consists of PHP files will continue to be converted into opcode sequences, then these opcode sequences became training data for the CNN The trained model will be used to predict test data sets, resulting in the data set classified as benigns and webshell In our research, Convolution Neural Network applied for malware detection using opcodes as its input raw data as show in Figure The opcodes goes through a sequence of convolution layers at different levels In the end, we have output layer which outputs probabilities of the files being malware or benign By providing a huge amount of training data, we can expect the neural network to learn specific patterns of the malware family as well as powerful invariant features over time to distinguish the malware with benign files 4.2 Datasets To build the webshell dataset, we collected a wide range of webshells from reliable and most stars sources on Github 11 There are totally 4,171 PHP webshell files For the benign dataset, different PHP frameworks, forums and content management systems were collected from their official sites They includes Laravel, Wordpress, Joomla, phpMyAdmin, phpPgAdmin, phpbb 12 After removing non-PHP files, the benign set contains totally 7,400 files In order to 11 /tennc/webshell, /bartblaze/PHP-backdoors, /b374k/b374k, /JohnTroony/ php-webshells, /xl7dev/WebShell, /BlackArch/webshells, /fuzzdb-project/fuzzdb, /LuciferoO/webshell-collector, /ysrc/webshell-sample, /webshellpub/ awsome-webshell, /PHP-WebShell-Bypass-WAF, /linuxsec/indoxploit-shell 12 Github: https://github.com/laravel/laravel; https://github.com/WordPress/ WordPress; https://github.com/joomla/joomla-cms; https://github.com/ phpmyadmin/phpmyadmin; https://github.com/phppgadmin/phppgadmin; https://github.com/phpbb/ 518 SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H NGUYEN et al train and validate our proposed method of detecting PHP webshells, we divided the benign and webshells datasets in two parts with the ratio of 7:3 as the rule of thumb [16] Based on the distribution of files in the dataset sources, the split of training/testing sets is chosen by whole sources Thus, the following table shows our final datasets for training and testing To convert the PHP files into opcodes, we training datasets to train the CNN model by using the tensorflow engine The maximum sequence length of opcodes in our datasets is 44,335 Therefore, we should pad all training opcodes by value (mean no-operation) to have the same maximum length Therefore, the configuration of CNN network is based on maximum of 100,000 inputs, 128 outputs, 03 1D-convolution layers By our different training, we chosen finally the filter sizes for layers are 3, and respectively; dropout is 0.5; activation function is softmax; optimizer is adam; learning_rate is 0.08; loss function is categorical_crossentropy, validation set is 10%; batch_size is 96; and epochs are 32 By using this CNN model, we performed the test datasets and obtained the results illustrated by the matrix confusion in table and the scores in the table Table 2: Raw Benign and Webshell Datasets Benign Dataset Webshell Dataset Training Set 5,802 3,684 Testing Set 1,598 487 use the vld extension of PHP engine 13 to implement the opcode converter Based on this tool, the raw datasets are firstly cleaned by removing duplicate opcodes Therefore, the non-duplicate datasets are shown in the table 3: Table 6: Confusion matrix of PHP webshell detection by using CNN model Table 3: Non-duplicate Benign and Webshell Datasets Benign Dataset Webshell Dataset 4.3 Training Set 4,875 1,049 Predicted Benign Predicted Webshell Testing Set 1,182 275 a Pattern Matching based Detection From the non-duplicate webshell training dataset, we generated 3,242 Yara rules based on our previous research [15] We used these rules to detect the PHP webshell in the non-duplicate testing datasets (both benigns and webshells) Table shows the results we got in the matrix confusion From that, the performance of our Benign Webshell Real Webshell 250 Yara-based PHP webshell detector is illustrated by the following table: This experiment results are clearly better than the results Table 5: Accuracy, Precision, F1-score and FPR of Yara based testing (%) Benign Webshell Accuracy 98.15 98.15 Precision 97.93 99.21 Recall 99.83 90.91 F1-Score 98.87 94.88 FPR 9.09 0.17 Precision 99.14 91.38 Recall 97.88 96.36 F1-Score 98.51 93.81 FPR 3.64 2.12 Table 8: Cleaned Benign and Webshell Datasets published in [15] having the detecting F1-Score of 92% Benign Dataset Webshell Dataset b CNN based Detection Same as the previous experiment, we used also the non-duplicate 13 See Accuracy 97.60 97.60 c Yara and CNN based Detection In the above experiment, it is clear that the CNN-based detecting model have lower F1-Score, accuracy and FFR in comparison with the Yara-based detecting model However, after reviewing the misdetected samples, we found that these samples merely contain very common functions, such as the fread or file_put_contents functions that manipulate the contents to a file We also lookup in detail in the raw datasets and found that the webshell datasets contain some wrong samples: file in webshell datasets but is benign and similarly for benign datasets From that, we decided to combine the Yara-based detector with the CNN based model Fistly, we clean all non-duplicate datasets by using the Yara-based detector in order to remove the fake webshells After that, we got the cleaned datasets and then these datasets are used to train and test the CNN-based model of webshell detection The cleaned datasets we obtained is summary in table 8: Table 4: Confusion matrix of PHP webshell detection by using Yara rules Real Benign 1,180 25 Real Webshell 10 265 Table 7: Accuracy, Precision, F1-score and FPR of CNN based testing (%) Experiment Results Predicted Benign Predicted Webshell Real Benign 1,157 25 Training Set 4,871 618 Testing Set 1,180 250 By using this datasets, we performed to train the CNN model by using the same the settings as previous works After that, the more about VLD at: https://github.com/derickr/vld 519 Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam cleaned test datasets were used to evaluate this model Results we obtained are shown in the matrix confusion in table and the scores in the table 10 Table 12: Comparison of different webshell detection approaches (%) Table 9: Confusion matrix of PHP webshell detection by using Yara+CNN model Predicted Benign Predicted Webshell Real Benign 1,170 10 php-malware-finder[1] Word2Vec+CNN[14] RF-GBDT[6] GuruWS[15] Yara CNN Our Yara+CNN Real Webshell 246 Table 10: Accuracy, Precision, F1-score and FPR of Yara+CNN based Detection (%) Benign Webshell Micro Avg Macro Avg Weighted Avg Accuracy 99.02 99.02 99.02 99.02 99.02 Precision 99.66 96.09 99.66 97.88 99.04 Recall 99.15 98.40 99.15 98.78 99.02 F1-Score 99.41 97.23 99.41 98.32 99.03 FPR 1.60 0.85 0.85 1.22 1.47 F1-Score 98.20 97.67 95.98 96.88 96.02 96.95 This work is partially supported by the national research project No KC.01.19/16-20, granted by the Ministry of Science and Technology of Vietnam (MOST) FPR 0.00 0.92 0.93 0.00 1.14 0.60 REFERENCES [1] 2019 PHP Malware Finder https://github.com/nbs-system/php-malware-finder [2] 2019 Web Technology Surveys http://w3techs.com/technologies/overview/ programming_language/all/ [3] G P Bherde and M A Pund 2016 Recent attack prevention techniques in web service applications In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) 1174–1180 https://doi.org/10 1109/ICACDOT.2016.7877771 [4] Daniel Bilar 2007 Malware detection through opcode sequence analysis using machine learning Int J Electronic Security and Digital Forensics (2007) https: //doi.org/10.1504/IJESDF.2007.016865 [5] Simen Rune Bragen 2015 Opcodes as predictor for malware VDP::Mathematics and natural science: 400::Information and communication science: 420::Security and vulnerability: 424 (01 2015) [6] H Cui, D Huang, Y Fang, L Liu, and C Huang 2018 Webshell Detection Based on Random ForestâĂŞGradient Boosting Decision Tree Algorithm In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) 153–160 https://doi.org/10.1109/DSC.2018.00030 [7] Z Cui, F Xue, X Cai, Y Cao, G Wang, and J Chen 2018 Detection of Malicious Code Variants Based on Deep Learning IEEE Transactions on Industrial Informatics 14, (July 2018), 3187–3196 https://doi.org/10.1109/TII.2018.2822680 [8] Z Cui, F Xue, X Cai, Y Cao, G Wang, and J Chen 2018 Detection of Malicious Code Variants Based on Deep Learning IEEE Transactions on Industrial Informatics 14, (July 2018), 3187–3196 https://doi.org/10.1109/TII.2018.2822680 [9] Y Liu and Y Wang 2019 A Robust Malware Detection System Using Deep Learning on API Calls In 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) 1456–1460 https://doi org/10.1109/ITNEC.2019.8728992 [10] A Masood and J Java 2015 Static analysis for web service security - Tools amp; techniques for a secure development life cycle In 2015 IEEE International Symposium on Technologies for Homeland Security (HST) 1–6 https://doi.org/10 1109/THS.2015.7225337 [11] M Mazumder and T Braje 2016 Safe Client/Server Web Development with Haskell In 2016 IEEE Cybersecurity Development (SecDev) 150–150 https://doi org/10.1109/SecDev.2016.040 These results allow to confirm that the CNN model built from the cleaned datasets by Yara detector is overall better than only Yara and CNN based approach 4.4 Evaluation To justify the performance of our PHP webshel detection method based on Yara and CNN, we compare our results with other approaches By the time, we not perform the evaluating test on the same machine, same datasets (moreover, the source codes and clean datasets of other approaches are not published) Thus, we show only the results of each approach published by their authors Note that we use only the accuracy, F1-score, FPR metrics to compare them in this evaluation The following table shows the comparison of our Yara+CNN model with other approaches: FPR 4.49 0.68 0.00 0.17 2.11 0.85 ACKNOWLEDGMENTS Table 11: 5-fold Cross Validation Results (%) Accuracy 99.40 99.23 98.63 98.97 98.63 98.97 F1-Score 96.46 98.6 99.09 92.00 98.87 97.88 99.41 an efficient model using a deep learning approach combine with pattern matching applying Yara rules technique and converting the PHP source codes to a numerical sequence of opcodes to predict a PHP file whether embedded a malicious code or not Our experiment results show that the proposed method (Yara+CNN) achieved the accuracy of 99.02% with 0.85% false positive rate For future works, we aim to extend our method for others programming languages such as ASP, ASP.NET, Java, Python, etc Besides that, we will study and test other deep learning methods such as LSTM to compare with current methods then select a most accurate predictive model We also perform the k-fold cross validation for this model The following table shows the results we obtained with k=5 folds Fold Fold Fold Fold Fold Average Accuracy 94.23 98.6 99.16 85.56 98.15 97.60 99.02 CONCLUSION Facing the fact that more and more unknown malicious code is now being developed to install into the source code of web applications that are dominating the cyberspace have been a huge challenge today for cybersecurity researchers We proposed in this paper 520 SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H NGUYEN et al [12] M A E Mohd Efendi, Z Ibrahim, M N Ahmad Zawawi, F Abdul Rahim, N A Muhamad Pahri, and A Ismail 2019 A Survey on Deception Techniques for Securing Web Application In 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS) 328–331 https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2019.00066 [13] Internet Live Stats 2019 Internet Usage and Social Media Statistics https: //www.internetlivestats.com/ [14] Yifan Tian, Jiabao Wang, Zhenji Zhou, and Shengli Zhou 2017 CNN-Webshell: Malicious Web Shell Detection with Convolutional Neural Network In Proceedings of the 2017 VI International Conference on Network, Communication and Computing (ICNCC 2017) ACM, New York, NY, USA, 75–79 https://doi.org/10 1145/3171592.3171593 [15] Le V-G, Nguyen H-T, Pham D-P, Phung V-O, and N-H Nguyen 2019 GuruWS: A Hybrid Platform for Detecting Malicious Web Shells and Web Application Vulnerabilities Transactions on Computational Collective Intelligence, Springer, Berlin, Heidelberg 11370, XXXII (01 2019), 184–208 [16] Le V-G, Nguyen H-T, Lu L-D, and N-H Nguyen 2016 A solution for automatically malicious Web shell and Web application vulnerability detection In Computational Collective Intelligence, Volume 9875 of the series Lecture Notes in Computer Science Springer-Verlag, Berlin, Heidelberg, 367–378 [17] M Yeo, Y Koo, Y Yoon, T Hwang, J Ryu, J Song, and C Park 2018 Flow-based malware detection using convolutional neural network In 2018 International Conference on Information Networking (ICOIN) 910–913 https://doi.org/10.1109/ ICOIN.2018.8343255 [18] K ÃŰzkan, Åđ IÅ§Äśk, and Y Kartal 2018 Evaluation of convolutional neural network features for malware detection In 2018 6th International Symposium on Digital Forensic and Security (ISDFS) 1–5 https://doi.org/10.1109/ISDFS.2018 8355390 521 ... Detection by Deep Learning Approaches Deep learning is the application of deep neural networks to machine learning Deep learning is capable of simulating complex functions by learning deep nonlinear network... propose a solution that combines pattern matching for malicious code detection technique using Yara rule set and CNN based approach 3.1 Approach Each technique has its own advantages and disadvantages,... final result yara rule example { meta: description = An example of YARA rule 515 Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay,

Định dạng
Số trang	8
Dung lượng	654,14 KB