machine learning for cybersecurity cookbook over 80 recipes on how to implement machine learning algorithms for building security systems using python packt publishing (2019) All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Trang 2Machine Learning for
Trang 3Machine Learning for Cybersecurity
Cookbook
Copyright © 2019 Packt Publishing
All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy
of this information.
Commissioning Editor: Sunith Shetty
Acquisition Editor: Ali Abidi
Content Development Editor: Roshan Kumar
Senior Editor: Jack Cummings
Technical Editor: Dinesh Chaudhary
Copy Editor: Safis Editing
Project Coordinator: Aishwarya Mohan
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Shraddha Falebhai
First published: November 2019
Trang 4Subscribe to our online digital library for full access to over 7,000 books and videos, as well
as industry leading tools to help you plan your personal development and advance yourcareer For more information, please visit our website
Why subscribe?
Spend less time learning and more time coding with practical eBooks and Videosfrom over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.packt.com and as a printbook customer, you are entitled to a discount on the eBook copy Get in touch with us atcustomercare@packtpub.com for more details
At www.packt.com, you can also read a collection of free technical articles, sign up for arange of free newsletters, and receive exclusive discounts and offers on Packt books andeBooks
Trang 5About the author
Emmanuel Tsukerman graduated from Stanford University and obtained his Ph.D from
UC Berkeley In 2017, Dr Tsukerman's anti-ransomware product was listed in the Top 10
ransomware products of 2018 by PC Magazine In 2018, he designed an ML-based,
instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000customers In 2019, Dr Tsukerman launched the first cybersecurity data science course
Trang 6About the reviewers
Alexander Osipenko graduated cum laude with a degree in computational chemistry He
worked in the oil and gas industry for 4 years, working with real-time data streaming andlarge network data Then, he moved to the FinTech industry and cybersecurity He iscurrently a machine learning leading expert in the company, utilizing the full potential of
AI for intrusion detection and insider threat detection
Yasser Ali is a cybersecurity consultant at Thales, in the Middle East He has extensive
experience in providing consultancy and advisory services to enterprises on implementingcybersecurity best practices, critical infrastructure protection, red teaming, penetrationtesting, and vulnerability assessment, managing bug bounty programs, and web andmobile application security assessment He is also an advocate speaker and participant ininformation security industry discussions, panels, committees, and conferences, and is aspecialized trainer, featuring regularly on different media platforms around the world
Packt is searching for authors like you
If you're interested in becoming an author for Packt, please visit authors.packtpub.com
and apply today We have worked with thousands of developers and tech professionals,just like you, to help them share their insight with the global tech community You canmake a general application, apply for a specific hot topic that we are recruiting an authorfor, or submit your own idea
Trang 8Computing the hash of a sample 43
Scraping GitHub for files of a specific type 58
Trang 11Training a fake review generator 155
Trang 12Web server vulnerability scanner using machine learning
Trang 13Feature engineering for insider threat detection 231
Trang 16Cyber threats today are one of the key problems every organization faces This book usesvarious Python libraries, such as TensorFlow, Keras, scikit-learn, and others, to uncovercommon and not-so-common challenges faced by cybersecurity researchers
The book will help readers to implement intelligent solutions to existing cybersecuritychallenges and build cutting edge implementations that cater to increasingly complex
organizational needs By the end of this book, you will be able to build and use machine
learning (ML) algorithms to curb cybersecurity threats using a recipe-based approach.
Who this book is for
This book is for cybersecurity professionals and security researchers who want to take theirskills to the next level by implementing machine learning algorithms and techniques toupskill computer security This recipe-based book will also appeal to data scientists andmachine learning developers who are now looking to bring in smart techniques into thecybersecurity domain Having a working knowledge of Python and being familiar with thebasics of cybersecurity fundamentals will be required
What this book covers
Chapter 1, Machine Learning for Cybersecurity, covers the fundamental techniques of
machine learning for cybersecurity
Chapter 2, Machine Learning-Based Malware Detection, shows how to perform static and
dynamic analysis on samples You will also learn how to tackle important machine learning
challenges that occur in the domain of cybersecurity, such as class imbalance and false
positive rate (FPR) constraints.
Chapter 3, Advanced Malware Detection, covers more advanced concepts for malware
analysis We will also discuss how to approach obfuscated and packed malware, how toscale up the collection of N-gram features, and how to use deep learning to detect and evencreate malware
Trang 17Chapter 4, Machine Learning for Social Engineering, explains how to build a Twitter
spear-phishing bot using machine learning You'll also learn how to use deep learning to have arecording of a target saying whatever you want them to say The chapter also runs through
a lie detection cycle and shows you how to train a Recurrent Neural Network (RNN) so
that it is able to generate new reviews, similar to the ones in the training dataset
Chapter 5, Penetration Testing Using Machine Learning, covers a wide selection of machine
learning technologies for penetration testing and security countermeasures It also coversmore specialized topics, such as deanonymizing Tor traffic, recognizing unauthorizedaccess via keystroke dynamics, and detecting malicious URLs
Chapter 6, Automatic Intrusion Detection, looks at designing and implementing several
intrusion detection systems using machine learning It also addresses the
example-dependent, cost-sensitive, radically-imbalanced, challenging problem of credit card fraud
Chapter 7, Securing and Attacking Data with Machine Learning, covers recipes for employing
machine learning to secure and attack data It also covers an application of ML for
hardware security by attacking physically unclonable functions (PUFs) using AI.
Chapter 8, Secure and Private AI, explains how to use a federated learning model using the
TensorFlow Federated framework It also includes a walk-through of the basics of
encrypted computation and shows how to implement and train a differentially privatedeep neural network for MNIST using Keras and TensorFlow Privacy
Appendix offers you a guide to creating infrastructure to handle the challenges of machinelearning on cybersecurity data This chapter also provides a guide to using virtual Pythonenvironments, which allow you to seamlessly work on different Python projects whileavoiding package conflicts
To get the most out of this book
You will need a basic knowledge of Python and cybersecurity
Download the example code files
You can download the example code files for this book from your account
at www.packt.com If you purchased this book elsewhere, you can
visit www.packtpub.com/support and register to have the files emailed directly to you
Trang 18You can download the code files by following these steps:
Log in or register at www.packt.com
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub
at https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbo
ok In case there's an update to the code, it will be updated on the existing GitHub
repository
We also have other code bundles from our rich catalog of books and videos available
at https://github.com/PacktPublishing/ Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in thisbook You can download it here: https://static.packt-
cdn.com/downloads/9781789614671_ColorImages.pdf
Conventions used
There are a number of text conventions used throughout this book
CodeInText: Indicates code words in text, database table names, folder names, filenames,file extensions, pathnames, dummy URLs, user input, and Twitter handles Here is anexample: "Append the labels to X_outliers."
Trang 19A block of code is set as follows:
from sklearn.model_selection import train_test_split
import pandas as pd
Any command-line input or output is written as follows:
pip install sklearn pandas
Bold: Indicates a new term, an important word, or words that you see onscreen For
example, words in menus or dialog boxes appear in the text like this Here is an example:
"The most basic approach to hyperparameter tuning is called a grid search."
Warnings or important notes appear like this
Tips and tricks appear like this
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do
it , How it works , There's more , and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
Getting ready
This section tells you what to expect in the recipe and describes how to set up any software
or any preliminary settings required for the recipe
Trang 20Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book
title in the subject of your message and email us at customercare@packtpub.com
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes
do happen If you have found a mistake in this book, we would be grateful if you wouldreport this to us Please visit www.packt.com/submit-errata, selecting your book, clicking
on the Errata Submission Form link, and entering the details
Piracy: If you come across any illegal copies of our works in any form on the Internet, we
would be grateful if you would provide us with the location address or website name.Please contact us at copyright@packt.com with a link to the material
If you are interested in becoming an author: If there is a topic that you have expertise in
and you are interested in either writing or contributing to a book, please
visit authors.packtpub.com
Trang 21Please leave a review Once you have read and used this book, why not leave a review onthe site that you purchased it from? Potential readers can then see and use your unbiasedopinion to make purchase decisions, we at Packt can understand what you think about ourproducts, and our authors can see your feedback on their book Thank you!
For more information about Packt, please visit packt.com
Trang 22machine learning practitioner in cybersecurity is in a unique and exciting position to
leverage enormous amounts of data and create solutions in a constantly evolving
landscape
This chapter covers the following recipes:
Train-test-splitting your data
Standardizing your data
Summarizing large data using principal component analysis (PCA)
Generating text using Markov chains
Performing clustering using scikit-learn
Training an XGBoost classifier
Analyzing time series using statsmodels
Anomaly detection using Isolation Forest
Natural language processing (NLP) using hashing vectorizer and tf-idf with
scikit-learn
Hyperparameter tuning with scikit-optimize
Trang 23Train-test-splitting your data
In machine learning, our goal is to create a program that is able to perform tasks it hasnever been explicitly taught to perform The way we do that is to use data we have
collected to train or fit a mathematical or statistical model The data used to fit the model is referred to as training data The resulting trained model is then used to predict future,
previously-unseen data In this way, the program is able to manage new situations withouthuman intervention
One of the major challenges for a machine learning practitioner is the danger of overfitting –
creating a model that performs well on the training data but is not able to generalize tonew, previously-unseen data In order to combat the problem of overfitting, machine
learning practitioners set aside a portion of the data, called test data, and use it only to
assess the performance of the trained model, as opposed to including it as part of thetraining dataset This careful setting aside of testing sets is key to training classifiers incybersecurity, where overfitting is an omnipresent danger One small oversight, such asusing only benign data from one locale, can lead to a poor classifier
There are various other ways to validate model performance, such as cross-validation Forsimplicity, we will focus mainly on train-test splitting
Trang 24In addition, we have included the north_korea_missile_test_database.csv datasetfor use in this recipe.
read your features into X and labels into y:
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv("north_korea_missile_test_database.csv")
y = df["Missile Name"]
X = df.drop("Missile Name", axis=1)
Next, randomly split the dataset and its labels into a training set consisting 80%2
of the size of the original dataset and a testing set 20% of the size:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=31
)
We apply the train_test_split method once more, to obtain a validation set,3
X_val and y_val:
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.25, random_state=31
Trang 25The following screenshot shows the output:
How it works
We start by reading in our dataset, consisting of historical and continuing missile
experiments in North Korea We aim to predict the type of missile based on remainingfeatures, such as facility and time of launch This concludes step 1 In step 2, we applyscikit-learn's train_test_split method to subdivide X and y into a training set, X_trainand y_train, and also a testing set, X_test and y_test The test_size =
0.2 parameter means that the testing set consists of 20% of the original data, while theremainder is placed in the training set The random_state parameter allows us to
reproduce the same randomly generated split Next, concerning step 3, it is important to note
that, in applications, we often want to compare several different models The danger ofusing the testing set to select the best model is that we may end up overfitting the testingset This is similar to the statistical sin of data fishing In order to combat this danger, wecreate an additional dataset, called the validation set We train our models on the trainingset, use the validation set to compare them, and finally use the testing set to obtain anaccurate indicator of the performance of the model we have chosen So, in step 3, we chooseour parameters so that, mathematically speaking, the end result consists of a training set of60% of the original dataset, a validation set of 20%, and a testing set of 20% Finally, wedouble-check our assumptions by employing the len function to compute the length of thearrays (step 4)
Standardizing your data
For many machine learning algorithms, performance is highly sensitive to the relative scale
of features For that reason, it is often important to standardize your features To standardize
a feature means to shift all of its values so that their mean = 0 and to scale them so that theirvariance = 1
Trang 26One instance when normalizing is useful is when featuring the PE header of a file The PEheader contains extremely large values (for example, the SizeOfInitializedData field)and also very small ones (for example, the number of sections) For certain ML models,such as neural networks, the large discrepancy in magnitude between features can reduceperformance.
Getting ready
Preparation for this recipe consists of installing the scikit-learn and pandas packages inpip Perform the following steps:
pip install sklearn pandas
In addition, you will find a dataset named file_pe_headers.csv in the repository forthis recipe
data = pd.read_csv("file_pe_headers.csv", sep=",")
X = data.drop(["Name", "Malware"], axis=1).to_numpy()
Dataset X looks as follows:
Trang 27Next, standardize X using a StandardScaler instance:
We begin by reading in our dataset (step 1), which consists of the PE header information for
a collection of PE files These vary greatly, with some columns reaching hundreds of
thousands of files, and others staying in the single digits Consequently, certain models,such as neural networks, will perform poorly on such unstandardized data In step 2, weinstantiate StandardScaler() and then apply it to rescale X using fit_transform(X)
As a result, we obtained a rescaled dataset, whose columns (corresponding to features)have a mean of 0 and a variance of 1
Summarizing large data using principal
component analysis
Suppose that you would like to build a predictor for an individual's expected net fiscalworth at age 45 There are a huge number of variables to be considered: IQ, current fiscalworth, marriage status, height, geographical location, health, education, career state, age,and many others you might come up with, such as number of LinkedIn connections or SATscores
Trang 28The trouble with having so many features is several-fold First, the amount of data, whichwill incur high storage costs and computational time for your algorithm Second, with alarge feature space, it is critical to have a large amount of data for the model to be accurate.That's to say, it becomes harder to distinguish the signal from the noise For these reasons,when dealing with high-dimensional data such as this, we often employ dimensionalityreduction techniques, such as PCA More information on the topic can be found at https:/ /en.wikipedia.org/wiki/Principal_component_analysis.
PCA allows us to take our features and return a smaller number of new features, formedfrom our original ones, with maximal explanatory power In addition, since the new
features are linear combinations of the old features, this allows us to anonymize our data,which is very handy when working with financial information, for example
Getting ready
The preparation for this recipe consists of installing the scikit-learn and pandas packages inpip The command for this is as follows:
pip install sklearn pandas
In addition, we will be utilizing the same dataset, malware_pe_headers.csv, as in theprevious recipe
How to do it
In this section, we'll walk through a recipe showing how to use PCA on data:
Start by importing the necessary libraries and reading in the dataset:
1
from sklearn.decomposition import PCA
import pandas as pd
data = pd.read_csv("file_pe_headers.csv", sep=",")
X = data.drop(["Name", "Malware"], axis=1).to_numpy()
Trang 29Standardize the dataset, as is necessary before applying PCA:
Trang 30How it works
We begin by reading in our dataset and then standardizing it, as in the recipe on
standardizing data (steps 1 and 2) (It is necessary to work with standardized data beforeapplying PCA) We now instantiate a new PCA transformer instance, and use it to bothlearn the transformation (fit) and also apply the transform to the dataset, using
fit_transform (step 3) In step 4, we analyze our transformation In particular, note thatthe elements of pca.explained_variance_ratio_ indicate how much of the variance isaccounted for in each direction The sum is 1, indicating that all the variance is accountedfor if we consider the full space in which the data lives However, just by taking the firstfew directions, we can account for a large portion of the variance, while limiting our
dimensionality In our example, the first 40 directions account for 90% of the variance:
sum(pca.explained_variance_ratio_[0:40])
This produces the following output:
0.9068522354673663
This means that we can reduce our number of features to 40 (from 78) while preserving 90%
of the variance The implications of this are that many of the features of the PE header areclosely correlated, which is understandable, as they are not designed to be independent
Generating text using Markov chains
Markov chains are simple stochastic models in which a system can exist in a number ofstates To know the probability distribution of where the system will be next, it suffices toknow where it currently is This is in contrast with a system in which the probability
distribution of the subsequent state may depend on the past history of the system Thissimplifying assumption allows Markov chains to be easily applied in many domains,surprisingly fruitfully
In this recipe, we will utilize Markov chains to generate fake reviews, which is useful forpen-testing a review system's spam detector In a later recipe, you will upgrade the
technology from Markov chains to RNNs
Trang 31Getting ready
Preparation for this recipe consists of installing the markovify and pandas packages inpip The command for this is as follows:
pip install markovify pandas
In addition, the directory in the repository for this chapter includes a CSV
dataset, airport_reviews.csv, which should be placed alongside the code for the
chapter
How to do it
Let's see how to generate text using Markov chains by performing the following steps:
Start by importing the markovify library and a text file whose style we would1
like to imitate:
import markovify
import pandas as pd
df = pd.read_csv("airport_reviews.csv")
As an illustration, I have chosen a collection of airport reviews as my text:
"The airport is certainly tiny! "
Next, join the individual reviews into one large text string and build a Markov2
chain model using the airport review text:
from itertools import chain
Trang 32Since we are using airport reviews, we will have the following as the output after4.
executing the previous code:
On the positive side it's a clean airport transfer from A to C gates and outgoing gates is truly enormous - but why when we
arrived at about 7.30 am for our connecting flight to Venice on TAROM.
The only really bother: you may have to wait in a polite manner Why not have bus after a short wait to check-in there were a lots
of shops and less seating.
Very inefficient and hostile airport This is one of the time easy
to access at low price from city center by train.
The distance between the incoming gates and ending with dirty and always blocked by never ending roadworks.
Surprisingly realistic! Although the reviews would have to be filtered down to thebest ones
Generate 3 sentences with a length of no more than 140 characters:
5
for i in range(3):
print(markov_chain_model.make_short_sentence(140))
With our running example, we will see the following output:
However airport staff member told us that we were put on a
connecting code share flight.
Confusing in the check-in agent was friendly.
I am definitely not keen on coming to the lack of staff Lack of staff Lack of staff at boarding pass at check-in.
How it works
We begin the recipe by importing the Markovify library, a library for Markov chain
computations, and reading in text, which will inform our Markov model (step 1) In step 2,
we create a Markov chain model using the text The following is a relevant snippet from thetext object's initialization code:
class Text(object):
reject_pat = re.compile(r"(^')|('$)|\s'|'\s|[\"(\(\)\[\])]")
def init (self, input_text, state_size=2, chain=None,
parsed_sentences=None, retain_original=True, well_formed=True,
reject_reg=''):
Trang 33parsed_sentences: A list of lists, where each outer list is a "run"
of the process (e.g a single sentence), and each inner list contains the steps (e.g words) in the run If you want to simulate
an infinite process, you can come very close by passing just one, very
long run.
retain_original: Indicates whether to keep the original corpus well_formed: Indicates whether sentences should be well-formed, preventing
unmatched quotes, parenthesis by default, or a custom regular expression
Performing clustering using scikit-learn
Clustering is a collection of unsupervised machine learning algorithms in which parts of
the data are grouped based on similarity For example, clusters might consist of data that isclose together in n-dimensional Euclidean space Clustering is useful in cybersecurity fordistinguishing between normal and anomalous network activity, and for helping to classifymalware into families
Trang 34Getting ready
Preparation for this recipe consists of installing the scikit-learn, pandas, and plotlypackages in pip The command for this is as follows:
pip install sklearn plotly pandas
In addition, a dataset named file_pe_header.csv is provided in the repository for thisrecipe
Trang 35The following screenshot shows the output:
Extract the features and target labels:
2
y = df["Malware"]
X = df.drop(["Name", "Malware"], axis=1).to_numpy()
Next, import scikit-learn's clustering module and fit a K-means model with two3
clusters to the data:
from sklearn.cluster import KMeans
Trang 36To see how the algorithm did, plot the algorithm's clusters:
The following screenshot shows the output:
The results are not perfect, but we can see that the clustering algorithm captured much ofthe structure in the dataset
Trang 37How it works
We start by importing our dataset of PE header information from a collection of samples(step 1) This dataset consists of two classes of PE files: malware and benign We then useplotly to create a nice-looking interactive 3D graph (step 1) We proceed to prepare ourdataset for machine learning Specifically, in step 2, we set X as the features and y as theclasses of the dataset Based on the fact that there are two classes, we aim to cluster the datainto two groups that will match the sample classification We utilize the K-means algorithm(step 3), about which you can find more information at: https://en.wikipedia.org/wiki/ K-means_clustering With a thoroughly trained clustering algorithm, we are ready topredict on the testing set We apply our clustering algorithm to predict to which clustereach of the samples should belong (step 4) Observing our results in step 5, we see thatclustering has captured a lot of the underlying information, as it was able to fit the datawell
Training an XGBoost classifier
Gradient boosting is widely considered the most reliable and accurate algorithm for genericmachine learning problems We will utilize XGBoost to create malware detectors in futurerecipes
Getting ready
The preparation for this recipe consists of installing the scikit-learn, pandas, and xgboostpackages in pip The command for this is as follows:
pip install sklearn xgboost pandas
In addition, a dataset named file_pe_header.csv is provided in the repository for thisrecipe
Trang 38X = df.drop(["Name", "Malware"], axis=1).to_numpy()
Next, train-test-split a dataset:
2
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
Trang 39Analyzing time series using statsmodels
A time series is a series of values obtained at successive times For example, the price of thestock market sampled every minute forms a time series In cybersecurity, time seriesanalysis can be very handy for predicting a cyberattack, such as an insider employeeexfiltrating data, or a group of hackers colluding in preparation for their next hit
Let's look at several techniques for making predictions using time series
from random import random
time_series = [2 * x + random() for x in range(1, 100)]
Trang 40Plot your data:
The following screenshot shows the output:
There is a large variety of techniques we can use to predict the consequent value3
Moving average (MA):
from statsmodels.tsa.arima_model import ARMA model = ARMA(time_series, order=(0, 1)) model_fit = model.fit(disp=False)
y = model_fit.predict(len(time_series), len(time_series))