Emmanuel tsukerman machine learning for cybersecurity cookbook over 80 recipes on how to implement machine learning algorithms for building security systems using python packt publishing (2019)

machine learning for cybersecurity cookbook over 80 recipes on how to implement machine learning algorithms for building security systems using python packt publishing (2019) All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Trang 2

Machine Learning for

Trang 3

Machine Learning for Cybersecurity

Cookbook

or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy

of this information.

Commissioning Editor: Sunith Shetty

Acquisition Editor: Ali Abidi

Content Development Editor: Roshan Kumar

Senior Editor: Jack Cummings

Technical Editor: Dinesh Chaudhary

Copy Editor: Safis Editing

Project Coordinator: Aishwarya Mohan

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Shraddha Falebhai

First published: November 2019

Trang 4

Subscribe to our online digital library for full access to over 7,000 books and videos, as well

as industry leading tools to help you plan your personal development and advance yourcareer For more information, please visit our website

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videosfrom over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.packt.com and as a printbook customer, you are entitled to a discount on the eBook copy Get in touch with us atcustomercare@packtpub.com for more details

At www.packt.com, you can also read a collection of free technical articles, sign up for arange of free newsletters, and receive exclusive discounts and offers on Packt books andeBooks

Trang 5

About the author

Emmanuel Tsukerman graduated from Stanford University and obtained his Ph.D from

UC Berkeley In 2017, Dr Tsukerman's anti-ransomware product was listed in the Top 10

ransomware products of 2018 by PC Magazine In 2018, he designed an ML-based,

instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000customers In 2019, Dr Tsukerman launched the first cybersecurity data science course

Trang 6

About the reviewers

Alexander Osipenko graduated cum laude with a degree in computational chemistry He

worked in the oil and gas industry for 4 years, working with real-time data streaming andlarge network data Then, he moved to the FinTech industry and cybersecurity He iscurrently a machine learning leading expert in the company, utilizing the full potential of

AI for intrusion detection and insider threat detection

Yasser Ali is a cybersecurity consultant at Thales, in the Middle East He has extensive

experience in providing consultancy and advisory services to enterprises on implementingcybersecurity best practices, critical infrastructure protection, red teaming, penetrationtesting, and vulnerability assessment, managing bug bounty programs, and web andmobile application security assessment He is also an advocate speaker and participant ininformation security industry discussions, panels, committees, and conferences, and is aspecialized trainer, featuring regularly on different media platforms around the world

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com

and apply today We have worked with thousands of developers and tech professionals,just like you, to help them share their insight with the global tech community You canmake a general application, apply for a specific hot topic that we are recruiting an authorfor, or submit your own idea

Trang 8

Computing the hash of a sample 43

Scraping GitHub for files of a specific type 58

Trang 11

Training a fake review generator 155

Trang 12

Web server vulnerability scanner using machine learning

Trang 13

Feature engineering for insider threat detection 231

Trang 16

Cyber threats today are one of the key problems every organization faces This book usesvarious Python libraries, such as TensorFlow, Keras, scikit-learn, and others, to uncovercommon and not-so-common challenges faced by cybersecurity researchers

The book will help readers to implement intelligent solutions to existing cybersecuritychallenges and build cutting edge implementations that cater to increasingly complex

organizational needs By the end of this book, you will be able to build and use machine

learning (ML) algorithms to curb cybersecurity threats using a recipe-based approach.

Who this book is for

This book is for cybersecurity professionals and security researchers who want to take theirskills to the next level by implementing machine learning algorithms and techniques toupskill computer security This recipe-based book will also appeal to data scientists andmachine learning developers who are now looking to bring in smart techniques into thecybersecurity domain Having a working knowledge of Python and being familiar with thebasics of cybersecurity fundamentals will be required

What this book covers

Chapter 1, Machine Learning for Cybersecurity, covers the fundamental techniques of

machine learning for cybersecurity

Chapter 2, Machine Learning-Based Malware Detection, shows how to perform static and

dynamic analysis on samples You will also learn how to tackle important machine learning

challenges that occur in the domain of cybersecurity, such as class imbalance and false

positive rate (FPR) constraints.

Chapter 3, Advanced Malware Detection, covers more advanced concepts for malware

analysis We will also discuss how to approach obfuscated and packed malware, how toscale up the collection of N-gram features, and how to use deep learning to detect and evencreate malware

Trang 17

Chapter 4, Machine Learning for Social Engineering, explains how to build a Twitter

spear-phishing bot using machine learning You'll also learn how to use deep learning to have arecording of a target saying whatever you want them to say The chapter also runs through

a lie detection cycle and shows you how to train a Recurrent Neural Network (RNN) so

that it is able to generate new reviews, similar to the ones in the training dataset

Chapter 5, Penetration Testing Using Machine Learning, covers a wide selection of machine

learning technologies for penetration testing and security countermeasures It also coversmore specialized topics, such as deanonymizing Tor traffic, recognizing unauthorizedaccess via keystroke dynamics, and detecting malicious URLs

Chapter 6, Automatic Intrusion Detection, looks at designing and implementing several

intrusion detection systems using machine learning It also addresses the

example-dependent, cost-sensitive, radically-imbalanced, challenging problem of credit card fraud

Chapter 7, Securing and Attacking Data with Machine Learning, covers recipes for employing

machine learning to secure and attack data It also covers an application of ML for

hardware security by attacking physically unclonable functions (PUFs) using AI.

Chapter 8, Secure and Private AI, explains how to use a federated learning model using the

TensorFlow Federated framework It also includes a walk-through of the basics of

encrypted computation and shows how to implement and train a differentially privatedeep neural network for MNIST using Keras and TensorFlow Privacy

Appendix offers you a guide to creating infrastructure to handle the challenges of machinelearning on cybersecurity data This chapter also provides a guide to using virtual Pythonenvironments, which allow you to seamlessly work on different Python projects whileavoiding package conflicts

To get the most out of this book

You will need a basic knowledge of Python and cybersecurity

Download the example code files

You can download the example code files for this book from your account

at www.packt.com If you purchased this book elsewhere, you can

visit www.packtpub.com/support and register to have the files emailed directly to you

Trang 18

You can download the code files by following these steps:

Log in or register at www.packt.com

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub

at https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbo

ok In case there's an update to the code, it will be updated on the existing GitHub

repository

We also have other code bundles from our rich catalog of books and videos available

at https://github.com/PacktPublishing/ Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in thisbook You can download it here: https://static.packt-

cdn.com/downloads/9781789614671_ColorImages.pdf

Conventions used

There are a number of text conventions used throughout this book

CodeInText: Indicates code words in text, database table names, folder names, filenames,file extensions, pathnames, dummy URLs, user input, and Twitter handles Here is anexample: "Append the labels to X_outliers."

Trang 19

A block of code is set as follows:

from sklearn.model_selection import train_test_split

import pandas as pd

Any command-line input or output is written as follows:

pip install sklearn pandas

Bold: Indicates a new term, an important word, or words that you see onscreen For

example, words in menus or dialog boxes appear in the text like this Here is an example:

"The most basic approach to hyperparameter tuning is called a grid search."

Warnings or important notes appear like this

Tips and tricks appear like this

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do

it , How it works , There's more , and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software

or any preliminary settings required for the recipe

Trang 20

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book

title in the subject of your message and email us at customercare@packtpub.com

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes

do happen If you have found a mistake in this book, we would be grateful if you wouldreport this to us Please visit www.packt.com/submit-errata, selecting your book, clicking

on the Errata Submission Form link, and entering the details

Piracy: If you come across any illegal copies of our works in any form on the Internet, we

would be grateful if you would provide us with the location address or website name.Please contact us at copyright@packt.com with a link to the material

If you are interested in becoming an author: If there is a topic that you have expertise in

and you are interested in either writing or contributing to a book, please

visit authors.packtpub.com

Trang 21

Please leave a review Once you have read and used this book, why not leave a review onthe site that you purchased it from? Potential readers can then see and use your unbiasedopinion to make purchase decisions, we at Packt can understand what you think about ourproducts, and our authors can see your feedback on their book Thank you!

For more information about Packt, please visit packt.com

Trang 22

machine learning practitioner in cybersecurity is in a unique and exciting position to

leverage enormous amounts of data and create solutions in a constantly evolving

landscape

This chapter covers the following recipes:

Train-test-splitting your data

Standardizing your data

Summarizing large data using principal component analysis (PCA)

Generating text using Markov chains

Performing clustering using scikit-learn

Training an XGBoost classifier

Analyzing time series using statsmodels

Anomaly detection using Isolation Forest

Natural language processing (NLP) using hashing vectorizer and tf-idf with

scikit-learn

Hyperparameter tuning with scikit-optimize

Trang 23

Train-test-splitting your data

In machine learning, our goal is to create a program that is able to perform tasks it hasnever been explicitly taught to perform The way we do that is to use data we have

collected to train or fit a mathematical or statistical model The data used to fit the model is referred to as training data The resulting trained model is then used to predict future,

previously-unseen data In this way, the program is able to manage new situations withouthuman intervention

One of the major challenges for a machine learning practitioner is the danger of overfitting –

creating a model that performs well on the training data but is not able to generalize tonew, previously-unseen data In order to combat the problem of overfitting, machine

learning practitioners set aside a portion of the data, called test data, and use it only to

assess the performance of the trained model, as opposed to including it as part of thetraining dataset This careful setting aside of testing sets is key to training classifiers incybersecurity, where overfitting is an omnipresent danger One small oversight, such asusing only benign data from one locale, can lead to a poor classifier

There are various other ways to validate model performance, such as cross-validation Forsimplicity, we will focus mainly on train-test splitting

Trang 24

In addition, we have included the north_korea_missile_test_database.csv datasetfor use in this recipe.

read your features into X and labels into y:

import pandas as pd

df = pd.read_csv("north_korea_missile_test_database.csv")

y = df["Missile Name"]

X = df.drop("Missile Name", axis=1)

Next, randomly split the dataset and its labels into a training set consisting 80%2

of the size of the original dataset and a testing set 20% of the size:

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=31

)

We apply the train_test_split method once more, to obtain a validation set,3

X_val and y_val:

X_train, X_val, y_train, y_val = train_test_split(

X_train, y_train, test_size=0.25, random_state=31

Trang 25

The following screenshot shows the output:

How it works

We start by reading in our dataset, consisting of historical and continuing missile

experiments in North Korea We aim to predict the type of missile based on remainingfeatures, such as facility and time of launch This concludes step 1 In step 2, we applyscikit-learn's train_test_split method to subdivide X and y into a training set, X_trainand y_train, and also a testing set, X_test and y_test The test_size =

0.2 parameter means that the testing set consists of 20% of the original data, while theremainder is placed in the training set The random_state parameter allows us to

reproduce the same randomly generated split Next, concerning step 3, it is important to note

that, in applications, we often want to compare several different models The danger ofusing the testing set to select the best model is that we may end up overfitting the testingset This is similar to the statistical sin of data fishing In order to combat this danger, wecreate an additional dataset, called the validation set We train our models on the trainingset, use the validation set to compare them, and finally use the testing set to obtain anaccurate indicator of the performance of the model we have chosen So, in step 3, we chooseour parameters so that, mathematically speaking, the end result consists of a training set of60% of the original dataset, a validation set of 20%, and a testing set of 20% Finally, wedouble-check our assumptions by employing the len function to compute the length of thearrays (step 4)

Standardizing your data

For many machine learning algorithms, performance is highly sensitive to the relative scale

of features For that reason, it is often important to standardize your features To standardize

a feature means to shift all of its values so that their mean = 0 and to scale them so that theirvariance = 1

Trang 26

One instance when normalizing is useful is when featuring the PE header of a file The PEheader contains extremely large values (for example, the SizeOfInitializedData field)and also very small ones (for example, the number of sections) For certain ML models,such as neural networks, the large discrepancy in magnitude between features can reduceperformance.

Getting ready

Preparation for this recipe consists of installing the scikit-learn and pandas packages inpip Perform the following steps:

In addition, you will find a dataset named file_pe_headers.csv in the repository forthis recipe

data = pd.read_csv("file_pe_headers.csv", sep=",")

X = data.drop(["Name", "Malware"], axis=1).to_numpy()

Dataset X looks as follows:

Trang 27

Next, standardize X using a StandardScaler instance:

We begin by reading in our dataset (step 1), which consists of the PE header information for

a collection of PE files These vary greatly, with some columns reaching hundreds of

thousands of files, and others staying in the single digits Consequently, certain models,such as neural networks, will perform poorly on such unstandardized data In step 2, weinstantiate StandardScaler() and then apply it to rescale X using fit_transform(X)

As a result, we obtained a rescaled dataset, whose columns (corresponding to features)have a mean of 0 and a variance of 1

Summarizing large data using principal

component analysis

Suppose that you would like to build a predictor for an individual's expected net fiscalworth at age 45 There are a huge number of variables to be considered: IQ, current fiscalworth, marriage status, height, geographical location, health, education, career state, age,and many others you might come up with, such as number of LinkedIn connections or SATscores

Trang 28

The trouble with having so many features is several-fold First, the amount of data, whichwill incur high storage costs and computational time for your algorithm Second, with alarge feature space, it is critical to have a large amount of data for the model to be accurate.That's to say, it becomes harder to distinguish the signal from the noise For these reasons,when dealing with high-dimensional data such as this, we often employ dimensionalityreduction techniques, such as PCA More information on the topic can be found at https:/ /en.wikipedia.org/wiki/Principal_component_analysis.

PCA allows us to take our features and return a smaller number of new features, formedfrom our original ones, with maximal explanatory power In addition, since the new

features are linear combinations of the old features, this allows us to anonymize our data,which is very handy when working with financial information, for example

Getting ready

The preparation for this recipe consists of installing the scikit-learn and pandas packages inpip The command for this is as follows:

In addition, we will be utilizing the same dataset, malware_pe_headers.csv, as in theprevious recipe

How to do it

In this section, we'll walk through a recipe showing how to use PCA on data:

Start by importing the necessary libraries and reading in the dataset:

1

from sklearn.decomposition import PCA

import pandas as pd

data = pd.read_csv("file_pe_headers.csv", sep=",")

X = data.drop(["Name", "Malware"], axis=1).to_numpy()

Trang 29

Standardize the dataset, as is necessary before applying PCA:

Trang 30

How it works

We begin by reading in our dataset and then standardizing it, as in the recipe on

standardizing data (steps 1 and 2) (It is necessary to work with standardized data beforeapplying PCA) We now instantiate a new PCA transformer instance, and use it to bothlearn the transformation (fit) and also apply the transform to the dataset, using

fit_transform (step 3) In step 4, we analyze our transformation In particular, note thatthe elements of pca.explained_variance_ratio_ indicate how much of the variance isaccounted for in each direction The sum is 1, indicating that all the variance is accountedfor if we consider the full space in which the data lives However, just by taking the firstfew directions, we can account for a large portion of the variance, while limiting our

dimensionality In our example, the first 40 directions account for 90% of the variance:

sum(pca.explained_variance_ratio_[0:40])

This produces the following output:

0.9068522354673663

This means that we can reduce our number of features to 40 (from 78) while preserving 90%

of the variance The implications of this are that many of the features of the PE header areclosely correlated, which is understandable, as they are not designed to be independent

Generating text using Markov chains

Markov chains are simple stochastic models in which a system can exist in a number ofstates To know the probability distribution of where the system will be next, it suffices toknow where it currently is This is in contrast with a system in which the probability

distribution of the subsequent state may depend on the past history of the system Thissimplifying assumption allows Markov chains to be easily applied in many domains,surprisingly fruitfully

In this recipe, we will utilize Markov chains to generate fake reviews, which is useful forpen-testing a review system's spam detector In a later recipe, you will upgrade the

technology from Markov chains to RNNs

Trang 31

Getting ready

Preparation for this recipe consists of installing the markovify and pandas packages inpip The command for this is as follows:

pip install markovify pandas

In addition, the directory in the repository for this chapter includes a CSV

dataset, airport_reviews.csv, which should be placed alongside the code for the

chapter

How to do it

Let's see how to generate text using Markov chains by performing the following steps:

Start by importing the markovify library and a text file whose style we would1

like to imitate:

import markovify

import pandas as pd

df = pd.read_csv("airport_reviews.csv")

As an illustration, I have chosen a collection of airport reviews as my text:

"The airport is certainly tiny! "

Next, join the individual reviews into one large text string and build a Markov2

chain model using the airport review text:

from itertools import chain

Trang 32

Since we are using airport reviews, we will have the following as the output after4.

executing the previous code:

On the positive side it's a clean airport transfer from A to C gates and outgoing gates is truly enormous - but why when we

arrived at about 7.30 am for our connecting flight to Venice on TAROM.

The only really bother: you may have to wait in a polite manner Why not have bus after a short wait to check-in there were a lots

of shops and less seating.

Very inefficient and hostile airport This is one of the time easy

to access at low price from city center by train.

The distance between the incoming gates and ending with dirty and always blocked by never ending roadworks.

Surprisingly realistic! Although the reviews would have to be filtered down to thebest ones

Generate 3 sentences with a length of no more than 140 characters:

5

for i in range(3):

print(markov_chain_model.make_short_sentence(140))

With our running example, we will see the following output:

However airport staff member told us that we were put on a

connecting code share flight.

Confusing in the check-in agent was friendly.

I am definitely not keen on coming to the lack of staff Lack of staff Lack of staff at boarding pass at check-in.

How it works

We begin the recipe by importing the Markovify library, a library for Markov chain

computations, and reading in text, which will inform our Markov model (step 1) In step 2,

we create a Markov chain model using the text The following is a relevant snippet from thetext object's initialization code:

class Text(object):

reject_pat = re.compile(r"(^')|('$)|\s'|'\s|[\"(\[\])]")

def init (self, input_text, state_size=2, chain=None,

parsed_sentences=None, retain_original=True, well_formed=True,

reject_reg=''):

Trang 33

parsed_sentences: A list of lists, where each outer list is a "run"

of the process (e.g a single sentence), and each inner list contains the steps (e.g words) in the run If you want to simulate

an infinite process, you can come very close by passing just one, very

long run.

retain_original: Indicates whether to keep the original corpus well_formed: Indicates whether sentences should be well-formed, preventing

unmatched quotes, parenthesis by default, or a custom regular expression

Performing clustering using scikit-learn

Clustering is a collection of unsupervised machine learning algorithms in which parts of

the data are grouped based on similarity For example, clusters might consist of data that isclose together in n-dimensional Euclidean space Clustering is useful in cybersecurity fordistinguishing between normal and anomalous network activity, and for helping to classifymalware into families

Trang 34

Getting ready

Preparation for this recipe consists of installing the scikit-learn, pandas, and plotlypackages in pip The command for this is as follows:

pip install sklearn plotly pandas

In addition, a dataset named file_pe_header.csv is provided in the repository for thisrecipe

Trang 35

Extract the features and target labels:

2

y = df["Malware"]

X = df.drop(["Name", "Malware"], axis=1).to_numpy()

Next, import scikit-learn's clustering module and fit a K-means model with two3

clusters to the data:

from sklearn.cluster import KMeans

Trang 36

To see how the algorithm did, plot the algorithm's clusters:

The results are not perfect, but we can see that the clustering algorithm captured much ofthe structure in the dataset

Trang 37

How it works

We start by importing our dataset of PE header information from a collection of samples(step 1) This dataset consists of two classes of PE files: malware and benign We then useplotly to create a nice-looking interactive 3D graph (step 1) We proceed to prepare ourdataset for machine learning Specifically, in step 2, we set X as the features and y as theclasses of the dataset Based on the fact that there are two classes, we aim to cluster the datainto two groups that will match the sample classification We utilize the K-means algorithm(step 3), about which you can find more information at: https://en.wikipedia.org/wiki/ K-means_clustering With a thoroughly trained clustering algorithm, we are ready topredict on the testing set We apply our clustering algorithm to predict to which clustereach of the samples should belong (step 4) Observing our results in step 5, we see thatclustering has captured a lot of the underlying information, as it was able to fit the datawell

Training an XGBoost classifier

Gradient boosting is widely considered the most reliable and accurate algorithm for genericmachine learning problems We will utilize XGBoost to create malware detectors in futurerecipes

Getting ready

The preparation for this recipe consists of installing the scikit-learn, pandas, and xgboostpackages in pip The command for this is as follows:

pip install sklearn xgboost pandas

In addition, a dataset named file_pe_header.csv is provided in the repository for thisrecipe

Trang 38

X = df.drop(["Name", "Malware"], axis=1).to_numpy()

Next, train-test-split a dataset:

2

X_train, X_test, y_train, y_test = train_test_split(X, y,

Trang 39

Analyzing time series using statsmodels

A time series is a series of values obtained at successive times For example, the price of thestock market sampled every minute forms a time series In cybersecurity, time seriesanalysis can be very handy for predicting a cyberattack, such as an insider employeeexfiltrating data, or a group of hackers colluding in preparation for their next hit

Let's look at several techniques for making predictions using time series

from random import random

time_series = [2 * x + random() for x in range(1, 100)]

Trang 40

Plot your data:

There is a large variety of techniques we can use to predict the consequent value3

Moving average (MA):

from statsmodels.tsa.arima_model import ARMA model = ARMA(time_series, order=(0, 1)) model_fit = model.fit(disp=False)

y = model_fit.predict(len(time_series), len(time_series))

Tiêu đề	Machine Learning for Cybersecurity
Tác giả	Emmanuel Tsukerman
Chuyên ngành	Cybersecurity
Thể loại	Book
Năm xuất bản	2019
Thành phố	Birmingham

Định dạng
Số trang	338
Dung lượng	50,15 MB