1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn thạc sĩ Khoa học máy tính: Application of machine learning on automatic program repair of security vulnerabilities

61 2 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Trang 1

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

-

NGUYEN NGOC HAI DANG

APPLICATION OF MACHINE LEARNING ON AUTOMATIC PROGRAM REPAIR OF SECURITY

Trang 2

Assoc Prof Dr Huynh Tuong Nguyen, Assoc Prof Dr Quan Thanh Tho

Examiner 1: Dr Truong Tuan Anh

Examiner 2: Assoc Prof Dr Nguyen Van Vu

This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on July 11,2023

Master’s Thesis Committee:

(Please write down full name and academic rank of each member of the Master’s Thesis Committee)

1 Chairman: Assoc Prof Dr Le Hong Trang 2 Secretary: Dr Phan Trong Nhan

3 Reviewer 1: Dr Truong Tuan Anh

4 Reviewer 2: Assoc Prof Dr Nguyen Van Vu 5 Member: Assoc Prof Dr Nguyen Tuan Dang

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

CHAIRMAN OF THESIS COMMITTEE HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Trang 3

THE TASK SHEET OF MASTER’S THESIS

Full name: Nguyen Ngoc Hai Dang Student ID: 1970513 Date of birth: 24/11/1997 Place of birth: Lam Dong Major: Computer Science Major ID: 8480101

I THESIS TITLE:

ỨNG DỤNG HỌC MÁY VÀO CHƯƠNG TRÌNH TỰ ĐỘNG SỬA CHỮA LỖ HỔNG BẢO MẬT - APPLICATION OF MACHINE LEARNING ON

AUTOMATIC PROGRAM REPAIR OF SECURITY VULNERABILITIES

II TASKS AND CONTENTS:

- Research and build a system to automatically repair vulnerabilities - Research and propose methods to improve the accuracy of the model - Experiment and evaluate the results of the proposed methods.

III THESIS START DAY: 05/09/2022

IV THESIS COMPLETION DAY: 09/06/2023V SUPERVISOR:

Assoc Prof Dr Huynh Tuong Nguyen, Assoc Prof Dr Quan Thanh Tho

Ho Chi Minh City, date ………

SUPERVISOR

(Full name and signature)

CHAIR OF PROGRAM COMMITTEE

(Full name and signature)

DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING

(Full name and signature)

Trang 4

I would like to acknowledge the people who have helped me with their knowledge,encouragement, and patience during the work of this thesis The thesis would nothave been completed without your help and inspiration First and foremost, I wouldlike to thank my supervisor at Ho Chi Minh City University of Technology, ProfessorQuan Thanh Tho Thank you for your unwavering support Your insightful feedbackand contributions have pushed and guided me throughout the work of this thesis Iwould also like to thank my other supervisor at the Norwegian University of Scienceand Technology, Professor Nguyen Duc Anh Thank you for your feedback and help.Lastly, I would like to thank my friends and family for their endless patience, support,and encouragement.

i

Trang 5

We have, as individuals and as a society, become increasingly more dependent onsoftware, thus, the consequences of failing software have also become greater Identi-fying the failing parts of the software and fixing these parts manually could be time-consuming, expensive, and frustrating The growing research field of automated coderepair aims to tackle this problem, by applying machine learning techniques to be ableto repair software in an automated fashion With the abundance of data of bugs andpatches, research on the use of deep learning in code repairing has been on the riseand proven to be effective with the appearances of many systems [1] [2] with stateof the art performance However, the approach is conditioned on a large dataset tobe applicable and this condition can not be met by all types of bugs in applications.One type of such bugs is vulnerability, which is the target of security exploitation ofattackers to cause great harm to organizations that use the applications Therefore, theneed to automatically identify and fix vulnerabilities is obvious and can significantlyreduce the harm that can be caused to these organizations.

In our work, we focus on the application of deep learning in vulnerability ing and experiment with a solution that can be used to handle the lack of data, whichis a requirement for deep learning models to be applied effectively, through the useof embeddings extracted from large language models like CodeBERT [3] and Unix-Coder [4] Although our results show such an approach does not bring significantimprovement, they can be used by other researchers to gain more insights into theproximity between the repairing tasks of different types of bugs.

repair-ii

Trang 6

Chúng ta, với tư cách cá nhân và xã hội, ngày càng trở nên phụ thuộc nhiều hơn vàophần mềm, do đó, hậu quả của việc phần mềm bị lỗi cũng trở nên lớn hơn Việc xácđịnh các phần bị lỗi của phần mềm và sửa các phần này theo cách thủ công có thể tốnthời gian, tốn kém và gây khó chịu Lĩnh vực nghiên cứu sửa chữa mã tự động đangphát triển nhằm mục đích giải quyết vấn đề này, bằng cách áp dụng các kỹ thuật máyhọc để có thể sửa chữa phần mềm theo cách tự động Với lượng dữ liệu dồi dào vềlỗi và bản vá lỗi, nghiên cứu về việc sử dụng học sâu trong sửa mã ngày càng nhiềuvà được chứng minh là hiệu quả với sự xuất hiện của nhiều hệ thống [1] [2] với côngnghệ tiên tiến nhất biểu diễn nghệ thuật Tuy nhiên, cách tiếp cận này dựa trên mộttập dữ liệu lớn và không phải mọi loại lỗi trong ứng dụng cũng đáp ứng được điềukiện này, một trong số đó là lỗ hổng bảo mật, vốn là mục tiêu khai thác bảo mật củanhững kẻ tấn công nhằm gây hại cho các tổ chức sử dụng các ứng dụng chứa nhữnglỗ hổng này Do đó, nhu cầu tự động xác định và sửa các lỗ hổng này là hiển nhiên vàcó thể đem giảm đáng kể những thiệt hại có thể xảy ra cho các doanh nghiệp

Trong luận văn này, chúng tôi tập trung vào việc ứng dụng học sâu trong việckhắc phục lỗ hổng bảo mật và thử nghiệm một giải pháp có thể sử dụng để xử lý tìnhtrạng thiếu dữ liệu, vốn là yêu cầu để các mô hình này trở nên hiệu quả, thông quaviệc sử dụng các embeddings được trích xuất từ những mô hình ngôn ngữ lớn nhưCodeBERT [3] và UnixCoder [4] Mặc dù kết quả của chúng tôi cho thấy cách tiếpcận như vậy không mang lại sự cải thiện đáng kể, nhưng chúng vẫn có thể được cácnhà nghiên cứu khác sử dụng để hiểu rõ hơn về khoảng cách giữa các nhiệm vụ sửachữa của các loại lỗi khác nhau.

iii

Trang 7

I, Nguyen Ngoc Hai Dang, declare this thesis with the Vietnamese title as "Ứng dụngcủa học máy vào chương trình tự động sửa chữa lỗ hổng bảo mật” and English title as"Application of machine learning on automatic program repair of security vulnerabil-ities”, is my own work and contains no material that has been submitted previously,in whole or in part, for the award of any other academic degree or diploma.

Nguyen Ngoc Hai Dang

iv

Trang 8

2.1.1 Recurrent Neural Network (RNN) 6

2.1.2 Vanilla recurrent neural network 7

2.1.3 Long short-term memory network(LSTM) 9

2.1.4 Transformer Neural Network 11

2.4 Bug Repairing and Vulnerabilities Repairing 19

2.5 Source code Representation 20

2.5.1 GumTree 20

2.5.2 Byte Pair Encoding 21

2.6 Source code embeddings 22

Trang 9

3.2 Generative-based approach 29

3.2.1 SeqTrans 30

3.2.2 VRepair 32

4 Proposed Methods 345 Experiments and Results 355.1 Datasets 35

Validation method 36

5.2 Metrics of performance 36

5.3 Preprocessing the code as plain text 37

5.4 Extracting embeddings from large language models for code 38

Trang 10

List of Figures

2.1 The basic architecture of recurrent neural network 8

2.2 Recurrent Neural Network design patterns 9

2.3 LSTM network with three repeating layers 10

2.4 Attention-integrated recurrent network 12

2.5 The encoder-decoder architecture of transformer 13

2.6 Attention head operations 15

2.7 Dataset used for CodeBERT 23

2.8 CodeBERT architecture for replaced tokens detection task 25

2.9 A Python code with its comment and AST 25

2.10 Input for contrastive learning task of UnixCoder 26

3.1 Workflow of VuRLE 28

3.2 Architecture of SeqTrans 30

3.3 Input of SeqTrans 31

3.4 Normalized code segment 32

3.5 The VRepair pipeline 33

4.1 Design of our pipeline 35

5.1 Sample of buggy code and its patch 37

5.2 input sequence 37

5.3 Output sequence 38

5.4 Syntax of the output sequence 38

vii

Trang 11

List of Tables

5.1 Experiments replicating the VRepair pipeline 405.2 Experiments with embeddings as input 416.1 Complete set of hyperparameters used in our models built by Opennmt-

py 49

viii

Trang 12

In the modern society, software systems play a crucial role in almost every aspect ofour lives [5] These systems have become the backbone of our interconnected world,enabling us to communicate, work, learn, and entertain ourselves efficiently andeffectively [6] From mobile applications and social media platforms to e-commercewebsites and financial systems, software systems have revolutionized the way weinteract, transact, and navigate the digital landscape They have transformed

industries, streamlined processes, and empowered individuals by providing access toinformation and services at our fingertips The importance of software systems liesin their ability to automate tasks, enhance productivity, enable innovation, and fosterconnectivity on a global scale They have become indispensable tools for businesses,governments, healthcare, education, and countless other sectors, driving progress,enabling efficiency, and shaping the future.

With the increasing reliance on software systems for critical functions, such ascommunication, finance, healthcare, and infrastructure, ensuring the security ofthese systems is of paramount importance Software security involves protectingsoftware applications and data from unauthorized access, breaches, and maliciousactivities The consequences of software security breaches can be severe, rangingfrom financial loss and reputational damage to compromised privacy and eventhreats to national security While detecting software security issues can be doneduring and after software release, addressing security issues early in the

development process saves time and resources It is generally easier and less costlyto fix vulnerabilities during the development stage than after the software has beendeployed and is in active use.

Software security detection involves using various techniques and tools to identifyvulnerabilities, weaknesses, and potential threats within the codebase This caninclude static code analysis, dynamic testing, penetration testing, and securityauditing By actively searching for security issues, developers can uncover andaddress potential flaws before the software is deployed, reducing the risk ofexploitation by malicious actors.

Once security issues are detected, code repair comes into play It involves

remediation efforts to fix the identified vulnerabilities and weaknesses This mayinvolve patching code, implementing security controls, updating dependencies, orimproving the overall design of the software Code repair is a critical step in

Trang 13

mitigating security risks and ensuring that the software meets the necessary securitystandards.

In our research, we will explore code repair for software security issues We willempirically investigate the application of Deep Learning to create patches of suchvulnerabilities in software in an automatic manner The contribution of this thesisare three folds:

• A literature review on state-of-the-art Deep Learning on code repair forsoftware security

• Experiments with different DL approaches for vulnerable code repair

In the field of software testing, security vulnerabilities are the type of bugs that areboth hard to detect and implement patches as they are not explicitly affecting thesoftware functionalities but they only exposed and cause great harm when exploitedintentionally.

Methods used for creating code patches can be classified into template-based andgenerative-based Template-based patching involves using predefined templates orpatterns to guide the creation of code patches These templates provide a structure orframework for making specific modifications to the code Developers can fill in thetemplate with the necessary changes based on the identified security issues Thisapproach simplifies the patching process by providing a consistent and predictableformat, making it easier to apply fixes across different codebases Template-basedpatching is especially useful for addressing common security vulnerabilities thathave well-defined solutions On the other hand, generative-based patching takes amore automated and algorithmic approach to create code patches Instead of relyingon predefined templates, this approach leverages machine learning techniques, codeanalysis, and algorithms to generate patches automatically Generative-based

patching involves analyzing the codebase, identifying problematic areas, and

generating code changes that address the detected security issues This approach canbe more flexible and adaptive, as it can handle a wider range of vulnerabilities andadapt to different programming languages and code structures.

Automated program repair is an emerging field of Software Engineering (SE)

research that allows for automated rectification of software errors and vulnerabilities

Trang 14

[7] Program repair, code repair, also known as automated program repair orsoftware patching, refers to the process of automatically identifying and fixingsoftware bugs or defects without manual intervention Program repair techniquestypically analyze the source code, identify the root cause of the bug, and generate apatch or modification that resolves the issue It is an emerging research interest inapplying Machine Learning techniques to automate the task of repairing faulty code[7]–[13].

Inspired by the work of Zimi Chen [14], whose research uses transfer learning toleverage knowledge deep learning learned from generic code repairing tasks toimprove the performance of vulnerability repairing task, our main focus will be toinvestigate the problem of generating vulnerabilities patches automatically using agenerative model and leverage the knowledge learned by large programminglanguages models such as CodeBERT [3] to improve the performance of thegenerative model through the use of extracted embeddings from the CodeBERT.

The problem of generating patches for vulnerabilities using knowledge learned fromdata is quite new and still has challenges to work on One of such challenges is thelack of proficient data to work on, in terms of both volume and diversity; this leadsto the performance of current vulnerabilities’ patches generating system to be quiteunreliable even for template-based methods that are not as data-hungry as thegenerative-based, which is dominated by the deep learning architectures.

Unlike the scarcity of data in the domain of vulnerabilities repairing, the data forbuggy code, in general, is quite large; one can easily crawl a data set from therepositories on GitHub based on commits, and for each commit, we will have a pairof bad code and its patch The question then is how we can use this abundance ofdata in the broader domain of code repairing in the solution of code repairing.

As it turns out, situations where the availability of data is limited are not uncommon;in the field of deep learning, there is a method designed using the same intuition, inwhich we transfer the knowledge learned from one task to another, called transferlearning The application of transfer learning to the vulnerabilities repairing task hasalready been investigated in [15] [2] and archived promising results on the limiteddata set.

Trang 15

In our work, we will go even further by combining source code modeling techniqueswith transfer learning with the hope that this could further improve previously

reported results.

Our research will focus on answering the three Research Questions:

Research Question 1: What do we know about Deep Learning in Vulnerableprogram Repair?

• Reason: Deep Learning has shown great potential in various software

development problems, and exploring its applicability in vulnerable programrepair can lead to advancements in automated bug fixing Understanding theexisting knowledge about Deep Learning in this context can provide insightsinto its effectiveness, limitations, and potential for improving the efficiency andaccuracy of program repair techniques.

Research Question 2: How effective existing generative-based methods to theproblem of code repairing and vulnerability repairing?

• Reason: Investigating the effectiveness of these methods in code andvulnerability repairing can provide valuable insights into their strengths,weaknesses, and limitations.

Research Question 3: Can code embedding extend the capabilities of thesemethods?

• Reason: Code embedding is a technique that represents source code as vectorsin a continuous space, enabling the application of machine learning algorithmsto code analysis and understanding Exploring the use of code embedding ingenerative-based methods for code and vulnerability repairing can offer newperspectives on improving the effectiveness and generalizability of thesetechniques.

Trang 16

1.4Thesis Outline

The thesis will be structured as followed: in section 2 we will discuss the necessarybackground knowledge of deep learning, and its applications in the broader field ofcode repairing, then we will further discuss the previous methods used in the

problem of vulnerabilities, in the section 3 4 we will then discuss in details some ofthe prominent methods and from then lead to our proposed methods.

Trang 17

2Literature review

We will first introduce the fundamentals of Deep Learning, secondly LearningParadigms, and thirdly Deep Learning in Code Repair

As mentioned in our previous sections, the current focus of research in the fields ison Deep Learning-based methods and in such methods source code is seen as asequence of tokens, using the intuition that source code is a raw text of a specificlanguage, and the source will learn the predict the tokens sequences of the respectivepatches We start with the general technique neural network (NN) A neural networkis a general term for a model that is built by stacking basic layers or networks toform a larger network whose output would be the input of the later layers, theconfiguration of layers can vary based on specific tasks and the nature of data itself.In the context of sequence data and text-based data, the prominent basic networksthat are commonly used to handle such data are recurrent neural networks andtransformer neural networks In this section, we will subsequently go through thesenetworks to have a basic understanding of them and how they can be used in ourproblem of vulnerability repairing This section will be structured as follow: first,we will touch on the basics of a neural network: components and learning algorithm(parameter optimization); in the second section, we will briefly discuss the basis ofthe recurrent neural network, then we will advance to long-short term memoryneural network, which intuitively an enhancement of the vanilla recurrent neuralnetwork for handling long term memorization; in last section, we will go into detailsof the transformer neural network along with it attention mechanism.

2.1.1 Recurrent Neural Network (RNN)

In the world of data, there are properties that regulate the information or knowledgethat models are trying to learn, one such type properties is the order of values insequence data To be more formal on the definition of sequence data, we will definea sequence as a type of data where values in the later position of the sequence aredetermined by the values in the previous position.

XT = {x1, x2, xt xT}

Trang 18

• T denotes the length of the sequence• tdenotes the position of xin the sequence• xt is the value at positiont

• XT represent sequence data of lengthT

Recurrent neural networks [16] have mainly two types of variants: with gated unitsand without gated units In the next sections, we will discuss both network typesusing two representative networks (one network for each type): vanilla recurrentneural network (with gated units) and long short-term memory network (withoutgated units).

2.1.2 Vanilla recurrent neural network

Given a sample of sequence dataXT = {x1, x2 xt xT}in whichxtis the valueof the sequence at positiontor time stept, each of these values is subsequently fedinto the network as parameters for the functions inside that would extract

information from that timestep, propagate it to downstream operations and finallymake the predictions determined by the task working on These functions in RNNcan be intuitively described as extracting information from the current timestepusing the knowledge provided by the previous calculations, this also implies thatinformation from the current timestep is propagated throughout the network Themathematical representation of the description is shown below along with figure 2.1describing an RNN of three layers.

ht= tanh(xt· Wxh+ ht−1· Whh+ b) (2.1)First we need to walk through the notations used in the figure 2.1 before going intothe details of the operations of a typical RNN.

• tanhfunction is an activation that return values in range of[−1, 1].• b term is the bias added to allow better generalization of models.

• Whhis the weights matrix representing the connection between hidden values ofpositiont1and position t, which allows information learned in previous position

xt−1to be forwarded to the current positionxt.

Trang 19

Figure 2.1: The basic architecture of recurrent neural network

• Wxhis the weights matrix, which extracts information from the currentxtposition.

• ht represents the latent information extracted from the current layer.

Figure 2.1 shows an RNN with three layers and each layer is used to extract thelatent valueshtfrom its respective input xt In order to do so, the network is fed withinputxtalong with previously extracted informationht−1 as described in the

function given in 2.1; however, note that the calculation of hidden values at eachlayer is done using the same weights matrices mentioned and only theht ispropagated to the next layer; therefore, in some text, RNN and its other variantsincluding LSTM can be represented using a more compact form shown in the

left-hand side The behavior of propagating information from one layer to the next inRNN is what makes it capable of handling sequence data as the mentioned

properties of the sequence are the dependencies between values in a sequence.However, depending on the specific task the network is working on, there might ormight not be a predictionytforxt, this lead to the categorization of common patternsfor designing recurrent network architectures: one-to-one, many-to-one,

many-to-many, one-to-many As depicted in the below figure 2.2, the main

difference between these designing patterns is the number of inputxt that need to befed into the network before either a single predictionyt or a sequence ofyt y′T canbe generated In the above section, we talked about the sequence-to-sequence designwhich is also included in the below images.

Trang 20

Figure 2.2: Recurrent Neural Network design patterns

2.1.3 Long short-term memory network(LSTM)

Although the structure of RNN allows them to efficiently handle sequence data viathe connection between hidden layers allowing information to flow between

timesteps, their abilities fall short when it comes to long sequences in which

information in the later part of a sequence depending on the context of provided byvalues whose positions are far from the current position LSTM [16] handles thisproblem of the vanilla RNN using different gated units, which serve as controllersthat manipulate the information flow propagating through the network, there are 3different gated units in an LSTM: forget gate, output gate, and input gate.

Accompany these gated units are a cell state which is also the key component thatallows an LSTM to retain information from previous timesteps and at each time stepthe information contained in the cell state is regulated by the units by forgetting oldinformation and updating new one extracted from the recent timesteps.

As mentioned in the previous section, the gated units regulate the information flowstored in cell states, which serve as information storage that allows networks toretain information from the far ”past“ However, this is just the intuition of LSTMand we need to break down the mathematical representation of these gates and theiroperations The formal representation is a function whose parameters are: the input

xt, the hiddenht, and the cell stateCt, with the choice of parameters depending onthe specific gate Along with the new units in LSTM, the number of learnableweights matrix also increase corresponding to the new units, the weights matrix of

Trang 21

an LSTM include: WC for computing candidate cell state valuesCˆ, which wouldlater be added to the cell stateC; Wi ,Wf, andWoused in computations thatdetermine the information to be extracted from current timestep, the amount ofinformation to discard from cell state and the information to be directly fed to thenext time step as hidden state, values The mathematical representation for

mentioned computation is described below and they are also visualized in 2.3 usingthe same notations mentioned in this section.

Figure 2.3: LSTM network with three repeating layers

Figure 2.3 show a network with three repeating LSTM layer, each layer is equivalentto a timestep which is the term we have been using throughout this section At eachlayer (timestep), only the cell isCt, and hidden valuesht are exposed and fed intothe next layer; however, the LSTM network still shares the weights matrices in thementioned operations in every layer.

Trang 22

2.1.4 Transformer Neural Network

Neural networks that are constructed using transformer as their components arecurrently the focus of research in many fields of artificial intelligence from naturallanguage processing to computer vision and speech processing, these networksoutperform every networks that come before by a large margin and continue to pushthe boundary until today In this section, we aim to break down the components ofthe transformer by explaining the concept of attention which is the building block ofa transformer, and then walk through step-by-step the operations of a transformermodule.

Attention Mechanism In order to understand the transformer module, we first needto discuss the attention mechanism which gives the transformer the power to retrieveinformation from any previous time steps One drawback of recurrent neural

networks and their variants is the amount of information that can be retained andreferenced, this remains true even for LSTM as the integration of the cell state onlyincreases the window size of reference and as the number of LSTM layers increasethe cell state would have a harder time retaining old information from.

The intuition behind attention is that while extracting information at the currenttimestep, the network can “reference” the previously extracted information directlyand decide how much “attention” or the amount of information a certain part of thesequence carries that can be used for the current calculation This behavior can beseen as an imitation of how humans process information and make predictionsreferencing only certain parts of past information However, the cost for this directreference of past knowledge would require the network to store the result at eachtimestep for future references and the operations involving referencing this pastknowledge do not come cheap as the matrix multiplication operations increaseexponentially as the sequence gets longer In order to give a better understanding ofthe attention mechanism, in this next section we will go through a step-by-stepwalk-through of attention integrated with a recurrent network with the

encoder-decoder architecture shown in 2.4 Encoder The encoder block firstextracted the information residing in the input sequence; however, the extractedinformation (hidden states values) is stored instead of being discarded as in thenormal forward flow of a recurrent network Decoder The decoder would still do theoperations that generate predictions using the extracted knowledge returned by anencoder as input; however, with the attention mechanism, the embedding returned bythe encoder is concatenated with the context vector resulting from direct reference to

Trang 23

each layer in the encoder and formulate the input for layers in the decoder Thiscontext vector at each decoder’s layer is calculated via the following operations:

• Calculating the amount of attention should be put on each encoder layer bytaking the dot product of the hidden stateshdin each decoder with everyhe

layer in the encoder, before feeding the result of this dot product intosof tmax,whose output is a numeric value in range[0, 1]interpreted as the “amountattention” or the “contribution to the context” of the prediction.

Attention = sof tmax(dot(hd, he) (2.8)

• The result from thesof tmax function at each layer is then multiplied with itsrespective layer hidden states and these vectors from all layers of the encoderare then aggregated by sum or mean to form the mentioned context at

Trang 24

allowing the network to be trained in a parallel manner and further enhance theattention mechanism to not just a direct reference of hidden states of by matrixmultiplication operations and softmax function, however, this enhancements alsoincrease the network complexity and computation cost for both training andinference process of models In this section, transformer architecture will bediscussed along with advantages and disadvantages; this would give us a betterunderstanding in a later discussion on using of such architecture for solving theproblem of vulnerability repair.

The architecture of the transformer proposed in the original paper by the authorsfollows the structure of the traditional encoder-decoder framework, in which theencoder serves the purpose of creating a dense vector representation of input as areference and the decoder would use this dense representation of reference as inputfor generating predictions relating to learning task The encoder and decoder stacksin the architecture are made ofN number of identical modules staked vertically andmake up the transformer network; however, the building blocks of these modulesshown in figure 2.5 are attentions and fully connected networks, whose operationsand interactions would be discussed in details in this section.

Figure 2.5: The encoder-decoder architecture of transformer

In the previous section, the attention mechanism is discussed by breaking down itsoperation in the context of a recurrent network and therefore this setup suffers from

Trang 25

one of the drawbacks of a recurrent network is that the architecture can not make useof the computing power of parallel processing integrated deeply in modern

processors The reason for this is due to the fact information is extracted

subsequently meaning that previous states need to be calculated first to be used asinput for the next states until the end of the input sequence and this paradigm resultsin a long training time and heavily effect the inference speed Compared to therecurrent paradigm, a fully connected neural network, which is an expansion ofmulti-layer perceptron by adding more hidden layers, offers better training time andfaster inference speed as it would be able to take advantage of parallel computing bycalculating each hidden state independently in a hidden layer before aggregatingthem for the next hidden layer calculations This is also one of the enhancements ofthe transformer network over the recurrent network, transformer allows for parallelcomputing by replacing the recurrent neural network with a fully connected networkand clever positional encoding “hack”, as shown in figure 2.5 and the method itselfhas the following mathematical representation

Intuitively speaking, the author defines cosine and sine functions to map the positionof each element in a sequence with a vector with the sameddimension of

embedding in the model This definition of the mapping function is just for theconvenience that it is easier to learn a linear relationship which helps ease thetraining process without scarifying any performance stated by the authors.

After the positional encoding input sequence, the result is later fed into a multi-headattention module which the attention integrated fully connected layer Each attentionhead in this module operates on three vectors created from passing the positionalencoding through three linear layers simultaneously: query, key, and value vectorsshown in 2.6.

The concepts of query and keys come from information retrieval meaning that forqueries set and keys set representing the items in the search repository, we cancompute the scores denoting the relevancy between each query and each item, thesescores are later normalized to range[0, 1]being in the context of the transformer Thenormalized values are the attentions each element should give to the rest of the inputsequence or the information should be extracted from the value vectors However, in

Trang 26

Figure 2.6: Attention head operations

the original paper, the author also stated that as the output of the dot-product getslarger, the smaller the gradients resulting vanishing gradients during the

backpropagation phase; therefore, the attention scores are first scaled by √1

dk beforenormalizing bysof tmaxas in the expression below to stabilize the gradients duringthe optimization process with gradient descent.

Attention(Q, K, V ) = sof tmax(QK

Multiple attention head modules with the same operations as described above arestacked horizontally in the architecture of the transformer, which implies that theirresults are concatenated together to be fed into another fully connected layer, whichwe will reference as a linear layer to differentiate them from the layers attentionheads, for aggregation; this linear layer is the last layer in the encoder and decoderstack of transformer architecture aside form the layers the output predictions at thetail of the model architecture After each of these multi-head attention layers andlinear layer, there are a residual connection and a normalization layer which bothserve the purpose of stabilizing the gradients and easing the learning process of thetransformer.

Despite the fact that deep learning networks have been shown to provide largeperformance increases across multiple tasks of all common data types, one of the

Trang 27

main ingredients that make up the success of these methods is the large amount ofdata for these models to learn from; however the availability of data in many tasks donot meet the requirements of these deep models and this is where transfer learningcome in place, which is also the technique used in the methods we are investigatingfor vulnerabilities repairing In this technique, one would just use a trained model onsimilar tasks as the initial weights for the new model and train the model on thetarget task To be more specific, the scenarios where transfer learning can be usefulinclude

• Feature extraction: the output of the source models as input for a model of thetarget task

• Fine-tuning and pre-trained: the few last layers of the source model areremoved and replaced with layers of the new model, during the fine-tuningprocess weights of the new architecture can be updated altogether, or only thenewly added layers are updated depending on specific tasks we are working on.However, the choice of transfer learning approaches to use is largely dependent onthe similarities between tasks and the datasets in both tasks as stated in the

guidelines in [17]

• New dataset is small and similar to original dataset

Since the data is small, it is not a good idea to fine-tune the sourcemodel due to overfitting concerns Since the data is similar to theoriginal data, we expect higher-level features in the source model to berelevant to this dataset as well Hence, the best idea might be to train alinear classifier on the CNN codes.

• New dataset is large and similar to the original dataset

Since we have more data, we can have more confidence that we won’toverfit if we were to try to fine-tune through the full network.

• New dataset is small but very different from the original datasetSince the data is small, it is likely best to only train a linear classifier.Since the dataset is very different, it might not be best to train a newmodel from the top of the source model, which contains more

dataset-specific features Instead, it might work better to train the newmodel somewhere in the middle of the source model’s architecture.

Trang 28

• New dataset is large and very different from the original datasetSince the dataset is very large, we may expect that we can afford totrain a the source models from scratch However, in practice, it is veryoften still beneficial to initialize with weights from a pre-trainedmodel In this case, we would have enough data and confidence tofine-tune the entire network.

Learning methods that leverage neural networks require the data to be representedvectors, which include both the input and output However, these vectors need to beencoded from a raw representation that capable of capturing the neccessary theimplicit features within the data In the case of source code, these implicit patternscan be the sequential commands, the data flow, etc Depending on the specificobjectives, the use of different representations along with the design of the neuralnetworks can be brought varying results In this section, we try to categorize themethods, which based on neural networks, using their data representation and theirchoice of architectures into sequence-to-sequence learning, graph-based learning,tree-to-tree-learning This categorizaiton will help us to better navigate the scope ofthis research and the proposing method.

2.3.1 Sequence to Sequence Learning

Sequence-to-sequence or many-to-many or encoder-decoder is a type of neuralnetworks design pattern shown in 2.2 in which models take in a sequence of linearlydependent value and generate an output sequence base on the objective which themodel is trained on The general architecture of this type of model includes twoseparate modules: the encoder and the decoder These two modules are both neuralnetworks whose architectures can be either identical or different The encoder firstcreates a context vector which is then used as the input in the decoder for sequencegeneration In recent years, sequence-to-sequence models are normally createdbased on one or the composition of the following network Recurrent Neural

Networks (RNN), Long Short-term Memory Networks (LSTM), and Transformers.The details of all these networks would be our subjects of discussion in the followingsection 2.1.

Trang 29

Sequence-to-sequence architecture is normally used in the field of Natural LanguageProcessing (NLP) especially in tasks involving taking in a sentence as input and theoutput is an objective-based sentence, which can be a translation of the originalsentence or its summary However, sequence data does not only exist only in thefield of natural language processing, these sequence-to-sequence architectures canalso be found in other fields of deep learning like forecasting of time series, imagecaptioning, text-to-speech, speech-to-text, etc In these methods that leveragesequence-to-sequence design, the common pattern is that they all take in sequencesof information as input and output another sequence and these information

sequences take many forms from visual information, and language information to assimple as numeric sequences.

Due to the architecture’s popularity, researchers on deep learning methods for sourcecode tasks also heavily use this design pattern with input as the sequence

representation of source code and the models would learn to predict the sequencesdenoting the summary of the input code block, or in our case of research, patches forerror code In an NLP application, these sequences are made up of tokens whichmay be the words in natural language sentences When used on source code, thesetokens may be variables, operators, parameters, etc One of the methods used forgenerating these tokens for use in sequence to sequence model has been discussed inthe above Byte Pair Encoding section A significant difference between tokens in thetwo fields is that the vocabulary for program code can be limitless It can be

described as limitless because any variable name, function name, and class name canbe whatever the programmer wishes it to be In [15] and [2], the authors also

proposed tokenization methods that help reduce the vocabulary size along with thesequence-to-sequence networks for patches generation, which is also the topics thatwe will discuss in detail in respective sections.

2.3.2 Graphs-based Learning

Representing code with graphs opens up the ability to capture both syntactic andsemantic structures in the code This process is very computationally expensive Theidea of using a graph representation is to be able to lessen the requirements of themodel capacity, training regime, and the amount of training data needed by

introducing data from data flow and type hierarchies and thus, be able to capture thesemantic context in addition to the syntactic context of a program [18] Researchthat has been conducted on using machine learning on programs that are represented

Trang 30

with graphs [19], has implemented this idea by building on a Gated Graph NeuralNetwork [20] Representing the source code as a graph can be done using: anabstract syntax tree (AST) [21], control flow graph (CFG) [22], or program

dependence graph (PDG) [23] While the abstract syntax tree representation takesinspiration from natural language processing, the other two focus on differentaspects of the source code which can be suitable for different optimizing purposes.

2.3.3 Tree-to-tree Learning

This category of learning paradigm can be considered to be a subcategory of

graph-based learning, however, we differentiate this from the previous one as it alsoresembles the sequence-to-sequence learning in terms of learning the mappingfunction from input to output using the same representation In this paradigm, thetree representation of code, which is normally the abstract syntax tree or its variants,may be used to capture the rich syntactic structure of the code, one that the tokensequence in sequence-to-sequence learning may fail to capture Like

sequence-sequence learning, the inspiration for tree-based learning also comes fromthe world of NLP Using a neural machine translation model, an input tree,

representing buggy code, is transformed into an output tree, representing fixed code.In the works that use this representation of source code such as [24], a code

differencing GumTree [25] would need to be used to identify the differencesbetween the bad code AST and the patch AST.

Intuitively, bug repairing is a broader domain compared to vulnerability repairing[26], which is a type of security-related bug that can be exploited and causes harm toboth software users and providers Although the two all aim to detect and repairthese bugs automatically, the task of vulnerability repairing takes more time andeffort to detect and fix as the software can still run smoothly with these

vulnerabilities.

Ngày đăng: 30/07/2024, 16:59

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN