1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn thạc sĩ Khoa học máy tính: Application of machine learning on automatic program repair of security vulnerabilities

61 2 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Application of Machine Learning on Automatic Program Repair of Security Vulnerabilities
Tác giả Nguyen Ngoc Hai Dang
Người hướng dẫn Assoc. Prof. Dr. Huynh Tuong Nguyen, Assoc. Prof. Dr. Quan Thanh Tho
Trường học Ho Chi Minh City University of Technology
Chuyên ngành Computer Science
Thể loại Master’s Thesis
Năm xuất bản 2023
Thành phố Ho Chi Minh City
Định dạng
Số trang 61
Dung lượng 0,97 MB

Cấu trúc

  • 1.1 Motivation (13)
  • 1.2 Problem Statement (14)
  • 1.3 Research Questions (15)
  • 1.4 Thesis Outline (16)
  • 2.1 Background on Neural Network and Deep Learning (17)
    • 2.1.1 Recurrent Neural Network (RNN) (17)
    • 2.1.2 Vanilla recurrent neural network (18)
    • 2.1.3 Long short-term memory network(LSTM) (20)
    • 2.1.4 Transformer Neural Network (22)
  • 2.2 Transfer Learning (26)
  • 2.3 Learning Paradigm (28)
    • 2.3.1 Sequence to Sequence Learning (28)
    • 2.3.2 Graphs-based Learning (29)
    • 2.3.3 Tree-to-tree Learning (30)
  • 2.4 Bug Repairing and Vulnerabilities Repairing (30)
  • 2.5 Source code Representation (31)
    • 2.5.1 GumTree (31)
    • 2.5.2 Byte Pair Encoding (32)
  • 2.6 Source code embeddings (33)
    • 2.6.1 CodeBERT (34)
    • 2.6.2 UnixCoder (36)
  • 3.1 Template-based approach (38)
  • 3.2 Generative-based approach (40)
    • 3.2.1 SeqTrans (41)
    • 3.2.2 VRepair (43)
  • 5.1 Datasets (46)
  • 5.2 Metrics of performance (47)
  • 5.3 Preprocessing the code as plain text (48)
  • 5.4 Extracting embeddings from large language models for code (49)
  • 5.5 Environment (51)
  • 5.6 Results (51)
  • 6.1 Discussions of the results (54)
  • 6.2 Main Contribution (54)
  • 6.3 Future works (55)
  • 2.1 The basic architecture of recurrent neural network (0)
  • 2.2 Recurrent Neural Network design patterns (0)
  • 2.3 LSTM network with three repeating layers (0)
  • 2.4 Attention-integrated recurrent network (0)
  • 2.5 The encoder-decoder architecture of transformer (0)
  • 2.6 Attention head operations (0)
  • 2.7 Dataset used for CodeBERT (0)
  • 2.8 CodeBERT architecture for replaced tokens detection task (0)
  • 2.9 A Python code with its comment and AST (0)
  • 2.10 Input for contrastive learning task of UnixCoder (0)
  • 3.1 Workflow of VuRLE (0)
  • 3.2 Architecture of SeqTrans (0)
  • 3.3 Input of SeqTrans (0)
  • 3.4 Normalized code segment (0)
  • 3.5 The VRepair pipeline (0)
  • 4.1 Design of our pipeline (0)
  • 5.1 Sample of buggy code and its patch (0)
  • 5.2 input sequence (0)
  • 5.3 Output sequence (0)
  • 5.4 Syntax of the output sequence (0)

Nội dung

Motivation

In the field of software testing, security vulnerabilities are the type of bugs that are both hard to detect and implement patches as they are not explicitly affecting the software functionalities but they only exposed and cause great harm when exploited intentionally.

Methods used for creating code patches can be classified into template-based and generative-based Template-based patching involves using predefined templates or patterns to guide the creation of code patches These templates provide a structure or framework for making specific modifications to the code Developers can fill in the template with the necessary changes based on the identified security issues This approach simplifies the patching process by providing a consistent and predictable format, making it easier to apply fixes across different codebases Template-based patching is especially useful for addressing common security vulnerabilities that have well-defined solutions On the other hand, generative-based patching takes a more automated and algorithmic approach to create code patches Instead of relying on predefined templates, this approach leverages machine learning techniques, code analysis, and algorithms to generate patches automatically Generative-based patching involves analyzing the codebase, identifying problematic areas, and generating code changes that address the detected security issues This approach can be more flexible and adaptive, as it can handle a wider range of vulnerabilities and adapt to different programming languages and code structures.

Automated program repair is an emerging field of Software Engineering (SE) research that allows for automated rectification of software errors and vulnerabilities

[7] Program repair, code repair, also known as automated program repair or software patching, refers to the process of automatically identifying and fixing software bugs or defects without manual intervention Program repair techniques typically analyze the source code, identify the root cause of the bug, and generate a patch or modification that resolves the issue It is an emerging research interest in applying Machine Learning techniques to automate the task of repairing faulty code [7]–[13].

Inspired by the work of Zimi Chen [14], whose research uses transfer learning to leverage knowledge deep learning learned from generic code repairing tasks to improve the performance of vulnerability repairing task, our main focus will be to investigate the problem of generating vulnerabilities patches automatically using a generative model and leverage the knowledge learned by large programming languages models such as CodeBERT [3] to improve the performance of the generative model through the use of extracted embeddings from the CodeBERT.

Problem Statement

The problem of generating patches for vulnerabilities using knowledge learned from data is quite new and still has challenges to work on One of such challenges is the lack of proficient data to work on, in terms of both volume and diversity; this leads to the performance of current vulnerabilities’ patches generating system to be quite unreliable even for template-based methods that are not as data-hungry as the generative-based, which is dominated by the deep learning architectures.

Unlike the scarcity of data in the domain of vulnerabilities repairing, the data for buggy code, in general, is quite large; one can easily crawl a data set from the repositories on GitHub based on commits, and for each commit, we will have a pair of bad code and its patch The question then is how we can use this abundance of data in the broader domain of code repairing in the solution of code repairing.

As it turns out, situations where the availability of data is limited are not uncommon; in the field of deep learning, there is a method designed using the same intuition, in which we transfer the knowledge learned from one task to another, called transfer learning The application of transfer learning to the vulnerabilities repairing task has already been investigated in [15] [2] and archived promising results on the limited data set.

In our work, we will go even further by combining source code modeling techniques with transfer learning with the hope that this could further improve previously reported results.

Research Questions

Our research will focus on answering the three Research Questions:

Research Question 1: What do we know about Deep Learning in Vulnerable program Repair?

• Reason: Deep Learning has shown great potential in various software development problems, and exploring its applicability in vulnerable program repair can lead to advancements in automated bug fixing Understanding the existing knowledge about Deep Learning in this context can provide insights into its effectiveness, limitations, and potential for improving the efficiency and accuracy of program repair techniques.

Research Question 2: How effective existing generative-based methods to the problem of code repairing and vulnerability repairing?

• Reason: Investigating the effectiveness of these methods in code and vulnerability repairing can provide valuable insights into their strengths, weaknesses, and limitations.

Research Question 3: Can code embedding extend the capabilities of these methods?

• Reason: Code embedding is a technique that represents source code as vectors in a continuous space, enabling the application of machine learning algorithms to code analysis and understanding Exploring the use of code embedding in generative-based methods for code and vulnerability repairing can offer new perspectives on improving the effectiveness and generalizability of these techniques.

Thesis Outline

The thesis will be structured as followed: in section 2 we will discuss the necessary background knowledge of deep learning, and its applications in the broader field of code repairing, then we will further discuss the previous methods used in the problem of vulnerabilities, in the section 3 4 we will then discuss in details some of the prominent methods and from then lead to our proposed methods.

We will first introduce the fundamentals of Deep Learning, secondly LearningParadigms, and thirdly Deep Learning in Code Repair

Background on Neural Network and Deep Learning

Recurrent Neural Network (RNN)

In the world of data, there are properties that regulate the information or knowledge that models are trying to learn, one such type properties is the order of values in sequence data To be more formal on the definition of sequence data, we will define a sequence as a type of data where values in the later position of the sequence are determined by the values in the previous position.

• T denotes the length of the sequence

• tdenotes the position of xin the sequence

• x t is the value at positiont

• X T represent sequence data of lengthT

Recurrent neural networks [16] have mainly two types of variants: with gated units and without gated units In the next sections, we will discuss both network types using two representative networks (one network for each type): vanilla recurrent neural network (with gated units) and long short-term memory network (without gated units).

Vanilla recurrent neural network

Given a sample of sequence dataX T = {x 1 , x 2 x t x T }in whichx tis the value of the sequence at positiontor time stept, each of these values is subsequently fed into the network as parameters for the functions inside that would extract information from that timestep, propagate it to downstream operations and finally make the predictions determined by the task working on These functions in RNN can be intuitively described as extracting information from the current timestep using the knowledge provided by the previous calculations, this also implies that information from the current timestep is propagated throughout the network The mathematical representation of the description is shown below along with figure 2.1 describing an RNN of three layers. h t = tanh(x t ã W xh + h t−1 ã W hh + b) (2.1)

First we need to walk through the notations used in the figure 2.1 before going into the details of the operations of a typical RNN.

• tanhfunction is an activation that return values in range of[−1, 1].

• b term is the bias added to allow better generalization of models.

• W hh is the weights matrix representing the connection between hidden values of positiont 1 and position t, which allows information learned in previous position x t−1 to be forwarded to the current positionx t

Figure 2.1: The basic architecture of recurrent neural network

• W xh is the weights matrix, which extracts information from the currentx t position.

• h t represents the latent information extracted from the current layer.

Figure 2.1 shows an RNN with three layers and each layer is used to extract the latent valuesh t from its respective input x t In order to do so, the network is fed with inputx t along with previously extracted informationh t−1 as described in the function given in 2.1; however, note that the calculation of hidden values at each layer is done using the same weights matrices mentioned and only theh t is propagated to the next layer; therefore, in some text, RNN and its other variants including LSTM can be represented using a more compact form shown in the left-hand side The behavior of propagating information from one layer to the next in RNN is what makes it capable of handling sequence data as the mentioned properties of the sequence are the dependencies between values in a sequence.

However, depending on the specific task the network is working on, there might or might not be a predictiony t forx t , this lead to the categorization of common patterns for designing recurrent network architectures: one-to-one, many-to-one, many-to-many, one-to-many As depicted in the below figure 2.2, the main difference between these designing patterns is the number of inputx t that need to be fed into the network before either a single predictiony t or a sequence ofy t y ′ T can be generated In the above section, we talked about the sequence-to-sequence design which is also included in the below images.

Figure 2.2: Recurrent Neural Network design patterns

Long short-term memory network(LSTM)

Although the structure of RNN allows them to efficiently handle sequence data via the connection between hidden layers allowing information to flow between timesteps, their abilities fall short when it comes to long sequences in which information in the later part of a sequence depending on the context of provided by values whose positions are far from the current position LSTM [16] handles this problem of the vanilla RNN using different gated units, which serve as controllers that manipulate the information flow propagating through the network, there are 3 different gated units in an LSTM: forget gate, output gate, and input gate.

Accompany these gated units are a cell state which is also the key component that allows an LSTM to retain information from previous timesteps and at each time step the information contained in the cell state is regulated by the units by forgetting old information and updating new one extracted from the recent timesteps.

As mentioned in the previous section, the gated units regulate the information flow stored in cell states, which serve as information storage that allows networks to retain information from the far ”past“ However, this is just the intuition of LSTM and we need to break down the mathematical representation of these gates and their operations The formal representation is a function whose parameters are: the input x t, the hiddenh t, and the cell stateC t, with the choice of parameters depending on the specific gate Along with the new units in LSTM, the number of learnable weights matrix also increase corresponding to the new units, the weights matrix of an LSTM include: W C for computing candidate cell state valuesC ˆ , which would later be added to the cell stateC; W i ,W f , andW o used in computations that determine the information to be extracted from current timestep, the amount of information to discard from cell state and the information to be directly fed to the next time step as hidden state, values The mathematical representation for mentioned computation is described below and they are also visualized in 2.3 using the same notations mentioned in this section. f t = σ(W f [h t−1 , x t ] + b f ) (2.2)

In the above operations, biasbis added for each linear transformation to increase the generalized ability of the model;σ andtanhare non-linear transformations that output the values from range[0, 1]and[−1, 1]respectively.

Figure 2.3: LSTM network with three repeating layers

Figure 2.3 show a network with three repeating LSTM layer, each layer is equivalent to a timestep which is the term we have been using throughout this section At each layer (timestep), only the cell isC t, and hidden valuesh t are exposed and fed into the next layer; however, the LSTM network still shares the weights matrices in the mentioned operations in every layer.

Transformer Neural Network

Neural networks that are constructed using transformer as their components are currently the focus of research in many fields of artificial intelligence from natural language processing to computer vision and speech processing, these networks outperform every networks that come before by a large margin and continue to push the boundary until today In this section, we aim to break down the components of the transformer by explaining the concept of attention which is the building block of a transformer, and then walk through step-by-step the operations of a transformer module.

Attention MechanismIn order to understand the transformer module, we first need to discuss the attention mechanism which gives the transformer the power to retrieve information from any previous time steps One drawback of recurrent neural networks and their variants is the amount of information that can be retained and referenced, this remains true even for LSTM as the integration of the cell state only increases the window size of reference and as the number of LSTM layers increase the cell state would have a harder time retaining old information from.

The intuition behind attention is that while extracting information at the current timestep, the network can “reference” the previously extracted information directly and decide how much “attention” or the amount of information a certain part of the sequence carries that can be used for the current calculation This behavior can be seen as an imitation of how humans process information and make predictions referencing only certain parts of past information However, the cost for this direct reference of past knowledge would require the network to store the result at each timestep for future references and the operations involving referencing this past knowledge do not come cheap as the matrix multiplication operations increase exponentially as the sequence gets longer In order to give a better understanding of the attention mechanism, in this next section we will go through a step-by-step walk-through of attention integrated with a recurrent network with the encoder-decoder architecture shown in 2.4 EncoderThe encoder block first extracted the information residing in the input sequence; however, the extracted information (hidden states values) is stored instead of being discarded as in the normal forward flow of a recurrent network.DecoderThe decoder would still do the operations that generate predictions using the extracted knowledge returned by an encoder as input; however, with the attention mechanism, the embedding returned by the encoder is concatenated with the context vector resulting from direct reference to each layer in the encoder and formulate the input for layers in the decoder This context vector at each decoder’s layer is calculated via the following operations:

• Calculating the amount of attention should be put on each encoder layer by taking the dot product of the hidden statesh d in each decoder with everyh e layer in the encoder, before feeding the result of this dot product intosof tmax, whose output is a numeric value in range[0, 1]interpreted as the “amount attention” or the “contribution to the context” of the prediction.

• The result from thesof tmax function at each layer is then multiplied with its respective layer hidden states and these vectors from all layers of the encoder are then aggregated by sum or mean to form the mentioned context at prediction.

Figure 2.4: Attention-integrated recurrent network

Since the emergence of the transformer in 2017, the transformer has brought about performance increases in many domains of artificial intelligence, breaking records after records of performance metrics and replacing recurrent networks as the focus of research on neural network models The design of the transformer at its core is built solely around the concept of attention, along with the incorporation of modules allowing the network to be trained in a parallel manner and further enhance the attention mechanism to not just a direct reference of hidden states of by matrix multiplication operations andsoftmax function, however, this enhancements also increase the network complexity and computation cost for both training and inference process of models In this section, transformer architecture will be discussed along with advantages and disadvantages; this would give us a better understanding in a later discussion on using of such architecture for solving the problem of vulnerability repair.

The architecture of the transformer proposed in the original paper by the authors follows the structure of the traditional encoder-decoder framework, in which the encoder serves the purpose of creating a dense vector representation of input as a reference and the decoder would use this dense representation of reference as input for generating predictions relating to learning task The encoder and decoder stacks in the architecture are made ofN number of identical modules staked vertically and make up the transformer network; however, the building blocks of these modules shown in figure 2.5 are attentions and fully connected networks, whose operations and interactions would be discussed in details in this section.

Figure 2.5: The encoder-decoder architecture of transformer

In the previous section, the attention mechanism is discussed by breaking down its operation in the context of a recurrent network and therefore this setup suffers from one of the drawbacks of a recurrent network is that the architecture can not make use of the computing power of parallel processing integrated deeply in modern processors The reason for this is due to the fact information is extracted subsequently meaning that previous states need to be calculated first to be used as input for the next states until the end of the input sequence and this paradigm results in a long training time and heavily effect the inference speed Compared to the recurrent paradigm, a fully connected neural network, which is an expansion of multi-layer perceptron by adding more hidden layers, offers better training time and faster inference speed as it would be able to take advantage of parallel computing by calculating each hidden state independently in a hidden layer before aggregating them for the next hidden layer calculations This is also one of the enhancements of the transformer network over the recurrent network, transformer allows for parallel computing by replacing the recurrent neural network with a fully connected network and clever positional encoding “hack”, as shown in figure 2.5 and the method itself has the following mathematical representation

Intuitively speaking, the author defines cosine and sine functions to map the position of each element in a sequence with a vector with the sameddimension of embedding in the model This definition of the mapping function is just for the convenience that it is easier to learn a linear relationship which helps ease the training process without scarifying any performance stated by the authors.

After the positional encoding input sequence, the result is later fed into a multi-head attention module which the attention integrated fully connected layer Each attention head in this module operates on three vectors created from passing the positional encoding through three linear layers simultaneously: query, key, and value vectors shown in 2.6.

The concepts of query and keys come from information retrieval meaning that for queries set and keys set representing the items in the search repository, we can compute the scores denoting the relevancy between each query and each item, these scores are later normalized to range[0, 1]being in the context of the transformer The normalized values are the attentions each element should give to the rest of the input sequence or the information should be extracted from the value vectors However, in

Figure 2.6: Attention head operations the original paper, the author also stated that as the output of the dot-product gets larger, the smaller the gradients resulting vanishing gradients during the backpropagation phase; therefore, the attention scores are first scaled by √ 1 d k before normalizing bysof tmaxas in the expression below to stabilize the gradients during the optimization process with gradient descent.

Multiple attention head modules with the same operations as described above are stacked horizontally in the architecture of the transformer, which implies that their results are concatenated together to be fed into another fully connected layer, which we will reference as a linear layer to differentiate them from the layers attention heads, for aggregation; this linear layer is the last layer in the encoder and decoder stack of transformer architecture aside form the layers the output predictions at the tail of the model architecture After each of these multi-head attention layers and linear layer, there are a residual connection and a normalization layer which both serve the purpose of stabilizing the gradients and easing the learning process of the transformer.

Transfer Learning

Despite the fact that deep learning networks have been shown to provide large performance increases across multiple tasks of all common data types, one of the main ingredients that make up the success of these methods is the large amount of data for these models to learn from; however the availability of data in many tasks do not meet the requirements of these deep models and this is where transfer learning come in place, which is also the technique used in the methods we are investigating for vulnerabilities repairing In this technique, one would just use a trained model on similar tasks as the initial weights for the new model and train the model on the target task To be more specific, the scenarios where transfer learning can be useful include

• Feature extraction: the output of the source models as input for a model of the target task

• Fine-tuning and pre-trained: the few last layers of the source model are removed and replaced with layers of the new model, during the fine-tuning process weights of the new architecture can be updated altogether, or only the newly added layers are updated depending on specific tasks we are working on.

However, the choice of transfer learning approaches to use is largely dependent on the similarities between tasks and the datasets in both tasks as stated in the guidelines in [17]

• New dataset is small and similar to original dataset

Since the data is small, it is not a good idea to fine-tune the source model due to overfitting concerns Since the data is similar to the original data, we expect higher-level features in the source model to be relevant to this dataset as well Hence, the best idea might be to train a linear classifier on the CNN codes.

• New dataset is large and similar to the original dataset

Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.

• New dataset is small but very different from the original dataset

Since the data is small, it is likely best to only train a linear classifier.

Since the dataset is very different, it might not be best to train a new model from the top of the source model, which contains more dataset-specific features Instead, it might work better to train the new model somewhere in the middle of the source model’s architecture.

• New dataset is large and very different from the original dataset

Since the dataset is very large, we may expect that we can afford to train a the source models from scratch However, in practice, it is very often still beneficial to initialize with weights from a pre-trained model In this case, we would have enough data and confidence to fine-tune the entire network.

Learning Paradigm

Sequence to Sequence Learning

Sequence-to-sequence or many-to-many or encoder-decoder is a type of neural networks design pattern shown in 2.2 in which models take in a sequence of linearly dependent value and generate an output sequence base on the objective which the model is trained on The general architecture of this type of model includes two separate modules: the encoder and the decoder These two modules are both neural networks whose architectures can be either identical or different The encoder first creates a context vector which is then used as the input in the decoder for sequence generation In recent years, sequence-to-sequence models are normally created based on one or the composition of the following network Recurrent Neural

Networks (RNN), Long Short-term Memory Networks (LSTM), and Transformers.The details of all these networks would be our subjects of discussion in the following section 2.1.

Sequence-to-sequence architecture is normally used in the field of Natural Language Processing (NLP) especially in tasks involving taking in a sentence as input and the output is an objective-based sentence, which can be a translation of the original sentence or its summary However, sequence data does not only exist only in the field of natural language processing, these sequence-to-sequence architectures can also be found in other fields of deep learning like forecasting of time series, image captioning, text-to-speech, speech-to-text, etc In these methods that leverage sequence-to-sequence design, the common pattern is that they all take in sequences of information as input and output another sequence and these information sequences take many forms from visual information, and language information to as simple as numeric sequences.

Due to the architecture’s popularity, researchers on deep learning methods for source code tasks also heavily use this design pattern with input as the sequence representation of source code and the models would learn to predict the sequences denoting the summary of the input code block, or in our case of research, patches for error code In an NLP application, these sequences are made up of tokens which may be the words in natural language sentences When used on source code, these tokens may be variables, operators, parameters, etc One of the methods used for generating these tokens for use in sequence to sequence model has been discussed in the above Byte Pair Encoding section A significant difference between tokens in the two fields is that the vocabulary for program code can be limitless It can be described as limitless because any variable name, function name, and class name can be whatever the programmer wishes it to be In [15] and [2], the authors also proposed tokenization methods that help reduce the vocabulary size along with the sequence-to-sequence networks for patches generation, which is also the topics that we will discuss in detail in respective sections.

Graphs-based Learning

Representing code with graphs opens up the ability to capture both syntactic and semantic structures in the code This process is very computationally expensive The idea of using a graph representation is to be able to lessen the requirements of the model capacity, training regime, and the amount of training data needed by introducing data from data flow and type hierarchies and thus, be able to capture the semantic context in addition to the syntactic context of a program [18] Research that has been conducted on using machine learning on programs that are represented with graphs [19], has implemented this idea by building on a Gated Graph Neural Network [20] Representing the source code as a graph can be done using: an abstract syntax tree (AST) [21], control flow graph (CFG) [22], or program dependence graph (PDG) [23] While the abstract syntax tree representation takes inspiration from natural language processing, the other two focus on different aspects of the source code which can be suitable for different optimizing purposes.

Tree-to-tree Learning

This category of learning paradigm can be considered to be a subcategory of graph-based learning, however, we differentiate this from the previous one as it also resembles the sequence-to-sequence learning in terms of learning the mapping function from input to output using the same representation In this paradigm, the tree representation of code, which is normally the abstract syntax tree or its variants, may be used to capture the rich syntactic structure of the code, one that the token sequence in sequence-to-sequence learning may fail to capture Like sequence-sequence learning, the inspiration for tree-based learning also comes from the world of NLP Using a neural machine translation model, an input tree, representing buggy code, is transformed into an output tree, representing fixed code.

In the works that use this representation of source code such as [24], a code differencing GumTree [25] would need to be used to identify the differences between the bad code AST and the patch AST.

Bug Repairing and Vulnerabilities Repairing

Intuitively, bug repairing is a broader domain compared to vulnerability repairing [26], which is a type of security-related bug that can be exploited and causes harm to both software users and providers Although the two all aim to detect and repair these bugs automatically, the task of vulnerability repairing takes more time and effort to detect and fix as the software can still run smoothly with these vulnerabilities.

Source code Representation

GumTree

Gumtree [27] is an algorithm is extracting the edit scripts from an abstract syntax tree using a two-phase approach that first produces matching nodes in the abstract tree of the original code and the fixed code, then these mappings are then used as input to another algorithm (RTED as in the original paper ) to generate the edit scripts The two phases of code repair include the bottom-up phase and the top-down phase Top-down phaseThe two trees are compared to find the isomorphic sub-trees The roots of these sub-trees are called anchor mappings which will later be used in the bottom-up phase The process of finding these anchor mappings is done by first using an auxiliary data structure called height list and the algorithm will iterate through this list from the root to nodes whose heights are larger than minHeights The detailed operation of the top-down algorithm is described as followed:

• Start from the highest node of both source treeT 1 and destination treeT 2 , and check if these nodes are isomorphic If the two nodes are not isomorphic then their children will be tested.

• Given a node, there can be multiple matches, so all these mappings are first put into a list called candidate list and later processed after all unique mappings have been found

• For each node with multiple matches in the candidate mappings list we only choose the mappings to give the highest score in the belowdiceF unction diceF unction(t 1 , t 2 , M ) = 2 × |{t 1 ∈ s(t 1 )|(t 1 , t 2 ) ∈ M }|

Bottom-up phaseThe bottom phase will traverse the two sub-trees from their leaves to roots to find the highest matching nodes, these mappings are called container mappings The two parent nodes are considered a match in this phase if:

1 The two nodes do not appear inM generated from the top-down phase

2 The two nodes’ dice score has a value larger thanminDiceas the expression below diceF unction(t 1 , t 2 , M ) ≤ minDice (2.13)

3 Only the mapping with the highest dice score is chosen, for nodes with multiple matches given the above condition and added toM

These container mappings are further processed to find more matching descendants by first removing all matching descendants, then for only sub-trees with a height smaller thanmaxSize an edit script (without the move action) is generated along with the nodes mappings From this edit script, we only add new mappings toM if the two nodes in these mappings have identical labels.

The edit scripts from the source tree to the destination tree would be generated using the mappings created from the two phases above using an edit script generation algorithm such as RTED, which is also the algorithm used by the authors in the original paper of GumTree This edit script would be used as a representation of our source code and input for patches generating modules downstream.

Byte Pair Encoding

In NLP tasks, it is common to represent the input text vectors of tokens made up of words in the dataset and if the same operations can be applied to source code if we view them as text However, this representation has one problem the vocabulary in the dataset might not contain all the words that would encounter when models are put into use Instead of words, the character-level tokens can be used but this would risk losing the semantic properties carried within the words So, the goal of BytePair Encoding (BPE) [28]is to create tokens that:

• Retaining the semantic features of the token, that is information per token.

• Tokenizing without demanding a very large vocabulary with a finite set of words.

We will utilize the example provided by Wikipedia to illustrate BPE As the following example shows, the original data is ”aaabdaaabac”, the algorithm will search the most frequently occurring byte pair Now there is the following data and replacement table

Then iterate the above steps and place the most frequently occurring byte pair in the table:

The algorithm will be stopped when there are no pairs of bytes that occur more than once If we need to decompress the data, we will perform the replacements in reverse order.

Source code embeddings

CodeBERT

CodeBERT is a bimodal model as it is trained on two types of data: the code segments and their documentation, and this is what makes it capable of generating general-purpose vector representation Another key point in the setup of CodeBERT is that it trained with a dataset shown in figure 2.7 that is a mixture of programming languages with no indication in the model to differentiate between these languages.

As CodeBERT is trained on two types of data, the input of the model has two segments, each for a type of data; however, both these segments follow the same tokenization process as in a standard text processing pipeline The generated sequences of tokens would then be inserted with some special tokens formulating the following sequence which is used as input for the model, and the model would need to output (1) dense vector representations of both code tokens and word tokens along with (2) the representation of[CLS]stated in [3] In order to generate such representation, the model is trained on two learning objectives: masked language modeling and replaced token detection.

2.6.1.1 Masked Language Modeling This objective uses the bimodal data of natural language and programming language in order to learn to predict the tokens that are masked out in the input sequence denoted by[M ASK]token in the sequence The loss function for this objective is written as follows

Givenθ to be the model parameters that we are trying to find that optimize the loss function denoting the joint probabilityp D i of the predictions for the masked tokens m w andm c forw masked andc masked is maximized.

2.6.1.2 Replaced Token Detection In this objective, the model would learn to detect replaced tokens as illustrated in figure 2.8 with loss function described as in equation 2.16 δ(i) =

Given thatp D is the probability the discriminator predicts the token atibeing original anddeltais an indicator function that returns 1if the position is a corrupt token and0if the position otherwise, which is the opposite of GAN, in that we assign “real” value to the token label if the generator happens to generate the correct tokens.

Figure 2.8: CodeBERT architecture for replaced tokens detection task

UnixCoder

2.6.2.1 Abstraction of UnixCoder UnixCoder [4] is another large programming languages model that leverages the AST representation of code segments, which is described in 2.9, along with the comments to learn the embeddings in two pre-training tasks.

Figure 2.9: A Python code with its comment and AST

• The first one is a contrastive learning task in which the model tries to optimize a cosine loss that measures the sum of the similarity of all vectorized input in a training batch The input of this task, which is shown in 2.10, is fed into the model which composes of a concatenation of the flattened AST representation of a code segment and the comment describing the code segment.

Figure 2.10: Input for contrastive learning task of UnixCoder

• The second pre-training task is the conditional text generation that makes the model learns to generate the respective comment with the flattened AST of the code segment.

3 The state of the art program repair approraches

The problem is automatically generating patches for vulnerability code has been studied by researchers for a long time and before the bloom of neural network-based methods, the approach that seems to be dominant is the template-based approach that uses templates mined from the data set to generate patches for respective vulnerabilities With the emergence of deep learning models built upon the concepts of neural networks and their outstanding performance in multiple fields, prominently computer vision and natural language processing, it is not a surprise that neural-net-based methods have been the subject of study for researchers in the field automatic bug repairing and these methods have been proven to be surpassing template-based methods in term of performance In this section, we will discuss some recent and prominent methods in each of these categories in detail, the first section will be the template-based approach and the generative methods using neural networks will be discussed in the second section.

Template-based approach

The methods that fall into this category would need to identify patterns that can be leveraged to generate the patch, the patterns can be created via mining the data set of source code and their patches or predefined rules written by the engineers Each method has different sets of fix patterns that they use for patches generation, making it hard to evaluate and compare these methods as the quality of fix patterns defined has a direct impact on the quality of generated patches; however, the authors of TBAR have done the job of categorizing these fix patterns into sixteen different groups along with their properties described using four qualitative dimensions [30], these dimension can be briefly described in the following section:

• Change Action: What high-level operations are applied on a buggy code entity? The mentioned operations are categorized intoU pdate,

Delete,Insert, andM ove On the one hand,U pdateoperations replace the buggy code entity with retrieved error code, whileDelete operations remove the buggy code entity from the program On the other hand,Insertoperations insert an otherwise missing code entity into the program, andM oveoperations change the position of the buggy code entity to a more suitable location in the program.

• Change Granularity: What kinds of code entities are directly impacted by the change actions? This entity can be an entireM ethod, a whole Statement or specifically targeting anExpressionwithin a statement.

• Bug Context: What specific AST nodes of code entities are used to match fix patterns?

• Change Spread: How many statements are impacted by each repairing pattern?

A notable system that falls into this category is VuRLE [24] generated patches using the template-based method that includes two phases: Learning Phase and Repair Phase In the learning phase, the method mines the training data to create repair templates which are retrieved for patches generated in the later phase This method uses a tree as source code representation and it is created using GumTree.Figure 3.1 shows an overview of the two-phase workflow along with the necessary steps to transform the data in each phase.

Figure 3.1: Workflow of VuRLELearning PhaseIn this phase the VuRLE will mine the data and generate the patches templates, which will then be fine-grained in later phase for creating respective patches.

Extracting edit blocksPairs of buggy code and its patch is fed into GumTree to create edit sequences fixing the buggy code.

Constructing edit groupsThese edit sequences will be used to create graphs by forming edges between sequence pairs with the longest overlapping sub-sequence. The graphs of edit sequences are then split into connected components and

DBSCAN is used to cluster these components into edit groups.

Template generationFor each pair of edit sequences in the edit groups, a template is created by identifying the longest overlapping edit sub-sequence and the context of this sub-sequence The editing context is also the output of GumTree that specify the locations of edit operations in the code segments.

Repair PhaseWith the repair templates generated in the learning phase, for each unseen bad code segment, we will then find the appropriate templates based on their similarity with the known bad code in the learning phase data set After that, the chosen template will then be further fine-grained to match the bad code.

Selecting templatesTemplates are selected by comparing the input code with edit groups’ templates mined in the learning process

Patches generationThe input code then used the transformative operations specified in the templates’ edit pattern to create code patches and only keep patches that do not contain redundant code

Generative-based approach

SeqTrans

SeqTrans [2] used transfer learning to fine-tune the model that first trained on the bug repairing task to the vulnerabilities repairing task, for that reason each task will need a separate data set In reported experiments, the model trained on bug repairing task using the Tufano [31] data set then in the fine-tuning phase the Ponto [32] dataset is used, with both of these data set being source code of Java language.

Before going into the details of this method, let us look at the general design of the architecture in figure 3.2.

Tokenization and NormalizationAlthough SeqTrans does not use tree representation of the source codes, the GumTree algorithm is still used to map the AST nodes of the source and patch so that the diff context can be extracted using a commercial tool Understand Each sample from both datasets of bug repair and vulnerability repair is represented as a list of code segment pairs

CP = (st src , st dst ) 1 , , (st src , st dst ) n

These code pairs are further fine-grained to construct the def-use chains, which are the assignment of some value to a variable containing all variable definitions from the vulnerable statement [2] turning the code pairs into Figure 3.3 shows a sample of code pair input for the model, in which all global variables definitions and statements have dependencies with the vulnerability statements are preserved while other statements in the same method would be discarded

CP =((def 1 , , def n , st src ), (def 1 , , def n , st dst )) 1 , , ((def 1 , , def n , st src) , (def 1 , , def n , st dst )) n

After the code pairs dataset has been created, each code segment would first be normalized to reduce the vocabulary size of the dictionary, which also determines the output vectors size as it denotes the probability of each token in the dictionary being the predictions, this, in turn, would ease the models training processed: each literal and strings would be turned intonum 1 , , num n , andstr 1 , , str 2 and variable name would be replaced withvar 1 , , var n as shown in 3.4; however, these

“placeholders” will later be replaced back with their real value using mappings generated during this normalization process At this state, the input is ready to be tokenized with Byte Pair Encoding [28] along with the dictionary of the dataset.

Pre-training and fine-tuningSeqTrans is designed using transformer modules as building blocks, whose details have been discussed in the previous sections The architecture is the same for both the pre-training and fine-tuning phases with the

Figure 3.4: Normalized code segment only differences in the dataset, batch size, and number of training steps: the pre-training model is trained with a batch size of 4096 for 300k steps, and fine-tuning model is trained with a batch size of 4096 for extra 30k steps [2] The implementation of SeqTrans is done with OpenNMT and the framework offers a low code solution to configure the model architecture using configuration files, some of the main configurations used to build the model listed by the authors in [2].

• Size of hidden transformer feed-forward: 2048

VRepair

In general, the architecture of VRepair [15] is the same as SeqTrans, due to the fact that the authors take inspiration from SeqTrans and apply the method on different datasets as stated by the authors in [15], its main differences would be in the way each method do pre-processing and source code representation aside from the choice of datasets For SeqTrans, the code pairs are generated using GumTree and further fine-grained to extract the reference chains before being tokenized, while in

VRepair, source code is handled just like natural language, and the tokenizing

Figure 3.5: The VRepair pipeline process is applied directly to the vulnerable code segments and their respective patches with additional special tokens in both the buggy code segments and patches.

The figure 3.5 show both the pre-training and fine-tuning process of VRepair, in which we can see the similarity with SeqTrans [2] apart from the preprocessing step. The authors’ reasoning behind their design choice is that

• The additional tokens serve the function to localize the buggy segments in the input code and reduce the difficulty of the generative task by setting up the model to only learn to generate the modified segments instead.

• Representing multiple changes to a function, which in turn allows vulnerabilities fixes across multiple lines within a single code block providing robustness to the solution compared to [1] [24] [33].

• Decreasing the length of the output sequence generated also leads to a reduction in both training cost and inference cost.

Motivation and inspirationAiming to clarify research question 3, we argue that using embeddings created by CodeBERT would be beneficial for vulnerability repair The justification for our hypothesis is that the use of large-scale embeddings for models of downstream tasks with low-resource has proven effective in many instances of natural language processing problems With the availability of large-scale code language models trained on large datasets on code generation tasks such as code repairing and or masked token prediction, we can leverage them as embeddings for models of vulnerability repairing that have limited datasets, with the premise that the upstream and downstream tasks are similar in term of objectives and data On the similarity between data, we know Vulnerability is a type of bug that can be the target of security exploitation which poses a much more difficult challenge to identify and generate patches manually, however, the similarity of these tasks is not.

Experiments designIn these experiments, we treat the source code as plain text to be processed by the preprocessing modules before feeding these data into the model. The use of representation can be justified by the similarity in the existence of sequential and structural information of both programming and natural language. Furthermore, empirical results show that this type of representation works with both discriminative and generative tasks such as bug detection, code summarization, code prediction, etc.

In order to prove the mentioned hypothesis, we will conduct experiments in two-phase using the same dataset provided in VRepair and then compare the results from the models of these two-phase, whose models are using OpenNMT The design pipeline for these experiments in both of the phases is shown 4.1

• First, we replicate the experiments reported by the authors of VRepair on a downscaled network due to our limited computing resources The network used in these experiments is the transformer-based neural translation network built using the OpenNMT-py framework.

• Second, we design a new pipeline that would use the embeddings created from programing language models as input to the VRepair architecture and compare the result of this pipeline with the previous one These embeddings are then fed into the same neural translation network used in the first phase.

Figure 4.1: Design of our pipeline

This section presents the setting of our experiment, including the dataset, performance metrics, data preparation, and results.

Datasets

The existing dataset provided by VRepair include two existing dataset called

Big-Vul [34] and CVEfixes [35] for training the neural translation network The Big-Vul dataset has been created by crawling CVE databases and extracting vulnerability-related information such as CWE ID and CVE ID Then, depending on the project, the authors developed distinct crawlers for each project’s pages to obtain the git commit link that fixes the vulnerability In total, Big-Vul contains3754 different vulnerabilities across 348 projects categorized into91different CWE IDs, with a time frame spanning from 2002 to 2019 The CVEfixes dataset is collected in a way similar to the Big-Vul dataset This dataset contains5365vulnerabilities across1754projects categorized into 180 different CWE IDs, with a time frame spanning from 1999 to 2021 In our research, we only conduct the experiments onBig-vul to narrow down the scope of experiments in this thesis and then leave the experiments on a more diverse dataset in future works.

Validation method To train and validate our experiments, we split the datasets into training data, validation data, and test data with70%for training,10%for validation, and20%for testing In the Big-vul dataset, we will have 2228 samples as training data, 318 samples as validation data, and 636 samples as testing data.

Metrics of performance

To understand our results, finding a proper way of measuring our models and finding suitable metrics is important to discuss The OpenNMT-py framework [36] reports automatically the perplexity(PPL) and the accuracy during training and validation. For translation, the PPL is reported It is also conditioned to consider early stopping conditions based on these metrics The perplexity is a measure of how uncertain the network is that the predicted outcome is correct Low PPL means low uncertainty and high PPL means high uncertainty It is a common way to evaluate language models in NLP Luong et al report that the translation quality is connected to the PPL [37], where they claim that if a model has a low PPL, its translation will be of higher quality The PPL is defined in equation 5.1, in which the network’s uncertainty of the generated documentDis measured by the joint probability of all words in that document, normalized by the number of wordsN d in the document:

Another metric that is reported from the OpenNMT-py automatically is accuracy which measures the number of correctly predicted tokens The accuracy is calculated by:

N umberof tokeninY (5.2) whereY ˆ is the predicted output sequence, andY is the target sequence The way accuracy is calculated, makes it a metric that does not give much insight into the results and the model’s performance The reason is that even if all the tokens present in the target sequence,Y, are present in the predicted sequenceY ˆ , the positions of each token can be different from the target sequence, and the accuracy will still be 100%.

Bleu-score [38] is a metric that is used specifically for evaluating the quality of text in the machine translation problem, which is based on calculating the predictions’ precision on an n-length subsequence of the prediction sequence The precision metric in the context of machine translation is the number of words in the predicted tokens that also appear in the target sequence The calculation of the Bleu-score is done using the following equation 5.3, in which N is the length of the subsequence in the predicted sequence:

Preprocessing the code as plain text

In the pipelines used in these experiments, we will treat the source along with its patch as plain texts shown in 5.1 and tokenize them using a byte-pair encoding algorithm However, before the tokenization process, we will add an extra preprocessing step that adds two special tokens to the original dataset including the input sequence and target sequence This processing step is done before we extract the embeddings from data using programming language models.

Figure 5.1: Sample of buggy code and its patch

• In the input sequence shown in 5.2,< StarLoc >and< EndLoc >will be added to the location identified as vulnerable, and there is also an additional indicator ”CWE-xxx” which specifies the type of vulnerability.

• For the target sequence shown in 5.3, we use two new unique tokens

< M odStart >and< M odEnd >and change the target sequence to only contain the modifications needed There are 3 types of modifications to be made to the input to create a patch, which is shown in 5.4, leading to 3 types of format of the target sequence, each indicating a type of modification made to the input sequence.

Figure 5.4: Syntax of the output sequence

Extracting embeddings from large language models for code

The embeddings extracted from these language models are stored as a look-up table of the vocabulary of the corpus, therefore we are required to first extracted the vocabulary of words that exist in the corpus which should include the newly added tokens Each word in the vocabulary is represented by a vector shown in 4.1 which has a size of768and represents the semantic representation learned by the pretraining tasks of programming language models.

In this process, we use two different large language models to extract embeddings from, one is CodeBERT[3] and the other is Unixcoder [4], both of these are trained on a large dataset of programming languages including the ones in our vulnerability dataset The main difference between these two pre-trained is the type of input on which they trained, the information that CodeBERT used to train includes both natural language and programming language while Unixcoder is trained on programming language only This difference stems from the fact that they aim at optimizing their performance on different tasks, the first one emphasizes code summarization, while the latter is better at auto-regressive tasks such as code completion.

The extracted embeddings representing the entire vocabulary in the corpus is stored as a look-up table which will later be used as input during the training process of the downstream translation model of vulnerability repairing One thing to note here is that the programming language models used in our experiments have their tokenizers and dictionaries Therefore, the input tokenized by Openmt’s tokenizer might further be tokenized in these language models, which leads the output tensors to have shape nX768, in whichnis the number of tokens created from the input token For example, the tokenword1might further be tokenized intosubword1andsubword2, making the output tensor of the language models have the size of2X768 To create an embedding of size1X768representing one single token in our dictionary, we use two methods to aggregate the language models’ output tensors:

• We take the mean of the output tensor along the second dimension, the code snippet of this method is shown below, in which the tokenizeris the tokenizer used by the programming language model to map thevocab in the vulnerability dataset into indexes respective to the dictionary of the programming language models These indexes are then fed into the language models to get the output tensor whose mean is taken along the second dimension.

Listing 1: Extracting the embeddings using mean code_tokens = tokenizer.tokenize(vocab) tokens_ids = tokenizer.convert_tokens_to_ids(code_tokens) context_embeddings = model(torch.tensor(tokens_ids)[None,:])[0] context_embeddings = torch.mean(context_embeddings,dim=1).ravel () vul_embs[idx,:] = context_embeddings

• Whenever we feed a token in our dictionary into the language models, we concatenate it which a special token named[cls]which intuitively represents the semantic information of the entire tokens By using this method, we only need to use the first row of the language models output to act as the embedding The code snippet of this method resembles the first method with the only difference in the additional[cls]token used as embedding.

Listing 2: Embeddings extracted by using the concatenated token code_tokens = tokenizer.tokenize(vocab) tokens = [tokenizer.cls_token]+code_tokens tokens_ids = tokenizer.convert_tokens_to_ids(tokens) context_embeddings = model(torch.tensor(tokens_ids)[None,:]) [0][0,0] vul_embs[idx,:] = context_embeddings

Environment

All of our experiments are conducted on a machine with 32G of RAM and one NVIDIA Quadro RTX 6000 24GB GDDR6 To train, predict and create the vocabularies for both embedding extraction along with our translation model, we use the OpenNMT-py framework [36] It is a neural machine translation framework built on top of Pytorch [39] The programming language models used in our experiments are all accessible through the Hugging Face hub and implemented with the

Results

The results in the table 5.1 below are from the experiments that replicate the pipeline of VRepair using a downscale version of the transformer architecture The?? section shows a complete set of our hyperparameters used in these experiments, however, in the experiments below, we ran multiple experiments trying our different configurations of some of the hyperparameters, which were also experiments in [15], to better understand the architecture performance when training without using the embeddings from CodeBERT [3] and UnixCoder [4] We report the model performance using token-level accuracy, perplexity, and training time in the below table with the best configuration highlighted, which has a token-level accuracy of 50.229% However, the high perplexity value in these results indicates that the models are not certain in their predictions.

Table 5.1: Experiments replicating the VRepair pipeline

Learning rate Hidden size Sequence Length Validation accuracy (%) Validation perplexity Training time (s)

The results 5.2 below show the pipeline that leverages the embeddings extracted from CodeBERT and UnixCoder as input for the downstream task However, in these experiments, we choose only one set of hyperparameters from the configurations in the first phase, in which the learning rate is set to0.0005, the hidden size is768and the sequence length is2000 The reason behind this is that we just want to clarify the effect of using embeddings on the downstream model and using the same hyperparameters helps us correctly attribute any improvement in terms of performance to the use of embeddings Along with that, as mentioned in the earlier section we also chose to reduce the training iteration from100000in the previous experiments to20000with the same justification mentioned.

As mentioned in section 5.4, we fed each word in the vocabulary into the programming language model to get the representation of the word, however, each of these models has its own input specification, leading to the differences in output. The experiments denoted with postfix(1)using the embeddings extracted from the first method, which is aggregation along the second dimension of the language models output tensor, and the one with postfix(2) is using the[cls]token as embedding Similar to the first phase of our experiment, results in our second phase are also reported on token-level accuracy, perplexity, and training time In addition, we also report the models’ performance on their capability to generate the perfect patches that entirely match the samples’ labels The results show that the use of embeddings extracted from CodeBERT by the latter method does help to improve the performance slightly.

Table 5.2: Experiments with embeddings as input

Embedding Validation accuracy (%) Validation perplexity Training time (s) EM (out of 316) Bleu-score

The code snippet below is an example of the perfect patches generated from the models using samples from the validation dataset, which has the format described in the 5.3 In this specific example, the predict sequence indicates that the generated patch will insertmemsetbetweenstride ) ;and( input ,at every places in the original code that has such pattern

stride ) ; memset ( input ,

DiscussionResults from both of these experiments show that the use of pre-trained embeddings does not help improve the results compare to training the models from scratch in terms of models’ performance and training time We argue that the similarity between code repairing and vulnerability repairing tasks is not close enough for the embeddings to be used to be as a medium for transferring information to improve the training process of the vulnerability repairing model The experiments in our second phase that use embeddings only show slight increase in BLEU-score and exact match when compared the vanilla pipeline, although the Bleu-score reaching 30 in 5.2 is consider to be understandable according to [41]. The high perplexity in the results of both phase of the experiments show that the models are not certain in their predictions, which means the probability of the correctly predicted token is not much larger than others.

Discussions of the results

Code repairing is a young research field with the use of both template-based methods and generative-based methods to generate code patches automatically, therefore, most researchers focus on generic bugs due to the large and easily accessible dataset which can be leveraged by the deep learning network [1] [42] to archive significant results in the field While vulnerability in source code is also a type of bug, it is more difficult to detect and patch due to the endeavor to exploit the application for security errors taking more time This led to the fact the available labeled dataset of vulnerability source code is sparse compared to the generic bugs, which makes the application of deep networks in the problem of vulnerability repairing become limited In this thesis, our main objective is to come up with a solution for this problem by leveraging the information learned from a larger dataset of programming languages to use as input for the downstream vulnerability repairing models to tackle the problem of a small dataset We tried using the embeddings extracted from

CodeBERT and UnixCoder as input for a transformer-based network, which serves as the vulnerability repairing model, to generate patches for the vulnerability.

From the results of our experiments, there is an indication that the used embeddings do not offer a significant improvement on the task, and while conducting this research we also find out that researchers have also conducted the same experiments[43] on the tasks of vulnerability detections and archived the same analysis as our experiments.

Main Contribution

The problem statement the work of this thesis was meant to answer, was the following: How the problem of lacking labeled dataset can be handled while applying deep learning to automatically generate patches for vulnerability? In order to answer this problem statement, we defined a set of objectives, and through achieving these objectives, we can answer this problem statement, and contribute to the research field with the work towards achieving them The objectives were defined as follows:

• What do we know about machine learning in vulnerability repair and code repairing?

• How effective existing generative-based methods to the problem of code repairing and vulnerability repairing?

• Can code embedding extend the capabilities of these methods?

Objective 1 is archived through our research on the related work in the field of code repairing using template-based and generative-based methods Given the premise that vulnerability is a type of bug, we conclude that most of the noticeable research on code repairing or vulnerability repairing recently focused on learning the patterns in the dataset from the perspective of natural language in which the input is either represented as lists of tokens or an abstract syntax tree Objective 2 is archived through our replication of the VRepair in which a transformer-based model is trained to generate patches for a small dataset with an average accuracy of50%, which is quite promising Later on, we tried to improve the performance by leveraging the embeddings extracted from CodeBERT and UnixCoder to serve as a medium for transferring knowledge learning from a larger dataset to a vulnerability-repairing task However, the archived results do show that the use of such embeddings has no significant improvements.

Future works

Despite the use of embeddings does not show any improvement in our research, the results of our experiments can be used as justification that the task of vulnerability repairing and code understanding task do not have close proximity However, due to the complex nature of the vulnerability, we can try to lower the tasks’ proximity by focusing on one type of vulnerability that is most likely to resemble a code understanding task Another approach can also be can be considered, which is doing feature engineering on the vulnerability dataset using traditional machine learning methods The justification is that the complex nature of vulnerability can be further explored through the use of code representation like dataflow suggested in guo2021graphcodebert

[1] Z Chenet al., “Sequencer: Sequence-to-sequence learning for end-to-end program repair”,IEEE Transactions on Software Engineering, vol 47, no 09, pp 1943–1959, Sep 2021.

[2] J Chiet al., “Seqtrans: Automatic vulnerability fix via sequence to sequence learning”,IEEE Transactions on Software Engineering, vol 49, pp 564–585, 2020.

[3] Z Fenget al., “CodeBERT: A pre-trained model for programming and natural languages”, inFindings of the Association for Computational Linguistics: EMNLP 2020, Online, Nov 2020, pp 1536–1547.

[4] D Guoet al., “UniXcoder: Unified cross-modal pre-training for code representation”, inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp 7212–7225.

[5] A Nguyen-Ducet al.,Software Business - 13th International Conference. Bolzano, Italy: Springer, Nov 2022, vol 463 [Online] Available: https://doi.org/10.1007/978-3-031-20706-8.

[6] A N Ducet al.,Fundamentals of Software Startups: Essential Engineering and Business Aspects Springer International Publishing, 2020 [Online]. Available: https://www.springer.com/gp/book/9783030359829.

[7] C L Goueset al., “Automated program repair”,Communications of the ACM, vol 62, no 12, pp 56–65, Nov 21, 2019.

[8] R K Sahaet al., “Elixir: Effective object-oriented program repair”, in2017 32nd IEEE/ACM International Conference on Automated Software

Engineering (ASE), Urbana, IL, USA, Oct 2017, pp 648–659.

[9] H Tianet al., “Evaluating representation learning of code changes for predicting patch correctness in program repair”, inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA, Jan 27, 2021, pp 981–992.

[10] S Zhanget al., “Deep learning based recommender system: A survey and new perspectives”,ACM Computing Surveys, vol 52, no 1, pp 1–38, Jan 31,2020.

[11] M Vasicet al.“Neural program repair by jointly learning to localize and repair” arxiv preprint arXiv:1904.01720 (2019).

[12] L Schramm, “Improving performance of automatic program repair using learned heuristics”, inProceedings of the 2017 11th Joint Meeting on

Foundations of Software Engineering, New York, NY, USA, Aug 21, 2017, pp 1071–1073.

[13] E Mashhadi and H Hemmati, “Applying CodeBERT for automated program repair of java simple bugs”, in2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Online, May 2021, pp 505–509.

[14] Z Chenet al., “Plur: A unifying, graph-based view of program learning, understanding, and repair”,Advances in Neural Information Processing

[15] Z Chenet al., “Neural transfer learning for repairing security vulnerabilities in c code”,IEEE Transactions on Software Engineering, vol 49, no 1, pp 147–165, 2023.

[16] A Sherstinsky, “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network”,Physica D: Nonlinear Phenomena, vol 404, p 132 306, Mar 2020.

[17] A Karpathy and J Johnson “Cs231n convolutional neural networks for visual https://cs231n.github.io/transfer-learning/(visited on 2023).

[18] N E Q E P˚alsrud, “Exploring neural machine translation architectures for automated code repair”, M.S thesis, University of Oslo, Norway, 2022.

[19] D Tarlowet al.“Learning to fix build errors with graph2diff neural networks”. arXiv preprint arXiv:1911.01205 (2019).

[20] Y Liet al.“Gated graph sequence neural networks” arXiv preprint arXiv: 1511.05493 (2017).

[21] Z Tanget al., “Ast-transformer: Encoding abstract syntax trees efficiently for code summarization”, in2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Los Alamitos, CA, USA, Nov 2021, pp 1193–1195. recognition-transfer learning”.[Online] Available:

[22] Y.-F Ma and M Li, “The flowing nature matters: Feature learning from the control flow graph of source code for bug localization”,Mach Learn., vol 111, no 3, pp 853–870, Mar 2022.

[23] K Nodaet al., “Sirius: Static program repair with dependence graph-based systematic edit patterns”, in2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), Los Alamitos, CA, USA, Oct 2021, pp 437–447.

[24] S Maet al., “VuRLE: Automatic vulnerability detection and repair by learning from examples”, inComputer Security – ESORICS 2017, S N Foley et al., Eds., vol 10493, 2017, pp 229–246.

[25] J.-R Falleriet al., “Fine-grained and accurate source code differencing”, in Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, New York, NY, USA, Sep 15, 2014, pp 313–324.

[26] C L Goueset al., “Automated program repair”,Commun ACM, vol 62, no 12, pp 56–65, Nov 2019.

[27] J.-R Falleriet al., “Fine-grained and accurate source code differencing”, in Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, Vasteras, Sweden, 2014, pp 313–324.

[28] R Sennrichet al., “Neural machine translation of rare words with subword units”, inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, Aug.

[29] K W Church, “Word2vec”,Natural Language Engineering, vol 23, no 1, pp 155–162, 2017.

[30] K Liuet al., “TBar: Revisiting template-based automated program repair”, in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, New York, NY, USA, Jul 10, 2019, pp 31–42.

[31] M Tufanoet al., “An empirical investigation into learning bug-fixing patches in the wild via neural machine translation”, in2018 33rd IEEE/ACM

International Conference on Automated Software Engineering (ASE), New York, NY, USA, 2018, pp 832–837.

[32] S E Pontaet al., “A manually-curated dataset of fixes to vulnerabilities of open-source software”, inProceedings of the 16th International Conference on Mining Software Repositories, Montreal, Quebec, Canada, pp 383–387.

[33] J Guoet al., “A deep look into neural ranking models for information retrieval”,Information Processing & Management, vol 57, no 6, p 102 067, 2020.

[34] J Fanet al., “A c/c++ code vulnerability dataset with code changes and cve summaries”, inProceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea, 2020, pp 508–512.

[35] G Bhandariet al., “CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software”, inProceedings of the 17th

International Conference on Predictive Models and Data Analytics in

Software Engineering, Athens, Greece, Aug 2021.

[36] G Kleinet al., “OpenNMT: Open-source toolkit for neural machine translation”, inProceedings of ACL 2017, System Demonstrations, Vancouver, Canada, Jul 2017, pp 67–72.

[37] T Luonget al., “Addressing the rare word problem in neural machine translation”, inProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, Jul 2015, pp 11–19.

[38] K Blagecet al.“A global analysis of metrics used for measuring performance in natural language processing” arXiv:2204.11574 (2022).

[39] A Paszkeet al.“Pytorch: An imperative style, high-performance deep learning library” arXiv preprint arXiv:1912.01703 (2019).

[40] T Wolfet al., “Transformers: State-of-the-art natural language processing”, in Proceedings of the 2020 Conference on Empirical Methods in Natural

Language Processing: System Demonstrations, Online, Oct 2020, pp 38–45. https:

//cloud.google.com/translate/automl/docs/evaluate (visited on 2023).

[42] Z Liet al., “SySeVR: A framework for using deep learning to detect software vulnerabilities”, inIEEE Transactions on Dependable and Secure Computing, vol 19, Los Alamitos, CA, USA, Jul 2022, pp 2244–2258.

[43] Y Choiet al., “Learning sequential and structural information for source code summarization”, inFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, Aug 2021, pp 2842–2851.

[41] Google “Evaluating models”.[Online] Available:

Hyperparameter Value src vocab size 2000 tgt vocab size 2000 src seq length Varied tgt seq length 100 batch size 4 valid batch size 1 train steps Varied valid steps Varied save checkpoint steps valid steps early stopping 2 early stopping criteria accuracy keep checkpoint 3 optim adam learning rate Varied learning rate decay 0.9 label smoothing 0.1 param init 0 param init glorot true encoder type transformer decoder type transformer enc layers 3 dec layers 3 heads 4 rnn size Varied word vec size rnn size transformer ff 1024 dropout 0.1 attention dropout 0.1 copy attn true position encoding trueTable 6.1: Complete set of hyperparameters used in our models built by Opennmt-py

Ngày đăng: 30/07/2024, 16:59

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN