Application of machine learning on automatic program repair of security vulnerabilities

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	Application Of Machine Learning On Automatic Program Repair Of Security Vulnerabilities
Tác giả	Nguyen Ngoc Hai Dang
Người hướng dẫn	Assoc. Prof. Dr. Huynh Tuong Nguyen, Assoc. Prof. Dr. Quan Thanh Tho
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	master’s thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	61
Dung lượng	0,97 MB

Nội dung

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY NGUYEN NGOC HAI DANG APPLICATION OF MACHINE LEARNING ON AUTOMATIC PROGRAM REPAIR OF SECURITY VULNERABILITIES Major: Computer Science Major code: 8480101 MASTER’S THESIS HO CHI MINH CITY, July 2023 THIS THESIS IS COMPLETED AT HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Assoc Prof Dr Huynh Tuong Nguyen, Assoc Prof Dr Quan Thanh Tho Examiner 1: Dr Truong Tuan Anh Examiner 2: Assoc Prof Dr Nguyen Van Vu This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on July 11,2023 Master’s Thesis Committee: (Please write down full name and academic rank of each member of the Master’s Thesis Committee) Chairman: Assoc Prof Dr Le Hong Trang Secretary: Dr Phan Trong Nhan Reviewer 1: Dr Truong Tuan Anh Reviewer 2: Assoc Prof Dr Nguyen Van Vu Member: Assoc Prof Dr Nguyen Tuan Dang Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any) CHAIRMAN OF THESIS COMMITTEE HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness THE TASK SHEET OF MASTER’S THESIS Full name: Nguyen Ngoc Hai Dang Date of birth: 24/11/1997 Major: Computer Science I Student ID: 1970513 Place of birth: Lam Dong Major ID: 8480101 THESIS TITLE: ỨNG DỤNG HỌC MÁY VÀO CHƯƠNG TRÌNH TỰ ĐỘNG SỬA CHỮA LỖ HỔNG BẢO MẬT - APPLICATION OF MACHINE LEARNING ON AUTOMATIC PROGRAM REPAIR OF SECURITY VULNERABILITIES II TASKS AND CONTENTS: - Research and build a system to automatically repair vulnerabilities - Research and propose methods to improve the accuracy of the model - Experiment and evaluate the results of the proposed methods III THESIS START DAY: 05/09/2022 IV THESIS COMPLETION DAY: 09/06/2023 V SUPERVISOR: Assoc Prof Dr Huynh Tuong Nguyen, Assoc Prof Dr Quan Thanh Tho Ho Chi Minh City, date ……… SUPERVISOR (Full name and signature) CHAIR OF PROGRAM COMMITTEE (Full name and signature) DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature) Acknowledgement I would like to acknowledge the people who have helped me with their knowledge, encouragement, and patience during the work of this thesis The thesis would not have been completed without your help and inspiration First and foremost, I would like to thank my supervisor at Ho Chi Minh City University of Technology, Professor Quan Thanh Tho Thank you for your unwavering support Your insightful feedback and contributions have pushed and guided me throughout the work of this thesis I would also like to thank my other supervisor at the Norwegian University of Science and Technology, Professor Nguyen Duc Anh Thank you for your feedback and help Lastly, I would like to thank my friends and family for their endless patience, support, and encouragement i Abstract We have, as individuals and as a society, become increasingly more dependent on software, thus, the consequences of failing software have also become greater Identifying the failing parts of the software and fixing these parts manually could be timeconsuming, expensive, and frustrating The growing research field of automated code repair aims to tackle this problem, by applying machine learning techniques to be able to repair software in an automated fashion With the abundance of data of bugs and patches, research on the use of deep learning in code repairing has been on the rise and proven to be effective with the appearances of many systems [1] [2] with state of the art performance However, the approach is conditioned on a large dataset to be applicable and this condition can not be met by all types of bugs in applications One type of such bugs is vulnerability, which is the target of security exploitation of attackers to cause great harm to organizations that use the applications Therefore, the need to automatically identify and fix vulnerabilities is obvious and can significantly reduce the harm that can be caused to these organizations In our work, we focus on the application of deep learning in vulnerability repairing and experiment with a solution that can be used to handle the lack of data, which is a requirement for deep learning models to be applied effectively, through the use of embeddings extracted from large language models like CodeBERT [3] and UnixCoder [4] Although our results show such an approach does not bring significant improvement, they can be used by other researchers to gain more insights into the proximity between the repairing tasks of different types of bugs ii Tóm tắt luận văn Chúng ta, với tư cách cá nhân xã hội, ngày trở nên phụ thuộc nhiều vào phần mềm, đó, hậu việc phần mềm bị lỗi trở nên lớn Việc xác định phần bị lỗi phần mềm sửa phần theo cách thủ cơng tốn thời gian, tốn gây khó chịu Lĩnh vực nghiên cứu sửa chữa mã tự động phát triển nhằm mục đích giải vấn đề này, cách áp dụng kỹ thuật máy học để sửa chữa phần mềm theo cách tự động Với lượng liệu dồi lỗi vá lỗi, nghiên cứu việc sử dụng học sâu sửa mã ngày nhiều chứng minh hiệu với xuất nhiều hệ thống [1] [2] với công nghệ tiên tiến biểu diễn nghệ thuật Tuy nhiên, cách tiếp cận dựa tập liệu lớn loại lỗi ứng dụng đáp ứng điều kiện này, số lỗ hổng bảo mật, vốn mục tiêu khai thác bảo mật kẻ công nhằm gây hại cho tổ chức sử dụng ứng dụng chứa lỗ hổng Do đó, nhu cầu tự động xác định sửa lỗ hổng hiển nhiên đem giảm đáng kể thiệt hại xảy cho doanh nghiệp Trong luận văn này, tập trung vào việc ứng dụng học sâu việc khắc phục lỗ hổng bảo mật thử nghiệm giải pháp sử dụng để xử lý tình trạng thiếu liệu, vốn yêu cầu để mơ hình trở nên hiệu quả, thơng qua việc sử dụng embeddings trích xuất từ mơ hình ngơn ngữ lớn CodeBERT [3] UnixCoder [4] Mặc dù kết cho thấy cách tiếp cận không mang lại cải thiện đáng kể, chúng nhà nghiên cứu khác sử dụng để hiểu rõ khoảng cách nhiệm vụ sửa chữa loại lỗi khác iii Declaration I, Nguyen Ngoc Hai Dang, declare this thesis with the Vietnamese title as "Ứng dụng học máy vào chương trình tự động sửa chữa lỗ hổng bảo mật” and English title as "Application of machine learning on automatic program repair of security vulnerabilities”, is my own work and contains no material that has been submitted previously, in whole or in part, for the award of any other academic degree or diploma Signature Nguyen Ngoc Hai Dang iv CONTENTS CONTENTS Contents Introduction 1.1 Motivation 1.2 Problem Statement 1.3 Research Questions 1.4 Thesis Outline Literature review 2.1 Background on Neural Network and Deep Learning 2.1.1 Recurrent Neural Network (RNN) 2.1.2 Vanilla recurrent neural network 2.1.3 Long short-term memory network(LSTM) 2.1.4 Transformer Neural Network 11 2.2 Transfer Learning 15 2.3 Learning Paradigm 17 2.3.1 Sequence to Sequence Learning 17 2.3.2 Graphs-based Learning 18 2.3.3 Tree-to-tree Learning 19 2.4 Bug Repairing and Vulnerabilities Repairing 19 2.5 Source code Representation 20 2.5.1 GumTree 20 2.5.2 Byte Pair Encoding 21 Source code embeddings 22 2.6.1 CodeBERT 23 2.6.2 UnixCoder 25 2.6 The state of the art program repair approraches 27 3.1 27 Template-based approach v CONTENTS 3.2 CONTENTS Generative-based approach 29 3.2.1 SeqTrans 30 3.2.2 VRepair 32 Proposed Methods 34 Experiments and Results 35 5.1 Datasets 35 Validation method 36 5.2 Metrics of performance 36 5.3 Preprocessing the code as plain text 37 5.4 Extracting embeddings from large language models for code 38 5.5 Environment 40 5.6 Results 40 Discussions and Conclustion 43 6.1 Discussions of the results 43 6.2 Main Contribution 43 6.3 Future works 44 References 45 Appendix 49 vi LIST OF FIGURES LIST OF FIGURES List of Figures 2.1 The basic architecture of recurrent neural network 2.2 Recurrent Neural Network design patterns 2.3 LSTM network with three repeating layers 10 2.4 Attention-integrated recurrent network 12 2.5 The encoder-decoder architecture of transformer 13 2.6 Attention head operations 15 2.7 Dataset used for CodeBERT 23 2.8 CodeBERT architecture for replaced tokens detection task 25 2.9 A Python code with its comment and AST 25 2.10 Input for contrastive learning task of UnixCoder 26 3.1 Workflow of VuRLE 28 3.2 Architecture of SeqTrans 30 3.3 Input of SeqTrans 31 3.4 Normalized code segment 32 3.5 The VRepair pipeline 33 4.1 Design of our pipeline 35 5.1 Sample of buggy code and its patch 37 5.2 input sequence 37 5.3 Output sequence 38 5.4 Syntax of the output sequence 38 vii

Ngày đăng: 05/12/2023, 23:32

Nguồn tham khảo

Tài liệu tham khảo

Loại

Chi tiết

[1] Z. Chen et al., “Sequencer: Sequence-to-sequence learning for end-to-end program repair”, IEEE Transactions on Software Engineering, vol. 47, no. 09, pp. 1943–1959, Sep. 2021

Sách, tạp chí

Tiêu đề:	Sequencer: Sequence-to-sequence learning for end-to-end program repair
Tác giả:	Z. Chen, et al
Nhà XB:	IEEE Transactions on Software Engineering
Năm:	2021

[2] J. Chi et al., “Seqtrans: Automatic vulnerability fix via sequence to sequence learning”, IEEE Transactions on Software Engineering, vol. 49, pp. 564–585, 2020

Sách, tạp chí

Tiêu đề:	Seqtrans: Automatic vulnerability fix via sequence to sequencelearning

[3] Z. Feng et al., “CodeBERT: A pre-trained model for programming and natural languages”, in Findings of the Association for Computational Linguistics:EMNLP 2020, Online, Nov. 2020, pp. 1536–1547

Sách, tạp chí

Tiêu đề:	CodeBERT: A pre-trained model for programming and natural languages
Tác giả:	Z. Feng, et al
Nhà XB:	Findings of the Association for Computational Linguistics:EMNLP 2020
Năm:	2020

[4] D. Guo et al., “UniXcoder: Unified cross-modal pre-training for coderepresentation”, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

Sách, tạp chí

Tiêu đề:	UniXcoder: Unified cross-modal pre-training for coderepresentation

[5] A. Nguyen-Duc et al., Software Business - 13th International Conference.Bolzano, Italy: Springer, Nov. 2022, vol. 463. [Online]. Available:https://doi.org/10.1007/978-3-031-20706-8

Sách, tạp chí

Tiêu đề:	Software Business - 13th International Conference
Tác giả:	A. Nguyen-Duc
Nhà XB:	Springer
Năm:	2022

[6] A. N. Duc et al., Fundamentals of Software Startups: Essential Engineering and Business Aspects. Springer International Publishing, 2020. [Online].Available:https://www.springer.com/gp/book/9783030359829

Sách, tạp chí

Tiêu đề:	Fundamentals of Software Startups: Essential Engineering and Business Aspects
Tác giả:	A. N. Duc, Jürgen Münch, Rafael Prikladnicki, Xiaofeng Wang, Pekka Abrahamsson
Nhà XB:	Springer International Publishing
Năm:	2020

[7] C. L. Goues et al., “Automated program repair”, Communications of the ACM, vol. 62, no. 12, pp. 56–65, Nov. 21, 2019

Sách, tạp chí

Tiêu đề:	Automated program repair

[8] R. K. Saha et al., “Elixir: Effective object-oriented program repair”, in 2017 32nd IEEE/ACM International Conference on Automated SoftwareEngineering (ASE), Urbana, IL, USA, Oct. 2017, pp. 648–659

Sách, tạp chí

Tiêu đề:	Elixir: Effective object-oriented program repair
Tác giả:	R. K. Saha, et al
Nhà XB:	2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)
Năm:	2017

[9] H. Tian et al., “Evaluating representation learning of code changes for predicting patch correctness in program repair”, in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA, Jan. 27, 2021, pp. 981–992

Sách, tạp chí

Tiêu đề:	Evaluating representation learning of code changes forpredicting patch correctness in program repair

[10] S. Zhang et al., “Deep learning based recommender system: A survey and new perspectives”, ACM Computing Surveys, vol. 52, no. 1, pp. 1–38, Jan. 31, 2020

Sách, tạp chí

Tiêu đề:	Deep learning based recommender system: A survey and newperspectives

[11] M. Vasic et al. “Neural program repair by jointly learning to localize and repair”. arxiv preprint arXiv: 1904.01720. (2019)

Sách, tạp chí

Tiêu đề:	Neural program repair by jointly learning to localize andrepair

[12] L. Schramm, “Improving performance of automatic program repair using learned heuristics”, in Proceedings of the 2017 11th Joint Meeting onFoundations of Software Engineering, New York, NY, USA, Aug. 21, 2017, pp. 1071–1073

Sách, tạp chí

Tiêu đề:	Improving performance of automatic program repair using learned heuristics
Tác giả:	L. Schramm
Nhà XB:	Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering
Năm:	2017

[13] E. Mashhadi and H. Hemmati, “Applying CodeBERT for automated program repair of java simple bugs”, in 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Online, May 2021, pp. 505–509

Sách, tạp chí

Tiêu đề:	Applying CodeBERT for automated program repair of java simple bugs
Tác giả:	E. Mashhadi, H. Hemmati
Nhà XB:	2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)
Năm:	2021

[14] Z. Chen et al., “Plur: A unifying, graph-based view of program learning, understanding, and repair”, Advances in Neural Information Processing Systems, vol. 34, pp. 23 089–23 101, 2021

Sách, tạp chí

Tiêu đề:	Plur: A unifying, graph-based view of program learning, understanding, and repair
Tác giả:	Z. Chen, et al
Nhà XB:	Advances in Neural Information Processing Systems
Năm:	2021

[15] Z. Chen et al., “Neural transfer learning for repairing security vulnerabilities in c code”, IEEE Transactions on Software Engineering, vol. 49, no. 1, pp. 147–165, 2023

Sách, tạp chí

Tiêu đề:	Neural transfer learning for repairing security vulnerabilities in c code
Tác giả:	Z. Chen, et al
Nhà XB:	IEEE Transactions on Software Engineering
Năm:	2023

[16] A. Sherstinsky, “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network”, Physica D: Nonlinear Phenomena, vol. 404, p. 132 306, Mar. 2020

Sách, tạp chí

Tiêu đề:	Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network
Tác giả:	A. Sherstinsky
Nhà XB:	Physica D: Nonlinear Phenomena
Năm:	2020

[17] A. Karpathy and J. Johnson. “Cs231n convolutional neural networks for visual https://cs231n.github.io/transfer-learning/(visited on 2023)

Sách, tạp chí

Tiêu đề:	Cs231n convolutional neural networks for visual
Tác giả:	A. Karpathy, J. Johnson

[18] N. E. Q. E. P˚alsrud, “Exploring neural machine translation architectures for automated code repair”, M.S. thesis, University of Oslo, Norway, 2022

Sách, tạp chí

Tiêu đề:	Exploring neural machine translation architectures for automated code repair
Tác giả:	N. E. Q. E. P˚alsrud
Nhà XB:	University of Oslo
Năm:	2022

[20] Y. Li et al. “Gated graph sequence neural networks”. arXiv preprint arXiv:1511.05493. (2017)

Sách, tạp chí

Tiêu đề:	Gated graph sequence neural networks
Tác giả:	Y. Li, et al
Nhà XB:	arXiv
Năm:	2017

[21] Z. Tang et al., “Ast-transformer: Encoding abstract syntax trees efficiently for code summarization”, in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Los Alamitos, CA, USA, Nov. 2021, pp. 1193–1195.recognition-transfer learning”.[Online]. Available

Sách, tạp chí

Tiêu đề:	Ast-transformer: Encoding abstract syntax trees efficiently for code summarization
Tác giả:	Z. Tang, et al
Nhà XB:	2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
Năm:	2021