Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 61 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
61
Dung lượng
0,97 MB
Nội dung
VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY NGUYEN NGOC HAI DANG APPLICATION OF MACHINE LEARNING ON AUTOMATIC PROGRAM REPAIR OF SECURITY VULNERABILITIES Major: Computer Science Major code: 8480101 MASTER’S THESIS HO CHI MINH CITY, July 2023 THIS THESIS IS COMPLETED AT HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Assoc Prof Dr Huynh Tuong Nguyen, Assoc Prof Dr Quan Thanh Tho Examiner 1: Dr Truong Tuan Anh Examiner 2: Assoc Prof Dr Nguyen Van Vu This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on July 11,2023 Master’s Thesis Committee: (Please write down full name and academic rank of each member of the Master’s Thesis Committee) Chairman: Assoc Prof Dr Le Hong Trang Secretary: Dr Phan Trong Nhan Reviewer 1: Dr Truong Tuan Anh Reviewer 2: Assoc Prof Dr Nguyen Van Vu Member: Assoc Prof Dr Nguyen Tuan Dang Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any) CHAIRMAN OF THESIS COMMITTEE HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness THE TASK SHEET OF MASTER’S THESIS Full name: Nguyen Ngoc Hai Dang Date of birth: 24/11/1997 Major: Computer Science I Student ID: 1970513 Place of birth: Lam Dong Major ID: 8480101 THESIS TITLE: ỨNG DỤNG HỌC MÁY VÀO CHƯƠNG TRÌNH TỰ ĐỘNG SỬA CHỮA LỖ HỔNG BẢO MẬT - APPLICATION OF MACHINE LEARNING ON AUTOMATIC PROGRAM REPAIR OF SECURITY VULNERABILITIES II TASKS AND CONTENTS: - Research and build a system to automatically repair vulnerabilities - Research and propose methods to improve the accuracy of the model - Experiment and evaluate the results of the proposed methods III THESIS START DAY: 05/09/2022 IV THESIS COMPLETION DAY: 09/06/2023 V SUPERVISOR: Assoc Prof Dr Huynh Tuong Nguyen, Assoc Prof Dr Quan Thanh Tho Ho Chi Minh City, date ……… SUPERVISOR (Full name and signature) CHAIR OF PROGRAM COMMITTEE (Full name and signature) DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature) Acknowledgement I would like to acknowledge the people who have helped me with their knowledge, encouragement, and patience during the work of this thesis The thesis would not have been completed without your help and inspiration First and foremost, I would like to thank my supervisor at Ho Chi Minh City University of Technology, Professor Quan Thanh Tho Thank you for your unwavering support Your insightful feedback and contributions have pushed and guided me throughout the work of this thesis I would also like to thank my other supervisor at the Norwegian University of Science and Technology, Professor Nguyen Duc Anh Thank you for your feedback and help Lastly, I would like to thank my friends and family for their endless patience, support, and encouragement i Abstract We have, as individuals and as a society, become increasingly more dependent on software, thus, the consequences of failing software have also become greater Identifying the failing parts of the software and fixing these parts manually could be timeconsuming, expensive, and frustrating The growing research field of automated code repair aims to tackle this problem, by applying machine learning techniques to be able to repair software in an automated fashion With the abundance of data of bugs and patches, research on the use of deep learning in code repairing has been on the rise and proven to be effective with the appearances of many systems [1] [2] with state of the art performance However, the approach is conditioned on a large dataset to be applicable and this condition can not be met by all types of bugs in applications One type of such bugs is vulnerability, which is the target of security exploitation of attackers to cause great harm to organizations that use the applications Therefore, the need to automatically identify and fix vulnerabilities is obvious and can significantly reduce the harm that can be caused to these organizations In our work, we focus on the application of deep learning in vulnerability repairing and experiment with a solution that can be used to handle the lack of data, which is a requirement for deep learning models to be applied effectively, through the use of embeddings extracted from large language models like CodeBERT [3] and UnixCoder [4] Although our results show such an approach does not bring significant improvement, they can be used by other researchers to gain more insights into the proximity between the repairing tasks of different types of bugs ii Tóm tắt luận văn Chúng ta, với tư cách cá nhân xã hội, ngày trở nên phụ thuộc nhiều vào phần mềm, đó, hậu việc phần mềm bị lỗi trở nên lớn Việc xác định phần bị lỗi phần mềm sửa phần theo cách thủ cơng tốn thời gian, tốn gây khó chịu Lĩnh vực nghiên cứu sửa chữa mã tự động phát triển nhằm mục đích giải vấn đề này, cách áp dụng kỹ thuật máy học để sửa chữa phần mềm theo cách tự động Với lượng liệu dồi lỗi vá lỗi, nghiên cứu việc sử dụng học sâu sửa mã ngày nhiều chứng minh hiệu với xuất nhiều hệ thống [1] [2] với công nghệ tiên tiến biểu diễn nghệ thuật Tuy nhiên, cách tiếp cận dựa tập liệu lớn loại lỗi ứng dụng đáp ứng điều kiện này, số lỗ hổng bảo mật, vốn mục tiêu khai thác bảo mật kẻ công nhằm gây hại cho tổ chức sử dụng ứng dụng chứa lỗ hổng Do đó, nhu cầu tự động xác định sửa lỗ hổng hiển nhiên đem giảm đáng kể thiệt hại xảy cho doanh nghiệp Trong luận văn này, tập trung vào việc ứng dụng học sâu việc khắc phục lỗ hổng bảo mật thử nghiệm giải pháp sử dụng để xử lý tình trạng thiếu liệu, vốn yêu cầu để mơ hình trở nên hiệu quả, thơng qua việc sử dụng embeddings trích xuất từ mơ hình ngơn ngữ lớn CodeBERT [3] UnixCoder [4] Mặc dù kết cho thấy cách tiếp cận không mang lại cải thiện đáng kể, chúng nhà nghiên cứu khác sử dụng để hiểu rõ khoảng cách nhiệm vụ sửa chữa loại lỗi khác iii Declaration I, Nguyen Ngoc Hai Dang, declare this thesis with the Vietnamese title as "Ứng dụng học máy vào chương trình tự động sửa chữa lỗ hổng bảo mật” and English title as "Application of machine learning on automatic program repair of security vulnerabilities”, is my own work and contains no material that has been submitted previously, in whole or in part, for the award of any other academic degree or diploma Signature Nguyen Ngoc Hai Dang iv CONTENTS CONTENTS Contents Introduction 1.1 Motivation 1.2 Problem Statement 1.3 Research Questions 1.4 Thesis Outline Literature review 2.1 Background on Neural Network and Deep Learning 2.1.1 Recurrent Neural Network (RNN) 2.1.2 Vanilla recurrent neural network 2.1.3 Long short-term memory network(LSTM) 2.1.4 Transformer Neural Network 11 2.2 Transfer Learning 15 2.3 Learning Paradigm 17 2.3.1 Sequence to Sequence Learning 17 2.3.2 Graphs-based Learning 18 2.3.3 Tree-to-tree Learning 19 2.4 Bug Repairing and Vulnerabilities Repairing 19 2.5 Source code Representation 20 2.5.1 GumTree 20 2.5.2 Byte Pair Encoding 21 Source code embeddings 22 2.6.1 CodeBERT 23 2.6.2 UnixCoder 25 2.6 The state of the art program repair approraches 27 3.1 27 Template-based approach v CONTENTS 3.2 CONTENTS Generative-based approach 29 3.2.1 SeqTrans 30 3.2.2 VRepair 32 Proposed Methods 34 Experiments and Results 35 5.1 Datasets 35 Validation method 36 5.2 Metrics of performance 36 5.3 Preprocessing the code as plain text 37 5.4 Extracting embeddings from large language models for code 38 5.5 Environment 40 5.6 Results 40 Discussions and Conclustion 43 6.1 Discussions of the results 43 6.2 Main Contribution 43 6.3 Future works 44 References 45 Appendix 49 vi LIST OF FIGURES LIST OF FIGURES List of Figures 2.1 The basic architecture of recurrent neural network 2.2 Recurrent Neural Network design patterns 2.3 LSTM network with three repeating layers 10 2.4 Attention-integrated recurrent network 12 2.5 The encoder-decoder architecture of transformer 13 2.6 Attention head operations 15 2.7 Dataset used for CodeBERT 23 2.8 CodeBERT architecture for replaced tokens detection task 25 2.9 A Python code with its comment and AST 25 2.10 Input for contrastive learning task of UnixCoder 26 3.1 Workflow of VuRLE 28 3.2 Architecture of SeqTrans 30 3.3 Input of SeqTrans 31 3.4 Normalized code segment 32 3.5 The VRepair pipeline 33 4.1 Design of our pipeline 35 5.1 Sample of buggy code and its patch 37 5.2 input sequence 37 5.3 Output sequence 38 5.4 Syntax of the output sequence 38 vii