Xử lý dữ liệu không cân bằng trong bài toán dự đoán lỗi phần mềm

ĐẠI HỌC ĐÀ NẴNG TRƯỜNG ĐẠI HỌC SƯ PHẠM LÊ SONG TỒN XỬ LÝ DỮ LIỆU KHƠNG CÂN BẰNG TRONG BÀI TOÁN DỰ ĐOÁN LỖI PHẦN MỀM LUẬN VĂN THẠC SĨ HỆ THỐNG THÔNG TIN Đà Nẵng - Năm 2020 ĐẠI HỌC ĐÀ NẴNG TRƯỜNG ĐẠI HỌC SƯ PHẠM LÊ SONG TỒN XỬ LÝ DỮ LIỆU KHƠNG CÂN BẰNG TRONG BÀI TOÁN DỰ ĐOÁN LỖI PHẦN MỀM Chuyên ngành Mã số : Hệ thống thông tin : 848.01.04 LUẬN VĂN THẠC SĨ NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS.TS Nguyễn Thanh Bình Đà Nẵng – Năm 2020 iv MỤC LỤC LỜI CAM ĐOAN i TRANG THONG TIN LUẬN VĂN ii MỤC LỤC .iv DANH MỤC CÁC TỪ VIẾT TẮT .vi DANH MỤC BẢNG vii DANH MỤC HÌNH ẢNH viii MỞ ĐẦU 1 Lý chọn đề tài Mục đích nghiên cứu Đối tượng phạm vi nghiên cứu Phương pháp nghiên cứu Ý nghĩa khoa học thực tiễn đề tài Cấu trúc luận văn Chương TỔNG QUAN VỀ DỰ ĐOÁN LỖI PHẦN MỀM 1.1 Lỗi phần mềm 1.2 Độ đo phần mềm 1.2.1 Độ đo mã nguồn 1.2.2 Độ đo trình 12 1.2.3 Đánh giá thuật toán .12 1.3 Dự đoán lỗi phần mềm 13 1.4 Mơ hình dự đốn lỗi phần mềm .14 1.5 Dữ liệu khơng cân tốn dự đoán lỗi .16 1.6 Kết luận chương 18 Chương KỸ THUẬT LẤY MẪU DỮ LIỆU 19 2.1 Khái niệm cần thiết việc lấy mẫu liệu .19 2.1.1 Khái niệm 19 2.1.2 Sự cần thiết việc lấy mẫu liệu 20 2.1.3 Random Undersampling .21 2.2 Random Oversampling 26 2.3 SMOTE 29 2.4 Tóm tắt chương 32 Chương KỸ THUẬT LẤY MẪU ĐỂ XỬ LÝ DỮ LIỆU KHÔNG CÂN BẰNG TRONG DỰ ĐOÁN LỖI PHẦN MỀM .33 3.1 Bài toán lấy mẫu liệu dự đoán lỗi phần mềm 33 3.2 Các tiêu chuẩn đánh giá thực nghiệm 34 3.3 Dữ liệu thực nghiệm .35 3.4 Thiết lập thực nghiệm .35 v 3.5 Các kết thực nghiệm 38 KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN 57 DANH MỤC TÀI LIỆU THAM KHẢO QUYẾT ĐỊNH GIAO ĐỀ TÀI (BẢN SAO) BẢN TƯỜNG TRÌNH CHỈNH SỬA LUẬN VĂN BIÊN BẢN BẢO VỆ HỌP HỘI ĐỒNG NHẬN XÉT CỦA HAI PHẢN BIỆN vi DANH MỤC CÁC TỪ VIẾT TẮT Từ viết tắt ANN CK CK OO FN Forest FP KNN LR MLP NB OO RUS ROS SDP SMOTE SVM TN TP Tree Tiếng Anh Artificial neural networks Chidamber-Kemerer Metrics Chidamber-Kemerer ObjectOriented Metrics False Negative Random Forest False Positive K-nearest Neighbors Logistic Regression Multilayer Perceptron Naïve Bayes Object-Oriented Random Undersampling Random Oversampling Software Defect Prediction Synthetic Minority Over-sampling Technique Support Vector Machine True Negative True Positive Decision Tree Tiếng Việt Mạng Neural nhân tạo Độ đo hướng lớp Chidamber-Kemerer Độ đo hướng đối tượng Chidamber-Kemerer Âm tính giả Dương tính giả Hướng đối tượng Giảm mẫu ngẫu nhiên Tăng mẫu ngẫu nhiên Dự đoán lỗi phần mềm Âm tính thật Dương tính thật vii DANH MỤC BẢNG Số hiệu bảng 1.1 1.2 1.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Tên bảng Trang Độ đo Halstead Mô tả độ đo Halstead Độ đo Chidamber & Kemerer (CK) Tóm tắt 13 tập liệu khơng cân cao sử dụng thực nghiệm Thông tin 13 tập liệu sau lấy mẫu kỹ thuật Undersampling Thông tin 13 tập liệu sau lấy mẫu kỹ thuật Oversampling Thông tin 13 tập liệu sau lấy mẫu kỹ thuật SMOTE Giá trị F1-score trung bình tập liệu Class, CM1, JM1, KC1, KC2, KC3 không cân sử dụng kỹ thuật lấy mẫu không lấy mẫu Giá trị F1-score trung bình tập liệu MC1, MC2, MW1, PC1, PC2, PC3, PC4 không cân sử dụng kỹ thuật lấy mẫu không lấy mẫu Bảng đánh giá mức độ hiệu kỹ thuật lấy mẫu 7 36 37 37 38 46 47 56 viii DANH MỤC CÁC HÌNH Số hiệu hình 1.1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 Tên hình Trang Đường cong ROC Mơ tả thuật tốn Undersampling Nội dung file mẫu jm1.arff Tỷ lệ lỗi/không lỗi liệu tập jm1.arff Biểu đồ thể tỷ lệ lỗi / khơng lỗi tập liệu Tính tốn số lượng mẫu trước sau cân Thực lấy mẫu xuất file liệu cân Biểu đồ thể tỷ lệ mẫu lỗi / không lỗi sau cân Mô tả liệu với thuật tốn Oversampling [43] Tính tốn số lượng mẫu trước sau cân bằng, thực thi kỹ thuật lấy mẫu Biểu đồ thể tỷ lệ mẫu lỗi / không lỗi sau cân với Random Oversampling Mơ tả thuật tốn kỹ thuật SMOTE [33] Biểu đồ thể tỷ lệ cân mẫu lỗi không lỗi Biểu đồ phân lớp tập liệu CLASS trước sau lấy mẫu Biểu đồ phân lớp tập liệu CM1 trước sau lấy mẫu Biểu đồ phân lớp tập liệu JM1 trước sau lấy mẫu Biểu đồ phân lớp tập liệu KC1 trước sau lấy mẫu Biểu đồ phân lớp tập liệu KC2 trước sau lấy mẫu Biểu đồ phân lớp tập liệu KC3 trước sau lấy mẫu Biểu đồ phân lớp tập liệu MC1 trước sau lấy mẫu Biểu đồ phân lớp tập liệu MC2 trước sau lấy mẫu Biểu đồ phân lớp tập liệu MW1 trước sau lấy mẫu Biểu đồ phân lớp tập liệu PC1 trước sau lấy mẫu Biểu đồ phân lớp tập liệu PC2 trước sau lấy mẫu Biểu đồ phân lớp tập liệu PC3 trước sau lấy mẫu Biểu đồ phân lớp tập liệu PC4 trước sau lấy mẫu Đường cong ROC sử dụng tập liệu CLASS thuật toán việc dự đốn lỗi hướng phương thức Mơ hình dự đốn sử dụng liệu trước lấy mẫu liệu (trên bên trái), lấy mẫu qua Undersampling (trên bên phải), lấy mẫu qua Oversampling (dưới bên trái), lấy mẫu qua SMOTE (dưới bên phải) 13 22 22 23 23 24 25 25 26 28 28 30 32 39 40 40 41 41 42 42 43 43 44 44 45 45 48 DANH MỤC TÀI LIỆU THAM KHẢO [1] "Gartner says worldwide it spending on pace to reach $3.8 trillion in 2014," February 2016 [Online] Available: http://www gartner com /newsroom /id/2643919 [2] O F Arar and K Ayan, "Software defect prediction using cost-sensitive neural network," Applied Soft Computing, vol 33, no C, p 263–277, August 2015 [3] B S Ainapure, "Software Testing & Quality Assurance," 1st ed India: Technical Publications, 2014 [4] S Huda, "An ensemble oversampling model for class imbalance problem in software defect prediction," IEEE Access, 2018 [5] Bieman, J M., "Software Metrics: A Rigorous & Practical Approach.," IBM Systems Journal, 36(4), 594, 1997 [6] Jaechang Nam, "Survey on Software Defect Prediction," 2009 [7] Al-Qutaish, Rafa & Abran, Alain, "Halstead Metrics: Analysis of their Design," 2010 [8] T Hariprasad ; G Vidhyagaran ; K Seenu ; Chandrasegar Thirumalai, "Software Complexity Analysis Using Halstead Metrics," International Conference on Trends in Electronics and Informatics (ICEI), May 2017 [9] T.J McCabe, "A Complexity Measure," IEEE Transactions on Software Engineering, Vols SE-2, no 4, pp 308 - 320, Dec 1976 [10] Halstead, Maurice H, "Elements of Software Science," Amsterdam: Elsevier North-Holland, 1977 [11] S.R Chidamber ; C.F Kemerer, "A metrics suite for object oriented design," IEEE Transactions on Software Engineering, vol 20, no 6, June 1994 [12] Mei-Huei Tang ; Ming-Hung Kao ; Mei-Hwa Chen, "An empirical study on object-oriented metrics," Proceedings Sixth International Software Metrics Symposium (Cat No.PR00403), pp 242-249, Nov 1999 [13] Ramanath Subramanyam, M S Krishnan, "Empirical Analysis of CK Metrics for Object-Oriented Design Complexity: Implications for Software Defects," IEEE Transactions on Software Engineering 29(4), pp 297- 310, 2003 [14] K.K.Aggarwal, Yogesh Singh, Arvinder Kaur, Ruchika Malhotra, "Empirical Study of Object-Oriented Metrics," School of Information Technology, GGS Indraprastha University, Delhi 110006, India, vol 5, no 8, pp 149-173, November-December 2006 [15] Ermiyas Birihanu Belachew, Feidu Akmel Gobena and Shumet Tadesse Nigatu, "ANALYSIS OF SOFTWARE QUALITY USING SOFTWARE METRICS," International Journal on Computational Science & Applications (IJCSA), vol 8, October 2018 [16] Rüdiger Lincke, Jonas Lundberg and Welf Löwe, "Comparing Software Metrics Tools," International Symposium on Software Testing and Analysis - ISSTA ’08, 2018 [17] Marco D’Ambros, Michele Lanza, Romain Robbes, "Evaluating Defect Prediction Approaches: A Benchmark and an Extensive Comparison," Empirical Software Engineering, 17(4-5),, pp 531-577, 2012 [18] R Akbani, S Kwek, and N Japkowicz, "Applying Support Vector Machines to Imbalanced Datasets," in Proceedings of 15th European Conference on Machine Learning, pp 39-50, 2004 [19] El-Enam, K.Benlarbri, S., Goel, N., Rai, S., "Avalidation of Object-Oriented Metrics," National Research Council of Canada, NR/ ERB 1063 [20] Elish, K.O.; Elish, M.O, "Predicting Defect-Prone Software Modules Using Support Vector Machines," J.Syst Softw 2008, 81, 649–660 [21] T J McCabe and C W Butler, "Design complexity measurement and testing," Communications of the ACM, 32(12), December 1989 [22] Le Hoang Son, Nakul Pritam, Manju Khari, Raghvendra Kumar, Pham Thi Minh Phuong, Pham Huy Thong, "Empirical Study of Software Defect Prediction: A Systematic Mapping," Symmetry 2019, 11, 212, 13 February 2019 [23] Cagatay Rukshan Catal, Banu Diri, "A systematic review of software fault prediction studies," Expert Systems with Applications, vol 36, no 4, pp 7346-7354, May 2009 [24] Catal, C., &Diri, B cag J, "Expert Systems with Applications," vol 36 (4), pp 7346-7354, 2009 [25] Menzies, T.; Krishna, R.; Pryor, D., "The Promise Repository of Empirical Software Engineering Data," 2015 [26] Mrinal Singh Rawat, Sanjay Kumar Dubey, "Software Defect Prediction Models for Quality Improvement: A Literature Study," IJCSI International Journal of Computer Science Issues, vol 9, no 5, Mrinal Singh Rawat1, Sanjay Kumar Dubey [27] Victoria López, Alberto Fernández, Salvador García, Vasile Palade, Francisco Herrera, "An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics," Information Sciences 250, pp 113-141, 2013 [28] T Menzies, B Turhan, A Bener, G Gay, B Cukic, and Y Jiang, "Implications of ceiling effects in defect predictors,," in The 4th International Workshop on Predictor Models in Software Engineering, p 47–54, 2008 [29] Ling, Charles X., and Chenghui Li., "Data mining for direct marketing: Problems and solutions.," Kdd, vol 98, 1998 [30] Sebastián Maldonado, Julio López, Carla Vairetti, "An alternative SMOTE oversampling strategy for high-dimensional datasets," Soft Computing Journal 76, p 380–389, 2019 [31] TOM FAWCETT, "https://www.svds.com/," 25 AUGUST 2016 [Online] Available: https://www.svds.com/learning-imbalancedclasses/?fbclid=IwAR274sviaYgr5TmuL3NWoaPeHaBypFe_J4_QUinYsqSwtvjuFUI4dsNp2s [32] Chawla, Nitesh V.; Herrera, Francisco; Garcia, Salvador; Fernandez, Alberto, " "SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary"," Journal of Artificial Intelligence Research, pp 863-905, 2018 [33] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, pp 321-357, 2002 [34] Nguyễn Thị Lan Anh, "Thuật toán HMU toán phân lớp liệu cân bằng," Tạp chí Khoa học Giáo dục, Trường Đại học Sư phạm Huế, pp 101-108, 2017 [35] H He and E A Garcia, "Learning from Imbalanced Data," IEEE Trans Knowl Data Eng., vol 21, p 1263–1284, 2009 [36] G.M Weiss and F Provost, "The Effect of Class Distribution on Classifier Learning: An Empirical Study," Technical Report MLTR-43, Dept of Computer Science, Rutgers Univ., 2001 [37] J Laurikkala, "Improving Identification of Difficult Small Classes by Balancing Class Distribution," Proc Conf AI in Medicine in Europe: Artificial Intelligence Medicine, pp 63-66, 2001 [38] A Estabrooks, T Jo, and N Japkowicz, "A Multiple Resampling Method for Learning from Imbalanced Data Sets," Computational Intelligence, vol 20, pp 18-36, 2004 [39] G.E.A.P.A Batista, R.C Prati, and M.C Monard, "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data," ACM SIGKDD Explorations Newsletter, vol 6, no 1, pp 20-29, 2004 [40] N Japkowicz and S Stephen, "The Class Imbalance Problem: A Systematic Study," Intelligent Data Analysis, vol 6, no 5, pp 429-449, 2002 [41] Nuno Moniz, Paula Branco, Luís Torgo, "Resampling Strategies for Imbalanced Time Series Forecasting," International Journal of Data Science and Analytics, vol 3, no 3, pp 161-181, May 2017 [42] Satwik Mishra, "Handling Imbalanced Data: SMOTE vs Random Undersampling," International Research Journal of Engineering and Technology (IRJET), vol 4, no 8, Aug -2017 [43] The Learning Machine, "The Learning Machine," [Online] Available: https://www.thelearningmachine.ai/imbalanced [44] María Pérez-Ortiz ; Pedro Antonio Gutiérrez ; Peter Tino ; César HervásMartínez, "Oversampling the Minority Class in the Feature Space," IEEE Transactions on Neural Networks and Learning Systems, vol 27, no 9, pp 1947 - 1961, Sept 2016 [45] Aggarwal, K K., Singh, Y., Kaur, A., and Malhotra, R, "Empirical analysis for investigating the effect of object-oriented metrics on fault proneness:," Software Process: Improvement and Practice, vol 14, no 1, pp 39-62, 2009 [46] D W Aha, D Kibler, and M K Albert, "Instance-based learning algorithms," Machine Learning, vol 6, pp 37-66, January 01 1991 [47] Breiman, L., "RandomForests," MachineLearning, vol 45, no 1, pp 5-32, 2001 [48] Lessmann, S, "Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings," IEEE transactions on software engineering, vol 34, no 4, 2008 [49] Gayathri, M and Sudha, "A Software Defect Prediction System using Multilayer Perceptron Neural Network with Data Mining," International Journal of Recent Technology and Engineering, pp 54-59, 2014 [50] Caragea, D., Cook, D., Honavar, V., "Gaining Insights into Support Vector Machine Pattern Classifiers Using Projection-Based Tour Methods," Proceedings of the KDD Conference, San Francisco, CA, pp 251-256, 2001 [51] Del-Hoyo, R., Buldain, D., Marco, A., "Supervised Classification with Associative SOM," Lecture Notes in Computer Science, pp 334-341, 2003 ... MẪU ĐỂ XỬ LÝ DỮ LIỆU KHÔNG CÂN BẰNG TRONG DỰ ĐOÁN LỖI PHẦN MỀM .33 3.1 Bài toán lấy mẫu liệu dự đoán lỗi phần mềm 33 3.2 Các tiêu chuẩn đánh giá thực nghiệm 34 3.3 Dữ liệu. .. nguồn phần mềm, mơ hình dự đốn lỗi dựa độ đo phần mềm Cuối chương trình bày vấn đề cân liệu lớp, ảnh hưởng tập liệu tốn dự đốn lỗi phần mềm 1.1 Lỗi phần mềm Trong lĩnh vực công nghệ phần mềm, ... tập liệu toán dự đoán lỗi phần mềm Các kết thu minh chứng cho tính hiệu việc áp dụng kỹ thuật để nâng cao tính xác kết dự đốn lỗi phần mềm 3.1 Bài toán lấy mẫu liệu dự đoán lỗi phần mềm Tập liệu

Định dạng
Số trang	85
Dung lượng	24,41 MB