Giải pháp nhận dạng và xử lý lỗi trong hạ tầng điện toán đám mây

ĐẠI HỌC QUỐC GIA TP HỒ CHÍ MINH TRƯỜNG ĐẠI HỌC BÁCH KHOA BÙI THANH KH1ÉT GIẢI PHÁP NHẬN DẠNG VÀ xủ LÝ LỎI TRONG HẠ TẦNG ĐIỆN TOÁN ĐÁM MÂY LUẬN ÁN TIÉN SÌ TP HỒ CHÍ MINH - NĂM 2022 ĐẠI HỌC ỌUỒC GIA THÀNH PHỐ HỔ CHÍ MINH TRƯỜNG ĐẠI HỌC BÁCH KHOA BÙI THANH KHIÉT GIẢI PHÁP NHẬN DẠNG VÀ xử LÝ LỎI TRONG HẠ TẦNG ĐIẸN TỐN ĐÁM MÂY Chun ngành: Khoa học máy tính Mã số chuyên ngành: 62480101 Phản biện độc lập: Phản biện độc lập: Phản biện: PGS TS Huỳnh Trung Hiếu Phản biện: PGS TS Nguyền Đình Thuân Phân biện: PGS TS Quan Thành Thư NGƯỜI HƯỚNG DẢN: PGS.TS Trần Công Hùng PGS.TS Phạm Trần Vũ LỜI CAM ĐOAN Tơi xin cam đoan cơng trình nghiên cứu cua riêng Các kết viết chung với tác giả khác đồng ỷ đồng tác giả trước đưa vào luận án Các kết qua nêu luận án trung thực chưa cơng bố cơng trình khác Ngi cam đoan Bùi Thanh Khiết ĩ TĨM TẮT LUẬN ÁN Dịch vụ hạ tầng Điện toán đám mây (ĐTĐM) mang lại tiện lợi thiết thực, giúp người dùng triên khai ứng dụng cách linh hoạt, đơn giản hóa trinh thuê, giải phóng tài nguyên chi phi thuê tài nguyên tính dựa phân bô mồi lần sử dụng (sử-dụng-bao-nhiêu-trả-bấy-nhiêu) Tuy nhiên, lồi dịch vụ hạ tầng ĐTĐM khó có the tránh khoi vi quy mơ hệ thống mạng không lồ cùa trung tâm liệu ĐTĐM với kiến trúc phức tạp gồm hàng ngàn máy chu vật lý (Physical machine, viết tat PM) với độ tin cậy khác Với tính mớ, linh hoạt cấu trúc phức tạp cua ĐTĐM dẫn đen nhiều loại lồi khác từ hệ thống sở hạ tầng, tảng đến ứng dụng Lỗi xảy tầng cụ thể cùa ĐTĐM ánh hướng lên tầng Neu lỗi xay hệ điều hành tầng dịch vụ tàng cỏ the dẫn đen ứng dụng dịch vụ phần mềm bị lỗi Trong lồi xảy cứng cúa máy vật lý, ánh hường lên tang dịch vụ sở hạ tầng tiếp tục dẫn đến lồi sè xáy hệ điều hành tầng dịch vụ tàng tiếp tục ảnh hưởng đến lồi xảy ứng dụng cùa tầng dịch vụ phần mềm Có the thay rang, lồi dịch vụ hạ tầng đặc biệt phần cứng ánh hưởng, gây thiệt hại lớn đến hệ thống Việc phát lồi phần cứng điển hình phát triển kỳ thuật kháng lỗi tương ứng vấn đề cấp thiết Theo đó, DTĐM cần có nhận diện hành xử hợp lý đe đảm bão tính thông suốt, chất lượng dịch vụ, tránh mát liệu ca lồi xáy Khả dược xem khả kháng lỗi (Fault Tolerance) hạ tầng ĐTĐM Cỏ hai chiến lược kháng lồi ĐTĐM gồm kháng lồi thụ động kháng lỗi chủ động Kháng lồi thụ động nham giám bớt hậu lồi gây trình hoạt động, thực thi ứng dụng, dịch vụ hệ thống Mơ hình dựa phàn ứng lại có lỗi xảy phàn ứng dựa trcn dự báo Ảnh hưởng cua lỗi thường loại bò bang cách sư dụng hệ thống bao trì Bèn cạnh đó, chiến lược kháng lồi chu động nhằm giừ ứng dụng dịch vụ thực thi dúng cách tránh lỗi tiềm ẩn thông qua biện pháp ngăn chặn Từ chù động 11 ngừ cánh kháng lỗi định nghĩa khả cùa hệ thống trạng thái chuân bị kiểm soát trước lỗi xảy Trạng thái hệ thống theo dõi liên tục khả xảy lồi ước tính phương pháp thống kê, mơ hình tốn học Các hành động cần thiết sau thực đế ngăn chặn lỗi xáy Mặc dù phương pháp kháng lồi thụ động phô biến giới nghiên cứu nay, nhiên, tiến vượt bậc cùa học máy, trí tuệ nhàn tạo, thiết bị ngày trớ nên thông minh làm gia tăng phạm vi nghiên cứu khả nâng kháng lỗi chủ động Các khung kháng lồi ngày mong đợi thông minh hon đề đưa chiến lược khác cho ngữ cánh khác lỗi hệ thống nham kháng dạng lỗi khác Một che cua điều phối dịch vụ cách linh hoạt DTDM hướng đến kháng lồi cẩn thiết Nói cách khác, cẩn xây dựng khung kháng lỗi đam bao độ sẵn sàng cao hiệu việc quan lý, khai thác tài ngun Tìr đó, luận án tập trung nghiên cứu chiến lược kháng lồi chủ động nhàm xây dựng khung kháng lồi cho hạ tầng hệ thống ĐTDM Theo đó, khung kháng lỗi gồm hai thành phần gồm phát lồi máy chủ hạ tầng ĐTĐM di trú tài nguyên hiệu qua Trong đó, phát lỗi dựa bất thường đề xuất đàm bao cho hệ thống kháng lồi hoạt động xác tăng khả phán ứng cúa hệ thống tình có lỗi xáy Từ kết cua thành phần phát lồi, việc tránh ánh hưởng lồi giải thông qua di trú tài nguyên ăo hóa dảm bao việc sư dụng, khai thác tài nguyên hiệu qua Đê nâng cao kha phán ứng linh hoạt cho chiến lược di trú máy ào, luận án đề xuất điều khiến di trú máy áo kháng lồi có khả học tăng cường Đóng góp luận án bao gồm: • Xây dựng khung kháng lồi chu động cho hạ tầng ĐTĐM dựa cẩu trúc vỏng lặp MAPE-K cua hệ thống tự trị gồm thành phần giám sát, phàn tích lồi PM, xây dựng chiến lược chiến lược di trú VM kháng lồi có khả học tăng cường, thực thi điều phối tài nguyên PM hạ tang ĐTĐM • Đồ xuất mơ hình phát lồi PM trơn hạ tầng ĐTĐM dựa chi so vận hành bất thường Chi sổ vận hành bất thường xác định từ giá trị biên iĩi định cùa mơ hình Fuzzy One Class Support Vector Machine (FOCSVM) - kct hợp cùa logic mờ máy vector hồ trợ lớp (One Class Support Vector Machine, viết tắt OCSVM) để giảm ảnh hướng nhiễu xuất tập liệu huấn luyện Logic mờ sử dụng để lính tốn hệ số phạt OCSVM nhằm cài thiện hoạt động linh hoạt thời gian thực thi tận dụng kiến thức cua chuyên gia Đe xuất phương pháp phát lỗi dựa so vận hành bất thường, có tên EWMA-FOCSVM, dựa theo dõi biến động đột ngột giá trị biên định FOCSVM biếu đồ kiểm sốt trung bình trượt có trọng số hàm mù (Exponentially Weighted Moving Average, vict tat EWMA) Các mẫu liệu giám sát dán nhàn bình thường/lồi bang cách sư dụng EWMA-FOCSVM thời gian thực để tạo thành liệu huấn luyện có nhãn cho vấn đề phân tích thơng số hiệu cua máy chù vật lý liên quan đến lồi Việc phàn tích thơng số hiệu cùa máy vật lý liên quan đến lồi đưa toán lựa chọn đặc trưng giải cách sừ dụng thuật toán RFE-RF kết hợp thuật tốn loại bõ thuộc tính hồi quy (Recursive Feature Elimination, viết tat RFE) thuật toán rừng nẫu nhicn (Random Forest, viết tắt RF) Các thông sổ đáng ngờ xác định thông qua việc xếp hạng thuộc tính cua tập liệu Đe xuất mơ hình xây dựng chiến lược di trú máy ảo kháng lồi dựa điều khiên mờ học tăng cường Fuzzy Q-Lcarning Việc điều khiến di trú VM để tránh ảnh hường từ PM bị lồi dam bảo PM sau tiếp nhận VM có chi số vận hành bất thường thấp mức độ sử dụng tài nguyên PM cân bàng Thêm vào đó, để nâng cao khả nàng thực thi điều khiển di trú VM kháng lõi, thành phần tập luật cập nhật theo cư chế học tăng cường cã bắt đầu hệ thống với tập luật chưa đầy đù Đồ xuất giái thuật V2PFQL cho việc điều khiên di trú VM kháng lỗi dựa Fuzzy Ọ-Leaming Một sức mạnh hệ suy dien mờ chuyên đôi tri thức cua người thành luật trực quan dạng NÉU-THÌ Tuy nhiên, trình thiết kế hệ suy diễn mờ, người thiết ke có the gặp vấn đề khó khăn định nghĩa tập luật iv thiết kế tập luật khơng có sằn tri thức vấn đề, định nghĩa phần cua tập luật, có thê định nghĩa tập luật khơng hiệu dư thừa tập luật tập luật không chan (đúng số trường hợp lại sai số trường hợp khác) Đe giái vấn đề này, luận án đề xuất thuật toán huấn luyện tập luật cho vấn đề di trú máy ảo, đặt tên V2PFQL-AS, dựa kết họp giừa thuật tốn V2PFQL Hệ kiến để hồn thiện tập luật giai đoạn thiết kế hệ suy diễn mờ Luận án đánh giá hiệu cua V2PFQL sau cập nhật tri thức từ kết q trình huấn luyện theo thuật tốn V2PFỌL-AS Giá trị hàm mục tiêu cua toán di trú VM kháng lồi cúa thuật toán V2PFQL so sánh với giải thuật RoundRobin (RR), giải thuật tối ưu đàn kiến Inverse Ant System (ÍAS), giái thuật hệ kiến Ant System (AS), giải thuật Max-Min Ant System (MMAS), giai thuật tối tru bầy đàn Particle swarm optimization (PSO), giai thuật luyện kim Simulated Annealing (SA) V ABSTRACT Cloud computing infrastructure services bring practical convenience, help users to deploy applications flexibly, simplify the rental process, and release resources while renting resources calculated based on use-pay-as-you-go However, faults on the cloud infrastructure service are unavoidable because of the large scale and network system of the cloud data center along with the complex architecture of thousands of physical servers with different reliability With the openness, flexibility and complex structure of cloud computing, it leads to many different types of faults from the infrastructure system, the platform to the application Faults can occur at any particular layer of the cloud and it will affect above layers If the fault occurs in the operating system of the platform service layer, it can lead to applications on software services to fail If a fault occurs in the hardware of physical servers, it will affect the infrastructure service layer and continues to lead to the failure in the operating system of the platform service layer, continues to affect the infrastructure service layer, and then software service layer application failures It can be seen that faults in infrastructure services, especially hardware, will affect and cause great damage to the system It is imperative to detect typical hardware faults and develop corresponding fault toletrance techniques Accordingly, cloud computing needs to be able to identify and behave appropriately to ensure transparency, quality of service, and avoid data loss even when faults occur This ability is known as fault tolerance on cloud infrastructure Existing fault tolerance (FT) approaches can be classified into two basic categories, viz reactive and proactive approaches The reactive FT approaches handle the faults after their appearance through using system maintenance programs They are built on responsiveness rather than predictability They are also conservative by nature, so there’s no need to inspect the system’s behavior As a result, they not have any unnecessary overhead Proactive FT approaches, on the other hand, are described as the capacity of the system to be in an active state to avoid potential vi faults/errors/failures before they occur Statistics, machine learning, and artificial intelligence approaches arc used to continually monitor the system’s health and anticipate the likelihood of a fault occurring The system handles fault occurrence by taking essential actions A FT approach is an incorporated action of fault detection and fault recovery (reactive FT approach) or fault forecasting and fault prevention (proactive FT approach) Although reactive FT frameworks arc popular among researchers till now, the scope of research in proactive FT frameworks is increasing because of ongoing advancements in machine learning and artificial intelligence Therefore, this thesis focuses on researching proactive FT strategics to build a FT framework for infrastructure of cloud computing Accordingly, the FT framework consists of two main components including physical server fault detection of the cloud infrastructure and virtual machine migration In particular, the proposed anomaly-based fault detector ensures the FT system to work correctly and increases the system's ability to react when a fault occurs From the results of the fautl detection model, the avoidance of fautl effects will be solved through a virtual machine migration To improve the responsiveness of the virtual machine migration strategy, this study proposes a virtual machine migration controller capable of reinforcement learning The main contributions of the thesis include: • Building a proactive FT framework for cloud computing infrastructure based on the MAPE-K loop structure of the autonomous system, including the monitoring component, PM fault analysis, seft-leaming VM migration, and resources coordination executing • The combination of fuzzy logic and OCSVM (namedFOCSVM) is proposed to improve the abnormal detection when outliers appear in the dataset By using fuzzy logic for calculating penalty factors of OCSVM model, fault detection approach improves flexible operations in real time as well as takes advantage of experts’ knowledge Based on the FOCSVM abnormal detection model, the fault detection and diagnosis approach is proposed including abnormal detection, fault detection, and analysis of suspicious parameters For fault detection problem, the vii exponentially weighted moving average (EWMA) chart is then used to identify abrupt changes if there is any fault to occur, named EWMA-FOCSVM And then, the fault diagnosis problem is abstracted to feature selection problem with the training dataset which are labeled by EWMA-FOCSVM The analysis of physical server performance parameters related to the faults is brought to the feature selection problem and solved using the RFE-RF model - which is a combination of the Recursive Feature Elimination (RFE) model and the Recursive Feature Elimination (RFE) model and Random Forest (RF) Suspicious parameters are identified through the feature ranking of the data set • The self-learning VM migration component is designed by applying Fuzzy Q- Leaming algorithm to enhance the performance of fuzzy inference system One of the strengths of fuzzy inference systems is their ability to convert human knowledge into intuitive 1F-THE rules VM migration strategies are considered as internal knowledge of the cloud controller which shows capable of learning in the execution environment To implement the self-learning VM migration controller, a rule set is continually explored during execution time through self-learning rule component which shows ability of self-learning to complete the rule set in run time without prior knowledge The migration controller observes the infrastructure state and manipulates the migration plans The PMs which allocate to cloud-hosted applications arc monitored by load balance and abnormal score metrics The V2PỌL algorithm is proposed to migrate VM in order to avoid the influence of deteriorating PMs as well as keep load balance and abnormal score for all the safe PMs The analysis results from the migration controller go to the self-learning rule component for updating it Learning mechanism of VM migration rule, named V2PQL-AS, is designed based on a combination of V2PQL algorithm and Ant System algorithm In general, the problem of VM migration is expressed in the form of one n VMs that need to be migrated into m PM After migrating VMs, the system ensures at least the level of load balancing between resources in each PM, ensure a minimum of anomalies for each PM, viii xuất tập dừ liệu Bằng cách sứ dụng logic mờ để tính tốn hệ số phạt cua OCSVM nham thiện hoạt động linh hoạt thời gian thực thi tận dụng kiến thức cúa chuyên gia Dựa mơ hình phát bất thường FOCSVM, phương pháp phát xác định thông số liên quan đển lồi đề xuất Đối với vấn đề phát lồi, biêu đồ kiểm sốt trung bình trượt có trọng số hàm mũ (Exponentially Weighted Moving Average, viết tat EWMA) sứ dụng đe xác định nhừng thay đơi đột ngột có lỗi xảy ra, đặt tên EWMA-FOCSVM Và sau đó, vấn đề xác định thông số liên quan đến lồi đưa tốn xếp hạng thuộc tính với tập liệu huấn luyện gắn nhãn bới thành phần phát lồi EWMA-FOCSVM Đe giái vấn đồ phân tích thông sổ hiệu PM liên quan đến lỗi, phương pháp Recursive Feature Elimination (RFE) kết hợp với thuật toán Rừng ngầu nhiên (Random Forest, viết tat RF) mồi Ian lặp Mơ hình phát phân tích thông số hiệu PM liên quan đến lồi (được đặt tên REFRF) đánh giá trôn liệu thực nghiệm hạ tầng đám mây riêng dừ liệu trích xuất từ liệu Google Cluster Trace De giãi Càu hói 2, cần xây dựng chiến lược di trú VM kháng lồi dựa trẽn công nghệ di trú VM Công nghệ di trú VM cho phép di chuyến toàn hệ thống VM (gồm vi xư lý, nhở, lưu trữ, tài nguyên mạng, hệ điều hành, ứng dụng liên quan) từ PM sang PM khác Từ đó, đem lại nhiều lợi ích cho trung tâm dừ liệu cân tái, bảo trì trực tuyến, quản lý nãng lượng Tuy nhiên, cấu trác phức tạp cùa hạ tàng đám mây dặc diêm động cua ứng dụng dám mây lúc thực thi thách thức lớn thiết kế điều khiến đám mây Trong thực tế, cấu trúc cùa hạ tầng ĐTĐM phức tạp mà thay đoi theo thời gian tinh co giãn cúa ĐTĐM gây khó khăn việc mơ hình hóa hệ thống Thêm vào đó, tính suốt ĐTĐM khiến việc phân tích tinh trạng hạ tầng trớ nên khó khăn điều phụ thuộc nhiều vào tương tác thành phần hệ thống với Nhà cung cấp dịch vụ thiếu thông tin cụ thề kiến trúc hộ thống ứng dụng triển khai hạ tầng ĐTĐM Trong đó, người dùng/khách hàng thiếu thông tin cụ thể hạ tầng ĐTĐM Điều gây khó khăn việc dịnh nghĩa tham số, tập luật điều khiến tối ưu Những đặc diêm nêu mơi trường ĐTĐM địi hôi điểu khiển đám mây phải hoạt động tác nhân đưa cho vấn đề định tuần 132 tự mà hành động sè ảnh hưởng đến định tưong lai Trong khuôn kho cua luận án, chiến lược di trú VM xcm tri thức bên cua điều khiên đám mây cho chúng có khả học môi trường thực thi Luận án xây dựng điều khiển di trú VM mà tập luật di trú VM có học tăng cường theo chế MAPE-K Thuật toán điều khiển di trú máy V2PFQL xây dựng dựa thuật tốn học tăng cường mờ Fuzzy Q-Lcarning Thêm vào đó, để nâng cao khả thực thi điều khiến di trú VM, thành phần tập luật có khả học tăng cường đề hoàn thiện tập luật suốt thời thời gian thực Thuật toán huấn luyện tập luật di trú máy V2PFQL-AS thiết kế dựa kết hợp V2PFQL Hệ kiến đè hoàn thiện tập luật giai đoạn thiết ke hệ suy diễn mờ cua điều khiên di trú VM Hiệu cua giái thuật huấn luyện tập luật V2PFQL-AS dược dánh giá việc điều chinh hệ số học, hệ số khau, hệ số khám phá/khai thác dựa hội tụ giá trị q-value Thuật toán điều khiến di trú máy ảo V2PFQL đánh giá so sánh với giái thuật thuộc lớp meta-heuristic gồm RR, ĨAS, AS, MMAS, SA, PSO Tóm lại luận án có đóng góp là: • Xây dựng khung kháng lồi chủ động cho hạ tầng ĐTĐM dựa chu trình MAPE-K, gồm thành phần tương ứng: hệ thống giám sát, phân tích lỗi, xây dựng chiến lược kháng lỗi, điều phối tài nguyên • Đồ xuất mơ hình phát lỗi cho PM trơn hạ tầng ĐTĐM dựa chi số vận hành bất thường Chi số giá trị biên cua mơ hình FOCSVM thơng qua kết hợp logic mờ thuật tốn máy vector hỗ trợ lớp Luận án sứ dụng trung bình trượt trọng số mũ (EWMA) đế theo dõi biển động đột ngột cúa giá trị biên mơ hình FOCSVM, từ phát lồi dựa sổ vận hành bất thường Ngoài ra, luận án kết hợp thuật tốn loại bo thuộc tính hồi qui (REF) với thuật toán Rừng ngẫu nhiên (RF) đế phân tích hiệu cua mảy vật lý liên quan đến lồi, cách xếp hạng thuộc tính tập dừ liệu • Đe xuất di trú máy kháng lồi dựa điêu khiến mờ học tăng cường Fuzzy Q- Leaming Luận án đề xuất giải thuật V2PFQL V2PFỌL-AS hệ suy diễn mờ gom luật điêu khiên di trú máy ảo, bao gom trường hợp tập luật không đầy đủ 133 5.2 Huong nghiên cứu mỏ’ rộng Từ nghiên cứu kết đạt được, luận án đồ nghị số vấn đồ hướng nghiên cứu sau: • Vấn đề 1: Luận án tiếp tục nghiên cứu vấn đề xác định nguồn gốc gây lồi Trong Chương chi kháo sát đến việc phân tích thơng số hiệu cua vật lý liên quan đến lỗi Đây có the tiền đe để tiếp tục nghiên cứu vấn đề xác định nguồn gốc gây lồi, từ dó có dược mơ hình phân tích lồi máy chù vật lý hồn chinh hiệu q • Vấn đề 2: Luận án tiếp tục nghiên cửu vấn đề xác định máy chủ vật lý tối ưu cho di trú máy áo dựa thông số cua hạ tầng mạng Luận án dừng lại việc xác định mảy chủ vật lý cho di trú máy ảo dựa thông số hiệu Việc kết hợp thông số hiệu với thông sổ cua hạ tầng mạng giúp cho việc xác định máy chủ vật lý phù hợp, hiệu cho việc di trú máy ảo • Vấn đề 3: Luận án có the tiếp tục đánh giá mơ hình xây dựng chiến lược di trú máy ảo đề xuất Chương với mơ hình có khác đế có kết đánh giá tồn diện Các tham số Cơng thức (4.29) ảnh hướng đến q trình học tăng cường gồm hệ số học 7?, hệ số chiết khấu y, hệ sổ khám phá/khai thác £ cua thuật toán diều khiển di trú máy ảo V2PFQL Gần đây, việc nghiên cứu chiến lược khám phá/khai thác thòng sổ tiếp cận phương pháp học máy Deep Learning 134 DANH MỤC CÁC CÔNG TRÌNH CƠNG BỐ Tạp chí quốc tế T K Bui, c H Tran, and T V Pham, “V2PFQL: A proactive fault tolerance approach for cloud-hostcd applications in cloud computing environment,” IET Control Theory & Applications, vol 16, no 14, pp 1474-1498, 2022 T K Bui, V L Vo, M c Nguyen, T V Pham, and c H Tran, “A fault detection and diagnosis approach for multi-tier application in cloud computing,” Journal of Communications and Networks, vol 22, no 5, pp 399-414, 2020 c H Tran, T K Bui, and T V Pham, “Virtual machine migration policy for multi tier application in cloud computing based on Q-learning algorithm,” Computing, vol 104, no 6, pp 1285-1306, 2022 T K Bui, H D Ho, T V Pham, and c H Tran, “Virtual machines migration game approach for multi-tier application in infrastructure as a service cloud computing,” JET Networks, vol 9, no 6, pp 326-337, 2020 Kỷ yếu hội nghị quốc tế T K Bui, V L Nguyen, V T Tran, T V Pham, and c H Tran, “A load balancing vms migration approach for multi-tier application in cloud computing based on fuzzy set and q-learning algorithm,” Research in Intelligent and Computing in Engineering, Springer, 2021, pp 617-628 T K Bui, T V Pham, and c H Tran, “A load balancing game approach for VM provision cloud computing based on ant colony optimization,” International Conference on Context-Aware Systems and Applications, Springer, 2017, pp 52-63 135 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] J Han, J Pei, and M Kamber, Data mining: concepts and techniques, Elsevier, 2011 s Agrawal, and J Agrawal, “Survey on Anomaly Detection using Data Mining Techniques,” Procedia Computer Science, vol 60, pp 708-713,2015 D c Plummer, T J Bittman, T Austin, D w Cearley, and D M Smith, “Cloud computing: Defining and describing an emerging phenomenon,” Gartner, vol 17, 2008 D Wang, D s Yeung, and E c Tsang, “Structured one-class classification,” IEEE transactions on systems, man, and cybernetics, vol 36, no 6, pp 1283-1295, 2006 B Scholkopf, J c Platt, J Shawe-Taylor, A J Smola, and R c Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol 13, no 7, pp 1443-1471, 2001 s Yin, X Zhu, and c Jing, “Fault detection based on a robust one class support vector machine,” Neurocomputing, vol 145, pp 263-268, 2014 C.-F Lin, and S.-D Wang, “Fuzzy support vector machines,” IEEE transactions on neural networks, vol 13, no 2, pp 464-471, 2002 P.-Y Hao, “Fuzzy one-class support vector machines,” Fuzzy Sets Systems, vol 159, no 18, pp 2317-2336, 2008 w z Tao Wang, Chunyang Yc, Jun Wei, Hua Zhong, Tao Huang, “FD4C: Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol 46, no 1, pp 61-75, 2015 T Knauth, and c Fetzer, “Scaling non-elastic applications using virtual machines,” in 2011 IEEE 4th International Conference on Cloud Computing, IEEE, 2011, pp 468-475 K Lazri, s Laniepce, and J Ben-Othman, “When dynamic vm migration falls under the control of vm users,” in 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, IEEE, 2013, pp 395-402 p G J Leelipushpam, and J Sharmila, “Live VM migration techniques in cloud environment—a survey,” in 2013 IEEE Conference on Information & Communication Technologies, IEEE, 2013, pp 408-413 p Lu, A Barbalace, R Palmieri, and B Ravindran, “Adaptive live migration to improve load balancing in virtual machine environment,” in European Conference on Parallel Processing, Springer, 2013, pp 116-125 F Zhang, G Liu, X Fu, R Yahyapour, and Tutorials, “A survey on virtual machine migration: Challenges, techniques, and open issues,” IEEE Communications Surveys, vol 20, no 2, pp 1206-1243, 2018 A Corradi, M Fanelli, and L Foschini, “VM consolidation: A real case based on OpenStack Cloud,” Future Generation Computer Systems, vol 32, pp 118-127, 2014 M Seddigh, H Taheri, and s Sharifian, “Dynamic prediction scheduling for virtual machine placement via ant colony optimization,” in 2015 Signal Processing and Intelligent Systems Conference (SPIS), IEEE, 2015, pp 104-108 137 TÀI LIỆU THAM KHẢO [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] A D JoSEP, R KAtz, A KonWinSKi, L Gunho, D PAttERSon, and A RABKin, “A view of cloud computing,” Communications of the ACM, vol 53, no 4, pp 5058,2010 D Oppenheimer, A Ganapathi, and D A Patterson, "Why Internet services fail, and what can be done about it?,” in USENIX symposium on internet technologies and systems, Seattle WA, 2003, pp 1-16 V Chandola, A Banerjee, and V Kumar, “Anomaly detection: A survey,” ACM computing surveys, vol 41, no 3, pp 1-58, 2009 E Sindrilaru, A Costan, and V Cristca, “Fault tolerance and recovery in grid workflow management systems,” in 2010 international conference on complex, intelligent and software intensive systems, IEEE, 2010, pp 475-480 Y Zhang, A Mandal, c Koelbel, and K Cooper, “Combined fault tolerance and scheduling techniques for workflow applications on computational grids,” in Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, IEEE Computer Society, 2009, pp 244-251 M Hasan, and M s Goraya, “Fault tolerance in cloud computing environment: A systematic survey,” Computers in Industry, vol 99, no 2018, pp 156-172, 2018 Y Ding, G Yao, and K Hao, “Fault-tolerant elastic scheduling algorithm for workflow in cloud systems,” Information Sciences, vol 393, pp 47-65, 2017 R Jhawar, and V Piuri, Fault tolerance and resilience in cloud computing environments, Elsevier, 2017 E Sindrilaru, A Costan, and V Cristca, “Fault Tolerance and Recovery in Grid Workflow Management Systems,” in International Conference on Complex Intelligent and Software Intensive Systems, IEEE, 2010, pp 475-480 Y Zhang, A Mandal, c Koelbel, and K Cooper, “Combined Fault tolerance and Scheduling Techniques for Workflow Applications on Computational Grids ” 9th IEEE/ACM international symposium on clustering and grid, pp 244-251,2010 R Iscrmann, Fault-diagnosis applications: model-based condition monitoring: actuators, drives, machinery, plants, sensors, and fault-tolerant systems, Springer Science & Business Media, 2011 H p Kriegel, p Kroger, J Sander, and A Zimek, “Density-based clustering,” Wiley Interdisciplinary Reviews: Data Mining Knowledge Discovery, vol 1, no 3, pp 231-240,2011 B Tang, and H He, “A local density-based approach for outlier detection,” Neurocomputing, vol 241, pp 171-180,2017 Q Leng, H Qi, J Miao, w Zhu, and G Su, “One-class classification with extreme learning machine,” Mathematical problems in engineering, vol 2015, pp 1-11, 2015 G Jiang, H Chen, and K Yoshihira, “Modeling and tracking of transaction flow dynamics for fault detection in complex systems,” IEEE Transactions on Dependable Secure Computing, vol 3, no 4, pp 312-326, 2006 136 [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] L Liu, H Mei, and B Xie, “Towards a multi-QoS human-centric cloud computing load balance resource allocation method,” The Journal of Supercomputing, vol 72, pp 2488-2501,2016 H Siar, K Kiani, and A Chronopoulos, “An effective game theoretic static load balancing applied to distributed computing,” Cluster Computing, vol 18, pp 16091623,2015 D Ye, and J Chen, “Non-cooperative games on multidimensional resource allocation,” Future Generation Computer Systems, vol 29, no 6, pp 1345-1352, 2013 z Xiao, w Song, and Ọ Chen, “Dynamic resource allocation using virtual machines for cloud computing environment,” Parallel and Distributed Systems, IEEE Transactions on, vol 24, no 6, pp 1107-1117, 2013 X Xu, and H Yu, “A game theory approach to fair and efficient resource allocation in cloud computing,” Mathematical Problems in Engineering, vol 2014, pp 1-14, 2014 M R Gary, and D s Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness, WH Freeman and Company, New York, 1979 T Morton, and D w Pcntico, Heuristic scheduling systems: with applications to production systems and project management, John Wiley & Sons, 1993 C.-W Tsai, and J J Rodrigues, “Metahcuristic scheduling for cloud: A survey,” IEEE Systems Journal, vol 8, no 1, pp 279-291, 2014 A Filieri, M Maggio, K Angelopoulos, N d'Ippolito, Gerostathopoulos, A B Hempel, H Hoffmann, p Jamshidi, E Kalyvianaki, and c Klein, “Software engineering meets control theory,” in 2015 IEEE/ACM 10th International Symposium on Software Engineering for Adaptive and Self-Managing Systems, IEEE, 2015, pp 71-82 J Xu, M Zhao, J Fortes, R Carpenter, and M Yousif, “On the use of fuzzy modeling in virtualized data center management,” in Fourth International Conference on Autonomic Computing (ICAC'07), IEEE, 2007, pp 25-35 J Rao, Y Wei, J Gong, and C.-Z Xu, “DynaQoS: Model-free self-tuning fuzzy control of virtualized resources for QoS provisioning,” in 2011 IEEE Nineteenth IEEE International Workshop on Quality of Service, IEEE, 2011, pp 1-9 D Ardagna, G Casale, M Ciavotta, J F Pérez, and w Wang, “Quality-of-scrvicc in cloud computing: modeling techniques and their applications,” Journal ofInternet Services Applications, vol 5, no I, pp 1-17,2014 Y Jin, M Bouzid, D Kostadinov, and A Aghasaryan, “Model-free resource management of cloud-based applications using reinforcement learning,” in 2018 21st Conference on Innovation in Clouds, Internet and Networks and Workshops (ICỈN), IEEE, 2018, pp 1-6 N Esfahani, A Elkhodary, and s Malek, “A learning-based framework for engineering feature-oriented self-adaptive software systems,” IEEE transactions on software engineering, vol 39, no 11, pp 1467-1493,2013 138 [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] p Jamshidi, A Ahmad, and c Pahl, “Autonomic resource provisioning for cloud based software,” in Proceedings of the 9th international symposium on software engineering for adaptive and self-managing systems, ACM, 2014, pp 95-104 p Jamshidi, A M Sharifloo, c Pahl, A Metzger, and G Estrada, “Self-learning cloud controllers: Fuzzy q-lcarning for knowledge evolution,” in 2015 International Conference on Cloud and Autonomic Computing, IEEE, 2015, pp 208-211 A Gambi, M Pezze, and G Toffetti, “Kriging-based self-adaptive cloud controllers,” IEEE Transactions on Services Computing, vol 9, no 3, pp 368-381,2015 c Qu, R N Calheiros, and R Buyya, “Auto-scaling Web Applications in Clouds: A Taxonomy and Survey,” ACM Computing Surveys (CSƯR), vol 51, no 4, pp 1-33, 2016 p Mell, and T Grance, “Effectively and securely using the cloud computing paradigm,” NIST, Information Technology Laboratory, vol 2, no 8, pp 304-311, 2009 I Foster, Y Zhao, I Raicu, and s Lu, “Cloud Computing and Grid Computing 360Degree Compared,” in Grid Computing Environments Workshop, IEEE, 2008, pp 1-10 D Sun, G Chang, c Miao, and X Wang, “Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments,” The Journal of Supercomputing, vol 66, no l,pp 193-228,2013 c Engelmann, G R Vallee, T Naughton, and s L Scott, “Proactive fault tolerance using preemptive migration,” in 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, IEEE, 2009, pp 252-257 K Plankensteiner, R Prodan, and T Fahringer, “A new fault tolerance heuristic for scientific workflows in highly distributed environments based on resubmission impact,” in 2009 Fifth IEEE International Conference on e-Science, IEEE, 2009, pp 313-320 Y Li, and z Lan, “FREM: A fast restart mechanism for general checkpoint/restart,” IEEE Transactions on Computers, vol 60, no 5, pp 639-652, 2010 Y Luo, and Đ Manivannan, “Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families,” Performance Evaluation, vol 68, no 5, pp 429-445, 2011 s Marzouk, and M Jmaicl, “A survey on software checkpointing and mobility techniques in distributed systems,” Concurrency Computation: Practice Experience, vol 23, no 11, pp 1196-1212, 2011 w Zhao, P Melliar-Smith, and L E Moser, “Fault tolerance middleware for cloud computing,” in 2010 IEEE 3rd International Conference on Cloud Computing, IEEE, 2010, pp 67-74 G Vallee, K Charoenpornwattana, c Engelmann, A Tikotekar, and s L Scott, “A Framework for Proactive Fault Tolerance,” in Proceedings of the Third International Conference on Availability Reliability and Security (ARES 2008 - The International Dependability Conference), IEEE, 2008, pp 659-664 139 [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] J Sahni, and D p Vidyarthi, “Heterogeneity-aware adaptive auto-scaling heuristic for improved QoS and resource usage in cloud environments,” Computing, vol 99, pp 351-381,2016 F Paraiso, p Merle, and L Seinturier, “soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds,” Computing, vol 98, no 5, pp 539-565, 2016 M Serafini, E Mansour, A Aboulnaga, K Salem, T Rafiq, and u F Minhas, “Accordion: Elastic scalability for database systems supporting distributed transactions,” Proceedings of the VLDI3 Endowment, vol 7, no 12, pp 1035-1046, 2014 Q Wang, H Chen, s Zhang, L Hu, and B Palanisamy, “Integrating concurrency control in n-tier application scaling management in the cloud,” IEEE Transactions on Parallel Distributed Systems, vol 30, no 4, pp 855-869, 2018 A Arefin, V K Singh, G Jiang, Y Zhang, and c Lumezanu, “Diagnosing data center behavior flow by flow,” in 2013 IEEE 33rd International Conference on Distributed Computing Systems, IEEE, 2013, pp 11-20 H Chen, G Jiang, K Yoshihira, and A Saxena, “Invariants based failure diagnosis in distributed computing systems,” in 2010 29th IEEE Symposium on Reliable Distributed Systems, IEEE, 2010, pp 160-166 J Pinto, p Jain, and T Kumar, “Hadoop distributed computing clusters for fault prediction,” in 2016 International Computer Science and Engineering Conference (ICSEC), IEEE, 2016, pp -6 M Smara, M Aliouat, A.-S K Pathan, and z Aliouat, “Acceptance test for fault detection in component-based cloud computing and systems,” Future Generation Computer Systems, vol 70, pp 74-93, 2017 P Zhang, s Shu, and M Zhou, “An online fault detection model and strategies based on SVM-grid in clouds,” 1EEE/CAA Journal ofAutomatica Sinica, vol 5, no 2, pp 445-456,2018 Y Jiang, J Huang, J Ding, and Y Liu, “Method of Fault Detection in Cloud Computing Systems,” International Journal of Grid and Distributed Computing, vol 7, no 3, pp 205-212, 2014 K Bhaduri, K Das, and B L Matthews, “Detecting abnormal machine characteristics in cloud infrastructures,” in 2011 IEEE 11th International Conference on Data Mining Workshops, IEEE, 2011, pp 137-144 c Pham, L Wang, B c Tak, s Baset, c Tang, z Kalbarczyk, and R Iyer, “Failure diagnosis for distributed systems using targeted fault injection,” IEEE Transactions on Parallel Distributed Systems, vol 28, no 2, pp 503-516, 2016 T Wang, J Xu, w Zhang, z Gu, and H Zhong, “Self-adaptive cloud monitoring with online anomaly detection,” Future Generation Computer Systems, vol 80, pp 89-101,2018 M Lin, z Yao, F Gao, and Y Li, “Toward anomaly detection in iaas cloud computing platforms,” International Journal ofSecurity Its Applications, vol 9, no 12, pp 175-188,2015 141 [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] H Agarwal, and A Shanna, “A comprehensive survey of fault tolerance techniques in cloud computing,” in 2015 International Conference on Computing and Network Communications (CoCoNet), IEEE, 2015, pp 408-413 G Yao, Y Ding, L Ren, K Hao, and L Chen, “An immune system-inspired rescheduling algorithm for workflow in Cloud systems,” Knowledge-Based Systems, vol 99, pp 39-50, 2016 R Salvador, A Otero, J Mora, E de la Torre, L Sekanina, and T Riesgo, “Fault tolerance analysis and self-healing strategy of autonomous, evolvable hardware systems,” in 2011 International Conference on Reconfigurable Computing and FPGAs, IEEE, 2011, pp 164-169 D Ghosh, R Sharman, H R Rao, and s Upadhyaya, “Self-healing systems—survey and synthesis,” Decision support systems, vol 42, no 4, pp 2164-2185, 2007 H.-y Tu, “Comparisons of self-healing fault-tolerant computing schemes,” in World Congress on Engineering and Computer Science, Citeseer, 2010, pp 1-6 A Poize, p Troger, and F Salfner, “Timely virtual machine migration for pro-active fault tolerance,” in 2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops, IEEE, 201 l, pp 234-243 A Bala, and I Chana, “Fault Tolerance-Challenges,Techniques and Implementation in Cloud Computing,” International Journal of Computer Science Issues, vol 9, no I, pp 288-293, 2012 M Armbrust, A Fox, and R Griffit, “A view of cloud computing,” Communications of the ACM, vol 53, no 4, pp 50-58, 2010 D Bruneo, S Distefano, F Longo, A Puliafito, and M Scarpa, “Workload-based software rejuvenation in cloud systems,” IEEE Transactions on Computers, vol 62, no 6, pp 1072-1085,2013 L Gillam, and N Antonopoulos, Cloud computing: principles, systems and applications, Springer, 2017 A Naskos, A Gounaris, and s Sioutas, “Cloud elasticity: a survey,” in Algorithmic Aspects of Cloud Computing: First International Workshop, Springer, 2016, pp 151167 T Lorido-Botran, J Miguel-Alonso, and J A Lozano, “A review of auto-scaling techniques for elastic applications in cloud environments,” Journal of grid computing, vol 12, no 4, pp 559-592, 2014 .1 Sahoo, s Mohapatra, and R Lath, “Virtualization: A survey on concepts, taxonomy and associated security issues,” in 2010 Second International Conference on Computer and Network Technology, IEEE, 2010, pp 222-226 K Kolyshkin, “Virtualization in linux,” White paper, OpenVZ, vol 3, no 39, pp 15, 2006 H N Palit, X Li, s Lu, L c Larsen, and J A Setia, “Evaluating hardware-assisted virtualization for deploying HPC-as-a-service,” in Proceedings of the 7th international workshop on Virtualization technologies in distributed computing, ACM, 2013, pp 11-20 140 [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] M Lin, z Yao, F Gao, and Y Li, “Data-driven Anomaly Detection Method for Monitoring Runtime Performance of Cloud Computing Platforms,” International Journal of Hybrid Information Technology, vol 9, no 2, pp 439-450, 2016 F Doelitzscher, M Knahl, c Reich, and N Clarke, “Anomaly detection in iaas clouds,” in 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, IEEE, 2013, pp 387-394 Ọ Guan, and s Fu, “Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures,” in 2013 IEEE 32nd International Symposium on Reliable Distributed Systems, IEEE, 2013, pp 205-214 D Sun, G Chang, c Miao, and X Wang, “Modelling and evaluating a high serviceability fault tolerance strategy in cloud computing environments,” International Journal of Security and Networks, vol 7, no 4, pp 196-210,2012 s Yi, A Andrzcjak, and D Kondo, “Monetary cost-aware checkpointing and migration on amazon cloud spot instances,” IEEE Transactions on Services Computing, vol 5, no 4, pp 512-524, 2011 B Nicolae, G Antoniu, L Bougé, D Moise, and A Carpen-Amarie, “BlobSeer: Next-generation data management for large scale infrastructures,” Journal ofParallel distributed computing, vol 71, no 2, pp 169-184, 2011 M Zhao, F D'Ugard, K A Kwiat, and c A Kamhoua, “Multi-level VM replication based survivability for mission-critical cloud computing,” in 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), IEEE, 2015, pp 1351-1356 M Amoon, “A framework for providing a hybrid fault tolerance in cloud computing,” in 2015 Science and Information Conference (SAI), IEEE, 2015, pp 844-849 M Amoon, “Adaptive framework for reliable cloud computing environment,” IEEE Access, vol 4, pp 9469-9478, 2016 X Chen, and J.-I I Jiang, “A method of virtual machine placement for fault-tolerant cloud applications,” Intelligent Automation Soft Computing, vol 22, no 4, pp 587597,2016 R Jhawar, V Piuri, and M Santambrogio, “Fault tolerance management in cloud computing: A system-level perspective,” IEEE Systems Journal, vol 7, no 2, pp 288-297,2012 M Hasan, and M s Goraya, “Priority based cooperative computing in cloud using task backfilling,” Lecture Notes on Software Engineering, vol 4, no 3, pp 229-233, 2016 M Hasan, and M s Goraya, “A framework for priority based task execution in the distributed computing environment,” in 2015 International Conference on Signal Processing, Computing and Control (ISPCC), IEEE, 2015, pp 155-158 M s Goraya, and L Kaur, “Fault tolerance task execution through cooperative computing in grid,” Parallel Processing Letters, vol 23, no 01, pp 1-20, 2013 Y Zhang, z Zheng, and M R Lyu, “BFTCloud: A byzantine fault tolerance framework for voluntary-resource cloud computing,” in 2011 IEEE 4th International Conference on Cloud Computing, IEEE, 2011 , pp 444-451 142 [132] R Singh, u Sharma, E Cecchet, and p Shenoy, “Autonomic mix-aware provisioning for non-stationary data center workloads,” in Proceedings of the 7th international conference on Autonomic computing, ACM, 2010, pp 21-30 [ 133] s Pertet, and p Narasimhan, Causes offailure in web applications, Technical Report CMU-PDL-05-109, Carnegie Mellon University, 2005 [134] G Casale, N Mi, and E Smirni, “Model-driven system capacity planning under workload burstiness,” IEEE Transactions on Computers, vol 59, no l,pp 66-80, 2009 [135] H Kang, H Chen, and G Jiang, “PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems,” in Proceedings of the 7th international conference on Autonomic computing, ACM, 2010, pp 119-128 [136] G Jiang, H Chen, K Yoshihira, and A Saxena, “Ranking the importance of alerts for problem determination in large computer systems,” Cluster Computing, vol 14, no 3, pp 213-227, 2011 [137] A Verma, L Pedrosa, M Korupolu, D Oppenheimer, E Tune, and J Wilkes, “Large-scale cluster management at Google with Borg,” in Proceedings of the Tenth European Conference on Computer Systems, ACM, 2015, pp 1-17 [ 138] J Wilkes, More Google cluster data, Google research blog, 2011 [139] J Li, M Qiu, z Ming, G Quan, X Qin, and z Gu, “Online optimization for scheduling prccmptablc tasks on laaS cloud systems,” Journal of Parallel and Distributed Computing, vol 72, no 5, pp 666-677, 2012 [140] M Rahman, X Li, and H Palit, “Hybrid heuristic for scheduling data analytics workflow applications in hybrid cloud environment,” in 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, IEEE, 2011, pp 966-974 [141] B Saovapakhiran, G Michailidis, and M Devetsikiotis, “Aggregated-DAG scheduling for job flow maximization in heterogeneous cloud computing,” in 2011 IEEE Global Telecommunications Conference-GLOBECOM, IEEE, 2011, pp 1-6 [142] z Xiao, w Song, and Q Chen, “Dynamic resource allocation using virtual machines for cloud computing environment,” IEEE transactions on parallel distributed systems, vol 24, no 6, pp 1107-1 117, 2012 [143] A Ghascmi, and A T Haghighat, “A multi-objcctivc load balancing algorithm for virtual machine placement in cloud data centers based on machine learning,” Computing, vol 102, no 9, pp 2049-2072, 2020 [144] M M Breunig, H.-P Kriegel, R T Ng, and J Sander, “LOF: identifying density based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, ACM, 2000, pp 93-104 [145] M Amer, M Goldstein, and s Abdennadher, “Enhancing one-class support vector machines for unsupervised anomaly detection,” in Proceedings of the ACM SIGKDD workshop on outlier detection and description, ACM, 2013, pp 8-15 [146] J M Mendel, Uncertain rule-based fuzzy systems, Springer, 2017 [147] M L Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014 145 [104] z Zheng, T c Zhou, M R Lyu, and King, “Component ranking for fault-tolerant cloud applications,” IEEE Transactions on Services Computing, vol 5, no 4, pp 540-550, 2011 [105] B Mohammed, M Kiran, I.-Ư Awan, and K M Maiyama, “An Integrated Virtualized Strategy for Fault Tolerance in Cloud Computing Environment,” in 2016 Inti IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet ofPeople, and Smart World Congress, IEEE, 2016, pp 542-549 [106] J Wang, w Bao, X Zhu, L T Yang, and Y Xiang, “FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds,” IEEE Transactions on Computers, vol 64, no 9, pp 2545-2558, 2014 [107] C.-A Chen, M Won, R Stoleru, and G G Xie, “Energy-efficient fault-tolerant data storage and processing in mobile cloud,” IEEE Transactions on cloud computing, vol 3, no 1, pp 28-41,2014 [108] G Radhakrishnan, “Adaptive application scaling for improving fault-tolerance and availability in the cloud,” Bell Labs Technical Journal, vol 17, no 2, pp 5-14,2012 [109] D Poola, K Ramamohanarao, and R Buyya, “Fault-tolerant Workflow Scheduling using Spot Instances on Clouds,” Procedia Computer Science, vol 29, pp 523-533, 2014 [110] J Cao, M Simonin, G Cooperman, and c Morin, “Checkpointing as a service in heterogeneous cloud environments,” in 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE, 2015, pp 61-70 [Ill] D Sun, G Zhang, c Wu, K Li, and w Zheng, “Building a fault tolerant framework with deadline guarantee in big data stream computing environments,” Journal of Computer System Sciences, vol 89, pp 4-23, 2017 [112] s Sidiroglou, o Laadan, c Perez, N Vicnnot, J Nich, and A Kcromytis, “Assure: automatic software self-healing using rescue points,” ACM SIGARCH Computer Architecture News, vol 37, no 1, pp 37-48, 2009 [113] G Chen, H Jin, D Zou, B B Zhou, w Qiang, and G Hu, “Shelp: Automatic selfhealing for multiple application instances in a virtual machine environment,” in 2010 IEEE International Conference on Cluster Computing, IEEE, 2010, pp 97-106 [114] I p Egwutuoha, S Chen, D Levy, B Sclic, and R Calvo, “A proactive fault tolerance approach to High Performance Computing (HPC) in the cloud,” in 2012 Second International Conference on Cloud and Green Computing, IEEE, 2012, pp 268-273 [115] A B Nagarajan, F Mueller, c Engelmann, and s L Scott, “Proactive fault tolerance for HPC w ith Xcn virtualization,” in Proceedings of the 21st annual international conference on Supercomputing, ACM, 2007, pp 23-32 [116] J Liu, J Zhou, and R Buyya, “Software rejuvenation based fault tolerance scheme for cloud applications,” in 2015 IEEE 8th International Conference on Cloud Computing, IEEE, 2015, pp 1115-1118 143 [117] p Y Glorennec, “Fuzzy Q-leaming and dynamical fuzzy Q-leaming,” in Proceedings of1994 IEEE 3rd International Fuzzy Systems Conference, IEEE, 1994, pp 474-479 [118] L Jouffe, “Fuzzy inference system learning by reinforcement methods,” EEE Transactions on Systems, Man, Cybernetics, Part c, vol 28, no 3, pp 338-355, 1998 [119] B F Darst, K c Malecki, and c D Engelman, “Using recursive feature elimination in random forest to account for correlated variables in high dimensional data,” BMC genetics, vol 19, no 1, pp 65, 2018 [120] I Guyon, J Weston, s Barnhill, and V Vapnik, “Gene selection for cancer classification using support vector machines,” Machine learning, vol 46, no 1-3, pp 389-422, 2002 [121] M Liu, B c Vcmuri, s.-l Amari, and F Nielsen, “Total Bregman divergence and its applications to shape retrieval,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp 3463-3468 [122] s Yin, and G Wang, “A modified partial robust M-regression to improve prediction performance for data with outliers,” in 2013 IEEE International Symposium on Industrial Electronics, IEEE, 2013, pp 1-6 [123] J Jiong, and z Hao-ran, “A fast learning algorithm for One-Class Support vector machine,” in Third International Conference on Natural Computation (ICNC 2007), IEEE, 2007, pp 19-23 [124] R.-E Fan, P.-H Chen, and C.-J Lin, “Working set selection using second order information for training support vector machines,” Journal of machine learning research, vol 6, no 12, pp 1889-1918,2005 [125] s Mahadevan, and s L Shah, “Fault detection and diagnosis in process data using one-class support vector machines,” Journal ofprocess control, vol 19, no 10, pp 1627-1639,2009 [126] T M Khoshgoftaar, M Golawala, and J Van Hulse, “An empirical study of learning from imbalanced data using random forest,” in 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI2007), IEEE, 2007, pp 310-317 [127] A Sĩrbu, and o Babaoglu, “Towards data-driven autonomies in data centers,” in 2015 International Conference on Cloud and Autonomic Computing, IEEE, 2015, pp 45-56 [128] G Haixiang, L Yijing, J Shang, G Mingyun, H Yuanyue, and G Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Systems with Applications, vol 73, pp 220-239, 2017 [129] Guyon, s Gunn, M Nikravcsh, and L A Zadch, Feature extraction: foundations and applications, Springer, 2008 [130] w Buntine, M Grobclnik, D Mladenic, and J Shawe-Taylor, Machine Learning and Knowledge Discovery in Databases, Springer, 2009 [131] D A Menascé, “TPC-W: A benchmark for e-commerce,” IEEE Internet Computing, vol 6, no 3, pp 83-87, 2002 144 [148] T Duong, Y.-J Chu, T Nguyen, and J Chakareski, “Virtual machine placement via q-leaming with function approximation,” in 2015 IEEE Global Communications Conference (GLOBECOM), IEEE, 2OỈ5,pp Ì-6 [149] R s Sutton, and A G Barto, Introduction to reinforcement learning, MIT press Cambridge, 1998 [150] M Dorigo, V Maniezzo, and A Colorni, “Ant system: optimization by a colony of cooperating agents,” IEEE Transactions on Systems, Man,Cybernetics, Part B, vol 26, no l,pp 29-41, 1996 [151] R N Calheiros, R Ranjan, A Beloglazov, c A De Rose, and R Buyya, “CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms,” Software: Practice and experience, vol 41, no 1, pp 23-50, 2011 [152] T M Tawfccg, A Yousif, A Hassan, s M Alqhtani, R Hamza, M B Bashir, and A Ali, “Cloud Dynamic Load Balancing and Reactive Fault Tolerance Techniques: A Systematic Literature Review (SLR),” IEEE Access, vol 10, no 6, pp 7185371873,2022 [153] Y Guo, A L Stolyar, and A Walid, “Online VM auto-scaling algorithms for application hosting in a cloud,” IEEE Transactions on Cloud Computing, vol 8, no 3, pp 889-898, 2018 [154] S.-Y Hsieh, C.-S Liu, R Buyya, and A Y Zomaya, “Utilization-prcdiction-awarc virtual machine consolidation approach for energy-efficient cloud data centers,” Journal of Parallel Distributed Computing, vol 139, pp 99-109, 2020 [ 155] A Ghasemi, and A Toroghi Haghighat, “A multi-objective load balancing algorithm for virtual machine placement in cloud data centers based on machine learning,” Computing, vol 102, no 9, pp 2049-2072, 2020 146