Abstract—As the sizes of supercomputers and data centers grow towards exascale, failures become normal. System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both nonfatal and fatal events, nor a precise one that uses finegrained information (such as failure type, node location, related application, and time of occurrence). A deeper and more precise log analysis technique is needed. We propose a threestep approach to draw out event dependencies and to identify failureevent generating processes. First, we cluster frequent event sequences into event groups based on common events. Then we infer causal dependencies between events in each event group. Finally, we extract failure rules based on the observation that events of the same event types, on the same nodes or from the same applications have similar operational behaviors. We use this rich information to improve failure prediction. Our approach semiautomates diagnosing the root causes of failure events, making it a valuable tool for system administrators
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 Log Analysis for Failure Diagnosis and Workload Prediction in Cloud Computing KRISTIAN HUNT KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Log Analysis for Failure Diagnosis and Workload Prediction in Cloud Computing KRISTIAN HUNT khunt@kth.se Master’s Thesis at CSC Supervisor: Örjan Ekeberg Examiner: Patric Jensfelt Principal: Ericsson AB Contact Person at Principal: Zhangming Niu June, 2016 Abstract The size and complexity of cloud computing systems makes runtime errors inevitable These errors could be caused by the system having insufficient resources or an unexpected failure in the system In order to be able to provide highly available cloud computing services it is necessary to automate the resource provisioning and failure diagnosing processes as much as possible Log files are often a good source of information about the current status of the system In this thesis methods for diagnosing failures and predicting system workload using log file analysis are presented and the performance of different machine learning algorithms using our proposed methods are compared Our experimental results show that classification tree and random forest algorithms are both suitable for diagnosing failures and that Support Vector Regression outperforms linear regression and regression trees when predicting disk availability and memory usage However, we conclude that predicting CPU utilization requires further studies Referat Analys av loggfiler för feldiagnos och skattning av kommande belastning i system för molntjänster Storleken och komplexiteten på dagens datorbaserade moln system gör det omöjligt att helt undvika programfel Programfelen kan bero på att systemet har otillräckliga resurser eller ett oväntat systemfel För att kunna erbjuda molnbaserade tjänster med hög tillförlitlighet är det nödvändigt att i så hög grad som möjligt automatisera fördelningen av systemets resurser samt processerna för felutpekning Loggfiler är ofta en bra informationskälla om systemets tillstånd I detta arbete presenteras metoder för felutpekning och prediktion av systemets lastfördelning med hjälp av analys av systemets loggfiler samt olika metoder för maskinlärande jämförs Det experimentella resultatet visar att både Classification Tree och Random Forest är lämpliga algoritmer förfelutpekning och att Support Vector Regression överträffar både Linear Regression och Regression Trees för att prediktera disk tillgänglighet och minnesutnyttjande Det behövs dock ytterligare studier för att kunna prediktera utnyttjandet av den tillgängliga CPU-användingen Acknowledgements I would like to thank my supervisor at KTH Örjan Ekeberg and my supervisors at Ericsson Per-Olof Gatter and Zhangming Niu Thank you for your guidance and support! I would like to thank Ola Lundin for giving me this opportunity and Fikri Aydemir for helping me with UDC I thank Alain Kaeslin for proofreading my thesis and Anders Svantesson for translating the abstract to Swedish Lastly, I want to express my gratitude to my dearest Helian for supporting me throughout the project Contents Introduction 1.1 Objective and Delimitations 1.2 Ethics Statement 1.3 Contributions 2 2 Background 2.1 Data Automation Platform 2.2 Machine Learning 2.2.1 Classification 2.2.2 Regression 2.3 Related Work 2.3.1 Features 2.3.2 Anomaly Detection 2.3.3 Failure Diagnosis 2.3.4 Automatic Scaling 3 4 5 Methodology 3.1 Data collection and structure 3.1.1 Data Collection 3.1.2 Log Format 3.2 Algorithm and Feature Selection 3.2.1 Failure Diagnosis 3.2.2 Workload Prediction 3.3 Model Training 3.3.1 Cross-validation 3.4 Evaluation 11 11 11 13 14 14 16 17 18 18 Results 4.1 Failure Diagnosis 4.1.1 Single Component Failure 4.1.2 Double Component Failure 4.1.3 Single or Double Component Failure 4.2 Workload Prediction 21 21 21 23 24 24 4.2.1 4.2.2 4.2.3 Disk Space Availability Memory Usage CPU Utilization Discussion and Conclusions 5.1 Discussion and Future Work 5.1.1 Failure Diagnosis 5.1.2 Workload Prediction 5.2 Conclusions 24 25 27 29 29 29 30 31 References 33 Appendices 35 A Tree Model 37 B Detailed Results Tables 39 Chapter Introduction Cloud computing usually refers to a distributed system which consists of connected and virtualized computers that are provisioned based on the current needs of the user When cloud computing resources are made available to the public as a service and billed by the usage instead of a flat fee it is called utility computing Some well-known utility computing services are Amazon EC2, Google App Engine and Microsoft Azure The size and complexity of these systems makes runtime errors inevitable Since most utility computing services have high requirements for service availability, then fixing runtime errors needs high level of automation [3] According to Armbrust et al [3] service availability and automatic scaling are in the top 10 obstacles for the growth of cloud computing The information about the current state of a running system is written into a log file Some examples of the uses for system log files include making business decisions, monitoring the behaviour of the system and troubleshooting In case of troubleshooting, log files are often the only source of information [12] The file’s content, logging frequency and data format are highly dependable on the software developers who implemented that software and what is the intended usage of the logged messages However, logged messages are often hard to understand without knowing the context and often it is impossible for developers to know which messages will be useful, especially if the software is used as part of a larger system [20] Logs can be categorized in two: event logs and performance logs Event logs are captured when something happens in the system while the performance logs are usually output regularly to give an overview of the current status of the system [20] One very simple and common way of analysing event logs is to search for a specific part of the message or a keyword i.e “error” However, it might be difficult to figure out what to search for Furthermore, some messages can induce false-positives to the search results, for example messages such as “0 error(s) detected” According to Oliner et al [20] machine learning techniques are commonly used to discover log messages and patterns which might be of interest when the logging volume is too big to be manually shifted through CHAPTER INTRODUCTION 1.1 Objective and Delimitations The objective of this thesis is to compare some of the viable classifiers and regressors when diagnosing system faults and predicting system workload based on historical performance and event log files from a system in the cloud computing domain using log analysis and machine learning techniques A cloud computing system developed by Ericsson called Data Automation Platform will be used for data generation and evaluation of the proposed method The project will focus on inspecting data related to the platform itself and not the customer services which would run on the platform The scope of the project includes providing predictions for possible sources of faults and predictions of workloads, but it does not include the actions that are needed to be taken based on these predictions Moreover, the predictions for sources of faults are given on the level of component name, but the reason for the failure is not modelled 1.2 Ethics Statement The thesis at hand was conducted independently and impartially The author of the thesis has not identified any ways how the results of this thesis could have direct negative environmental, economical or societal impact However, if future studies based on the results of this thesis indicate that the results are reproducible in production environments, then it would have positive economical impact for the companies utilizing the methods proposed in this project Additionally, positive environmental impact might be possible when the optimal amount of resources is utilized when using the workload predictions as inputs for automatic scaling 1.3 Contributions Supervised learning has successfully been applied to failure diagnosis (See 2.3.3) and workload prediction (See 2.3.4) tasks in the field of cloud computing This thesis contributes results for unique sets of feature vectors Even though English word count based feature vectors have been used in unsupervised log analysis tasks, the author is not aware of any studies where these feature vectors have been applied to supervised learning for failure diagnosis in the field of cloud computing 4.2 WORKLOAD PREDICTION 4.2.3 CPU Utilization The results for the use case of predicting CPU utilization for the UDC are presented in Table 4.7 and illustrated in Figure 4.4 We can see that the worst results are from linear regression algorithm and the best results are from regression tree However, we can see that the results for predicting are worst in this use case and that the difference between the Baseline Regressor and other regressors is the smallest in this use case, indicating bad predictive performance The small value for the RMSE for Baseline Regressor shows that the test set of CPU utilization is closer to the mean of the training data than in the previous experiments, which can be visually verified from Figure 3.2 It is also apparent from Figure 3.2 that the data about CPU utilization is noisier than the data about disk space availability and memory usage The best cross-validation score for regression tree algorithm was achieved with the depth of and the best cross-validation score for SVR was achieved with ε = 16.6 Table 4.7: Root Mean-Squared Errors for different regressors when predicting CPU utilization for UDC on the test set Regressor Baseline Regressor Support Vector Regression Linear Regression Regression Tree 27 RMSE 21.22 18.86 19.76 18.78 Average CPU utilization (%) CHAPTER RESULTS 100 90 80 70 60 50 40 30 20 100 90 80 70 60 50 40 30 20 100 90 80 70 60 50 40 30 20 True value SVR True value Linear Regression True value Regression Tree 20 40 60 80 Time in minutes 100 120 140 Figure 4.4: True and predicted CPU utilization for UDC on the test set 28 Chapter Discussion and Conclusions The results presented in the previous chapter indicate that using machine learning for failure diagnosis and workload prediction is viable These results and future work will be discussed in this chapter Lastly, the conclusions of the project will be presented 5.1 5.1.1 Discussion and Future Work Failure Diagnosis We saw that predicting single component failure of the system under test was an easy task for both tested classifiers, while diagnosing when two components failed at the same time was significantly harder A possible explanation would be that failures of some components “shadow” the failures of other components This hypothesis could explain why failures caused by components which are responsible for inter-component communication - api_server and messaging_bus - continue receiving high precision and recall scores, while others not In order to verify this hypothesis, the data points about the components responsible for inter-component communication could be removed from the data set Similarly to the study by Kandula et al [11], where for evaluating multiple simultaneous faults in enterprise networks they had prior knowledge about fault combinations which interfere with each other and these combinations were not injected to the system in the data collection phase We compared the number of logged events captured during our data collection process for Cloud Deployer presented in Figure 3.1b with the detailed experiment results in Appendix B The three components of Cloud Deployer which had the least amount of logged events when we excluded messaging_bus had also the worst precision and recall scores Even though there is no one-to-one mapping in our results between component log file verbosity and prediction results, in both experiments the components with more verbose log files had better results than the components with less verbose log files, where the messaging_bus is the exception to this result 29 CHAPTER DISCUSSION AND CONCLUSIONS The results in our failure diagnosis experiments show that the classification tree and random forest algorithms have very similar performance If these similarities hold also after tests in the production environments then additional properties should be taken into account when choosing the algorithm depending on the needs of the application Such properties could be learning speed, memory and CPU requirements of the underlying machine learning algorithms In case the classification tree algorithm is selected it is suggested to implement a pruning method to prevent overfitting to training data In our experiments we also attempted to use the negative English opinion word dictionary compiled by Liu et al [15] as features from the unstructured message part of the log file for failure diagnosis However, the results from these experiments did not improve the results presented in Table 4.1 This confirms the results from previous studies that negative log messages not correlate well with general failures in the system [19] Using the English dictionary for distinguishing between words and variables in the log files improved our results over just exploiting the structure of the log messages for our experiments However, it must be mentioned that using a dictionary for capturing words loses information when there are misspellings, application-specific terms or abbreviations The results for diagnosing failures for the single component failure use case are similar to the results from the study by Chen et al [5] where they identified correctly 100% of faults with a false positive rate of 25% on the log files from the eBay’s Centralized Application Logging framework For the double component use case they achieved a 93% identification rate, which is significantly better result than what was achieved in the project at hand As mentioned earlier, we believe that the reason why there are so drastic differences between the results from the single fault use case and the double fault use case are that there are interferences between the components which are responsible for the inter-component communication and other components 5.1.2 Workload Prediction In case of workload prediction we can see that in the first two use cases the SVR algorithm has the best results SVR and linear regression predict significantly better than regression tree algorithm in both use cases Future studies could explore whether using non-linear kernel functions would improve the performance of SVR even further When predicting CPU utilization all three evaluated regressors have better results than the baseline regressor and SVR RMSE scores are better than in the study by Nikravesh et al [18], but slightly worse than in the study by Ajila and Bankole [2], Islam et al [9] However, these predictions are meant to be the input for the planning phase of the horizontal auto-scaler and we can see from Figure 4.4 that the regressors cannot predict accurately CPU utilization when more than 90% of the resources are utilized - when there could be a possible breach of SLAs For this reason we believe that further studies in suitable feature vectors and data preprocessing are needed for the CPU usage prediction to be usable for horizontal 30 5.2 CONCLUSIONS auto-scaling We hypothesize that the reason why the results from the experiment where we attempted to predict CPU utilization were not as good as the results from some other studies [2, 9] is due to our data set The CPU utilization data is noisier than the data used in other studies It seems that the CPU utilization pattern, which was created using our custom workload generator, was significantly noisier when compared to the workload generated by TPC-W benchmark which simulates web server and database application environment’s workload and which is commonly used in related studies [18, 2, 9] However, it must be noted that none of the experiments were conducted in production environments and both systems were deployed in a single-node cluster Moreover, a workload generator and artificial failure injection were used, which might not have all the properties of actual system workload and failures Even though both methods are accepted in the field of log analysis and experts of the domain were included in designing and implementing the failures and workload generator, further studies might be necessary to find out whether these results are reproducible in production environments 5.2 Conclusions Runtime errors are unavoidable in cloud computing systems due to their size and complexity A method for diagnosing failures using log analysis was presented in this thesis Classification trees and random forests were compared using this method Our experimental results show that classification trees perform as well as random forests Both being able to classify the failed component with 100% accuracy, in case when a single cloud computing system’s component failed and with approximately 72%-73% accuracy when one or two components failed at the same time Additionally, three regressors were compared for predicting workloads on a cloud computing system The experimental results indicate that Support Vector Regression is performing the best when predicting the disk space availability and memory usage However, all three evaluated regressors struggle with predicting CPU usage and for that further studies are required 31 References [1] Naoki Abe, Bianca Zadrozny, and John Langford Outlier detection by active learning In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 504–509 ACM, 2006 [2] Samuel A Ajila and Akindele A Bankole Cloud client prediction models using machine learning techniques In Computer Software and Applications Conference (COMPSAC), 2013 IEEE 37th Annual, pages 134–142 IEEE, 2013 [3] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, et al A view of cloud computing Communications of the ACM, 53(4):50–58, 2010 [4] Varun Chandola, Arindam Banerjee, and Vipin Kumar Anomaly detection: A survey ACM computing surveys (CSUR), 41(3):15, 2009 [5] Mike Chen, Alice X Zheng, Jim Lloyd, Michael Jordan, Eric Brewer, et al Failure diagnosis using decision trees In Autonomic Computing, 2004 Proceedings International Conference on, pages 36–43 IEEE, 2004 [6] Harris Drucker, Chris J.C Burges, Linda Kaufman, Alex Smola, and Vladimir Vapnik Support vector regression machines In Advances in Neural Information Processing Systems, pages 155–161, 1997 [7] Song Fu Performance metric selection for autonomic anomaly detection on cloud computing systems In Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE, pages 1–5 IEEE, 2011 [8] Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, and Bill Kramer Event log mining tool for large scale hpc systems In Euro-Par 2011 Parallel Processing, pages 52–64 Springer, 2011 [9] Sadeka Islam, Jacky Keung, Kevin Lee, and Anna Liu Empirical prediction models for adaptive resource provisioning in the cloud Future Generation Computer Systems, 28(1):155–162, 2012 [10] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani An introduction to statistical learning Springer, 2013 33 REFERENCES [11] Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl Detailed diagnosis in enterprise networks ACM SIGCOMM Computer Communication Review, 39(4):243–254, 2009 [12] Kamal Kc and Xiaohui Gu Elt: Efficient log-based troubleshooting system for cloud computing infrastructures In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on, pages 11–20 IEEE, 2011 [13] Emre Kiciman and Armando Fox Detecting application-level failures in component-based internet services Neural Networks, IEEE Transactions on, 16(5):1027–1041, 2005 [14] Chinghway Lim, Navjot Singh, and Shalini Yajnik A log mining approach to failure analysis of enterprise telephony systems In Dependable Systems and Networks With FTCS and DCC, 2008 DSN 2008 IEEE International Conference on, pages 398–403 IEEE, 2008 [15] Bing Liu, Minqing Hu, and Junsheng Cheng Opinion observer: analyzing and comparing opinions on the web In Proceedings of the 14th international conference on World Wide Web, pages 342–351 ACM, 2005 [16] Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A Lozano A review of auto-scaling techniques for elastic applications in cloud environments Journal of Grid Computing, 12(4):559–592, 2014 [17] John Mingers An empirical comparison of selection measures for decision-tree induction Machine learning, 3(4):319–342, 1989 [18] Ali Yadavar Nikravesh, Samuel A Ajila, and Chung-Horng Lung Measuring prediction sensitivity of a cloud auto-scaling system In Computer Software and Applications Conference Workshops (COMPSACW), 2014 IEEE 38th International, pages 690–695 IEEE, 2014 [19] Adam Oliner and Jon Stearley What supercomputers say: A study of five system logs In Dependable Systems and Networks, 2007 DSN’07 37th Annual IEEE/IFIP International Conference on, pages 575–584 IEEE, 2007 [20] Adam Oliner, Archana Ganapathi, and Wei Xu Advances and challenges in log analysis Communications of the ACM, 55(2):55–61, 2012 [21] Radu Prodan and Vlad Nae Prediction-based real-time resource provisioning for massively multiplayer online games Future Generation Computer Systems, 25(7):785–793, 2009 [22] Andres Quiroz, Hyunjoo Kim, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma Towards autonomic workload provisioning for enterprise grids and clouds In Grid Computing, 2009 10th IEEE/ACM International Conference on, pages 50–57 IEEE, 2009 34 REFERENCES [23] Irina Rish, Mark Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik, and Karina Hernandez Adaptive diagnosis in distributed systems Neural Networks, IEEE Transactions on, 16(5):1088–1109, 2005 [24] Ingo Steinwart, Don R Hush, and Clint Scovel A classification framework for anomaly detection In Journal of Machine Learning Research, pages 211–232, 2005 [25] Narate Taerat, Jim Brandt, Ann Gentile, Matthew Wong, and Chokchai Leangsuksun Baler: deterministic, lossless log message clustering tool Computer Science-Research and Development, 26(3-4):285–295, 2011 [26] James P Theiler and D Michael Cai Resampling approach for anomaly detection in multispectral images In AeroSense 2003, pages 230–240 International Society for Optics and Photonics, 2003 [27] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan Detecting large-scale system problems by mining console logs In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117–132 ACM, 2009 [28] Kenji Yamanishi and Yuko Maruyama Dynamic syslog mining for network failure monitoring In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 499–508 ACM, 2005 35 Appendix A Tree Model The classification tree trained for the single component failure use case in Cloud Deployer is visualized in Figure A.1 It can be seen that some component failures are easier to detect using classification trees than others For example, it is enough to check only one feature value to understand if the component that failed was messaging_bus, while it is necessary to check seven feature values to know if the component that failed was cloud_deployer_manager It can also be observed that the model is not visibly overfitting to training data 37 38 samples = 158 value = [0, 158, 0, 0, 0, 0, 0, 0] class = cloud_deployer_manager samples = 136 value = [0, 0, 136, 0, 0, 0, 0, 0] class = database_service samples = 148 value = [0, 0, 0, 0, 148, 0, 0, 0] class = cloud_deployer_1 samples = 167 value = [0, 0, 0, 0, 0, 0, 0, 167] class = openstack X[48]