Failure Prediction for Critical Infrastructures
Final Report Abstract
Critical infrastmctures provide an indispensable support to sustain today's modem life. Among the bestknown examples are the power grid, telephone and data networks and nansportalion systems, but also gas supply, water supply, medical services, or food supply are considered to be critical infrastmctures. An outage in any of these can have severe impact on the economy or even worse on human lives. As complexity of infrastmctures continues to grow, the avoidance of major failures becomes a vital concem of any modem society. In order to be able to avoid an upcoming failure, die first step is to be warned about it sufficiently in advance. It was this project's goal to lay a foundation for predicting failures in critical infrastmctures. While on one hand side complexity and dynamicity grow, such systems on the other hand reveal more and more of their intemal status by providing extensive monitoring facilities. This trend suggests that tiie monitoring data should be analyzed not only locally by each entity itself, but also on a wider range combining the data of many entities of the critical infrastmcture in order to predict major outages of the entire infrastmcture (or at least large parts of it). Due to the fact that entities are in many cases privately owned and no details on the intemal stmcture are available, entities have to be treated as black boxes where nothing but the provided monitoring data is available. Therefore, this project focused on data-driven failure prediction using model-driven machine-leaming techniques. Along with the manifold of available monitoring data comes the problem of efficient, flexible, reliable, and tmsiworthy data delivery. In order to address this issue, it was the original goal of the project to build a monitoring data delivery infrastmcture using a publisher-subscribe concept implemented by a dependable and composable service oriented architecture. We have developed a new approach to event-based failure prediction which employs a pattem recognition approach. The usage of model-driven pattem recognition is motivated by the fact that dependencies exist among the entities in the infrastmcture. Once one entity encounters a failure, dependent entities might have a problem, too. This leads to a chain reaction as is typical for failures of critical infrastmctures. The task of failure prediction is to identify those pattems that are symptomatic for failures and to recognize them quickly before they tum into a severe outage. As a pattem recognizer, we have developed an extension to hidden Markov models that incorporates time into the underlying stochastic process. The resulting failure predictor's properties show various characteristics that are of key importance for critical infrastmcture protection, such as the capability to handle changes in the order of events or the ability to tolerate noise in the data. We have conducted a detailed analysis of the failure prediction approach using data of an industrial telecommunication platform. Results have shown that compared to best-known event-based failure prediction approaches, our failure predictor advances the state-of-the art significantly. We have also developed novel and comprehensive methods of service availability evaluation as well business process availability evaluation method. In addition, a fault taxonomy for service-oriented architectures and a comprehensive survey on dependability evaluation tools have been developed.
Publications
- O Brüning, S.; Weißleder, S.; Malek, M..- A Fault Taxonomy for Service-Oriented Architecture, Fast Abstract in: lOth IEEE High Assurance Engineering Symposium (HASE'07), Dallas, Texas, November 14-16, 2007
- O Chan, P.W.; Lyu, M.R.; Malek, M.: Reliable Web Services: Methodology, Experiment and Modeling, IEEE Intemafional Conference on Web Services (ICWS 2007), Salt Lake City, USA, July 2007
- O Hoffmann, G.A.; Trivedi, K.S.; Malek, M.: A best pracitce guide to resource jorecasting for computing systems, IEEE Transaction on Reliability, December 2007
- O Maiek, M.: Online Dependability Assessment through. Runtime Monitoring and Prediction. Panel Contribufion, Proceedings of the 7^^ European Dependable Compufing Symposium, EDCC-7, May, Vilnus, Littauen (2008)
- O Malek, M., Milic, B., Milanovic, N.: Analytical Availability Assessment of IT Services. Intemafional Service Availability Symposium (ISAS 2008), Lecture Notes in Computer Science, vol. 5017, Springer Veriag, 2008
- O Malek, M; Reitenspieß, M; van Moorsel, A.P.A. (eds.): 4"" Intemational Service Availability Symposium (ISAS), Lecture Notes in Computer Science, vol. 4526, Springer Verlag, 2007
- O Milanovic, N., MiUc, B., Malek, M.: Modeling Business Process Availability, IEEE Intemational Workshop on Methodologies for Non-functional Properties in Services Computing (MNPSC), Honolulu, Hawaii, July 2008
- O Milanovic, N.. Malek, M.: Adaptive Search and Learning-Based Approaches jor Automatic Web Service. Chapter VI in "Modem Technologies in Web Services Research", Liang-Jie Zhang (ed.), 2008
- O Milanovic, N.; Malek, M.: Adaptive Search- and Leaming-based Approaches for Automatic Web Service Composition, Web Services Research and Practices, Volume 2, 2007
- O Milanovic, N.; Malek, M.: Search Strategies for Automatic Web Service Composition,. In, Intemational Joumal of Web Services Research 3(2), pp 1-32, 2006
- O MUanovic, N.; Malek, M.: Service-Oriented Operating System: A Key Element in Improving Service Availability, Proceedings of Intemafional Service Availability Symposium (ISAS2007), pp. 31-42, Lecture Notes in Computer Science 4526, Springer, 2007
- O Nanya, T; Mamyama, F.; Pataricza, A.; Malek, M. (eds.): 5"" Intemational Service Availability Symposium (ISAS), Lecture Notes in Computer Science, vol. 5017, Springer Verlag, 2008
- O Salfner, F. et al.: Hardware Reliability; Sojtware Reliability; Performability; In: Dependability Metrics; I. Eusgeld, F. Freiling, R. Reussner (eds.); Lecture Notes in Computer Science, vol. 4909, Springer Veriag; Beriin, 2008
- O Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach; Dissertafion.de - Veriag im Intemet, BerHn, 2008
- O Salfner, F.; Malek, M.: Using Hidden Semi-Markov Models for Effective Online Failure Prediction; In: IEEE Proceedings of the 26th Symposium on Reliable Distributed Systems (SRDS), Oct. 2007
- O Salfner, F.; Schieschke, M.; Malek, M.; Predicting Failures of Computer Systems: A Case Study for a Telecommunication System; In: Proceedings of 11th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS); April 2006