Marc Platini - Machine learning applied to the analysis and to the prediction of failures in HPC systems

Organized by: 
Marc Platini
Marc Platini

Jury :

  • Franck Cappello, directeur de recherche, Argonne National Laboratory, rapporteur
  • Jean-Marc Menaud, professeur, IMT-Atlantique, rapporteur
  • Sara Bouchenak, professeur, INSA Lyon, examinatrice
  •  Benoit Pelletier, directeur de section R&D, ATOS, invité
  • Noël de Palma, PR, Université Grenoble Alpes, directeur  
  • Thomas Ropars, MCF HDR, Université Grenoble Alpes, co-encadrant


With the increase in size of supercomputers, also increases the number of failures or abnormal events. This increase of the number of failures reduces the availability of these systems. To manage these failures and be able to reduce their impact on HPC systems, it is important to implement solutions to understand the failures and to predict them. HPC systems produce a large amount of monitoring data that contains useful information about the status of these systems. However, the analysis of these data is difficult and can be very tedious because these data reflect the complexity and the size of HPC systems. The work presented in this thesis proposes to use machine-learning-based solutions to analyse these data in an automated way. More precisely, this thesis presents two main contributions: the first one focuses on the prediction of processors overheating events in HPC systems, the second one focuses on the analysis and the highlighting of the relationships between the events present in the system logs. Both contributions are evaluated on real data from a large HPC system used in production. To predict CPU overheating events, we propose a solution that uses only the temperature of the CPUs. It is based on the analysis of the general shape of the temperature prior to an overheating event and on the automated learning of the correlations between this shape and overheating events using a supervised learning model. The use of the general curve shape and a supervised learning model allows learning using temperature data with low accuracy and using a limited number of overheating events. The evaluation of the solution shows that it is able to predict overheating events several minutes in advance with high accuracy and recall. Furthermore, the evaluation of these results shows that it is possible to use preventive actions based on the predictions made by the solution to reduce the impact of overheating events on the system. To analyze and to extract in an automated way the causal relations between the events described in the HPC system logs, we propose an unconventional use of a deep machine learning model. Indeed, this type of model is classically used for prediction tasks. Thanks to the addition of a new layer proposed by state-of-the-art contributions of the machine learning community, it is possible to determine the weight of the algorithm inputs associated to its prediction. Using this information, we are able to detect the causal relations between the different events. The evaluation of the solution shows that it is able to extract the causal relations of the vast majority of events occurring in an HPC system. Moreover, its evaluation by administrators validates the highlighted correlations. Both contributions and their evaluations show the benefit of using machine learning solutions for understanding and predicting failures in HPC systems by automating the analysis of supervision data.