Amina Guermouche - A New Hierarchical Fault Tolerance Protocol for MPI HPC Applications

Organized by: 

Arnaud Legrand


Amina Guermouche


High performance computing will probably reach exascale in this decade. At such a scale, the mean time between failures is expected to be a few hours. Existing fault tolerance protocols will no longer be suitable : Coordinated checkointing protocols force all processes to restart and message logging protocols log all messages and their determinants. In order to overcome these limits, one can combine these protocols and use them on clusters of processes. Many protocols based on this idea already exist in the litterature. This talk presents a new hierarchical protocol that logs only message payload unlike all existing hierarchical protocols. It is based on a study of MPI applications. The study shows that many MPI applications are "send-deterministic", and in many cases, the communication patterns of the application allow creating groups of processes. The talk will also address theoricial models fault tolerant protocols.