Valentin Reis - Learning to control large-scale parallel platforms

Organized by: 
Valentin Reis
Valentin Reis

Providing the computational infrastructure needed to solve complex problems arising in modern society is a strategic challenge. Organisations usually address this problem by building extreme-scale parallel and distributed platforms. High Performance Computing (HPC) vendors race for more computing power and storage capacity, leading to sophisticated specific Petascale platforms, soon to be Exascale platforms. These systems are centrally managed using dedicated software solutions called Resource and Job Management Systems (RJMS). A crucial problem addressed by this software layer is the job scheduling problem, where the RJMS chooses when and on which resources computational tasks will be executed. This manuscript provides ways to adress this scheduling problem. No two platforms are identical. Indeed, the infrastructure, user behavior and organization's goals all change from one system to the other. We therefore argue that scheduling policies should be adaptive to the system's behavior. In this manuscript, we provide multiple ways to achieve this adaptivity. Through an experimental approach, we study various trade-offs between the complexity of the approach, the potential gain, and the risks taken.