Event

Research Seminar: RM-replay: a high-fidelity tuning, optimization and exploration tool for resource management

  • Conférencier  Maxime Martinasso, Chief Technical Officer, CSCS

  • Lieu

    Room 1.030 Maison du Nombre (MNO), Campus Belval 6, avenue de la Fonte L-4364 Esch-sur-Alzette

    LU

Leading hybrid and heterogeneous supercomputing systems process hundreds of thousands of jobs using complex scheduling algorithms and parameters. The centres operating these systems aim to achieve higher levels of resource utilization while being restricted by compliance with policy constraints. There is a critical need for a high-fidelity, high-performance tool with familiar interfaces that allows not only tuning and optimization of the operational job scheduler but also enables exploration of new resource management algorithms. We propose a new methodology and a tool called RM-Replay which is not a simulator but instead a fast replay engine for production workloads. Slurm is used as a platform to demonstrate the capabilities of our replay engine. The tool accuracy is discussed and our investigation shows that, by providing better job runtime estimation or using topology-aware allocation, scheduling metric values vary. The presented methodology to create fast replay engines can be extended to other complex systems.

Maxime Martinasso is a computer scientist and deputy of the Chief Technical Officer at CSCS, the Swiss National Supercomputing Centre. In his role, he is part of a team-leading the Centre towards strategic goals by managing the conception and development of key initiatives such as cloud technology or hardware benchmarking. His interests focus on performance modelling, network, resource management and HPC technology in general. Previously, Maxime worked as an HPC specialist and team lead for one major oil industry company. He obtained his PhD on performance modelling applied to HPC technology in 2007 from the University Joseph Fourier, France.