Slurm is the most widely used batch scheduler for HPC systems. The Open Source Software community is very active in the development surrounding the Slurm ecosystem, contributing CLI tools for accounting, monitoring, and notebooks among others. A lot of these client environments are nowadays created on containers, which have become a ubiquitous part of running applications. However, this way of working provides new challenges in HPC environments, especially when using Slurm. Slurm requires careful management of shared cluster secrets and cluster-wide configuration files that need to be in sync in order to work efficiently and securely. This talk proposes a novel and simple tool called straw, which allows the creation of secret-less and config-less Slurm client environments. Therefore simplifying the creation of (containerised) environments by removing the burdens of maintaining config files, sensitive munge secrets, and additional daemons.
This talk will first provide an introduction to Slurm, followed by a description (mostly drawing from personal experience) of common patterns and pitfalls when creating containers that interact with Slurm clusters for different purposes (monitoring, notebooks, etc). Next, I will introduce Straw, explaining why it was needed and why despite its simplicity (it mostly just fetches a bunch of config files), it is able to perform a task that regular Slurm tools can't, therefore simplifying Slurm client environments. Finally, I will conclude by showing a simple example of how the tool can be used, and how it compares to the usual scenarios in which config files, extra daemons, and secrets need to be carefully managed. If time allows it, I might detail some of the weaknesses of this approach: the fact that the Slurm protocol isn't really documented, and therefore this tool relies on "reverse-engineering" (as much as one can say reverse engineering when no documentation exists, but the code is available) to keep up with new Slurm releases.
Speakers: Pablo Llopis Sanmillan