This talk will give an overview over how we use Slurm to schedule the workloads of over 6000 scientists at NERSC, while providing high throughput, ease of use and ultimately user satisfaction. With the emergence of data-intensive applications it was necessary to update the classic scheduling infrastructure to handle things like user defined software stacks (read: containers), data movement and storage provisioning. We did all of this and more through facilities provided by Slurm. In addition to these features we will discuss priority management and quality of service and how that can greatly improve the user experience of computational infrastructures.
This talk will be a walkthrough of the features that make Slurm great, using a supercomputing site as an example. All of the introduced interfaces are not specific to the site in question and can be used by the broader community. After an a brief introduction of the workings of a workload manager/scheduler in general and Slurm in particular, we'll go into some of the features and how those open up possibilities for frictionless working with all kinds of use-cases, way beyond classical HPC workloads: - Container integration - Data staging - On-demand filesystem provisioning - On-the-fly job rewriting - Cluster federation - and a healthy plugin ecosystem
No previous knowledge of high performance computing or batch processing required.
Speakers: Georg Rath