Facilitating HPC job debugging through job scripts archival

FOSDEM 2020

SArchive is a lightweight tool to facilitate debugging HPC job issues by providing support teams with the exact version of the job script that is run in the HPC job in an archive either on the filesystem, in Elasticsearch, or by producing it to a Kafka topic.

HPC schedulers usually keep a version of the user’s job script in their spool directory for the lifetime of the job, i.e., from job submission until the job has run to completion — either succesfully or failed. However, once the job has completed, the job script and associated files are removed to avoid stacking up a large number of files. HPC systems typically run several millions of jobs, if not many more, over their lifetime -- it is not feasible to keep them all in the spool directory. In case the job failed, user support teams are often asked to help figure out the cause of the failure. For these occasions, it often is helpful if the exact job script is available. Since a typical scheduler setup will make changes to every submitted script through, e.g., a submission filter, simply obtaining what the user submitted requires an extra hoop to run the given script through the filter(s). Furthermore, users may have tweaked, changed, or removed the job script, which may add to the difficulty of debugging the issue at hand.

SArchive aims to address this problem by providing user support teams with an exact copy of the script that was run, along with the exact additional files that are used by the scheduler, e.g., to set up the environment in which the jobs runs. It can be argued that making a backup copy is actually the job of the scheduler itself, but we decided to use a tool outside the scheduler. This has the advantages that (i) one need not have access to the scheduler’s source code (not all schedulers are open source) and (ii) sites running multiple schedulers need not make any changes to each of them, but only to SArchive — which should be a fairly limited effort, if any at all. SArchive is currenly tailored towards the Slurm scheduler (hence the name), but it also supports the Torque resource manager. Adding support for other schedulers should be fairly straightforward — pull requests are welcome :)

Currently, SArchive provides three archival options: storing archived files inside a file hierarchy, ship them to Elasticsearch, or produce them to a Kafka topic. File archival is pretty feature complete, the code for shipping to Elasticsearch and Kafka is still under development and only has what is needed in our (HPCUGent) specific setup — which may evolve.

Speakers: Andy Georges