Automating a rolling binary release for Spack

FOSDEM 2023

Spack is a software distribution targeted at HPC systems, with over 6,800 packages. While Spack has long been a source-only distribution, in June 2022 we added public build caches that offer fixed and rolling binary releases. With 400-500 pull requests per month, most of them package updates, this was a non-trivial task. The build cache model for Spack is similar to Nix and Guix — it assumes no ABI compatibility. Any change to dependencies triggers rebuilds of dependents. Despite these challenges we have been able to build a CI system that builds and tests packages on pull request and on release branches for a subset of several thousand builds for x86_64, Power, and aarch64, as well as for AMD and NVIDIA GPUs and Intel's oneapi compiler. This talk will cover some of the main challenges we have faced: reliable build infrastructure, integration with pull request workflows, Kubernetes auto-scaling and AWS instance selection, and optimizing build performance in the cloud. We’ll talk about the infrastructure, as well as the algorithmic complexities of choosing CI commits carefully to minimize builds.

This talk is targeted at distribution maintainers, particularly people managing build farms for large distributions. The goal of the presentation is to highlight the challenges of managing a rolling release for continuously evolving distributions hosted on GitHub.

Spack manages 400-500 pull requests/month and is a very active repository. Keeping all the builds working all the time, particularly in a build-from-source system where the recipes are templated, is a big challenge. We may have the same package built many different ways in Spack (e.g. for different CPUs, GPUs, or MPI implementations), and all of those are managed by the same recipe and a very smart dependency/package configuration resolver.

The talk will cover how we’ve scaled build caching to manage PRs as well as release branches, and how we’ve set up infrastructure that allows us to request builders with specific processor microarchitectures. The CI system is orchestrated by a high availability GitLab CI instance in the cloud. Builds are automated and triggered by pull requests, with runners both in the cloud and on bare metal. We will talk about the architecture of the CI system, from the user-facing stack descriptions in YAML to backend services like Kubernetes, Karpenter, S3, CloudFront, and the challenges of tuning runners to give good build performance.

We'll also cover a bit about security in a completely PR-driven CI system. Specifically, we cover how we guarantee that arbitrary public CI users can’t inject malicious code into our binaries.

Choosing commits to build from the distribution tree has been a challenge as well, and in a build-cached system, choosing the right commit to merge a PR with can be a key factor in minimizing the number of builds that need to be done. Specifically, we are trying to maximize reuse of builds done on the rolling release branch while still allowing PRs to update frequently. Orchestrating this in CI, particularly when most systems want to merge with the head of the mainline, can be a challenge.

Finally, we'll talk about some of the architectural decisions in Spack itself that had to change to better support CI — specifically with the dependency resolver and how we choose exactly which stack configurations to build in CI for the rolling release.

All of this topics should help distribution managers better orchestrate build farms, and they should be particularly relevant to other distributions that are similar to Spack, like Nix and Guix.

Speakers: Todd Gamblin