Bare-metal servers as a container runtime

FOSDEM 2023

At Scaleway, we built a large-scale PXE-based imaging infrastructure to manage the fleet of machines that power our various storage services. Using this infrastructure, we can reliably deploy new machines—and reconfigure existing ones—in predictable ways. In this talk, we will explain the problems we had to solve, how we reasoned through these issues, and what we built to solve the problems. Improved reliability, decreased time to production, increased stability, and all of this without sacrificing usability or end-user experience.

At the outset of the project, the existing fleet deployment and management systems—which had grown organically over the previous years—were in need of an upgrade. We embarked on a new initiative with an aim to reduce the time to deployment, increase the reliability of those deployments, and improve the maintainability of the management infrastructure as a whole. Leveraging PXE, Ansible, chroot, DHCP, and a clean monorepo backend, we were ultimately able to scale the number of managed machines by orders of magnitude—without linearly scaling the time or resource commitments for the sysadmins. In this talk, we’ll look at architectures, code examples, decision trees, challenges, and solutions. At the end of the session, the audience will have a better understanding of not only the technologies involved, but also the business cases behind those choices.

Speakers: Florian Florensa