HPC systems have been traditionally operated as monolithic installations on bare-metal hardware primarily used by users with computational background to submit classic batch jobs. However the commoditization of compute resources and the introduction of new scientific fields such as life sciences to high performance computing has caused a shift in this paradigm. Today, an increasing number of biological software is made accessible through web portals. This improved ease of use has led towards a democratization of access to computational resources
Users of those fields don’t have the same computational knowledge as traditional HPC users from physics or chemistry and additionally require different kinds of workloads and applications that don’t fit traditional non-interactive batch scheduling resource management systems. Additionally, cloud computing is becoming more and more relevant and various efforts to lift HPC into the Cloud were started.
We manage the HPC infrastructure for 3 life science and 2 particle physics institutions at the Vienna Bio Center (VBC). For the new HPC system that was procured at the end of 2018, we decided to go with an on-prem cloud framework based on OpenStack to accommodate the various emerging workflows and programs. OpenStack is not a finished product and requires considerable amount of engineering. It took us around 2 years of testing and engineering to feel confident in deploying the new HPC infrastructure on top of OpenStack. Since summer 2019 we have our 200 node production SLURM cluster running on top of VMs in OpenStack.
In this talk we want to share our experiences from our endeavor into HPC on OpenStack. We want to briefly discuss the reasoning behind HPC in the cloud and specifically OpenStack.
Often times these kind of projects either completely fade away in case of failure or get published in a high-level white paper that is only useful as marketing material.
We want to share our honest experience from both implementer and operator perspective. We discuss how we use 3 environments to test updates and configuration changes. We will also explain our approach to automation and infrastructure as code all the way from the underlying infrastructure to the SLURM payload and how we keep our sanity using development procedures around pull requests and code reviews. We will also share some stories from the trenches, such as why you still learn new things about OpenStack after 1000 deploys or discover that a simple config change can destroy performance.
This talk will contain information that you won’t find in success stories or white papers but is hopefully very helpful or anyone who considers deploying HPC on OpenStack.