In this session we are presenting our experiences with migrating from traditional HPC to cloud-native HPC using a compute-heavy scientific workflow that is usually carried out on national supercomputing centers. Our scientific application are atomistic biomolecular simulations using the GROMACS molecular dynamics simulation toolkit.
Molecular dynamics simulations are computationally challenging in two respects: First, individual simulations usually need to be parallelized over as many resources (cores, GPUs, nodes) as practicable, to reduce the time to solution from months down to weeks or possibly less. Second, we as scientists are not so much interested in individual simulations, but rather in average properties of the simulated systems. The latter can however only be addressed with ensemble runs of many (typically hundreds) slightly different replicas of the system, thus requiring an enormous amount of compute time.
Cloud-based HPC can address both challenges: The cloud offers as much compute time as desired, plus the possibility to efficiently scale individual simulations over multiple instances connected by a high-performance interconnect. We build a cloud-based HPC cluster in a straightforward and reproducible way by simplifying software management with SPACK and cluster lifecycle management with AWS ParallelCluster. With the SPACK package manager, diverse hardware is easily incorporated into a single cluster, e.g. instances with AMD, ARM, and Intel processors, instances with (multiple) GPUs and instances with high-performance interconnect.
On the cluster, we used several representative biomolecular systems to benchmark the GROMACS performance on a various hardware available in the cloud, both on individual instances as well as across multiple instances. This way we uncover which instances deliver the highest performance or the best performance-to-price ratio for GROMACS simulations.
As a next step, we are preparing to run a large ensemble consisting of 20,000 individual simulations in the cloud using all resources that are globally available to reduce the time to solution as much as possible. In principle, such an ensemble simulation - which would occupy a medium-sized compute cluster for weeks or even months - could then finish within a day. Containerization will be a key concept here to provide a common software environment across a vast variety of hardware.