HPC support teams are often tasked with installing
scientific software for their user community and the complexity of
managing a large software stack gets very challenging. Software
installation brings forth many challenges that requires a team of
domain expertise and countless hours troubleshooting to build an
optimal software state that is tuned to the architecture. In the past
decade, two software build tools (Easybuild, Spack) have emerged
that are widely accepted in HPC community to accelerate building
a complete software stack for HPC systems. The support team are
constantly involved in fulfilling software request for end-users
which leads to an ever-growing software ecosystem. Once a
software is installed, the support team hands it off to the user
without any testing because scientific software requires domain
expertise in order to test software. Some software packages are
shipped with a test suite that can be run at post build while many
software have no mechanism for testing. This poses a knowledge
gap between HPC support team and end-users on the type of
testing to do. Some HPC centers may have developed in-house test
scripts that are suitable for testing their software, but these tests
are not portable due to hardcoded paths and are often site
dependent. In addition, there is no collaboration between HPC
sites in building a test repository that will benefit the community.
In this talk I will presents buildtest, a framework to automate software
testing for a software stack along with several module operations
that would be of interest to the HPC support team.
HPC computing environment is a tightly coupled system that
includes a cluster of nodes and accelerators interconnected with
a high-speed interconnect, a parallel filesystem,multiple storage
tiers, a batch scheduler for users to submit jobs to the cluster and
a software stack for users to run their workflows. A software
stack is a collection of compilers, MPI, libraries, system utilities
and scientific packages typically installed in a parallel filesystem.
A module tool like environment-modules or Lmod is generally used for loading the software environment into
the usersβ shell environment.
Software are packaged in various forms that determine how
they are installed. A few package formats are: binary, Makefile,
CMake, Autoconf, github, PyPi, Conda, RPM,tarball, rubygem,
MakeCp, jar, and many more. With many packaging formats,
this creates a burden for HPC support team to learn how to build
software since each one has a unique build process. Software
build tools like Easybuild and Spack can build up to
1000+ software packages by supporting many packaging
formats to address all sorts of software builds. Easybuild and
Spack provide end-end software build automation that helps
HPC site to build a very large software stack with many
combinatorial software configurations. During the installation,
some packages will provide a test harness that can be executed
via Easybuild or Spack which typically invokes a make test or
ctest for packages that follow ConfigureMake, Autoconf, or
CMake install process.
Many HPC sites rely on their users for testing the software
stack, and some sites may develop in-house test scripts to run
sanity check for popular scientific tools. Despite these efforts,
there is little or no collaboration between HPC sites on sharing
tests because they are site-specific and often provide no
documentation. For many sites, the HPC support team donβt
have the time for conducting software stack testing because: (1)
lack of domain expertise and understaffed, (2) no standard testsuite and framework to automate test build and execution.
Frankly, HPC support teams are so busy with important day-day
operation and engineering projects that software testing is either
neglected or left to end-users. This demands for a concerted
effort by HPC community to build a strong open-source
community around software stack testing.
There are two points that need to be addressed. First, we need
a framework to do automatic testing of installed software stack.
Second, is to build a test repository for scientific software that is
community driven and reusable amongst the HPC community.
An automated test framework is a harness for automating the
test creation process, but it requires a community contribution to
accumulate this repository on per-package basis. Before we
dive in, this talk will focus on conducting sanity check of the
software stack so tests will need to be generic with simple
examples that can be compiled easily. In future, buildtest will
focus on domain-specific tests once there is a strong community
behind this project.