Traditionally HPC systems assume they are in a secure, isolated environment and
as many barriers as possible are removed, in order to achieve the highest
possible performance. While these assumptions may still hold for traditional
simulation codes, many HPC clusters are now used for heterogeneous workloads.
Such workloads increasingly involve the integration of input data from a variety
of sources, notably in the life sciences. Scientists are now operating at the
population scale, where datasets are ultimately derived from real people. In
this talk we discuss some of the restrictions placed on the usage of such
datasets, how those restrictions interfere with the goal of high performance
computing, and some alternative strategies that meet the data requirements while
not hobbling the speed of analytical workloads.
HPC systems are by definition optimised to run user codes at the fastest
possible speed. Many of the normal safeguards and security procedures of Linux
systems are removed in furtherance of this goal. For example, firewalls are
often disabled and password-less SSH is usually enabled between nodes. The
parallel filesystems required for high performance often dictate further
security compromises. Normally these systems will be placed on an isolated
network, mitigating the risk to the wider infrastructure. In some commercial
organizations, an entire compute node will be dedicated to a user. Conversely,
the norm in academic clusters is for different users to share the nodes.
While the jobs running on these clusters were simulations, or were using data
without access concerns, none of the above approaches were problematic. Simple
POSIX permissions were sufficient to provide basic security and isolation. These
assumptions used to hold for life sciences HPC jobs too, where they were
operating on data obtained in vivo or from non-human organisms.
In recent years there have been a variety of efforts to obtain human data, often
at population scale. Examples include UK Biobank, the 100,000 Genomes project,
and The Cancer Genome Atlas (TCGA). A quotation from the latter illustrates the
ambitions of these sorts of initiatives:
"The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to
accelerate our understanding of the molecular basis of cancer through the
application of genome analysis technologies, including large-scale genome
sequencing."
https://cancergenome.nih.gov/
Scientists wishing to use TCGA data need to register and comply with access
policies:
https://wiki.nci.nih.gov/display/TCGA/Access+Tiers
As an example, I facilitated the download of 800TB of TCGA data on eMedLab, and
careful attention was needed to ensure that collaborators with the lab that had
signed the TCGA agreements were not able to see those data.
The 100,000 Genomes project organised by Genomics England only allows analysis
via their strictly controlled 'embassy' system:
https://www.genomicsengland.co.uk/the-100000-genomes-project/data/
https://www.genomicsengland.co.uk/about-gecip/for-gecip-members/data-and-data-access/
UK Biobank, which contains various data sources from 500,000 individuals, also
has strict data access policies:
http://biobank.ctsu.ox.ac.uk/showcase/exinfo.cgi?src=accessingdataguide
In some cases, researchers have to travel to specific locations where there are
physically isolated computers, in order to gain access to data.
Clearly these policies are directly in conflict with the barrier-free approach
that is normal in HPC facilities.
In this presentation we discuss possible approaches to compliance
with data licensing and security requirements, while allowing good performance
for researchers working on increasingly large-scale analyses.