Does data security rule out high performance?

FOSDEM 2018

Traditionally HPC systems assume they are in a secure, isolated environment and as many barriers as possible are removed, in order to achieve the highest possible performance. While these assumptions may still hold for traditional simulation codes, many HPC clusters are now used for heterogeneous workloads. Such workloads increasingly involve the integration of input data from a variety of sources, notably in the life sciences. Scientists are now operating at the population scale, where datasets are ultimately derived from real people. In this talk we discuss some of the restrictions placed on the usage of such datasets, how those restrictions interfere with the goal of high performance computing, and some alternative strategies that meet the data requirements while not hobbling the speed of analytical workloads.

HPC systems are by definition optimised to run user codes at the fastest possible speed. Many of the normal safeguards and security procedures of Linux systems are removed in furtherance of this goal. For example, firewalls are often disabled and password-less SSH is usually enabled between nodes. The parallel filesystems required for high performance often dictate further security compromises. Normally these systems will be placed on an isolated network, mitigating the risk to the wider infrastructure. In some commercial organizations, an entire compute node will be dedicated to a user. Conversely, the norm in academic clusters is for different users to share the nodes.

While the jobs running on these clusters were simulations, or were using data without access concerns, none of the above approaches were problematic. Simple POSIX permissions were sufficient to provide basic security and isolation. These assumptions used to hold for life sciences HPC jobs too, where they were operating on data obtained in vivo or from non-human organisms.

In recent years there have been a variety of efforts to obtain human data, often at population scale. Examples include UK Biobank, the 100,000 Genomes project, and The Cancer Genome Atlas (TCGA). A quotation from the latter illustrates the ambitions of these sorts of initiatives:

"The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing."

https://cancergenome.nih.gov/

Scientists wishing to use TCGA data need to register and comply with access policies:

https://wiki.nci.nih.gov/display/TCGA/Access+Tiers

As an example, I facilitated the download of 800TB of TCGA data on eMedLab, and careful attention was needed to ensure that collaborators with the lab that had signed the TCGA agreements were not able to see those data.

The 100,000 Genomes project organised by Genomics England only allows analysis via their strictly controlled 'embassy' system:

https://www.genomicsengland.co.uk/the-100000-genomes-project/data/

https://www.genomicsengland.co.uk/about-gecip/for-gecip-members/data-and-data-access/

UK Biobank, which contains various data sources from 500,000 individuals, also has strict data access policies:

http://biobank.ctsu.ox.ac.uk/showcase/exinfo.cgi?src=accessingdataguide

In some cases, researchers have to travel to specific locations where there are physically isolated computers, in order to gain access to data.

Clearly these policies are directly in conflict with the barrier-free approach that is normal in HPC facilities.

In this presentation we discuss possible approaches to compliance with data licensing and security requirements, while allowing good performance for researchers working on increasingly large-scale analyses.

Speakers: Adam Huffman