Adam Huffman

👥 2 conferences
🎤 2 talks
📅 Years active: 2017 to 2018

Biography

I am the Research Computing Manager at the Big Data Institute, University of Oxford. I have been working with HPC, HTC and Cloud systems, and research data for over ten years, mainly in the life sciences.

In addition to the academic HPC and cloud community, I have taken an interest in and participated in research data management activities, driven by the explosion in the quantity and diversity of life science datasets. I also helped with the release of open data from the CMS experiment at CERN, during my time at Imperial College London.

— biography from FOSDEM 2018
https://archive.fosdem.org/2018/schedule/speaker/adam_huffman/

Conferences

2 known conferences

👥 FOSDEM 2018 📅 03 Feb 2018

🎤 Does data security rule out high performance?
04 Feb 2018 show details

Traditionally HPC systems assume they are in a secure, isolated environment and as many barriers as possible are removed, in order to achieve the highest possible performance. While these assumptions may still hold for traditional simulation codes, many HPC clusters are now used for heterogeneous workloads. Such workloads increasingly involve the integration of input data from a variety of sources, notably in the life sciences. Scientists are now operating at the population scale, where datasets are ultimately derived from real people. In this talk we discuss some of the restrictions placed on the usage of such datasets, how those restrictions interfere with the goal of high performance computing, and some alternative strategies that meet the data requirements while not hobbling the speed of analytical workloads.

HPC systems are by definition optimised to run user codes at the fastest possible speed. Many of the normal safeguards and security procedures of Linux systems are removed in furtherance of this goal. For example, firewalls are often disabled and password-less SSH is usually enabled between nodes. The parallel filesystems required for high performance often dictate further security compromises. Normally these systems will be placed on an isolated network, mitigating the risk to the wider infrastructure. In some commercial organizations, an entire compute node will be dedicated to a user. Conversely, the norm in academic clusters is for different users to share the nodes.

While the jobs running on these clusters were simulations, or were using data without access concerns, none of the above approaches were problematic. Simple POSIX permissions were sufficient to provide basic security and isolation. These assumptions used to hold for life sciences HPC jobs too, where they were operating on data obtained in vivo or from non-human organisms.

In recent years there have been a variety of efforts to obtain human data, often at population scale. Examples include UK Biobank, the 100,000 Genomes project, and The Cancer Genome Atlas (TCGA). A quotation from the latter illustrates the ambitions of these sorts of initiatives:

"The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing."

https://cancergenome.nih.gov/

Scientists wishing to use TCGA data need to register and comply with access policies:

https://wiki.nci.nih.gov/display/TCGA/Access+Tiers

As an example, I facilitated the download of 800TB of TCGA data on eMedLab, and careful attention was needed to ensure that collaborators with the lab that had signed the TCGA agreements were not able to see those data.

The 100,000 Genomes project organised by Genomics England only allows analysis via their strictly controlled 'embassy' system:

https://www.genomicsengland.co.uk/the-100000-genomes-project/data/

https://www.genomicsengland.co.uk/about-gecip/for-gecip-members/data-and-data-access/

UK Biobank, which contains various data sources from 500,000 individuals, also has strict data access policies:

http://biobank.ctsu.ox.ac.uk/showcase/exinfo.cgi?src=accessingdataguide

In some cases, researchers have to travel to specific locations where there are physically isolated computers, in order to gain access to data.

Clearly these policies are directly in conflict with the barrier-free approach that is normal in HPC facilities.

In this presentation we discuss possible approaches to compliance with data licensing and security requirements, while allowing good performance for researchers working on increasingly large-scale analyses.

👥 FOSDEM 2017 📅 04 Feb 2017

🎤 The Marriage of Cloud, HPC and Containers
04 Feb 2017 show details