ML inference acceleration for lightweight VMMs

FOSDEM 2021

The debate on how to deploy applications, monoliths or micro services, is in full swing. Part of this discussion relates to how the new paradigm incorporates support for accessing accelerators, e.g. GPUs, FPGAs. That kind of support has been made available to traditional programming models the last couple of decades and its tooling has evolved to be stable and standardized (eg. CUDA, OpenCL/OpenACC, Tensorflow etc.).

On the other hand, what does it mean for a highly distributed application instance (i.e. a Serverless deployment) to access an accelerator? Should the function invoked to classify an image, for instance, link against the whole acceleration runtime and program the hardware device itself? It seems quite counter-intuitive to create such bloated functions.

Things get more complicated when we consider the low-level layers of the service architecture. To ensure user and data isolation, infrastructure providers employ virtualization techniques. However, generic hardware accelerators are not designed to be shared by multiple untrusted tenants. Current solutions (device passthrough, API-remoting) impose inflexible setups, present security trade-offs and add significant performance overheads.

To this end, we introduce vAccel, a lightweight framework to expose hardware acceleration functionality to VM tenants. Our framework is based on a thin runtime system, vAccelRT, which is, essentially, an acceleration API: it offers support for a set of operators that use generic hardware acceleration frameworks to increase performance, such as machine learning and linear algebra operators. vAccelRT abstracts away any hardware/vendor-specific code by employing a modular design where backends implement bindings for popular acceleration frameworks and the frontend exposes a function prototype for each available acceleration function. On top of that, using an optimized paravirtual interface, vAccelRT is exposed to a VM’s user-space, where applications can benefit from hardware acceleration via a simple function call.

In this talk we present the design and implementation of vAccel on two KVM VMMs: QEMU and AWS Firecracker. We go through a brief design description and focus on the key aspects of enabling hardware acceleration for machine learning inference for ligthweight VMs both on x86_64 and aarch64 architectures. Our current implementation supports jetson-inference & TensorRT, as well as Google Coral TPU, while facilitating integration with NVIDIA GPUs (CUDA) and Intel Iris GPUs (OpenCL).

Finally, we present a demo of vAccel in action, using a containerized environment to simplify configuration & deployment

[1] https://blog.cloudkernels.net/posts/vaccel
[2] https://blog.cloudkernels.net/posts/vaccel_v2
[3] https://vaccel.org
[4] https://github.com/nubificus/docker-jetson-inference

Speakers: Anastassios Nanos Babis Chalios