The debate on how to deploy applications, monoliths or micro services, is in
full swing. Part of this discussion relates to how the new paradigm
incorporates support for accessing accelerators, e.g. GPUs, FPGAs. That kind of
support has been made available to traditional programming models the last
couple of decades and its tooling has evolved to be stable and standardized
(eg. CUDA, OpenCL/OpenACC, Tensorflow etc.).
On the other hand, what does it mean for a highly distributed application
instance (i.e. a Serverless deployment) to access an accelerator? Should the
function invoked to classify an image, for instance, link against the whole
acceleration runtime and program the hardware device itself? It seems quite
counter-intuitive to create such bloated functions.
Things get more complicated when we consider the low-level layers of the
service architecture. To ensure user and data isolation, infrastructure
providers employ virtualization techniques. However, generic hardware
accelerators are not designed to be shared by multiple untrusted tenants.
Current solutions (device passthrough, API-remoting) impose inflexible setups,
present security trade-offs and add significant performance overheads.
To this end, we introduce vAccel, a lightweight framework to expose hardware
acceleration functionality to VM tenants. Our framework is based on a thin
runtime system, vAccelRT, which is, essentially, an acceleration API: it offers
support for a set of operators that use generic hardware acceleration
frameworks to increase performance, such as machine learning and linear algebra
operators. vAccelRT abstracts away any hardware/vendor-specific code by
employing a modular design where backends implement bindings for popular
acceleration frameworks and the frontend exposes a function prototype for each
available acceleration function. On top of that, using an optimized paravirtual
interface, vAccelRT is exposed to a VM’s user-space, where applications can
benefit from hardware acceleration via a simple function call.
In this talk we present the design and implementation of vAccel on two KVM
VMMs: QEMU and AWS Firecracker. We go through a brief design description and
focus on the key aspects of enabling hardware acceleration for machine learning
inference for ligthweight VMs both on x86_64 and aarch64 architectures. Our
current implementation supports jetson-inference & TensorRT, as well as Google
Coral TPU, while facilitating integration with NVIDIA GPUs (CUDA) and Intel
Iris GPUs (OpenCL).
Finally, we present a demo of vAccel in action, using a containerized environment
to simplify configuration & deployment
- [1] https://blog.cloudkernels.net/posts/vaccel
- [2] https://blog.cloudkernels.net/posts/vaccel_v2
- [3] https://vaccel.org
- [4] https://github.com/nubificus/docker-jetson-inference