Seccomp has supported a notify target for a little while now. This can be used to delegate system call handling to a userspace process. Using this, it's possible to intercept any syscall (even a non-existing one) and have it be handled by a potentially more privileged userspace process.
In LXD, we like running very safe containers. For that we use user namespaces by default, combined with both AppArmor and Seccomp policies. The result are very very safe containers but because of all that security, a number of actions just aren't possible or return odd values.
System call interception gives us a way out there as we can selectively intercept system calls such as mount, sysinfo, ebpf, ... then run them through policies and if found to be safe, run them again with elevated privileges. We have been progressively growing the list of system calls that can be handled in that way with our eventual goal to complete deprecate the use of privileged containers.
In this presentation, we'll be covering the general concept behind system call interception through the seccomp notify mechanism, some of the things to be extremely careful about and show the ones we have implemented to this day.
Speakers: Stéphane Graber