Recently the kernel landed seccomp support for SECCOMPRETUSER_NOTIF which enables a process (supervisee) to retrieve a fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (supervisor). The supervisor will then be able to receive seccomp messages about the syscalls having been performed by the supervisee.
We have integrated this feature into userspace and currently make heavy use of this to intercept mknod(), mount(), and other syscalls in user namespaces aka in containers. For example, if the mknod() syscall matches a device in a pre-determined whitelist the privileged supervisor will perform the mknod syscall in lieu of the unprivileged supervisee and report back to the supervisee on the success or failure of its attempt. If the syscall does not match a device in a whitelist we simply report an error.
This talk is going to show how this works and what limitations we run into and what future improvements we plan on doing in the kernel.
Speakers: Christian Brauner