• Christian Brauner's avatar
    seccomp: notify about unused filter · 99cdb8b9
    Christian Brauner authored
    We've been making heavy use of the seccomp notifier to intercept and
    handle certain syscalls for containers. This patch allows a syscall
    supervisor listening on a given notifier to be notified when a seccomp
    filter has become unused.
    
    A container is often managed by a singleton supervisor process the
    so-called "monitor". This monitor process has an event loop which has
    various event handlers registered. If the user specified a seccomp
    profile that included a notifier for various syscalls then we also
    register a seccomp notify even handler. For any container using a
    separate pid namespace the lifecycle of the seccomp notifier is bound to
    the init process of the pid namespace, i.e. when the init process exits
    the filter must be unused.
    
    If a new process attaches to a container we force it to assume a seccomp
    profile. This can either be the same seccomp profile as the container
    was started with or a modified one. If the attaching process makes use
    of the seccomp notifier we will register a new seccomp notifier handler
    in the monitor's event loop. However, when the attaching process exits
    we can't simply delete the handler since other child processes could've
    been created (daemons spawned etc.) that have inherited the seccomp
    filter and so we need to keep the seccomp notifier fd alive in the event
    loop. But this is problematic since we don't get a notification when the
    seccomp filter has become unused and so we currently never remove the
    seccomp notifier fd from the event loop and just keep accumulating fds
    in the event loop. We've had this issue for a while but it has recently
    become more pressing as more and larger users make use of this.
    
    To fix this, we introduce a new "users" reference counter that tracks any
    tasks and dependent filters making use of a filter. When a notifier is
    registered waiting tasks will be notified that the filter is now empty
    by receiving a (E)POLLHUP event.
    
    The concept in this patch introduces is the same as for signal_struct,
    i.e. reference counting for life-cycle management is decoupled from
    reference counting taks using the object. There's probably some trickery
    possible but the second counter is just the correct way of doing this
    IMHO and has precedence.
    
    Cc: Tycho Andersen <tycho@tycho.ws>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Matt Denton <mpdenton@google.com>
    Cc: Sargun Dhillon <sargun@sargun.me>
    Cc: Jann Horn <jannh@google.com>
    Cc: Chris Palmer <palmer@google.com>
    Cc: Aleksa Sarai <cyphar@cyphar.com>
    Cc: Robert Sesek <rsesek@google.com>
    Cc: Jeffrey Vander Stoep <jeffv@google.com>
    Cc: Linux Containers <containers@lists.linux-foundation.org>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
    99cdb8b9
seccomp.c 48.2 KB