• Christian Brauner's avatar
    binfmt_misc: enable sandboxed mounts · 21ca59b3
    Christian Brauner authored
    Enable unprivileged sandboxes to create their own binfmt_misc mounts.
    This is based on Laurent's work in [1] but has been significantly
    reworked to fix various issues we identified in earlier versions.
    
    While binfmt_misc can currently only be mounted in the initial user
    namespace, binary types registered in this binfmt_misc instance are
    available to all sandboxes (Either by having them installed in the
    sandbox or by registering the binary type with the F flag causing the
    interpreter to be opened right away). So binfmt_misc binary types are
    already delegated to sandboxes implicitly.
    
    However, while a sandbox has access to all registered binary types in
    binfmt_misc a sandbox cannot currently register its own binary types
    in binfmt_misc. This has prevented various use-cases some of which were
    already outlined in [1] but we have a range of issues associated with
    this (cf. [3]-[5] below which are just a small sample).
    
    Extend binfmt_misc to be mountable in non-initial user namespaces.
    Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
    keyed superblock management. The key determines whether we need to
    create a new superblock or can reuse an already existing one. We use the
    user namespace of the mount as key. This means a new binfmt_misc
    superblock is created once per user namespace creation. Subsequent
    mounts of binfmt_misc in the same user namespace will mount the same
    binfmt_misc instance. We explicitly do not create a new binfmt_misc
    superblock on every binfmt_misc mount as the semantics for
    load_misc_binary() line up with the keying model. This also allows us to
    retrieve the relevant binfmt_misc instance based on the caller's user
    namespace which can be done in a simple (bounded to 32 levels) loop.
    
    Similar to the current binfmt_misc semantics allowing access to the
    binary types in the initial binfmt_misc instance we do allow sandboxes
    access to their parent's binfmt_misc mounts if they do not have created
    a separate binfmt_misc instance.
    
    Overall, this will unblock the use-cases mentioned below and in general
    will also allow to support and harden execution of another
    architecture's binaries in tight sandboxes. For instance, using the
    unshare binary it possible to start a chroot of another architecture and
    configure the binfmt_misc interpreter without being root to run the
    binaries in this chroot and without requiring the host to modify its
    binary type handlers.
    
    Henning had already posted a few experiments in the cover letter at [1].
    But here's an additional example where an unprivileged container
    registers qemu-user-static binary handlers for various binary types in
    its separate binfmt_misc mount and is then seamlessly able to start
    containers with a different architecture without affecting the host:
    
    root    [lxc monitor] /var/snap/lxd/common/lxd/containers f1
    1000000  \_ /sbin/init
    1000000      \_ /lib/systemd/systemd-journald
    1000000      \_ /lib/systemd/systemd-udevd
    1000100      \_ /lib/systemd/systemd-networkd
    1000101      \_ /lib/systemd/systemd-resolved
    1000000      \_ /usr/sbin/cron -f
    1000103      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
    1000000      \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
    1000104      \_ /usr/sbin/rsyslogd -n -iNONE
    1000000      \_ /lib/systemd/systemd-logind
    1000000      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
    1000107      \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
    1000000      \_ [lxc monitor] /var/lib/lxc f1-s390x
    1100000          \_ /usr/bin/qemu-s390x-static /sbin/init
    1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
    1100000              \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
    1100103              \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
    1100000              \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
    1100104              \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
    1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
    1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
    1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
    1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
    1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
    1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
    1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd
    
    [1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
    [2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
    [3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
    [4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
    [5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11
    
    Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
    Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
    Cc: Sargun Dhillon <sargun@sargun.me>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Henning Schild <henning.schild@siemens.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Laurent Vivier <laurent@vivier.eu>
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: default avatarLaurent Vivier <laurent@vivier.eu>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    ---
    /* v2 */
    - Serge Hallyn <serge@hallyn.com>:
      - Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
        new binary type handler is registered.
    - Christian Brauner <christian.brauner@ubuntu.com>:
      - Switch authorship to me. I refused to do that earlier even though
        Laurent said I should do so because I think it's genuinely bad form.
        But by now I have changed so many things that it'd be unfair to
        blame Laurent for any potential bugs in here.
      - Add more comments that explain what's going on.
      - Rename functions while changing them to better reflect what they are
        doing to make the code easier to understand.
      - In the first version when a specific binary type handler was removed
        either through a write to the entry's file or all binary type
        handlers were removed by a write to the binfmt_misc mount's status
        file all cleanup work happened during inode eviction.
        That includes removal of the relevant entries from entry list. While
        that works fine I disliked that model after thinking about it for a
        bit. Because it means that there was a window were someone has
        already removed a or all binary handlers but they could still be
        safely reached from load_misc_binary() when it has managed to take
        the read_lock() on the entries list while inode eviction was already
        happening. Again, that perfectly benign but it's cleaner to remove
        the binary handler from the list immediately meaning that ones the
        write to then entry's file or the binfmt_misc status file returns
        the binary type cannot be executed anymore. That gives stronger
        guarantees to the user.
    21ca59b3
user_namespace.h 6.34 KB