• Christian Brauner's avatar
    open: add close_range() · 278a5fba
    Christian Brauner authored
    This adds the close_range() syscall. It allows to efficiently close a range
    of file descriptors up to all file descriptors of a calling task.
    
    I was contacted by FreeBSD as they wanted to have the same close_range()
    syscall as we proposed here. We've coordinated this and in the meantime, Kyle
    was fast enough to merge close_range() into FreeBSD already in April:
    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836
    and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
    once its merged in Linux too. Python is in the process of switching to
    close_range() on FreeBSD and they are waiting on us to merge this to switch on
    Linux as well: https://bugs.python.org/issue38061
    
    The syscall came up in a recent discussion around the new mount API and
    making new file descriptor types cloexec by default. During this
    discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
    syscall in this manner has been requested by various people over time.
    
    First, it helps to close all file descriptors of an exec()ing task. This
    can be done safely via (quoting Al's example from [1] verbatim):
    
            /* that exec is sensitive */
            unshare(CLONE_FILES);
            /* we don't want anything past stderr here */
            close_range(3, ~0U);
            execve(....);
    
    The code snippet above is one way of working around the problem that file
    descriptors are not cloexec by default. This is aggravated by the fact that
    we can't just switch them over without massively regressing userspace. For
    a whole class of programs having an in-kernel method of closing all file
    descriptors is very helpful (e.g. demons, service managers, programming
    language standard libraries, container managers etc.).
    (Please note, unshare(CLONE_FILES) should only be needed if the calling
    task is multi-threaded and shares the file descriptor table with another
    thread in which case two threads could race with one thread allocating file
    descriptors and the other one closing them via close_range(). For the
    general case close_range() before the execve() is sufficient.)
    
    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
    file descriptor. From looking at various large(ish) userspace code bases
    this or similar patterns are very common in:
    - service managers (cf. [4])
    - libcs (cf. [6])
    - container runtimes (cf. [5])
    - programming language runtimes/standard libraries
      - Python (cf. [2])
      - Rust (cf. [7], [8])
    As Dmitry pointed out there's even a long-standing glibc bug about missing
    kernel support for this task (cf. [3]).
    In addition, the syscall will also work for tasks that do not have procfs
    mounted and on kernels that do not have procfs support compiled in. In such
    situations the only way to make sure that all file descriptors are closed
    is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
    OPEN_MAX trickery (cf. comment [8] on Rust).
    
    The performance is striking. For good measure, comparing the following
    simple close_all_fds() userspace implementation that is essentially just
    glibc's version in [6]:
    
    static int close_all_fds(void)
    {
            int dir_fd;
            DIR *dir;
            struct dirent *direntp;
    
            dir = opendir("/proc/self/fd");
            if (!dir)
                    return -1;
            dir_fd = dirfd(dir);
            while ((direntp = readdir(dir))) {
                    int fd;
                    if (strcmp(direntp->d_name, ".") == 0)
                            continue;
                    if (strcmp(direntp->d_name, "..") == 0)
                            continue;
                    fd = atoi(direntp->d_name);
                    if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                            continue;
                    close(fd);
            }
            closedir(dir);
            return 0;
    }
    
    to close_range() yields:
    1. closing 4 open files:
       - close_all_fds(): ~280 us
       - close_range():    ~24 us
    
    2. closing 1000 open files:
       - close_all_fds(): ~5000 us
       - close_range():   ~800 us
    
    close_range() is designed to allow for some flexibility. Specifically, it
    does not simply always close all open file descriptors of a task. Instead,
    callers can specify an upper bound.
    This is e.g. useful for scenarios where specific file descriptors are
    created with well-known numbers that are supposed to be excluded from
    getting closed.
    For extra paranoia close_range() comes with a flags argument. This can e.g.
    be used to implement extension. Once can imagine userspace wanting to stop
    at the first error instead of ignoring errors under certain circumstances.
    There might be other valid ideas in the future. In any case, a flag
    argument doesn't hurt and keeps us on the safe side.
    
    From an implementation side this is kept rather dumb. It saw some input
    from David and Jann but all nonsense is obviously my own!
    - Errors to close file descriptors are currently ignored. (Could be changed
      by setting a flag in the future if needed.)
    - __close_range() is a rather simplistic wrapper around __close_fd().
      My reasoning behind this is based on the nature of how __close_fd() needs
      to release an fd. But maybe I misunderstood specifics:
      We take the files_lock and rcu-dereference the fdtable of the calling
      task, we find the entry in the fdtable, get the file and need to release
      files_lock before calling filp_close().
      In the meantime the fdtable might have been altered so we can't just
      retake the spinlock and keep the old rcu-reference of the fdtable
      around. Instead we need to grab a fresh reference to the fdtable.
      If my reasoning is correct then there's really no point in fancyfying
      __close_range(): We just need to rcu-dereference the fdtable of the
      calling task once to cap the max_fd value correctly and then go on
      calling __close_fd() in a loop.
    
    /* References */
    [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
    [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
    [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
    [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
    [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
    [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
         Note that this is an internal implementation that is not exported.
         Currently, libc seems to not provide an exported version of this
         because of missing kernel support to do this.
         Note, in a recent patch series Florian made grantpt() a nop thereby
         removing the code referenced here.
    [7]: https://github.com/rust-lang/rust/issues/12148
    [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
         Rust's solution is slightly different but is equally unperformant.
         Rust calls getdtablesize() which is a glibc library function that
         simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
         goes on to call close() on each fd. That's obviously overkill for most
         tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
         OPEN_MAX.
         Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
         to 1024. Even in this case, there's a very high chance that in the
         common case Rust is calling the close() syscall 1021 times pointlessly
         if the task just has 0, 1, and 2 open.
    Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Kyle Evans <self@kyle-evans.net>
    Cc: Jann Horn <jannh@google.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Dmitry V. Levin <ldv@altlinux.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: linux-api@vger.kernel.org
    278a5fba
file.c 25.6 KB