1. 13 May, 2020 1 commit
    • Christian Brauner's avatar
      nsproxy: attach to namespaces via pidfds · 303cc571
      Christian Brauner authored
      For quite a while we have been thinking about using pidfds to attach to
      namespaces. This patchset has existed for about a year already but we've
      wanted to wait to see how the general api would be received and adopted.
      Now that more and more programs in userspace have started using pidfds
      for process management it's time to send this one out.
      
      This patch makes it possible to use pidfds to attach to the namespaces
      of another process, i.e. they can be passed as the first argument to the
      setns() syscall. When only a single namespace type is specified the
      semantics are equivalent to passing an nsfd. That means
      setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
      when a pidfd is passed, multiple namespace flags can be specified in the
      second setns() argument and setns() will attach the caller to all the
      specified namespaces all at once or to none of them. Specifying 0 is not
      valid together with a pidfd.
      
      Here are just two obvious examples:
      setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
      setns(pidfd, CLONE_NEWUSER);
      Allowing to also attach subsets of namespaces supports various use-cases
      where callers setns to a subset of namespaces to retain privilege, perform
      an action and then re-attach another subset of namespaces.
      
      If the need arises, as Eric suggested, we can extend this patchset to
      assume even more context than just attaching all namespaces. His suggestion
      specifically was about assuming the process' root directory when
      setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
      keep it flexible in terms of supporting subsets of namespaces but let's
      wait until we have users asking for even more context to be assumed. At
      that point we can add an extension.
      
      The obvious example where this is useful is a standard container
      manager interacting with a running container: pushing and pulling files
      or directories, injecting mounts, attaching/execing any kind of process,
      managing network devices all these operations require attaching to all
      or at least multiple namespaces at the same time. Given that nowadays
      most containers are spawned with all namespaces enabled we're currently
      looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
      nsfds, another 7 to actually perform the namespace switch. With time
      namespaces we're looking at about 16 syscalls.
      (We could amortize the first 7 or 8 syscalls for opening the nsfds by
       stashing them in each container's monitor process but that would mean
       we need to send around those file descriptors through unix sockets
       everytime we want to interact with the container or keep on-disk
       state. Even in scenarios where a caller wants to join a particular
       namespace in a particular order callers still profit from batching
       other namespaces. That mostly applies to the user namespace but
       all container runtimes I found join the user namespace first no matter
       if it privileges or deprivileges the container similar to how unshare
       behaves.)
      With pidfds this becomes a single syscall no matter how many namespaces
      are supposed to be attached to.
      
      A decently designed, large-scale container manager usually isn't the
      parent of any of the containers it spawns so the containers don't die
      when it crashes or needs to update or reinitialize. This means that
      for the manager to interact with containers through pids is inherently
      racy especially on systems where the maximum pid number is not
      significicantly bumped. This is even more problematic since we often spawn
      and manage thousands or ten-thousands of containers. Interacting with a
      container through a pid thus can become risky quite quickly. Especially
      since we allow for an administrator to enable advanced features such as
      syscall interception where we're performing syscalls in lieu of the
      container. In all of those cases we use pidfds if they are available and
      we pass them around as stable references. Using them to setns() to the
      target process' namespaces is as reliable as using nsfds. Either the
      target process is already dead and we get ESRCH or we manage to attach
      to its namespaces but we can't accidently attach to another process'
      namespaces. So pidfds lend themselves to be used with this api.
      The other main advantage is that with this change the pidfd becomes the
      only relevant token for most container interactions and it's the only
      token we need to create and send around.
      
      Apart from significiantly reducing the number of syscalls from double
      digit to single digit which is a decent reason post-spectre/meltdown
      this also allows to switch to a set of namespaces atomically, i.e.
      either attaching to all the specified namespaces succeeds or we fail. If
      we fail we haven't changed a single namespace. There are currently three
      namespaces that can fail (other than for ENOMEM which really is not
      very interesting since we then have other problems anyway) for
      non-trivial reasons, user, mount, and pid namespaces. We can fail to
      attach to a pid namespace if it is not our current active pid namespace
      or a descendant of it. We can fail to attach to a user namespace because
      we are multi-threaded or because our current mount namespace shares
      filesystem state with other tasks, or because we're trying to setns()
      to the same user namespace, i.e. the target task has the same user
      namespace as we do. We can fail to attach to a mount namespace because
      it shares filesystem state with other tasks or because we fail to lookup
      the new root for the new mount namespace. In most non-pathological
      scenarios these issues can be somewhat mitigated. But there are cases where
      we're half-attached to some namespace and failing to attach to another one.
      I've talked about some of these problem during the hallway track (something
      only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
      in 2018(?). Even if all these issues could be avoided with super careful
      userspace coding it would be nicer to have this done in-kernel. Pidfds seem
      to lend themselves nicely for this.
      
      The other neat thing about this is that setns() becomes an actual
      counterpart to the namespace bits of unshare().
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarSerge Hallyn <serge@hallyn.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
      303cc571
  2. 09 May, 2020 1 commit
    • Christian Brauner's avatar
      nsproxy: add struct nsset · f2a8d52e
      Christian Brauner authored
      Add a simple struct nsset. It holds all necessary pieces to switch to a new
      set of namespaces without leaving a task in a half-switched state which we
      will make use of in the next patch. This patch switches the existing setns
      logic over without causing a change in setns() behavior. This brings
      setns() closer to how unshare() works(). The prepare_ns() function is
      responsible to prepare all necessary information. This has two reasons.
      First it minimizes dependencies between individual namespaces, i.e. all
      install handler can expect that all fields are properly initialized
      independent in what order they are called in. Second, this makes the code
      easier to maintain and easier to follow if it needs to be changed.
      
      The prepare_ns() helper will only be switched over to use a flags argument
      in the next patch. Here it will still use nstype as a simple integer
      argument which was argued would be clearer. I'm not particularly
      opinionated about this if it really helps or not. The struct nsset itself
      already contains the flags field since its name already indicates that it
      can contain information required by different namespaces. None of this
      should have functional consequences.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarSerge Hallyn <serge@hallyn.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
      f2a8d52e
  3. 03 May, 2020 4 commits
  4. 02 May, 2020 8 commits
    • Linus Torvalds's avatar
      Merge tag 'pm-5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 743f0573
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
      
       - prevent the intel_pstate driver from printing excessive diagnostic
         messages in some cases (Chris Wilson)
      
       - make the hibernation restore kernel freeze kernel threads as well as
         user space tasks (Dexuan Cui)
      
       - fix the ACPI device PM disagnostic messages to include the correct
         power state name (Kai-Heng Feng).
      
      * tag 'pm-5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM: ACPI: Output correct message on target power state
        PM: hibernate: Freeze kernel threads in software_resume()
        cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once
      743f0573
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-cpufreq' and 'pm-sleep' · a5383996
      Rafael J. Wysocki authored
      * pm-cpufreq:
        cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once
      
      * pm-sleep:
        PM: hibernate: Freeze kernel threads in software_resume()
      a5383996
    • Linus Torvalds's avatar
      Merge tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · f66ed1eb
      Linus Torvalds authored
      Pull iomap fix from Darrick Wong:
       "Hoist the check for an unrepresentable FIBMAP return value into
        ioctl_fibmap.
      
        The internal kernel function can handle 64-bit values (and is needed
        to fix a regression on ext4 + jbd2). It is only the userspace ioctl
        that is so old that it cannot deal"
      
      * tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        fibmap: Warn and return an error in case of block > INT_MAX
      f66ed1eb
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 29a47f45
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights include:
      
        Stable fixes:
         - fix handling of backchannel binding in BIND_CONN_TO_SESSION
      
        Bugfixes:
         - Fix a credential use-after-free issue in pnfs_roc()
         - Fix potential posix_acl refcnt leak in nfs3_set_acl
         - defer slow parts of rpc_free_client() to a workqueue
         - Fix an Oopsable race in __nfs_list_for_each_server()
         - Fix trace point use-after-free race
         - Regression: the RDMA client no longer responds to server disconnect
           requests
         - Fix return values of xdr_stream_encode_item_{present, absent}
         - _pnfs_return_layout() must always wait for layoutreturn completion
      
        Cleanups:
         - Remove unreachable error conditions"
      
      * tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFS: Fix a race in __nfs_list_for_each_server()
        NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
        SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
        NFSv4: Remove unreachable error condition due to rpc_run_task()
        SUNRPC: Remove unreachable error condition
        xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
        xprtrdma: Fix trace point use-after-free race
        xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
        nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
        NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
        NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
      29a47f45
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-5.7-rc4' of git://git.infradead.org/users/vkoul/slave-dma · ed6889db
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
       "Core:
         - Documentation typo fixes
         - fix the channel indexes
         - dmatest: fixes for process hang and iterations
      
        Drivers:
         - hisilicon: build error fix without PCI_MSI
         - ti-k3: deadlock fix
         - uniphier-xdmac: fix for reg region
         - pch: fix data race
         - tegra: fix clock state"
      
      * tag 'dmaengine-fix-5.7-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: dmatest: Fix process hang when reading 'wait' parameter
        dmaengine: dmatest: Fix iteration non-stop logic
        dmaengine: tegra-apb: Ensure that clock is enabled during of DMA synchronization
        dmaengine: fix channel index enumeration
        dmaengine: mmp_tdma: Reset channel error on release
        dmaengine: mmp_tdma: Do not ignore slave config validation errors
        dmaengine: pch_dma.c: Avoid data race between probe and irq handler
        dt-bindings: dma: uniphier-xdmac: switch to single reg region
        include/linux/dmaengine: Typos fixes in API documentation
        dmaengine: xilinx_dma: Add missing check for empty list
        dmaengine: ti: k3-psil: fix deadlock on error path
        dmaengine: hisilicon: Fix build error without PCI_MSI
      ed6889db
    • Linus Torvalds's avatar
      Merge tag 'vfio-v5.7-rc4' of git://github.com/awilliam/linux-vfio · 690e2aba
      Linus Torvalds authored
      Pull VFIO fixes from Alex Williamson:
      
       - copy_*_user validity check for new vfio_dma_rw interface (Yan Zhao)
      
       - Fix a potential math overflow (Yan Zhao)
      
       - Use follow_pfn() for calculating PFNMAPs (Sean Christopherson)
      
      * tag 'vfio-v5.7-rc4' of git://github.com/awilliam/linux-vfio:
        vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()
        vfio: avoid possible overflow in vfio_iommu_type1_pin_pages
        vfio: checking of validity of user vaddr in vfio_dma_rw
      690e2aba
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 42eb62d4
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Add -fasynchronous-unwind-tables to the vDSO CFLAGS"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: vdso: Add -fasynchronous-unwind-tables to cflags
      42eb62d4
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block · cf018530
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - Fix for statx not grabbing the file table, making AT_EMPTY_PATH fail
      
       - Cover a few cases where async poll can handle retry, eliminating the
         need for an async thread
      
       - fallback request busy/free fix (Bijan)
      
       - syzbot reported SQPOLL thread exit fix for non-preempt (Xiaoguang)
      
       - Fix extra put of req for sync_file_range (Pavel)
      
       - Always punt splice async. We'll improve this for 5.8, but wanted to
         eliminate the inode mutex lock from the non-blocking path for 5.7
         (Pavel)
      
      * tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block:
        io_uring: punt splice async because of inode mutex
        io_uring: check non-sync defer_list carefully
        io_uring: fix extra put in sync_file_range()
        io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()
        io_uring: use proper references for fallback_req locking
        io_uring: only force async punt if poll based retry can't handle it
        io_uring: enable poll retry for any file with ->read_iter / ->write_iter
        io_uring: statx must grab the file table for valid fd
      cf018530
  5. 01 May, 2020 19 commits
  6. 30 Apr, 2020 7 commits