1. 24 Jan, 2021 36 commits
    • Christian Brauner's avatar
      fs: introduce MOUNT_ATTR_IDMAP · 9caccd41
      Christian Brauner authored
      Introduce a new mount bind mount property to allow idmapping mounts. The
      MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
      together with a file descriptor referring to a user namespace.
      
      The user namespace referenced by the namespace file descriptor will be
      attached to the bind mount. All interactions with the filesystem going
      through that mount will be mapped according to the mapping specified in
      the user namespace attached to it.
      
      Using user namespaces to mark mounts means we can reuse all the existing
      infrastructure in the kernel that already exists to handle idmappings
      and can also use this for permission checking to allow unprivileged user
      to create idmapped mounts in the future.
      
      Idmapping a mount is decoupled from the caller's user and mount
      namespace. This means idmapped mounts can be created in the initial
      user namespace which is an important use-case for systemd-homed,
      portable usb-sticks between systems, sharing data between the initial
      user namespace and unprivileged containers, and other use-cases that
      have been brought up. For example, assume a home directory where all
      files are owned by uid and gid 1000 and the home directory is brought to
      a new laptop where the user has id 12345. The system administrator can
      simply create a mount of this home directory with a mapping of
      1000:12345:1 and other mappings to indicate the ids should be kept.
      (With this it is e.g. also possible to create idmapped mounts on the
      host with an identity mapping 1:1:100000 where the root user is not
      mapped. A user with root access that e.g. has been pivot rooted into
      such a mount on the host will be not be able to execute, read, write, or
      create files as root.)
      
      Given that mapping a mount is decoupled from the caller's user namespace
      a sufficiently privileged process such as a container manager can set up
      an idmapped mount for the container and the container can simply pivot
      root to it. There's no need for the container to do anything. The mount
      will appear correctly mapped independent of the user namespace the
      container uses. This means we don't need to mark a mount as idmappable.
      
      In order to create an idmapped mount the caller must currently be
      privileged in the user namespace of the superblock the mount belongs to.
      Once a mount has been idmapped we don't allow it to change its mapping.
      This keeps permission checking and life-cycle management simple. Users
      wanting to change the idmapped can always create a new detached mount
      with a different idmapping.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      9caccd41
    • Christian Brauner's avatar
      fs: add mount_setattr() · 2a186721
      Christian Brauner authored
      This implements the missing mount_setattr() syscall. While the new mount
      api allows to change the properties of a superblock there is currently
      no way to change the properties of a mount or a mount tree using file
      descriptors which the new mount api is based on. In addition the old
      mount api has the restriction that mount options cannot be applied
      recursively. This hasn't changed since changing mount options on a
      per-mount basis was implemented in [1] and has been a frequent request
      not just for convenience but also for security reasons. The legacy
      mount syscall is unable to accommodate this behavior without introducing
      a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
      MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
      mount. Changing MS_REC to apply to the whole mount tree would mean
      introducing a significant uapi change and would likely cause significant
      regressions.
      
      The new mount_setattr() syscall allows to recursively clear and set
      mount options in one shot. Multiple calls to change mount options
      requesting the same changes are idempotent:
      
      int mount_setattr(int dfd, const char *path, unsigned flags,
                        struct mount_attr *uattr, size_t usize);
      
      Flags to modify path resolution behavior are specified in the @flags
      argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
      and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
      restrict path resolution as introduced with openat2() might be supported
      in the future.
      
      The mount_setattr() syscall can be expected to grow over time and is
      designed with extensibility in mind. It follows the extensible syscall
      pattern we have used with other syscalls such as openat2(), clone3(),
      sched_{set,get}attr(), and others.
      The set of mount options is passed in the uapi struct mount_attr which
      currently has the following layout:
      
      struct mount_attr {
      	__u64 attr_set;
      	__u64 attr_clr;
      	__u64 propagation;
      	__u64 userns_fd;
      };
      
      The @attr_set and @attr_clr members are used to clear and set mount
      options. This way a user can e.g. request that a set of flags is to be
      raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
      @attr_set while at the same time requesting that another set of flags is
      to be lowered such as removing noexec from a mount tree by specifying
      MOUNT_ATTR_NOEXEC in @attr_clr.
      
      Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
      not a bitmap, users wanting to transition to a different atime setting
      cannot simply specify the atime setting in @attr_set, but must also
      specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
      MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
      can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
      @attr_clr.
      
      The @propagation field lets callers specify the propagation type of a
      mount tree. Propagation is a single property that has four different
      settings and as such is not really a flag argument but an enum.
      Specifically, it would be unclear what setting and clearing propagation
      settings in combination would amount to. The legacy mount() syscall thus
      forbids the combination of multiple propagation settings too. The goal
      is to keep the semantics of mount propagation somewhat simple as they
      are overly complex as it is.
      
      The @userns_fd field lets user specify a user namespace whose idmapping
      becomes the idmapping of the mount. This is implemented and explained in
      detail in the next patch.
      
      [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      2a186721
    • Christian Brauner's avatar
      fs: add attr_flags_to_mnt_flags helper · 5b490500
      Christian Brauner authored
      Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
      which we will use in follow-up patches too.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      5b490500
    • Christian Brauner's avatar
      fs: split out functions to hold writers · fbdc2f6c
      Christian Brauner authored
      When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
      aren't currently any active writers. Split this logic out into simple
      helpers that we can use in follow-up patches.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      fbdc2f6c
    • Christian Brauner's avatar
      namespace: only take read lock in do_reconfigure_mnt() · e58ace1a
      Christian Brauner authored
      do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
      which seems unnecessary since we're not changing the superblock. We're
      only checking whether it is already read-only. Setting other mount
      attributes is protected by lock_mount_hash() afaict and not by s_umount.
      
      The history of down_write(&sb->s_umount) lock being taken when setting
      mount attributes dates back to the introduction of MNT_READONLY in [2].
      This introduced the concept of having read-only mounts in contrast to
      just having a read-only superblock. When it got introduced it was simply
      plumbed into do_remount() which already took down_write(&sb->s_umount)
      because it was only used to actually change the superblock before [2].
      Afaict, it would've already been possible back then to only use
      down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
      options were protected by the vfsmount lock already. But that would've
      meant special casing the locking for MS_BIND | MS_REMOUNT in
      do_remount() which people might not have considered worth it.
      Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
      do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
      lock was simply copied over.
      Now that we have this be a separate helper only take the
      down_read(&sb->s_umount) lock since we're only interested in checking
      whether the super block is currently read-only and blocking any writers
      from changing it. Essentially, checking that the super block is
      read-only has the advantage that we can avoid having to go into the
      slowpath and through MNT_WRITE_HOLD and can simply set the read-only
      flag on the mount in set_mount_attributes().
      
      [1]: commit 43f5e655 ("vfs: Separate changing mount flags full remount")
      [2]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      e58ace1a
    • Christian Brauner's avatar
      mount: make {lock,unlock}_mount_hash() static · d033cb67
      Christian Brauner authored
      The lock_mount_hash() and unlock_mount_hash() helpers are never called
      outside a single file. Remove them from the header and make them static
      to reflect this fact. There's no need to have them callable from other
      places right now, as Christoph observed.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      d033cb67
    • Christian Brauner's avatar
      namespace: take lock_mount_hash() directly when changing flags · 68847c94
      Christian Brauner authored
      Changing mount options always ends up taking lock_mount_hash() but when
      MNT_READONLY is requested and neither the mount nor the superblock are
      MNT_READONLY we end up taking the lock, dropping it, and retaking it to
      change the other mount attributes. Instead, let's acquire the lock once
      when changing the mount attributes. This simplifies the locking in these
      codepath, makes them easier to reason about and avoids having to
      reacquire the lock right after dropping it.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      68847c94
    • Christian Brauner's avatar
      nfs: do not export idmapped mounts · 899bf2ce
      Christian Brauner authored
      Prevent nfs from exporting idmapped mounts until we have ported it to
      support exporting idmapped mounts.
      
      Link: https://lore.kernel.org/linux-api/20210123130958.3t6kvgkl634njpsm@wittgenstein
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "J. Bruce Fields" <bfields@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      899bf2ce
    • Christian Brauner's avatar
      overlayfs: do not mount on top of idmapped mounts · 029a52ad
      Christian Brauner authored
      Prevent overlayfs from being mounted on top of idmapped mounts.
      Stacking filesystems need to be prevented from being mounted on top of
      idmapped mounts until they have have been converted to handle this.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      029a52ad
    • Christian Brauner's avatar
      ecryptfs: do not mount on top of idmapped mounts · 0f16ff0f
      Christian Brauner authored
      Prevent ecryptfs from being mounted on top of idmapped mounts.
      Stacking filesystems need to be prevented from being mounted on top of
      idmapped mounts until they have have been converted to handle this.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-28-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0f16ff0f
    • Christian Brauner's avatar
      ima: handle idmapped mounts · a2d2329e
      Christian Brauner authored
      IMA does sometimes access the inode's i_uid and compares it against the
      rules' fowner. Enable IMA to handle idmapped mounts by passing down the
      mount's user namespace. We simply make use of the helpers we introduced
      before. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.comSigned-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      a2d2329e
    • Christian Brauner's avatar
      apparmor: handle idmapped mounts · 3cee6079
      Christian Brauner authored
      The i_uid and i_gid are mostly used when logging for AppArmor. This is
      broken in a bunch of places where the global root id is reported instead
      of the i_uid or i_gid of the file. Nonetheless, be kind and log the
      mapped inode if we're coming from an idmapped mount. If the initial user
      namespace is passed nothing changes so non-idmapped mounts will see
      identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-26-christian.brauner@ubuntu.comSigned-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      3cee6079
    • Christian Brauner's avatar
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner authored
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
    • Christian Brauner's avatar
      exec: handle idmapped mounts · 1ab29965
      Christian Brauner authored
      When executing a setuid binary the kernel will verify in bprm_fill_uid()
      that the inode has a mapping in the caller's user namespace before
      setting the callers uid and gid. Let bprm_fill_uid() handle idmapped
      mounts. If the inode is accessed through an idmapped mount it is mapped
      according to the mount's user namespace. Afterwards the checks are
      identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-24-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      1ab29965
    • Christian Brauner's avatar
      would_dump: handle idmapped mounts · 435ac621
      Christian Brauner authored
      When determining whether or not to create a coredump the vfs will verify
      that the caller is privileged over the inode. Make the would_dump()
      helper handle idmapped mounts by passing down the mount's user namespace
      of the exec file. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-23-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      435ac621
    • Christian Brauner's avatar
      ioctl: handle idmapped mounts · 0f5d220b
      Christian Brauner authored
      Enable generic ioctls to handle idmapped mounts by passing down the
      mount's user namespace. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-22-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0f5d220b
    • Christian Brauner's avatar
      init: handle idmapped mounts · b816dd5d
      Christian Brauner authored
      Enable the init helpers to handle idmapped mounts by passing down the
      mount's user namespace. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-21-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      b816dd5d
    • Christian Brauner's avatar
      fcntl: handle idmapped mounts · 9eccd12c
      Christian Brauner authored
      Enable the setfl() helper to handle idmapped mounts by passing down the
      mount's user namespace. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-20-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      9eccd12c
    • Christian Brauner's avatar
      utimes: handle idmapped mounts · d06c26f1
      Christian Brauner authored
      Enable the vfs_utimes() helper to handle idmapped mounts by passing down
      the mount's user namespace. If the initial user namespace is passed
      nothing changes so non-idmapped mounts will see identical behavior as
      before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-19-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      d06c26f1
    • Christian Brauner's avatar
      af_unix: handle idmapped mounts · 7c02cf73
      Christian Brauner authored
      When binding a non-abstract AF_UNIX socket it will gain a representation
      in the filesystem. Enable the socket infrastructure to handle idmapped
      mounts by passing down the user namespace of the mount the socket will
      be created from. If the initial user namespace is passed nothing changes
      so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-18-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      7c02cf73
    • Christian Brauner's avatar
      open: handle idmapped mounts · b8b546a0
      Christian Brauner authored
      For core file operations such as changing directories or chrooting,
      determining file access, changing mode or ownership the vfs will verify
      that the caller is privileged over the inode. Extend the various helpers
      to handle idmapped mounts. If the inode is accessed through an idmapped
      mount map it into the mount's user namespace. Afterwards the permissions
      checks are identical to non-idmapped mounts. When changing file
      ownership we need to map the uid and gid from the mount's user
      namespace. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      b8b546a0
    • Christian Brauner's avatar
      open: handle idmapped mounts in do_truncate() · 643fe55a
      Christian Brauner authored
      When truncating files the vfs will verify that the caller is privileged
      over the inode. Extend it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount it is mapped according to the mount's
      user namespace. Afterwards the permissions checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      643fe55a
    • Christian Brauner's avatar
      namei: prepare for idmapped mounts · 6521f891
      Christian Brauner authored
      The various vfs_*() helpers are called by filesystems or by the vfs
      itself to perform core operations such as create, link, mkdir, mknod, rename,
      rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
      inode is accessed through an idmapped mount map it into the
      mount's user namespace and pass it down. Afterwards the checks and
      operations are identical to non-idmapped mounts. If the initial user
      namespace is passed nothing changes so non-idmapped mounts will see
      identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      6521f891
    • Christian Brauner's avatar
      namei: introduce struct renamedata · 9fe61450
      Christian Brauner authored
      In order to handle idmapped mounts we will extend the vfs rename helper
      to take two new arguments in follow up patches. Since this operations
      already takes a bunch of arguments add a simple struct renamedata and
      make the current helper use it before we extend it.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      9fe61450
    • Christian Brauner's avatar
      namei: handle idmapped mounts in may_*() helpers · ba73d987
      Christian Brauner authored
      The may_follow_link(), may_linkat(), may_lookup(), may_open(),
      may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
      helpers determine whether the caller is privileged enough to perform the
      associated operations. Let them handle idmapped mounts by mapping the
      inode or fsids according to the mount's user namespace. Afterwards the
      checks are identical to non-idmapped inodes. The patch takes care to
      retrieve the mount's user namespace right before performing permission
      checks and passing it down into the fileystem so the user namespace
      can't change in between by someone idmapping a mount that is currently
      not idmapped. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      ba73d987
    • Christian Brauner's avatar
      stat: handle idmapped mounts · 0d56a451
      Christian Brauner authored
      The generic_fillattr() helper fills in the basic attributes associated
      with an inode. Enable it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount map it into the mount's user
      namespace before we store the uid and gid. If the initial user namespace
      is passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0d56a451
    • Christian Brauner's avatar
      commoncap: handle idmapped mounts · 71bc356f
      Christian Brauner authored
      When interacting with user namespace and non-user namespace aware
      filesystem capabilities the vfs will perform various security checks to
      determine whether or not the filesystem capabilities can be used by the
      caller, whether they need to be removed and so on. The main
      infrastructure for this resides in the capability codepaths but they are
      called through the LSM security infrastructure even though they are not
      technically an LSM or optional. This extends the existing security hooks
      security_inode_removexattr(), security_inode_killpriv(),
      security_inode_getsecurity() to pass down the mount's user namespace and
      makes them aware of idmapped mounts.
      
      In order to actually get filesystem capabilities from disk the
      capability infrastructure exposes the get_vfs_caps_from_disk() helper.
      For user namespace aware filesystem capabilities a root uid is stored
      alongside the capabilities.
      
      In order to determine whether the caller can make use of the filesystem
      capability or whether it needs to be ignored it is translated according
      to the superblock's user namespace. If it can be translated to uid 0
      according to that id mapping the caller can use the filesystem
      capabilities stored on disk. If we are accessing the inode that holds
      the filesystem capabilities through an idmapped mount we map the root
      uid according to the mount's user namespace. Afterwards the checks are
      identical to non-idmapped mounts: reading filesystem caps from disk
      enforces that the root uid associated with the filesystem capability
      must have a mapping in the superblock's user namespace and that the
      caller is either in the same user namespace or is a descendant of the
      superblock's user namespace. For filesystems that are mountable inside
      user namespace the caller can just mount the filesystem and won't
      usually need to idmap it. If they do want to idmap it they can create an
      idmapped mount and mark it with a user namespace they created and which
      is thus a descendant of s_user_ns. For filesystems that are not
      mountable inside user namespaces the descendant rule is trivially true
      because the s_user_ns will be the initial user namespace.
      
      If the initial user namespace is passed nothing changes so non-idmapped
      mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      71bc356f
    • Tycho Andersen's avatar
      xattr: handle idmapped mounts · c7c7a1a1
      Tycho Andersen authored
      When interacting with extended attributes the vfs verifies that the
      caller is privileged over the inode with which the extended attribute is
      associated. For posix access and posix default extended attributes a uid
      or gid can be stored on-disk. Let the functions handle posix extended
      attributes on idmapped mounts. If the inode is accessed through an
      idmapped mount we need to map it according to the mount's user
      namespace. Afterwards the checks are identical to non-idmapped mounts.
      This has no effect for e.g. security xattrs since they don't store uids
      or gids and don't perform permission checks on them like posix acls do.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarTycho Andersen <tycho@tycho.pizza>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      c7c7a1a1
    • Christian Brauner's avatar
      acl: handle idmapped mounts · e65ce2a5
      Christian Brauner authored
      The posix acl permission checking helpers determine whether a caller is
      privileged over an inode according to the acls associated with the
      inode. Add helpers that make it possible to handle acls on idmapped
      mounts.
      
      The vfs and the filesystems targeted by this first iteration make use of
      posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
      translate basic posix access and default permissions such as the
      ACL_USER and ACL_GROUP type according to the initial user namespace (or
      the superblock's user namespace) to and from the caller's current user
      namespace. Adapt these two helpers to handle idmapped mounts whereby we
      either map from or into the mount's user namespace depending on in which
      direction we're translating.
      Similarly, cap_convert_nscap() is used by the vfs to translate user
      namespace and non-user namespace aware filesystem capabilities from the
      superblock's user namespace to the caller's user namespace. Enable it to
      handle idmapped mounts by accounting for the mount's user namespace.
      
      In addition the fileystems targeted in the first iteration of this patch
      series make use of the posix_acl_chmod() and, posix_acl_update_mode()
      helpers. Both helpers perform permission checks on the target inode. Let
      them handle idmapped mounts. These two helpers are called when posix
      acls are set by the respective filesystems to handle this case we extend
      the ->set() method to take an additional user namespace argument to pass
      the mount's user namespace down.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      e65ce2a5
    • Christian Brauner's avatar
      attr: handle idmapped mounts · 2f221d6f
      Christian Brauner authored
      When file attributes are changed most filesystems rely on the
      setattr_prepare(), setattr_copy(), and notify_change() helpers for
      initialization and permission checking. Let them handle idmapped mounts.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Helpers that perform checks on the ia_uid and ia_gid fields in struct
      iattr assume that ia_uid and ia_gid are intended values and have already
      been mapped correctly at the userspace-kernelspace boundary as we
      already do today. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      2f221d6f
    • Christian Brauner's avatar
      inode: make init and permission helpers idmapped mount aware · 21cb47be
      Christian Brauner authored
      The inode_owner_or_capable() helper determines whether the caller is the
      owner of the inode or is capable with respect to that inode. Allow it to
      handle idmapped mounts. If the inode is accessed through an idmapped
      mount it according to the mount's user namespace. Afterwards the checks
      are identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Similarly, allow the inode_init_owner() helper to handle idmapped
      mounts. It initializes a new inode on idmapped mounts by mapping the
      fsuid and fsgid of the caller from the mount's user namespace. If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      21cb47be
    • Christian Brauner's avatar
      namei: make permission helpers idmapped mount aware · 47291baa
      Christian Brauner authored
      The two helpers inode_permission() and generic_permission() are used by
      the vfs to perform basic permission checking by verifying that the
      caller is privileged over an inode. In order to handle idmapped mounts
      we extend the two helpers with an additional user namespace argument.
      On idmapped mounts the two helpers will make sure to map the inode
      according to the mount's user namespace and then peform identical
      permission checks to inode_permission() and generic_permission(). If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      47291baa
    • Christian Brauner's avatar
      capability: handle idmapped mounts · 0558c1bf
      Christian Brauner authored
      In order to determine whether a caller holds privilege over a given
      inode the capability framework exposes the two helpers
      privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
      verifies that the inode has a mapping in the caller's user namespace and
      the latter additionally verifies that the caller has the requested
      capability in their current user namespace.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped inodes. If the initial user namespace is passed all
      operations are a nop so non-idmapped mounts will not see a change in
      behavior.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0558c1bf
    • Christian Brauner's avatar
      fs: add file and path permissions helpers · 02f92b38
      Christian Brauner authored
      Add two simple helpers to check permissions on a file and path
      respectively and convert over some callers. It simplifies quite a few
      codepaths and also reduces the churn in later patches quite a bit.
      Christoph also correctly points out that this makes codepaths (e.g.
      ioctls) way easier to follow that would otherwise have to do more
      complex argument passing than necessary.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      02f92b38
    • Christian Brauner's avatar
      fs: add id translation helpers · e6c9a714
      Christian Brauner authored
      Add simple helpers to make it easy to map kuids into and from idmapped
      mounts. We provide simple wrappers that filesystems can use to e.g.
      initialize inodes similar to i_{uid,gid}_read() and i_{uid,gid}_write().
      Accessing an inode through an idmapped mount maps the i_uid and i_gid of
      the inode to the mount's user namespace. If the fsids are used to
      initialize inodes they are unmapped according to the mount's user
      namespace. Passing the initial user namespace to these helpers makes
      them a nop and so any non-idmapped paths will not be impacted.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-3-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      e6c9a714
    • Christian Brauner's avatar
      mount: attach mappings to mounts · a6435940
      Christian Brauner authored
      In order to support per-mount idmappings vfsmounts are marked with user
      namespaces. The idmapping of the user namespace will be used to map the
      ids of vfs objects when they are accessed through that mount. By default
      all vfsmounts are marked with the initial user namespace. The initial
      user namespace is used to indicate that a mount is not idmapped. All
      operations behave as before.
      
      Based on prior discussions we want to attach the whole user namespace
      and not just a dedicated idmapping struct. This allows us to reuse all
      the helpers that already exist for dealing with idmappings instead of
      introducing a whole new range of helpers. In addition, if we decide in
      the future that we are confident enough to enable unprivileged users to
      setup idmapped mounts the permission checking can take into account
      whether the caller is privileged in the user namespace the mount is
      currently marked with.
      Later patches enforce that once a mount has been idmapped it can't be
      remapped. This keeps permission checking and life-cycle management
      simple. Users wanting to change the idmapped can always create a new
      detached mount with a different idmapping.
      
      Add a new mnt_userns member to vfsmount and two simple helpers to
      retrieve the mnt_userns from vfsmounts and files.
      
      The idea to attach user namespaces to vfsmounts has been floated around
      in various forms at Linux Plumbers in ~2018 with the original idea
      tracing back to a discussion in 2017 at a conference in St. Petersburg
      between Christoph, Tycho, and myself.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      a6435940
  2. 18 Jan, 2021 1 commit
  3. 17 Jan, 2021 3 commits
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-2021-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux · e2da7836
      Linus Torvalds authored
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix 'CPU too large' error in Intel PT
      
       - Correct event attribute sizes in 'perf inject'
      
       - Sync build_bug.h and kvm.h kernel copies
      
       - Fix bpf.h header include directive in 5sec.c 'perf trace' bpf example
      
       - libbpf tests fixes
      
       - Fix shadow stat 'perf test' for non-bash shells
      
       - Take cgroups into account for shadow stats in 'perf stat'
      
      * tag 'perf-tools-fixes-2021-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf inject: Correct event attribute sizes
        perf intel-pt: Fix 'CPU too large' error
        perf stat: Take cgroups into account for shadow stats
        perf stat: Introduce struct runtime_stat_data
        libperf tests: Fail when failing to get a tracepoint id
        libperf tests: If a test fails return non-zero
        libperf tests: Avoid uninitialized variable warning
        perf test: Fix shadow stat test for non-bash shells
        tools headers: Syncronize linux/build_bug.h with the kernel sources
        tools headers UAPI: Sync kvm.h headers with the kernel sources
        perf bpf examples: Fix bpf.h header include directive in 5sec.c example
      e2da7836
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · a1339d63
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "One fix for a lack of alignment in our linker script, that can lead to
        crashes depending on configuration etc.
      
        One fix for the 32-bit VDSO after the C VDSO conversion.
      
        Thanks to Andreas Schwab, Ariel Marcovitch, and Christophe Leroy"
      
      * tag 'powerpc-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/vdso: Fix clock_gettime_fallback for vdso32
        powerpc: Fix alignment bug within the init sections
      a1339d63
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · a527a2b3
      Linus Torvalds authored
      Pull misc vfs fixes from Al Viro:
       "Several assorted fixes.
      
        I still think that audit ->d_name race is better fixed this way for
        the benefit of backports, with any possibly fancier variants done on
        top of it"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        dump_common_audit_data(): fix racy accesses to ->d_name
        iov_iter: fix the uaccess area in copy_compat_iovec_from_user
        umount(2): move the flag validity checks first
      a527a2b3