1. 24 Jan, 2021 15 commits
    • Christian Brauner's avatar
      open: handle idmapped mounts in do_truncate() · 643fe55a
      Christian Brauner authored
      When truncating files the vfs will verify that the caller is privileged
      over the inode. Extend it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount it is mapped according to the mount's
      user namespace. Afterwards the permissions checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      643fe55a
    • Christian Brauner's avatar
      namei: prepare for idmapped mounts · 6521f891
      Christian Brauner authored
      The various vfs_*() helpers are called by filesystems or by the vfs
      itself to perform core operations such as create, link, mkdir, mknod, rename,
      rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
      inode is accessed through an idmapped mount map it into the
      mount's user namespace and pass it down. Afterwards the checks and
      operations are identical to non-idmapped mounts. If the initial user
      namespace is passed nothing changes so non-idmapped mounts will see
      identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      6521f891
    • Christian Brauner's avatar
      namei: introduce struct renamedata · 9fe61450
      Christian Brauner authored
      In order to handle idmapped mounts we will extend the vfs rename helper
      to take two new arguments in follow up patches. Since this operations
      already takes a bunch of arguments add a simple struct renamedata and
      make the current helper use it before we extend it.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      9fe61450
    • Christian Brauner's avatar
      namei: handle idmapped mounts in may_*() helpers · ba73d987
      Christian Brauner authored
      The may_follow_link(), may_linkat(), may_lookup(), may_open(),
      may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
      helpers determine whether the caller is privileged enough to perform the
      associated operations. Let them handle idmapped mounts by mapping the
      inode or fsids according to the mount's user namespace. Afterwards the
      checks are identical to non-idmapped inodes. The patch takes care to
      retrieve the mount's user namespace right before performing permission
      checks and passing it down into the fileystem so the user namespace
      can't change in between by someone idmapping a mount that is currently
      not idmapped. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      ba73d987
    • Christian Brauner's avatar
      stat: handle idmapped mounts · 0d56a451
      Christian Brauner authored
      The generic_fillattr() helper fills in the basic attributes associated
      with an inode. Enable it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount map it into the mount's user
      namespace before we store the uid and gid. If the initial user namespace
      is passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0d56a451
    • Christian Brauner's avatar
      commoncap: handle idmapped mounts · 71bc356f
      Christian Brauner authored
      When interacting with user namespace and non-user namespace aware
      filesystem capabilities the vfs will perform various security checks to
      determine whether or not the filesystem capabilities can be used by the
      caller, whether they need to be removed and so on. The main
      infrastructure for this resides in the capability codepaths but they are
      called through the LSM security infrastructure even though they are not
      technically an LSM or optional. This extends the existing security hooks
      security_inode_removexattr(), security_inode_killpriv(),
      security_inode_getsecurity() to pass down the mount's user namespace and
      makes them aware of idmapped mounts.
      
      In order to actually get filesystem capabilities from disk the
      capability infrastructure exposes the get_vfs_caps_from_disk() helper.
      For user namespace aware filesystem capabilities a root uid is stored
      alongside the capabilities.
      
      In order to determine whether the caller can make use of the filesystem
      capability or whether it needs to be ignored it is translated according
      to the superblock's user namespace. If it can be translated to uid 0
      according to that id mapping the caller can use the filesystem
      capabilities stored on disk. If we are accessing the inode that holds
      the filesystem capabilities through an idmapped mount we map the root
      uid according to the mount's user namespace. Afterwards the checks are
      identical to non-idmapped mounts: reading filesystem caps from disk
      enforces that the root uid associated with the filesystem capability
      must have a mapping in the superblock's user namespace and that the
      caller is either in the same user namespace or is a descendant of the
      superblock's user namespace. For filesystems that are mountable inside
      user namespace the caller can just mount the filesystem and won't
      usually need to idmap it. If they do want to idmap it they can create an
      idmapped mount and mark it with a user namespace they created and which
      is thus a descendant of s_user_ns. For filesystems that are not
      mountable inside user namespaces the descendant rule is trivially true
      because the s_user_ns will be the initial user namespace.
      
      If the initial user namespace is passed nothing changes so non-idmapped
      mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      71bc356f
    • Tycho Andersen's avatar
      xattr: handle idmapped mounts · c7c7a1a1
      Tycho Andersen authored
      When interacting with extended attributes the vfs verifies that the
      caller is privileged over the inode with which the extended attribute is
      associated. For posix access and posix default extended attributes a uid
      or gid can be stored on-disk. Let the functions handle posix extended
      attributes on idmapped mounts. If the inode is accessed through an
      idmapped mount we need to map it according to the mount's user
      namespace. Afterwards the checks are identical to non-idmapped mounts.
      This has no effect for e.g. security xattrs since they don't store uids
      or gids and don't perform permission checks on them like posix acls do.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarTycho Andersen <tycho@tycho.pizza>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      c7c7a1a1
    • Christian Brauner's avatar
      acl: handle idmapped mounts · e65ce2a5
      Christian Brauner authored
      The posix acl permission checking helpers determine whether a caller is
      privileged over an inode according to the acls associated with the
      inode. Add helpers that make it possible to handle acls on idmapped
      mounts.
      
      The vfs and the filesystems targeted by this first iteration make use of
      posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
      translate basic posix access and default permissions such as the
      ACL_USER and ACL_GROUP type according to the initial user namespace (or
      the superblock's user namespace) to and from the caller's current user
      namespace. Adapt these two helpers to handle idmapped mounts whereby we
      either map from or into the mount's user namespace depending on in which
      direction we're translating.
      Similarly, cap_convert_nscap() is used by the vfs to translate user
      namespace and non-user namespace aware filesystem capabilities from the
      superblock's user namespace to the caller's user namespace. Enable it to
      handle idmapped mounts by accounting for the mount's user namespace.
      
      In addition the fileystems targeted in the first iteration of this patch
      series make use of the posix_acl_chmod() and, posix_acl_update_mode()
      helpers. Both helpers perform permission checks on the target inode. Let
      them handle idmapped mounts. These two helpers are called when posix
      acls are set by the respective filesystems to handle this case we extend
      the ->set() method to take an additional user namespace argument to pass
      the mount's user namespace down.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      e65ce2a5
    • Christian Brauner's avatar
      attr: handle idmapped mounts · 2f221d6f
      Christian Brauner authored
      When file attributes are changed most filesystems rely on the
      setattr_prepare(), setattr_copy(), and notify_change() helpers for
      initialization and permission checking. Let them handle idmapped mounts.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Helpers that perform checks on the ia_uid and ia_gid fields in struct
      iattr assume that ia_uid and ia_gid are intended values and have already
      been mapped correctly at the userspace-kernelspace boundary as we
      already do today. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      2f221d6f
    • Christian Brauner's avatar
      inode: make init and permission helpers idmapped mount aware · 21cb47be
      Christian Brauner authored
      The inode_owner_or_capable() helper determines whether the caller is the
      owner of the inode or is capable with respect to that inode. Allow it to
      handle idmapped mounts. If the inode is accessed through an idmapped
      mount it according to the mount's user namespace. Afterwards the checks
      are identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Similarly, allow the inode_init_owner() helper to handle idmapped
      mounts. It initializes a new inode on idmapped mounts by mapping the
      fsuid and fsgid of the caller from the mount's user namespace. If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      21cb47be
    • Christian Brauner's avatar
      namei: make permission helpers idmapped mount aware · 47291baa
      Christian Brauner authored
      The two helpers inode_permission() and generic_permission() are used by
      the vfs to perform basic permission checking by verifying that the
      caller is privileged over an inode. In order to handle idmapped mounts
      we extend the two helpers with an additional user namespace argument.
      On idmapped mounts the two helpers will make sure to map the inode
      according to the mount's user namespace and then peform identical
      permission checks to inode_permission() and generic_permission(). If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      47291baa
    • Christian Brauner's avatar
      capability: handle idmapped mounts · 0558c1bf
      Christian Brauner authored
      In order to determine whether a caller holds privilege over a given
      inode the capability framework exposes the two helpers
      privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
      verifies that the inode has a mapping in the caller's user namespace and
      the latter additionally verifies that the caller has the requested
      capability in their current user namespace.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped inodes. If the initial user namespace is passed all
      operations are a nop so non-idmapped mounts will not see a change in
      behavior.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0558c1bf
    • Christian Brauner's avatar
      fs: add file and path permissions helpers · 02f92b38
      Christian Brauner authored
      Add two simple helpers to check permissions on a file and path
      respectively and convert over some callers. It simplifies quite a few
      codepaths and also reduces the churn in later patches quite a bit.
      Christoph also correctly points out that this makes codepaths (e.g.
      ioctls) way easier to follow that would otherwise have to do more
      complex argument passing than necessary.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      02f92b38
    • Christian Brauner's avatar
      fs: add id translation helpers · e6c9a714
      Christian Brauner authored
      Add simple helpers to make it easy to map kuids into and from idmapped
      mounts. We provide simple wrappers that filesystems can use to e.g.
      initialize inodes similar to i_{uid,gid}_read() and i_{uid,gid}_write().
      Accessing an inode through an idmapped mount maps the i_uid and i_gid of
      the inode to the mount's user namespace. If the fsids are used to
      initialize inodes they are unmapped according to the mount's user
      namespace. Passing the initial user namespace to these helpers makes
      them a nop and so any non-idmapped paths will not be impacted.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-3-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      e6c9a714
    • Christian Brauner's avatar
      mount: attach mappings to mounts · a6435940
      Christian Brauner authored
      In order to support per-mount idmappings vfsmounts are marked with user
      namespaces. The idmapping of the user namespace will be used to map the
      ids of vfs objects when they are accessed through that mount. By default
      all vfsmounts are marked with the initial user namespace. The initial
      user namespace is used to indicate that a mount is not idmapped. All
      operations behave as before.
      
      Based on prior discussions we want to attach the whole user namespace
      and not just a dedicated idmapping struct. This allows us to reuse all
      the helpers that already exist for dealing with idmappings instead of
      introducing a whole new range of helpers. In addition, if we decide in
      the future that we are confident enough to enable unprivileged users to
      setup idmapped mounts the permission checking can take into account
      whether the caller is privileged in the user namespace the mount is
      currently marked with.
      Later patches enforce that once a mount has been idmapped it can't be
      remapped. This keeps permission checking and life-cycle management
      simple. Users wanting to change the idmapped can always create a new
      detached mount with a different idmapping.
      
      Add a new mnt_userns member to vfsmount and two simple helpers to
      retrieve the mnt_userns from vfsmounts and files.
      
      The idea to attach user namespaces to vfsmounts has been floated around
      in various forms at Linux Plumbers in ~2018 with the original idea
      tracing back to a discussion in 2017 at a conference in St. Petersburg
      between Christoph, Tycho, and myself.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      a6435940
  2. 18 Jan, 2021 1 commit
  3. 17 Jan, 2021 4 commits
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-2021-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux · e2da7836
      Linus Torvalds authored
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix 'CPU too large' error in Intel PT
      
       - Correct event attribute sizes in 'perf inject'
      
       - Sync build_bug.h and kvm.h kernel copies
      
       - Fix bpf.h header include directive in 5sec.c 'perf trace' bpf example
      
       - libbpf tests fixes
      
       - Fix shadow stat 'perf test' for non-bash shells
      
       - Take cgroups into account for shadow stats in 'perf stat'
      
      * tag 'perf-tools-fixes-2021-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf inject: Correct event attribute sizes
        perf intel-pt: Fix 'CPU too large' error
        perf stat: Take cgroups into account for shadow stats
        perf stat: Introduce struct runtime_stat_data
        libperf tests: Fail when failing to get a tracepoint id
        libperf tests: If a test fails return non-zero
        libperf tests: Avoid uninitialized variable warning
        perf test: Fix shadow stat test for non-bash shells
        tools headers: Syncronize linux/build_bug.h with the kernel sources
        tools headers UAPI: Sync kvm.h headers with the kernel sources
        perf bpf examples: Fix bpf.h header include directive in 5sec.c example
      e2da7836
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · a1339d63
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "One fix for a lack of alignment in our linker script, that can lead to
        crashes depending on configuration etc.
      
        One fix for the 32-bit VDSO after the C VDSO conversion.
      
        Thanks to Andreas Schwab, Ariel Marcovitch, and Christophe Leroy"
      
      * tag 'powerpc-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/vdso: Fix clock_gettime_fallback for vdso32
        powerpc: Fix alignment bug within the init sections
      a1339d63
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · a527a2b3
      Linus Torvalds authored
      Pull misc vfs fixes from Al Viro:
       "Several assorted fixes.
      
        I still think that audit ->d_name race is better fixed this way for
        the benefit of backports, with any possibly fancier variants done on
        top of it"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        dump_common_audit_data(): fix racy accesses to ->d_name
        iov_iter: fix the uaccess area in copy_compat_iovec_from_user
        umount(2): move the flag validity checks first
      a527a2b3
    • Linus Torvalds's avatar
      mm: don't put pinned pages into the swap cache · feb889fb
      Linus Torvalds authored
      So technically there is nothing wrong with adding a pinned page to the
      swap cache, but the pinning obviously means that the page can't actually
      be free'd right now anyway, so it's a bit pointless.
      
      However, the real problem is not with it being a bit pointless: the real
      issue is that after we've added it to the swap cache, we'll try to unmap
      the page.  That will succeed, because the code in mm/rmap.c doesn't know
      or care about pinned pages.
      
      Even the unmapping isn't fatal per se, since the page will stay around
      in memory due to the pinning, and we do hold the connection to it using
      the swap cache.  But when we then touch it next and take a page fault,
      the logic in do_swap_page() will map it back into the process as a
      possibly read-only page, and we'll then break the page association on
      the next COW fault.
      
      Honestly, this issue could have been fixed in any of those other places:
      (a) we could refuse to unmap a pinned page (which makes conceptual
      sense), or (b) we could make sure to re-map a pinned page writably in
      do_swap_page(), or (c) we could just make do_wp_page() not COW the
      pinned page (which was what we historically did before that "mm:
      do_wp_page() simplification" commit).
      
      But while all of them are equally valid models for breaking this chain,
      not putting pinned pages into the swap cache in the first place is the
      simplest one by far.
      
      It's also the safest one: the reason why do_wp_page() was changed in the
      first place was that getting the "can I re-use this page" wrong is so
      fraught with errors.  If you do it wrong, you end up with an incorrectly
      shared page.
      
      As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or
      do_swap_page() would be a serious bug since it is only a (very good)
      heuristic.  Re-using the page requires a hard black-and-white rule with
      no room for ambiguity.
      
      In contrast, saying "this page is very likely dma pinned, so let's not
      add it to the swap cache and try to unmap it" is an obviously safe thing
      to do, and if the heuristic might very rarely be a false positive, no
      harm is done.
      
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Reported-and-tested-by: default avatarMartin Raiber <martin@urbackup.org>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      feb889fb
  4. 16 Jan, 2021 12 commits
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 0da0a8a0
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Nine minor fixes, seven in drivers and two in the core SCSI disk
        driver (sd) which should be harmless involving removing an unused
        variable and quietening a spurious warning"
      Signed-off-by: default avatarJames E.J. Bottomley <jejb@linux.ibm.com>
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: sd: Remove obsolete variable in sd_remove()
        scsi: sd: Suppress spurious errors when WRITE SAME is being disabled
        scsi: scsi_debug: Fix memleak in scsi_debug_init()
        scsi: mpt3sas: Fix spelling mistake in Kconfig "compatiblity" -> "compatibility"
        scsi: qedi: Correct max length of CHAP secret
        scsi: ufs: Correct the LUN used in eh_device_reset_handler() callback
        scsi: ufs: Relocate flush of exceptional event
        scsi: ufs: Relax the condition of UFSHCI_QUIRK_SKIP_MANUAL_WB_FLUSH_CTRL
        scsi: ufs: Fix possible power drain during system suspend
      0da0a8a0
    • Al Viro's avatar
      dump_common_audit_data(): fix racy accesses to ->d_name · d36a1dd9
      Al Viro authored
      We are not guaranteed the locking environment that would prevent
      dentry getting renamed right under us.  And it's possible for
      old long name to be freed after rename, leading to UAF here.
      
      Cc: stable@kernel.org # v2.6.2+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d36a1dd9
    • Linus Torvalds's avatar
      Merge tag 'block-5.11-2021-01-16' of git://git.kernel.dk/linux-block · 54c6247d
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Just an nvme pull request via Christoph:
      
         - don't initialize hwmon for discover controllers (Sagi Grimberg)
      
         - fix iov_iter handling in nvme-tcp (Sagi Grimberg)
      
         - fix a preempt warning in nvme-tcp (Sagi Grimberg)
      
         - fix a possible NULL pointer dereference in nvme (Israel Rukshin)"
      
      * tag 'block-5.11-2021-01-16' of git://git.kernel.dk/linux-block:
        nvme: don't intialize hwmon for discovery controllers
        nvme-tcp: fix possible data corruption with bio merges
        nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT
        nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
      54c6247d
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block · 11c0239a
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "We still have a pending fix for a cancelation issue, but it's still
        being investigated. In the meantime:
      
         - Dead mm handling fix (Pavel)
      
         - SQPOLL setup error handling (Pavel)
      
         - Flush timeout sequence fix (Marcelo)
      
         - Missing finish_wait() for one exit case"
      
      * tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block:
        io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()
        io_uring: flush timeouts that should already have expired
        io_uring: do sqo disable on install_fd error
        io_uring: fix null-deref in io_disable_sqo_submit
        io_uring: don't take files/mm for a dead task
        io_uring: drop mm and files after task_work_run
      11c0239a
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · acda701b
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
       "There are a few more fixes than a normal rc4, largely due to the
        bubble introduced by the holiday break:
      
         - return -ENOSYS for syscall number -1, which previously returned an
           uninitialized value.
      
         - ensure of_clk_init() has been called in time_init(), without which
           clock drivers may not be initialized.
      
         - fix sifive,uart0 driver to properly display the baud rate. A fix to
           initialize MPIE that allows interrupts to be processed during
           system calls.
      
         - avoid erronously begin tracing IRQs when interrupts are disabled,
           which at least triggers suprious lockdep failures.
      
         - workaround for a warning related to calling smp_processor_id()
           while preemptible. The warning itself is suprious on currently
           availiable systems.
      
         - properly include the generic time VDSO calls. A fix to our kasan
           address mapping. A fix to the HiFive Unleashed device tree, which
           allows the Ethernet PHY to be properly initialized by Linux (as
           opposed to relying on the bootloader).
      
         - defconfig update to include SiFive's GPIO driver, which is present
           on the HiFive Unleashed and necessary to initialize the PHY.
      
         - avoid allocating memory while initializing reserved memory.
      
         - avoid allocating the last 4K of memory, as pointers there alias
           with syscall errors.
      
        There are also two cleanups that should have no functional effect but
        do fix build warnings:
      
         - drop a duplicated definition of PAGE_KERNEL_EXEC.
      
         - properly declare the asm register SP shim.
      
         - cleanup the rv32 memory size Kconfig entry, to reflect the actual
           size of memory availiable"
      
      * tag 'riscv-for-linus-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        RISC-V: Fix maximum allowed phsyical memory for RV32
        RISC-V: Set current memblock limit
        RISC-V: Do not allocate memblock while iterating reserved memblocks
        riscv: stacktrace: Move register keyword to beginning of declaration
        riscv: defconfig: enable gpio support for HiFive Unleashed
        dts: phy: add GPIO number and active state used for phy reset
        dts: phy: fix missing mdio device and probe failure of vsc8541-01 device
        riscv: Fix KASAN memory mapping.
        riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
        riscv: cacheinfo: Fix using smp_processor_id() in preemptible
        riscv: Trace irq on only interrupt is enabled
        riscv: Drop a duplicated PAGE_KERNEL_EXEC
        riscv: Enable interrupts during syscalls with M-Mode
        riscv: Fix sifive serial driver
        riscv: Fix kernel time_init()
        riscv: return -ENOSYS for syscall -1
      acda701b
    • Linus Torvalds's avatar
      mm: don't play games with pinned pages in clear_page_refs · 9348b73c
      Linus Torvalds authored
      Turning a pinned page read-only breaks the pinning after COW.  Don't do it.
      
      The whole "track page soft dirty" state doesn't work with pinned pages
      anyway, since the page might be dirtied by the pinning entity without
      ever being noticed in the page tables.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9348b73c
    • Linus Torvalds's avatar
      mm: fix clear_refs_write locking · 29a951df
      Linus Torvalds authored
      Turning page table entries read-only requires the mmap_sem held for
      writing.
      
      So stop doing the odd games with turning things from read locks to write
      locks and back.  Just get the write lock.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29a951df
    • Atish Patra's avatar
      RISC-V: Fix maximum allowed phsyical memory for RV32 · e5577937
      Atish Patra authored
      Linux kernel can only map 1GB of address space for RV32 as the page offset
      is set to 0xC0000000. The current description in the Kconfig is confusing
      as it indicates that RV32 can support 2GB of physical memory. That is
      simply not true for current kernel. In future, a 2GB split support can be
      added to allow 2GB physical address space.
      Reviewed-by: default avatarAnup Patel <anup@brainfault.org>
      Signed-off-by: default avatarAtish Patra <atish.patra@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      e5577937
    • Atish Patra's avatar
      RISC-V: Set current memblock limit · abb8e86b
      Atish Patra authored
      Currently, linux kernel can not use last 4k bytes of addressable space
      because IS_ERR_VALUE macro treats those as an error. This will be an issue
      for RV32 as any memblock allocator potentially allocate chunk of memory
      from the end of DRAM (2GB) leading bad address error even though the
      address was technically valid.
      
      Fix this issue by limiting the memblock if available memory spans the
      entire address space.
      Reviewed-by: default avatarAnup Patel <anup@brainfault.org>
      Signed-off-by: default avatarAtish Patra <atish.patra@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      abb8e86b
    • Atish Patra's avatar
      RISC-V: Do not allocate memblock while iterating reserved memblocks · 797f0375
      Atish Patra authored
      Currently, resource tree allocates memory blocks while iterating on the
      list. It leads to following kernel warning because memblock allocation
      also invokes memory block reservation API.
      
      [    0.000000] ------------[ cut here ]------------
      [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/resource.c:795
      __insert_resource+0x8e/0xd0
      [    0.000000] Modules linked in:
      [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
      5.10.0-00022-ge20097fb37e2-dirty #549
      [    0.000000] epc: c00125c2 ra : c001262c sp : c1c01f50
      [    0.000000]  gp : c1d456e0 tp : c1c0a980 t0 : ffffcf20
      [    0.000000]  t1 : 00000000 t2 : 00000000 s0 : c1c01f60
      [    0.000000]  s1 : ffffcf00 a0 : ffffff00 a1 : c1c0c0c4
      [    0.000000]  a2 : 80c12b15 a3 : 80402000 a4 : 80402000
      [    0.000000]  a5 : c1c0c0c4 a6 : 80c12b15 a7 : f5faf600
      [    0.000000]  s2 : c1c0c0c4 s3 : c1c0e000 s4 : c1009a80
      [    0.000000]  s5 : c1c0c000 s6 : c1d48000 s7 : c1613b4c
      [    0.000000]  s8 : 00000fff s9 : 80000200 s10: c1613b40
      [    0.000000]  s11: 00000000 t3 : c1d4a000 t4 : ffffffff
      
      This is also unnecessary as we can pre-compute the total memblocks required
      for each memory region and allocate it before the loop. It save precious
      boot time not going through memblock allocation code every time.
      
      Fixes: 00ab027a ("RISC-V: Add kernel image sections to the resource tree")
      Reviewed-by: default avatarAnup Patel <anup@brainfault.org>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAtish Patra <atish.patra@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      797f0375
    • Christoph Hellwig's avatar
      iov_iter: fix the uaccess area in copy_compat_iovec_from_user · a959a978
      Christoph Hellwig authored
      sizeof needs to be called on the compat pointer, not the native one.
      
      Fixes: 89cd35c5 ("iov_iter: transparently handle compat iovecs in import_iovec")
      Reported-by: default avatarDavid Laight <David.Laight@ACULAB.COM>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a959a978
    • Linus Torvalds's avatar
      Merge tag 'for-5.11/dm-fixes-1' of... · 1d94330a
      Linus Torvalds authored
      Merge tag 'for-5.11/dm-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - Fix DM-raid's raid1 discard limits so discards work.
      
       - Select missing Kconfig dependencies for DM integrity and zoned
         targets.
      
       - Four fixes for DM crypt target's support to optionally bypass kcryptd
         workqueues.
      
       - Fix DM snapshot merge supports missing data flushes before committing
         metadata.
      
       - Fix DM integrity data device flushing when external metadata is used.
      
       - Fix DM integrity's maximum number of supported constructor arguments
         that user can request when creating an integrity device.
      
       - Eliminate DM core ioctl logging noise when an ioctl is issued without
         required CAP_SYS_RAWIO permission.
      
      * tag 'for-5.11/dm-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm crypt: defer decryption to a tasklet if interrupts disabled
        dm integrity: fix the maximum number of arguments
        dm crypt: do not call bio_endio() from the dm-crypt tasklet
        dm integrity: fix flush with external metadata device
        dm: eliminate potential source of excessive kernel log noise
        dm snapshot: flush merged data before committing metadata
        dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
        dm crypt: do not wait for backlogged crypto request completion in softirq
        dm zoned: select CONFIG_CRC32
        dm integrity: select CRYPTO_SKCIPHER
        dm raid: fix discard limits for raid1
      1d94330a
  5. 15 Jan, 2021 8 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · b45e2da6
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "10 patches.
      
        Subsystems affected by this patch series: MAINTAINERS and mm (slub,
        pagealloc, memcg, kasan, vmalloc, migration, hugetlb, memory-failure,
        and process_vm_access)"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/process_vm_access.c: include compat.h
        mm,hwpoison: fix printing of page flags
        MAINTAINERS: add Vlastimil as slab allocators maintainer
        mm/hugetlb: fix potential missing huge page size info
        mm: migrate: initialize err in do_migrate_pages
        mm/vmalloc.c: fix potential memory leak
        arm/kasan: fix the array size of kasan_early_shadow_pte[]
        mm/memcontrol: fix warning in mem_cgroup_page_lruvec()
        mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint
        mm, slub: consider rest of partial list if acquire_slab() fails
      b45e2da6
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 8cbe71e7
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "A fairly modest set of bug fixes, nothing abnormal from the merge
        window
      
        The ucma patch is a bit on the larger side, but given the regression
        was recently added I've opted to forward it to the rc stream.
      
         - Fix a ucma memory leak introduced in v5.9 while fixing the
           Syzkaller bugs
      
         - Don't fail when the xarray wraps for user verbs objects
      
         - User triggerable oops regression from the umem page size rework
      
         - Error unwind bugs in usnic, ocrdma, mlx5 and cma"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/cma: Fix error flow in default_roce_mode_store
        RDMA/mlx5: Fix wrong free of blue flame register on error
        IB/mlx5: Fix error unwinding when set_has_smi_cap fails
        RDMA/umem: Avoid undefined behavior of rounddown_pow_of_two()
        RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd()
        RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp
        RDMA/restrack: Don't treat as an error allocation ID wrapping
        RDMA/ucma: Do not miss ctx destruction steps in some cases
      8cbe71e7
    • Jens Axboe's avatar
      io_uring: ensure finish_wait() is always called in __io_uring_task_cancel() · a8d13dbc
      Jens Axboe authored
      If we enter with requests pending and performm cancelations, we'll have
      a different inflight count before and after calling prepare_to_wait().
      This causes the loop to restart. If we actually ended up canceling
      everything, or everything completed in-between, then we'll break out
      of the loop without calling finish_wait() on the waitqueue. This can
      trigger a warning on exit_signals(), as we leave the task state in
      TASK_UNINTERRUPTIBLE.
      
      Put a finish_wait() after the loop to catch that case.
      
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8d13dbc
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 0bc9bc1d
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "A number of bug fixes for ext4:
      
         - Fix for the new fast_commit feature
      
         - Fix some error handling codepaths in whiteout handling and
           mountpoint sampling
      
         - Fix how we write ext4_error information so it goes through the
           journal when journalling is active, to avoid races that can lead to
           lost error information, superblock checksum failures, or DIF/DIX
           features"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: remove expensive flush on fast commit
        ext4: fix bug for rename with RENAME_WHITEOUT
        ext4: fix wrong list_splice in ext4_fc_cleanup
        ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
        ext4: don't leak old mountpoint samples
        ext4: drop ext4_handle_dirty_super()
        ext4: fix superblock checksum failure when setting password salt
        ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()
        ext4: save error info to sb through journal if available
        ext4: protect superblock modifications with a buffer lock
        ext4: drop sync argument of ext4_commit_super()
        ext4: combine ext4_handle_error() and save_error_info()
      0bc9bc1d
    • Linus Torvalds's avatar
      Merge tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6 · 7cd3c412
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Two small cifs fixes for stable (including an important handle leak
        fix) and three small cleanup patches"
      
      * tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: style: replace one-element array with flexible-array
        cifs: connect: style: Simplify bool comparison
        fs: cifs: remove unneeded variable in smb3_fs_context_dup
        cifs: fix interrupted close commands
        cifs: check pointer before freeing
      7cd3c412
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 82821be8
      Linus Torvalds authored
      Pull arm64 fixes from Catalin Marinas:
      
       - Set the minimum GCC version to 5.1 for arm64 due to earlier compiler
         bugs.
      
       - Make atomic helpers __always_inline to avoid a section mismatch when
         compiling with clang.
      
       - Fix the CMA and crashkernel reservations to use ZONE_DMA (remove the
         arm64_dma32_phys_limit variable, no longer needed with a dynamic
         ZONE_DMA sizing in 5.11).
      
       - Remove redundant IRQ flag tracing that was leaving lockdep
         inconsistent with the hardware state.
      
       - Revert perf events based hard lockup detector that was causing
         smp_processor_id() to be called in preemptible context.
      
       - Some trivial cleanups - spelling fix, renaming S_FRAME_SIZE to
         PT_REGS_SIZE, function prototypes added.
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: selftests: Fix spelling of 'Mismatch'
        arm64: syscall: include prototype for EL0 SVC functions
        compiler.h: Raise minimum version of GCC to 5.1 for arm64
        arm64: make atomic helpers __always_inline
        arm64: rename S_FRAME_SIZE to PT_REGS_SIZE
        Revert "arm64: Enable perf events based hard lockup detector"
        arm64: entry: remove redundant IRQ flag tracing
        arm64: Remove arm64_dma32_phys_limit and its uses
      82821be8
    • Linus Torvalds's avatar
      Merge tag 'mips_fixes_5.11.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · f288c895
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - fix coredumps on 64bit kernels
      
       - fix for alignment bugs preventing booting
      
       - fix checking for failed irq_alloc_desc calls
      
      * tag 'mips_fixes_5.11.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: OCTEON: fix unreachable code in octeon_irq_init_ciu
        MIPS: relocatable: fix possible boot hangup with KASLR enabled
        MIPS: Fix malformed NT_FILE and NT_SIGINFO in 32bit coredumps
        MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
      f288c895
    • Al Grant's avatar
      perf inject: Correct event attribute sizes · 648b054a
      Al Grant authored
      When 'perf inject' reads a perf.data file from an older version of perf,
      it writes event attributes into the output with the original size field,
      but lays them out as if they had the size currently used. Readers see a
      corrupt file. Update the size field to match the layout.
      Signed-off-by: default avatarAl Grant <al.grant@foss.arm.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20201124195818.30603-1-al.grant@arm.comSigned-off-by: default avatarDenis Nikitin <denik@chromium.org>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      648b054a