1. 30 Oct, 2023 15 commits
    • Amir Goldstein's avatar
      ovl: do not open/llseek lower file with upper sb_writers held · c63e56a4
      Amir Goldstein authored
      overlayfs file open (ovl_maybe_lookup_lowerdata) and overlay file llseek
      take the ovl_inode_lock, without holding upper sb_writers.
      
      In case of nested lower overlay that uses same upper fs as this overlay,
      lockdep will warn about (possibly false positive) circular lock
      dependency when doing open/llseek of lower ovl file during copy up with
      our upper sb_writers held, because the locking ordering seems reverse to
      the locking order in ovl_copy_up_start():
      
      - lower ovl_inode_lock
      - upper sb_writers
      
      Let the copy up "transaction" keeps an elevated mnt write count on upper
      mnt, but leaves taking upper sb_writers to lower level helpers only when
      they actually need it.  This allows to avoid holding upper sb_writers
      during lower file open/llseek and prevents the lockdep warning.
      
      Minimizing the scope of upper sb_writers during copy up is also needed
      for fixing another possible deadlocks by a following patch.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      c63e56a4
    • Amir Goldstein's avatar
      ovl: reorder ovl_want_write() after ovl_inode_lock() · 162d0644
      Amir Goldstein authored
      Make the locking order of ovl_inode_lock() strictly between the two
      vfs stacked layers, i.e.:
      - ovl vfs locks: sb_writers, inode_lock, ...
      - ovl_inode_lock
      - upper vfs locks: sb_writers, inode_lock, ...
      
      To that effect, move ovl_want_write() into the helpers ovl_nlink_start()
      and ovl_copy_up_start which currently take the ovl_inode_lock() after
      ovl_want_write().
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      162d0644
    • Amir Goldstein's avatar
      ovl: split ovl_want_write() into two helpers · d08d3b3c
      Amir Goldstein authored
      ovl_get_write_access() gets write access to upper mnt without taking
      freeze protection on upper sb and ovl_start_write() only takes freeze
      protection on upper sb.
      
      These helpers will be used to breakup the large ovl_want_write() scope
      during copy up into finer grained freeze protection scopes.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      d08d3b3c
    • Amir Goldstein's avatar
      ovl: add helper ovl_file_modified() · c002728f
      Amir Goldstein authored
      A simple wrapper for updating ovl inode size/mtime, to conform
      with ovl_file_accessed().
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      c002728f
    • Amir Goldstein's avatar
      ovl: protect copying of realinode attributes to ovl inode · f7621b11
      Amir Goldstein authored
      ovl_copyattr() may be called concurrently from aio completion context
      without any lock and that could lead to overlay inode attributes getting
      permanently out of sync with real inode attributes.
      
      Use ovl inode spinlock to protect ovl_copyattr().
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      f7621b11
    • Amir Goldstein's avatar
      ovl: punt write aio completion to workqueue · 389a4a4a
      Amir Goldstein authored
      We want to protect concurrent updates of ovl inode size and mtime
      (i.e. ovl_copyattr()) from aio completion context.
      
      Punt write aio completion to a workqueue so that we can protect
      ovl_copyattr() with a spinlock.
      
      Export sb_init_dio_done_wq(), so that overlayfs can use its own
      dio workqueue to punt aio completions.
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Link: https://lore.kernel.org/r/8620dfd3-372d-4ae0-aa3f-2fe97dda1bca@kernel.dk/Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      389a4a4a
    • Amir Goldstein's avatar
      ovl: propagate IOCB_APPEND flag on writes to realfile · 5f034d34
      Amir Goldstein authored
      If ovl file is opened O_APPEND, the underlying realfile is also
      opened O_APPEND, so it makes sense to propagate the IOCB_APPEND flags
      on sync writes to realfile, just as we do with aio writes.
      
      Effectively, because sync ovl writes are protected by inode lock,
      this change only makes a difference if the realfile is written to (size
      extending writes) from underneath overlayfs.  The behavior in this case
      is undefined, so it is ok if we change the behavior (to fail the ovl
      IOCB_APPEND write).
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      5f034d34
    • Amir Goldstein's avatar
      ovl: use simpler function to convert iocb to rw flags · db5b5e83
      Amir Goldstein authored
      Overlayfs implements its own function to translate iocb flags into rw
      flags, so that they can be passed into another vfs call.
      
      With commit ce71bfea ("fs: align IOCB_* flags with RWF_* flags")
      Jens created a 1:1 matching between the iocb flags and rw flags,
      simplifying the conversion.
      Signed-off-by: default avatarAlessio Balsini <balsini@android.com>
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      db5b5e83
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 14ab6d42
      Linus Torvalds authored
      Pull vfs inode time accessor updates from Christian Brauner:
       "This finishes the conversion of all inode time fields to accessor
        functions as discussed on list. Changing timestamps manually as we
        used to do before is error prone. Using accessors function makes this
        robust.
      
        It does not contain the switch of the time fields to discrete 64 bit
        integers to replace struct timespec and free up space in struct inode.
        But after this, the switch can be trivially made and the patch should
        only affect the vfs if we decide to do it"
      
      * tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
        fs: rename inode i_atime and i_mtime fields
        security: convert to new timestamp accessors
        selinux: convert to new timestamp accessors
        apparmor: convert to new timestamp accessors
        sunrpc: convert to new timestamp accessors
        mm: convert to new timestamp accessors
        bpf: convert to new timestamp accessors
        ipc: convert to new timestamp accessors
        linux: convert to new timestamp accessors
        zonefs: convert to new timestamp accessors
        xfs: convert to new timestamp accessors
        vboxsf: convert to new timestamp accessors
        ufs: convert to new timestamp accessors
        udf: convert to new timestamp accessors
        ubifs: convert to new timestamp accessors
        tracefs: convert to new timestamp accessors
        sysv: convert to new timestamp accessors
        squashfs: convert to new timestamp accessors
        server: convert to new timestamp accessors
        client: convert to new timestamp accessors
        ...
      14ab6d42
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 7352a676
      Linus Torvalds authored
      Pull vfs xattr updates from Christian Brauner:
       "The 's_xattr' field of 'struct super_block' currently requires a
        mutable table of 'struct xattr_handler' entries (although each handler
        itself is const). However, no code in vfs actually modifies the
        tables.
      
        This changes the type of 's_xattr' to allow const tables, and modifies
        existing file systems to move their tables to .rodata. This is
        desirable because these tables contain entries with function pointers
        in them; moving them to .rodata makes it considerably less likely to
        be modified accidentally or maliciously at runtime"
      
      * tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
        const_structs.checkpatch: add xattr_handler
        net: move sockfs_xattr_handlers to .rodata
        shmem: move shmem_xattr_handlers to .rodata
        overlayfs: move xattr tables to .rodata
        xfs: move xfs_xattr_handlers to .rodata
        ubifs: move ubifs_xattr_handlers to .rodata
        squashfs: move squashfs_xattr_handlers to .rodata
        smb: move cifs_xattr_handlers to .rodata
        reiserfs: move reiserfs_xattr_handlers to .rodata
        orangefs: move orangefs_xattr_handlers to .rodata
        ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
        ntfs3: move ntfs_xattr_handlers to .rodata
        nfs: move nfs4_xattr_handlers to .rodata
        kernfs: move kernfs_xattr_handlers to .rodata
        jfs: move jfs_xattr_handlers to .rodata
        jffs2: move jffs2_xattr_handlers to .rodata
        hfsplus: move hfsplus_xattr_handlers to .rodata
        hfs: move hfs_xattr_handlers to .rodata
        gfs2: move gfs2_xattr_handlers_max to .rodata
        fuse: move fuse_xattr_handlers to .rodata
        ...
      7352a676
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · df9c65b5
      Linus Torvalds authored
      Pull iov_iter updates from Christian Brauner:
       "This contain's David's iov_iter cleanup work to convert the iov_iter
        iteration macros to inline functions:
      
         - Remove last_offset from iov_iter as it was only used by ITER_PIPE
      
         - Add a __user tag on copy_mc_to_user()'s dst argument on x86 to
           match that on powerpc and get rid of a sparse warning
      
         - Convert iter->user_backed to user_backed_iter() in the sound PCM
           driver
      
         - Convert iter->user_backed to user_backed_iter() in a couple of
           infiniband drivers
      
         - Renumber the type enum so that the ITER_* constants match the order
           in iterate_and_advance*()
      
         - Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
           user_backed_iter() to just use the type value and get rid of the
           extra flag
      
         - Convert the iov_iter iteration macros to always-inline functions to
           make the code easier to follow. It uses function pointers, but they
           get optimised away
      
         - Move the check for ->copy_mc to _copy_from_iter() and
           copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
           where it gets repeated for every segment. Instead, we check once
           and invoke a side function that can use iterate_bvec() rather than
           iterate_and_advance() and supply a different step function
      
         - Move the copy-and-csum code to net/ where it can be in proximity
           with the code that uses it
      
         - Fold memcpy_and_csum() in to its two users
      
         - Move csum_and_copy_from_iter_full() out of line and merge in
           csum_and_copy_from_iter() since the former is the only caller of
           the latter
      
         - Move hash_and_copy_to_iter() to net/ where it can be with its only
           caller"
      
      * tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
        iov_iter, net: Move hash_and_copy_to_iter() to net/
        iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
        iov_iter, net: Fold in csum_and_memcpy()
        iov_iter, net: Move csum_and_copy_to/from_iter() to net/
        iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
        iov_iter: Convert iterate*() to inline funcs
        iov_iter: Derive user-backedness from the iterator type
        iov_iter: Renumber ITER_* constants
        infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
        sound: Fix snd_pcm_readv()/writev() to use iov access functions
        iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
        iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
      df9c65b5
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 3b3f874c
      Linus Torvalds authored
      Pull misc vfs updates from Christian Brauner:
       "This contains the usual miscellaneous features, cleanups, and fixes
        for vfs and individual fses.
      
        Features:
      
         - Rename and export helpers that get write access to a mount. They
           are used in overlayfs to get write access to the upper mount.
      
         - Print the pretty name of the root device on boot failure. This
           helps in scenarios where we would usually only print
           "unknown-block(1,2)".
      
         - Add an internal SB_I_NOUMASK flag. This is another part in the
           endless POSIX ACL saga in a way.
      
           When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
           the umask because if the relevant inode has POSIX ACLs set it might
           take the umask from there. But if the inode doesn't have any POSIX
           ACLs set then we apply the umask in the filesytem itself. So we end
           up with:
      
            (1) no SB_POSIXACL -> strip umask in vfs
            (2) SB_POSIXACL    -> strip umask in filesystem
      
           The umask semantics associated with SB_POSIXACL allowed filesystems
           that don't even support POSIX ACLs at all to raise SB_POSIXACL
           purely to avoid umask stripping. That specifically means NFS v4 and
           Overlayfs. NFS v4 does it because it delegates this to the server
           and Overlayfs because it needs to delegate umask stripping to the
           upper filesystem, i.e., the filesystem used as the writable layer.
      
           This went so far that SB_POSIXACL is raised eve on kernels that
           don't even have POSIX ACL support at all.
      
           Stop this blatant abuse and add SB_I_NOUMASK which is an internal
           superblock flag that filesystems can raise to opt out of umask
           handling. That should really only be the two mentioned above. It's
           not that we want any filesystems to do this. Ideally we have all
           umask handling always in the vfs.
      
         - Make overlayfs use SB_I_NOUMASK too.
      
         - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
           IS_POSIXACL() if the kernel doesn't have support for it. This is a
           very old patch but it's only possible to do this now with the wider
           cleanup that was done.
      
         - Follow-up work on fake path handling from last cycle. Citing mostly
           from Amir:
      
           When overlayfs was first merged, overlayfs files of regular files
           and directories, the ones that are installed in file table, had a
           "fake" path, namely, f_path is the overlayfs path and f_inode is
           the "real" inode on the underlying filesystem.
      
           In v6.5, we took another small step by introducing of the
           backing_file container and the file_real_path() helper. This change
           allowed vfs and filesystem code to get the "real" path of an
           overlayfs backing file. With this change, we were able to make
           fsnotify work correctly and report events on the "real" filesystem
           objects that were accessed via overlayfs.
      
           This method works fine, but it still leaves the vfs vulnerable to
           new code that is not aware of files with fake path. A recent
           example is commit db1d1e8b ("IMA: use vfs_getattr_nosec to get
           the i_version"). This commit uses direct referencing to f_path in
           IMA code that otherwise uses file_inode() and file_dentry() to
           reference the filesystem objects that it is measuring.
      
           This contains work to switch things around: instead of having
           filesystem code opt-in to get the "real" path, have generic code
           opt-in for the "fake" path in the few places that it is needed.
      
           Is it far more likely that new filesystems code that does not use
           the file_dentry() and file_real_path() helpers will end up causing
           crashes or averting LSM/audit rules if we keep the "fake" path
           exposed by default.
      
           This change already makes file_dentry() moot, but for now we did
           not change this helper just added a WARN_ON() in ovl_d_real() to
           catch if we have made any wrong assumptions.
      
           After the dust settles on this change, we can make file_dentry() a
           plain accessor and we can drop the inode argument to ->d_real().
      
         - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
           change but it really isn't and I would like to see everyone on
           their tippie toes for any possible bugs from this work.
      
           Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
           files since a very long time because of the nasty interactions
           between the SCM_RIGHTS file descriptor garbage collection. So
           extending it makes a lot of sense but it is a subtle change. There
           are almost no places that fiddle with file rcu semantics directly
           and the ones that did mess around with struct file internal under
           rcu have been made to stop doing that because it really was always
           dodgy.
      
           I forgot to put in the link tag for this change and the discussion
           in the commit so adding it into the merge message:
      
             https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
      
        Cleanups:
      
         - Various smaller pipe cleanups including the removal of a spin lock
           that was only used to protect against writes without pipe_lock()
           from O_NOTIFICATION_PIPE aka watch queues. As that was never
           implemented remove the additional locking from pipe_write().
      
         - Annotate struct watch_filter with the new __counted_by attribute.
      
         - Clarify do_unlinkat() cleanup so that it doesn't look like an extra
           iput() is done that would cause issues.
      
         - Simplify file cleanup when the file has never been opened.
      
         - Use module helper instead of open-coding it.
      
         - Predict error unlikely for stale retry.
      
         - Use WRITE_ONCE() for mount expiry field instead of just commenting
           that one hopes the compiler doesn't get smart.
      
        Fixes:
      
         - Fix readahead on block devices.
      
         - Fix writeback when layztime is enabled and inodes whose timestamp
           is the only thing that changed reside on wb->b_dirty_time. This
           caused excessively large zombie memory cgroup when lazytime was
           enabled as such inodes weren't handled fast enough.
      
         - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
      
      * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
        file, i915: fix file reference for mmap_singleton()
        vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
        writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
        chardev: Simplify usage of try_module_get()
        ovl: rely on SB_I_NOUMASK
        fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
        fs: store real path instead of fake path in backing file f_path
        fs: create helper file_user_path() for user displayed mapped file path
        fs: get mnt_writers count for an open backing file's real path
        vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
        vfs: predict the error in retry_estale as unlikely
        backing file: free directly
        vfs: fix readahead(2) on block devices
        io_uring: use files_lookup_fd_locked()
        file: convert to SLAB_TYPESAFE_BY_RCU
        vfs: shave work on failed file open
        fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
        watch_queue: Annotate struct watch_filter with __counted_by
        fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
        fs/pipe: remove unnecessary spinlock from pipe_write()
        ...
      3b3f874c
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.autofs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 0d63d8b2
      Linus Torvalds authored
      Pull autofs mount api updates from Christian Brauner:
       "This ports autofs to the new mount api. The patchset has existed for
        quite a while but never made it upstream. Ian picked it back up.
      
        This also fixes a bug where fs_param_is_fd() was passed a garbage
        param->dirfd but it expected it to be set to the fd that was used to
        set param->file otherwise result->uint_32 contains nonsense. So make
        sure it's set.
      
        One less filesystem using the old mount api. We're getting there,
        albeit rather slow. The last remaining major filesystem that hasn't
        converted is btrfs. Patches exist - I even wrote them - but so far
        they haven't made it upstream"
      
      * tag 'vfs-6.7.autofs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
        autofs: fix add autofs_parse_fd()
        fsconfig: ensure that dirfd is set to aux
        autofs: fix protocol sub version setting
        autofs: convert autofs to use the new mount api
        autofs: validate protocol version
        autofs: refactor parse_options()
        autofs: reformat 0pt enum declaration
        autofs: refactor super block info init
        autofs: add autofs_parse_fd()
        autofs: refactor autofs_prepare_pipe()
      0d63d8b2
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · d4e175f2
      Linus Torvalds authored
      Pull vfs superblock updates from Christian Brauner:
       "This contains the work to make block device opening functions return a
        struct bdev_handle instead of just a struct block_device. The same
        struct bdev_handle is then also passed to block device closing
        functions.
      
        This allows us to propagate context from opening to closing a block
        device without having to modify all users everytime.
      
        Sidenote, in the future we might even want to try and have block
        device opening functions return a struct file directly but that's a
        series on top of this.
      
        These are further preparatory changes to be able to count writable
        opens and blocking writes to mounted block devices. That's a separate
        piece of work for next cycle and for that we absolutely need the
        changes to btrfs that have been quietly dropped somehow.
      
        Originally the series contained a patch that removed the old
        blkdev_*() helpers. But since this would've caused needles churn in
        -next for bcachefs we ended up delaying it.
      
        The second piece of work addresses one of the major annoyances about
        the work last cycle, namely that we required dropping s_umount
        whenever we used the superblock and fs_holder_ops for a block device.
      
        The reason for that requirement had been that in some codepaths
        s_umount could've been taken under disk->open_mutex (that's always
        been the case, at least theoretically). For example, on surprise block
        device removal or media change. And opening and closing block devices
        required grabbing disk->open_mutex as well.
      
        So we did the work and went through the block layer and fixed all
        those places so that s_umount is never taken under disk->open_mutex.
        This means no more brittle games where we yield and reacquire s_umount
        during block device opening and closing and no more requirements where
        block devices need to be closed. Filesystems don't need to care about
        this.
      
        There's a bunch of other follow-up work such as moving block device
        freezing and thawing to holder operations which makes it work for all
        block devices and not just the main block device just as we did for
        surprise removal. But that is for next cycle.
      
        Tested with fstests for all major fses, blktests, LTP"
      
      * tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
        porting: update locking requirements
        fs: assert that open_mutex isn't held over holder ops
        block: assert that we're not holding open_mutex over blk_report_disk_dead
        block: move bdev_mark_dead out of disk_check_media_change
        block: WARN_ON_ONCE() when we remove active partitions
        block: simplify bdev_del_partition()
        fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
        jfs: fix log->bdev_handle null ptr deref in lbmStartIO
        bcache: Fixup error handling in register_cache()
        xfs: Convert to bdev_open_by_path()
        reiserfs: Convert to bdev_open_by_dev/path()
        ocfs2: Convert to use bdev_open_by_dev()
        nfs/blocklayout: Convert to use bdev_open_by_dev/path()
        jfs: Convert to bdev_open_by_dev()
        f2fs: Convert to bdev_open_by_dev/path()
        ext4: Convert to bdev_open_by_dev()
        erofs: Convert to use bdev_open_by_path()
        btrfs: Convert to bdev_open_by_path()
        fs: Convert to bdev_open_by_dev()
        mm/swap: Convert to use bdev_open_by_dev()
        ...
      d4e175f2
    • Linus Torvalds's avatar
      Linux 6.6 · ffc25326
      Linus Torvalds authored
      ffc25326
  2. 28 Oct, 2023 25 commits