1. 30 Oct, 2023 4 commits
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 3b3f874c
      Linus Torvalds authored
      Pull misc vfs updates from Christian Brauner:
       "This contains the usual miscellaneous features, cleanups, and fixes
        for vfs and individual fses.
      
        Features:
      
         - Rename and export helpers that get write access to a mount. They
           are used in overlayfs to get write access to the upper mount.
      
         - Print the pretty name of the root device on boot failure. This
           helps in scenarios where we would usually only print
           "unknown-block(1,2)".
      
         - Add an internal SB_I_NOUMASK flag. This is another part in the
           endless POSIX ACL saga in a way.
      
           When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
           the umask because if the relevant inode has POSIX ACLs set it might
           take the umask from there. But if the inode doesn't have any POSIX
           ACLs set then we apply the umask in the filesytem itself. So we end
           up with:
      
            (1) no SB_POSIXACL -> strip umask in vfs
            (2) SB_POSIXACL    -> strip umask in filesystem
      
           The umask semantics associated with SB_POSIXACL allowed filesystems
           that don't even support POSIX ACLs at all to raise SB_POSIXACL
           purely to avoid umask stripping. That specifically means NFS v4 and
           Overlayfs. NFS v4 does it because it delegates this to the server
           and Overlayfs because it needs to delegate umask stripping to the
           upper filesystem, i.e., the filesystem used as the writable layer.
      
           This went so far that SB_POSIXACL is raised eve on kernels that
           don't even have POSIX ACL support at all.
      
           Stop this blatant abuse and add SB_I_NOUMASK which is an internal
           superblock flag that filesystems can raise to opt out of umask
           handling. That should really only be the two mentioned above. It's
           not that we want any filesystems to do this. Ideally we have all
           umask handling always in the vfs.
      
         - Make overlayfs use SB_I_NOUMASK too.
      
         - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
           IS_POSIXACL() if the kernel doesn't have support for it. This is a
           very old patch but it's only possible to do this now with the wider
           cleanup that was done.
      
         - Follow-up work on fake path handling from last cycle. Citing mostly
           from Amir:
      
           When overlayfs was first merged, overlayfs files of regular files
           and directories, the ones that are installed in file table, had a
           "fake" path, namely, f_path is the overlayfs path and f_inode is
           the "real" inode on the underlying filesystem.
      
           In v6.5, we took another small step by introducing of the
           backing_file container and the file_real_path() helper. This change
           allowed vfs and filesystem code to get the "real" path of an
           overlayfs backing file. With this change, we were able to make
           fsnotify work correctly and report events on the "real" filesystem
           objects that were accessed via overlayfs.
      
           This method works fine, but it still leaves the vfs vulnerable to
           new code that is not aware of files with fake path. A recent
           example is commit db1d1e8b ("IMA: use vfs_getattr_nosec to get
           the i_version"). This commit uses direct referencing to f_path in
           IMA code that otherwise uses file_inode() and file_dentry() to
           reference the filesystem objects that it is measuring.
      
           This contains work to switch things around: instead of having
           filesystem code opt-in to get the "real" path, have generic code
           opt-in for the "fake" path in the few places that it is needed.
      
           Is it far more likely that new filesystems code that does not use
           the file_dentry() and file_real_path() helpers will end up causing
           crashes or averting LSM/audit rules if we keep the "fake" path
           exposed by default.
      
           This change already makes file_dentry() moot, but for now we did
           not change this helper just added a WARN_ON() in ovl_d_real() to
           catch if we have made any wrong assumptions.
      
           After the dust settles on this change, we can make file_dentry() a
           plain accessor and we can drop the inode argument to ->d_real().
      
         - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
           change but it really isn't and I would like to see everyone on
           their tippie toes for any possible bugs from this work.
      
           Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
           files since a very long time because of the nasty interactions
           between the SCM_RIGHTS file descriptor garbage collection. So
           extending it makes a lot of sense but it is a subtle change. There
           are almost no places that fiddle with file rcu semantics directly
           and the ones that did mess around with struct file internal under
           rcu have been made to stop doing that because it really was always
           dodgy.
      
           I forgot to put in the link tag for this change and the discussion
           in the commit so adding it into the merge message:
      
             https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
      
        Cleanups:
      
         - Various smaller pipe cleanups including the removal of a spin lock
           that was only used to protect against writes without pipe_lock()
           from O_NOTIFICATION_PIPE aka watch queues. As that was never
           implemented remove the additional locking from pipe_write().
      
         - Annotate struct watch_filter with the new __counted_by attribute.
      
         - Clarify do_unlinkat() cleanup so that it doesn't look like an extra
           iput() is done that would cause issues.
      
         - Simplify file cleanup when the file has never been opened.
      
         - Use module helper instead of open-coding it.
      
         - Predict error unlikely for stale retry.
      
         - Use WRITE_ONCE() for mount expiry field instead of just commenting
           that one hopes the compiler doesn't get smart.
      
        Fixes:
      
         - Fix readahead on block devices.
      
         - Fix writeback when layztime is enabled and inodes whose timestamp
           is the only thing that changed reside on wb->b_dirty_time. This
           caused excessively large zombie memory cgroup when lazytime was
           enabled as such inodes weren't handled fast enough.
      
         - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
      
      * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
        file, i915: fix file reference for mmap_singleton()
        vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
        writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
        chardev: Simplify usage of try_module_get()
        ovl: rely on SB_I_NOUMASK
        fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
        fs: store real path instead of fake path in backing file f_path
        fs: create helper file_user_path() for user displayed mapped file path
        fs: get mnt_writers count for an open backing file's real path
        vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
        vfs: predict the error in retry_estale as unlikely
        backing file: free directly
        vfs: fix readahead(2) on block devices
        io_uring: use files_lookup_fd_locked()
        file: convert to SLAB_TYPESAFE_BY_RCU
        vfs: shave work on failed file open
        fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
        watch_queue: Annotate struct watch_filter with __counted_by
        fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
        fs/pipe: remove unnecessary spinlock from pipe_write()
        ...
      3b3f874c
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.autofs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 0d63d8b2
      Linus Torvalds authored
      Pull autofs mount api updates from Christian Brauner:
       "This ports autofs to the new mount api. The patchset has existed for
        quite a while but never made it upstream. Ian picked it back up.
      
        This also fixes a bug where fs_param_is_fd() was passed a garbage
        param->dirfd but it expected it to be set to the fd that was used to
        set param->file otherwise result->uint_32 contains nonsense. So make
        sure it's set.
      
        One less filesystem using the old mount api. We're getting there,
        albeit rather slow. The last remaining major filesystem that hasn't
        converted is btrfs. Patches exist - I even wrote them - but so far
        they haven't made it upstream"
      
      * tag 'vfs-6.7.autofs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
        autofs: fix add autofs_parse_fd()
        fsconfig: ensure that dirfd is set to aux
        autofs: fix protocol sub version setting
        autofs: convert autofs to use the new mount api
        autofs: validate protocol version
        autofs: refactor parse_options()
        autofs: reformat 0pt enum declaration
        autofs: refactor super block info init
        autofs: add autofs_parse_fd()
        autofs: refactor autofs_prepare_pipe()
      0d63d8b2
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · d4e175f2
      Linus Torvalds authored
      Pull vfs superblock updates from Christian Brauner:
       "This contains the work to make block device opening functions return a
        struct bdev_handle instead of just a struct block_device. The same
        struct bdev_handle is then also passed to block device closing
        functions.
      
        This allows us to propagate context from opening to closing a block
        device without having to modify all users everytime.
      
        Sidenote, in the future we might even want to try and have block
        device opening functions return a struct file directly but that's a
        series on top of this.
      
        These are further preparatory changes to be able to count writable
        opens and blocking writes to mounted block devices. That's a separate
        piece of work for next cycle and for that we absolutely need the
        changes to btrfs that have been quietly dropped somehow.
      
        Originally the series contained a patch that removed the old
        blkdev_*() helpers. But since this would've caused needles churn in
        -next for bcachefs we ended up delaying it.
      
        The second piece of work addresses one of the major annoyances about
        the work last cycle, namely that we required dropping s_umount
        whenever we used the superblock and fs_holder_ops for a block device.
      
        The reason for that requirement had been that in some codepaths
        s_umount could've been taken under disk->open_mutex (that's always
        been the case, at least theoretically). For example, on surprise block
        device removal or media change. And opening and closing block devices
        required grabbing disk->open_mutex as well.
      
        So we did the work and went through the block layer and fixed all
        those places so that s_umount is never taken under disk->open_mutex.
        This means no more brittle games where we yield and reacquire s_umount
        during block device opening and closing and no more requirements where
        block devices need to be closed. Filesystems don't need to care about
        this.
      
        There's a bunch of other follow-up work such as moving block device
        freezing and thawing to holder operations which makes it work for all
        block devices and not just the main block device just as we did for
        surprise removal. But that is for next cycle.
      
        Tested with fstests for all major fses, blktests, LTP"
      
      * tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
        porting: update locking requirements
        fs: assert that open_mutex isn't held over holder ops
        block: assert that we're not holding open_mutex over blk_report_disk_dead
        block: move bdev_mark_dead out of disk_check_media_change
        block: WARN_ON_ONCE() when we remove active partitions
        block: simplify bdev_del_partition()
        fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
        jfs: fix log->bdev_handle null ptr deref in lbmStartIO
        bcache: Fixup error handling in register_cache()
        xfs: Convert to bdev_open_by_path()
        reiserfs: Convert to bdev_open_by_dev/path()
        ocfs2: Convert to use bdev_open_by_dev()
        nfs/blocklayout: Convert to use bdev_open_by_dev/path()
        jfs: Convert to bdev_open_by_dev()
        f2fs: Convert to bdev_open_by_dev/path()
        ext4: Convert to bdev_open_by_dev()
        erofs: Convert to use bdev_open_by_path()
        btrfs: Convert to bdev_open_by_path()
        fs: Convert to bdev_open_by_dev()
        mm/swap: Convert to use bdev_open_by_dev()
        ...
      d4e175f2
    • Linus Torvalds's avatar
      Linux 6.6 · ffc25326
      Linus Torvalds authored
      ffc25326
  2. 28 Oct, 2023 36 commits