1. 08 Jan, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 8c9440fe
      Linus Torvalds authored
      Pull vfs mount updates from Christian Brauner:
       "This contains the work to retrieve detailed information about mounts
        via two new system calls. This is hopefully the beginning of the end
        of the saga that started with fsinfo() years ago.
      
        The LWN articles in [1] and [2] can serve as a summary so we can avoid
        rehashing everything here.
      
        At LSFMM in May 2022 we got into a room and agreed on what we want to
        do about fsinfo(). Basically, split it into pieces. This is the first
        part of that agreement. Specifically, it is concerned with retrieving
        information about mounts. So this only concerns the mount information
        retrieval, not the mount table change notification, or the extended
        filesystem specific mount option work. That is separate work.
      
        Currently mounts have a 32bit id. Mount ids are already in heavy use
        by libmount and other low-level userspace but they can't be relied
        upon because they're recycled very quickly. We agreed that mounts
        should carry a unique 64bit id by which they can be referenced
        directly. This is now implemented as part of this work.
      
        The new 64bit mount id is exposed in statx() through the new
        STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is
        returned. If it is raised and the kernel supports the new 64bit mount
        id the flag is raised in the result mask and the new 64bit mount id is
        returned. New and old mount ids do not overlap so they cannot be
        conflated.
      
        Two new system calls are introduced that operate on the 64bit mount
        id: statmount() and listmount(). A summary of the api and usage can be
        found on LWN as well (cf. [3]) but of course, I'll provide a summary
        here as well.
      
        Both system calls rely on struct mnt_id_req. Which is the request
        struct used to pass the 64bit mount id identifying the mount to
        operate on. It is extensible to allow for the addition of new
        parameters and for future use in other apis that make use of mount
        ids.
      
        statmount() mimicks the semantics of statx() and exposes a set flags
        that userspace may raise in mnt_id_req to request specific information
        to be retrieved. A statmount() call returns a struct statmount filled
        in with information about the requested mount. Supported requests are
        indicated by raising the request flag passed in struct mnt_id_req in
        the @mask argument in struct statmount.
      
        Currently we do support:
      
         - STATMOUNT_SB_BASIC:
           Basic filesystem info
      
         - STATMOUNT_MNT_BASIC
           Mount information (mount id, parent mount id, mount attributes etc)
      
         - STATMOUNT_PROPAGATE_FROM
           Propagation from what mount in current namespace
      
         - STATMOUNT_MNT_ROOT
           Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla)
      
         - STATMOUNT_MNT_POINT
           Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt)
      
         - STATMOUNT_FS_TYPE
           Name of the filesystem type as the magic number isn't enough due to submounts
      
        The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE
        are appended to the end of the struct. Userspace can use the offsets
        in @fs_type, @mnt_root, and @mnt_point to reference those strings
        easily.
      
        The struct statmount reserves quite a bit of space currently for
        future extensibility. This isn't really a problem and if this bothers
        us we can just send a follow-up pull request during this cycle.
      
        listmount() is given a 64bit mount id via mnt_id_req just as
        statmount(). It takes a buffer and a size to return an array of the
        64bit ids of the child mounts of the requested mount. Userspace can
        thus choose to either retrieve child mounts for a mount in batches or
        iterate through the child mounts. For most use-cases it will be
        sufficient to just leave space for a few child mounts. But for big
        mount tables having an iterator is really helpful. Iterating through a
        mount table works by setting @param in mnt_id_req to the mount id of
        the last child mount retrieved in the previous listmount() call"
      
      Link: https://lwn.net/Articles/934469 [1]
      Link: https://lwn.net/Articles/829212 [2]
      Link: https://lwn.net/Articles/950569 [3]
      
      * tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        add selftest for statmount/listmount
        fs: keep struct mnt_id_req extensible
        wire up syscalls for statmount/listmount
        add listmount(2) syscall
        statmount: simplify string option retrieval
        statmount: simplify numeric option retrieval
        add statmount(2) syscall
        namespace: extract show_path() helper
        mounts: keep list of mounts in an rbtree
        add unique mount ID
      8c9440fe
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 3f6984e7
      Linus Torvalds authored
      Pull vfs super updates from Christian Brauner:
       "This contains the super work for this cycle including the long-awaited
        series by Jan to make it possible to prevent writing to mounted block
        devices:
      
         - Writing to mounted devices is dangerous and can lead to filesystem
           corruption as well as crashes. Furthermore syzbot comes with more
           and more involved examples how to corrupt block device under a
           mounted filesystem leading to kernel crashes and reports we can do
           nothing about. Add tracking of writers to each block device and a
           kernel cmdline argument which controls whether other writeable
           opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
           allowed.
      
           Note that this effectively only prevents modification of the
           particular block device's page cache by other writers. The actual
           device content can still be modified by other means - e.g. by
           issuing direct scsi commands, by doing writes through devices lower
           in the storage stack (e.g. in case loop devices, DM, or MD are
           involved) etc. But blocking direct modifications of the block
           device page cache is enough to give filesystems a chance to perform
           data validation when loading data from the underlying storage and
           thus prevent kernel crashes.
      
           Syzbot can use this cmdline argument option to avoid uninteresting
           crashes. Also users whose userspace setup does not need writing to
           mounted block devices can set this option for hardening. We expect
           that this will be interesting to quite a few workloads.
      
           Btrfs is currently opted out of this because they still haven't
           merged patches we require for this to work from three kernel
           releases ago.
      
         - Reimplement block device freezing and thawing as holder operations
           on the block device.
      
           This allows us to extend block device freezing to all devices
           associated with a superblock and not just the main device. It also
           allows us to remove get_active_super() and thus another function
           that scans the global list of superblocks.
      
           Freezing via additional block devices only works if the filesystem
           chooses to use @fs_holder_ops for these additional devices as well.
           That currently only includes ext4 and xfs.
      
           Earlier releases switched get_tree_bdev() and mount_bdev() to use
           @fs_holder_ops. The remaining nilfs2 open-coded version of
           mount_bdev() has been converted to rely on @fs_holder_ops as well.
           So block device freezing for the main block device will continue to
           work as before.
      
           There should be no regressions in functionality. The only special
           case is btrfs where block device freezing for the main block device
           never worked because sb->s_bdev isn't set. Block device freezing
           for btrfs can be fixed once they can switch to @fs_holder_ops but
           that can happen whenever they're ready"
      
      * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
        block: Fix a memory leak in bdev_open_by_dev()
        super: don't bother with WARN_ON_ONCE()
        super: massage wait event mechanism
        ext4: Block writes to journal device
        xfs: Block writes to log device
        fs: Block writes to mounted block devices
        btrfs: Do not restrict writes to btrfs devices
        block: Add config option to not allow writing to mounted devices
        block: Remove blkdev_get_by_*() functions
        bcachefs: Convert to bdev_open_by_path()
        fs: handle freezing from multiple devices
        fs: remove dead check
        nilfs2: simplify device handling
        fs: streamline thaw_super_locked
        ext4: simplify device handling
        xfs: simplify device handling
        fs: simplify setup_bdev_super() calls
        blkdev: comment fs_holder_ops
        porting: document block device freeze and thaw changes
        fs: remove unused helper
        ...
      3f6984e7
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · c604110e
      Linus Torvalds authored
      Pull misc vfs updates from Christian Brauner:
       "This contains the usual miscellaneous features, cleanups, and fixes
        for vfs and individual fses.
      
        Features:
      
         - Add Jan Kara as VFS reviewer
      
         - Show correct device and inode numbers in proc/<pid>/maps for vma
           files on stacked filesystems. This is now easily doable thanks to
           the backing file work from the last cycles. This comes with
           selftests
      
        Cleanups:
      
         - Remove a redundant might_sleep() from wait_on_inode()
      
         - Initialize pointer with NULL, not 0
      
         - Clarify comment on access_override_creds()
      
         - Rework and simplify eventfd_signal() and eventfd_signal_mask()
           helpers
      
         - Process aio completions in batches to avoid needless wakeups
      
         - Completely decouple struct mnt_idmap from namespaces. We now only
           keep the actual idmapping around and don't stash references to
           namespaces
      
         - Reformat maintainer entries to indicate that a given subsystem
           belongs to fs/
      
         - Simplify fput() for files that were never opened
      
         - Get rid of various pointless file helpers
      
         - Rename various file helpers
      
         - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
           last cycle
      
         - Make relatime_need_update() return bool
      
         - Use GFP_KERNEL instead of GFP_USER when allocating superblocks
      
         - Replace deprecated ida_simple_*() calls with their current ida_*()
           counterparts
      
        Fixes:
      
         - Fix comments on user namespace id mapping helpers. They aren't
           kernel doc comments so they shouldn't be using /**
      
         - s/Retuns/Returns/g in various places
      
         - Add missing parameter documentation on can_move_mount_beneath()
      
         - Rename i_mapping->private_data to i_mapping->i_private_data
      
         - Fix a false-positive lockdep warning in pipe_write() for watch
           queues
      
         - Improve __fget_files_rcu() code generation to improve performance
      
         - Only notify writer that pipe resizing has finished after setting
           pipe->max_usage otherwise writers are never notified that the pipe
           has been resized and hang
      
         - Fix some kernel docs in hfsplus
      
         - s/passs/pass/g in various places
      
         - Fix kernel docs in ntfs
      
         - Fix kcalloc() arguments order reported by gcc 14
      
         - Fix uninitialized value in reiserfs"
      
      * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
        reiserfs: fix uninit-value in comp_keys
        watch_queue: fix kcalloc() arguments order
        ntfs: dir.c: fix kernel-doc function parameter warnings
        fs: fix doc comment typo fs tree wide
        selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
        fs/proc: show correct device and inode numbers in /proc/pid/maps
        eventfd: Remove usage of the deprecated ida_simple_xx() API
        fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
        fs/hfsplus: wrapper.c: fix kernel-doc warnings
        fs: add Jan Kara as reviewer
        fs/inode: Make relatime_need_update return bool
        pipe: wakeup wr_wait after setting max_usage
        file: remove __receive_fd()
        file: stop exposing receive_fd_user()
        fs: replace f_rcuhead with f_task_work
        file: remove pointless wrapper
        file: s/close_fd_get_file()/file_close_fd()/g
        Improve __fget_files_rcu() code generation (and thus __fget_light())
        file: massage cleanup of files that failed to open
        fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
        ...
      c604110e
    • Dmitry Torokhov's avatar
      asm-generic: make sparse happy with odd-sized put_unaligned_*() · 1ab33c03
      Dmitry Torokhov authored
      __put_unaligned_be24() and friends use implicit casts to convert
      larger-sized data to bytes, which trips sparse truncation warnings when
      the argument is a constant:
      
          CC [M]  drivers/input/touchscreen/hynitron_cstxxx.o
          CHECK   drivers/input/touchscreen/hynitron_cstxxx.c
        drivers/input/touchscreen/hynitron_cstxxx.c: note: in included file (through arch/x86/include/generated/asm/unaligned.h):
        include/asm-generic/unaligned.h:119:16: warning: cast truncates bits from constant value (aa01a0 becomes a0)
        include/asm-generic/unaligned.h:120:20: warning: cast truncates bits from constant value (aa01 becomes 1)
        include/asm-generic/unaligned.h:119:16: warning: cast truncates bits from constant value (ab00d0 becomes d0)
        include/asm-generic/unaligned.h:120:20: warning: cast truncates bits from constant value (ab00 becomes 0)
      
      To avoid this let's mask off upper bits explicitly, the resulting code
      should be exactly the same, but it will keep sparse happy.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Closes: https://lore.kernel.org/oe-kbuild-all/202401070147.gqwVulOn-lkp@intel.com/Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ab33c03
  2. 07 Jan, 2024 1 commit
  3. 06 Jan, 2024 2 commits
  4. 05 Jan, 2024 25 commits
  5. 04 Jan, 2024 8 commits