1. 23 Apr, 2024 2 commits
  2. 22 Apr, 2024 22 commits
  3. 20 Apr, 2024 1 commit
    • Dave Chinner's avatar
      xfs: fix CIL sparse lock context warnings · 2c03d956
      Dave Chinner authored
      Sparse reports:
      
      fs/xfs/xfs_log_cil.c:1127:1: warning: context imbalance in 'xlog_cil_push_work' - different lock contexts for basic block
      fs/xfs/xfs_log_cil.c:1380:1: warning: context imbalance in 'xlog_cil_push_background' - wrong count at exit
      fs/xfs/xfs_log_cil.c:1623:9: warning: context imbalance in 'xlog_cil_commit' - unexpected unlock
      
      xlog_cil_push_background() has a locking annotations for an rw_sem.
      Sparse does not track lock contexts for rw_sems, so the
      annotation generates false warnings. Remove the annotation.
      
      xlog_wait_on_iclog() drops the log->l_ic_loglock. The function has a
      sparse annotation, but the prototype in xfs_log_priv.h does not.
      Hence the warning from xlog_cil_push_work() which calls
      xlog_wait_on_iclog(). Add the missing annotation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      2c03d956
  4. 16 Apr, 2024 15 commits
    • Chandan Babu R's avatar
      Merge tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of... · 9cb5f15d
      Chandan Babu R authored
      Merge tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: retain ILOCK during directory updates
      
      This series changes the directory update code to retain the ILOCK on all
      files involved in a rename until the end of the operation.  The upcoming
      parent pointers patchset applies parent pointers in a separate chained
      update from the actual directory update, which is why it is now
      necessary to keep the ILOCK instead of dropping it after the first
      transaction in the chain.
      
      As a side effect, we no longer need to hold the IOLOCK during an rmapbt
      scan of inodes to serialize the scan with ongoing directory updates.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: unlock new repair tempfiles after creation
        xfs: don't pick up IOLOCK during rmapbt repair scan
        xfs: Hold inode locks in xfs_rename
        xfs: Hold inode locks in xfs_trans_alloc_dir
        xfs: Hold inode locks in xfs_ialloc
        xfs: Increase XFS_QM_TRANS_MAXDQS to 5
        xfs: Increase XFS_DEFER_OPS_NR_INODES to 5
      9cb5f15d
    • Chandan Babu R's avatar
      Merge tag 'online-fsck-design-6.10_2024-04-15' of... · f910defd
      Chandan Babu R authored
      Merge tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: design documentation for online fsck, part 2
      
      This series updates the design documentation for online fsck to reflect
      the final design of the parent pointers feature as well as the
      implementation of online fsck for the new metadata.
      
      This has been running on the djcloud for months with no problems.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        docs: describe xfs directory tree online fsck
        docs: update offline parent pointer repair strategy
        docs: update online directory and parent pointer repair sections
        docs: update the parent pointers documentation to the final version
      f910defd
    • Chandan Babu R's avatar
      Merge tag 'discard-relax-locks-6.10_2024-04-15' of... · 6ad1b914
      Chandan Babu R authored
      Merge tag 'discard-relax-locks-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: less heavy locks during fstrim
      
      Congratulations!  You have made it to the final patchset of the main
      online fsck feature!  This patchset fixes some stalling behavior that I
      observed when running FITRIM against large flash-based filesystems with
      very heavily fragmented free space data.  In summary -- the current
      fstrim implementation optimizes for trimming the largest free extents
      first, and holds the AGF lock for the duration of the operation.  This
      is great if fstrim is being run as a foreground process by a sysadmin.
      
      For xfs_scrub, however, this isn't so good -- we don't really want to
      block on one huge kernel call while reporting no progress information.
      We don't want to hold the AGF so long that background processes stall.
      These problems are easily fixable by issuing smaller FITRIM calls, but
      there's still the problem of walking the entire cntbt.  To solve that
      second problem, we introduce a new sub-AG FITRIM implementation.  To
      solve the first problem, make it relax the AGF periodically.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'discard-relax-locks-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: fix performance problems when fstrimming a subset of a fragmented AG
      6ad1b914
    • Chandan Babu R's avatar
      Merge tag 'inode-repair-improvements-6.10_2024-04-15' of... · 9ba8e658
      Chandan Babu R authored
      Merge tag 'inode-repair-improvements-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: inode-related repair fixes
      
      While doing QA of the online fsck code, I made a few observations:
      First, nobody was checking that the di_onlink field is actually zero;
      Second, that allocating a temporary file for repairs can fail (and
      thus bring down the entire fs) if the inode cluster is corrupt; and
      Third, that file link counts do not pin at ~0U to prevent integer
      overflows.  Fourth, the x{chk,rep}_metadata_inode_fork functions
      should be subclassing the main scrub context, not modifying the
      parent's setup willy-nilly.
      
      This scattered patchset fixes those three problems.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'inode-repair-improvements-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype
        xfs: pin inodes that would otherwise overflow link count
        xfs: try to avoid allocating from sick inode clusters
        xfs: check unused nlink fields in the ondisk inode
      9ba8e658
    • Chandan Babu R's avatar
      Merge tag 'repair-iunlink-6.10_2024-04-15' of... · 1eef0125
      Chandan Babu R authored
      Merge tag 'repair-iunlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: online fsck of iunlink buckets
      
      This series enhances the AGI scrub code to check the unlinked inode
      bucket lists for errors, and fixes them if necessary.  Now that iunlink
      pointer updates are virtual log items, we can batch updates pretty
      efficiently in the logging code.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-iunlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: repair AGI unlinked inode bucket lists
        xfs: hoist AGI repair context to a heap object
        xfs: check AGI unlinked inode buckets
      1eef0125
    • Chandan Babu R's avatar
      Merge tag 'repair-symlink-6.10_2024-04-15' of... · 0313dd8f
      Chandan Babu R authored
      Merge tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: online repair of symbolic links
      
      The patches in this set adds the ability to repair the target buffer of
      a symbolic link, using the same salvage, rebuild, and swap strategy used
      everywhere else.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: online repair of symbolic links
        xfs: pass the owner to xfs_symlink_write_target
        xfs: expose xfs_bmap_local_to_extents for online repair
      0313dd8f
    • Chandan Babu R's avatar
      Merge tag 'repair-orphanage-6.10_2024-04-15' of... · 067d3f71
      Chandan Babu R authored
      Merge tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: move orphan files to lost and found
      
      Orphaned files are defined to be files with nonzero ondisk link count
      but no observable parent directory.  This series enables online repair
      to reparent orphaned files into the filesystem directory tree, and wires
      up this reparenting ability into the directory, file link count, and
      parent pointer repair functions.  This is how we fix files with positive
      link count that are not reachable through the directory tree.
      
      This patch will also create the orphanage directory (lost+found) if it
      is not present.  In contrast to xfs_repair, we follow e2fsck in creating
      the lost+found without group or other-owner access to avoid accidental
      disclosure of files that were previously hidden by an 0700 directory.
      That's silly security, but people have been known to do it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: ensure dentry consistency when the orphanage adopts a file
        xfs: move files to orphanage instead of letting nlinks drop to zero
        xfs: move orphan files to the orphanage
      067d3f71
    • Chandan Babu R's avatar
      Merge tag 'repair-dirs-6.10_2024-04-15' of... · 9e6b93b7
      Chandan Babu R authored
      Merge tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: online repair of directories
      
      This series employs atomic extent swapping to enable safe reconstruction
      of directory data.  For now, XFS does not support reverse directory
      links (aka parent pointers), so we can only salvage the dirents of a
      directory and construct a new structure.
      
      Directory repair therefore consists of five main parts:
      
      First, we walk the existing directory to salvage as many entries as we
      can, by adding them as new directory entries to the repair temp dir.
      
      Second, we validate the parent pointer found in the directory.  If one
      was not found, we scan the entire filesystem looking for a potential
      parent.
      
      Third, we use atomic extent swaps to exchange the entire data fork
      between the two directories.
      
      Fourth, we reap the old directory blocks as carefully as we can.
      
      To wrap up the directory repair code, we need to add to the regular
      filesystem the ability to free all the data fork blocks in a directory.
      This does not change anything with normal directories, since they must
      still unlink and shrink one entry at a time.  However, this will
      facilitate freeing of partially-inactivated temporary directories during
      log recovery.
      
      The second half of this patchset implements repairs for the dotdot
      entries of directories.  For now there is only rudimentary support for
      this, because there are no directory parent pointers, so the best we can
      do is scanning the filesystem and the VFS dcache for answers.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: ask the dentry cache if it knows the parent of a directory
        xfs: online repair of parent pointers
        xfs: scan the filesystem to repair a directory dotdot entry
        xfs: online repair of directories
        xfs: inactivate directory data blocks
      9e6b93b7
    • Chandan Babu R's avatar
      Merge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of... · 902603bf
      Chandan Babu R authored
      Merge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: online repair of inode unlinked state
      
      This series adds some logic to the inode scrubbers so that they can
      detect and deal with consistency errors between the link count and the
      per-inode unlinked list state.  The helpers needed to do this are
      presented here because they are a prequisite for rebuildng directories,
      since we need to get a rebuilt non-empty directory off the unlinked
      list.
      
      Note that this patchset does not provide comprehensive reconstruction of
      the AGI unlinked list; that is coming in a subsequent patchset.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: update the unlinked list when repairing link counts
        xfs: ensure unlinked list state is consistent with nlink during scrub
      902603bf
    • Chandan Babu R's avatar
      Merge tag 'repair-xattrs-6.10_2024-04-15' of... · 5f3e9511
      Chandan Babu R authored
      Merge tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: online repair of extended attributes
      
      This series employs atomic extent swapping to enable safe reconstruction
      of extended attribute data attached to a file.  Because xattrs do not
      have any redundant information to draw off of, we can at best salvage
      as much data as we can and build a new structure.
      
      Rebuilding an extended attribute structure consists of these three
      steps:
      
      First, we walk the existing attributes to salvage as many of them as we
      can, by adding them as new attributes attached to the repair tempfile.
      We need to add a new xfile-based data structure to hold blobs of
      arbitrary length to stage the xattr names and values.
      
      Second, we write the salvaged attributes to a temporary file, and use
      atomic extent swaps to exchange the entire attribute fork between the
      two files.
      
      Finally, we reap the old xattr blocks (which are now in the temporary
      file) as carefully as we can.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: create an xattr iteration function for scrub
        xfs: flag empty xattr leaf blocks for optimization
        xfs: scrub should set preen if attr leaf has holes
        xfs: repair extended attributes
        xfs: use atomic extent swapping to fix user file fork data
        xfs: create a blob array data structure
        xfs: enable discarding of folios backing an xfile
      5f3e9511
    • Chandan Babu R's avatar
      Merge tag 'dirattr-validate-owners-6.10_2024-04-15' of... · fb1f7c66
      Chandan Babu R authored
      Merge tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: set and validate dir/attr block owners
      
      There are a couple of significant changes that need to be made to the
      directory and xattr code before we can support online repairs of those
      data structures.
      
      The first change is because online repair is designed to use libxfs to
      create a replacement dir/xattr structure in a temporary file, and use
      atomic extent swapping to commit the corrected structure.  To avoid the
      performance hit of walking every block of the new structure to rewrite
      the owner number before the swap, we instead change libxfs to allow
      callers of the dir and xattr code the ability to set an explicit owner
      number to be written into the header fields of any new blocks that are
      created.  For regular operation this will be the directory inode number.
      
      The second change is to update the dir/xattr code to actually *check*
      the owner number in each block that is read off the disk, since we don't
      currently do that.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: validate explicit directory free block owners
        xfs: validate explicit directory block buffer owners
        xfs: validate explicit directory data buffer owners
        xfs: validate directory leaf buffer owners
        xfs: validate dabtree node buffer owners
        xfs: validate attr remote value buffer owners
        xfs: validate attr leaf buffer owners
        xfs: reduce indenting in xfs_attr_node_list
        xfs: use the xfs_da_args owner field to set new dir/attr block owner
        xfs: add an explicit owner field to xfs_da_args
      fb1f7c66
    • Chandan Babu R's avatar
      Merge tag 'repair-rtsummary-6.10_2024-04-15' of... · 8b309acd
      Chandan Babu R authored
      Merge tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: online repair of realtime summaries
      
      We now have all the infrastructure we need to repair file metadata.
      We'll begin with the realtime summary file, because it is the least
      complex data structure.  To support this we need to add three more
      pieces to the temporary file code from the previous patchset --
      preallocating space in the temp file, formatting metadata into that
      space and writing the blocks to disk, and swapping the fork mappings
      atomically.
      
      After that, the actual reconstruction of the realtime summary
      information is pretty simple, since we can simply write the incore
      copy computed by the rtsummary scrubber to the temporary file, swap the
      contents, and reap the old blocks.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: online repair of realtime summaries
        xfs: teach the tempfile to set up atomic file content exchanges
        xfs: support preallocating and copying content into temporary files
      8b309acd
    • Chandan Babu R's avatar
      Merge tag 'repair-tempfiles-6.10_2024-04-15' of... · 783c5170
      Chandan Babu R authored
      Merge tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: create temporary files for online repair
      
      As mentioned earlier, the repair strategy for file-based metadata is to
      build a new copy in a temporary file and swap the file fork mappings
      with the metadata inode.  We've built the atomic extent swap facility,
      so now we need to build a facility for handling private temporary files.
      
      The first step is to teach the filesystem to ignore the temporary files.
      We'll mark them as PRIVATE in the VFS so that the kernel security
      modules will leave it alone.  The second step is to add the online
      repair code the ability to create a temporary file and reap extents from
      the temporary file after the extent swap.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: add the ability to reap entire inode forks
        xfs: refactor live buffer invalidation for repairs
        xfs: create temporary files and directories for online repair
        xfs: hide private inodes from bulkstat and handle functions
      783c5170
    • Chandan Babu R's avatar
      Merge tag 'atomic-file-updates-6.10_2024-04-15' of... · 22d5a8e5
      Chandan Babu R authored
      Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: atomic file content exchanges
      
      This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
      ranges of bytes between two files atomically.
      
      This new functionality enables data storage programs to stage and commit
      file updates such that reader programs will see either the old contents
      or the new contents in their entirety, with no chance of torn writes.  A
      successful call completion guarantees that the new contents will be seen
      even if the system fails.
      
      The ability to exchange file fork mappings between files in this manner
      is critical to supporting online filesystem repair, which is built upon
      the strategy of constructing a clean copy of a damaged structure and
      committing the new structure into the metadata file atomically.  The
      ioctls exist to facilitate testing of the new functionality and to
      enable future application program designs.
      
      User programs will be able to update files atomically by opening an
      O_TMPFILE, reflinking the source file to it, making whatever updates
      they want to make, and exchange the relevant ranges of the temp file
      with the original file.  If the updates are aligned with the file block
      size, a new (since v2) flag provides for exchanging only the written
      areas.  Note that application software must quiesce writes to the file
      while it stages an atomic update.  This will be addressed by a
      subsequent series.
      
      This mechanism solves the clunkiness of two existing atomic file update
      mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
      where other programs can see an empty file.  For create tempfile +
      rename, the need to copy file attributes and extended attributes for
      each file update is eliminated.
      
      However, this method introduces its own awkwardness -- any program
      initiating an exchange now needs to have a way to signal to other
      programs that the file contents have changed.  For file access mediated
      via read and write, fanotify or inotify are probably sufficient.  For
      mmaped files, that may not be fast enough.
      
      Here is the proposed manual page:
      
      IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
      
      NAME
             ioctl_xfs_exchange_range  -  exchange  the contents of parts of
             two files
      
      SYNOPSIS
             #include <sys/ioctl.h>
             #include <xfs/xfs_fs.h>
      
             int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct  xfs_ex‐
             change_range *arg);
      
      DESCRIPTION
             Given  a  range  of bytes in a first file file1_fd and a second
             range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
             changes the contents of the two ranges.
      
             Exchanges  are  atomic  with  regards to concurrent file opera‐
             tions.  Implementations must guarantee that readers see  either
             the old contents or the new contents in their entirety, even if
             the system fails.
      
             The system call parameters are conveyed in  structures  of  the
             following form:
      
                 struct xfs_exchange_range {
                     __s32    file1_fd;
                     __u32    pad;
                     __u64    file1_offset;
                     __u64    file2_offset;
                     __u64    length;
                     __u64    flags;
                 };
      
             The field pad must be zero.
      
             The  fields file1_fd, file1_offset, and length define the first
             range of bytes to be exchanged.
      
             The fields file2_fd, file2_offset, and length define the second
             range of bytes to be exchanged.
      
             Both  files must be from the same filesystem mount.  If the two
             file descriptors represent the same file, the byte ranges  must
             not  overlap.   Most  disk-based  filesystems  require that the
             starts of both ranges must be aligned to the file  block  size.
             If  this  is  the  case, the ends of the ranges must also be so
             aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.
      
             The field flags control the behavior of the exchange operation.
      
                 XFS_EXCHANGE_RANGE_TO_EOF
                        Ignore the length parameter.  All bytes in  file1_fd
                        from  file1_offset to EOF are moved to file2_fd, and
                        file2's size is set to  (file2_offset+(file1_length-
                        file1_offset)).   Meanwhile, all bytes in file2 from
                        file2_offset to EOF are moved to file1  and  file1's
                        size    is   set   to   (file1_offset+(file2_length-
                        file2_offset)).
      
                 XFS_EXCHANGE_RANGE_DSYNC
                        Ensure that all modified in-core data in  both  file
                        ranges  and  all  metadata updates pertaining to the
                        exchange operation are flushed to persistent storage
                        before  the  call  returns.  Opening either file de‐
                        scriptor with O_SYNC or O_DSYNC will have  the  same
                        effect.
      
                 XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                        Only  exchange sub-ranges of file1_fd that are known
                        to contain data  written  by  application  software.
                        Each  sub-range  may  be  expanded (both upwards and
                        downwards) to align with the file  allocation  unit.
                        For files on the data device, this is one filesystem
                        block.  For files on the realtime  device,  this  is
                        the realtime extent size.  This facility can be used
                        to implement fast atomic  scatter-gather  writes  of
                        any  complexity for software-defined storage targets
                        if all writes are aligned  to  the  file  allocation
                        unit.
      
                 XFS_EXCHANGE_RANGE_DRY_RUN
                        Check  the parameters and the feasibility of the op‐
                        eration, but do not change anything.
      
      RETURN VALUE
             On error, -1 is returned, and errno is set to indicate the  er‐
             ror.
      
      ERRORS
             Error  codes can be one of, but are not limited to, the follow‐
             ing:
      
             EBADF  file1_fd is not open for reading and writing or is  open
                    for  append-only  writes;  or  file2_fd  is not open for
                    reading and writing or is open for append-only writes.
      
             EINVAL The parameters are not correct for  these  files.   This
                    error  can  also appear if either file descriptor repre‐
                    sents a device, FIFO, or socket.  Disk filesystems  gen‐
                    erally  require  the  offset  and length arguments to be
                    aligned to the fundamental block sizes of both files.
      
             EIO    An I/O error occurred.
      
             EISDIR One of the files is a directory.
      
             ENOMEM The kernel was unable to allocate sufficient  memory  to
                    perform the operation.
      
             ENOSPC There  is  not  enough  free space in the filesystem ex‐
                    change the contents safely.
      
             EOPNOTSUPP
                    The filesystem does not support exchanging bytes between
                    the two files.
      
             EPERM  file1_fd or file2_fd are immutable.
      
             ETXTBSY
                    One of the files is a swap file.
      
             EUCLEAN
                    The filesystem is corrupt.
      
             EXDEV  file1_fd  and  file2_fd  are  not  on  the  same mounted
                    filesystem.
      
      CONFORMING TO
             This API is XFS-specific.
      
      USE CASES
             Several use cases are imagined for this system  call.   In  all
             cases, application software must coordinate updates to the file
             because the exchange is performed unconditionally.
      
             The first is a data storage program that wants to  commit  non-
             contiguous  updates  to a file atomically and coordinates write
             access to that file.  This can be done by creating a  temporary
             file, calling FICLONE(2) to share the contents, and staging the
             updates into the temporary file.  The FULL_FILES flag is recom‐
             mended  for this purpose.  The temporary file can be deleted or
             punched out afterwards.
      
             An example program might look like this:
      
                 int fd = open("/some/file", O_RDWR);
                 int temp_fd = open("/some", O_TMPFILE | O_RDWR);
      
                 ioctl(temp_fd, FICLONE, fd);
      
                 /* append 1MB of records */
                 lseek(temp_fd, 0, SEEK_END);
                 write(temp_fd, data1, 1000000);
      
                 /* update record index */
                 pwrite(temp_fd, data1, 600, 98765);
                 pwrite(temp_fd, data2, 320, 54321);
                 pwrite(temp_fd, data2, 15, 0);
      
                 /* commit the entire update */
                 struct xfs_exchange_range args = {
                     .file1_fd = temp_fd,
                     .flags = XFS_EXCHANGE_RANGE_TO_EOF,
                 };
      
                 ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
      
             The second is a software-defined  storage  host  (e.g.  a  disk
             jukebox)  which  implements an atomic scatter-gather write com‐
             mand.  Provided the exported disk's logical block size  matches
             the file's allocation unit size, this can be done by creating a
             temporary file and writing the data at the appropriate offsets.
             It  is  recommended that the temporary file be truncated to the
             size of the regular file before any writes are  staged  to  the
             temporary  file  to avoid issues with zeroing during EOF exten‐
             sion.  Use this call with the FILE1_WRITTEN  flag  to  exchange
             only  the  file  allocation  units involved in the emulated de‐
             vice's write command.  The temporary file should  be  truncated
             or  punched out completely before being reused to stage another
             write.
      
             An example program might look like this:
      
                 int fd = open("/some/file", O_RDWR);
                 int temp_fd = open("/some", O_TMPFILE | O_RDWR);
                 struct stat sb;
                 int blksz;
      
                 fstat(fd, &sb);
                 blksz = sb.st_blksize;
      
                 /* land scatter gather writes between 100fsb and 500fsb */
                 pwrite(temp_fd, data1, blksz * 2, blksz * 100);
                 pwrite(temp_fd, data2, blksz * 20, blksz * 480);
                 pwrite(temp_fd, data3, blksz * 7, blksz * 257);
      
                 /* commit the entire update */
                 struct xfs_exchange_range args = {
                     .file1_fd = temp_fd,
                     .file1_offset = blksz * 100,
                     .file2_offset = blksz * 100,
                     .length       = blksz * 400,
                     .flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                                     XFS_EXCHANGE_RANGE_FILE1_DSYNC,
                 };
      
                 ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
      
      NOTES
             Some filesystems may limit the amount of data or the number  of
             extents that can be exchanged in a single call.
      
      SEE ALSO
             ioctl(2)
      
      XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
      
      The reference implementation in XFS creates a new log incompat feature
      and log intent items to track high level progress of swapping ranges of
      two files and finish interrupted work if the system goes down.  Sample
      code can be found in the corresponding changes to xfs_io to exercise the
      use case mentioned above.
      
      Note that this function is /not/ the O_DIRECT atomic untorn file writes
      concept that has also been floating around for years.  It is also not
      the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
      entirely in software, which means that there are no limitations other
      than the general filesystem limits.
      
      As a side note, the original motivation behind the kernel functionality
      is online repair of file-based metadata.  The atomic file content
      exchange is implemented as an atomic exchange of file fork mappings,
      which means that we can implement online reconstruction of extended
      attributes and directories by building a new one in another inode and
      exchanging the contents.
      
      Subsequent patchsets adapt the online filesystem repair code to use
      atomic file exchanges.  This enables repair functions to construct a
      clean copy of a directory, xattr information, symbolic links, realtime
      bitmaps, and realtime summary information in a temporary inode.  If this
      completes successfully, the new contents can be committed atomically
      into the inode being repaired.  This is essential to avoid making
      corruption problems worse if the system goes down in the middle of
      running repair.
      
      For userspace, this series also includes the userspace pieces needed to
      test the new functionality, and a sample implementation of atomic file
      updates.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: enable logged file mapping exchange feature
        docs: update swapext -> exchmaps language
        xfs: capture inode generation numbers in the ondisk exchmaps log item
        xfs: support non-power-of-two rtextsize with exchange-range
        xfs: make file range exchange support realtime files
        xfs: condense symbolic links after a mapping exchange operation
        xfs: condense directories after a mapping exchange operation
        xfs: condense extended attributes after a mapping exchange operation
        xfs: add error injection to test file mapping exchange recovery
        xfs: bind together the front and back ends of the file range exchange code
        xfs: create deferred log items for file mapping exchanges
        xfs: introduce a file mapping exchange log intent item
        xfs: create a incompat flag for atomic file mapping exchanges
        xfs: introduce new file range exchange ioctl
        vfs: export remap and write check helpers
      22d5a8e5
    • Chandan Babu R's avatar
      Merge tag 'file-exchange-refactorings-6.10_2024-04-15' of... · 4ec2e3c1
      Chandan Babu R authored
      Merge tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA
      
      xfs: refactorings for atomic file content exchanges
      
      This series applies various cleanups and refactorings to file IO
      handling code ahead of the main series to implement atomic file content
      exchanges.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: constify xfs_bmap_is_written_extent
        xfs: refactor non-power-of-two alignment checks
        xfs: hoist multi-fsb allocation unit detection to a helper
        xfs: create a new helper to return a file's allocation unit
        xfs: declare xfs_file.c symbols in xfs_file.h
        xfs: move xfs_iops.c declarations out of xfs_inode.h
        xfs: move inode lease breaking functions to xfs_inode.c
      4ec2e3c1