Commit 22d5a8e5 authored by Chandan Babu R's avatar Chandan Babu R

Merge tag 'atomic-file-updates-6.10_2024-04-15' of...

Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: atomic file content exchanges

This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.

This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes.  A
successful call completion guarantees that the new contents will be seen
even if the system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.  The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Note that application software must quiesce writes to the file
while it stages an atomic update.  This will be addressed by a
subsequent series.

This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file.  For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.

However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed.  For file access mediated
via read and write, fanotify or inotify are probably sufficient.  For
mmaped files, that may not be fast enough.

Here is the proposed manual page:

IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)

NAME
       ioctl_xfs_exchange_range  -  exchange  the contents of parts of
       two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs.h>

       int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct  xfs_ex‐
       change_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions.  Implementations must guarantee that readers see  either
       the old contents or the new contents in their entirety, even if
       the system fails.

       The system call parameters are conveyed in  structures  of  the
       following form:

           struct xfs_exchange_range {
               __s32    file1_fd;
               __u32    pad;
               __u64    file1_offset;
               __u64    file2_offset;
               __u64    length;
               __u64    flags;
           };

       The field pad must be zero.

       The  fields file1_fd, file1_offset, and length define the first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHANGE_RANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHANGE_RANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHANGE_RANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EINVAL The parameters are not correct for  these  files.   This
              error  can  also appear if either file descriptor repre‐
              sents a device, FIFO, or socket.  Disk filesystems  gen‐
              erally  require  the  offset  and length arguments to be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The kernel was unable to allocate sufficient  memory  to
              perform the operation.

       ENOSPC There  is  not  enough  free space in the filesystem ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd  and  file2_fd  are  not  on  the  same mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system  call.   In  all
       cases, application software must coordinate updates to the file
       because the exchange is performed unconditionally.

       The first is a data storage program that wants to  commit  non-
       contiguous  updates  to a file atomically and coordinates write
       access to that file.  This can be done by creating a  temporary
       file, calling FICLONE(2) to share the contents, and staging the
       updates into the temporary file.  The FULL_FILES flag is recom‐
       mended  for this purpose.  The temporary file can be deleted or
       punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

       The second is a software-defined  storage  host  (e.g.  a  disk
       jukebox)  which  implements an atomic scatter-gather write com‐
       mand.  Provided the exported disk's logical block size  matches
       the file's allocation unit size, this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       It  is  recommended that the temporary file be truncated to the
       size of the regular file before any writes are  staged  to  the
       temporary  file  to avoid issues with zeroing during EOF exten‐
       sion.  Use this call with the FILE1_WRITTEN  flag  to  exchange
       only  the  file  allocation  units involved in the emulated de‐
       vice's write command.  The temporary file should  be  truncated
       or  punched out completely before being reused to stage another
       write.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           int blksz;

           fstat(fd, &sb);
           blksz = sb.st_blksize;

           /* land scatter gather writes between 100fsb and 500fsb */
           pwrite(temp_fd, data1, blksz * 2, blksz * 100);
           pwrite(temp_fd, data2, blksz * 20, blksz * 480);
           pwrite(temp_fd, data3, blksz * 7, blksz * 257);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .file1_offset = blksz * 100,
               .file2_offset = blksz * 100,
               .length       = blksz * 400,
               .flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                               XFS_EXCHANGE_RANGE_FILE1_DSYNC,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

NOTES
       Some filesystems may limit the amount of data or the number  of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>

* tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: enable logged file mapping exchange feature
  docs: update swapext -> exchmaps language
  xfs: capture inode generation numbers in the ondisk exchmaps log item
  xfs: support non-power-of-two rtextsize with exchange-range
  xfs: make file range exchange support realtime files
  xfs: condense symbolic links after a mapping exchange operation
  xfs: condense directories after a mapping exchange operation
  xfs: condense extended attributes after a mapping exchange operation
  xfs: add error injection to test file mapping exchange recovery
  xfs: bind together the front and back ends of the file range exchange code
  xfs: create deferred log items for file mapping exchanges
  xfs: introduce a file mapping exchange log intent item
  xfs: create a incompat flag for atomic file mapping exchanges
  xfs: introduce new file range exchange ioctl
  vfs: export remap and write check helpers
parents 4ec2e3c1 0730e8d8
......@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
return 0;
}
EXPORT_SYMBOL_GPL(generic_write_check_limits);
/* Like generic_write_checks(), but takes size of write instead of iter. */
int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
......
......@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
return 0;
}
static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
bool write)
int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
{
int mask = write ? MAY_WRITE : MAY_READ;
loff_t tmp;
......@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
return fsnotify_file_area_perm(file, mask, &pos, len);
}
EXPORT_SYMBOL_GPL(remap_verify_area);
/*
* Ensure that we don't remap a partial EOF block in the middle of something
......
......@@ -34,6 +34,7 @@ xfs-y += $(addprefix libxfs/, \
xfs_dir2_node.o \
xfs_dir2_sf.o \
xfs_dquot_buf.o \
xfs_exchmaps.o \
xfs_ialloc.o \
xfs_ialloc_btree.o \
xfs_iext_tree.o \
......@@ -67,6 +68,7 @@ xfs-y += xfs_aops.o \
xfs_dir2_readdir.o \
xfs_discard.o \
xfs_error.o \
xfs_exchrange.o \
xfs_export.o \
xfs_extent_busy.o \
xfs_file.o \
......@@ -101,6 +103,7 @@ xfs-y += xfs_log.o \
xfs_buf_item.o \
xfs_buf_item_recover.o \
xfs_dquot_item_recover.o \
xfs_exchmaps_item.o \
xfs_extfree_item.o \
xfs_attr_item.o \
xfs_icreate_item.o \
......
......@@ -27,6 +27,7 @@
#include "xfs_da_btree.h"
#include "xfs_attr.h"
#include "xfs_trans_priv.h"
#include "xfs_exchmaps.h"
static struct kmem_cache *xfs_defer_pending_cache;
......@@ -1176,6 +1177,10 @@ xfs_defer_init_item_caches(void)
error = xfs_attr_intent_init_cache();
if (error)
goto err;
error = xfs_exchmaps_intent_init_cache();
if (error)
goto err;
return 0;
err:
xfs_defer_destroy_item_caches();
......@@ -1186,6 +1191,7 @@ xfs_defer_init_item_caches(void)
void
xfs_defer_destroy_item_caches(void)
{
xfs_exchmaps_intent_destroy_cache();
xfs_attr_intent_destroy_cache();
xfs_extfree_intent_destroy_cache();
xfs_bmap_intent_destroy_cache();
......
......@@ -72,7 +72,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
extern const struct xfs_defer_op_type xfs_attr_defer_type;
extern const struct xfs_defer_op_type xfs_exchmaps_defer_type;
/*
* Deferred operation item relogging limits.
......
......@@ -63,7 +63,8 @@
#define XFS_ERRTAG_ATTR_LEAF_TO_NODE 41
#define XFS_ERRTAG_WB_DELAY_MS 42
#define XFS_ERRTAG_WRITE_DELAY_MS 43
#define XFS_ERRTAG_MAX 44
#define XFS_ERRTAG_EXCHMAPS_FINISH_ONE 44
#define XFS_ERRTAG_MAX 45
/*
* Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
......@@ -111,5 +112,6 @@
#define XFS_RANDOM_ATTR_LEAF_TO_NODE 1
#define XFS_RANDOM_WB_DELAY_MS 3000
#define XFS_RANDOM_WRITE_DELAY_MS 3000
#define XFS_RANDOM_EXCHMAPS_FINISH_ONE 1
#endif /* __XFS_ERRORTAG_H_ */
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHMAPS_H__
#define __XFS_EXCHMAPS_H__
/* In-core deferred operation info about a file mapping exchange request. */
struct xfs_exchmaps_intent {
/* List of other incore deferred work. */
struct list_head xmi_list;
/* Inodes participating in the operation. */
struct xfs_inode *xmi_ip1;
struct xfs_inode *xmi_ip2;
/* File offset range information. */
xfs_fileoff_t xmi_startoff1;
xfs_fileoff_t xmi_startoff2;
xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
xfs_fsize_t xmi_isize1;
xfs_fsize_t xmi_isize2;
uint64_t xmi_flags; /* XFS_EXCHMAPS_* flags */
};
/* Try to convert inode2 from block to short format at the end, if possible. */
#define __XFS_EXCHMAPS_INO2_SHORTFORM (1ULL << 63)
#define XFS_EXCHMAPS_INTERNAL_FLAGS (__XFS_EXCHMAPS_INO2_SHORTFORM)
/* flags that can be passed to xfs_exchmaps_{estimate,mappings} */
#define XFS_EXCHMAPS_PARAMS (XFS_EXCHMAPS_ATTR_FORK | \
XFS_EXCHMAPS_SET_SIZES | \
XFS_EXCHMAPS_INO1_WRITTEN)
static inline int
xfs_exchmaps_whichfork(const struct xfs_exchmaps_intent *xmi)
{
if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
return XFS_ATTR_FORK;
return XFS_DATA_FORK;
}
/* Parameters for a mapping exchange request. */
struct xfs_exchmaps_req {
/* Inodes participating in the operation. */
struct xfs_inode *ip1;
struct xfs_inode *ip2;
/* File offset range information. */
xfs_fileoff_t startoff1;
xfs_fileoff_t startoff2;
xfs_filblks_t blockcount;
/* XFS_EXCHMAPS_* operation flags */
uint64_t flags;
/*
* Fields below this line are filled out by xfs_exchmaps_estimate;
* callers should initialize this part of the struct to zero.
*/
/*
* Data device blocks to be moved out of ip1, and free space needed to
* handle the bmbt changes.
*/
xfs_filblks_t ip1_bcount;
/*
* Data device blocks to be moved out of ip2, and free space needed to
* handle the bmbt changes.
*/
xfs_filblks_t ip2_bcount;
/* rt blocks to be moved out of ip1. */
xfs_filblks_t ip1_rtbcount;
/* rt blocks to be moved out of ip2. */
xfs_filblks_t ip2_rtbcount;
/* Free space needed to handle the bmbt changes */
unsigned long long resblks;
/* Number of exchanges needed to complete the operation */
unsigned long long nr_exchanges;
};
static inline int
xfs_exchmaps_reqfork(const struct xfs_exchmaps_req *req)
{
if (req->flags & XFS_EXCHMAPS_ATTR_FORK)
return XFS_ATTR_FORK;
return XFS_DATA_FORK;
}
int xfs_exchmaps_estimate(struct xfs_exchmaps_req *req);
extern struct kmem_cache *xfs_exchmaps_intent_cache;
int __init xfs_exchmaps_intent_init_cache(void);
void xfs_exchmaps_intent_destroy_cache(void);
struct xfs_exchmaps_intent *xfs_exchmaps_init_intent(
const struct xfs_exchmaps_req *req);
void xfs_exchmaps_ensure_reflink(struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi);
void xfs_exchmaps_upgrade_extent_counts(struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi);
int xfs_exchmaps_finish_one(struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi);
int xfs_exchmaps_check_forks(struct xfs_mount *mp,
const struct xfs_exchmaps_req *req);
void xfs_exchange_mappings(struct xfs_trans *tp,
const struct xfs_exchmaps_req *req);
#endif /* __XFS_EXCHMAPS_H__ */
......@@ -367,19 +367,21 @@ xfs_sb_has_ro_compat_feature(
return (sbp->sb_features_ro_compat & feature) != 0;
}
#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
#define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */
#define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */
#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */
#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */
#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */
#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
#define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */
#define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */
#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */
#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */
#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */
#define XFS_SB_FEAT_INCOMPAT_EXCHRANGE (1 << 6) /* exchangerange supported */
#define XFS_SB_FEAT_INCOMPAT_ALL \
(XFS_SB_FEAT_INCOMPAT_FTYPE| \
XFS_SB_FEAT_INCOMPAT_SPINODES| \
XFS_SB_FEAT_INCOMPAT_META_UUID| \
XFS_SB_FEAT_INCOMPAT_BIGTIME| \
XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR| \
XFS_SB_FEAT_INCOMPAT_NREXT64)
(XFS_SB_FEAT_INCOMPAT_FTYPE | \
XFS_SB_FEAT_INCOMPAT_SPINODES | \
XFS_SB_FEAT_INCOMPAT_META_UUID | \
XFS_SB_FEAT_INCOMPAT_BIGTIME | \
XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR | \
XFS_SB_FEAT_INCOMPAT_NREXT64 | \
XFS_SB_FEAT_INCOMPAT_EXCHRANGE)
#define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL
static inline bool
......
......@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */
#define XFS_FSOP_GEOM_FLAGS_INOBTCNT (1 << 22) /* inobt btree counter */
#define XFS_FSOP_GEOM_FLAGS_NREXT64 (1 << 23) /* large extent counters */
#define XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE (1 << 24) /* exchange range */
/*
* Minimum and maximum sizes need for growth checks.
......@@ -772,6 +773,46 @@ struct xfs_scrub_metadata {
# define XFS_XATTR_LIST_MAX 65536
#endif
/*
* Exchange part of file1 with part of the file that this ioctl that is being
* called against (which we'll call file2). Filesystems must be able to
* restart and complete the operation even after the system goes down.
*/
struct xfs_exchange_range {
__s32 file1_fd;
__u32 pad; /* must be zeroes */
__u64 file1_offset; /* file1 offset, bytes */
__u64 file2_offset; /* file2 offset, bytes */
__u64 length; /* bytes to exchange */
__u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
};
/*
* Exchange file data all the way to the ends of both files, and then exchange
* the file sizes. This flag can be used to replace a file's contents with a
* different amount of data. length will be ignored.
*/
#define XFS_EXCHANGE_RANGE_TO_EOF (1ULL << 0)
/* Flush all changes in file data and file metadata to disk before returning. */
#define XFS_EXCHANGE_RANGE_DSYNC (1ULL << 1)
/* Dry run; do all the parameter verification but do not change anything. */
#define XFS_EXCHANGE_RANGE_DRY_RUN (1ULL << 2)
/*
* Exchange only the parts of the two files where the file allocation units
* mapped to file1's range have been written to. This can accelerate
* scatter-gather atomic writes with a temp file if all writes are aligned to
* the file allocation unit.
*/
#define XFS_EXCHANGE_RANGE_FILE1_WRITTEN (1ULL << 3)
#define XFS_EXCHANGE_RANGE_ALL_FLAGS (XFS_EXCHANGE_RANGE_TO_EOF | \
XFS_EXCHANGE_RANGE_DSYNC | \
XFS_EXCHANGE_RANGE_DRY_RUN | \
XFS_EXCHANGE_RANGE_FILE1_WRITTEN)
/*
* ioctl commands that are used by Linux filesystems
......@@ -843,6 +884,7 @@ struct xfs_scrub_metadata {
#define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOWR('X', 129, struct xfs_exchange_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
......
......@@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
#define XLOG_REG_TYPE_ATTRD_FORMAT 28
#define XLOG_REG_TYPE_ATTR_NAME 29
#define XLOG_REG_TYPE_ATTR_VALUE 30
#define XLOG_REG_TYPE_MAX 30
#define XLOG_REG_TYPE_XMI_FORMAT 31
#define XLOG_REG_TYPE_XMD_FORMAT 32
#define XLOG_REG_TYPE_MAX 32
/*
* Flags to log operation header
......@@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
#define XFS_LI_BUD 0x1245
#define XFS_LI_ATTRI 0x1246 /* attr set/remove intent*/
#define XFS_LI_ATTRD 0x1247 /* attr set/remove done */
#define XFS_LI_XMI 0x1248 /* mapping exchange intent */
#define XFS_LI_XMD 0x1249 /* mapping exchange done */
#define XFS_LI_TYPE_DESC \
{ XFS_LI_EFI, "XFS_LI_EFI" }, \
......@@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
{ XFS_LI_BUI, "XFS_LI_BUI" }, \
{ XFS_LI_BUD, "XFS_LI_BUD" }, \
{ XFS_LI_ATTRI, "XFS_LI_ATTRI" }, \
{ XFS_LI_ATTRD, "XFS_LI_ATTRD" }
{ XFS_LI_ATTRD, "XFS_LI_ATTRD" }, \
{ XFS_LI_XMI, "XFS_LI_XMI" }, \
{ XFS_LI_XMD, "XFS_LI_XMD" }
/*
* Inode Log Item Format definitions.
......@@ -878,6 +883,61 @@ struct xfs_bud_log_format {
uint64_t bud_bui_id; /* id of corresponding bui */
};
/*
* XMI/XMD (file mapping exchange) log format definitions
*/
/* This is the structure used to lay out an mapping exchange log item. */
struct xfs_xmi_log_format {
uint16_t xmi_type; /* xmi log item type */
uint16_t xmi_size; /* size of this item */
uint32_t __pad; /* must be zero */
uint64_t xmi_id; /* xmi identifier */
uint64_t xmi_inode1; /* inumber of first file */
uint64_t xmi_inode2; /* inumber of second file */
uint32_t xmi_igen1; /* generation of first file */
uint32_t xmi_igen2; /* generation of second file */
uint64_t xmi_startoff1; /* block offset into file1 */
uint64_t xmi_startoff2; /* block offset into file2 */
uint64_t xmi_blockcount; /* number of blocks */
uint64_t xmi_flags; /* XFS_EXCHMAPS_* */
uint64_t xmi_isize1; /* intended file1 size */
uint64_t xmi_isize2; /* intended file2 size */
};
/* Exchange mappings between extended attribute forks instead of data forks. */
#define XFS_EXCHMAPS_ATTR_FORK (1ULL << 0)
/* Set the file sizes when finished. */
#define XFS_EXCHMAPS_SET_SIZES (1ULL << 1)
/*
* Exchange the mappings of the two files only if the file allocation units
* mapped to file1's range have been written.
*/
#define XFS_EXCHMAPS_INO1_WRITTEN (1ULL << 2)
/* Clear the reflink flag from inode1 after the operation. */
#define XFS_EXCHMAPS_CLEAR_INO1_REFLINK (1ULL << 3)
/* Clear the reflink flag from inode2 after the operation. */
#define XFS_EXCHMAPS_CLEAR_INO2_REFLINK (1ULL << 4)
#define XFS_EXCHMAPS_LOGGED_FLAGS (XFS_EXCHMAPS_ATTR_FORK | \
XFS_EXCHMAPS_SET_SIZES | \
XFS_EXCHMAPS_INO1_WRITTEN | \
XFS_EXCHMAPS_CLEAR_INO1_REFLINK | \
XFS_EXCHMAPS_CLEAR_INO2_REFLINK)
/* This is the structure used to lay out an mapping exchange done log item. */
struct xfs_xmd_log_format {
uint16_t xmd_type; /* xmd log item type */
uint16_t xmd_size; /* size of this item */
uint32_t __pad;
uint64_t xmd_xmi_id; /* id of corresponding xmi */
};
/*
* Dquot Log format definitions.
*
......
......@@ -75,6 +75,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops;
extern const struct xlog_recover_item_ops xlog_cud_item_ops;
extern const struct xlog_recover_item_ops xlog_attri_item_ops;
extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
extern const struct xlog_recover_item_ops xlog_xmi_item_ops;
extern const struct xlog_recover_item_ops xlog_xmd_item_ops;
/*
* Macros, structures, prototypes for internal log manager use.
......@@ -121,6 +123,8 @@ bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len);
int xlog_recover_iget(struct xfs_mount *mp, xfs_ino_t ino,
struct xfs_inode **ipp);
int xlog_recover_iget_handle(struct xfs_mount *mp, xfs_ino_t ino, uint32_t gen,
struct xfs_inode **ipp);
void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
uint64_t intent_id);
int xlog_alloc_buf_cancel_table(struct xlog *log);
......
......@@ -26,6 +26,7 @@
#include "xfs_health.h"
#include "xfs_ag.h"
#include "xfs_rtbitmap.h"
#include "xfs_exchrange.h"
/*
* Physical superblock buffer manipulations. Shared with libxfs in userspace.
......@@ -175,6 +176,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_NEEDSREPAIR;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NREXT64)
features |= XFS_FEAT_NREXT64;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_EXCHRANGE)
features |= XFS_FEAT_EXCHANGE_RANGE;
return features;
}
......@@ -1259,6 +1262,8 @@ xfs_fs_geometry(
}
if (xfs_has_large_extent_counts(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
if (xfs_has_exchange_range(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE;
geo->rtsectsize = sbp->sb_blocksize;
geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
......
......@@ -380,3 +380,50 @@ xfs_symlink_write_target(
ASSERT(pathlen == 0);
return 0;
}
/* Remove all the blocks from a symlink and invalidate buffers. */
int
xfs_symlink_remote_truncate(
struct xfs_trans *tp,
struct xfs_inode *ip)
{
struct xfs_bmbt_irec mval[XFS_SYMLINK_MAPS];
struct xfs_mount *mp = tp->t_mountp;
struct xfs_buf *bp;
int nmaps = XFS_SYMLINK_MAPS;
int done = 0;
int i;
int error;
/* Read mappings and invalidate buffers. */
error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
if (error)
return error;
for (i = 0; i < nmaps; i++) {
if (!xfs_bmap_is_real_extent(&mval[i]))
break;
error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
&bp);
if (error)
return error;
xfs_trans_binval(tp, bp);
}
/* Unmap the remote blocks. */
error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
if (error)
return error;
if (!done) {
ASSERT(done);
xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
return -EFSCORRUPTED;
}
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
return 0;
}
......@@ -22,5 +22,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
uint resblks);
int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
#endif /* __XFS_SYMLINK_REMOTE_H */
......@@ -10,6 +10,10 @@
* Components of space reservations.
*/
/* Worst case number of bmaps that can be held in a block. */
#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp) \
(((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
/* Worst case number of rmaps that can be held in a block. */
#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \
(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
......
......@@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = {
XFS_RANDOM_ATTR_LEAF_TO_NODE,
XFS_RANDOM_WB_DELAY_MS,
XFS_RANDOM_WRITE_DELAY_MS,
XFS_RANDOM_EXCHMAPS_FINISH_ONE,
};
struct xfs_errortag_attr {
......@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split, XFS_ERRTAG_DA_LEAF_SPLIT);
XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node, XFS_ERRTAG_ATTR_LEAF_TO_NODE);
XFS_ERRORTAG_ATTR_RW(wb_delay_ms, XFS_ERRTAG_WB_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(write_delay_ms, XFS_ERRTAG_WRITE_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(noerror),
......@@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
NULL,
};
ATTRIBUTE_GROUPS(xfs_errortag);
......
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHMAPS_ITEM_H__
#define __XFS_EXCHMAPS_ITEM_H__
/*
* The file mapping exchange intent item helps us exchange multiple file
* mappings between two inode forks. It does this by tracking the range of
* file block offsets that still need to be exchanged, and relogs as progress
* happens.
*
* *I items should be recorded in the *first* of a series of rolled
* transactions, and the *D items should be recorded in the same transaction
* that records the associated bmbt updates.
*
* Should the system crash after the commit of the first transaction but
* before the commit of the final transaction in a series, log recovery will
* use the redo information recorded by the intent items to replay the
* rest of the mapping exchanges.
*/
/* kernel only XMI/XMD definitions */
struct xfs_mount;
struct kmem_cache;
/*
* This is the incore file mapping exchange intent log item. It is used to log
* the fact that we are exchanging mappings between two files. It is used in
* conjunction with the incore file mapping exchange done log item described
* below.
*
* These log items follow the same rules as struct xfs_efi_log_item; see the
* comments about that structure (in xfs_extfree_item.h) for more details.
*/
struct xfs_xmi_log_item {
struct xfs_log_item xmi_item;
atomic_t xmi_refcount;
struct xfs_xmi_log_format xmi_format;
};
/*
* This is the incore file mapping exchange done log item. It is used to log
* the fact that an exchange mentioned in an earlier xmi item have been
* performed.
*/
struct xfs_xmd_log_item {
struct xfs_log_item xmd_item;
struct xfs_xmi_log_item *xmd_intent_log_item;
struct xfs_xmd_log_format xmd_format;
};
extern struct kmem_cache *xfs_xmi_cache;
extern struct kmem_cache *xfs_xmd_cache;
struct xfs_exchmaps_intent;
void xfs_exchmaps_defer_add(struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi);
#endif /* __XFS_EXCHMAPS_ITEM_H__ */
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHRANGE_H__
#define __XFS_EXCHRANGE_H__
/* Update the mtime/cmtime of file1 and file2 */
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
__XFS_EXCHANGE_RANGE_UPD_CMTIME2)
struct xfs_exchrange {
struct file *file1;
struct file *file2;
loff_t file1_offset;
loff_t file2_offset;
u64 length;
u64 flags; /* XFS_EXCHANGE_RANGE flags */
};
long xfs_ioc_exchange_range(struct file *file,
struct xfs_exchange_range __user *argp);
struct xfs_exchmaps_req;
void xfs_exchrange_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
struct xfs_inode *ip2);
void xfs_exchrange_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
int xfs_exchrange_estimate(struct xfs_exchmaps_req *req);
#endif /* __XFS_EXCHRANGE_H__ */
......@@ -40,6 +40,7 @@
#include "xfs_xattr.h"
#include "xfs_rtbitmap.h"
#include "xfs_file.h"
#include "xfs_exchrange.h"
#include <linux/mount.h>
#include <linux/namei.h>
......@@ -2170,6 +2171,9 @@ xfs_file_ioctl(
return error;
}
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
default:
return -ENOTTY;
}
......
......@@ -1767,6 +1767,37 @@ xlog_recover_iget(
return 0;
}
/*
* Get an inode so that we can recover a log operation.
*
* Log intent items that target inodes effectively contain a file handle.
* Check that the generation number matches the intent item like we do for
* other file handles. Log intent items defined after this validation weakness
* was identified must use this function.
*/
int
xlog_recover_iget_handle(
struct xfs_mount *mp,
xfs_ino_t ino,
uint32_t gen,
struct xfs_inode **ipp)
{
struct xfs_inode *ip;
int error;
error = xlog_recover_iget(mp, ino, &ip);
if (error)
return error;
if (VFS_I(ip)->i_generation != gen) {
xfs_irele(ip);
return -EFSCORRUPTED;
}
*ipp = ip;
return 0;
}
/******************************************************************************
*
* Log recover routines
......@@ -1789,6 +1820,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
&xlog_bud_item_ops,
&xlog_attri_item_ops,
&xlog_attrd_item_ops,
&xlog_xmi_item_ops,
&xlog_xmd_item_ops,
};
static const struct xlog_recover_item_ops *
......
......@@ -292,6 +292,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
#define XFS_FEAT_EXCHANGE_RANGE (1ULL << 27) /* exchange range */
/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
......@@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
__XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
/*
* Mount features
......
......@@ -43,6 +43,7 @@
#include "xfs_iunlink_item.h"
#include "xfs_dahash_test.h"
#include "xfs_rtbitmap.h"
#include "xfs_exchmaps_item.h"
#include "scrub/stats.h"
#include "scrub/rcbag_btree.h"
......@@ -1727,6 +1728,10 @@ xfs_fs_fill_super(
goto out_filestream_unmount;
}
if (xfs_has_exchange_range(mp))
xfs_warn(mp,
"EXPERIMENTAL exchange-range feature enabled. Use at your own risk!");
error = xfs_mountfs(mp);
if (error)
goto out_filestream_unmount;
......@@ -2185,8 +2190,24 @@ xfs_init_caches(void)
if (!xfs_iunlink_cache)
goto out_destroy_attri_cache;
xfs_xmd_cache = kmem_cache_create("xfs_xmd_item",
sizeof(struct xfs_xmd_log_item),
0, 0, NULL);
if (!xfs_xmd_cache)
goto out_destroy_iul_cache;
xfs_xmi_cache = kmem_cache_create("xfs_xmi_item",
sizeof(struct xfs_xmi_log_item),
0, 0, NULL);
if (!xfs_xmi_cache)
goto out_destroy_xmd_cache;
return 0;
out_destroy_xmd_cache:
kmem_cache_destroy(xfs_xmd_cache);
out_destroy_iul_cache:
kmem_cache_destroy(xfs_iunlink_cache);
out_destroy_attri_cache:
kmem_cache_destroy(xfs_attri_cache);
out_destroy_attrd_cache:
......@@ -2243,6 +2264,8 @@ xfs_destroy_caches(void)
* destroy caches.
*/
rcu_barrier();
kmem_cache_destroy(xfs_xmd_cache);
kmem_cache_destroy(xfs_xmi_cache);
kmem_cache_destroy(xfs_iunlink_cache);
kmem_cache_destroy(xfs_attri_cache);
kmem_cache_destroy(xfs_attrd_cache);
......
......@@ -250,19 +250,12 @@ xfs_symlink(
*/
STATIC int
xfs_inactive_symlink_rmt(
struct xfs_inode *ip)
struct xfs_inode *ip)
{
struct xfs_buf *bp;
int done;
int error;
int i;
xfs_mount_t *mp;
xfs_bmbt_irec_t mval[XFS_SYMLINK_MAPS];
int nmaps;
int size;
xfs_trans_t *tp;
mp = ip->i_mount;
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
int error;
ASSERT(!xfs_need_iread_extents(&ip->i_df));
/*
* We're freeing a symlink that has some
......@@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt(
* locked for the second transaction. In the error paths we need it
* held so the cancel won't rele it, see below.
*/
size = (int)ip->i_disk_size;
ip->i_disk_size = 0;
VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
/*
* Find the block(s) so we can inval and unmap them.
*/
done = 0;
nmaps = ARRAY_SIZE(mval);
error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
mval, &nmaps, 0);
if (error)
goto error_trans_cancel;
/*
* Invalidate the block(s). No validation is done.
*/
for (i = 0; i < nmaps; i++) {
error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
&bp);
if (error)
goto error_trans_cancel;
xfs_trans_binval(tp, bp);
}
/*
* Unmap the dead block(s) to the dfops.
*/
error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done);
error = xfs_symlink_remote_truncate(tp, ip);
if (error)
goto error_trans_cancel;
ASSERT(done);
/*
* Commit the transaction. This first logs the EFI and the inode, then
* rolls and commits the transaction that frees the extents.
*/
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
error = xfs_trans_commit(tp);
if (error) {
ASSERT(xfs_is_shutdown(mp));
......
......@@ -39,6 +39,8 @@
#include "xfs_buf_mem.h"
#include "xfs_btree_mem.h"
#include "xfs_bmap.h"
#include "xfs_exchmaps.h"
#include "xfs_exchrange.h"
/*
* We include this last to have the helpers above available for the trace
......
This diff is collapsed.
......@@ -2119,6 +2119,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment