Commit 22d5a8e5 authored by Chandan Babu R's avatar Chandan Babu R

Merge tag 'atomic-file-updates-6.10_2024-04-15' of...

Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: atomic file content exchanges

This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.

This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes.  A
successful call completion guarantees that the new contents will be seen
even if the system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.  The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Note that application software must quiesce writes to the file
while it stages an atomic update.  This will be addressed by a
subsequent series.

This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file.  For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.

However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed.  For file access mediated
via read and write, fanotify or inotify are probably sufficient.  For
mmaped files, that may not be fast enough.

Here is the proposed manual page:

IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)

NAME
       ioctl_xfs_exchange_range  -  exchange  the contents of parts of
       two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs.h>

       int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct  xfs_ex‐
       change_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions.  Implementations must guarantee that readers see  either
       the old contents or the new contents in their entirety, even if
       the system fails.

       The system call parameters are conveyed in  structures  of  the
       following form:

           struct xfs_exchange_range {
               __s32    file1_fd;
               __u32    pad;
               __u64    file1_offset;
               __u64    file2_offset;
               __u64    length;
               __u64    flags;
           };

       The field pad must be zero.

       The  fields file1_fd, file1_offset, and length define the first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHANGE_RANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHANGE_RANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHANGE_RANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EINVAL The parameters are not correct for  these  files.   This
              error  can  also appear if either file descriptor repre‐
              sents a device, FIFO, or socket.  Disk filesystems  gen‐
              erally  require  the  offset  and length arguments to be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The kernel was unable to allocate sufficient  memory  to
              perform the operation.

       ENOSPC There  is  not  enough  free space in the filesystem ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd  and  file2_fd  are  not  on  the  same mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system  call.   In  all
       cases, application software must coordinate updates to the file
       because the exchange is performed unconditionally.

       The first is a data storage program that wants to  commit  non-
       contiguous  updates  to a file atomically and coordinates write
       access to that file.  This can be done by creating a  temporary
       file, calling FICLONE(2) to share the contents, and staging the
       updates into the temporary file.  The FULL_FILES flag is recom‐
       mended  for this purpose.  The temporary file can be deleted or
       punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

       The second is a software-defined  storage  host  (e.g.  a  disk
       jukebox)  which  implements an atomic scatter-gather write com‐
       mand.  Provided the exported disk's logical block size  matches
       the file's allocation unit size, this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       It  is  recommended that the temporary file be truncated to the
       size of the regular file before any writes are  staged  to  the
       temporary  file  to avoid issues with zeroing during EOF exten‐
       sion.  Use this call with the FILE1_WRITTEN  flag  to  exchange
       only  the  file  allocation  units involved in the emulated de‐
       vice's write command.  The temporary file should  be  truncated
       or  punched out completely before being reused to stage another
       write.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           int blksz;

           fstat(fd, &sb);
           blksz = sb.st_blksize;

           /* land scatter gather writes between 100fsb and 500fsb */
           pwrite(temp_fd, data1, blksz * 2, blksz * 100);
           pwrite(temp_fd, data2, blksz * 20, blksz * 480);
           pwrite(temp_fd, data3, blksz * 7, blksz * 257);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .file1_offset = blksz * 100,
               .file2_offset = blksz * 100,
               .length       = blksz * 400,
               .flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                               XFS_EXCHANGE_RANGE_FILE1_DSYNC,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

NOTES
       Some filesystems may limit the amount of data or the number  of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>

* tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: enable logged file mapping exchange feature
  docs: update swapext -> exchmaps language
  xfs: capture inode generation numbers in the ondisk exchmaps log item
  xfs: support non-power-of-two rtextsize with exchange-range
  xfs: make file range exchange support realtime files
  xfs: condense symbolic links after a mapping exchange operation
  xfs: condense directories after a mapping exchange operation
  xfs: condense extended attributes after a mapping exchange operation
  xfs: add error injection to test file mapping exchange recovery
  xfs: bind together the front and back ends of the file range exchange code
  xfs: create deferred log items for file mapping exchanges
  xfs: introduce a file mapping exchange log intent item
  xfs: create a incompat flag for atomic file mapping exchanges
  xfs: introduce new file range exchange ioctl
  vfs: export remap and write check helpers
parents 4ec2e3c1 0730e8d8
...@@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` ...@@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
function frees them all because compaction is not needed. function frees them all because compaction is not needed.
The details of repairing directories and extended attributes will be discussed The details of repairing directories and extended attributes will be discussed
in a subsequent section about atomic extent swapping. in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required. file, which is why compaction is not required.
...@@ -2802,7 +2802,8 @@ follows this format: ...@@ -2802,7 +2802,8 @@ follows this format:
Repairs for file-based metadata such as extended attributes, directories, Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a symbolic links, quota files and realtime bitmaps are performed by building a
new structure attached to a temporary file and swapping the forks. new structure attached to a temporary file and exchanging all mappings in the
file forks.
Afterward, the mappings in the old file fork are the candidate blocks for Afterward, the mappings in the old file fork are the candidate blocks for
disposal. disposal.
...@@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs ...@@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available. cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the the XFS filesystem, writes a new structure at the correct offsets into the
temporary file, and atomically swaps the fork mappings (and hence the fork temporary file, and atomically exchanges all file fork mappings (and hence the
contents) to commit the repair. fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks system goes down during the reap, the iunlink code will delete the blocks
during log recovery. during log recovery.
...@@ -3862,10 +3863,11 @@ consistent to use a temporary file safely! ...@@ -3862,10 +3863,11 @@ consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information. memory to stage ondisk space usage information.
Swapping metadata extents with a temporary file requires the owner field of the Exchanging metadata file mappings with a temporary file requires the owner
block headers to match the file being repaired and not the temporary file. The field of the block headers to match the file being repaired and not the
directory, extended attribute, and symbolic link functions were all modified to temporary file.
allow callers to specify owner numbers explicitly. The directory, extended attribute, and symbolic link functions were all
modified to allow callers to specify owner numbers explicitly.
There is a downside to the reaping process -- if the system crashes during the There is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will reap phase and the fork extents are crosslinked, the iunlink processing will
...@@ -3974,8 +3976,8 @@ The proposed patches are in the ...@@ -3974,8 +3976,8 @@ The proposed patches are in the
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
series. series.
Atomic Extent Swapping Logged File Content Exchanges
---------------------- -----------------------------
Once repair builds a temporary file with a new data structure written into Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file. it, it must commit the new changes into the existing file.
...@@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must ...@@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
These problems are overcome by creating a new deferred operation and a new type These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file of log intent item to track the progress of an operation to exchange two file
ranges. ranges.
The new deferred operation type chains together the same transactions used by The new exchange operation type chains together the same transactions used by
the reverse-mapping extent swap code. the reverse-mapping extent swap code, but records intermedia progress in the
log so that operations can be restarted after a crash.
This new functionality is called the file contents exchange (xfs_exchrange)
code.
The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are exchange begins, it will always run to completion, even there are
interruptions. interruptions.
The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on in the superblock protects these new log item records from being replayed on
old kernels. old kernels.
The proposed patchset is the The proposed patchset is the
`atomic extent swap `file contents exchange
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
series. series.
...@@ -4061,72 +4067,73 @@ series. ...@@ -4061,72 +4067,73 @@ series.
| The feature bit will not be cleared from the superblock until the log | | The feature bit will not be cleared from the superblock until the log |
| becomes clean. | | becomes clean. |
| | | |
| Log-assisted extended attribute updates and atomic extent swaps both use | | Log-assisted extended attribute updates and file content exchanges bothe |
| log incompat features and provide convenience wrappers around the | | use log incompat features and provide convenience wrappers around the |
| functionality. | | functionality. |
+--------------------------------------------------------------------------+ +--------------------------------------------------------------------------+
Mechanics of an Atomic Extent Swap Mechanics of a Logged File Content Exchange
`````````````````````````````````` ```````````````````````````````````````````
Swapping entire file forks is a complex task. Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset The goal is to exchange all file fork mappings between two file fork offset
ranges. ranges.
There are likely to be many extent mappings in each fork, and the edges of There are likely to be many extent mappings in each fork, and the edges of
the mappings aren't necessarily aligned. the mappings aren't necessarily aligned.
Furthermore, there may be other updates that need to happen after the swap, Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local such as exchanging file sizes, inode flags, or conversion of fork data to local
format. format.
This is roughly the format of the new deferred extent swap work item: This is roughly the format of the new deferred exchange-mapping work item:
.. code-block:: c .. code-block:: c
struct xfs_swapext_intent { struct xfs_exchmaps_intent {
/* Inodes participating in the operation. */ /* Inodes participating in the operation. */
struct xfs_inode *sxi_ip1; struct xfs_inode *xmi_ip1;
struct xfs_inode *sxi_ip2; struct xfs_inode *xmi_ip2;
/* File offset range information. */ /* File offset range information. */
xfs_fileoff_t sxi_startoff1; xfs_fileoff_t xmi_startoff1;
xfs_fileoff_t sxi_startoff2; xfs_fileoff_t xmi_startoff2;
xfs_filblks_t sxi_blockcount; xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */ /* Set these file sizes after the operation, unless negative. */
xfs_fsize_t sxi_isize1; xfs_fsize_t xmi_isize1;
xfs_fsize_t sxi_isize2; xfs_fsize_t xmi_isize2;
/* XFS_SWAP_EXT_* log operation flags */ /* XFS_EXCHMAPS_* log operation flags */
uint64_t sxi_flags; uint64_t xmi_flags;
}; };
The new log intent item contains enough information to track two logical fork The new log intent item contains enough information to track two logical fork
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
blockcount)``. blockcount)``.
Each step of a swap operation exchanges the largest file range mapping possible Each step of an exchange operation exchanges the largest file range mapping
from one file to the other. possible from one file to the other.
After each step in the swap operation, the two startoff fields are incremented After each step in the exchange operation, the two startoff fields are
and the blockcount field is decremented to reflect the progress made. incremented and the blockcount field is decremented to reflect the progress
The flags field captures behavioral parameters such as swapping the attr fork made.
instead of the data fork and other work to be done after the extent swap. The flags field captures behavioral parameters such as exchanging attr fork
The two isize fields are used to swap the file size at the end of the operation mappings instead of the data fork and other work to be done after the exchange.
if the file data fork is the target of the swap operation. The two isize fields are used to exchange the file sizes at the end of the
operation if the file data fork is the target of the operation.
When the extent swap is initiated, the sequence of operations is as follows:
When the exchange is initiated, the sequence of operations is as follows:
1. Create a deferred work item for the extent swap.
At the start, it should contain the entirety of the file ranges to be 1. Create a deferred work item for the file mapping exchange.
swapped. At the start, it should contain the entirety of the file block ranges to be
exchanged.
2. Call ``xfs_defer_finish`` to process the exchange. 2. Call ``xfs_defer_finish`` to process the exchange.
This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred This will log an extent swap intent item to the transaction for the deferred
extent swap work item. mapping exchange work item.
3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, 3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
``sxi_startoff2``, respectively, and compute the longest extent that can ``xmi_startoff2``, respectively, and compute the longest extent that can
be swapped in a single step. be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings. This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings Keep advancing through the file forks until at least one of the mappings
contains written blocks. contains written blocks.
...@@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows: ...@@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
g. Extend the ondisk size of either file if necessary. g. Extend the ondisk size of either file if necessary.
h. Log an extent swap done log item for the extent swap intent log item h. Log a mapping exchange done log item for th mapping exchange intent log
that was read at the start of step 3. item that was read at the start of step 3.
i. Compute the amount of file range that has just been covered. i. Compute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount - This quantity is ``(map1.br_startoff + map1.br_blockcount -
sxi_startoff1)``, because step 3a could have skipped holes. xmi_startoff1)``, because step 3a could have skipped holes.
j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease by the number of blocks computed in the previous step, and decrease
``sxi_blockcount`` by the same quantity. ``xmi_blockcount`` by the same quantity.
This advances the cursor. This advances the cursor.
k. Log a new extent swap intent log item reflecting the advanced state of k. Log a new mapping exchange intent log item reflecting the advanced state
the work item. of the work item.
l. Return the proper error code (EAGAIN) to the deferred operation manager l. Return the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done. to inform it that there is more work to be done.
...@@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows: ...@@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
This will be discussed in more detail in subsequent sections. This will be discussed in more detail in subsequent sections.
If the filesystem goes down in the middle of an operation, log recovery will If the filesystem goes down in the middle of an operation, log recovery will
find the most recent unfinished extent swap log intent item and restart from find the most recent unfinished maping exchange log intent item and restart
there. from there.
This is how extent swapping guarantees that an outside observer will either see This is how atomic file mapping exchanges guarantees that an outside observer
the old broken structure or the new one, and never a mismash of both. will either see the old broken structure or the new one, and never a mismash of
both.
Preparation for Extent Swapping Preparation for File Content Exchanges
``````````````````````````````` ``````````````````````````````````````
There are a few things that need to be taken care of before initiating an There are a few things that need to be taken care of before initiating an
atomic extent swap operation. atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced. operation begins, and directio writes to be quiesced.
Like any filesystem operation, extent swapping must determine the maximum Like any filesystem operation, file mapping exchanges must determine the
amount of disk space and quota that can be consumed on behalf of both files in maximum amount of disk space and quota that can be consumed on behalf of both
the operation, and reserve that quantity of resources to avoid an unrecoverable files in the operation, and reserve that quantity of resources to avoid an
out of space failure once it starts dirtying metadata. unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate: The preparation step scans the ranges of both files to estimate:
- Data device blocks needed to handle the repeated updates to the fork - Data device blocks needed to handle the repeated updates to the fork
...@@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate: ...@@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
to different extents on the realtime volume, which could happen if the to different extents on the realtime volume, which could happen if the
operation fails to run to completion. operation fails to run to completion.
The need for precise estimation increases the run time of the swap operation, The need for precise estimation increases the run time of the exchange
but it is very important to maintain correct accounting. operation, but it is very important to maintain correct accounting.
The filesystem must not run completely out of free space, nor can the extent The filesystem must not run completely out of free space, nor can the mapping
swap ever add more extent mappings to a fork than it can support. exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere. may exceed quota to resolve inconsistent metadata elsewhere.
Special Features for Swapping Metadata File Extents Special Features for Exchanging Metadata File Contents
``````````````````````````````````````````````````` ``````````````````````````````````````````````````````
Extended attributes, symbolic links, and directories can set the fork format to Extended attributes, symbolic links, and directories can set the fork format to
"local" and treat the fork as a literal area for data storage. "local" and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases: Metadata repairs must take extra steps to support these cases:
- If both forks are in local format and the fork areas are large enough, the - If both forks are in local format and the fork areas are large enough, the
swap is performed by copying the incore fork contents, logging both forks, exchange is performed by copying the incore fork contents, logging both
and committing. forks, and committing.
The atomic extent swap mechanism is not necessary, since this can be done The atomic file mapping exchange mechanism is not necessary, since this can
with a single transaction. be done with a single transaction.
- If both forks map blocks, then the regular atomic extent swap is used. - If both forks map blocks, then the regular atomic file mapping exchange is
used.
- Otherwise, only one fork is in local format. - Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the The contents of the local format fork are converted to a block to perform the
swap. exchange.
The conversion to block format must be done in the same transaction that The conversion to block format must be done in the same transaction that
logs the initial extent swap intent log item. logs the initial mapping exchange intent log item.
The regular atomic extent swap is used to exchange the mappings. The regular atomic mapping exchange is used to exchange the metadata file
Special flags are set on the swap operation so that the transaction can be mappings.
rolled one more time to convert the second file's fork back to local format Special flags are set on the exchange operation so that the transaction can
so that the second file will be ready to go as soon as the ILOCK is dropped. be rolled one more time to convert the second file's fork back to local
format so that the second file will be ready to go as soon as the ILOCK is
dropped.
Extended attributes and directories stamp the owning inode into every block, Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number! but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain Although there is no verification, it is still important to maintain
referential integrity, so prior to performing the extent swap, online repair referential integrity, so prior to performing the mapping exchange, online
builds every block in the new data structure with the owner field of the file repair builds every block in the new data structure with the owner field of the
being repaired. file being repaired.
After a successful swap operation, the repair operation must reap the old fork After a successful exchange operation, the repair operation must reap the old
blocks by processing each fork mapping through the standard :ref:`file extent fork blocks by processing each fork mapping through the standard :ref:`file
reaping <reaping>` mechanism that is done post-repair. extent reaping <reaping>` mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped. whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof. repair, and is not completely foolproof.
Swapping Temporary File Extents Exchanging Temporary File Contents
``````````````````````````````` ``````````````````````````````````
To repair a metadata file, online repair proceeds as follows: To repair a metadata file, online repair proceeds as follows:
...@@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows: ...@@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
file. file.
The same fork must be written to as is being repaired. The same fork must be written to as is being repaired.
3. Commit the scrub transaction, since the swap estimation step must be 3. Commit the scrub transaction, since the exchange resource estimation step
completed before transaction reservations are made. must be completed before transaction reservations are made.
4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with 4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct the appropriate resource reservations, locks, and fill out a ``struct
xfs_swapext_req`` with the details of the swap operation. xfs_exchmaps_req`` with the details of the exchange operation.
5. Call ``xrep_tempswap_contents`` to swap the contents. 5. Call ``xrep_tempexch_contents`` to exchange the contents.
6. Commit the transaction to complete the repair. 6. Commit the transaction to complete the repair.
...@@ -4309,7 +4320,7 @@ To check the summary file against the bitmap: ...@@ -4309,7 +4320,7 @@ To check the summary file against the bitmap:
3. Compare the contents of the xfile against the ondisk file. 3. Compare the contents of the xfile against the ondisk file.
To repair the summary file, write the xfile contents into the temporary file To repair the summary file, write the xfile contents into the temporary file
and use atomic extent swap to commit the new contents. and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped. The temporary file is then reaped.
The proposed patchset is the The proposed patchset is the
...@@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows: ...@@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows:
memory or there are no more attr fork blocks to examine, unlock the file and memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file. add the staged extended attributes to the temporary file.
3. Use atomic extent swapping to exchange the new and old extended attribute 3. Use atomic file mapping exchange to exchange the new and old extended
structures. attribute structures.
The old attribute blocks are now attached to the temporary file. The old attribute blocks are now attached to the temporary file.
4. Reap the temporary file. 4. Reap the temporary file.
...@@ -4410,7 +4421,8 @@ salvaging directories is straightforward: ...@@ -4410,7 +4421,8 @@ salvaging directories is straightforward:
directory and add the staged dirents into the temporary directory. directory and add the staged dirents into the temporary directory.
Truncate the staging files. Truncate the staging files.
4. Use atomic extent swapping to exchange the new and old directory structures. 4. Use atomic file mapping exchange to exchange the new and old directory
structures.
The old directory blocks are now attached to the temporary file. The old directory blocks are now attached to the temporary file.
5. Reap the temporary file. 5. Reap the temporary file.
...@@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows: ...@@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
Instead, we stash updates in the xfarray and rely on the scanner thread Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory. to apply the stashed updates to the temporary directory.
5. When the scan is complete, atomically swap the contents of the temporary 5. When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired. directory and the directory being repaired.
The temporary directory now contains the damaged directory structure. The temporary directory now contains the damaged directory structure.
...@@ -4629,8 +4641,8 @@ directory reconstruction: ...@@ -4629,8 +4641,8 @@ directory reconstruction:
5. Copy all non-parent pointer extended attributes to the temporary file. 5. Copy all non-parent pointer extended attributes to the temporary file.
6. When the scan is complete, atomically swap the attribute fork of the 6. When the scan is complete, atomically exchange the mappings of the attribute
temporary file and the file being repaired. forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure. The temporary file now contains the damaged extended attribute structure.
7. Reap the temporary file. 7. Reap the temporary file.
...@@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it ...@@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it
has been built, and why. has been built, and why.
Please feel free to contact the XFS mailing list with questions. Please feel free to contact the XFS mailing list with questions.
FIEXCHANGE_RANGE XFS_IOC_EXCHANGE_RANGE
---------------- ----------------------
As discussed earlier, a second frontend to the atomic extent swap mechanism is As discussed earlier, a second frontend to the atomic file mapping exchange
a new ioctl call that userspace programs can use to commit updates to files mechanism is a new ioctl call that userspace programs can use to commit updates
atomically. to files atomically.
This frontend has been out for review for several years now, though the This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard. the proposal has not been pushed very hard.
Extent Swapping with Regular User Files File Content Exchanges with Regular User Files
``````````````````````````````````````` ``````````````````````````````````````````````
As mentioned earlier, XFS has long had the ability to swap extents between As mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ``xfs_fsr`` to defragment files. files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
...@@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to ...@@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time. swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for This mechanism is identical to steps 2-3 from the procedure above except for
the new tracking items, because the atomic extent swap mechanism is an the new tracking items, because the atomic file mapping exchange mechanism is
iteration of an existing mechanism and not something totally novel. an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain. identical, so the recovery guarantees are not much of a gain.
Atomic extent swapping is much more flexible than the existing swapext Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges. file fork ranges.
...@@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases: ...@@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases:
Next, it opens a temporary file and calls the file clone operation to reflink Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file. the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file. Writes to the original file should instead be written to the temporary file.
Finally, the process calls the atomic extent swap system call Finally, the process calls the atomic file mapping exchange system call
(``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
of the updates to the original file, or none of them. committing all of the updates to the original file, or none of them.
.. _swapext_if_unchanged: .. _exchrange_if_unchanged:
- **Transactional file updates**: The same mechanism as above, but the caller - **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not only wants the commit to occur if the original file's contents have not
...@@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases: ...@@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases:
change timestamps of the original file before reflinking its data to the change timestamps of the original file before reflinking its data to the
temporary file. temporary file.
When the program is ready to commit the changes, it passes the timestamps When the program is ready to commit the changes, it passes the timestamps
into the kernel as arguments to the atomic extent swap system call. into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the The kernel only commits the changes if the provided timestamps match the
original file. original file.
A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
- **Emulation of atomic block device writes**: Export a block device with a - **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size. to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the Stage all writes to a temporary file, and when that is complete, call the
atomic extent swap system call with a flag to indicate that holes in the atomic file mapping exchange system call with a flag to indicate that holes
temporary file should be ignored. in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary This emulates an atomic device write in software, and can support arbitrary
scattered writes. scattered writes.
...@@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file. ...@@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write. contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being Clearspace makes its own copy of the frozen extent in an area that is not being
cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
<swapext_if_unchanged>` feature) to change the target file's data extent <exchrange_if_unchanged>` feature) to change the target file's data extent
mapping away from the area being cleared. mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable. space collector file so that it becomes unavailable.
......
...@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count) ...@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
return 0; return 0;
} }
EXPORT_SYMBOL_GPL(generic_write_check_limits);
/* Like generic_write_checks(), but takes size of write instead of iter. */ /* Like generic_write_checks(), but takes size of write instead of iter. */
int generic_write_checks_count(struct kiocb *iocb, loff_t *count) int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
......
...@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in, ...@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
return 0; return 0;
} }
static int remap_verify_area(struct file *file, loff_t pos, loff_t len, int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
bool write)
{ {
int mask = write ? MAY_WRITE : MAY_READ; int mask = write ? MAY_WRITE : MAY_READ;
loff_t tmp; loff_t tmp;
...@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, ...@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
return fsnotify_file_area_perm(file, mask, &pos, len); return fsnotify_file_area_perm(file, mask, &pos, len);
} }
EXPORT_SYMBOL_GPL(remap_verify_area);
/* /*
* Ensure that we don't remap a partial EOF block in the middle of something * Ensure that we don't remap a partial EOF block in the middle of something
......
...@@ -34,6 +34,7 @@ xfs-y += $(addprefix libxfs/, \ ...@@ -34,6 +34,7 @@ xfs-y += $(addprefix libxfs/, \
xfs_dir2_node.o \ xfs_dir2_node.o \
xfs_dir2_sf.o \ xfs_dir2_sf.o \
xfs_dquot_buf.o \ xfs_dquot_buf.o \
xfs_exchmaps.o \
xfs_ialloc.o \ xfs_ialloc.o \
xfs_ialloc_btree.o \ xfs_ialloc_btree.o \
xfs_iext_tree.o \ xfs_iext_tree.o \
...@@ -67,6 +68,7 @@ xfs-y += xfs_aops.o \ ...@@ -67,6 +68,7 @@ xfs-y += xfs_aops.o \
xfs_dir2_readdir.o \ xfs_dir2_readdir.o \
xfs_discard.o \ xfs_discard.o \
xfs_error.o \ xfs_error.o \
xfs_exchrange.o \
xfs_export.o \ xfs_export.o \
xfs_extent_busy.o \ xfs_extent_busy.o \
xfs_file.o \ xfs_file.o \
...@@ -101,6 +103,7 @@ xfs-y += xfs_log.o \ ...@@ -101,6 +103,7 @@ xfs-y += xfs_log.o \
xfs_buf_item.o \ xfs_buf_item.o \
xfs_buf_item_recover.o \ xfs_buf_item_recover.o \
xfs_dquot_item_recover.o \ xfs_dquot_item_recover.o \
xfs_exchmaps_item.o \
xfs_extfree_item.o \ xfs_extfree_item.o \
xfs_attr_item.o \ xfs_attr_item.o \
xfs_icreate_item.o \ xfs_icreate_item.o \
......
...@@ -27,6 +27,7 @@ ...@@ -27,6 +27,7 @@
#include "xfs_da_btree.h" #include "xfs_da_btree.h"
#include "xfs_attr.h" #include "xfs_attr.h"
#include "xfs_trans_priv.h" #include "xfs_trans_priv.h"
#include "xfs_exchmaps.h"
static struct kmem_cache *xfs_defer_pending_cache; static struct kmem_cache *xfs_defer_pending_cache;
...@@ -1176,6 +1177,10 @@ xfs_defer_init_item_caches(void) ...@@ -1176,6 +1177,10 @@ xfs_defer_init_item_caches(void)
error = xfs_attr_intent_init_cache(); error = xfs_attr_intent_init_cache();
if (error) if (error)
goto err; goto err;
error = xfs_exchmaps_intent_init_cache();
if (error)
goto err;
return 0; return 0;
err: err:
xfs_defer_destroy_item_caches(); xfs_defer_destroy_item_caches();
...@@ -1186,6 +1191,7 @@ xfs_defer_init_item_caches(void) ...@@ -1186,6 +1191,7 @@ xfs_defer_init_item_caches(void)
void void
xfs_defer_destroy_item_caches(void) xfs_defer_destroy_item_caches(void)
{ {
xfs_exchmaps_intent_destroy_cache();
xfs_attr_intent_destroy_cache(); xfs_attr_intent_destroy_cache();
xfs_extfree_intent_destroy_cache(); xfs_extfree_intent_destroy_cache();
xfs_bmap_intent_destroy_cache(); xfs_bmap_intent_destroy_cache();
......
...@@ -72,7 +72,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type; ...@@ -72,7 +72,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
extern const struct xfs_defer_op_type xfs_extent_free_defer_type; extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
extern const struct xfs_defer_op_type xfs_agfl_free_defer_type; extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
extern const struct xfs_defer_op_type xfs_attr_defer_type; extern const struct xfs_defer_op_type xfs_attr_defer_type;
extern const struct xfs_defer_op_type xfs_exchmaps_defer_type;
/* /*
* Deferred operation item relogging limits. * Deferred operation item relogging limits.
......
...@@ -63,7 +63,8 @@ ...@@ -63,7 +63,8 @@
#define XFS_ERRTAG_ATTR_LEAF_TO_NODE 41 #define XFS_ERRTAG_ATTR_LEAF_TO_NODE 41
#define XFS_ERRTAG_WB_DELAY_MS 42 #define XFS_ERRTAG_WB_DELAY_MS 42
#define XFS_ERRTAG_WRITE_DELAY_MS 43 #define XFS_ERRTAG_WRITE_DELAY_MS 43
#define XFS_ERRTAG_MAX 44 #define XFS_ERRTAG_EXCHMAPS_FINISH_ONE 44
#define XFS_ERRTAG_MAX 45
/* /*
* Random factors for above tags, 1 means always, 2 means 1/2 time, etc. * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
...@@ -111,5 +112,6 @@ ...@@ -111,5 +112,6 @@
#define XFS_RANDOM_ATTR_LEAF_TO_NODE 1 #define XFS_RANDOM_ATTR_LEAF_TO_NODE 1
#define XFS_RANDOM_WB_DELAY_MS 3000 #define XFS_RANDOM_WB_DELAY_MS 3000
#define XFS_RANDOM_WRITE_DELAY_MS 3000 #define XFS_RANDOM_WRITE_DELAY_MS 3000
#define XFS_RANDOM_EXCHMAPS_FINISH_ONE 1
#endif /* __XFS_ERRORTAG_H_ */ #endif /* __XFS_ERRORTAG_H_ */
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_shared.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_mount.h"
#include "xfs_defer.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
#include "xfs_bmap.h"
#include "xfs_icache.h"
#include "xfs_quota.h"
#include "xfs_exchmaps.h"
#include "xfs_trace.h"
#include "xfs_bmap_btree.h"
#include "xfs_trans_space.h"
#include "xfs_error.h"
#include "xfs_errortag.h"
#include "xfs_health.h"
#include "xfs_exchmaps_item.h"
#include "xfs_da_format.h"
#include "xfs_da_btree.h"
#include "xfs_attr_leaf.h"
#include "xfs_attr.h"
#include "xfs_dir2_priv.h"
#include "xfs_dir2.h"
#include "xfs_symlink_remote.h"
struct kmem_cache *xfs_exchmaps_intent_cache;
/* bmbt mappings adjacent to a pair of records. */
struct xfs_exchmaps_adjacent {
struct xfs_bmbt_irec left1;
struct xfs_bmbt_irec right1;
struct xfs_bmbt_irec left2;
struct xfs_bmbt_irec right2;
};
#define ADJACENT_INIT { \
.left1 = { .br_startblock = HOLESTARTBLOCK }, \
.right1 = { .br_startblock = HOLESTARTBLOCK }, \
.left2 = { .br_startblock = HOLESTARTBLOCK }, \
.right2 = { .br_startblock = HOLESTARTBLOCK }, \
}
/* Information to reset reflink flag / CoW fork state after an exchange. */
/*
* If the reflink flag is set on either inode, make sure it has an incore CoW
* fork, since all reflink inodes must have them. If there's a CoW fork and it
* has mappings in it, make sure the inodes are tagged appropriately so that
* speculative preallocations can be GC'd if we run low of space.
*/
static inline void
xfs_exchmaps_ensure_cowfork(
struct xfs_inode *ip)
{
struct xfs_ifork *cfork;
if (xfs_is_reflink_inode(ip))
xfs_ifork_init_cow(ip);
cfork = xfs_ifork_ptr(ip, XFS_COW_FORK);
if (!cfork)
return;
if (cfork->if_bytes > 0)
xfs_inode_set_cowblocks_tag(ip);
else
xfs_inode_clear_cowblocks_tag(ip);
}
/*
* Adjust the on-disk inode size upwards if needed so that we never add
* mappings into the file past EOF. This is crucial so that log recovery won't
* get confused by the sudden appearance of post-eof mappings.
*/
STATIC void
xfs_exchmaps_update_size(
struct xfs_trans *tp,
struct xfs_inode *ip,
struct xfs_bmbt_irec *imap,
xfs_fsize_t new_isize)
{
struct xfs_mount *mp = tp->t_mountp;
xfs_fsize_t len;
if (new_isize < 0)
return;
len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
new_isize);
if (len <= ip->i_disk_size)
return;
trace_xfs_exchmaps_update_inode_size(ip, len);
ip->i_disk_size = len;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
}
/* Advance the incore state tracking after exchanging a mapping. */
static inline void
xmi_advance(
struct xfs_exchmaps_intent *xmi,
const struct xfs_bmbt_irec *irec)
{
xmi->xmi_startoff1 += irec->br_blockcount;
xmi->xmi_startoff2 += irec->br_blockcount;
xmi->xmi_blockcount -= irec->br_blockcount;
}
/* Do we still have more mappings to exchange? */
static inline bool
xmi_has_more_exchange_work(const struct xfs_exchmaps_intent *xmi)
{
return xmi->xmi_blockcount > 0;
}
/* Do we have post-operation cleanups to perform? */
static inline bool
xmi_has_postop_work(const struct xfs_exchmaps_intent *xmi)
{
return xmi->xmi_flags & (XFS_EXCHMAPS_CLEAR_INO1_REFLINK |
XFS_EXCHMAPS_CLEAR_INO2_REFLINK |
__XFS_EXCHMAPS_INO2_SHORTFORM);
}
/* Check all mappings to make sure we can actually exchange them. */
int
xfs_exchmaps_check_forks(
struct xfs_mount *mp,
const struct xfs_exchmaps_req *req)
{
struct xfs_ifork *ifp1, *ifp2;
int whichfork = xfs_exchmaps_reqfork(req);
/* No fork? */
ifp1 = xfs_ifork_ptr(req->ip1, whichfork);
ifp2 = xfs_ifork_ptr(req->ip2, whichfork);
if (!ifp1 || !ifp2)
return -EINVAL;
/* We don't know how to exchange local format forks. */
if (ifp1->if_format == XFS_DINODE_FMT_LOCAL ||
ifp2->if_format == XFS_DINODE_FMT_LOCAL)
return -EINVAL;
return 0;
}
#ifdef CONFIG_XFS_QUOTA
/* Log the actual updates to the quota accounting. */
static inline void
xfs_exchmaps_update_quota(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi,
struct xfs_bmbt_irec *irec1,
struct xfs_bmbt_irec *irec2)
{
int64_t ip1_delta = 0, ip2_delta = 0;
unsigned int qflag;
qflag = XFS_IS_REALTIME_INODE(xmi->xmi_ip1) ? XFS_TRANS_DQ_RTBCOUNT :
XFS_TRANS_DQ_BCOUNT;
if (xfs_bmap_is_real_extent(irec1)) {
ip1_delta -= irec1->br_blockcount;
ip2_delta += irec1->br_blockcount;
}
if (xfs_bmap_is_real_extent(irec2)) {
ip1_delta += irec2->br_blockcount;
ip2_delta -= irec2->br_blockcount;
}
xfs_trans_mod_dquot_byino(tp, xmi->xmi_ip1, qflag, ip1_delta);
xfs_trans_mod_dquot_byino(tp, xmi->xmi_ip2, qflag, ip2_delta);
}
#else
# define xfs_exchmaps_update_quota(tp, xmi, irec1, irec2) ((void)0)
#endif
/* Decide if we want to skip this mapping from file1. */
static inline bool
xfs_exchmaps_can_skip_mapping(
struct xfs_exchmaps_intent *xmi,
struct xfs_bmbt_irec *irec)
{
struct xfs_mount *mp = xmi->xmi_ip1->i_mount;
/* Do not skip this mapping if the caller did not tell us to. */
if (!(xmi->xmi_flags & XFS_EXCHMAPS_INO1_WRITTEN))
return false;
/* Do not skip mapped, written mappings. */
if (xfs_bmap_is_written_extent(irec))
return false;
/*
* The mapping is unwritten or a hole. It cannot be a delalloc
* reservation because we already excluded those. It cannot be an
* unwritten extent with dirty page cache because we flushed the page
* cache. For files where the allocation unit is 1FSB (files on the
* data dev, rt files if the extent size is 1FSB), we can safely
* skip this mapping.
*/
if (!xfs_inode_has_bigrtalloc(xmi->xmi_ip1))
return true;
/*
* For a realtime file with a multi-fsb allocation unit, the decision
* is trickier because we can only swap full allocation units.
* Unwritten mappings can appear in the middle of an rtx if the rtx is
* partially written, but they can also appear for preallocations.
*
* If the mapping is a hole, skip it entirely. Holes should align with
* rtx boundaries.
*/
if (!xfs_bmap_is_real_extent(irec))
return true;
/*
* All mappings below this point are unwritten.
*
* - If the beginning is not aligned to an rtx, trim the end of the
* mapping so that it does not cross an rtx boundary, and swap it.
*
* - If both ends are aligned to an rtx, skip the entire mapping.
*/
if (!isaligned_64(irec->br_startoff, mp->m_sb.sb_rextsize)) {
xfs_fileoff_t new_end;
new_end = roundup_64(irec->br_startoff, mp->m_sb.sb_rextsize);
irec->br_blockcount = min(irec->br_blockcount,
new_end - irec->br_startoff);
return false;
}
if (isaligned_64(irec->br_blockcount, mp->m_sb.sb_rextsize))
return true;
/*
* All mappings below this point are unwritten, start on an rtx
* boundary, and do not end on an rtx boundary.
*
* - If the mapping is longer than one rtx, trim the end of the mapping
* down to an rtx boundary and skip it.
*
* - The mapping is shorter than one rtx. Swap it.
*/
if (irec->br_blockcount > mp->m_sb.sb_rextsize) {
xfs_fileoff_t new_end;
new_end = rounddown_64(irec->br_startoff + irec->br_blockcount,
mp->m_sb.sb_rextsize);
irec->br_blockcount = new_end - irec->br_startoff;
return true;
}
return false;
}
/*
* Walk forward through the file ranges in @xmi until we find two different
* mappings to exchange. If there is work to do, return the mappings;
* otherwise we've reached the end of the range and xmi_blockcount will be
* zero.
*
* If the walk skips over a pair of mappings to the same storage, save them as
* the left records in @adj (if provided) so that the simulation phase can
* avoid an extra lookup.
*/
static int
xfs_exchmaps_find_mappings(
struct xfs_exchmaps_intent *xmi,
struct xfs_bmbt_irec *irec1,
struct xfs_bmbt_irec *irec2,
struct xfs_exchmaps_adjacent *adj)
{
int nimaps;
int bmap_flags;
int error;
bmap_flags = xfs_bmapi_aflag(xfs_exchmaps_whichfork(xmi));
for (; xmi_has_more_exchange_work(xmi); xmi_advance(xmi, irec1)) {
/* Read mapping from the first file */
nimaps = 1;
error = xfs_bmapi_read(xmi->xmi_ip1, xmi->xmi_startoff1,
xmi->xmi_blockcount, irec1, &nimaps,
bmap_flags);
if (error)
return error;
if (nimaps != 1 ||
irec1->br_startblock == DELAYSTARTBLOCK ||
irec1->br_startoff != xmi->xmi_startoff1) {
/*
* We should never get no mapping or a delalloc mapping
* or something that doesn't match what we asked for,
* since the caller flushed both inodes and we hold the
* ILOCKs for both inodes.
*/
ASSERT(0);
return -EINVAL;
}
if (xfs_exchmaps_can_skip_mapping(xmi, irec1)) {
trace_xfs_exchmaps_mapping1_skip(xmi->xmi_ip1, irec1);
continue;
}
/* Read mapping from the second file */
nimaps = 1;
error = xfs_bmapi_read(xmi->xmi_ip2, xmi->xmi_startoff2,
irec1->br_blockcount, irec2, &nimaps,
bmap_flags);
if (error)
return error;
if (nimaps != 1 ||
irec2->br_startblock == DELAYSTARTBLOCK ||
irec2->br_startoff != xmi->xmi_startoff2) {
/*
* We should never get no mapping or a delalloc mapping
* or something that doesn't match what we asked for,
* since the caller flushed both inodes and we hold the
* ILOCKs for both inodes.
*/
ASSERT(0);
return -EINVAL;
}
/*
* We can only exchange as many blocks as the smaller of the
* two mapping maps.
*/
irec1->br_blockcount = min(irec1->br_blockcount,
irec2->br_blockcount);
trace_xfs_exchmaps_mapping1(xmi->xmi_ip1, irec1);
trace_xfs_exchmaps_mapping2(xmi->xmi_ip2, irec2);
/* We found something to exchange, so return it. */
if (irec1->br_startblock != irec2->br_startblock)
return 0;
/*
* Two mappings pointing to the same physical block must not
* have different states; that's filesystem corruption. Move
* on to the next mapping if they're both holes or both point
* to the same physical space extent.
*/
if (irec1->br_state != irec2->br_state) {
xfs_bmap_mark_sick(xmi->xmi_ip1,
xfs_exchmaps_whichfork(xmi));
xfs_bmap_mark_sick(xmi->xmi_ip2,
xfs_exchmaps_whichfork(xmi));
return -EFSCORRUPTED;
}
/*
* Save the mappings if we're estimating work and skipping
* these identical mappings.
*/
if (adj) {
memcpy(&adj->left1, irec1, sizeof(*irec1));
memcpy(&adj->left2, irec2, sizeof(*irec2));
}
}
return 0;
}
/* Exchange these two mappings. */
static void
xfs_exchmaps_one_step(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi,
struct xfs_bmbt_irec *irec1,
struct xfs_bmbt_irec *irec2)
{
int whichfork = xfs_exchmaps_whichfork(xmi);
xfs_exchmaps_update_quota(tp, xmi, irec1, irec2);
/* Remove both mappings. */
xfs_bmap_unmap_extent(tp, xmi->xmi_ip1, whichfork, irec1);
xfs_bmap_unmap_extent(tp, xmi->xmi_ip2, whichfork, irec2);
/*
* Re-add both mappings. We exchange the file offsets between the two
* maps and add the opposite map, which has the effect of filling the
* logical offsets we just unmapped, but with with the physical mapping
* information exchanged.
*/
swap(irec1->br_startoff, irec2->br_startoff);
xfs_bmap_map_extent(tp, xmi->xmi_ip1, whichfork, irec2);
xfs_bmap_map_extent(tp, xmi->xmi_ip2, whichfork, irec1);
/* Make sure we're not adding mappings past EOF. */
if (whichfork == XFS_DATA_FORK) {
xfs_exchmaps_update_size(tp, xmi->xmi_ip1, irec2,
xmi->xmi_isize1);
xfs_exchmaps_update_size(tp, xmi->xmi_ip2, irec1,
xmi->xmi_isize2);
}
/*
* Advance our cursor and exit. The caller (either defer ops or log
* recovery) will log the XMD item, and if *blockcount is nonzero, it
* will log a new XMI item for the remainder and call us back.
*/
xmi_advance(xmi, irec1);
}
/* Convert inode2's leaf attr fork back to shortform, if possible.. */
STATIC int
xfs_exchmaps_attr_to_sf(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
struct xfs_da_args args = {
.dp = xmi->xmi_ip2,
.geo = tp->t_mountp->m_attr_geo,
.whichfork = XFS_ATTR_FORK,
.trans = tp,
};
struct xfs_buf *bp;
int forkoff;
int error;
if (!xfs_attr_is_leaf(xmi->xmi_ip2))
return 0;
error = xfs_attr3_leaf_read(tp, xmi->xmi_ip2, 0, &bp);
if (error)
return error;
forkoff = xfs_attr_shortform_allfit(bp, xmi->xmi_ip2);
if (forkoff == 0)
return 0;
return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
}
/* Convert inode2's block dir fork back to shortform, if possible.. */
STATIC int
xfs_exchmaps_dir_to_sf(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
struct xfs_da_args args = {
.dp = xmi->xmi_ip2,
.geo = tp->t_mountp->m_dir_geo,
.whichfork = XFS_DATA_FORK,
.trans = tp,
};
struct xfs_dir2_sf_hdr sfh;
struct xfs_buf *bp;
bool isblock;
int size;
int error;
error = xfs_dir2_isblock(&args, &isblock);
if (error)
return error;
if (!isblock)
return 0;
error = xfs_dir3_block_read(tp, xmi->xmi_ip2, &bp);
if (error)
return error;
size = xfs_dir2_block_sfsize(xmi->xmi_ip2, bp->b_addr, &sfh);
if (size > xfs_inode_data_fork_size(xmi->xmi_ip2))
return 0;
return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
}
/* Convert inode2's remote symlink target back to shortform, if possible. */
STATIC int
xfs_exchmaps_link_to_sf(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
struct xfs_inode *ip = xmi->xmi_ip2;
struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
char *buf;
int error;
if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
ip->i_disk_size > xfs_inode_data_fork_size(ip))
return 0;
/* Read the current symlink target into a buffer. */
buf = kmalloc(ip->i_disk_size + 1,
GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
if (!buf) {
ASSERT(0);
return -ENOMEM;
}
error = xfs_symlink_remote_read(ip, buf);
if (error)
goto free;
/* Remove the blocks. */
error = xfs_symlink_remote_truncate(tp, ip);
if (error)
goto free;
/* Convert fork to local format and log our changes. */
xfs_idestroy_fork(ifp);
ifp->if_bytes = 0;
ifp->if_format = XFS_DINODE_FMT_LOCAL;
xfs_init_local_fork(ip, XFS_DATA_FORK, buf, ip->i_disk_size);
xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
free:
kfree(buf);
return error;
}
/* Clear the reflink flag after an exchange. */
static inline void
xfs_exchmaps_clear_reflink(
struct xfs_trans *tp,
struct xfs_inode *ip)
{
trace_xfs_reflink_unset_inode_flag(ip);
ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
}
/* Finish whatever work might come after an exchange operation. */
static int
xfs_exchmaps_do_postop_work(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
if (xmi->xmi_flags & __XFS_EXCHMAPS_INO2_SHORTFORM) {
int error = 0;
if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
error = xfs_exchmaps_attr_to_sf(tp, xmi);
else if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode))
error = xfs_exchmaps_dir_to_sf(tp, xmi);
else if (S_ISLNK(VFS_I(xmi->xmi_ip2)->i_mode))
error = xfs_exchmaps_link_to_sf(tp, xmi);
xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM;
if (error)
return error;
}
if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO1_REFLINK) {
xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip1);
xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO1_REFLINK;
}
if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO2_REFLINK) {
xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip2);
xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO2_REFLINK;
}
return 0;
}
/* Finish one step in a mapping exchange operation, possibly relogging. */
int
xfs_exchmaps_finish_one(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
struct xfs_bmbt_irec irec1, irec2;
int error;
if (xmi_has_more_exchange_work(xmi)) {
/*
* If the operation state says that some range of the files
* have not yet been exchanged, look for mappings in that range
* to exchange. If we find some mappings, exchange them.
*/
error = xfs_exchmaps_find_mappings(xmi, &irec1, &irec2, NULL);
if (error)
return error;
if (xmi_has_more_exchange_work(xmi))
xfs_exchmaps_one_step(tp, xmi, &irec1, &irec2);
/*
* If the caller asked us to exchange the file sizes after the
* exchange and either we just exchanged the last mappings in
* the range or we didn't find anything to exchange, update the
* ondisk file sizes.
*/
if ((xmi->xmi_flags & XFS_EXCHMAPS_SET_SIZES) &&
!xmi_has_more_exchange_work(xmi)) {
xmi->xmi_ip1->i_disk_size = xmi->xmi_isize1;
xmi->xmi_ip2->i_disk_size = xmi->xmi_isize2;
xfs_trans_log_inode(tp, xmi->xmi_ip1, XFS_ILOG_CORE);
xfs_trans_log_inode(tp, xmi->xmi_ip2, XFS_ILOG_CORE);
}
} else if (xmi_has_postop_work(xmi)) {
/*
* Now that we're finished with the exchange operation,
* complete the post-op cleanup work.
*/
error = xfs_exchmaps_do_postop_work(tp, xmi);
if (error)
return error;
}
if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_EXCHMAPS_FINISH_ONE))
return -EIO;
/* If we still have work to do, ask for a new transaction. */
if (xmi_has_more_exchange_work(xmi) || xmi_has_postop_work(xmi)) {
trace_xfs_exchmaps_defer(tp->t_mountp, xmi);
return -EAGAIN;
}
/*
* If we reach here, we've finished all the exchange work and the post
* operation work. The last thing we need to do before returning to
* the caller is to make sure that COW forks are set up correctly.
*/
if (!(xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)) {
xfs_exchmaps_ensure_cowfork(xmi->xmi_ip1);
xfs_exchmaps_ensure_cowfork(xmi->xmi_ip2);
}
return 0;
}
/*
* Compute the amount of bmbt blocks we should reserve for each file. In the
* worst case, each exchange will fill a hole with a new mapping, which could
* result in a btree split every time we add a new leaf block.
*/
static inline uint64_t
xfs_exchmaps_bmbt_blocks(
struct xfs_mount *mp,
const struct xfs_exchmaps_req *req)
{
return howmany_64(req->nr_exchanges,
XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)) *
XFS_EXTENTADD_SPACE_RES(mp, xfs_exchmaps_reqfork(req));
}
/* Compute the space we should reserve for the rmap btree expansions. */
static inline uint64_t
xfs_exchmaps_rmapbt_blocks(
struct xfs_mount *mp,
const struct xfs_exchmaps_req *req)
{
if (!xfs_has_rmapbt(mp))
return 0;
if (XFS_IS_REALTIME_INODE(req->ip1))
return 0;
return howmany_64(req->nr_exchanges,
XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) *
XFS_RMAPADD_SPACE_RES(mp);
}
/* Estimate the bmbt and rmapbt overhead required to exchange mappings. */
static int
xfs_exchmaps_estimate_overhead(
struct xfs_exchmaps_req *req)
{
struct xfs_mount *mp = req->ip1->i_mount;
xfs_filblks_t bmbt_blocks;
xfs_filblks_t rmapbt_blocks;
xfs_filblks_t resblks = req->resblks;
/*
* Compute the number of bmbt and rmapbt blocks we might need to handle
* the estimated number of exchanges.
*/
bmbt_blocks = xfs_exchmaps_bmbt_blocks(mp, req);
rmapbt_blocks = xfs_exchmaps_rmapbt_blocks(mp, req);
trace_xfs_exchmaps_overhead(mp, bmbt_blocks, rmapbt_blocks);
/* Make sure the change in file block count doesn't overflow. */
if (check_add_overflow(req->ip1_bcount, bmbt_blocks, &req->ip1_bcount))
return -EFBIG;
if (check_add_overflow(req->ip2_bcount, bmbt_blocks, &req->ip2_bcount))
return -EFBIG;
/*
* Add together the number of blocks we need to handle btree growth,
* then add it to the number of blocks we need to reserve to this
* transaction.
*/
if (check_add_overflow(resblks, bmbt_blocks, &resblks))
return -ENOSPC;
if (check_add_overflow(resblks, bmbt_blocks, &resblks))
return -ENOSPC;
if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
return -ENOSPC;
if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
return -ENOSPC;
/* Can't actually reserve more than UINT_MAX blocks. */
if (req->resblks > UINT_MAX)
return -ENOSPC;
req->resblks = resblks;
trace_xfs_exchmaps_final_estimate(req);
return 0;
}
/* Decide if we can merge two real mappings. */
static inline bool
xmi_can_merge(
const struct xfs_bmbt_irec *b1,
const struct xfs_bmbt_irec *b2)
{
/* Don't merge holes. */
if (b1->br_startblock == HOLESTARTBLOCK ||
b2->br_startblock == HOLESTARTBLOCK)
return false;
/* We don't merge holes. */
if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2))
return false;
if (b1->br_startoff + b1->br_blockcount == b2->br_startoff &&
b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
b1->br_state == b2->br_state &&
b1->br_blockcount + b2->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
return true;
return false;
}
/*
* Decide if we can merge three mappings. Caller must ensure all three
* mappings must not be holes or delalloc reservations.
*/
static inline bool
xmi_can_merge_all(
const struct xfs_bmbt_irec *l,
const struct xfs_bmbt_irec *m,
const struct xfs_bmbt_irec *r)
{
xfs_filblks_t new_len;
new_len = l->br_blockcount + m->br_blockcount + r->br_blockcount;
return new_len <= XFS_MAX_BMBT_EXTLEN;
}
#define CLEFT_CONTIG 0x01
#define CRIGHT_CONTIG 0x02
#define CHOLE 0x04
#define CBOTH_CONTIG (CLEFT_CONTIG | CRIGHT_CONTIG)
#define NLEFT_CONTIG 0x10
#define NRIGHT_CONTIG 0x20
#define NHOLE 0x40
#define NBOTH_CONTIG (NLEFT_CONTIG | NRIGHT_CONTIG)
/* Estimate the effect of a single exchange on mapping count. */
static inline int
xmi_delta_nextents_step(
struct xfs_mount *mp,
const struct xfs_bmbt_irec *left,
const struct xfs_bmbt_irec *curr,
const struct xfs_bmbt_irec *new,
const struct xfs_bmbt_irec *right)
{
bool lhole, rhole, chole, nhole;
unsigned int state = 0;
int ret = 0;
lhole = left->br_startblock == HOLESTARTBLOCK;
rhole = right->br_startblock == HOLESTARTBLOCK;
chole = curr->br_startblock == HOLESTARTBLOCK;
nhole = new->br_startblock == HOLESTARTBLOCK;
if (chole)
state |= CHOLE;
if (!lhole && !chole && xmi_can_merge(left, curr))
state |= CLEFT_CONTIG;
if (!rhole && !chole && xmi_can_merge(curr, right))
state |= CRIGHT_CONTIG;
if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
!xmi_can_merge_all(left, curr, right))
state &= ~CRIGHT_CONTIG;
if (nhole)
state |= NHOLE;
if (!lhole && !nhole && xmi_can_merge(left, new))
state |= NLEFT_CONTIG;
if (!rhole && !nhole && xmi_can_merge(new, right))
state |= NRIGHT_CONTIG;
if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
!xmi_can_merge_all(left, new, right))
state &= ~NRIGHT_CONTIG;
switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
case CLEFT_CONTIG | CRIGHT_CONTIG:
/*
* left/curr/right are the same mapping, so deleting curr
* causes 2 new mappings to be created.
*/
ret += 2;
break;
case 0:
/*
* curr is not contiguous with any mapping, so we remove curr
* completely
*/
ret--;
break;
case CHOLE:
/* hole, do nothing */
break;
case CLEFT_CONTIG:
case CRIGHT_CONTIG:
/* trim either left or right, no change */
break;
}
switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) {
case NLEFT_CONTIG | NRIGHT_CONTIG:
/*
* left/curr/right will become the same mapping, so adding
* curr causes the deletion of right.
*/
ret--;
break;
case 0:
/* new is not contiguous with any mapping */
ret++;
break;
case NHOLE:
/* hole, do nothing. */
break;
case NLEFT_CONTIG:
case NRIGHT_CONTIG:
/* new is absorbed into left or right, no change */
break;
}
trace_xfs_exchmaps_delta_nextents_step(mp, left, curr, new, right, ret,
state);
return ret;
}
/* Make sure we don't overflow the extent (mapping) counters. */
static inline int
xmi_ensure_delta_nextents(
struct xfs_exchmaps_req *req,
struct xfs_inode *ip,
int64_t delta)
{
struct xfs_mount *mp = ip->i_mount;
int whichfork = xfs_exchmaps_reqfork(req);
struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork);
uint64_t new_nextents;
xfs_extnum_t max_nextents;
if (delta < 0)
return 0;
/*
* It's always an error if the delta causes integer overflow. delta
* needs an explicit cast here to avoid warnings about implicit casts
* coded into the overflow check.
*/
if (check_add_overflow(ifp->if_nextents, (uint64_t)delta,
&new_nextents))
return -EFBIG;
if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS) &&
new_nextents > 10)
return -EFBIG;
/*
* We always promote both inodes to have large extent counts if the
* superblock feature is enabled, so we only need to check against the
* theoretical maximum.
*/
max_nextents = xfs_iext_max_nextents(xfs_has_large_extent_counts(mp),
whichfork);
if (new_nextents > max_nextents)
return -EFBIG;
return 0;
}
/* Find the next mapping after irec. */
static inline int
xmi_next(
struct xfs_inode *ip,
int bmap_flags,
const struct xfs_bmbt_irec *irec,
struct xfs_bmbt_irec *nrec)
{
xfs_fileoff_t off;
xfs_filblks_t blockcount;
int nimaps = 1;
int error;
off = irec->br_startoff + irec->br_blockcount;
blockcount = XFS_MAX_FILEOFF - off;
error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags);
if (error)
return error;
if (nrec->br_startblock == DELAYSTARTBLOCK ||
nrec->br_startoff != off) {
/*
* If we don't get the mapping we want, return a zero-length
* mapping, which our estimator function will pretend is a hole.
* We shouldn't get delalloc reservations.
*/
nrec->br_startblock = HOLESTARTBLOCK;
}
return 0;
}
int __init
xfs_exchmaps_intent_init_cache(void)
{
xfs_exchmaps_intent_cache = kmem_cache_create("xfs_exchmaps_intent",
sizeof(struct xfs_exchmaps_intent),
0, 0, NULL);
return xfs_exchmaps_intent_cache != NULL ? 0 : -ENOMEM;
}
void
xfs_exchmaps_intent_destroy_cache(void)
{
kmem_cache_destroy(xfs_exchmaps_intent_cache);
xfs_exchmaps_intent_cache = NULL;
}
/*
* Decide if we will exchange the reflink flags between the two files after the
* exchange. The only time we want to do this is if we're exchanging all
* mappings under EOF and the inode reflink flags have different states.
*/
static inline bool
xmi_can_exchange_reflink_flags(
const struct xfs_exchmaps_req *req,
unsigned int reflink_state)
{
struct xfs_mount *mp = req->ip1->i_mount;
if (hweight32(reflink_state) != 1)
return false;
if (req->startoff1 != 0 || req->startoff2 != 0)
return false;
if (req->blockcount != XFS_B_TO_FSB(mp, req->ip1->i_disk_size))
return false;
if (req->blockcount != XFS_B_TO_FSB(mp, req->ip2->i_disk_size))
return false;
return true;
}
/* Allocate and initialize a new incore intent item from a request. */
struct xfs_exchmaps_intent *
xfs_exchmaps_init_intent(
const struct xfs_exchmaps_req *req)
{
struct xfs_exchmaps_intent *xmi;
unsigned int rs = 0;
xmi = kmem_cache_zalloc(xfs_exchmaps_intent_cache,
GFP_NOFS | __GFP_NOFAIL);
INIT_LIST_HEAD(&xmi->xmi_list);
xmi->xmi_ip1 = req->ip1;
xmi->xmi_ip2 = req->ip2;
xmi->xmi_startoff1 = req->startoff1;
xmi->xmi_startoff2 = req->startoff2;
xmi->xmi_blockcount = req->blockcount;
xmi->xmi_isize1 = xmi->xmi_isize2 = -1;
xmi->xmi_flags = req->flags & XFS_EXCHMAPS_PARAMS;
if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK) {
xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM;
return xmi;
}
if (req->flags & XFS_EXCHMAPS_SET_SIZES) {
xmi->xmi_flags |= XFS_EXCHMAPS_SET_SIZES;
xmi->xmi_isize1 = req->ip2->i_disk_size;
xmi->xmi_isize2 = req->ip1->i_disk_size;
}
/* Record the state of each inode's reflink flag before the op. */
if (xfs_is_reflink_inode(req->ip1))
rs |= 1;
if (xfs_is_reflink_inode(req->ip2))
rs |= 2;
/*
* Figure out if we're clearing the reflink flags (which effectively
* exchanges them) after the operation.
*/
if (xmi_can_exchange_reflink_flags(req, rs)) {
if (rs & 1)
xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO1_REFLINK;
if (rs & 2)
xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK;
}
if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode) ||
S_ISLNK(VFS_I(xmi->xmi_ip2)->i_mode))
xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM;
return xmi;
}
/*
* Estimate the number of exchange operations and the number of file blocks
* in each file that will be affected by the exchange operation.
*/
int
xfs_exchmaps_estimate(
struct xfs_exchmaps_req *req)
{
struct xfs_exchmaps_intent *xmi;
struct xfs_bmbt_irec irec1, irec2;
struct xfs_exchmaps_adjacent adj = ADJACENT_INIT;
xfs_filblks_t ip1_blocks = 0, ip2_blocks = 0;
int64_t d_nexts1, d_nexts2;
int bmap_flags;
int error;
ASSERT(!(req->flags & ~XFS_EXCHMAPS_PARAMS));
bmap_flags = xfs_bmapi_aflag(xfs_exchmaps_reqfork(req));
xmi = xfs_exchmaps_init_intent(req);
/*
* To guard against the possibility of overflowing the extent counters,
* we have to estimate an upper bound on the potential increase in that
* counter. We can split the mapping at each end of the range, and for
* each step of the exchange we can split the mapping that we're
* working on if the mappings do not align.
*/
d_nexts1 = d_nexts2 = 3;
while (xmi_has_more_exchange_work(xmi)) {
/*
* Walk through the file ranges until we find something to
* exchange. Because we're simulating the exchange, pass in
* adj to capture skipped mappings for correct estimation of
* bmbt record merges.
*/
error = xfs_exchmaps_find_mappings(xmi, &irec1, &irec2, &adj);
if (error)
goto out_free;
if (!xmi_has_more_exchange_work(xmi))
break;
/* Update accounting. */
if (xfs_bmap_is_real_extent(&irec1))
ip1_blocks += irec1.br_blockcount;
if (xfs_bmap_is_real_extent(&irec2))
ip2_blocks += irec2.br_blockcount;
req->nr_exchanges++;
/* Read the next mappings from both files. */
error = xmi_next(req->ip1, bmap_flags, &irec1, &adj.right1);
if (error)
goto out_free;
error = xmi_next(req->ip2, bmap_flags, &irec2, &adj.right2);
if (error)
goto out_free;
/* Update extent count deltas. */
d_nexts1 += xmi_delta_nextents_step(req->ip1->i_mount,
&adj.left1, &irec1, &irec2, &adj.right1);
d_nexts2 += xmi_delta_nextents_step(req->ip1->i_mount,
&adj.left2, &irec2, &irec1, &adj.right2);
/* Now pretend we exchanged the mappings. */
if (xmi_can_merge(&adj.left2, &irec1))
adj.left2.br_blockcount += irec1.br_blockcount;
else
memcpy(&adj.left2, &irec1, sizeof(irec1));
if (xmi_can_merge(&adj.left1, &irec2))
adj.left1.br_blockcount += irec2.br_blockcount;
else
memcpy(&adj.left1, &irec2, sizeof(irec2));
xmi_advance(xmi, &irec1);
}
/* Account for the blocks that are being exchanged. */
if (XFS_IS_REALTIME_INODE(req->ip1) &&
xfs_exchmaps_reqfork(req) == XFS_DATA_FORK) {
req->ip1_rtbcount = ip1_blocks;
req->ip2_rtbcount = ip2_blocks;
} else {
req->ip1_bcount = ip1_blocks;
req->ip2_bcount = ip2_blocks;
}
/*
* Make sure that both forks have enough slack left in their extent
* counters that the exchange operation will not overflow.
*/
trace_xfs_exchmaps_delta_nextents(req, d_nexts1, d_nexts2);
if (req->ip1 == req->ip2) {
error = xmi_ensure_delta_nextents(req, req->ip1,
d_nexts1 + d_nexts2);
} else {
error = xmi_ensure_delta_nextents(req, req->ip1, d_nexts1);
if (error)
goto out_free;
error = xmi_ensure_delta_nextents(req, req->ip2, d_nexts2);
}
if (error)
goto out_free;
trace_xfs_exchmaps_initial_estimate(req);
error = xfs_exchmaps_estimate_overhead(req);
out_free:
kmem_cache_free(xfs_exchmaps_intent_cache, xmi);
return error;
}
/* Set the reflink flag before an operation. */
static inline void
xfs_exchmaps_set_reflink(
struct xfs_trans *tp,
struct xfs_inode *ip)
{
trace_xfs_reflink_set_inode_flag(ip);
ip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
}
/*
* If either file has shared blocks and we're exchanging data forks, we must
* flag the other file as having shared blocks so that we get the shared-block
* rmap functions if we need to fix up the rmaps.
*/
void
xfs_exchmaps_ensure_reflink(
struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi)
{
unsigned int rs = 0;
if (xfs_is_reflink_inode(xmi->xmi_ip1))
rs |= 1;
if (xfs_is_reflink_inode(xmi->xmi_ip2))
rs |= 2;
if ((rs & 1) && !xfs_is_reflink_inode(xmi->xmi_ip2))
xfs_exchmaps_set_reflink(tp, xmi->xmi_ip2);
if ((rs & 2) && !xfs_is_reflink_inode(xmi->xmi_ip1))
xfs_exchmaps_set_reflink(tp, xmi->xmi_ip1);
}
/* Set the large extent count flag before an operation if needed. */
static inline void
xfs_exchmaps_ensure_large_extent_counts(
struct xfs_trans *tp,
struct xfs_inode *ip)
{
if (xfs_inode_has_large_extent_counts(ip))
return;
ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
}
/* Widen the extent counter fields of both inodes if necessary. */
void
xfs_exchmaps_upgrade_extent_counts(
struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi)
{
if (!xfs_has_large_extent_counts(tp->t_mountp))
return;
xfs_exchmaps_ensure_large_extent_counts(tp, xmi->xmi_ip1);
xfs_exchmaps_ensure_large_extent_counts(tp, xmi->xmi_ip2);
}
/*
* Schedule an exchange a range of mappings from one inode to another.
*
* The use of file mapping exchange log intent items ensures the operation can
* be resumed even if the system goes down. The caller must commit the
* transaction to start the work.
*
* The caller must ensure the inodes must be joined to the transaction and
* ILOCKd; they will still be joined to the transaction at exit.
*/
void
xfs_exchange_mappings(
struct xfs_trans *tp,
const struct xfs_exchmaps_req *req)
{
struct xfs_exchmaps_intent *xmi;
BUILD_BUG_ON(XFS_EXCHMAPS_INTERNAL_FLAGS & XFS_EXCHMAPS_LOGGED_FLAGS);
xfs_assert_ilocked(req->ip1, XFS_ILOCK_EXCL);
xfs_assert_ilocked(req->ip2, XFS_ILOCK_EXCL);
ASSERT(!(req->flags & ~XFS_EXCHMAPS_LOGGED_FLAGS));
if (req->flags & XFS_EXCHMAPS_SET_SIZES)
ASSERT(!(req->flags & XFS_EXCHMAPS_ATTR_FORK));
ASSERT(xfs_has_exchange_range(tp->t_mountp));
if (req->blockcount == 0)
return;
xmi = xfs_exchmaps_init_intent(req);
xfs_exchmaps_defer_add(tp, xmi);
xfs_exchmaps_ensure_reflink(tp, xmi);
xfs_exchmaps_upgrade_extent_counts(tp, xmi);
}
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHMAPS_H__
#define __XFS_EXCHMAPS_H__
/* In-core deferred operation info about a file mapping exchange request. */
struct xfs_exchmaps_intent {
/* List of other incore deferred work. */
struct list_head xmi_list;
/* Inodes participating in the operation. */
struct xfs_inode *xmi_ip1;
struct xfs_inode *xmi_ip2;
/* File offset range information. */
xfs_fileoff_t xmi_startoff1;
xfs_fileoff_t xmi_startoff2;
xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
xfs_fsize_t xmi_isize1;
xfs_fsize_t xmi_isize2;
uint64_t xmi_flags; /* XFS_EXCHMAPS_* flags */
};
/* Try to convert inode2 from block to short format at the end, if possible. */
#define __XFS_EXCHMAPS_INO2_SHORTFORM (1ULL << 63)
#define XFS_EXCHMAPS_INTERNAL_FLAGS (__XFS_EXCHMAPS_INO2_SHORTFORM)
/* flags that can be passed to xfs_exchmaps_{estimate,mappings} */
#define XFS_EXCHMAPS_PARAMS (XFS_EXCHMAPS_ATTR_FORK | \
XFS_EXCHMAPS_SET_SIZES | \
XFS_EXCHMAPS_INO1_WRITTEN)
static inline int
xfs_exchmaps_whichfork(const struct xfs_exchmaps_intent *xmi)
{
if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
return XFS_ATTR_FORK;
return XFS_DATA_FORK;
}
/* Parameters for a mapping exchange request. */
struct xfs_exchmaps_req {
/* Inodes participating in the operation. */
struct xfs_inode *ip1;
struct xfs_inode *ip2;
/* File offset range information. */
xfs_fileoff_t startoff1;
xfs_fileoff_t startoff2;
xfs_filblks_t blockcount;
/* XFS_EXCHMAPS_* operation flags */
uint64_t flags;
/*
* Fields below this line are filled out by xfs_exchmaps_estimate;
* callers should initialize this part of the struct to zero.
*/
/*
* Data device blocks to be moved out of ip1, and free space needed to
* handle the bmbt changes.
*/
xfs_filblks_t ip1_bcount;
/*
* Data device blocks to be moved out of ip2, and free space needed to
* handle the bmbt changes.
*/
xfs_filblks_t ip2_bcount;
/* rt blocks to be moved out of ip1. */
xfs_filblks_t ip1_rtbcount;
/* rt blocks to be moved out of ip2. */
xfs_filblks_t ip2_rtbcount;
/* Free space needed to handle the bmbt changes */
unsigned long long resblks;
/* Number of exchanges needed to complete the operation */
unsigned long long nr_exchanges;
};
static inline int
xfs_exchmaps_reqfork(const struct xfs_exchmaps_req *req)
{
if (req->flags & XFS_EXCHMAPS_ATTR_FORK)
return XFS_ATTR_FORK;
return XFS_DATA_FORK;
}
int xfs_exchmaps_estimate(struct xfs_exchmaps_req *req);
extern struct kmem_cache *xfs_exchmaps_intent_cache;
int __init xfs_exchmaps_intent_init_cache(void);
void xfs_exchmaps_intent_destroy_cache(void);
struct xfs_exchmaps_intent *xfs_exchmaps_init_intent(
const struct xfs_exchmaps_req *req);
void xfs_exchmaps_ensure_reflink(struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi);
void xfs_exchmaps_upgrade_extent_counts(struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi);
int xfs_exchmaps_finish_one(struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi);
int xfs_exchmaps_check_forks(struct xfs_mount *mp,
const struct xfs_exchmaps_req *req);
void xfs_exchange_mappings(struct xfs_trans *tp,
const struct xfs_exchmaps_req *req);
#endif /* __XFS_EXCHMAPS_H__ */
...@@ -373,13 +373,15 @@ xfs_sb_has_ro_compat_feature( ...@@ -373,13 +373,15 @@ xfs_sb_has_ro_compat_feature(
#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */ #define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */
#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */ #define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */
#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */ #define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */
#define XFS_SB_FEAT_INCOMPAT_EXCHRANGE (1 << 6) /* exchangerange supported */
#define XFS_SB_FEAT_INCOMPAT_ALL \ #define XFS_SB_FEAT_INCOMPAT_ALL \
(XFS_SB_FEAT_INCOMPAT_FTYPE| \ (XFS_SB_FEAT_INCOMPAT_FTYPE | \
XFS_SB_FEAT_INCOMPAT_SPINODES| \ XFS_SB_FEAT_INCOMPAT_SPINODES | \
XFS_SB_FEAT_INCOMPAT_META_UUID| \ XFS_SB_FEAT_INCOMPAT_META_UUID | \
XFS_SB_FEAT_INCOMPAT_BIGTIME| \ XFS_SB_FEAT_INCOMPAT_BIGTIME | \
XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR| \ XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR | \
XFS_SB_FEAT_INCOMPAT_NREXT64) XFS_SB_FEAT_INCOMPAT_NREXT64 | \
XFS_SB_FEAT_INCOMPAT_EXCHRANGE)
#define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL #define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL
static inline bool static inline bool
......
...@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks { ...@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */ #define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */
#define XFS_FSOP_GEOM_FLAGS_INOBTCNT (1 << 22) /* inobt btree counter */ #define XFS_FSOP_GEOM_FLAGS_INOBTCNT (1 << 22) /* inobt btree counter */
#define XFS_FSOP_GEOM_FLAGS_NREXT64 (1 << 23) /* large extent counters */ #define XFS_FSOP_GEOM_FLAGS_NREXT64 (1 << 23) /* large extent counters */
#define XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE (1 << 24) /* exchange range */
/* /*
* Minimum and maximum sizes need for growth checks. * Minimum and maximum sizes need for growth checks.
...@@ -772,6 +773,46 @@ struct xfs_scrub_metadata { ...@@ -772,6 +773,46 @@ struct xfs_scrub_metadata {
# define XFS_XATTR_LIST_MAX 65536 # define XFS_XATTR_LIST_MAX 65536
#endif #endif
/*
* Exchange part of file1 with part of the file that this ioctl that is being
* called against (which we'll call file2). Filesystems must be able to
* restart and complete the operation even after the system goes down.
*/
struct xfs_exchange_range {
__s32 file1_fd;
__u32 pad; /* must be zeroes */
__u64 file1_offset; /* file1 offset, bytes */
__u64 file2_offset; /* file2 offset, bytes */
__u64 length; /* bytes to exchange */
__u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
};
/*
* Exchange file data all the way to the ends of both files, and then exchange
* the file sizes. This flag can be used to replace a file's contents with a
* different amount of data. length will be ignored.
*/
#define XFS_EXCHANGE_RANGE_TO_EOF (1ULL << 0)
/* Flush all changes in file data and file metadata to disk before returning. */
#define XFS_EXCHANGE_RANGE_DSYNC (1ULL << 1)
/* Dry run; do all the parameter verification but do not change anything. */
#define XFS_EXCHANGE_RANGE_DRY_RUN (1ULL << 2)
/*
* Exchange only the parts of the two files where the file allocation units
* mapped to file1's range have been written to. This can accelerate
* scatter-gather atomic writes with a temp file if all writes are aligned to
* the file allocation unit.
*/
#define XFS_EXCHANGE_RANGE_FILE1_WRITTEN (1ULL << 3)
#define XFS_EXCHANGE_RANGE_ALL_FLAGS (XFS_EXCHANGE_RANGE_TO_EOF | \
XFS_EXCHANGE_RANGE_DSYNC | \
XFS_EXCHANGE_RANGE_DRY_RUN | \
XFS_EXCHANGE_RANGE_FILE1_WRITTEN)
/* /*
* ioctl commands that are used by Linux filesystems * ioctl commands that are used by Linux filesystems
...@@ -843,6 +884,7 @@ struct xfs_scrub_metadata { ...@@ -843,6 +884,7 @@ struct xfs_scrub_metadata {
#define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom) #define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req) #define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req) #define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOWR('X', 129, struct xfs_exchange_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */ /* XFS_IOC_GETFSUUID ---------- deprecated 140 */
......
...@@ -117,8 +117,9 @@ struct xfs_unmount_log_format { ...@@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
#define XLOG_REG_TYPE_ATTRD_FORMAT 28 #define XLOG_REG_TYPE_ATTRD_FORMAT 28
#define XLOG_REG_TYPE_ATTR_NAME 29 #define XLOG_REG_TYPE_ATTR_NAME 29
#define XLOG_REG_TYPE_ATTR_VALUE 30 #define XLOG_REG_TYPE_ATTR_VALUE 30
#define XLOG_REG_TYPE_MAX 30 #define XLOG_REG_TYPE_XMI_FORMAT 31
#define XLOG_REG_TYPE_XMD_FORMAT 32
#define XLOG_REG_TYPE_MAX 32
/* /*
* Flags to log operation header * Flags to log operation header
...@@ -243,6 +244,8 @@ typedef struct xfs_trans_header { ...@@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
#define XFS_LI_BUD 0x1245 #define XFS_LI_BUD 0x1245
#define XFS_LI_ATTRI 0x1246 /* attr set/remove intent*/ #define XFS_LI_ATTRI 0x1246 /* attr set/remove intent*/
#define XFS_LI_ATTRD 0x1247 /* attr set/remove done */ #define XFS_LI_ATTRD 0x1247 /* attr set/remove done */
#define XFS_LI_XMI 0x1248 /* mapping exchange intent */
#define XFS_LI_XMD 0x1249 /* mapping exchange done */
#define XFS_LI_TYPE_DESC \ #define XFS_LI_TYPE_DESC \
{ XFS_LI_EFI, "XFS_LI_EFI" }, \ { XFS_LI_EFI, "XFS_LI_EFI" }, \
...@@ -260,7 +263,9 @@ typedef struct xfs_trans_header { ...@@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
{ XFS_LI_BUI, "XFS_LI_BUI" }, \ { XFS_LI_BUI, "XFS_LI_BUI" }, \
{ XFS_LI_BUD, "XFS_LI_BUD" }, \ { XFS_LI_BUD, "XFS_LI_BUD" }, \
{ XFS_LI_ATTRI, "XFS_LI_ATTRI" }, \ { XFS_LI_ATTRI, "XFS_LI_ATTRI" }, \
{ XFS_LI_ATTRD, "XFS_LI_ATTRD" } { XFS_LI_ATTRD, "XFS_LI_ATTRD" }, \
{ XFS_LI_XMI, "XFS_LI_XMI" }, \
{ XFS_LI_XMD, "XFS_LI_XMD" }
/* /*
* Inode Log Item Format definitions. * Inode Log Item Format definitions.
...@@ -878,6 +883,61 @@ struct xfs_bud_log_format { ...@@ -878,6 +883,61 @@ struct xfs_bud_log_format {
uint64_t bud_bui_id; /* id of corresponding bui */ uint64_t bud_bui_id; /* id of corresponding bui */
}; };
/*
* XMI/XMD (file mapping exchange) log format definitions
*/
/* This is the structure used to lay out an mapping exchange log item. */
struct xfs_xmi_log_format {
uint16_t xmi_type; /* xmi log item type */
uint16_t xmi_size; /* size of this item */
uint32_t __pad; /* must be zero */
uint64_t xmi_id; /* xmi identifier */
uint64_t xmi_inode1; /* inumber of first file */
uint64_t xmi_inode2; /* inumber of second file */
uint32_t xmi_igen1; /* generation of first file */
uint32_t xmi_igen2; /* generation of second file */
uint64_t xmi_startoff1; /* block offset into file1 */
uint64_t xmi_startoff2; /* block offset into file2 */
uint64_t xmi_blockcount; /* number of blocks */
uint64_t xmi_flags; /* XFS_EXCHMAPS_* */
uint64_t xmi_isize1; /* intended file1 size */
uint64_t xmi_isize2; /* intended file2 size */
};
/* Exchange mappings between extended attribute forks instead of data forks. */
#define XFS_EXCHMAPS_ATTR_FORK (1ULL << 0)
/* Set the file sizes when finished. */
#define XFS_EXCHMAPS_SET_SIZES (1ULL << 1)
/*
* Exchange the mappings of the two files only if the file allocation units
* mapped to file1's range have been written.
*/
#define XFS_EXCHMAPS_INO1_WRITTEN (1ULL << 2)
/* Clear the reflink flag from inode1 after the operation. */
#define XFS_EXCHMAPS_CLEAR_INO1_REFLINK (1ULL << 3)
/* Clear the reflink flag from inode2 after the operation. */
#define XFS_EXCHMAPS_CLEAR_INO2_REFLINK (1ULL << 4)
#define XFS_EXCHMAPS_LOGGED_FLAGS (XFS_EXCHMAPS_ATTR_FORK | \
XFS_EXCHMAPS_SET_SIZES | \
XFS_EXCHMAPS_INO1_WRITTEN | \
XFS_EXCHMAPS_CLEAR_INO1_REFLINK | \
XFS_EXCHMAPS_CLEAR_INO2_REFLINK)
/* This is the structure used to lay out an mapping exchange done log item. */
struct xfs_xmd_log_format {
uint16_t xmd_type; /* xmd log item type */
uint16_t xmd_size; /* size of this item */
uint32_t __pad;
uint64_t xmd_xmi_id; /* id of corresponding xmi */
};
/* /*
* Dquot Log format definitions. * Dquot Log format definitions.
* *
......
...@@ -75,6 +75,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops; ...@@ -75,6 +75,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops;
extern const struct xlog_recover_item_ops xlog_cud_item_ops; extern const struct xlog_recover_item_ops xlog_cud_item_ops;
extern const struct xlog_recover_item_ops xlog_attri_item_ops; extern const struct xlog_recover_item_ops xlog_attri_item_ops;
extern const struct xlog_recover_item_ops xlog_attrd_item_ops; extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
extern const struct xlog_recover_item_ops xlog_xmi_item_ops;
extern const struct xlog_recover_item_ops xlog_xmd_item_ops;
/* /*
* Macros, structures, prototypes for internal log manager use. * Macros, structures, prototypes for internal log manager use.
...@@ -121,6 +123,8 @@ bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len); ...@@ -121,6 +123,8 @@ bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len);
int xlog_recover_iget(struct xfs_mount *mp, xfs_ino_t ino, int xlog_recover_iget(struct xfs_mount *mp, xfs_ino_t ino,
struct xfs_inode **ipp); struct xfs_inode **ipp);
int xlog_recover_iget_handle(struct xfs_mount *mp, xfs_ino_t ino, uint32_t gen,
struct xfs_inode **ipp);
void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type, void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
uint64_t intent_id); uint64_t intent_id);
int xlog_alloc_buf_cancel_table(struct xlog *log); int xlog_alloc_buf_cancel_table(struct xlog *log);
......
...@@ -26,6 +26,7 @@ ...@@ -26,6 +26,7 @@
#include "xfs_health.h" #include "xfs_health.h"
#include "xfs_ag.h" #include "xfs_ag.h"
#include "xfs_rtbitmap.h" #include "xfs_rtbitmap.h"
#include "xfs_exchrange.h"
/* /*
* Physical superblock buffer manipulations. Shared with libxfs in userspace. * Physical superblock buffer manipulations. Shared with libxfs in userspace.
...@@ -175,6 +176,8 @@ xfs_sb_version_to_features( ...@@ -175,6 +176,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_NEEDSREPAIR; features |= XFS_FEAT_NEEDSREPAIR;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NREXT64) if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NREXT64)
features |= XFS_FEAT_NREXT64; features |= XFS_FEAT_NREXT64;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_EXCHRANGE)
features |= XFS_FEAT_EXCHANGE_RANGE;
return features; return features;
} }
...@@ -1259,6 +1262,8 @@ xfs_fs_geometry( ...@@ -1259,6 +1262,8 @@ xfs_fs_geometry(
} }
if (xfs_has_large_extent_counts(mp)) if (xfs_has_large_extent_counts(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64; geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
if (xfs_has_exchange_range(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE;
geo->rtsectsize = sbp->sb_blocksize; geo->rtsectsize = sbp->sb_blocksize;
geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp); geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
......
...@@ -380,3 +380,50 @@ xfs_symlink_write_target( ...@@ -380,3 +380,50 @@ xfs_symlink_write_target(
ASSERT(pathlen == 0); ASSERT(pathlen == 0);
return 0; return 0;
} }
/* Remove all the blocks from a symlink and invalidate buffers. */
int
xfs_symlink_remote_truncate(
struct xfs_trans *tp,
struct xfs_inode *ip)
{
struct xfs_bmbt_irec mval[XFS_SYMLINK_MAPS];
struct xfs_mount *mp = tp->t_mountp;
struct xfs_buf *bp;
int nmaps = XFS_SYMLINK_MAPS;
int done = 0;
int i;
int error;
/* Read mappings and invalidate buffers. */
error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
if (error)
return error;
for (i = 0; i < nmaps; i++) {
if (!xfs_bmap_is_real_extent(&mval[i]))
break;
error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
&bp);
if (error)
return error;
xfs_trans_binval(tp, bp);
}
/* Unmap the remote blocks. */
error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
if (error)
return error;
if (!done) {
ASSERT(done);
xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
return -EFSCORRUPTED;
}
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
return 0;
}
...@@ -22,5 +22,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link); ...@@ -22,5 +22,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip, int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
const char *target_path, int pathlen, xfs_fsblock_t fs_blocks, const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
uint resblks); uint resblks);
int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
#endif /* __XFS_SYMLINK_REMOTE_H */ #endif /* __XFS_SYMLINK_REMOTE_H */
...@@ -10,6 +10,10 @@ ...@@ -10,6 +10,10 @@
* Components of space reservations. * Components of space reservations.
*/ */
/* Worst case number of bmaps that can be held in a block. */
#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp) \
(((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
/* Worst case number of rmaps that can be held in a block. */ /* Worst case number of rmaps that can be held in a block. */
#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \ #define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \
(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0])) (((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
......
...@@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = { ...@@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = {
XFS_RANDOM_ATTR_LEAF_TO_NODE, XFS_RANDOM_ATTR_LEAF_TO_NODE,
XFS_RANDOM_WB_DELAY_MS, XFS_RANDOM_WB_DELAY_MS,
XFS_RANDOM_WRITE_DELAY_MS, XFS_RANDOM_WRITE_DELAY_MS,
XFS_RANDOM_EXCHMAPS_FINISH_ONE,
}; };
struct xfs_errortag_attr { struct xfs_errortag_attr {
...@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split, XFS_ERRTAG_DA_LEAF_SPLIT); ...@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split, XFS_ERRTAG_DA_LEAF_SPLIT);
XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node, XFS_ERRTAG_ATTR_LEAF_TO_NODE); XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node, XFS_ERRTAG_ATTR_LEAF_TO_NODE);
XFS_ERRORTAG_ATTR_RW(wb_delay_ms, XFS_ERRTAG_WB_DELAY_MS); XFS_ERRORTAG_ATTR_RW(wb_delay_ms, XFS_ERRTAG_WB_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(write_delay_ms, XFS_ERRTAG_WRITE_DELAY_MS); XFS_ERRORTAG_ATTR_RW(write_delay_ms, XFS_ERRTAG_WRITE_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
static struct attribute *xfs_errortag_attrs[] = { static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(noerror), XFS_ERRORTAG_ATTR_LIST(noerror),
...@@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = { ...@@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node), XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
XFS_ERRORTAG_ATTR_LIST(wb_delay_ms), XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
XFS_ERRORTAG_ATTR_LIST(write_delay_ms), XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
NULL, NULL,
}; };
ATTRIBUTE_GROUPS(xfs_errortag); ATTRIBUTE_GROUPS(xfs_errortag);
......
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_bit.h"
#include "xfs_shared.h"
#include "xfs_mount.h"
#include "xfs_defer.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
#include "xfs_trans_priv.h"
#include "xfs_exchmaps_item.h"
#include "xfs_exchmaps.h"
#include "xfs_log.h"
#include "xfs_bmap.h"
#include "xfs_icache.h"
#include "xfs_bmap_btree.h"
#include "xfs_trans_space.h"
#include "xfs_error.h"
#include "xfs_log_priv.h"
#include "xfs_log_recover.h"
#include "xfs_exchrange.h"
#include "xfs_trace.h"
struct kmem_cache *xfs_xmi_cache;
struct kmem_cache *xfs_xmd_cache;
static const struct xfs_item_ops xfs_xmi_item_ops;
static inline struct xfs_xmi_log_item *XMI_ITEM(struct xfs_log_item *lip)
{
return container_of(lip, struct xfs_xmi_log_item, xmi_item);
}
STATIC void
xfs_xmi_item_free(
struct xfs_xmi_log_item *xmi_lip)
{
kvfree(xmi_lip->xmi_item.li_lv_shadow);
kmem_cache_free(xfs_xmi_cache, xmi_lip);
}
/*
* Freeing the XMI requires that we remove it from the AIL if it has already
* been placed there. However, the XMI may not yet have been placed in the AIL
* when called by xfs_xmi_release() from XMD processing due to the ordering of
* committed vs unpin operations in bulk insert operations. Hence the reference
* count to ensure only the last caller frees the XMI.
*/
STATIC void
xfs_xmi_release(
struct xfs_xmi_log_item *xmi_lip)
{
ASSERT(atomic_read(&xmi_lip->xmi_refcount) > 0);
if (atomic_dec_and_test(&xmi_lip->xmi_refcount)) {
xfs_trans_ail_delete(&xmi_lip->xmi_item, 0);
xfs_xmi_item_free(xmi_lip);
}
}
STATIC void
xfs_xmi_item_size(
struct xfs_log_item *lip,
int *nvecs,
int *nbytes)
{
*nvecs += 1;
*nbytes += sizeof(struct xfs_xmi_log_format);
}
/*
* This is called to fill in the vector of log iovecs for the given xmi log
* item. We use only 1 iovec, and we point that at the xmi_log_format structure
* embedded in the xmi item.
*/
STATIC void
xfs_xmi_item_format(
struct xfs_log_item *lip,
struct xfs_log_vec *lv)
{
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip);
struct xfs_log_iovec *vecp = NULL;
xmi_lip->xmi_format.xmi_type = XFS_LI_XMI;
xmi_lip->xmi_format.xmi_size = 1;
xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMI_FORMAT,
&xmi_lip->xmi_format,
sizeof(struct xfs_xmi_log_format));
}
/*
* The unpin operation is the last place an XMI is manipulated in the log. It
* is either inserted in the AIL or aborted in the event of a log I/O error. In
* either case, the XMI transaction has been successfully committed to make it
* this far. Therefore, we expect whoever committed the XMI to either construct
* and commit the XMD or drop the XMD's reference in the event of error. Simply
* drop the log's XMI reference now that the log is done with it.
*/
STATIC void
xfs_xmi_item_unpin(
struct xfs_log_item *lip,
int remove)
{
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip);
xfs_xmi_release(xmi_lip);
}
/*
* The XMI has been either committed or aborted if the transaction has been
* cancelled. If the transaction was cancelled, an XMD isn't going to be
* constructed and thus we free the XMI here directly.
*/
STATIC void
xfs_xmi_item_release(
struct xfs_log_item *lip)
{
xfs_xmi_release(XMI_ITEM(lip));
}
/* Allocate and initialize an xmi item. */
STATIC struct xfs_xmi_log_item *
xfs_xmi_init(
struct xfs_mount *mp)
{
struct xfs_xmi_log_item *xmi_lip;
xmi_lip = kmem_cache_zalloc(xfs_xmi_cache, GFP_KERNEL | __GFP_NOFAIL);
xfs_log_item_init(mp, &xmi_lip->xmi_item, XFS_LI_XMI, &xfs_xmi_item_ops);
xmi_lip->xmi_format.xmi_id = (uintptr_t)(void *)xmi_lip;
atomic_set(&xmi_lip->xmi_refcount, 2);
return xmi_lip;
}
static inline struct xfs_xmd_log_item *XMD_ITEM(struct xfs_log_item *lip)
{
return container_of(lip, struct xfs_xmd_log_item, xmd_item);
}
STATIC void
xfs_xmd_item_size(
struct xfs_log_item *lip,
int *nvecs,
int *nbytes)
{
*nvecs += 1;
*nbytes += sizeof(struct xfs_xmd_log_format);
}
/*
* This is called to fill in the vector of log iovecs for the given xmd log
* item. We use only 1 iovec, and we point that at the xmd_log_format structure
* embedded in the xmd item.
*/
STATIC void
xfs_xmd_item_format(
struct xfs_log_item *lip,
struct xfs_log_vec *lv)
{
struct xfs_xmd_log_item *xmd_lip = XMD_ITEM(lip);
struct xfs_log_iovec *vecp = NULL;
xmd_lip->xmd_format.xmd_type = XFS_LI_XMD;
xmd_lip->xmd_format.xmd_size = 1;
xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMD_FORMAT, &xmd_lip->xmd_format,
sizeof(struct xfs_xmd_log_format));
}
/*
* The XMD is either committed or aborted if the transaction is cancelled. If
* the transaction is cancelled, drop our reference to the XMI and free the
* XMD.
*/
STATIC void
xfs_xmd_item_release(
struct xfs_log_item *lip)
{
struct xfs_xmd_log_item *xmd_lip = XMD_ITEM(lip);
xfs_xmi_release(xmd_lip->xmd_intent_log_item);
kvfree(xmd_lip->xmd_item.li_lv_shadow);
kmem_cache_free(xfs_xmd_cache, xmd_lip);
}
static struct xfs_log_item *
xfs_xmd_item_intent(
struct xfs_log_item *lip)
{
return &XMD_ITEM(lip)->xmd_intent_log_item->xmi_item;
}
static const struct xfs_item_ops xfs_xmd_item_ops = {
.flags = XFS_ITEM_RELEASE_WHEN_COMMITTED |
XFS_ITEM_INTENT_DONE,
.iop_size = xfs_xmd_item_size,
.iop_format = xfs_xmd_item_format,
.iop_release = xfs_xmd_item_release,
.iop_intent = xfs_xmd_item_intent,
};
/* Log file mapping exchange information in the intent item. */
STATIC struct xfs_log_item *
xfs_exchmaps_create_intent(
struct xfs_trans *tp,
struct list_head *items,
unsigned int count,
bool sort)
{
struct xfs_xmi_log_item *xmi_lip;
struct xfs_exchmaps_intent *xmi;
struct xfs_xmi_log_format *xlf;
ASSERT(count == 1);
xmi = list_first_entry_or_null(items, struct xfs_exchmaps_intent,
xmi_list);
xmi_lip = xfs_xmi_init(tp->t_mountp);
xlf = &xmi_lip->xmi_format;
xlf->xmi_inode1 = xmi->xmi_ip1->i_ino;
xlf->xmi_igen1 = VFS_I(xmi->xmi_ip1)->i_generation;
xlf->xmi_inode2 = xmi->xmi_ip2->i_ino;
xlf->xmi_igen2 = VFS_I(xmi->xmi_ip2)->i_generation;
xlf->xmi_startoff1 = xmi->xmi_startoff1;
xlf->xmi_startoff2 = xmi->xmi_startoff2;
xlf->xmi_blockcount = xmi->xmi_blockcount;
xlf->xmi_isize1 = xmi->xmi_isize1;
xlf->xmi_isize2 = xmi->xmi_isize2;
xlf->xmi_flags = xmi->xmi_flags & XFS_EXCHMAPS_LOGGED_FLAGS;
return &xmi_lip->xmi_item;
}
STATIC struct xfs_log_item *
xfs_exchmaps_create_done(
struct xfs_trans *tp,
struct xfs_log_item *intent,
unsigned int count)
{
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(intent);
struct xfs_xmd_log_item *xmd_lip;
xmd_lip = kmem_cache_zalloc(xfs_xmd_cache, GFP_KERNEL | __GFP_NOFAIL);
xfs_log_item_init(tp->t_mountp, &xmd_lip->xmd_item, XFS_LI_XMD,
&xfs_xmd_item_ops);
xmd_lip->xmd_intent_log_item = xmi_lip;
xmd_lip->xmd_format.xmd_xmi_id = xmi_lip->xmi_format.xmi_id;
return &xmd_lip->xmd_item;
}
/* Add this deferred XMI to the transaction. */
void
xfs_exchmaps_defer_add(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
trace_xfs_exchmaps_defer(tp->t_mountp, xmi);
xfs_defer_add(tp, &xmi->xmi_list, &xfs_exchmaps_defer_type);
}
static inline struct xfs_exchmaps_intent *xmi_entry(const struct list_head *e)
{
return list_entry(e, struct xfs_exchmaps_intent, xmi_list);
}
/* Cancel a deferred file mapping exchange. */
STATIC void
xfs_exchmaps_cancel_item(
struct list_head *item)
{
struct xfs_exchmaps_intent *xmi = xmi_entry(item);
kmem_cache_free(xfs_exchmaps_intent_cache, xmi);
}
/* Process a deferred file mapping exchange. */
STATIC int
xfs_exchmaps_finish_item(
struct xfs_trans *tp,
struct xfs_log_item *done,
struct list_head *item,
struct xfs_btree_cur **state)
{
struct xfs_exchmaps_intent *xmi = xmi_entry(item);
int error;
/*
* Exchange one more mappings between two files. If there's still more
* work to do, we want to requeue ourselves after all other pending
* deferred operations have finished. This includes all of the dfops
* that we queued directly as well as any new ones created in the
* process of finishing the others. Doing so prevents us from queuing
* a large number of XMI log items in kernel memory, which in turn
* prevents us from pinning the tail of the log (while logging those
* new XMI items) until the first XMI items can be processed.
*/
error = xfs_exchmaps_finish_one(tp, xmi);
if (error != -EAGAIN)
xfs_exchmaps_cancel_item(item);
return error;
}
/* Abort all pending XMIs. */
STATIC void
xfs_exchmaps_abort_intent(
struct xfs_log_item *intent)
{
xfs_xmi_release(XMI_ITEM(intent));
}
/* Is this recovered XMI ok? */
static inline bool
xfs_xmi_validate(
struct xfs_mount *mp,
struct xfs_xmi_log_item *xmi_lip)
{
struct xfs_xmi_log_format *xlf = &xmi_lip->xmi_format;
if (!xfs_has_exchange_range(mp))
return false;
if (xmi_lip->xmi_format.__pad != 0)
return false;
if (xlf->xmi_flags & ~XFS_EXCHMAPS_LOGGED_FLAGS)
return false;
if (!xfs_verify_ino(mp, xlf->xmi_inode1) ||
!xfs_verify_ino(mp, xlf->xmi_inode2))
return false;
if (!xfs_verify_fileext(mp, xlf->xmi_startoff1, xlf->xmi_blockcount))
return false;
return xfs_verify_fileext(mp, xlf->xmi_startoff2, xlf->xmi_blockcount);
}
/*
* Use the recovered log state to create a new request, estimate resource
* requirements, and create a new incore intent state.
*/
STATIC struct xfs_exchmaps_intent *
xfs_xmi_item_recover_intent(
struct xfs_mount *mp,
struct xfs_defer_pending *dfp,
const struct xfs_xmi_log_format *xlf,
struct xfs_exchmaps_req *req,
struct xfs_inode **ipp1,
struct xfs_inode **ipp2)
{
struct xfs_inode *ip1, *ip2;
struct xfs_exchmaps_intent *xmi;
int error;
/*
* Grab both inodes and set IRECOVERY to prevent trimming of post-eof
* mappings and freeing of unlinked inodes until we're totally done
* processing files. The ondisk format of this new log item contains
* file handle information, which is why recovery for other items do
* not check the inode generation number.
*/
error = xlog_recover_iget_handle(mp, xlf->xmi_inode1, xlf->xmi_igen1,
&ip1);
if (error) {
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, xlf,
sizeof(*xlf));
return ERR_PTR(error);
}
error = xlog_recover_iget_handle(mp, xlf->xmi_inode2, xlf->xmi_igen2,
&ip2);
if (error) {
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, xlf,
sizeof(*xlf));
goto err_rele1;
}
req->ip1 = ip1;
req->ip2 = ip2;
req->startoff1 = xlf->xmi_startoff1;
req->startoff2 = xlf->xmi_startoff2;
req->blockcount = xlf->xmi_blockcount;
req->flags = xlf->xmi_flags & XFS_EXCHMAPS_PARAMS;
xfs_exchrange_ilock(NULL, ip1, ip2);
error = xfs_exchmaps_estimate(req);
xfs_exchrange_iunlock(ip1, ip2);
if (error)
goto err_rele2;
*ipp1 = ip1;
*ipp2 = ip2;
xmi = xfs_exchmaps_init_intent(req);
xfs_defer_add_item(dfp, &xmi->xmi_list);
return xmi;
err_rele2:
xfs_irele(ip2);
err_rele1:
xfs_irele(ip1);
req->ip2 = req->ip1 = NULL;
return ERR_PTR(error);
}
/* Process a file mapping exchange item that was recovered from the log. */
STATIC int
xfs_exchmaps_recover_work(
struct xfs_defer_pending *dfp,
struct list_head *capture_list)
{
struct xfs_exchmaps_req req = { .flags = 0 };
struct xfs_trans_res resv;
struct xfs_exchmaps_intent *xmi;
struct xfs_log_item *lip = dfp->dfp_intent;
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip);
struct xfs_mount *mp = lip->li_log->l_mp;
struct xfs_trans *tp;
struct xfs_inode *ip1, *ip2;
int error = 0;
if (!xfs_xmi_validate(mp, xmi_lip)) {
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
&xmi_lip->xmi_format,
sizeof(xmi_lip->xmi_format));
return -EFSCORRUPTED;
}
xmi = xfs_xmi_item_recover_intent(mp, dfp, &xmi_lip->xmi_format, &req,
&ip1, &ip2);
if (IS_ERR(xmi))
return PTR_ERR(xmi);
trace_xfs_exchmaps_recover(mp, xmi);
resv = xlog_recover_resv(&M_RES(mp)->tr_write);
error = xfs_trans_alloc(mp, &resv, req.resblks, 0, 0, &tp);
if (error)
goto err_rele;
xfs_exchrange_ilock(tp, ip1, ip2);
xfs_exchmaps_ensure_reflink(tp, xmi);
xfs_exchmaps_upgrade_extent_counts(tp, xmi);
error = xlog_recover_finish_intent(tp, dfp);
if (error == -EFSCORRUPTED)
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
&xmi_lip->xmi_format,
sizeof(xmi_lip->xmi_format));
if (error)
goto err_cancel;
/*
* Commit transaction, which frees the transaction and saves the inodes
* for later replay activities.
*/
error = xfs_defer_ops_capture_and_commit(tp, capture_list);
goto err_unlock;
err_cancel:
xfs_trans_cancel(tp);
err_unlock:
xfs_exchrange_iunlock(ip1, ip2);
err_rele:
xfs_irele(ip2);
xfs_irele(ip1);
return error;
}
/* Relog an intent item to push the log tail forward. */
static struct xfs_log_item *
xfs_exchmaps_relog_intent(
struct xfs_trans *tp,
struct xfs_log_item *intent,
struct xfs_log_item *done_item)
{
struct xfs_xmi_log_item *xmi_lip;
struct xfs_xmi_log_format *old_xlf, *new_xlf;
old_xlf = &XMI_ITEM(intent)->xmi_format;
xmi_lip = xfs_xmi_init(tp->t_mountp);
new_xlf = &xmi_lip->xmi_format;
new_xlf->xmi_inode1 = old_xlf->xmi_inode1;
new_xlf->xmi_inode2 = old_xlf->xmi_inode2;
new_xlf->xmi_igen1 = old_xlf->xmi_igen1;
new_xlf->xmi_igen2 = old_xlf->xmi_igen2;
new_xlf->xmi_startoff1 = old_xlf->xmi_startoff1;
new_xlf->xmi_startoff2 = old_xlf->xmi_startoff2;
new_xlf->xmi_blockcount = old_xlf->xmi_blockcount;
new_xlf->xmi_flags = old_xlf->xmi_flags;
new_xlf->xmi_isize1 = old_xlf->xmi_isize1;
new_xlf->xmi_isize2 = old_xlf->xmi_isize2;
return &xmi_lip->xmi_item;
}
const struct xfs_defer_op_type xfs_exchmaps_defer_type = {
.name = "exchmaps",
.max_items = 1,
.create_intent = xfs_exchmaps_create_intent,
.abort_intent = xfs_exchmaps_abort_intent,
.create_done = xfs_exchmaps_create_done,
.finish_item = xfs_exchmaps_finish_item,
.cancel_item = xfs_exchmaps_cancel_item,
.recover_work = xfs_exchmaps_recover_work,
.relog_intent = xfs_exchmaps_relog_intent,
};
STATIC bool
xfs_xmi_item_match(
struct xfs_log_item *lip,
uint64_t intent_id)
{
return XMI_ITEM(lip)->xmi_format.xmi_id == intent_id;
}
static const struct xfs_item_ops xfs_xmi_item_ops = {
.flags = XFS_ITEM_INTENT,
.iop_size = xfs_xmi_item_size,
.iop_format = xfs_xmi_item_format,
.iop_unpin = xfs_xmi_item_unpin,
.iop_release = xfs_xmi_item_release,
.iop_match = xfs_xmi_item_match,
};
/*
* This routine is called to create an in-core file mapping exchange item from
* the xmi format structure which was logged on disk. It allocates an in-core
* xmi, copies the exchange information from the format structure into it, and
* adds the xmi to the AIL with the given LSN.
*/
STATIC int
xlog_recover_xmi_commit_pass2(
struct xlog *log,
struct list_head *buffer_list,
struct xlog_recover_item *item,
xfs_lsn_t lsn)
{
struct xfs_mount *mp = log->l_mp;
struct xfs_xmi_log_item *xmi_lip;
struct xfs_xmi_log_format *xmi_formatp;
size_t len;
len = sizeof(struct xfs_xmi_log_format);
if (item->ri_buf[0].i_len != len) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
xmi_formatp = item->ri_buf[0].i_addr;
if (xmi_formatp->__pad != 0) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
xmi_lip = xfs_xmi_init(mp);
memcpy(&xmi_lip->xmi_format, xmi_formatp, len);
xlog_recover_intent_item(log, &xmi_lip->xmi_item, lsn,
&xfs_exchmaps_defer_type);
return 0;
}
const struct xlog_recover_item_ops xlog_xmi_item_ops = {
.item_type = XFS_LI_XMI,
.commit_pass2 = xlog_recover_xmi_commit_pass2,
};
/*
* This routine is called when an XMD format structure is found in a committed
* transaction in the log. Its purpose is to cancel the corresponding XMI if it
* was still in the log. To do this it searches the AIL for the XMI with an id
* equal to that in the XMD format structure. If we find it we drop the XMD
* reference, which removes the XMI from the AIL and frees it.
*/
STATIC int
xlog_recover_xmd_commit_pass2(
struct xlog *log,
struct list_head *buffer_list,
struct xlog_recover_item *item,
xfs_lsn_t lsn)
{
struct xfs_xmd_log_format *xmd_formatp;
xmd_formatp = item->ri_buf[0].i_addr;
if (item->ri_buf[0].i_len != sizeof(struct xfs_xmd_log_format)) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
xlog_recover_release_intent(log, XFS_LI_XMI, xmd_formatp->xmd_xmi_id);
return 0;
}
const struct xlog_recover_item_ops xlog_xmd_item_ops = {
.item_type = XFS_LI_XMD,
.commit_pass2 = xlog_recover_xmd_commit_pass2,
};
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHMAPS_ITEM_H__
#define __XFS_EXCHMAPS_ITEM_H__
/*
* The file mapping exchange intent item helps us exchange multiple file
* mappings between two inode forks. It does this by tracking the range of
* file block offsets that still need to be exchanged, and relogs as progress
* happens.
*
* *I items should be recorded in the *first* of a series of rolled
* transactions, and the *D items should be recorded in the same transaction
* that records the associated bmbt updates.
*
* Should the system crash after the commit of the first transaction but
* before the commit of the final transaction in a series, log recovery will
* use the redo information recorded by the intent items to replay the
* rest of the mapping exchanges.
*/
/* kernel only XMI/XMD definitions */
struct xfs_mount;
struct kmem_cache;
/*
* This is the incore file mapping exchange intent log item. It is used to log
* the fact that we are exchanging mappings between two files. It is used in
* conjunction with the incore file mapping exchange done log item described
* below.
*
* These log items follow the same rules as struct xfs_efi_log_item; see the
* comments about that structure (in xfs_extfree_item.h) for more details.
*/
struct xfs_xmi_log_item {
struct xfs_log_item xmi_item;
atomic_t xmi_refcount;
struct xfs_xmi_log_format xmi_format;
};
/*
* This is the incore file mapping exchange done log item. It is used to log
* the fact that an exchange mentioned in an earlier xmi item have been
* performed.
*/
struct xfs_xmd_log_item {
struct xfs_log_item xmd_item;
struct xfs_xmi_log_item *xmd_intent_log_item;
struct xfs_xmd_log_format xmd_format;
};
extern struct kmem_cache *xfs_xmi_cache;
extern struct kmem_cache *xfs_xmd_cache;
struct xfs_exchmaps_intent;
void xfs_exchmaps_defer_add(struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi);
#endif /* __XFS_EXCHMAPS_ITEM_H__ */
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include "xfs.h"
#include "xfs_shared.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_mount.h"
#include "xfs_defer.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
#include "xfs_quota.h"
#include "xfs_bmap_util.h"
#include "xfs_reflink.h"
#include "xfs_trace.h"
#include "xfs_exchrange.h"
#include "xfs_exchmaps.h"
#include "xfs_sb.h"
#include "xfs_icache.h"
#include "xfs_log.h"
#include "xfs_rtbitmap.h"
#include <linux/fsnotify.h>
/* Lock (and optionally join) two inodes for a file range exchange. */
void
xfs_exchrange_ilock(
struct xfs_trans *tp,
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
if (ip1 != ip2)
xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL,
ip2, XFS_ILOCK_EXCL);
else
xfs_ilock(ip1, XFS_ILOCK_EXCL);
if (tp) {
xfs_trans_ijoin(tp, ip1, 0);
if (ip2 != ip1)
xfs_trans_ijoin(tp, ip2, 0);
}
}
/* Unlock two inodes after a file range exchange operation. */
void
xfs_exchrange_iunlock(
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
if (ip2 != ip1)
xfs_iunlock(ip2, XFS_ILOCK_EXCL);
xfs_iunlock(ip1, XFS_ILOCK_EXCL);
}
/*
* Estimate the resource requirements to exchange file contents between the two
* files. The caller is required to hold the IOLOCK and the MMAPLOCK and to
* have flushed both inodes' pagecache and active direct-ios.
*/
int
xfs_exchrange_estimate(
struct xfs_exchmaps_req *req)
{
int error;
xfs_exchrange_ilock(NULL, req->ip1, req->ip2);
error = xfs_exchmaps_estimate(req);
xfs_exchrange_iunlock(req->ip1, req->ip2);
return error;
}
#define QRETRY_IP1 (0x1)
#define QRETRY_IP2 (0x2)
/*
* Obtain a quota reservation to make sure we don't hit EDQUOT. We can skip
* this if quota enforcement is disabled or if both inodes' dquots are the
* same. The qretry structure must be initialized to zeroes before the first
* call to this function.
*/
STATIC int
xfs_exchrange_reserve_quota(
struct xfs_trans *tp,
const struct xfs_exchmaps_req *req,
unsigned int *qretry)
{
int64_t ddelta, rdelta;
int ip1_error = 0;
int error;
/*
* Don't bother with a quota reservation if we're not enforcing them
* or the two inodes have the same dquots.
*/
if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
(req->ip1->i_udquot == req->ip2->i_udquot &&
req->ip1->i_gdquot == req->ip2->i_gdquot &&
req->ip1->i_pdquot == req->ip2->i_pdquot))
return 0;
*qretry = 0;
/*
* For each file, compute the net gain in the number of regular blocks
* that will be mapped into that file and reserve that much quota. The
* quota counts must be able to absorb at least that much space.
*/
ddelta = req->ip2_bcount - req->ip1_bcount;
rdelta = req->ip2_rtbcount - req->ip1_rtbcount;
if (ddelta > 0 || rdelta > 0) {
error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
ddelta > 0 ? ddelta : 0,
rdelta > 0 ? rdelta : 0,
false);
if (error == -EDQUOT || error == -ENOSPC) {
/*
* Save this error and see what happens if we try to
* reserve quota for ip2. Then report both.
*/
*qretry |= QRETRY_IP1;
ip1_error = error;
error = 0;
}
if (error)
return error;
}
if (ddelta < 0 || rdelta < 0) {
error = xfs_trans_reserve_quota_nblks(tp, req->ip2,
ddelta < 0 ? -ddelta : 0,
rdelta < 0 ? -rdelta : 0,
false);
if (error == -EDQUOT || error == -ENOSPC)
*qretry |= QRETRY_IP2;
if (error)
return error;
}
if (ip1_error)
return ip1_error;
/*
* For each file, forcibly reserve the gross gain in mapped blocks so
* that we don't trip over any quota block reservation assertions.
* We must reserve the gross gain because the quota code subtracts from
* bcount the number of blocks that we unmap; it does not add that
* quantity back to the quota block reservation.
*/
error = xfs_trans_reserve_quota_nblks(tp, req->ip1, req->ip1_bcount,
req->ip1_rtbcount, true);
if (error)
return error;
return xfs_trans_reserve_quota_nblks(tp, req->ip2, req->ip2_bcount,
req->ip2_rtbcount, true);
}
/* Exchange the mappings (and hence the contents) of two files' forks. */
STATIC int
xfs_exchrange_mappings(
const struct xfs_exchrange *fxr,
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
struct xfs_mount *mp = ip1->i_mount;
struct xfs_exchmaps_req req = {
.ip1 = ip1,
.ip2 = ip2,
.startoff1 = XFS_B_TO_FSBT(mp, fxr->file1_offset),
.startoff2 = XFS_B_TO_FSBT(mp, fxr->file2_offset),
.blockcount = XFS_B_TO_FSB(mp, fxr->length),
};
struct xfs_trans *tp;
unsigned int qretry;
bool retried = false;
int error;
trace_xfs_exchrange_mappings(fxr, ip1, ip2);
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)
req.flags |= XFS_EXCHMAPS_SET_SIZES;
if (fxr->flags & XFS_EXCHANGE_RANGE_FILE1_WRITTEN)
req.flags |= XFS_EXCHMAPS_INO1_WRITTEN;
/*
* Round the request length up to the nearest file allocation unit.
* The prep function already checked that the request offsets and
* length in @fxr are safe to round up.
*/
if (xfs_inode_has_bigrtalloc(ip2))
req.blockcount = xfs_rtb_roundup_rtx(mp, req.blockcount);
error = xfs_exchrange_estimate(&req);
if (error)
return error;
retry:
/* Allocate the transaction, lock the inodes, and join them. */
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
XFS_TRANS_RES_FDBLKS, &tp);
if (error)
return error;
xfs_exchrange_ilock(tp, ip1, ip2);
trace_xfs_exchrange_before(ip2, 2);
trace_xfs_exchrange_before(ip1, 1);
error = xfs_exchmaps_check_forks(mp, &req);
if (error)
goto out_trans_cancel;
/*
* Reserve ourselves some quota if any of them are in enforcing mode.
* In theory we only need enough to satisfy the change in the number
* of blocks between the two ranges being remapped.
*/
error = xfs_exchrange_reserve_quota(tp, &req, &qretry);
if ((error == -EDQUOT || error == -ENOSPC) && !retried) {
xfs_trans_cancel(tp);
xfs_exchrange_iunlock(ip1, ip2);
if (qretry & QRETRY_IP1)
xfs_blockgc_free_quota(ip1, 0);
if (qretry & QRETRY_IP2)
xfs_blockgc_free_quota(ip2, 0);
retried = true;
goto retry;
}
if (error)
goto out_trans_cancel;
/* If we got this far on a dry run, all parameters are ok. */
if (fxr->flags & XFS_EXCHANGE_RANGE_DRY_RUN)
goto out_trans_cancel;
/* Update the mtime and ctime of both files. */
if (fxr->flags & __XFS_EXCHANGE_RANGE_UPD_CMTIME1)
xfs_trans_ichgtime(tp, ip1, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
if (fxr->flags & __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
xfs_trans_ichgtime(tp, ip2, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
xfs_exchange_mappings(tp, &req);
/*
* Force the log to persist metadata updates if the caller or the
* administrator requires this. The generic prep function already
* flushed the relevant parts of the page cache.
*/
if (xfs_has_wsync(mp) || (fxr->flags & XFS_EXCHANGE_RANGE_DSYNC))
xfs_trans_set_sync(tp);
error = xfs_trans_commit(tp);
trace_xfs_exchrange_after(ip2, 2);
trace_xfs_exchrange_after(ip1, 1);
if (error)
goto out_unlock;
/*
* If the caller wanted us to exchange the contents of two complete
* files of unequal length, exchange the incore sizes now. This should
* be safe because we flushed both files' page caches, exchanged all
* the mappings, and updated the ondisk sizes.
*/
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) {
loff_t temp;
temp = i_size_read(VFS_I(ip2));
i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
i_size_write(VFS_I(ip1), temp);
}
out_unlock:
xfs_exchrange_iunlock(ip1, ip2);
return error;
out_trans_cancel:
xfs_trans_cancel(tp);
goto out_unlock;
}
/*
* Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE.
* This part deals with struct file objects and byte ranges and does not deal
* with XFS-specific data structures such as xfs_inodes and block ranges. This
* separation may some day facilitate porting to another filesystem.
*
* The goal is to exchange fxr.length bytes starting at fxr.file1_offset in
* file1 with the same number of bytes starting at fxr.file2_offset in file2.
* Implementations must call xfs_exchange_range_prep to prepare the two
* files prior to taking locks; and they must update the inode change and mod
* times of both files as part of the metadata update. The timestamp update
* and freshness checks must be done atomically as part of the data exchange
* operation to ensure correctness of the freshness check.
* xfs_exchange_range_finish must be called after the operation completes
* successfully but before locks are dropped.
*/
/* Verify that we have security clearance to perform this operation. */
static int
xfs_exchange_range_verify_area(
struct xfs_exchrange *fxr)
{
int ret;
ret = remap_verify_area(fxr->file1, fxr->file1_offset, fxr->length,
true);
if (ret)
return ret;
return remap_verify_area(fxr->file2, fxr->file2_offset, fxr->length,
true);
}
/*
* Performs necessary checks before doing a range exchange, having stabilized
* mutable inode attributes via i_rwsem.
*/
static inline int
xfs_exchange_range_checks(
struct xfs_exchrange *fxr,
unsigned int alloc_unit)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
uint64_t allocmask = alloc_unit - 1;
int64_t test_len;
uint64_t blen;
loff_t size1, size2, tmp;
int error;
/* Don't touch certain kinds of inodes */
if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
return -EPERM;
if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
return -ETXTBSY;
size1 = i_size_read(inode1);
size2 = i_size_read(inode2);
/* Ranges cannot start after EOF. */
if (fxr->file1_offset > size1 || fxr->file2_offset > size2)
return -EINVAL;
/*
* If the caller said to exchange to EOF, we set the length of the
* request large enough to cover everything to the end of both files.
*/
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) {
fxr->length = max_t(int64_t, size1 - fxr->file1_offset,
size2 - fxr->file2_offset);
error = xfs_exchange_range_verify_area(fxr);
if (error)
return error;
}
/*
* The start of both ranges must be aligned to the file allocation
* unit.
*/
if (!IS_ALIGNED(fxr->file1_offset, alloc_unit) ||
!IS_ALIGNED(fxr->file2_offset, alloc_unit))
return -EINVAL;
/* Ensure offsets don't wrap. */
if (check_add_overflow(fxr->file1_offset, fxr->length, &tmp) ||
check_add_overflow(fxr->file2_offset, fxr->length, &tmp))
return -EINVAL;
/*
* We require both ranges to end within EOF, unless we're exchanging
* to EOF.
*/
if (!(fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) &&
(fxr->file1_offset + fxr->length > size1 ||
fxr->file2_offset + fxr->length > size2))
return -EINVAL;
/*
* Make sure we don't hit any file size limits. If we hit any size
* limits such that test_length was adjusted, we abort the whole
* operation.
*/
test_len = fxr->length;
error = generic_write_check_limits(fxr->file2, fxr->file2_offset,
&test_len);
if (error)
return error;
error = generic_write_check_limits(fxr->file1, fxr->file1_offset,
&test_len);
if (error)
return error;
if (test_len != fxr->length)
return -EINVAL;
/*
* If the user wanted us to exchange up to the infile's EOF, round up
* to the next allocation unit boundary for this check. Do the same
* for the outfile.
*
* Otherwise, reject the range length if it's not aligned to an
* allocation unit.
*/
if (fxr->file1_offset + fxr->length == size1)
blen = ALIGN(size1, alloc_unit) - fxr->file1_offset;
else if (fxr->file2_offset + fxr->length == size2)
blen = ALIGN(size2, alloc_unit) - fxr->file2_offset;
else if (!IS_ALIGNED(fxr->length, alloc_unit))
return -EINVAL;
else
blen = fxr->length;
/* Don't allow overlapped exchanges within the same file. */
if (inode1 == inode2 &&
fxr->file2_offset + blen > fxr->file1_offset &&
fxr->file1_offset + blen > fxr->file2_offset)
return -EINVAL;
/*
* Ensure that we don't exchange a partial EOF block into the middle of
* another file.
*/
if ((fxr->length & allocmask) == 0)
return 0;
blen = fxr->length;
if (fxr->file2_offset + blen < size2)
blen &= ~allocmask;
if (fxr->file1_offset + blen < size1)
blen &= ~allocmask;
return blen == fxr->length ? 0 : -EINVAL;
}
/*
* Check that the two inodes are eligible for range exchanges, the ranges make
* sense, and then flush all dirty data. Caller must ensure that the inodes
* have been locked against any other modifications.
*/
static inline int
xfs_exchange_range_prep(
struct xfs_exchrange *fxr,
unsigned int alloc_unit)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
bool same_inode = (inode1 == inode2);
int error;
/* Check that we don't violate system file offset limits. */
error = xfs_exchange_range_checks(fxr, alloc_unit);
if (error || fxr->length == 0)
return error;
/* Wait for the completion of any pending IOs on both files */
inode_dio_wait(inode1);
if (!same_inode)
inode_dio_wait(inode2);
error = filemap_write_and_wait_range(inode1->i_mapping,
fxr->file1_offset,
fxr->file1_offset + fxr->length - 1);
if (error)
return error;
error = filemap_write_and_wait_range(inode2->i_mapping,
fxr->file2_offset,
fxr->file2_offset + fxr->length - 1);
if (error)
return error;
/*
* If the files or inodes involved require synchronous writes, amend
* the request to force the filesystem to flush all data and metadata
* to disk after the operation completes.
*/
if (((fxr->file1->f_flags | fxr->file2->f_flags) & O_SYNC) ||
IS_SYNC(inode1) || IS_SYNC(inode2))
fxr->flags |= XFS_EXCHANGE_RANGE_DSYNC;
return 0;
}
/*
* Finish a range exchange operation, if it was successful. Caller must ensure
* that the inodes are still locked against any other modifications.
*/
static inline int
xfs_exchange_range_finish(
struct xfs_exchrange *fxr)
{
int error;
error = file_remove_privs(fxr->file1);
if (error)
return error;
if (file_inode(fxr->file1) == file_inode(fxr->file2))
return 0;
return file_remove_privs(fxr->file2);
}
/*
* Check the alignment of an exchange request when the allocation unit size
* isn't a power of two. The generic file-level helpers use (fast)
* bitmask-based alignment checks, but here we have to use slow long division.
*/
static int
xfs_exchrange_check_rtalign(
const struct xfs_exchrange *fxr,
struct xfs_inode *ip1,
struct xfs_inode *ip2,
unsigned int alloc_unit)
{
uint64_t length = fxr->length;
uint64_t blen;
loff_t size1, size2;
size1 = i_size_read(VFS_I(ip1));
size2 = i_size_read(VFS_I(ip2));
/* The start of both ranges must be aligned to a rt extent. */
if (!isaligned_64(fxr->file1_offset, alloc_unit) ||
!isaligned_64(fxr->file2_offset, alloc_unit))
return -EINVAL;
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)
length = max_t(int64_t, size1 - fxr->file1_offset,
size2 - fxr->file2_offset);
/*
* If the user wanted us to exchange up to the infile's EOF, round up
* to the next rt extent boundary for this check. Do the same for the
* outfile.
*
* Otherwise, reject the range length if it's not rt extent aligned.
* We already confirmed the starting offsets' rt extent block
* alignment.
*/
if (fxr->file1_offset + length == size1)
blen = roundup_64(size1, alloc_unit) - fxr->file1_offset;
else if (fxr->file2_offset + length == size2)
blen = roundup_64(size2, alloc_unit) - fxr->file2_offset;
else if (!isaligned_64(length, alloc_unit))
return -EINVAL;
else
blen = length;
/* Don't allow overlapped exchanges within the same file. */
if (ip1 == ip2 &&
fxr->file2_offset + blen > fxr->file1_offset &&
fxr->file1_offset + blen > fxr->file2_offset)
return -EINVAL;
/*
* Ensure that we don't exchange a partial EOF rt extent into the
* middle of another file.
*/
if (isaligned_64(length, alloc_unit))
return 0;
blen = length;
if (fxr->file2_offset + length < size2)
blen = rounddown_64(blen, alloc_unit);
if (fxr->file1_offset + blen < size1)
blen = rounddown_64(blen, alloc_unit);
return blen == length ? 0 : -EINVAL;
}
/* Prepare two files to have their data exchanged. */
STATIC int
xfs_exchrange_prep(
struct xfs_exchrange *fxr,
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
struct xfs_mount *mp = ip2->i_mount;
unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip2);
int error;
trace_xfs_exchrange_prep(fxr, ip1, ip2);
/* Verify both files are either real-time or non-realtime */
if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
return -EINVAL;
/* Check non-power of two alignment issues, if necessary. */
if (!is_power_of_2(alloc_unit)) {
error = xfs_exchrange_check_rtalign(fxr, ip1, ip2, alloc_unit);
if (error)
return error;
/*
* Do the generic file-level checks with the regular block
* alignment.
*/
alloc_unit = mp->m_sb.sb_blocksize;
}
error = xfs_exchange_range_prep(fxr, alloc_unit);
if (error || fxr->length == 0)
return error;
/* Attach dquots to both inodes before changing block maps. */
error = xfs_qm_dqattach(ip2);
if (error)
return error;
error = xfs_qm_dqattach(ip1);
if (error)
return error;
trace_xfs_exchrange_flush(fxr, ip1, ip2);
/* Flush the relevant ranges of both files. */
error = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length);
if (error)
return error;
error = xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length);
if (error)
return error;
/*
* Cancel CoW fork preallocations for the ranges of both files. The
* prep function should have flushed all the dirty data, so the only
* CoW mappings remaining should be speculative.
*/
if (xfs_inode_has_cow_data(ip1)) {
error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset,
fxr->length, true);
if (error)
return error;
}
if (xfs_inode_has_cow_data(ip2)) {
error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset,
fxr->length, true);
if (error)
return error;
}
return 0;
}
/*
* Exchange contents of files. This is the binding between the generic
* file-level concepts and the XFS inode-specific implementation.
*/
STATIC int
xfs_exchrange_contents(
struct xfs_exchrange *fxr)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
struct xfs_inode *ip1 = XFS_I(inode1);
struct xfs_inode *ip2 = XFS_I(inode2);
struct xfs_mount *mp = ip1->i_mount;
int error;
if (!xfs_has_exchange_range(mp))
return -EOPNOTSUPP;
if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
XFS_EXCHANGE_RANGE_PRIV_FLAGS))
return -EINVAL;
if (xfs_is_shutdown(mp))
return -EIO;
/* Lock both files against IO */
error = xfs_ilock2_io_mmap(ip1, ip2);
if (error)
goto out_err;
/* Prepare and then exchange file contents. */
error = xfs_exchrange_prep(fxr, ip1, ip2);
if (error)
goto out_unlock;
error = xfs_exchrange_mappings(fxr, ip1, ip2);
if (error)
goto out_unlock;
/*
* Finish the exchange by removing special file privileges like any
* other file write would do. This may involve turning on support for
* logged xattrs if either file has security capabilities.
*/
error = xfs_exchange_range_finish(fxr);
if (error)
goto out_unlock;
out_unlock:
xfs_iunlock2_io_mmap(ip1, ip2);
out_err:
if (error)
trace_xfs_exchrange_error(ip2, error, _RET_IP_);
return error;
}
/* Exchange parts of two files. */
static int
xfs_exchange_range(
struct xfs_exchrange *fxr)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
int ret;
BUILD_BUG_ON(XFS_EXCHANGE_RANGE_ALL_FLAGS &
XFS_EXCHANGE_RANGE_PRIV_FLAGS);
/* Both files must be on the same mount/filesystem. */
if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
return -EXDEV;
if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
return -EINVAL;
/* Userspace requests only honored for regular files. */
if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
return -EISDIR;
if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
return -EINVAL;
/* Both files must be opened for read and write. */
if (!(fxr->file1->f_mode & FMODE_READ) ||
!(fxr->file1->f_mode & FMODE_WRITE) ||
!(fxr->file2->f_mode & FMODE_READ) ||
!(fxr->file2->f_mode & FMODE_WRITE))
return -EBADF;
/* Neither file can be opened append-only. */
if ((fxr->file1->f_flags & O_APPEND) ||
(fxr->file2->f_flags & O_APPEND))
return -EBADF;
/*
* If we're not exchanging to EOF, we can check the areas before
* stabilizing both files' i_size.
*/
if (!(fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)) {
ret = xfs_exchange_range_verify_area(fxr);
if (ret)
return ret;
}
/* Update cmtime if the fd/inode don't forbid it. */
if (!(fxr->file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1))
fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME1;
if (!(fxr->file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2))
fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME2;
file_start_write(fxr->file2);
ret = xfs_exchrange_contents(fxr);
file_end_write(fxr->file2);
if (ret)
return ret;
fsnotify_modify(fxr->file1);
if (fxr->file2 != fxr->file1)
fsnotify_modify(fxr->file2);
return 0;
}
/* Collect exchange-range arguments from userspace. */
long
xfs_ioc_exchange_range(
struct file *file,
struct xfs_exchange_range __user *argp)
{
struct xfs_exchrange fxr = {
.file2 = file,
};
struct xfs_exchange_range args;
struct fd file1;
int error;
if (copy_from_user(&args, argp, sizeof(args)))
return -EFAULT;
if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
return -EINVAL;
if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
return -EINVAL;
fxr.file1_offset = args.file1_offset;
fxr.file2_offset = args.file2_offset;
fxr.length = args.length;
fxr.flags = args.flags;
file1 = fdget(args.file1_fd);
if (!file1.file)
return -EBADF;
fxr.file1 = file1.file;
error = xfs_exchange_range(&fxr);
fdput(file1);
return error;
}
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHRANGE_H__
#define __XFS_EXCHRANGE_H__
/* Update the mtime/cmtime of file1 and file2 */
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
__XFS_EXCHANGE_RANGE_UPD_CMTIME2)
struct xfs_exchrange {
struct file *file1;
struct file *file2;
loff_t file1_offset;
loff_t file2_offset;
u64 length;
u64 flags; /* XFS_EXCHANGE_RANGE flags */
};
long xfs_ioc_exchange_range(struct file *file,
struct xfs_exchange_range __user *argp);
struct xfs_exchmaps_req;
void xfs_exchrange_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
struct xfs_inode *ip2);
void xfs_exchrange_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
int xfs_exchrange_estimate(struct xfs_exchmaps_req *req);
#endif /* __XFS_EXCHRANGE_H__ */
...@@ -40,6 +40,7 @@ ...@@ -40,6 +40,7 @@
#include "xfs_xattr.h" #include "xfs_xattr.h"
#include "xfs_rtbitmap.h" #include "xfs_rtbitmap.h"
#include "xfs_file.h" #include "xfs_file.h"
#include "xfs_exchrange.h"
#include <linux/mount.h> #include <linux/mount.h>
#include <linux/namei.h> #include <linux/namei.h>
...@@ -2170,6 +2171,9 @@ xfs_file_ioctl( ...@@ -2170,6 +2171,9 @@ xfs_file_ioctl(
return error; return error;
} }
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
default: default:
return -ENOTTY; return -ENOTTY;
} }
......
...@@ -1767,6 +1767,37 @@ xlog_recover_iget( ...@@ -1767,6 +1767,37 @@ xlog_recover_iget(
return 0; return 0;
} }
/*
* Get an inode so that we can recover a log operation.
*
* Log intent items that target inodes effectively contain a file handle.
* Check that the generation number matches the intent item like we do for
* other file handles. Log intent items defined after this validation weakness
* was identified must use this function.
*/
int
xlog_recover_iget_handle(
struct xfs_mount *mp,
xfs_ino_t ino,
uint32_t gen,
struct xfs_inode **ipp)
{
struct xfs_inode *ip;
int error;
error = xlog_recover_iget(mp, ino, &ip);
if (error)
return error;
if (VFS_I(ip)->i_generation != gen) {
xfs_irele(ip);
return -EFSCORRUPTED;
}
*ipp = ip;
return 0;
}
/****************************************************************************** /******************************************************************************
* *
* Log recover routines * Log recover routines
...@@ -1789,6 +1820,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = { ...@@ -1789,6 +1820,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
&xlog_bud_item_ops, &xlog_bud_item_ops,
&xlog_attri_item_ops, &xlog_attri_item_ops,
&xlog_attrd_item_ops, &xlog_attrd_item_ops,
&xlog_xmi_item_ops,
&xlog_xmd_item_ops,
}; };
static const struct xlog_recover_item_ops * static const struct xlog_recover_item_ops *
......
...@@ -292,6 +292,7 @@ typedef struct xfs_mount { ...@@ -292,6 +292,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */ #define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */ #define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */ #define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
#define XFS_FEAT_EXCHANGE_RANGE (1ULL << 27) /* exchange range */
/* Mount features */ /* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */ #define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
...@@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT) ...@@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME) __XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR) __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64) __XFS_HAS_FEAT(large_extent_counts, NREXT64)
__XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
/* /*
* Mount features * Mount features
......
...@@ -43,6 +43,7 @@ ...@@ -43,6 +43,7 @@
#include "xfs_iunlink_item.h" #include "xfs_iunlink_item.h"
#include "xfs_dahash_test.h" #include "xfs_dahash_test.h"
#include "xfs_rtbitmap.h" #include "xfs_rtbitmap.h"
#include "xfs_exchmaps_item.h"
#include "scrub/stats.h" #include "scrub/stats.h"
#include "scrub/rcbag_btree.h" #include "scrub/rcbag_btree.h"
...@@ -1727,6 +1728,10 @@ xfs_fs_fill_super( ...@@ -1727,6 +1728,10 @@ xfs_fs_fill_super(
goto out_filestream_unmount; goto out_filestream_unmount;
} }
if (xfs_has_exchange_range(mp))
xfs_warn(mp,
"EXPERIMENTAL exchange-range feature enabled. Use at your own risk!");
error = xfs_mountfs(mp); error = xfs_mountfs(mp);
if (error) if (error)
goto out_filestream_unmount; goto out_filestream_unmount;
...@@ -2185,8 +2190,24 @@ xfs_init_caches(void) ...@@ -2185,8 +2190,24 @@ xfs_init_caches(void)
if (!xfs_iunlink_cache) if (!xfs_iunlink_cache)
goto out_destroy_attri_cache; goto out_destroy_attri_cache;
xfs_xmd_cache = kmem_cache_create("xfs_xmd_item",
sizeof(struct xfs_xmd_log_item),
0, 0, NULL);
if (!xfs_xmd_cache)
goto out_destroy_iul_cache;
xfs_xmi_cache = kmem_cache_create("xfs_xmi_item",
sizeof(struct xfs_xmi_log_item),
0, 0, NULL);
if (!xfs_xmi_cache)
goto out_destroy_xmd_cache;
return 0; return 0;
out_destroy_xmd_cache:
kmem_cache_destroy(xfs_xmd_cache);
out_destroy_iul_cache:
kmem_cache_destroy(xfs_iunlink_cache);
out_destroy_attri_cache: out_destroy_attri_cache:
kmem_cache_destroy(xfs_attri_cache); kmem_cache_destroy(xfs_attri_cache);
out_destroy_attrd_cache: out_destroy_attrd_cache:
...@@ -2243,6 +2264,8 @@ xfs_destroy_caches(void) ...@@ -2243,6 +2264,8 @@ xfs_destroy_caches(void)
* destroy caches. * destroy caches.
*/ */
rcu_barrier(); rcu_barrier();
kmem_cache_destroy(xfs_xmd_cache);
kmem_cache_destroy(xfs_xmi_cache);
kmem_cache_destroy(xfs_iunlink_cache); kmem_cache_destroy(xfs_iunlink_cache);
kmem_cache_destroy(xfs_attri_cache); kmem_cache_destroy(xfs_attri_cache);
kmem_cache_destroy(xfs_attrd_cache); kmem_cache_destroy(xfs_attrd_cache);
......
...@@ -252,17 +252,10 @@ STATIC int ...@@ -252,17 +252,10 @@ STATIC int
xfs_inactive_symlink_rmt( xfs_inactive_symlink_rmt(
struct xfs_inode *ip) struct xfs_inode *ip)
{ {
struct xfs_buf *bp; struct xfs_mount *mp = ip->i_mount;
int done; struct xfs_trans *tp;
int error; int error;
int i;
xfs_mount_t *mp;
xfs_bmbt_irec_t mval[XFS_SYMLINK_MAPS];
int nmaps;
int size;
xfs_trans_t *tp;
mp = ip->i_mount;
ASSERT(!xfs_need_iread_extents(&ip->i_df)); ASSERT(!xfs_need_iread_extents(&ip->i_df));
/* /*
* We're freeing a symlink that has some * We're freeing a symlink that has some
...@@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt( ...@@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt(
* locked for the second transaction. In the error paths we need it * locked for the second transaction. In the error paths we need it
* held so the cancel won't rele it, see below. * held so the cancel won't rele it, see below.
*/ */
size = (int)ip->i_disk_size;
ip->i_disk_size = 0; ip->i_disk_size = 0;
VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG; VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
/*
* Find the block(s) so we can inval and unmap them. error = xfs_symlink_remote_truncate(tp, ip);
*/
done = 0;
nmaps = ARRAY_SIZE(mval);
error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
mval, &nmaps, 0);
if (error)
goto error_trans_cancel;
/*
* Invalidate the block(s). No validation is done.
*/
for (i = 0; i < nmaps; i++) {
error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
&bp);
if (error)
goto error_trans_cancel;
xfs_trans_binval(tp, bp);
}
/*
* Unmap the dead block(s) to the dfops.
*/
error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done);
if (error) if (error)
goto error_trans_cancel; goto error_trans_cancel;
ASSERT(done);
/*
* Commit the transaction. This first logs the EFI and the inode, then
* rolls and commits the transaction that frees the extents.
*/
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
error = xfs_trans_commit(tp); error = xfs_trans_commit(tp);
if (error) { if (error) {
ASSERT(xfs_is_shutdown(mp)); ASSERT(xfs_is_shutdown(mp));
......
...@@ -39,6 +39,8 @@ ...@@ -39,6 +39,8 @@
#include "xfs_buf_mem.h" #include "xfs_buf_mem.h"
#include "xfs_btree_mem.h" #include "xfs_btree_mem.h"
#include "xfs_bmap.h" #include "xfs_bmap.h"
#include "xfs_exchmaps.h"
#include "xfs_exchrange.h"
/* /*
* We include this last to have the helpers above available for the trace * We include this last to have the helpers above available for the trace
......
...@@ -82,6 +82,9 @@ struct xfs_perag; ...@@ -82,6 +82,9 @@ struct xfs_perag;
struct xfbtree; struct xfbtree;
struct xfs_btree_ops; struct xfs_btree_ops;
struct xfs_bmap_intent; struct xfs_bmap_intent;
struct xfs_exchmaps_intent;
struct xfs_exchmaps_req;
struct xfs_exchrange;
#define XFS_ATTR_FILTER_FLAGS \ #define XFS_ATTR_FILTER_FLAGS \
{ XFS_ATTR_ROOT, "ROOT" }, \ { XFS_ATTR_ROOT, "ROOT" }, \
...@@ -4770,6 +4773,330 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block); ...@@ -4770,6 +4773,330 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block);
DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block); DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
#endif /* CONFIG_XFS_BTREE_IN_MEM */ #endif /* CONFIG_XFS_BTREE_IN_MEM */
/* exchmaps tracepoints */
#define XFS_EXCHMAPS_STRINGS \
{ XFS_EXCHMAPS_ATTR_FORK, "ATTRFORK" }, \
{ XFS_EXCHMAPS_SET_SIZES, "SETSIZES" }, \
{ XFS_EXCHMAPS_INO1_WRITTEN, "INO1_WRITTEN" }, \
{ XFS_EXCHMAPS_CLEAR_INO1_REFLINK, "CLEAR_INO1_REFLINK" }, \
{ XFS_EXCHMAPS_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" }, \
{ __XFS_EXCHMAPS_INO2_SHORTFORM, "INO2_SF" }
DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1_skip);
DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1);
DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping2);
DEFINE_ITRUNC_EVENT(xfs_exchmaps_update_inode_size);
#define XFS_EXCHRANGE_INODES \
{ 1, "file1" }, \
{ 2, "file2" }
DECLARE_EVENT_CLASS(xfs_exchrange_inode_class,
TP_PROTO(struct xfs_inode *ip, int whichfile),
TP_ARGS(ip, whichfile),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(int, whichfile)
__field(xfs_ino_t, ino)
__field(int, format)
__field(xfs_extnum_t, nex)
__field(int, broot_size)
__field(int, fork_off)
),
TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev;
__entry->whichfile = whichfile;
__entry->ino = ip->i_ino;
__entry->format = ip->i_df.if_format;
__entry->nex = ip->i_df.if_nextents;
__entry->fork_off = xfs_inode_fork_boff(ip);
),
TP_printk("dev %d:%d ino 0x%llx whichfile %s format %s num_extents %llu forkoff 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino,
__print_symbolic(__entry->whichfile, XFS_EXCHRANGE_INODES),
__print_symbolic(__entry->format, XFS_INODE_FORMAT_STR),
__entry->nex,
__entry->fork_off)
)
#define DEFINE_EXCHRANGE_INODE_EVENT(name) \
DEFINE_EVENT(xfs_exchrange_inode_class, name, \
TP_PROTO(struct xfs_inode *ip, int whichfile), \
TP_ARGS(ip, whichfile))
DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_before);
DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_after);
DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
#define XFS_EXCHANGE_RANGE_FLAGS_STRS \
{ XFS_EXCHANGE_RANGE_TO_EOF, "TO_EOF" }, \
{ XFS_EXCHANGE_RANGE_DSYNC , "DSYNC" }, \
{ XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
/* file exchange-range tracepoint class */
DECLARE_EVENT_CLASS(xfs_exchrange_class,
TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1,
struct xfs_inode *ip2),
TP_ARGS(fxr, ip1, ip2),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ip1_ino)
__field(loff_t, ip1_isize)
__field(loff_t, ip1_disize)
__field(xfs_ino_t, ip2_ino)
__field(loff_t, ip2_isize)
__field(loff_t, ip2_disize)
__field(loff_t, file1_offset)
__field(loff_t, file2_offset)
__field(unsigned long long, length)
__field(unsigned long long, flags)
),
TP_fast_assign(
__entry->dev = VFS_I(ip1)->i_sb->s_dev;
__entry->ip1_ino = ip1->i_ino;
__entry->ip1_isize = VFS_I(ip1)->i_size;
__entry->ip1_disize = ip1->i_disk_size;
__entry->ip2_ino = ip2->i_ino;
__entry->ip2_isize = VFS_I(ip2)->i_size;
__entry->ip2_disize = ip2->i_disk_size;
__entry->file1_offset = fxr->file1_offset;
__entry->file2_offset = fxr->file2_offset;
__entry->length = fxr->length;
__entry->flags = fxr->flags;
),
TP_printk("dev %d:%d flags %s bytecount 0x%llx "
"ino1 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx -> "
"ino2 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__print_flags_u64(__entry->flags, "|", XFS_EXCHANGE_RANGE_FLAGS_STRS),
__entry->length,
__entry->ip1_ino,
__entry->ip1_isize,
__entry->ip1_disize,
__entry->file1_offset,
__entry->ip2_ino,
__entry->ip2_isize,
__entry->ip2_disize,
__entry->file2_offset)
)
#define DEFINE_EXCHRANGE_EVENT(name) \
DEFINE_EVENT(xfs_exchrange_class, name, \
TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1, \
struct xfs_inode *ip2), \
TP_ARGS(fxr, ip1, ip2))
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
TRACE_EVENT(xfs_exchmaps_overhead,
TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
unsigned long long rmapbt_blocks),
TP_ARGS(mp, bmbt_blocks, rmapbt_blocks),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(unsigned long long, bmbt_blocks)
__field(unsigned long long, rmapbt_blocks)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->bmbt_blocks = bmbt_blocks;
__entry->rmapbt_blocks = rmapbt_blocks;
),
TP_printk("dev %d:%d bmbt_blocks 0x%llx rmapbt_blocks 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->bmbt_blocks,
__entry->rmapbt_blocks)
);
DECLARE_EVENT_CLASS(xfs_exchmaps_estimate_class,
TP_PROTO(const struct xfs_exchmaps_req *req),
TP_ARGS(req),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ino1)
__field(xfs_ino_t, ino2)
__field(xfs_fileoff_t, startoff1)
__field(xfs_fileoff_t, startoff2)
__field(xfs_filblks_t, blockcount)
__field(uint64_t, flags)
__field(xfs_filblks_t, ip1_bcount)
__field(xfs_filblks_t, ip2_bcount)
__field(xfs_filblks_t, ip1_rtbcount)
__field(xfs_filblks_t, ip2_rtbcount)
__field(unsigned long long, resblks)
__field(unsigned long long, nr_exchanges)
),
TP_fast_assign(
__entry->dev = req->ip1->i_mount->m_super->s_dev;
__entry->ino1 = req->ip1->i_ino;
__entry->ino2 = req->ip2->i_ino;
__entry->startoff1 = req->startoff1;
__entry->startoff2 = req->startoff2;
__entry->blockcount = req->blockcount;
__entry->flags = req->flags;
__entry->ip1_bcount = req->ip1_bcount;
__entry->ip2_bcount = req->ip2_bcount;
__entry->ip1_rtbcount = req->ip1_rtbcount;
__entry->ip2_rtbcount = req->ip2_rtbcount;
__entry->resblks = req->resblks;
__entry->nr_exchanges = req->nr_exchanges;
),
TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) bcount1 0x%llx rtbcount1 0x%llx bcount2 0x%llx rtbcount2 0x%llx resblks 0x%llx nr_exchanges %llu",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino1, __entry->startoff1,
__entry->ino2, __entry->startoff2,
__entry->blockcount,
__print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS),
__entry->ip1_bcount,
__entry->ip1_rtbcount,
__entry->ip2_bcount,
__entry->ip2_rtbcount,
__entry->resblks,
__entry->nr_exchanges)
);
#define DEFINE_EXCHMAPS_ESTIMATE_EVENT(name) \
DEFINE_EVENT(xfs_exchmaps_estimate_class, name, \
TP_PROTO(const struct xfs_exchmaps_req *req), \
TP_ARGS(req))
DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_initial_estimate);
DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_final_estimate);
DECLARE_EVENT_CLASS(xfs_exchmaps_intent_class,
TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi),
TP_ARGS(mp, xmi),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ino1)
__field(xfs_ino_t, ino2)
__field(uint64_t, flags)
__field(xfs_fileoff_t, startoff1)
__field(xfs_fileoff_t, startoff2)
__field(xfs_filblks_t, blockcount)
__field(xfs_fsize_t, isize1)
__field(xfs_fsize_t, isize2)
__field(xfs_fsize_t, new_isize1)
__field(xfs_fsize_t, new_isize2)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->ino1 = xmi->xmi_ip1->i_ino;
__entry->ino2 = xmi->xmi_ip2->i_ino;
__entry->flags = xmi->xmi_flags;
__entry->startoff1 = xmi->xmi_startoff1;
__entry->startoff2 = xmi->xmi_startoff2;
__entry->blockcount = xmi->xmi_blockcount;
__entry->isize1 = xmi->xmi_ip1->i_disk_size;
__entry->isize2 = xmi->xmi_ip2->i_disk_size;
__entry->new_isize1 = xmi->xmi_isize1;
__entry->new_isize2 = xmi->xmi_isize2;
),
TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) isize1 0x%llx newisize1 0x%llx isize2 0x%llx newisize2 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino1, __entry->startoff1,
__entry->ino2, __entry->startoff2,
__entry->blockcount,
__print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS),
__entry->isize1, __entry->new_isize1,
__entry->isize2, __entry->new_isize2)
);
#define DEFINE_EXCHMAPS_INTENT_EVENT(name) \
DEFINE_EVENT(xfs_exchmaps_intent_class, name, \
TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi), \
TP_ARGS(mp, xmi))
DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_defer);
DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_recover);
TRACE_EVENT(xfs_exchmaps_delta_nextents_step,
TP_PROTO(struct xfs_mount *mp,
const struct xfs_bmbt_irec *left,
const struct xfs_bmbt_irec *curr,
const struct xfs_bmbt_irec *new,
const struct xfs_bmbt_irec *right,
int delta, unsigned int state),
TP_ARGS(mp, left, curr, new, right, delta, state),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_fileoff_t, loff)
__field(xfs_fsblock_t, lstart)
__field(xfs_filblks_t, lcount)
__field(xfs_fileoff_t, coff)
__field(xfs_fsblock_t, cstart)
__field(xfs_filblks_t, ccount)
__field(xfs_fileoff_t, noff)
__field(xfs_fsblock_t, nstart)
__field(xfs_filblks_t, ncount)
__field(xfs_fileoff_t, roff)
__field(xfs_fsblock_t, rstart)
__field(xfs_filblks_t, rcount)
__field(int, delta)
__field(unsigned int, state)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->loff = left->br_startoff;
__entry->lstart = left->br_startblock;
__entry->lcount = left->br_blockcount;
__entry->coff = curr->br_startoff;
__entry->cstart = curr->br_startblock;
__entry->ccount = curr->br_blockcount;
__entry->noff = new->br_startoff;
__entry->nstart = new->br_startblock;
__entry->ncount = new->br_blockcount;
__entry->roff = right->br_startoff;
__entry->rstart = right->br_startblock;
__entry->rcount = right->br_blockcount;
__entry->delta = delta;
__entry->state = state;
),
TP_printk("dev %d:%d left 0x%llx:0x%llx:0x%llx; curr 0x%llx:0x%llx:0x%llx <- new 0x%llx:0x%llx:0x%llx; right 0x%llx:0x%llx:0x%llx delta %d state 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->loff, __entry->lstart, __entry->lcount,
__entry->coff, __entry->cstart, __entry->ccount,
__entry->noff, __entry->nstart, __entry->ncount,
__entry->roff, __entry->rstart, __entry->rcount,
__entry->delta, __entry->state)
);
TRACE_EVENT(xfs_exchmaps_delta_nextents,
TP_PROTO(const struct xfs_exchmaps_req *req, int64_t d_nexts1,
int64_t d_nexts2),
TP_ARGS(req, d_nexts1, d_nexts2),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ino1)
__field(xfs_ino_t, ino2)
__field(xfs_extnum_t, nexts1)
__field(xfs_extnum_t, nexts2)
__field(int64_t, d_nexts1)
__field(int64_t, d_nexts2)
),
TP_fast_assign(
int whichfork = xfs_exchmaps_reqfork(req);
__entry->dev = req->ip1->i_mount->m_super->s_dev;
__entry->ino1 = req->ip1->i_ino;
__entry->ino2 = req->ip2->i_ino;
__entry->nexts1 = xfs_ifork_ptr(req->ip1, whichfork)->if_nextents;
__entry->nexts2 = xfs_ifork_ptr(req->ip2, whichfork)->if_nextents;
__entry->d_nexts1 = d_nexts1;
__entry->d_nexts2 = d_nexts2;
),
TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino1, __entry->nexts1,
__entry->ino2, __entry->nexts2,
__entry->d_nexts1, __entry->d_nexts2)
);
#endif /* _TRACE_XFS_H */ #endif /* _TRACE_XFS_H */
#undef TRACE_INCLUDE_PATH #undef TRACE_INCLUDE_PATH
......
...@@ -2119,6 +2119,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); ...@@ -2119,6 +2119,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int); loff_t, size_t, unsigned int);
int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out, struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags, loff_t *len, unsigned int remap_flags,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment