Commit 5eda4300 authored by Darrick J. Wong's avatar Darrick J. Wong

xfs: mark speculative prealloc CoW fork extents unwritten

Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:

"Thread 1 writes a range from B to c

"                    B --------- C
                           p

"a little later thread 2 writes from A to B

"        A --------- B
               p

[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]

"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."

We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.

Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:

D: --RRRRRRSSSRRRRRRRR---
C: ------DDDDDDD---------

When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork.  The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:

D: --RRRRRRSSSRRRRRRRR---
U: ------UUUUUUU---------

Next, we convert only the part of the extent that we're actively
planning to write to normal (i.e. not unwritten) status:

D: --RRRRRRSSSRRRRRRRR---
U: ------UURRUUU---------

If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:

D: --RRRRRRRRSRRRRRRRR---
U: ------UU--UUU---------

This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.
Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
parent 05a630d7
...@@ -481,6 +481,12 @@ xfs_submit_ioend( ...@@ -481,6 +481,12 @@ xfs_submit_ioend(
struct xfs_ioend *ioend, struct xfs_ioend *ioend,
int status) int status)
{ {
/* Convert CoW extents to regular */
if (!status && ioend->io_type == XFS_IO_COW) {
status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
ioend->io_offset, ioend->io_size);
}
/* Reserve log space if we might write beyond the on-disk inode size. */ /* Reserve log space if we might write beyond the on-disk inode size. */
if (!status && if (!status &&
ioend->io_type != XFS_IO_UNWRITTEN && ioend->io_type != XFS_IO_UNWRITTEN &&
......
...@@ -685,7 +685,7 @@ xfs_iomap_write_allocate( ...@@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
int nres; int nres;
if (whichfork == XFS_COW_FORK) if (whichfork == XFS_COW_FORK)
flags |= XFS_BMAPI_COWFORK; flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
/* /*
* Make sure that the dquots are there. * Make sure that the dquots are there.
......
...@@ -82,11 +82,22 @@ ...@@ -82,11 +82,22 @@
* mappings are a reservation against the free space in the filesystem; * mappings are a reservation against the free space in the filesystem;
* adjacent mappings can also be combined into fewer larger mappings. * adjacent mappings can also be combined into fewer larger mappings.
* *
* As an optimization, the CoW extent size hint (cowextsz) creates
* outsized aligned delalloc reservations in the hope of landing out of
* order nearby CoW writes in a single extent on disk, thereby reducing
* fragmentation and improving future performance.
*
* D: --RRRRRRSSSRRRRRRRR--- (data fork)
* C: ------DDDDDDD--------- (CoW fork)
*
* When dirty pages are being written out (typically in writepage), the * When dirty pages are being written out (typically in writepage), the
* delalloc reservations are converted into real mappings by allocating * delalloc reservations are converted into unwritten mappings by
* blocks and replacing the delalloc mapping with real ones. A delalloc * allocating blocks and replacing the delalloc mapping with real ones.
* mapping can be replaced by several real ones if the free space is * A delalloc mapping can be replaced by several unwritten ones if the
* fragmented. * free space is fragmented.
*
* D: --RRRRRRSSSRRRRRRRR---
* C: ------UUUUUUU---------
* *
* We want to adapt the delalloc mechanism for copy-on-write, since the * We want to adapt the delalloc mechanism for copy-on-write, since the
* write paths are similar. The first two steps (creating the reservation * write paths are similar. The first two steps (creating the reservation
...@@ -101,13 +112,29 @@ ...@@ -101,13 +112,29 @@
* Block-aligned directio writes will use the same mechanism as buffered * Block-aligned directio writes will use the same mechanism as buffered
* writes. * writes.
* *
* Just prior to submitting the actual disk write requests, we convert
* the extents representing the range of the file actually being written
* (as opposed to extra pieces created for the cowextsize hint) to real
* extents. This will become important in the next step:
*
* D: --RRRRRRSSSRRRRRRRR---
* C: ------UUrrUUU---------
*
* CoW remapping must be done after the data block write completes, * CoW remapping must be done after the data block write completes,
* because we don't want to destroy the old data fork map until we're sure * because we don't want to destroy the old data fork map until we're sure
* the new block has been written. Since the new mappings are kept in a * the new block has been written. Since the new mappings are kept in a
* separate fork, we can simply iterate these mappings to find the ones * separate fork, we can simply iterate these mappings to find the ones
* that cover the file blocks that we just CoW'd. For each extent, simply * that cover the file blocks that we just CoW'd. For each extent, simply
* unmap the corresponding range in the data fork, map the new range into * unmap the corresponding range in the data fork, map the new range into
* the data fork, and remove the extent from the CoW fork. * the data fork, and remove the extent from the CoW fork. Because of
* the presence of the cowextsize hint, however, we must be careful
* only to remap the blocks that we've actually written out -- we must
* never remap delalloc reservations nor CoW staging blocks that have
* yet to be written. This corresponds exactly to the real extents in
* the CoW fork:
*
* D: --RRRRRRrrSRRRRRRRR---
* C: ------UU--UUU---------
* *
* Since the remapping operation can be applied to an arbitrary file * Since the remapping operation can be applied to an arbitrary file
* range, we record the need for the remap step as a flag in the ioend * range, we record the need for the remap step as a flag in the ioend
...@@ -296,6 +323,65 @@ xfs_reflink_reserve_cow( ...@@ -296,6 +323,65 @@ xfs_reflink_reserve_cow(
return 0; return 0;
} }
/* Convert part of an unwritten CoW extent to a real one. */
STATIC int
xfs_reflink_convert_cow_extent(
struct xfs_inode *ip,
struct xfs_bmbt_irec *imap,
xfs_fileoff_t offset_fsb,
xfs_filblks_t count_fsb,
struct xfs_defer_ops *dfops)
{
struct xfs_bmbt_irec irec = *imap;
xfs_fsblock_t first_block;
int nimaps = 1;
if (imap->br_state == XFS_EXT_NORM)
return 0;
xfs_trim_extent(&irec, offset_fsb, count_fsb);
trace_xfs_reflink_convert_cow(ip, &irec);
if (irec.br_blockcount == 0)
return 0;
return xfs_bmapi_write(NULL, ip, irec.br_startoff, irec.br_blockcount,
XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block,
0, &irec, &nimaps, dfops);
}
/* Convert all of the unwritten CoW extents in a file's range to real ones. */
int
xfs_reflink_convert_cow(
struct xfs_inode *ip,
xfs_off_t offset,
xfs_off_t count)
{
struct xfs_bmbt_irec got;
struct xfs_defer_ops dfops;
struct xfs_mount *mp = ip->i_mount;
struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, offset + count);
xfs_extnum_t idx;
bool found;
int error;
xfs_ilock(ip, XFS_ILOCK_EXCL);
/* Convert all the extents to real from unwritten. */
for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
found && got.br_startoff < end_fsb;
found = xfs_iext_get_extent(ifp, ++idx, &got)) {
error = xfs_reflink_convert_cow_extent(ip, &got, offset_fsb,
end_fsb - offset_fsb, &dfops);
if (error)
break;
}
/* Finish up. */
xfs_iunlock(ip, XFS_ILOCK_EXCL);
return error;
}
/* Allocate all CoW reservations covering a range of blocks in a file. */ /* Allocate all CoW reservations covering a range of blocks in a file. */
static int static int
__xfs_reflink_allocate_cow( __xfs_reflink_allocate_cow(
...@@ -328,6 +414,7 @@ __xfs_reflink_allocate_cow( ...@@ -328,6 +414,7 @@ __xfs_reflink_allocate_cow(
goto out_unlock; goto out_unlock;
ASSERT(nimaps == 1); ASSERT(nimaps == 1);
/* Make sure there's a CoW reservation for it. */
error = xfs_reflink_reserve_cow(ip, &imap, &shared); error = xfs_reflink_reserve_cow(ip, &imap, &shared);
if (error) if (error)
goto out_trans_cancel; goto out_trans_cancel;
...@@ -337,14 +424,16 @@ __xfs_reflink_allocate_cow( ...@@ -337,14 +424,16 @@ __xfs_reflink_allocate_cow(
goto out_trans_cancel; goto out_trans_cancel;
} }
/* Allocate the entire reservation as unwritten blocks. */
xfs_trans_ijoin(tp, ip, 0); xfs_trans_ijoin(tp, ip, 0);
error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount, error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
XFS_BMAPI_COWFORK, &first_block, XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block,
XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
&imap, &nimaps, &dfops); &imap, &nimaps, &dfops);
if (error) if (error)
goto out_trans_cancel; goto out_trans_cancel;
/* Finish up. */
error = xfs_defer_finish(&tp, &dfops, NULL); error = xfs_defer_finish(&tp, &dfops, NULL);
if (error) if (error)
goto out_trans_cancel; goto out_trans_cancel;
...@@ -389,11 +478,12 @@ xfs_reflink_allocate_cow_range( ...@@ -389,11 +478,12 @@ xfs_reflink_allocate_cow_range(
if (error) { if (error) {
trace_xfs_reflink_allocate_cow_range_error(ip, error, trace_xfs_reflink_allocate_cow_range_error(ip, error,
_RET_IP_); _RET_IP_);
break; return error;
} }
} }
return error; /* Convert the CoW extents to regular. */
return xfs_reflink_convert_cow(ip, offset, count);
} }
/* /*
...@@ -641,6 +731,16 @@ xfs_reflink_end_cow( ...@@ -641,6 +731,16 @@ xfs_reflink_end_cow(
ASSERT(!isnullstartblock(got.br_startblock)); ASSERT(!isnullstartblock(got.br_startblock));
/*
* Don't remap unwritten extents; these are
* speculatively preallocated CoW extents that have been
* allocated but have not yet been involved in a write.
*/
if (got.br_state == XFS_EXT_UNWRITTEN) {
idx--;
goto next_extent;
}
/* Unmap the old blocks in the data fork. */ /* Unmap the old blocks in the data fork. */
xfs_defer_init(&dfops, &firstfsb); xfs_defer_init(&dfops, &firstfsb);
rlen = del.br_blockcount; rlen = del.br_blockcount;
......
...@@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip, ...@@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
struct xfs_bmbt_irec *imap, bool *shared); struct xfs_bmbt_irec *imap, bool *shared);
extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip, extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
xfs_off_t offset, xfs_off_t count); xfs_off_t offset, xfs_off_t count);
extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count);
extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset, extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
struct xfs_bmbt_irec *imap); struct xfs_bmbt_irec *imap);
extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip, extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
......
...@@ -3088,6 +3088,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class, ...@@ -3088,6 +3088,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
__field(xfs_fileoff_t, lblk) __field(xfs_fileoff_t, lblk)
__field(xfs_extlen_t, len) __field(xfs_extlen_t, len)
__field(xfs_fsblock_t, pblk) __field(xfs_fsblock_t, pblk)
__field(int, state)
), ),
TP_fast_assign( TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev; __entry->dev = VFS_I(ip)->i_sb->s_dev;
...@@ -3095,13 +3096,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class, ...@@ -3095,13 +3096,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
__entry->lblk = irec->br_startoff; __entry->lblk = irec->br_startoff;
__entry->len = irec->br_blockcount; __entry->len = irec->br_blockcount;
__entry->pblk = irec->br_startblock; __entry->pblk = irec->br_startblock;
__entry->state = irec->br_state;
), ),
TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu", TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d",
MAJOR(__entry->dev), MINOR(__entry->dev), MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino, __entry->ino,
__entry->lblk, __entry->lblk,
__entry->len, __entry->len,
__entry->pblk) __entry->pblk,
__entry->state)
); );
#define DEFINE_INODE_IREC_EVENT(name) \ #define DEFINE_INODE_IREC_EVENT(name) \
DEFINE_EVENT(xfs_inode_irec_class, name, \ DEFINE_EVENT(xfs_inode_irec_class, name, \
...@@ -3241,6 +3244,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared); ...@@ -3241,6 +3244,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc); DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found); DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc); DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
DEFINE_RW_EVENT(xfs_reflink_reserve_cow); DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range); DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment