Commit 09c3717c authored by Chris Mason's avatar Chris Mason Committed by David Sterba

btrfs: only subtract from len_to_oe_boundary when it is tracking an extent

bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes.  Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.

Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:

- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?

The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero.  In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.

This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk.  We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted.  It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.

The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added.  I think the commit
24e6c808 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.

The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value.  bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.

Sample reproducer, note you'll need to change the path to the bdi and
device:

  SUBVOL=/btrfs/swapvol
  SWAPFILE=$SUBVOL/swapfile
  SZMB=8192

  mkfs.btrfs -f /dev/vdb
  mount /dev/vdb /btrfs

  btrfs subvol create $SUBVOL
  chattr +C $SUBVOL
  dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
  sync

  echo 4 > /proc/sys/vm/drop_caches

  echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb

  while true; do
	  echo 1 > /proc/sys/vm/drop_caches
	  echo 1 > /proc/sys/vm/drop_caches
	  dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
  done

Fixes: 24e6c808 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
Signed-off-by: default avatarChris Mason <clm@fb.com>
Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
parent b471965f
......@@ -902,7 +902,30 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl,
size -= len;
pg_offset += len;
disk_bytenr += len;
bio_ctrl->len_to_oe_boundary -= len;
/*
* len_to_oe_boundary defaults to U32_MAX, which isn't page or
* sector aligned. alloc_new_bio() then sets it to the end of
* our ordered extent for writes into zoned devices.
*
* When len_to_oe_boundary is tracking an ordered extent, we
* trust the ordered extent code to align things properly, and
* the check above to cap our write to the ordered extent
* boundary is correct.
*
* When len_to_oe_boundary is U32_MAX, the cap above would
* result in a 4095 byte IO for the last page right before
* we hit the bio limit of UINT_MAX. bio_add_page() has all
* the checks required to make sure we don't overflow the bio,
* and we should just ignore len_to_oe_boundary completely
* unless we're using it to track an ordered extent.
*
* It's pretty hard to make a bio sized U32_MAX, but it can
* happen when the page cache is able to feed us contiguous
* pages for large extents.
*/
if (bio_ctrl->len_to_oe_boundary != U32_MAX)
bio_ctrl->len_to_oe_boundary -= len;
/* Ordered extent boundary: move on to a new bio. */
if (bio_ctrl->len_to_oe_boundary == 0)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment