• Naohiro Aota's avatar
    btrfs: properly split extent_map for REQ_OP_ZONE_APPEND · abb99cfd
    Naohiro Aota authored
    Damien reported a test failure with btrfs/209. The test itself ran fine,
    but the fsck ran afterwards reported a corrupted filesystem.
    
    The filesystem corruption happens because we're splitting an extent and
    then writing the extent twice. We have to split the extent though, because
    we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
    
    When dumping the extent tree, we can see two EXTENT_ITEMs at the same
    start address but different lengths.
    
    $ btrfs inspect dump-tree /dev/nullb1 -t extent
    ...
       item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
               refs 1 gen 7 flags DATA
               extent data backref root FS_TREE objectid 257 offset 786432 count 1
       item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
               refs 1 gen 7 flags DATA
               extent data backref root FS_TREE objectid 257 offset 786432 count 1
    
    The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
    extract_ordered_extent(). Since extract_ordered_extent() uses
    create_io_em() to split an existing extent_map, we will have
    split->orig_start != split->start. Then, it will be logged with non-zero
    "extent data offset". Finally, the logged entries are replayed into
    a duplicated EXTENT_ITEM.
    
    Introduce and use proper splitting function for extent_map. The function is
    intended to be simple and specific usage for extract_ordered_extent() e.g.
    not supporting compression case (we do not allow splitting compressed
    extent_map anyway).
    
    There was a question raised by Qu, in summary why we want to split the
    extent map (and not the bio):
    
    The problem is not the limit on the zone end, which as you mention is
    the same as the block group end. The problem is that data write use zone
    append (ZA) operations. ZA BIOs cannot be split so a large extent may
    need to be processed with multiple ZA BIOs, While that is also true for
    regular writes, the major difference is that ZA are "nameless" write
    operation giving back the written sectors on completion. And ZA
    operations may be reordered by the block layer (not intentionally
    though). Combine both of these characteristics and you can see that the
    data for a large extent may end up being shuffled when written resulting
    in data corruption and the impossibility to map the extent to some start
    sector.
    
    To avoid this problem, zoned btrfs uses the principle "one data extent
    == one ZA BIO". So large extents need to be split. This is unfortunate,
    but we can revisit this later and optimize, e.g. merge back together the
    fragments of an extent once written if they actually were written
    sequentially in the zone.
    Reported-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
    Fixes: d22002fd ("btrfs: zoned: split ordered extent when bio is sent")
    CC: stable@vger.kernel.org # 5.12+
    CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    abb99cfd
inode.c 303 KB