• Qu Wenruo's avatar
    btrfs: zoned: fix dev-replace after the scrub rework · b675df02
    Qu Wenruo authored
    [BUG]
    After commit e02ee89b ("btrfs: scrub: switch scrub_simple_mirror()
    to scrub_stripe infrastructure"), scrub no longer works for zoned device
    at all.
    
    Even an empty zoned btrfs cannot be replaced:
    
      # mkfs.btrfs -f /dev/nvme0n1
      # mount /dev/nvme0n1 /mnt/btrfs
      # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
      Resetting device zones /dev/nvme1n1 (160 zones) ...
      ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
    
    And we can hit kernel crash related to that:
    
      BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
      BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
      nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
      I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
      BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
      BUG: kernel NULL pointer dereference, address: 00000000000000a8
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
      Call Trace:
       <IRQ>
       btrfs_lookup_ordered_extent+0x31/0x190
       btrfs_record_physical_zoned+0x18/0x40
       btrfs_simple_end_io+0xaf/0xc0
       blk_update_request+0x153/0x4c0
       blk_mq_end_request+0x15/0xd0
       nvme_poll_cq+0x1d3/0x360
       nvme_irq+0x39/0x80
       __handle_irq_event_percpu+0x3b/0x190
       handle_irq_event+0x2f/0x70
       handle_edge_irq+0x7c/0x210
       __common_interrupt+0x34/0xa0
       common_interrupt+0x7d/0xa0
       </IRQ>
       <TASK>
       asm_common_interrupt+0x22/0x40
    
    [CAUSE]
    Dev-replace reuses scrub code to iterate all extents and write the
    existing content back to the new device.
    
    And for zoned devices, we call fill_writer_pointer_gap() to make sure
    all the writes into the zoned device is sequential, even if there may be
    some gaps between the writes.
    
    However we have several different bugs all related to zoned dev-replace:
    
    - We are using ZONE_APPEND operation for metadata style write back
      For zoned devices, btrfs has two ways to write data:
    
      * ZONE_APPEND for data
        This allows higher queue depth, but will not be able to know where
        the write would land.
        Thus needs to grab the real on-disk physical location in it's endio.
    
      * WRITE for metadata
        This requires single queue depth (new writes can only be submitted
        after previous one finished), and all writes must be sequential.
    
      For scrub, we go single queue depth, but still goes with ZONE_APPEND,
      which requires btrfs_bio::inode being populated.
      This is the cause of that crash.
    
    - No correct tracing of write_pointer
      After a write finished, we should forward sctx->write_pointer, or
      fill_writer_pointer_gap() would not work properly and cause more
      than necessary zero out, and fill the whole zone prematurely.
    
    - Incorrect physical bytenr passed to fill_writer_pointer_gap()
      In scrub_write_sectors(), one call site passes logical address, which
      is completely wrong.
    
      The other call site passes physical address of current sector, but
      we should pass the physical address of the btrfs_bio we're submitting.
    
      This is the cause of the -EIO errors.
    
    [FIX]
    - Do not use ZONE_APPEND for btrfs_submit_repair_write().
    
    - Manually forward sctx->write_pointer after successful writeback
    
    - Use the physical address of the to-be-submitted btrfs_bio for
      fill_writer_pointer_gap()
    
    Now zoned device replace would work as expected.
    Reported-by: default avatarChristoph Hellwig <hch@lst.de>
    Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    b675df02
scrub.c 85.9 KB