1. 25 Jul, 2022 40 commits
    • David Sterba's avatar
      btrfs: clean up chained assignments · c1867eb3
      David Sterba authored
      The chained assignments may be convenient to write, but make readability
      a bit worse as it's too easy to overlook that there are several values
      set on the same line while this is rather an exception.  Making it
      consistent everywhere avoids surprises.
      
      The pattern where inode times are initialized reuses the first value and
      the order is mtime, ctime. In other blocks the assignments are expanded
      so the order of variables is similar to the neighboring code.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1867eb3
    • David Sterba's avatar
      btrfs: merge calculations for simple striped profiles in btrfs_rmap_block · ac067734
      David Sterba authored
      Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1)
      and RAID10 (map->sub_stripes is 2), with equivalent results.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac067734
    • David Sterba's avatar
      btrfs: use mask for all RAID1* profiles in btrfs_calc_avail_data_space · d09cb9e1
      David Sterba authored
      There's a sequence of hard coded values for RAID1 profiles that are
      already stored in the raid_attr table that should be used instead.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d09cb9e1
    • Nikolay Borisov's avatar
      btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA · e26b04c4
      Nikolay Borisov authored
      Commit 6f93e834 seemingly inadvertently moved the code responsible
      for flagging the filesystem as having BIG_METADATA to a place where
      setting the flag was essentially lost. This means that
      filesystems created with kernels containing this bug (starting with 5.15)
      can potentially be mounted by older (pre-3.4) kernels. In reality
      chances for this happening are low because there are other incompat
      flags introduced in the mean time. Still the correct behavior is to set
      INCOMPAT_BIG_METADATA flag and persist this in the superblock.
      
      Fixes: 6f93e834 ("btrfs: fix upper limit for max_inline for page size 64K")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e26b04c4
    • David Sterba's avatar
      btrfs: print checksum type and implementation at mount time · c8a5f8ca
      David Sterba authored
      Per user request, print the checksum type and implementation at mount
      time among the messages. The checksum is user configurable and the
      actual crypto implementation is useful to see for performance reasons.
      The same information is also available after mount in
      /sys/fs/FSID/checksum file.
      
      Example:
      
        [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm
      
      Link: https://github.com/kdave/btrfs-progs/issues/483Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8a5f8ca
    • Josef Bacik's avatar
      btrfs: reset block group chunk force if we have to wait · 1314ca78
      Josef Bacik authored
      If you try to force a chunk allocation, but you race with another chunk
      allocation, you will end up waiting on the chunk allocation that just
      occurred and then allocate another chunk.  If you have many threads all
      doing this at once you can way over-allocate chunks.
      
      Fix this by resetting force to NO_FORCE, that way if we think we need to
      allocate we can, otherwise we don't force another chunk allocation if
      one is already happening.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1314ca78
    • David Sterba's avatar
      btrfs: send: add new command FILEATTR for file attributes · 48247359
      David Sterba authored
      There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS
      and later from XFLAGS interfaces, now commonly found under the
      'fileattr' API. This corresponds to the individual inode bits and that's
      part of the on-disk format, so this is suitable for the protocol. The
      other interfaces contain a lot of cruft or bits that btrfs does not
      support yet.
      
      Currently the value is u64 and matches btrfs_inode_item. Not all the
      bits can be set by ioctls (like NODATASUM or READONLY), but we can send
      them over the protocol and leave it up to the receiving side what and
      how to apply.
      
      As some of the flags, eg. IMMUTABLE, can prevent any further changes,
      the receiving side needs to understand that and apply the changes in the
      right order, or possibly with some intermediate steps. This should be
      easier, future proof and simpler on the protocol layer than implementing
      in kernel.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48247359
    • David Sterba's avatar
      btrfs: send: add OTIME as utimes attribute for proto 2+ by default · 22a5b2ab
      David Sterba authored
      When send v1 was introduced the otime (inode creation time) was not
      available, however the attribute in btrfs send protocol exists. Though
      it would be possible to add it for v1 too as the attribute would be
      ignored by v1 receive, let's not change the layout of v1 and only add
      that to v2+.  The otime cannot be changed and is only informative.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22a5b2ab
    • Qu Wenruo's avatar
      btrfs: output mirror number for bad metadata · 8f0ed7d4
      Qu Wenruo authored
      When handling a real world transid mismatch image, it's hard to know
      which copy is corrupted, as the error messages just look like this:
      
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
      
      We don't even know if the retry is caused by btrfs or the VFS retry.
      
      To make things a little easier to read, add mirror number for all
      related tree block read errors.
      
      So the above messages would look like this:
      
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ update messages, add "logical" ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f0ed7d4
    • Naohiro Aota's avatar
      btrfs: replace unnecessary goto with direct return at cow_file_range() · aaafa1eb
      Naohiro Aota authored
      The 'goto out' in cow_file_range() in the exit block are not necessary
      and jump back. Replace them with return, while still keeping 'goto out'
      in the main code.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ keep goto in the main code, update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aaafa1eb
    • Naohiro Aota's avatar
      btrfs: fix error handling of fallback uncompress write · 71aa147b
      Naohiro Aota authored
      When cow_file_range() fails in the middle of the allocation loop, it
      unlocks the pages but leaves the ordered extents intact. Thus, we need
      to call btrfs_cleanup_ordered_extents() to finish the created ordered
      extents.
      
      Also, we need to call end_extent_writepage() if locked_page is available
      because btrfs_cleanup_ordered_extents() never processes the region on
      the locked_page.
      
      Furthermore, we need to set the mapping as error if locked_page is
      unavailable before unlocking the pages, so that the errno is properly
      propagated to the user space.
      
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71aa147b
    • Naohiro Aota's avatar
      btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page · 99826e4c
      Naohiro Aota authored
      btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
      is not usable for submit_uncompressed_range() which can have NULL
      locked_page.
      
      Add support supports locked_page == NULL case. Also, it rewrites
      redundant "page_offset(locked_page)".
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99826e4c
    • Naohiro Aota's avatar
      btrfs: ensure pages are unlocked on cow_file_range() failure · 9ce7466f
      Naohiro Aota authored
      There is a hung_task report on zoned btrfs like below.
      
      https://github.com/naota/linux/issues/59
      
        [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
        [726.329839]       Not tainted 5.16.0-rc1+ #1
        [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
        [726.331608] Call Trace:
        [726.331611]  <TASK>
        [726.331614]  __schedule+0x2e5/0x9d0
        [726.331622]  schedule+0x58/0xd0
        [726.331626]  io_schedule+0x3f/0x70
        [726.331629]  __folio_lock+0x125/0x200
        [726.331634]  ? find_get_entries+0x1bc/0x240
        [726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
        [726.331642]  truncate_inode_pages_range+0x5b2/0x770
        [726.331649]  truncate_inode_pages_final+0x44/0x50
        [726.331653]  btrfs_evict_inode+0x67/0x480
        [726.331658]  evict+0xd0/0x180
        [726.331661]  iput+0x13f/0x200
        [726.331664]  do_unlinkat+0x1c0/0x2b0
        [726.331668]  __x64_sys_unlink+0x23/0x30
        [726.331670]  do_syscall_64+0x3b/0xc0
        [726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [726.331677] RIP: 0033:0x7fb9490a171b
        [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
        [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
        [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
        [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
        [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
        [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
        [726.331693]  </TASK>
      
      While we debug the issue, we found running fstests generic/551 on 5GB
      non-zoned null_blk device in the emulated zoned mode also had a
      similar hung issue.
      
      Also, we can reproduce the same symptom with an error injected
      cow_file_range() setup.
      
      The hang occurs when cow_file_range() fails in the middle of
      allocation. cow_file_range() called from do_allocation_zoned() can
      split the give region ([start, end]) for allocation depending on
      current block group usages. When btrfs can allocate bytes for one part
      of the split regions but fails for the other region (e.g. because of
      -ENOSPC), we return the error leaving the pages in the succeeded regions
      locked. Technically, this occurs only when @unlock == 0. Otherwise, we
      unlock the pages in an allocated region after creating an ordered
      extent.
      
      Considering the callers of cow_file_range(unlock=0) won't write out
      the pages, we can unlock the pages on error exit from
      cow_file_range(). So, we can ensure all the pages except @locked_page
      are unlocked on error case.
      
      In summary, cow_file_range now behaves like this:
      
      - page_started == 1 (return value)
        - All the pages are unlocked. IO is started.
      - unlock == 1
        - All the pages except @locked_page are unlocked in any case
      - unlock == 0
        - On success, all the pages are locked for writing out them
        - On failure, all the pages except @locked_page are unlocked
      
      Fixes: 42c01100 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ce7466f
    • Ioannis Angelakopoulos's avatar
      btrfs: sysfs: export commit stats · 140a8ff7
      Ioannis Angelakopoulos authored
      Export commit stats in file
      
        /sys/fs/btrfs/UUID/commit_stats
      
      with example output like:
      
        commits 123
        last_commit_ms 11
        max_commit_ms 150
        total_commit_ms 2000
      
      The values are in one file so reading them at a single time will give a
      more consistent view. The stats are internally tracked in nanoseconds so
      the cumulative values should not suffer from rounding errors.
      
      Writing 0 to the file 'commit_stats' will reset max_commit_ms.
      Initial values are set at first mount of the filesystem.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarIoannis Angelakopoulos <iangelak@fb.com>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      140a8ff7
    • Ioannis Angelakopoulos's avatar
      btrfs: collect commit stats, count, duration · e55958c8
      Ioannis Angelakopoulos authored
      Track several stats about transaction commit, to be later exported via
      sysfs:
      
      - number of commits so far
      - duration of the last commit in ns
      - maximum commit duration seen so far in ns
      - total duration for all commits so far in ns
      
      The update of the commit stats occurs after the commit thread has gone
      through all the logic that checks if there is another thread committing
      at the same time. This means that we only account for actual commit work
      in the commit stats we report and not the time the thread spends waiting
      until it is ready to do the commit work.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e55958c8
    • Christoph Hellwig's avatar
      btrfs: remove extent writepage address space operation · f3e90c1c
      Christoph Hellwig authored
      Same as in commit 21b4ee70 ("xfs: drop ->writepage completely"): we
      can remove the callback as it's only used in one place - single page
      writeback from memory reclaim and is not called for cgroup writeback at
      all.
      
      We only allow such writeback from kswapd, not from direct memory
      reclaim, and so it is rarely used. When it comes from kswapd, it is
      effectively random dirty page shoot-down, which is horrible for IO
      patterns. We can rely on background writeback to clean all dirty pages
      in an efficient way and not let it be interrupted by kswapd.
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3e90c1c
    • David Sterba's avatar
      btrfs: send: use boolean types for current inode status · 9555e1f1
      David Sterba authored
      The new, new_gen and deleted indicate a status, use boolean type instead
      of int.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9555e1f1
    • David Sterba's avatar
      btrfs: send: remove old TODO regarding ERESTARTSYS · cec3dad9
      David Sterba authored
      The whole send operation is restartable and handling properly a buffer
      write may not be easy. We can't know what caused that and if a short
      delay and retry will fix it or how many retries should be performed in
      case it's a temporary condition.
      
      The error value is returned to the ioctl caller so in case it's
      transient problem, the user would be notified about the reason. Remove
      the TODO note as there's no plan to handle ERESTARTSYS.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cec3dad9
    • David Sterba's avatar
      btrfs: send: simplify includes · 8234d3f6
      David Sterba authored
      We don't need the whole ctree.h in send.h, none of the data types
      defined there are used.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8234d3f6
    • David Sterba's avatar
      btrfs: send: drop __KERNEL__ ifdef from send.h · e3b4b904
      David Sterba authored
      We don't need this ifdef as the header file is not shared, the protocol
      definition used by userspace should be from libbtrfs or libbtrfsutil.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3b4b904
    • Christoph Hellwig's avatar
      btrfs: increase direct io read size limit to 256 sectors · ee5b46a3
      Christoph Hellwig authored
      Btrfs currently limits direct I/O reads to a single sector, which goes
      back to commit c329861d ("Btrfs: don't allocate a separate csums
      array for direct reads") from Josef.  That commit changes the direct I/O
      code to ".. use the private part of the io_tree for our csums.", but ten
      years later that isn't how checksums for direct reads work, instead they
      use a csums allocation on a per-btrfs_dio_private basis (which have their
      own performance problem for small I/O, but that will be addressed later).
      
      There is no fundamental limit in btrfs itself to limit the I/O size
      except for the size of the checksum array that scales linearly with
      the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
      256 limits, which matches what the buffered reads typically see as
      the upper limit as the limit for direct I/O as well.
      
      This significantly improves direct read performance.  For example a fio
      run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
      throughput:
      
      Baseline:
      
      READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
      
      With this patch:
      
      READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee5b46a3
    • Qu Wenruo's avatar
      btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() · f6065f8e
      Qu Wenruo authored
      [BUG]
      There is a small workload which will always fail with recent kernel:
      (A simplified version from btrfs/125 test case)
      
        mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
        sync
        umount $mnt
        btrfs dev scan -u $dev3
        mount -o degraded $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
        umount $mnt
        btrfs dev scan
        mount $dev1 $mnt
        btrfs balance start --full-balance $mnt
        umount $mnt
      
      The failure is always failed to read some tree blocks:
      
        BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        ...
      
      [CAUSE]
      With the recently added debug output, we can see all RAID56 operations
      related to full stripe 38928384:
      
        56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
        56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
        56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
        56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
        56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
        56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
      
      Before we enter raid56_parity_recover(), we have triggered some metadata
      write for the full stripe 38928384, this leads to us to read all the
      sectors from disk.
      
      Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
      avoid unnecessary read.
      
      This means, for that full stripe, after any partial write, we will have
      stale data, along with P/Q calculated using that stale data.
      
      Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
      which has data stripes" we haven't submitted all the corrupted P/Q to disk.
      
      When we really need to recover certain range, aka in
      raid56_parity_recover(), we will use the cached rbio, along with its
      cached sectors (the full stripe is all cached).
      
      This explains why we have no event raid56_scrub_read_recover()
      triggered.
      
      Since we have the cached P/Q which is calculated using the stale data,
      the recovered one will just be stale.
      
      In our particular test case, it will always return the same incorrect
      metadata, thus causing the same error message "parent transid verify
      failed on 39010304 wanted 9 found 7" again and again.
      
      [BTRFS DESTRUCTIVE RMW PROBLEM]
      
      Test case btrfs/125 (and above workload) always has its trouble with
      the destructive read-modify-write (RMW) cycle:
      
              0       32K     64K
      Data1:  | Good  | Good  |
      Data2:  | Bad   | Bad   |
      Parity: | Good  | Good  |
      
      In above case, if we trigger any write into Data1, we will use the bad
      data in Data2 to re-generate parity, killing the only chance to recovery
      Data2, thus Data2 is lost forever.
      
      This destructive RMW cycle is not specific to btrfs RAID56, but there
      are some btrfs specific behaviors making the case even worse:
      
      - Btrfs will cache sectors for unrelated vertical stripes.
      
        In above example, if we're only writing into 0~32K range, btrfs will
        still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
        This behavior is to cache sectors for later update.
      
        Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio()
        subpage compatible") has a bug which makes RAID56 to never trust the
        cached sectors, thus slightly improve the situation for recovery.
      
        Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
        steal_rbio" will revert the behavior back to the old one.
      
      - Btrfs raid56 partial write will update all P/Q sectors and cache them
      
        This means, even if data at (64K ~ 96K) of Data2 is free space, and
        only (96K ~ 128K) of Data2 is really stale data.
        And we write into that (96K ~ 128K), we will update all the parity
        sectors for the full stripe.
      
        This unnecessary behavior will completely kill the chance of recovery.
      
        Thankfully, an unrelated optimization "btrfs: only write the sectors
        in the vertical stripe which has data stripes" will prevent
        submitting the write bio for untouched vertical sectors.
      
        That optimization will keep the on-disk P/Q untouched for a chance for
        later recovery.
      
      [FIX]
      Although we have no good way to completely fix the destructive RMW
      (unless we go full scrub for each partial write), we can still limit the
      damage.
      
      With patch "btrfs: only write the sectors in the vertical stripe which
      has data stripes" now we won't really submit the P/Q of unrelated
      vertical stripes, so the on-disk P/Q should still be fine.
      
      Now we really need to do is just drop all the cached sectors when doing
      recovery.
      
      By this, we have a chance to read the original P/Q from disk, and have a
      chance to recover the stale data, while still keep the cache to speed up
      regular write path.
      
      In fact, just dropping all the cache for recovery path is good enough to
      allow the test case btrfs/125 along with the small script to pass
      reliably.
      
      The lack of metadata write after the degraded mount, and forced metadata
      COW is saving us this time.
      
      So this patch will fix the behavior by not trust any cache in
      __raid56_parity_recover(), to solve the problem while still keep the
      cache useful.
      
      But please note that this test pass DOES NOT mean we have solved the
      destructive RMW problem, we just do better damage control a little
      better.
      
      Related patches:
      
      - btrfs: only write the sectors in the vertical stripe
      - d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      - btrfs: update stripe_sectors::uptodate in steal_rbio
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6065f8e
    • Christoph Hellwig's avatar
      btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished · 711f447b
      Christoph Hellwig authored
      finish_func is always set to finish_ordered_fn, so remove it and also
      the now pointless and somewhat confusingly named
      __endio_write_update_ordered wrapper.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      711f447b
    • Nikolay Borisov's avatar
      btrfs: batch up release of reserved metadata for delayed items used for deletion · 1f4f639f
      Nikolay Borisov authored
      With Filipe's recent rework of the delayed inode code one aspect which
      isn't batched is the release of the reserved metadata of delayed inode's
      delete items. With this patch on top of Filipe's rework and running the
      same test as provided in the description of a patch titled
      "btrfs: improve batch deletion of delayed dir index items" I observe
      the following change of the number of calls to btrfs_block_rsv_release:
      
      Before this change:
      - block_rsv_release:                      1004
      - btrfs_delete_delayed_items_total_time: 14602
      - delete_batches:                          505
      
      After:
      - block_rsv_release:                       510
      - btrfs_delete_delayed_items_total_time: 13643
      - delete_batches:                          507
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f4f639f
    • Qu Wenruo's avatar
      btrfs: warn about dev extents that are inside the reserved range · 3613249a
      Qu Wenruo authored
      Btrfs on-disk format has reserved the first 1MiB for the primary super
      block (at 64KiB offset) and bootloaders may also use this space.
      
      This behavior is only introduced since v4.1 btrfs-progs release,
      although kernel can ensure we never touch the reserved range of super
      blocks, it's better to inform the end users, and a balance will resolve
      the problem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ update changelog and message ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3613249a
    • Qu Wenruo's avatar
      btrfs: use named constant for reserved device space · 37f85ec3
      Qu Wenruo authored
      There's a reserved space on each device of size 1MiB that can be used by
      bootloaders or to avoid accidental overwrite. Use a symbolic constant
      with the explaining comment instead of hard coding the value and
      multiple comments.
      
      Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
      the primary super block (at offset 64KiB), until then the range could
      have been used by mistake. Kernel has been always respecting the 1MiB
      range for writes.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37f85ec3
    • David Sterba's avatar
    • David Sterba's avatar
      btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino · e3059ec0
      David Sterba authored
      There's only one function we pass to iterate_inodes_from_logical as
      iterator, so we can drop the indirection and call it directly, after
      moving the function to backref.c
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3059ec0
    • David Sterba's avatar
      btrfs: simplify parameters of backref iterators · 875d1daa
      David Sterba authored
      The inode reference iterator interface takes parameters that are derived
      from the context parameter, but as it's a void* type the values are
      passed individually.
      
      Change the ctx type to inode_fs_path as it's the only thing we pass and
      drop any parameters that are derived from that.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      875d1daa
    • David Sterba's avatar
      btrfs: call inode_to_path directly and drop indirection · ad6240f6
      David Sterba authored
      The functions for iterating inode reference take a function parameter
      but there's only one value, inode_to_path(). Remove the indirection and
      call the function. As paths_from_inode would become just an alias for
      iterate_irefs(), merge the two into one function.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad6240f6
    • Qu Wenruo's avatar
      btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies() · 6d322b48
      Qu Wenruo authored
      For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies
      directly, only for RAID5 and RAID6 we need some extra handling as
      there's no table value for that.
      
      For RAID10 there's a change from sub_stripes to ncopies. The values are
      the same but semantically we want to use number of copies, as this is
      what btrfs_num_copies does.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d322b48
    • Qu Wenruo's avatar
      btrfs: use btrfs_raid_array to calculate number of parity stripes · 0b30f719
      Qu Wenruo authored
      Use the raid table instead of hard coded values and rename the helper as
      it is exported.  This could make later extension on RAID56 based
      profiles easier.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b30f719
    • Qu Wenruo's avatar
      btrfs: use btrfs_chunk_max_errors() to replace tolerance calculation · 6dead96c
      Qu Wenruo authored
      In __btrfs_map_block() we have an assignment to @max_errors using
      nr_parity_stripes().
      
      Although it works for RAID56 it's confusing.  Replace it with
      btrfs_chunk_max_errors().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6dead96c
    • Qu Wenruo's avatar
      btrfs: remove parameter dev_extent_len from scrub_stripe() · bc88b486
      Qu Wenruo authored
      For scrub_stripe() we can easily calculate the dev extent length as we
      have the full info of the chunk.
      
      Thus there is no need to pass @dev_extent_len from the caller, and we
      introduce a helper, btrfs_calc_stripe_length(), to do the calculation
      from extent_map structure.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc88b486
    • David Sterba's avatar
      btrfs: unify tree search helper returning prev and next nodes · 9db33891
      David Sterba authored
      Simplify helper to return only next and prev pointers, we don't need all
      the node/parent/prev/next pointers of __etree_search as there are now
      other specialized helpers. Rename parameters so they follow the naming.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9db33891
    • David Sterba's avatar
      btrfs: make tree search for insert more generic and use it for tree_search · ec60c76f
      David Sterba authored
      With a slight extension of tree_search_for_insert (fill the return node
      and parent return parameters) we can avoid calling __etree_search from
      tree_search, that could be removed eventually in followup patches.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec60c76f
    • David Sterba's avatar
      btrfs: open code inexact rbtree search in tree_search · bebb22c1
      David Sterba authored
      The call chain from
      
      tree_search
        tree_search_for_insert
          __etree_search
      
      can be open coded and allow further simplifications, here we need a tree
      search with fallback to the next node in case it's not found. This is
      represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bebb22c1
    • David Sterba's avatar
      btrfs: remove node and parent parameters from insert_state · c367602a
      David Sterba authored
      There's no caller left that would pass valid pointers to insert_state so
      we can drop them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c367602a
    • David Sterba's avatar
      btrfs: add fast path for extent_state insertion · fb8f07d2
      David Sterba authored
      In two cases the exact location where to insert the extent state is
      known at the call time so we don't need to pass it to insert_state that
      takes the fast path.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb8f07d2
    • David Sterba's avatar
      btrfs: pass bits by value not by pointer for extent_state helpers · 6d92b304
      David Sterba authored
      The bits are passed to all extent state helpers for no apparent reason,
      the value only read and never updated so remove the indirection and pass
      it directly. Also unify the type to u32 where needed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d92b304