1. 01 Aug, 2024 4 commits
    • Boris Burkov's avatar
      btrfs: fix qgroup reserve leaks in cow_file_range · 30479f31
      Boris Burkov authored
      In the buffered write path, the dirty page owns the qgroup reserve until
      it creates an ordered_extent.
      
      Therefore, any errors that occur before the ordered_extent is created
      must free that reservation, or else the space is leaked. The fstest
      generic/475 exercises various IO error paths, and is able to trigger
      errors in cow_file_range where we fail to get to allocating the ordered
      extent. Note that because we *do* clear delalloc, we are likely to
      remove the inode from the delalloc list, so the inodes/pages to not have
      invalidate/launder called on them in the commit abort path.
      
      This results in failures at the unmount stage of the test that look like:
      
        BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
        BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
        BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
        Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
        CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W          6.10.0-rc7-gab56fde445b8 #21
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
        RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
        RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
        RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
        R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        FS:  00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
        Call Trace:
         <TASK>
         ? close_ctree+0x222/0x4d0 [btrfs]
         ? __warn.cold+0x8e/0xea
         ? close_ctree+0x222/0x4d0 [btrfs]
         ? report_bug+0xff/0x140
         ? handle_bug+0x3b/0x70
         ? exc_invalid_op+0x17/0x70
         ? asm_exc_invalid_op+0x1a/0x20
         ? close_ctree+0x222/0x4d0 [btrfs]
         generic_shutdown_super+0x70/0x160
         kill_anon_super+0x11/0x40
         btrfs_kill_super+0x11/0x20 [btrfs]
         deactivate_locked_super+0x2e/0xa0
         cleanup_mnt+0xb5/0x150
         task_work_run+0x57/0x80
         syscall_exit_to_user_mode+0x121/0x130
         do_syscall_64+0xab/0x1a0
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7f916847a887
        ---[ end trace 0000000000000000 ]---
        BTRFS error (device dm-8 state EA): qgroup reserved space leaked
      
      Cases 2 and 3 in the out_reserve path both pertain to this type of leak
      and must free the reserved qgroup data. Because it is already an error
      path, I opted not to handle the possible errors in
      btrfs_free_qgroup_data.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30479f31
    • Boris Burkov's avatar
      btrfs: implement launder_folio for clearing dirty page reserve · 872617a0
      Boris Burkov authored
      In the buffered write path, dirty pages can be said to "own" the qgroup
      reservation until they create an ordered_extent. It is possible for
      there to be outstanding dirty pages when a transaction is aborted, in
      which case there is no cancellation path for freeing this reservation
      and it is leaked.
      
      We do already walk the list of outstanding delalloc inodes in
      btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.
      
      This does *not* call btrfs_invalidate_folio(), as one might guess, but
      rather calls launder_folio() and release_folio(). Since this is a
      reservation associated with dirty pages only, rather than something
      associated with the private bit (ordered_extent is cancelled separately
      already in the cleanup transaction path), implementing this release
      should be done via launder_folio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      872617a0
    • Qu Wenruo's avatar
      btrfs: scrub: update last_physical after scrubbing one stripe · 63447b7d
      Qu Wenruo authored
      Currently sctx->stat.last_physical only got updated in the following
      cases:
      
      - When the last stripe of a non-RAID56 chunk is scrubbed
        This implies a pitfall, if the last stripe is at the chunk boundary,
        and we finished the scrub of the whole chunk, we won't update
        last_physical at all until the next chunk.
      
      - When a P/Q stripe of a RAID56 chunk is scrubbed
      
      This leads the following two problems:
      
      - sctx->stat.last_physical is not updated for a almost full chunk
        This is especially bad, affecting scrub resume, as the resume would
        start from last_physical, causing unnecessary re-scrub.
      
      - "btrfs scrub status" will not report any progress for a long time
      
      Fix the problem by properly updating @last_physical after each stripe is
      scrubbed.
      
      And since we're here, for the sake of consistency, use spin lock to
      protect the update of @last_physical, just like all the remaining
      call sites touching sctx->stat.
      Reported-by: default avatarMichel Palleau <michel.palleau@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAMFk-+igFTv2E8svg=cQ6o3e6CrR5QwgQ3Ok9EyRaEvvthpqCQ@mail.gmail.com/Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63447b7d
    • Qu Wenruo's avatar
      btrfs: factor out stripe length calculation into a helper · 33eb1e5d
      Qu Wenruo authored
      Currently there are two locations which need to calculate the real
      length of a stripe (which can be at the end of a chunk, and the chunk
      size may not always be 64K aligned).
      
      Factor them into a helper as we're going to have a third user soon.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33eb1e5d
  2. 30 Jul, 2024 1 commit
  3. 29 Jul, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: fix corruption after buffer fault in during direct IO append write · 939b656b
      Filipe Manana authored
      During an append (O_APPEND write flag) direct IO write if the input buffer
      was not previously faulted in, we can corrupt the file in a way that the
      final size is unexpected and it includes an unexpected hole.
      
      The problem happens like this:
      
      1) We have an empty file, with size 0, for example;
      
      2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
         buffer is not currently faulted in;
      
      3) We enter btrfs_direct_write(), lock the inode and call
         generic_write_checks(), which calls generic_write_checks_count(), and
         that function sets the iocb position to 0 with the following code:
      
      	if (iocb->ki_flags & IOCB_APPEND)
      		iocb->ki_pos = i_size_read(inode);
      
      4) We call btrfs_dio_write() and enter into iomap, which will end up
         calling btrfs_dio_iomap_begin() and that calls
         btrfs_get_blocks_direct_write(), where we update the i_size of the
         inode to 4096 bytes;
      
      5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
         the page of the write input buffer (at iomap_dio_bio_iter(), with a
         call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
         returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
      
      6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
         fault in the write buffer and then goto to the label 'relock';
      
      7) We lock again the inode, do all the necessary checks again and call
         again generic_write_checks(), which calls generic_write_checks_count()
         again, and there we set the iocb's position to 4K, which is the current
         i_size of the inode, with the following code pointed above:
      
              if (iocb->ki_flags & IOCB_APPEND)
                      iocb->ki_pos = i_size_read(inode);
      
      8) Then we go again to btrfs_dio_write() and enter iomap and the write
         succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
         the [0, 4K) range and an i_size of 8K, which goes against the
         expectations of having the data written to the range [0, 4K) and get an
         i_size of 4K.
      
      Fix this by not unlocking the inode before faulting in the input buffer,
      in case we get -EFAULT or an incomplete write, and not jumping to the
      'relock' label after faulting in the buffer - instead jump to a location
      immediately before calling iomap, skipping all the write checks and
      relocking. This solves this problem and it's fine even in case the input
      buffer is memory mapped to the same file range, since only holding the
      range locked in the inode's io tree can cause a deadlock, it's safe to
      keep the inode lock (VFS lock), as was fixed and described in commit
      51bd9563 ("btrfs: fix deadlock due to page faults during direct IO
      reads and writes").
      
      A sample reproducer provided by a reporter is the following:
      
         $ cat test.c
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <fcntl.h>
         #include <stdio.h>
         #include <sys/mman.h>
         #include <sys/stat.h>
         #include <unistd.h>
      
         int main(int argc, char *argv[])
         {
             if (argc < 2) {
                 fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
                 return 1;
             }
      
             int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
                           O_APPEND, 0644);
             if (fd < 0) {
                 perror("creating test file");
                 return 1;
             }
      
             char *buf = mmap(NULL, 4096, PROT_READ,
                              MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
             ssize_t ret = write(fd, buf, 4096);
             if (ret < 0) {
                 perror("pwritev2");
                 return 1;
             }
      
             struct stat stbuf;
             ret = fstat(fd, &stbuf);
             if (ret < 0) {
                 perror("stat");
                 return 1;
             }
      
             printf("size: %llu\n", (unsigned long long)stbuf.st_size);
             return stbuf.st_size == 4096 ? 0 : 1;
         }
      
      A test case for fstests will be sent soon.
      Reported-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
      Fixes: 8184620a ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
      CC: stable@vger.kernel.org # 6.1+
      Tested-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      939b656b
    • Naohiro Aota's avatar
      btrfs: zoned: fix zone_unusable accounting on making block group read-write again · 8cd44dd1
      Naohiro Aota authored
      When btrfs makes a block group read-only, it adds all free regions in the
      block group to space_info->bytes_readonly. That free space excludes
      reserved and pinned regions. OTOH, when btrfs makes the block group
      read-write again, it moves all the unused regions into the block group's
      zone_unusable. That unused region includes reserved and pinned regions.
      As a result, it counts too much zone_unusable bytes.
      
      Fortunately (or unfortunately), having erroneous zone_unusable does not
      affect the calculation of space_info->bytes_readonly, because free
      space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
      the erroneous zone_unusable and it reduces the num_bytes just to cancel the
      error.
      
      This behavior can be easily discovered by adding a WARN_ON to check e.g,
      "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
      case like btrfs/282.
      
      Fix it by properly considering pinned and reserved in
      btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
      btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
      
      Fixes: 169e0da9 ("btrfs: zoned: track unusable bytes for zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd44dd1
    • Naohiro Aota's avatar
      btrfs: do not subtract delalloc from avail bytes · d89c285d
      Naohiro Aota authored
      The block group's avail bytes printed when dumping a space info subtract
      the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
      btrfs_free_reserved_bytes(), it is added or subtracted along with
      "reserved" for the delalloc case, which means the "delalloc_bytes" is a
      part of the "reserved" bytes. So, excluding it to calculate the avail space
      counts delalloc_bytes twice, which can lead to an invalid result.
      
      Fixes: e50b122b ("btrfs: print available space for a block group when dumping a space info")
      CC: stable@vger.kernel.org # 6.6+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89c285d
    • Boris Burkov's avatar
      btrfs: make cow_file_range_inline() honor locked_page on error · 47857437
      Boris Burkov authored
      The btrfs buffered write path runs through __extent_writepage() which
      has some tricky return value handling for writepage_delalloc().
      Specifically, when that returns 1, we exit, but for other return values
      we continue and end up calling btrfs_folio_end_all_writers(). If the
      folio has been unlocked (note that we check the PageLocked bit at the
      start of __extent_writepage()), this results in an assert panic like
      this one from syzbot:
      
        BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
        BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
        BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
        assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/subpage.c:871!
        Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
        CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
        6.10.0-syzkaller-05505-gb1bc554e #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 06/27/2024
        RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
        Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
        0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
        6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
        RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
        RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
        RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
        RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
        R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
        R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
        FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
        knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
        <TASK>
        __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
        extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
        btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
        do_writepages+0x359/0x870 mm/page-writeback.c:2656
        filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
        __filemap_fdatawrite_range mm/filemap.c:430 [inline]
        __filemap_fdatawrite mm/filemap.c:436 [inline]
        filemap_flush+0xdf/0x130 mm/filemap.c:463
        btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
        __fput+0x24a/0x8a0 fs/file_table.c:422
        task_work_run+0x24f/0x310 kernel/task_work.c:222
        exit_task_work include/linux/task_work.h:40 [inline]
        do_exit+0xa2f/0x27f0 kernel/exit.c:877
        do_group_exit+0x207/0x2c0 kernel/exit.c:1026
        __do_sys_exit_group kernel/exit.c:1037 [inline]
        __se_sys_exit_group kernel/exit.c:1035 [inline]
        __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
        x64_sys_call+0x2634/0x2640
        arch/x86/include/generated/asm/syscalls_64.h:232
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
        entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7f5f075b70c9
        Code: Unable to access opcode bytes at
        0x7f5f075b709f.
      
      I was hitting the same issue by doing hundreds of accelerated runs of
      generic/475, which also hits IO errors by design.
      
      I instrumented that reproducer with bpftrace and found that the
      undesirable folio_unlock was coming from the following callstack:
      
        folio_unlock+5
        __process_pages_contig+475
        cow_file_range_inline.constprop.0+230
        cow_file_range+803
        btrfs_run_delalloc_range+566
        writepage_delalloc+332
        __extent_writepage # inlined in my stacktrace, but I added it here
        extent_write_cache_pages+622
      
      Looking at the bisected-to patch in the syzbot report, Josef realized
      that the logic of the cow_file_range_inline error path subtly changing.
      In the past, on error, it jumped to out_unlock in cow_file_range(),
      which honors the locked_page, so when we ultimately call
      folio_end_all_writers(), the folio of interest is still locked. After
      the change, we always unlocked ignoring the locked_page, on both success
      and error. On the success path, this all results in returning 1 to
      __extent_writepage(), which skips the folio_end_all_writers() call,
      which makes it OK to have unlocked.
      
      Fix the bug by wiring the locked_page into cow_file_range_inline() and
      only setting locked_page to NULL on success.
      
      Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
      Fixes: 0586d0a8 ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47857437
  4. 25 Jul, 2024 2 commits
    • Filipe Manana's avatar
      btrfs: fix corrupt read due to bad offset of a compressed extent map · de9f46cb
      Filipe Manana authored
      If we attempt to insert a compressed extent map that has a range that
      overlaps another extent map we have in the inode's extent map tree, we
      can end up with an incorrect offset after adjusting the new extent map at
      merge_extent_mapping() because we don't update the extent map's offset.
      
      For example consider the following scenario:
      
      1) We have a file extent item for a compressed extent covering the file
         range [108K, 144K) and currently there's no corresponding extent map
         in the inode's extent map tree;
      
      2) The inode's size is 141K;
      
      3) We have an encoded write (compressed) into the file range [120K, 128K),
         which overlaps the existing file extent item. The encoded write creates
         a matching extent map, adds it to the inode's extent map tree and
         creates an ordered extent for it.
      
         Note that the corresponding file extent item is added to the subvolume
         tree only when the ordered extent completes (when executing
         btrfs_finish_one_ordered());
      
      4) We have a write into the file range [160K, 164K).
      
         This writes increases the i_size of the file, and there's a hole
         between the current i_size (141K) and the start offset of this write,
         and since the old i_size is in the middle of the block [140K, 144K),
         we have to write zeroes to the range [141K, 144K) (3072 bytes) and
         therefore dirty that page.
      
         We then call btrfs_set_extent_delalloc() with a start offset of 140K.
         We then end up at btrfs_find_new_delalloc_bytes() which will call
         btrfs_get_extent() for the range [140K, 144K);
      
      5) The btrfs_get_extent() doesn't find any extent map in the inode's
         extent map tree covering the range [140K, 144K), so it searches the
         subvolume tree for any file extent items covering that range.
      
         There it finds the file extent item for the range [108K, 144K),
         creates a compressed extent map for that range and then calls
         btrfs_add_extent_mapping() with that extent map and passes the
         range [140K, 144K) via the "start" and "len" parameters;
      
      6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
         fails with -EEXIST because there's an extent map, created at step 2
         for the [120K, 128K) range, that covers that overlaps with the range
         of the given extent map ([108K, 144K)).
      
         Then it does a lookup for extent map from step 2 add calls
         merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
         That adjust the extent map to a start offset of 128K and a length
         of 16K (starting just after the extent map from step 2), but it does
         not update the offset field of the extent map, leaving it with a value
         of zero instead of updating to a value of 20K (128K - 108K = 20K).
      
         As a result any read for the range [128K, 144K) can return
         incorrect data since we read from a wrong section of the extent (unless
         both the correct and incorrect ranges happen to have the same data).
      
      So fix this by changing merge_extent_mapping() to update the extent map's
      offset even if it's compressed. Also add a test case to the self tests.
      This didn't happen before the patchset that does big changes in the extent
      map structure (which includes the commit in the Fixes tag below) because
      we kept track of the original start offset in the extent map (member
      "orig_start") so we could always calculate the correct offset by
      subtracting that offset from the start offset.
      
      A test case for fstests that triggered this problem using send/receive
      with compressed writes will be added soon.
      
      Fixes: 3d2ac992 ("btrfs: introduce new members for extent_map")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de9f46cb
    • Qu Wenruo's avatar
      btrfs: tree-checker: validate dref root and objectid · f333a3c7
      Qu Wenruo authored
      [CORRUPTION]
      There is a bug report that btrfs flips RO due to a corruption in the
      extent tree, the involved dumps looks like this:
      
       	item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
       		extent refs 3 gen 3678544 flags 1
       		ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
       		ref#1: shared data backref parent 1947073626112 count 1
       		ref#2: shared data backref parent 1156030103552 count 1
       BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
       BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2
      
      [CAUSE]
      The corrupted entry is ref#0 of item 188.
      The root number 13835058055282163977 is beyond the upper limit for root
      items (the current limit is 1 << 48), and the objectid also looks
      suspicious.
      
      Only the offset and count is correct.
      
      [ENHANCEMENT]
      Although it's still unknown why we have such many bytes corrupted
      randomly, we can still enhance the tree-checker for data backrefs by:
      
      - Validate the root value
        For now there should only be 3 types of roots can have data backref:
        * subvolume trees
        * data reloc trees
        * root tree
          Only for v1 space cache
      
      - validate the objectid value
        The objectid should be a valid inode number.
      
      Hopefully we can catch such problem in the future with the new checkers.
      Reported-by: default avatarKai Krakow <hurikhan77@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#tReviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f333a3c7
  5. 19 Jul, 2024 1 commit
    • Qu Wenruo's avatar
      btrfs: change BTRFS_MOUNT_* flags to 64bit type · c3ece6b7
      Qu Wenruo authored
      Currently the BTRFS_MOUNT_* flags are already beyond 32 bits, this is
      going to cause compilation errors for some 32 bit systems, as their
      unsigned long is only 32 bits long, thus flag
      BTRFS_MOUNT_IGNORESUPERFLAGS overflows and can lead to errors.
      
      Fix the problem by:
      
      - Migrate all existing BTRFS_MOUNT_* flags to unsigned long long
      - Migrate all mount option related variables to unsigned long long
        * btrfs_fs_info::mount_opt
        * btrfs_fs_context::mount_opt
        * mount_opt parameter of btrfs_check_options()
        * old_opts parameter of btrfs_remount_begin()
        * old_opts parameter of btrfs_remount_cleanup()
        * mount_opt parameter of btrfs_check_mountopts_zoned()
        * mount_opt and opt parameters of check_ro_option()
      
      Fixes: 32e62165 ("btrfs: introduce new "rescue=ignoresuperflags" mount option")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3ece6b7
  6. 11 Jul, 2024 28 commits