1. 01 Aug, 2024 1 commit
  2. 30 Jul, 2024 1 commit
  3. 29 Jul, 2024 4 commits
    • Filipe Manana's avatar
      btrfs: fix corruption after buffer fault in during direct IO append write · 939b656b
      Filipe Manana authored
      During an append (O_APPEND write flag) direct IO write if the input buffer
      was not previously faulted in, we can corrupt the file in a way that the
      final size is unexpected and it includes an unexpected hole.
      
      The problem happens like this:
      
      1) We have an empty file, with size 0, for example;
      
      2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
         buffer is not currently faulted in;
      
      3) We enter btrfs_direct_write(), lock the inode and call
         generic_write_checks(), which calls generic_write_checks_count(), and
         that function sets the iocb position to 0 with the following code:
      
      	if (iocb->ki_flags & IOCB_APPEND)
      		iocb->ki_pos = i_size_read(inode);
      
      4) We call btrfs_dio_write() and enter into iomap, which will end up
         calling btrfs_dio_iomap_begin() and that calls
         btrfs_get_blocks_direct_write(), where we update the i_size of the
         inode to 4096 bytes;
      
      5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
         the page of the write input buffer (at iomap_dio_bio_iter(), with a
         call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
         returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
      
      6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
         fault in the write buffer and then goto to the label 'relock';
      
      7) We lock again the inode, do all the necessary checks again and call
         again generic_write_checks(), which calls generic_write_checks_count()
         again, and there we set the iocb's position to 4K, which is the current
         i_size of the inode, with the following code pointed above:
      
              if (iocb->ki_flags & IOCB_APPEND)
                      iocb->ki_pos = i_size_read(inode);
      
      8) Then we go again to btrfs_dio_write() and enter iomap and the write
         succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
         the [0, 4K) range and an i_size of 8K, which goes against the
         expectations of having the data written to the range [0, 4K) and get an
         i_size of 4K.
      
      Fix this by not unlocking the inode before faulting in the input buffer,
      in case we get -EFAULT or an incomplete write, and not jumping to the
      'relock' label after faulting in the buffer - instead jump to a location
      immediately before calling iomap, skipping all the write checks and
      relocking. This solves this problem and it's fine even in case the input
      buffer is memory mapped to the same file range, since only holding the
      range locked in the inode's io tree can cause a deadlock, it's safe to
      keep the inode lock (VFS lock), as was fixed and described in commit
      51bd9563 ("btrfs: fix deadlock due to page faults during direct IO
      reads and writes").
      
      A sample reproducer provided by a reporter is the following:
      
         $ cat test.c
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <fcntl.h>
         #include <stdio.h>
         #include <sys/mman.h>
         #include <sys/stat.h>
         #include <unistd.h>
      
         int main(int argc, char *argv[])
         {
             if (argc < 2) {
                 fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
                 return 1;
             }
      
             int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
                           O_APPEND, 0644);
             if (fd < 0) {
                 perror("creating test file");
                 return 1;
             }
      
             char *buf = mmap(NULL, 4096, PROT_READ,
                              MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
             ssize_t ret = write(fd, buf, 4096);
             if (ret < 0) {
                 perror("pwritev2");
                 return 1;
             }
      
             struct stat stbuf;
             ret = fstat(fd, &stbuf);
             if (ret < 0) {
                 perror("stat");
                 return 1;
             }
      
             printf("size: %llu\n", (unsigned long long)stbuf.st_size);
             return stbuf.st_size == 4096 ? 0 : 1;
         }
      
      A test case for fstests will be sent soon.
      Reported-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
      Fixes: 8184620a ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
      CC: stable@vger.kernel.org # 6.1+
      Tested-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      939b656b
    • Naohiro Aota's avatar
      btrfs: zoned: fix zone_unusable accounting on making block group read-write again · 8cd44dd1
      Naohiro Aota authored
      When btrfs makes a block group read-only, it adds all free regions in the
      block group to space_info->bytes_readonly. That free space excludes
      reserved and pinned regions. OTOH, when btrfs makes the block group
      read-write again, it moves all the unused regions into the block group's
      zone_unusable. That unused region includes reserved and pinned regions.
      As a result, it counts too much zone_unusable bytes.
      
      Fortunately (or unfortunately), having erroneous zone_unusable does not
      affect the calculation of space_info->bytes_readonly, because free
      space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
      the erroneous zone_unusable and it reduces the num_bytes just to cancel the
      error.
      
      This behavior can be easily discovered by adding a WARN_ON to check e.g,
      "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
      case like btrfs/282.
      
      Fix it by properly considering pinned and reserved in
      btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
      btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
      
      Fixes: 169e0da9 ("btrfs: zoned: track unusable bytes for zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd44dd1
    • Naohiro Aota's avatar
      btrfs: do not subtract delalloc from avail bytes · d89c285d
      Naohiro Aota authored
      The block group's avail bytes printed when dumping a space info subtract
      the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
      btrfs_free_reserved_bytes(), it is added or subtracted along with
      "reserved" for the delalloc case, which means the "delalloc_bytes" is a
      part of the "reserved" bytes. So, excluding it to calculate the avail space
      counts delalloc_bytes twice, which can lead to an invalid result.
      
      Fixes: e50b122b ("btrfs: print available space for a block group when dumping a space info")
      CC: stable@vger.kernel.org # 6.6+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89c285d
    • Boris Burkov's avatar
      btrfs: make cow_file_range_inline() honor locked_page on error · 47857437
      Boris Burkov authored
      The btrfs buffered write path runs through __extent_writepage() which
      has some tricky return value handling for writepage_delalloc().
      Specifically, when that returns 1, we exit, but for other return values
      we continue and end up calling btrfs_folio_end_all_writers(). If the
      folio has been unlocked (note that we check the PageLocked bit at the
      start of __extent_writepage()), this results in an assert panic like
      this one from syzbot:
      
        BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
        BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
        BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
        assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/subpage.c:871!
        Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
        CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
        6.10.0-syzkaller-05505-gb1bc554e #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 06/27/2024
        RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
        Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
        0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
        6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
        RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
        RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
        RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
        RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
        R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
        R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
        FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
        knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
        <TASK>
        __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
        extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
        btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
        do_writepages+0x359/0x870 mm/page-writeback.c:2656
        filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
        __filemap_fdatawrite_range mm/filemap.c:430 [inline]
        __filemap_fdatawrite mm/filemap.c:436 [inline]
        filemap_flush+0xdf/0x130 mm/filemap.c:463
        btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
        __fput+0x24a/0x8a0 fs/file_table.c:422
        task_work_run+0x24f/0x310 kernel/task_work.c:222
        exit_task_work include/linux/task_work.h:40 [inline]
        do_exit+0xa2f/0x27f0 kernel/exit.c:877
        do_group_exit+0x207/0x2c0 kernel/exit.c:1026
        __do_sys_exit_group kernel/exit.c:1037 [inline]
        __se_sys_exit_group kernel/exit.c:1035 [inline]
        __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
        x64_sys_call+0x2634/0x2640
        arch/x86/include/generated/asm/syscalls_64.h:232
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
        entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7f5f075b70c9
        Code: Unable to access opcode bytes at
        0x7f5f075b709f.
      
      I was hitting the same issue by doing hundreds of accelerated runs of
      generic/475, which also hits IO errors by design.
      
      I instrumented that reproducer with bpftrace and found that the
      undesirable folio_unlock was coming from the following callstack:
      
        folio_unlock+5
        __process_pages_contig+475
        cow_file_range_inline.constprop.0+230
        cow_file_range+803
        btrfs_run_delalloc_range+566
        writepage_delalloc+332
        __extent_writepage # inlined in my stacktrace, but I added it here
        extent_write_cache_pages+622
      
      Looking at the bisected-to patch in the syzbot report, Josef realized
      that the logic of the cow_file_range_inline error path subtly changing.
      In the past, on error, it jumped to out_unlock in cow_file_range(),
      which honors the locked_page, so when we ultimately call
      folio_end_all_writers(), the folio of interest is still locked. After
      the change, we always unlocked ignoring the locked_page, on both success
      and error. On the success path, this all results in returning 1 to
      __extent_writepage(), which skips the folio_end_all_writers() call,
      which makes it OK to have unlocked.
      
      Fix the bug by wiring the locked_page into cow_file_range_inline() and
      only setting locked_page to NULL on success.
      
      Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
      Fixes: 0586d0a8 ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47857437
  4. 25 Jul, 2024 2 commits
    • Filipe Manana's avatar
      btrfs: fix corrupt read due to bad offset of a compressed extent map · de9f46cb
      Filipe Manana authored
      If we attempt to insert a compressed extent map that has a range that
      overlaps another extent map we have in the inode's extent map tree, we
      can end up with an incorrect offset after adjusting the new extent map at
      merge_extent_mapping() because we don't update the extent map's offset.
      
      For example consider the following scenario:
      
      1) We have a file extent item for a compressed extent covering the file
         range [108K, 144K) and currently there's no corresponding extent map
         in the inode's extent map tree;
      
      2) The inode's size is 141K;
      
      3) We have an encoded write (compressed) into the file range [120K, 128K),
         which overlaps the existing file extent item. The encoded write creates
         a matching extent map, adds it to the inode's extent map tree and
         creates an ordered extent for it.
      
         Note that the corresponding file extent item is added to the subvolume
         tree only when the ordered extent completes (when executing
         btrfs_finish_one_ordered());
      
      4) We have a write into the file range [160K, 164K).
      
         This writes increases the i_size of the file, and there's a hole
         between the current i_size (141K) and the start offset of this write,
         and since the old i_size is in the middle of the block [140K, 144K),
         we have to write zeroes to the range [141K, 144K) (3072 bytes) and
         therefore dirty that page.
      
         We then call btrfs_set_extent_delalloc() with a start offset of 140K.
         We then end up at btrfs_find_new_delalloc_bytes() which will call
         btrfs_get_extent() for the range [140K, 144K);
      
      5) The btrfs_get_extent() doesn't find any extent map in the inode's
         extent map tree covering the range [140K, 144K), so it searches the
         subvolume tree for any file extent items covering that range.
      
         There it finds the file extent item for the range [108K, 144K),
         creates a compressed extent map for that range and then calls
         btrfs_add_extent_mapping() with that extent map and passes the
         range [140K, 144K) via the "start" and "len" parameters;
      
      6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
         fails with -EEXIST because there's an extent map, created at step 2
         for the [120K, 128K) range, that covers that overlaps with the range
         of the given extent map ([108K, 144K)).
      
         Then it does a lookup for extent map from step 2 add calls
         merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
         That adjust the extent map to a start offset of 128K and a length
         of 16K (starting just after the extent map from step 2), but it does
         not update the offset field of the extent map, leaving it with a value
         of zero instead of updating to a value of 20K (128K - 108K = 20K).
      
         As a result any read for the range [128K, 144K) can return
         incorrect data since we read from a wrong section of the extent (unless
         both the correct and incorrect ranges happen to have the same data).
      
      So fix this by changing merge_extent_mapping() to update the extent map's
      offset even if it's compressed. Also add a test case to the self tests.
      This didn't happen before the patchset that does big changes in the extent
      map structure (which includes the commit in the Fixes tag below) because
      we kept track of the original start offset in the extent map (member
      "orig_start") so we could always calculate the correct offset by
      subtracting that offset from the start offset.
      
      A test case for fstests that triggered this problem using send/receive
      with compressed writes will be added soon.
      
      Fixes: 3d2ac992 ("btrfs: introduce new members for extent_map")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de9f46cb
    • Qu Wenruo's avatar
      btrfs: tree-checker: validate dref root and objectid · f333a3c7
      Qu Wenruo authored
      [CORRUPTION]
      There is a bug report that btrfs flips RO due to a corruption in the
      extent tree, the involved dumps looks like this:
      
       	item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
       		extent refs 3 gen 3678544 flags 1
       		ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
       		ref#1: shared data backref parent 1947073626112 count 1
       		ref#2: shared data backref parent 1156030103552 count 1
       BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
       BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2
      
      [CAUSE]
      The corrupted entry is ref#0 of item 188.
      The root number 13835058055282163977 is beyond the upper limit for root
      items (the current limit is 1 << 48), and the objectid also looks
      suspicious.
      
      Only the offset and count is correct.
      
      [ENHANCEMENT]
      Although it's still unknown why we have such many bytes corrupted
      randomly, we can still enhance the tree-checker for data backrefs by:
      
      - Validate the root value
        For now there should only be 3 types of roots can have data backref:
        * subvolume trees
        * data reloc trees
        * root tree
          Only for v1 space cache
      
      - validate the objectid value
        The objectid should be a valid inode number.
      
      Hopefully we can catch such problem in the future with the new checkers.
      Reported-by: default avatarKai Krakow <hurikhan77@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#tReviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f333a3c7
  5. 19 Jul, 2024 1 commit
    • Qu Wenruo's avatar
      btrfs: change BTRFS_MOUNT_* flags to 64bit type · c3ece6b7
      Qu Wenruo authored
      Currently the BTRFS_MOUNT_* flags are already beyond 32 bits, this is
      going to cause compilation errors for some 32 bit systems, as their
      unsigned long is only 32 bits long, thus flag
      BTRFS_MOUNT_IGNORESUPERFLAGS overflows and can lead to errors.
      
      Fix the problem by:
      
      - Migrate all existing BTRFS_MOUNT_* flags to unsigned long long
      - Migrate all mount option related variables to unsigned long long
        * btrfs_fs_info::mount_opt
        * btrfs_fs_context::mount_opt
        * mount_opt parameter of btrfs_check_options()
        * old_opts parameter of btrfs_remount_begin()
        * old_opts parameter of btrfs_remount_cleanup()
        * mount_opt parameter of btrfs_check_mountopts_zoned()
        * mount_opt and opt parameters of check_ro_option()
      
      Fixes: 32e62165 ("btrfs: introduce new "rescue=ignoresuperflags" mount option")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3ece6b7
  6. 11 Jul, 2024 31 commits