1. 03 Jan, 2023 7 commits
    • Qu Wenruo's avatar
      btrfs: fix compat_ro checks against remount · 2ba48b20
      Qu Wenruo authored
      [BUG]
      Even with commit 81d5d614 ("btrfs: enhance unsupported compat RO
      flags handling"), btrfs can still mount a fs with unsupported compat_ro
      flags read-only, then remount it RW:
      
        # btrfs ins dump-super /dev/loop0 | grep compat_ro_flags -A 3
        compat_ro_flags		0x403
      			( FREE_SPACE_TREE |
      			  FREE_SPACE_TREE_VALID |
      			  unknown flag: 0x400 )
      
        # mount /dev/loop0 /mnt/btrfs
        mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
               dmesg(1) may have more information after failed mount system call.
        ^^^ RW mount failed as expected ^^^
      
        # dmesg -t | tail -n5
        loop0: detected capacity change from 0 to 1048576
        BTRFS: device fsid cb5b82f5-0fdd-4d81-9b4b-78533c324afa devid 1 transid 7 /dev/loop0 scanned by mount (1146)
        BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
        BTRFS info (device loop0): using free space tree
        BTRFS error (device loop0): cannot mount read-write because of unknown compat_ro features (0x403)
        BTRFS error (device loop0): open_ctree failed
      
        # mount /dev/loop0 -o ro /mnt/btrfs
        # mount -o remount,rw /mnt/btrfs
        ^^^ RW remount succeeded unexpectedly ^^^
      
      [CAUSE]
      Currently we use btrfs_check_features() to check compat_ro flags against
      our current mount flags.
      
      That function get reused between open_ctree() and btrfs_remount().
      
      But for btrfs_remount(), the super block we passed in still has the old
      mount flags, thus btrfs_check_features() still believes we're mounting
      read-only.
      
      [FIX]
      Replace the existing @sb argument with @is_rw_mount.
      
      As originally we only use @sb to determine if the mount is RW.
      
      Now it's callers' responsibility to determine if the mount is RW, and
      since there are only two callers, the check is pretty simple:
      
      - caller in open_ctree()
        Just pass !sb_rdonly().
      
      - caller in btrfs_remount()
        Pass !(*flags & SB_RDONLY), as our check should be against the new
        flags.
      
      Now we can correctly reject the RW remount:
      
        # mount /dev/loop0 -o ro /mnt/btrfs
        # mount -o remount,rw /mnt/btrfs
        mount: /mnt/btrfs: mount point not mounted or bad option.
               dmesg(1) may have more information after failed mount system call.
        # dmesg -t | tail -n 1
        BTRFS error (device loop0: state M): cannot mount read-write because of unknown compat_ro features (0x403)
      Reported-by: default avatarChung-Chiang Cheng <shepjeng@gmail.com>
      Fixes: 81d5d614 ("btrfs: enhance unsupported compat RO flags handling")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ba48b20
    • Qu Wenruo's avatar
      btrfs: always report error in run_one_delayed_ref() · 39f501d6
      Qu Wenruo authored
      Currently we have a btrfs_debug() for run_one_delayed_ref() failure, but
      if end users hit such problem, there will be no chance that
      btrfs_debug() is enabled.  This can lead to very little useful info for
      debugging.
      
      This patch will:
      
      - Add extra info for error reporting
        Including:
        * logical bytenr
        * num_bytes
        * type
        * action
        * ref_mod
      
      - Replace the btrfs_debug() with btrfs_err()
      
      - Move the error reporting into run_one_delayed_ref()
        This is to avoid use-after-free, the @node can be freed in the caller.
      
      This error should only be triggered at most once.
      
      As if run_one_delayed_ref() failed, we trigger the error message, then
      causing the call chain to error out:
      
      btrfs_run_delayed_refs()
      `- btrfs_run_delayed_refs()
         `- btrfs_run_delayed_refs_for_head()
            `- run_one_delayed_ref()
      
      And we will abort the current transaction in btrfs_run_delayed_refs().
      If we have to run delayed refs for the abort transaction,
      run_one_delayed_ref() will just cleanup the refs and do nothing, thus no
      new error messages would be output.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39f501d6
    • Qu Wenruo's avatar
      btrfs: handle case when repair happens with dev-replace · d73a27b8
      Qu Wenruo authored
      [BUG]
      There is a bug report that a BUG_ON() in btrfs_repair_io_failure()
      (originally repair_io_failure() in v6.0 kernel) got triggered when
      replacing a unreliable disk:
      
        BTRFS warning (device sda1): csum failed root 257 ino 2397453 off 39624704 csum 0xb0d18c75 expected csum 0x4dae9c5e mirror 3
        kernel BUG at fs/btrfs/extent_io.c:2380!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 9 PID: 3614331 Comm: kworker/u257:2 Tainted: G           OE      6.0.0-5-amd64 #1  Debian 6.0.10-2
        Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO WIFI (MS-7C60), BIOS 2.70 07/01/2021
        Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
        RIP: 0010:repair_io_failure+0x24a/0x260 [btrfs]
        Call Trace:
         <TASK>
         clean_io_failure+0x14d/0x180 [btrfs]
         end_bio_extent_readpage+0x412/0x6e0 [btrfs]
         ? __switch_to+0x106/0x420
         process_one_work+0x1c7/0x380
         worker_thread+0x4d/0x380
         ? rescuer_thread+0x3a0/0x3a0
         kthread+0xe9/0x110
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork+0x22/0x30
      
      [CAUSE]
      
      Before the BUG_ON(), we got some read errors from the replace target
      first, note the mirror number (3, which is beyond RAID1 duplication,
      thus it's read from the replace target device).
      
      Then at the BUG_ON() location, we are trying to writeback the repaired
      sectors back the failed device.
      
      The check looks like this:
      
      		ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
      				      &map_length, &bioc, mirror_num);
      		if (ret)
      			goto out_counter_dec;
      		BUG_ON(mirror_num != bioc->mirror_num);
      
      But inside btrfs_map_block(), we can modify bioc->mirror_num especially
      for dev-replace:
      
      	if (dev_replace_is_ongoing && mirror_num == map->num_stripes + 1 &&
      	    !need_full_stripe(op) && dev_replace->tgtdev != NULL) {
      		ret = get_extra_mirror_from_replace(fs_info, logical, *length,
      						    dev_replace->srcdev->devid,
      						    &mirror_num,
      					    &physical_to_patch_in_first_stripe);
      		patch_the_first_stripe_for_dev_replace = 1;
      	}
      
      Thus if we're repairing the replace target device, we're going to
      trigger that BUG_ON().
      
      But in reality, the read failure from the replace target device may be
      that, our replace hasn't reached the range we're reading, thus we're
      reading garbage, but with replace running, the range would be properly
      filled later.
      
      Thus in that case, we don't need to do anything but let the replace
      routine to handle it.
      
      [FIX]
      Instead of a BUG_ON(), just skip the repair if we're repairing the
      device replace target device.
      Reported-by: default avatar小太 <nospam@kota.moe>
      Link: https://lore.kernel.org/linux-btrfs/CACsxjPYyJGQZ+yvjzxA1Nn2LuqkYqTCcUH43S=+wXhyf8S00Ag@mail.gmail.com/
      CC: stable@vger.kernel.org # 6.0+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d73a27b8
    • Filipe Manana's avatar
      btrfs: fix off-by-one in delalloc search during lseek · 2f2e84ca
      Filipe Manana authored
      During lseek, when searching for delalloc in a range that represents a
      hole and that range has a length of 1 byte, we end up not doing the actual
      delalloc search in the inode's io tree, resulting in not correctly
      reporting the offset with data or a hole. This actually only happens when
      the start offset is 0 because with any other start offset we round it down
      by sector size.
      
      Reproducer:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -q 0 1" /mnt/sdc/foo
      
        $ xfs_io -c "seek -d 0" /mnt/sdc/foo
        Whence   Result
        DATA	   EOF
      
      It should have reported an offset of 0 instead of EOF.
      
      Fix this by updating btrfs_find_delalloc_in_range() and count_range_bits()
      to deal with inclusive ranges properly. These functions are already
      supposed to work with inclusive end offsets, they just got it wrong in a
      couple places due to off-by-one mistakes.
      
      A test case for fstests will be added later.
      Reported-by: default avatarJoan Bruguera Micó <joanbrugueram@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/20221223020509.457113-1-joanbrugueram@gmail.com/
      Fixes: b6e83356 ("btrfs: make hole and data seeking a lot more efficient")
      CC: stable@vger.kernel.org # 6.1
      Tested-by: default avatarJoan Bruguera Micó <joanbrugueram@gmail.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f2e84ca
    • Qu Wenruo's avatar
      btrfs: fix false alert on bad tree level check · 1d854e4f
      Qu Wenruo authored
      [BUG]
      There is a bug report that on a RAID0 NVMe btrfs system, under heavy
      write load the filesystem can flip RO randomly.
      
      With extra debugging, it shows some tree blocks failed to pass their
      level checks, and if that happens at critical path of a transaction, we
      abort the transaction:
      
        BTRFS error (device nvme0n1p3): level verify failed on logical 5446121209856 mirror 1 wanted 0 found 1
        BTRFS error (device nvme0n1p3: state A): Transaction aborted (error -5)
        BTRFS: error (device nvme0n1p3: state A) in btrfs_finish_ordered_io:3343: errno=-5 IO failure
        BTRFS info (device nvme0n1p3: state EA): forced readonly
      
      [CAUSE]
      The reporter has already bisected to commit 947a6299 ("btrfs: move
      tree block parentness check into validate_extent_buffer()").
      
      And with extra debugging, it shows we can have btrfs_tree_parent_check
      filled with all zeros in the following call trace:
      
        submit_one_bio+0xd4/0xe0
        submit_extent_page+0x142/0x550
        read_extent_buffer_pages+0x584/0x9c0
        ? __pfx_end_bio_extent_readpage+0x10/0x10
        ? folio_unlock+0x1d/0x50
        btrfs_read_extent_buffer+0x98/0x150
        read_tree_block+0x43/0xa0
        read_block_for_search+0x266/0x370
        btrfs_search_slot+0x351/0xd30
        ? lock_is_held_type+0xe8/0x140
        btrfs_lookup_csum+0x63/0x150
        btrfs_csum_file_blocks+0x197/0x6c0
        ? sched_clock_cpu+0x9f/0xc0
        ? lock_release+0x14b/0x440
        ? _raw_read_unlock+0x29/0x50
        btrfs_finish_ordered_io+0x441/0x860
        btrfs_work_helper+0xfe/0x400
        ? lock_is_held_type+0xe8/0x140
        process_one_work+0x294/0x5b0
        worker_thread+0x4f/0x3a0
        ? __pfx_worker_thread+0x10/0x10
        kthread+0xf5/0x120
        ? __pfx_kthread+0x10/0x10
        ret_from_fork+0x2c/0x50
      
      Currently we only copy the btrfs_tree_parent_check structure into bbio
      at read_extent_buffer_pages() after we have assembled the bbio.
      
      But as shown above, submit_extent_page() itself can already submit the
      bbio, leaving the bbio->parent_check uninitialized, and cause the false
      alert.
      
      [FIX]
      Instead of copying @check into bbio after bbio is assembled, we pass
      @check in btrfs_bio_ctrl::parent_check, and copy the content of
      parent_check in submit_one_bio() for metadata read.
      
      By this we should be able to pass the needed info for metadata endio
      verification, and fix the false alert.
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CABXGCsNzVxo4iq-tJSGm_kO1UggHXgq6CdcHDL=z5FL4njYXSQ@mail.gmail.com/
      Fixes: 947a6299 ("btrfs: move tree block parentness check into validate_extent_buffer()")
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d854e4f
    • Qu Wenruo's avatar
      btrfs: add error message for metadata level mismatch · 77177ed1
      Qu Wenruo authored
      From a recent regression report, we found that after commit 947a6299
      ("btrfs: move tree block parentness check into
      validate_extent_buffer()") if we have a level mismatch (false alert
      though), there is no error message at all.
      
      This makes later debugging harder.  This patch will add the proper error
      message for such case.
      
      Link: https://lore.kernel.org/linux-btrfs/CABXGCsNzVxo4iq-tJSGm_kO1UggHXgq6CdcHDL=z5FL4njYXSQ@mail.gmail.com/Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77177ed1
    • Tanmay Bhushan's avatar
      btrfs: fix ASSERT em->len condition in btrfs_get_extent · 946c2923
      Tanmay Bhushan authored
      The em->len value is supposed to be verified in the assertion condition
      though we expect it to be same as the sectorsize.
      
      Fixes: a196a894 ("btrfs: do not reset extent map members for inline extents read")
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarTanmay Bhushan <007047221b@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      946c2923
  2. 20 Dec, 2022 3 commits
    • Filipe Manana's avatar
      btrfs: fix fscrypt name leak after failure to join log transaction · fee4c199
      Filipe Manana authored
      When logging a new name, we don't expect to fail joining a log transaction
      since we know at least one of the inodes was logged before in the current
      transaction. However if we fail for some unexpected reason, we end up not
      freeing the fscrypt name we previously allocated. So fix that by freeing
      the name in case we failed to join a log transaction.
      
      Fixes: ab3c5c18 ("btrfs: setup qstr from dentrys using fscrypt helper")
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fee4c199
    • Josef Bacik's avatar
      btrfs: scrub: fix uninitialized return value in recover_scrub_rbio · e7fc357e
      Josef Bacik authored
      Commit 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery
      path to use error_bitmap") introduced an uninitialized return variable.
      
      This can be caught by gcc 12.1 by -Wmaybe-uninitialized:
      
        CC [M]  fs/btrfs/raid56.o
      fs/btrfs/raid56.c: In function ‘scrub_rbio’:
      fs/btrfs/raid56.c:2801:15: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized]
       2801 |         ret = recover_scrub_rbio(rbio);
            |               ^~~~~~~~~~~~~~~~~~~~~~~~
      fs/btrfs/raid56.c:2649:13: note: ‘ret’ was declared here
       2649 |         int ret;
      
      The warning is disabled by default so we haven't caught that.
      
      Due to the bug the raid56 scrub fstests have been failing since the
      patch was merged, so initialize that.
      
      Fixes: 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7fc357e
    • Boris Burkov's avatar
      btrfs: fix resolving backrefs for inline extent followed by prealloc · 560840af
      Boris Burkov authored
      If a file consists of an inline extent followed by a regular or prealloc
      extent, then a legitimate attempt to resolve a logical address in the
      non-inline region will result in add_all_parents reading the invalid
      offset field of the inline extent. If the inline extent item is placed
      in the leaf eb s.t. it is the first item, attempting to access the
      offset field will not only be meaningless, it will go past the end of
      the eb and cause this panic:
      
        [17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8
        [17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI
        [17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199
        [17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110
        [17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202
        [17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000
        [17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001
        [17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff
        [17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918
        [17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd
        [17.663617] FS:  00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000
        [17.666525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0
        [17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [17.676034] PKRU: 55555554
        [17.677004] Call Trace:
        [17.677877]  add_all_parents+0x276/0x480
        [17.679325]  find_parent_nodes+0xfae/0x1590
        [17.680771]  btrfs_find_all_leafs+0x5e/0xa0
        [17.682217]  iterate_extent_inodes+0xce/0x260
        [17.683809]  ? btrfs_inode_flags_to_xflags+0x50/0x50
        [17.685597]  ? iterate_inodes_from_logical+0xa1/0xd0
        [17.687404]  iterate_inodes_from_logical+0xa1/0xd0
        [17.689121]  ? btrfs_inode_flags_to_xflags+0x50/0x50
        [17.691010]  btrfs_ioctl_logical_to_ino+0x131/0x190
        [17.692946]  btrfs_ioctl+0x104a/0x2f60
        [17.694384]  ? selinux_file_ioctl+0x182/0x220
        [17.695995]  ? __x64_sys_ioctl+0x84/0xc0
        [17.697394]  __x64_sys_ioctl+0x84/0xc0
        [17.698697]  do_syscall_64+0x33/0x40
        [17.700017]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [17.701753] RIP: 0033:0x7f64e72761b7
        [17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7
        [17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003
        [17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60
        [17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001
        [17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0
        [17.724839] Modules linked in:
      
      Fix the bug by detecting the inline extent item in add_all_parents and
      skipping to the next extent item.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      560840af
  3. 15 Dec, 2022 5 commits
  4. 05 Dec, 2022 25 commits
    • Filipe Manana's avatar
      btrfs: print transaction aborted messages with an error level · b7af0635
      Filipe Manana authored
      Currently we print the transaction aborted message with a debug level, but
      a transaction abort is an exceptional event that indicates something went
      wrong and it's useful to have it printed with an error level as it helps
      analysing problems in a production environment, where debug level messages
      are typically not logged. For example reports from syzbot never include
      the transaction aborted message, since the log level on the test machines
      is above the debug level.
      
      So change the log level from debug to error.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7af0635
    • Josef Bacik's avatar
      btrfs: sync some cleanups from progs into uapi/btrfs.h · a2813530
      Josef Bacik authored
      When syncing this code into btrfs-progs Dave noticed there's some things
      we were losing in the sync that are needed.  This syncs those changes
      into the kernel, which include a few comments that weren't in the
      kernel, some whitespace changes, an attribute, and the cplusplus bit.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2813530
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a range · 162d053e
      Filipe Manana authored
      If we get -ENOMEM while dropping file extent items in a given range, at
      btrfs_drop_extents(), due to failure to allocate memory when attempting to
      increment the reference count for an extent or drop the reference count,
      we handle it with a BUG_ON(). This is excessive, instead we can simply
      abort the transaction and return the error to the caller. In fact most
      callers of btrfs_drop_extents(), directly or indirectly, already abort
      the transaction if btrfs_drop_extents() returns any error.
      
      Also, we already have error paths at btrfs_drop_extents() that may return
      -ENOMEM and in those cases we abort the transaction, like for example
      anything that changes the b+tree may return -ENOMEM due to a failure to
      allocate a new extent buffer when COWing an existing extent buffer, such
      as a call to btrfs_duplicate_item() for example.
      
      So replace the BUG_ON() calls with proper logic to abort the transaction
      and return the error.
      
      Reported-by: syzbot+0b1fb6b0108c27419f9f@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000089773e05ee4b9cb4@google.com/
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      162d053e
    • void0red's avatar
      btrfs: fix extent map use-after-free when handling missing device in read_one_chunk · 1742e1c9
      void0red authored
      Store the error code before freeing the extent_map. Though it's
      reference counted structure, in that function it's the first and last
      allocation so this would lead to a potential use-after-free.
      
      The error can happen eg. when chunk is stored on a missing device and
      the degraded mount option is missing.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721Reported-by: default avatareriri <1527030098@qq.com>
      Fixes: adfb69af ("btrfs: add_missing_dev() should return the actual error")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarvoid0red <void0red@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1742e1c9
    • Filipe Manana's avatar
      btrfs: remove outdated logic from overwrite_item() and add assertion · 3eb42344
      Filipe Manana authored
      As of commit 193df624 ("btrfs: search for last logged dir index if
      it's not cached in the inode"), the overwrite_item() function is always
      called for a root that is from a fs/subvolume tree. In other words, now
      it's only used during log replay to modify a fs/subvolume tree. Therefore
      we can remove the logic that checks if we are dealing with a log tree at
      overwrite_item().
      
      So remove that logic, replacing it with an assertion and document that if
      we ever need to support a log root there, we will need to clone the leaf
      from the fs/subvolume tree and then release it before modifying the log
      tree, which is needed to avoid a potential deadlock, similar to the one
      recently fixed by a patch with the subject:
      
        "btrfs: do not modify log tree while holding a leaf from fs tree locked"
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3eb42344
    • Filipe Manana's avatar
      btrfs: unify overwrite_item() and do_overwrite_item() · 3a8d1db3
      Filipe Manana authored
      After commit 193df624 ("btrfs: search for last logged dir index if
      it's not cached in the inode"), there are no more callers of
      do_overwrite_item(), except overwrite_item().
      
      Originally both used to be the same function, but were split in
      commit 086dcbfa ("btrfs: insert items in batches when logging a
      directory when possible"), as there was the need to execute all logic
      of overwrite_item() but skip the tree search, since in the context of
      directory logging we already had a path with a leaf to copy data from.
      
      So unify them again as there is no more need to have them split.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a8d1db3
    • Artem Chernyshev's avatar
      btrfs: replace strncpy() with strscpy() · 63d5429f
      Artem Chernyshev authored
      Using strncpy() on NUL-terminated strings are deprecated.  To avoid
      possible forming of non-terminated string strscpy() should be used.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarArtem Chernyshev <artem.chernyshev@red-soft.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63d5429f
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable in find_first_clear_extent_bit · 26df39a9
      Josef Bacik authored
      This was caught when syncing extent-io-tree.c into btrfs-progs.  This
      however isn't really a problem, the only way next would be uninitialized
      is if we found the range we were looking for, and in this case we don't
      care about next.  However it's a compile error, so fix it up.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26df39a9
    • Josef Bacik's avatar
      btrfs: fix uninitialized parent in insert_state · d7c9e1be
      Josef Bacik authored
      I don't know how this isn't caught when we build this in the kernel, but
      while syncing extent-io-tree.c into btrfs-progs I got an error because
      parent could potentially be uninitialized when we link in a new node,
      specifically when the extent_io_tree is empty.  This means we could have
      garbage in the parent color.  I don't know what the ramifications are of
      that, but it's probably not great, so fix this by initializing parent to
      NULL.  I spot checked all of our other usages in btrfs and we appear to
      be doing the correct thing everywhere else.
      
      Fixes: c7e118cf ("btrfs: open code rbtree search in insert_state")
      CC: stable@vger.kernel.org # 6.0+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7c9e1be
    • ChenXiaoSong's avatar
      btrfs: add might_sleep() annotations · a4c853af
      ChenXiaoSong authored
      Add annotations to functions that might sleep due to allocations or IO
      and could be called from various contexts. In case of btrfs_search_slot
      it's not obvious why it would sleep:
      
          btrfs_search_slot
            setup_nodes_for_search
              reada_for_balance
                btrfs_readahead_node_child
                  btrfs_readahead_tree_block
                    btrfs_find_create_tree_block
                      alloc_extent_buffer
                        kmem_cache_zalloc
                          /* allocate memory non-atomically, might sleep */
                          kmem_cache_alloc(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO)
                    read_extent_buffer_pages
                      submit_extent_page
                        /* disk IO, might sleep */
                        submit_one_bio
      
      Other examples where the sleeping could happen is in 3 places might
      sleep in update_qgroup_limit_item(), as shown below:
      
        update_qgroup_limit_item
          btrfs_alloc_path
            /* allocate memory non-atomically, might sleep */
            kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS)
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4c853af
    • Josef Bacik's avatar
      btrfs: add stack helpers for a few btrfs items · 054056bd
      Josef Bacik authored
      We don't have these defined in the kernel because we don't have any
      users of these helpers.  However we do use them in btrfs-progs, so
      define them to make keeping accessors.h in sync between progs and the
      kernel easier.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      054056bd
    • Josef Bacik's avatar
      btrfs: add nr_global_roots to the super block definition · 0c703003
      Josef Bacik authored
      We already have this defined in btrfs-progs, add it to the kernel to
      make it easier to sync these files into btrfs-progs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c703003
    • Josef Bacik's avatar
      btrfs: remove BTRFS_LEAF_DATA_OFFSET · 8009adf3
      Josef Bacik authored
      This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so
      remove this helper and replace it's usage with the above statement.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8009adf3
    • Josef Bacik's avatar
      btrfs: add helpers for manipulating leaf items and data · 637e3b48
      Josef Bacik authored
      We have some gnarly memmove and copy_extent_buffer calls for leaf
      manipulation.  This is because our item offsets aren't absolute, they're
      based on 0 being where the items start in the leaf, which is after the
      btrfs_header.  This means any manipulation of the data requires adding
      sizeof(struct btrfs_header) to the offsets we pull from the items.
      Moving the items themselves is easier as the helpers are absolute
      offsets, however we of course have to call the helpers to get the
      offsets for the item numbers.  This makes for
      copy_extent_buffer/memmove_extent_buffer calls that are kind of hard to
      reason about what's happening.
      
      Fix this by pushing this logic into helpers.  For data we'll only use
      the item provided offsets, and the helpers will use the
      BTRFS_LEAF_DATA_OFFSET addition for the offsets.  Additionally for the
      item manipulation simply pass in the item numbers, and then the helpers
      will call the offset helper to get the actual offset into the leaf.
      
      The diffstat makes this look like more code, but that's simply because I
      added comments for the helpers, it's net negative for the amount of
      code, and is easier to reason.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      637e3b48
    • Josef Bacik's avatar
      btrfs: add eb to btrfs_node_key_ptr_offset · e23efd8e
      Josef Bacik authored
      This is a change needed for extent tree v2, as we will be growing the
      header size.  This exists in btrfs-progs currently, and not having it
      makes syncing accessors.[ch] more problematic.  So make this change to
      set us up for extent tree v2 and match what btrfs-progs does to make
      syncing easier.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e23efd8e
    • Josef Bacik's avatar
      btrfs: pass the extent buffer for the btrfs_item_nr helpers · 42c9419a
      Josef Bacik authored
      This is actually a change for extent tree v2, but it exists in
      btrfs-progs but not in the kernel.  This makes it annoying to sync
      accessors.h with btrfs-progs, and since this is the way I need it for
      extent-tree v2 simply update these helpers to take the extent buffer in
      order to make syncing possible now, and make the extent tree v2 stuff
      easier moving forward.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42c9419a
    • Josef Bacik's avatar
      btrfs: move the csum helpers into ctree.h · 0e6c40eb
      Josef Bacik authored
      These got moved because of copy+paste, but this code exists in ctree.c,
      so move the declarations back into ctree.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e6c40eb
    • Josef Bacik's avatar
      btrfs: move eb offset helpers into extent_io.h · 9b48adda
      Josef Bacik authored
      These are very specific to how the extent buffer is defined, so this
      differs between btrfs-progs and the kernel.  Make things easier by
      moving these helpers into extent_io.h so we don't have to worry about
      this when syncing ctree.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b48adda
    • Josef Bacik's avatar
      btrfs: move file_extent_item helpers into file-item.h · 6bfd0ffa
      Josef Bacik authored
      These helpers use functions that are in multiple places, which makes it
      tricky to sync them into btrfs-progs.  Move them to file-item.h and then
      include file-item.h in places that use these helpers.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bfd0ffa
    • Josef Bacik's avatar
      btrfs: move leaf_data_end into ctree.c · 3a3178c7
      Josef Bacik authored
      This is only used in ctree.c, with the exception of zero'ing out extent
      buffers we're getting ready to write out.  In theory we shouldn't have
      an extent buffer with 0 items that we're writing out, however I'd rather
      be safe than sorry so open code it in extent_io.c, and then copy the
      helper into ctree.c.  This will make it easier to sync accessors.[ch]
      into btrfs-progs, as this requires a helper that isn't defined in
      accessors.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a3178c7
    • Josef Bacik's avatar
      btrfs: move root helpers back into ctree.h · 1fe5ebc4
      Josef Bacik authored
      These accidentally got brought into accessors.h, but belong with the
      btrfs_root definitions which are currently in ctree.h.  Move these to
      make it easier to sync accessors.[ch] into btrfs-progs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fe5ebc4
    • Christoph Hellwig's avatar
      btrfs: move repair_io_failure to bio.c · bacf60e5
      Christoph Hellwig authored
      repair_io_failure ties directly into all the glory low-level details of
      mapping a bio with a logic address to the actual physical location.
      Move it right below btrfs_submit_bio to keep all the related logic
      together.
      
      Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that
      repair_io_failure is available in a header.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacf60e5
    • Christoph Hellwig's avatar
      btrfs: split the bio submission path into a separate file · 103c1972
      Christoph Hellwig authored
      The code used by btrfs_submit_bio only interacts with the rest of
      volumes.c through __btrfs_map_block (which itself is a more generic
      version of two exported helpers) and does not really have anything
      to do with volumes.c.  Create a new bio.c file and a bio.h header
      going along with it for the btrfs_bio-based storage layer, which
      will grow even more going forward.
      
      Also update the file with my copyright notice given that a large
      part of the moved code was written or rewritten by me.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      103c1972
    • Christoph Hellwig's avatar
      btrfs: move struct btrfs_tree_parent_check out of disk-io.h · 27137fac
      Christoph Hellwig authored
      Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
      an various .c files don't have to include disk-io.h just for it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use tree-checker.h for the structure ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27137fac
    • Qu Wenruo's avatar
      btrfs: raid56: do data csum verification during RMW cycle · 7a315072
      Qu Wenruo authored
      [BUG]
      For the following small script, btrfs will be unable to recover the
      content of file1:
      
        mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
      
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
        md5sum $mnt/file1
        umount $mnt
      
        # Corrupt the above 64K data stripe.
        xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
        mount $dev1 $mnt
      
        # Write a new 64K, which should be in the other data stripe
        # And this is a sub-stripe write, which will cause RMW
        xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
        md5sum $mnt/file1
        umount $mnt
      
      Above md5sum would fail.
      
      [CAUSE]
      There is a long existing problem for raid56 (not limited to btrfs
      raid56) that, if we already have some corrupted on-disk data, and then
      trigger a sub-stripe write (which needs RMW cycle), it can cause further
      damage into P/Q stripe.
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000|
        Disk 2: parity |0xffffffffffff|
      
      In above case, data 1 is already corrupted, the original data should be
      64KiB of 0xff.
      
      At this stage, if we read data 1, and it has data checksum, we can still
      recovery going via the regular RAID56 recovery path.
      
      But if now we decide to write some data into data 2, then we need to go
      RMW.
      Let's say we want to write 64KiB of '0x00' into data 2, then we read the
      on-disk data of data 1, calculate the new parity, resulting the
      following layout:
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000| <- New '0x00' writes
        Disk 2: parity |0x000000000000| <- New Parity.
      
      But the new parity is calculated using the *corrupted* data 1, we can
      no longer recover the correct data of data1.  Thus the corruption is
      forever there.
      
      [FIX]
      To solve above problem, this patch will do a full stripe data checksum
      verification at RMW time.
      
      This involves the following changes:
      
      - Always read the full stripe (including data/P/Q) when doing RMW
        Before we only read the missing data sectors, but since we may do a
        data csum verification and recovery, we need to read everything out.
      
        Please note that, if we have a cached rbio, we don't need to read
        anything, and can treat it the same as full stripe write.
      
        As only stripe with all its csum matches can be cached.
      
      - Verify the data csum during read.
        The goal is only the rbio stripe sectors, and only if the rbio
        already has csum_buf/csum_bitmap filled.
      
        And sectors which cannot pass csum verification will have their bit
        set in error_bitmap.
      
      - Always call recovery_sectors() after we read out all the sectors
        Since error_bitmap will be updated during read, recover_sectors()
        can easily find out all the bad sectors and try to recover (if still
        under tolerance).
      
        And since recovery_sectors() is already migrated to use error_bitmap,
        it can skip vertical stripes which don't have any error.
      
      - Verify the repaired sectors against its csum in recover_vertical()
      
      - Rename rmw_read_and_wait() to rmw_read_wait_recover()
        Since we will always recover the sectors, the old name is no longer
        accurate.
      
        Furthermore since recovery is already done in rmw_read_wait_recover(),
        we no longer need to call recovery_sectors() inside rmw_rbio().
      
      Obviously this will have a performance impact, as we are doing more
      work during RMW cycle:
      
      - Fetch the data checksums
      - Do checksum verification for all data stripes
      - Do checksum verification again after repair
      
      But for full stripe write or cached rbio we won't have the overhead all,
      thus for fully optimized RAID56 workload (always full stripe write),
      there should be no extra overhead.
      
      To me, the extra overhead looks reasonable, as data consistency is way
      more important than performance.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a315072