1. 03 Sep, 2024 1 commit
    • Filipe Manana's avatar
      btrfs: fix race between direct IO write and fsync when using same fd · cd9253c2
      Filipe Manana authored
      If we have 2 threads that are using the same file descriptor and one of
      them is doing direct IO writes while the other is doing fsync, we have a
      race where we can end up either:
      
      1) Attempt a fsync without holding the inode's lock, triggering an
         assertion failures when assertions are enabled;
      
      2) Do an invalid memory access from the fsync task because the file private
         points to memory allocated on stack by the direct IO task and it may be
         used by the fsync task after the stack was destroyed.
      
      The race happens like this:
      
      1) A user space program opens a file descriptor with O_DIRECT;
      
      2) The program spawns 2 threads using libpthread for example;
      
      3) One of the threads uses the file descriptor to do direct IO writes,
         while the other calls fsync using the same file descriptor.
      
      4) Call task A the thread doing direct IO writes and task B the thread
         doing fsyncs;
      
      5) Task A does a direct IO write, and at btrfs_direct_write() sets the
         file's private to an on stack allocated private with the member
         'fsync_skip_inode_lock' set to true;
      
      6) Task B enters btrfs_sync_file() and sees that there's a private
         structure associated to the file which has 'fsync_skip_inode_lock' set
         to true, so it skips locking the inode's VFS lock;
      
      7) Task A completes the direct IO write, and resets the file's private to
         NULL since it had no prior private and our private was stack allocated.
         Then it unlocks the inode's VFS lock;
      
      8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
         assertion that checks the inode's VFS lock is held fails, since task B
         never locked it and task A has already unlocked it.
      
      The stack trace produced is the following:
      
         assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
         ------------[ cut here ]------------
         kernel BUG at fs/btrfs/ordered-data.c:983!
         Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
         CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
         Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
         RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
         Code: 50 d6 86 c0 e8 (...)
         RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
         RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
         RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
         RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
         R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
         R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
         FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
         Call Trace:
          <TASK>
          ? __die_body.cold+0x14/0x24
          ? die+0x2e/0x50
          ? do_trap+0xca/0x110
          ? do_error_trap+0x6a/0x90
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? exc_invalid_op+0x50/0x70
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? asm_exc_invalid_op+0x1a/0x20
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? __seccomp_filter+0x31d/0x4f0
          __x64_sys_fdatasync+0x4f/0x90
          do_syscall_64+0x82/0x160
          ? do_futex+0xcb/0x190
          ? __x64_sys_futex+0x10e/0x1d0
          ? switch_fpu_return+0x4f/0xd0
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Another problem here is if task B grabs the private pointer and then uses
      it after task A has finished, since the private was allocated in the stack
      of task A, it results in some invalid memory access with a hard to predict
      result.
      
      This issue, triggering the assertion, was observed with QEMU workloads by
      two users in the Link tags below.
      
      Fix this by not relying on a file's private to pass information to fsync
      that it should skip locking the inode and instead pass this information
      through a special value stored in current->journal_info. This is safe
      because in the relevant section of the direct IO write path we are not
      holding a transaction handle, so current->journal_info is NULL.
      
      The following C program triggers the issue:
      
         $ cat repro.c
         /* Get the O_DIRECT definition. */
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <stdio.h>
         #include <stdlib.h>
         #include <unistd.h>
         #include <stdint.h>
         #include <fcntl.h>
         #include <errno.h>
         #include <string.h>
         #include <pthread.h>
      
         static int fd;
      
         static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
         {
             while (count > 0) {
                 ssize_t ret;
      
                 ret = pwrite(fd, buf, count, offset);
                 if (ret < 0) {
                     if (errno == EINTR)
                         continue;
                     return ret;
                 }
                 count -= ret;
                 buf += ret;
             }
             return 0;
         }
      
         static void *fsync_loop(void *arg)
         {
             while (1) {
                 int ret;
      
                 ret = fsync(fd);
                 if (ret != 0) {
                     perror("Fsync failed");
                     exit(6);
                 }
             }
         }
      
         int main(int argc, char *argv[])
         {
             long pagesize;
             void *write_buf;
             pthread_t fsyncer;
             int ret;
      
             if (argc != 2) {
                 fprintf(stderr, "Use: %s <file path>\n", argv[0]);
                 return 1;
             }
      
             fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
             if (fd == -1) {
                 perror("Failed to open/create file");
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             if (pagesize == -1) {
                 perror("Failed to get page size");
                 return 2;
             }
      
             ret = posix_memalign(&write_buf, pagesize, pagesize);
             if (ret) {
                 perror("Failed to allocate buffer");
                 return 3;
             }
      
             ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create writer thread: %d\n", ret);
                 return 4;
             }
      
             while (1) {
                 ret = do_write(fd, write_buf, pagesize, 0);
                 if (ret != 0) {
                     perror("Write failed");
                     exit(5);
                 }
             }
      
             return 0;
         }
      
         $ mkfs.btrfs -f /dev/sdi
         $ mount /dev/sdi /mnt/sdi
         $ timeout 10 ./repro /mnt/sdi/foo
      
      Usually the race is triggered within less than 1 second. A test case for
      fstests will follow soon.
      Reported-by: default avatarPaulo Dias <paulo.miguel.dias@gmail.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187Reported-by: default avatarAndreas Jahn <jahn-andi@web.de>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
      Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
      Fixes: 939b656b ("btrfs: fix corruption after buffer fault in during direct IO append write")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd9253c2
  2. 02 Sep, 2024 2 commits
    • Naohiro Aota's avatar
      btrfs: zoned: handle broken write pointer on zones · b1934cd6
      Naohiro Aota authored
      Btrfs rejects to mount a FS if it finds a block group with a broken write
      pointer (e.g, unequal write pointers on two zones of RAID1 block group).
      Since such case can happen easily with a power-loss or crash of a system,
      we need to handle the case more gently.
      
      Handle such block group by making it unallocatable, so that there will be
      no writes into it. That can be done by setting the allocation pointer at
      the end of allocating region (= block_group->zone_capacity). Then, existing
      code handle zone_unusable properly.
      
      Having proper zone_capacity is necessary for the change. So, set it as fast
      as possible.
      
      We cannot handle RAID0 and RAID10 case like this. But, they are anyway
      unable to read because of a missing stripe.
      
      Fixes: 265f7237 ("btrfs: zoned: allow DUP on meta-data block groups")
      Fixes: 568220fa ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
      CC: stable@vger.kernel.org # 6.1+
      Reported-by: default avatarHAN Yuwei <hrx@bupt.moe>
      Cc: Xuefer <xuefer@gmail.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1934cd6
    • Fedor Pchelkin's avatar
      btrfs: qgroup: don't use extent changeset when not needed · c346c629
      Fedor Pchelkin authored
      The local extent changeset is passed to clear_record_extent_bits() where
      it may have some additional memory dynamically allocated for ulist. When
      qgroup is disabled, the memory is leaked because in this case the
      changeset is not released upon __btrfs_qgroup_release_data() return.
      
      Since the recorded contents of the changeset are not used thereafter, just
      don't pass it.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Reported-by: syzbot+81670362c283f3dd889c@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/lkml/000000000000aa8c0c060ade165e@google.com
      Fixes: af0e2aab ("btrfs: qgroup: flush reservations during quota disable")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c346c629
  3. 27 Aug, 2024 1 commit
    • Filipe Manana's avatar
      btrfs: fix uninitialized return value from btrfs_reclaim_sweep() · ecb54277
      Filipe Manana authored
      The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if
      none of the space infos is reclaimable (for example if periodic reclaim
      is disabled, which is the default), so we return an undefined value.
      
      This can be fixed my making btrfs_reclaim_sweep() not return any value
      as well as do_reclaim_sweep() because:
      
      1) do_reclaim_sweep() always returns 0, so we can make it return void;
      
      2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't
         care about its return value, and in its context there's nothing to do
         about any errors anyway.
      
      Therefore remove the return value from btrfs_reclaim_sweep() and
      do_reclaim_sweep().
      
      Fixes: e4ca3932 ("btrfs: periodic block_group reclaim")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ecb54277
  4. 26 Aug, 2024 2 commits
    • Qu Wenruo's avatar
      btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk() · 10d9d8c3
      Qu Wenruo authored
      [BUG]
      There is an internal report that KASAN is reporting use-after-free, with
      the following backtrace:
      
        BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs]
        Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45
        CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
        Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
        Call Trace:
         dump_stack_lvl+0x61/0x80
         print_address_description.constprop.0+0x5e/0x2f0
         print_report+0x118/0x216
         kasan_report+0x11d/0x1f0
         btrfs_check_read_bio+0xa68/0xb70 [btrfs]
         process_one_work+0xce0/0x12a0
         worker_thread+0x717/0x1250
         kthread+0x2e3/0x3c0
         ret_from_fork+0x2d/0x70
         ret_from_fork_asm+0x11/0x20
      
        Allocated by task 20917:
         kasan_save_stack+0x37/0x60
         kasan_save_track+0x10/0x30
         __kasan_slab_alloc+0x7d/0x80
         kmem_cache_alloc_noprof+0x16e/0x3e0
         mempool_alloc_noprof+0x12e/0x310
         bio_alloc_bioset+0x3f0/0x7a0
         btrfs_bio_alloc+0x2e/0x50 [btrfs]
         submit_extent_page+0x4d1/0xdb0 [btrfs]
         btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
         btrfs_readahead+0x29a/0x430 [btrfs]
         read_pages+0x1a7/0xc60
         page_cache_ra_unbounded+0x2ad/0x560
         filemap_get_pages+0x629/0xa20
         filemap_read+0x335/0xbf0
         vfs_read+0x790/0xcb0
         ksys_read+0xfd/0x1d0
         do_syscall_64+0x6d/0x140
         entry_SYSCALL_64_after_hwframe+0x4b/0x53
      
        Freed by task 20917:
         kasan_save_stack+0x37/0x60
         kasan_save_track+0x10/0x30
         kasan_save_free_info+0x37/0x50
         __kasan_slab_free+0x4b/0x60
         kmem_cache_free+0x214/0x5d0
         bio_free+0xed/0x180
         end_bbio_data_read+0x1cc/0x580 [btrfs]
         btrfs_submit_chunk+0x98d/0x1880 [btrfs]
         btrfs_submit_bio+0x33/0x70 [btrfs]
         submit_one_bio+0xd4/0x130 [btrfs]
         submit_extent_page+0x3ea/0xdb0 [btrfs]
         btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
         btrfs_readahead+0x29a/0x430 [btrfs]
         read_pages+0x1a7/0xc60
         page_cache_ra_unbounded+0x2ad/0x560
         filemap_get_pages+0x629/0xa20
         filemap_read+0x335/0xbf0
         vfs_read+0x790/0xcb0
         ksys_read+0xfd/0x1d0
         do_syscall_64+0x6d/0x140
         entry_SYSCALL_64_after_hwframe+0x4b/0x53
      
      [CAUSE]
      Although I cannot reproduce the error, the report itself is good enough
      to pin down the cause.
      
      The call trace is the regular endio workqueue context, but the
      free-by-task trace is showing that during btrfs_submit_chunk() we
      already hit a critical error, and is calling btrfs_bio_end_io() to error
      out.  And the original endio function called bio_put() to free the whole
      bio.
      
      This means a double freeing thus causing use-after-free, e.g.:
      
      1. Enter btrfs_submit_bio() with a read bio
         The read bio length is 128K, crossing two 64K stripes.
      
      2. The first run of btrfs_submit_chunk()
      
      2.1 Call btrfs_map_block(), which returns 64K
      2.2 Call btrfs_split_bio()
          Now there are two bios, one referring to the first 64K, the other
          referring to the second 64K.
      2.3 The first half is submitted.
      
      3. The second run of btrfs_submit_chunk()
      
      3.1 Call btrfs_map_block(), which by somehow failed
          Now we call btrfs_bio_end_io() to handle the error
      
      3.2 btrfs_bio_end_io() calls the original endio function
          Which is end_bbio_data_read(), and it calls bio_put() for the
          original bio.
      
          Now the original bio is freed.
      
      4. The submitted first 64K bio finished
         Now we call into btrfs_check_read_bio() and tries to advance the bio
         iter.
         But since the original bio (thus its iter) is already freed, we
         trigger the above use-after free.
      
         And even if the memory is not poisoned/corrupted, we will later call
         the original endio function, causing a double freeing.
      
      [FIX]
      Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(),
      which has the extra check on split bios and do the proper refcounting
      for cloned bios.
      
      Furthermore there is already one extra btrfs_cleanup_bio() call, but
      that is duplicated to btrfs_orig_bbio_end_io() call, so remove that
      label completely.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: 852eee62 ("btrfs: allow btrfs_submit_bio to split bios")
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10d9d8c3
    • David Sterba's avatar
      btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap() · 33f58a04
      David Sterba authored
      There's a warning (probably on some older compiler version):
      
      fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]:  => 822:19
      
      Initialize the variable to 0 although it's not necessary as it's either
      properly set or not used after an error. The called function is in the
      same file so this is a false alert but we want to fix all
      -Wmaybe-uninitialized reports.
      
      Link: https://lore.kernel.org/all/20240819070639.2558629-1-geert@linux-m68k.org/Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33f58a04
  5. 25 Aug, 2024 1 commit
    • Josef Bacik's avatar
      btrfs: run delayed iputs when flushing delalloc · 2d344726
      Josef Bacik authored
      We have transient failures with btrfs/301, specifically in the part
      where we do
      
        for i in $(seq 0 10); do
      	  write 50m to file
      	  rm -f file
        done
      
      Sometimes this will result in a transient quota error, and it's because
      sometimes we start writeback on the file which results in a delayed
      iput, and thus the rm doesn't actually clean the file up.  When we're
      flushing the quota space we need to run the delayed iputs to make sure
      all the unlinks that we think have completed have actually completed.
      This removes the small window where we could fail to find enough space
      in our quota.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d344726
  6. 16 Aug, 2024 1 commit
  7. 15 Aug, 2024 4 commits
    • Naohiro Aota's avatar
      btrfs: zoned: properly take lock to read/update block group's zoned variables · e30729d4
      Naohiro Aota authored
      __btrfs_add_free_space_zoned() references and modifies bg's alloc_offset,
      ro, and zone_unusable, but without taking the lock. It is mostly safe
      because they monotonically increase (at least for now) and this function is
      mostly called by a transaction commit, which is serialized by itself.
      
      Still, taking the lock is a safer and correct option and I'm going to add a
      change to reset zone_unusable while a block group is still alive. So, add
      locking around the operations.
      
      Fixes: 169e0da9 ("btrfs: zoned: track unusable bytes for zones")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e30729d4
    • Qu Wenruo's avatar
      btrfs: tree-checker: add dev extent item checks · 008e2512
      Qu Wenruo authored
      [REPORT]
      There is a corruption report that btrfs refused to mount a fs that has
      overlapping dev extents:
      
        BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272
        BTRFS error (device sdc): failed to verify dev extents against chunks: -117
        BTRFS error (device sdc): open_ctree failed
      
      [CAUSE]
      The direct cause is very obvious, there is a bad dev extent item with
      incorrect length.
      
      With btrfs check reporting two overlapping extents, the second one shows
      some clue on the cause:
      
        ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272
        ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144
        ERROR: errors found in extent allocation tree or chunk allocation
      
      The second one looks like a bitflip happened during new chunk
      allocation:
      hex(2257707008000) = 0x20da9d30000
      hex(2257707270144) = 0x20da9d70000
      diff               = 0x00000040000
      
      So it looks like a bitflip happened during new dev extent allocation,
      resulting the second overlap.
      
      Currently we only do the dev-extent verification at mount time, but if the
      corruption is caused by memory bitflip, we really want to catch it before
      writing the corruption to the storage.
      
      Furthermore the dev extent items has the following key definition:
      
      	(<device id> DEV_EXTENT <physical offset>)
      
      Thus we can not just rely on the generic key order check to make sure
      there is no overlapping.
      
      [ENHANCEMENT]
      Introduce dedicated dev extent checks, including:
      
      - Fixed member checks
        * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3)
        * chunk_objectid should always be
          BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256)
      
      - Alignment checks
        * chunk_offset should be aligned to sectorsize
        * length should be aligned to sectorsize
        * key.offset should be aligned to sectorsize
      
      - Overlap checks
        If the previous key is also a dev-extent item, with the same
        device id, make sure we do not overlap with the previous dev extent.
      
      Reported: Stefan N <stefannnau@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      008e2512
    • Jeff Layton's avatar
      btrfs: update target inode's ctime on unlink · 3bc2ac2f
      Jeff Layton authored
      Unlink changes the link count on the target inode. POSIX mandates that
      the ctime must also change when this occurs.
      
      According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:
      
      "Upon successful completion, unlink() shall mark for update the last data
       modification and last file status change timestamps of the parent
       directory. Also, if the file's link count is not 0, the last file status
       change timestamp of the file shall be marked for update."
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add link to the opengroup docs ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bc2ac2f
    • Thorsten Blum's avatar
      btrfs: send: annotate struct name_cache_entry with __counted_by() · c0247d28
      Thorsten Blum authored
      Add the __counted_by compiler attribute to the flexible array member
      name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
      CONFIG_FORTIFY_SOURCE.
      Signed-off-by: default avatarThorsten Blum <thorsten.blum@toblux.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0247d28
  8. 13 Aug, 2024 5 commits
    • Naohiro Aota's avatar
      btrfs: fix invalid mapping of extent xarray state · 6252690f
      Naohiro Aota authored
      In __extent_writepage_io(), we call btrfs_set_range_writeback() ->
      folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the
      mapping xarray if the folio is not dirty. This worked fine before commit
      97713b1a ("btrfs: do not clear page dirty inside
      extent_write_locked_range()").
      
      After the commit, however, the folio is still dirty at this point, so the
      mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io()
      calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That
      results in the page being unlocked with a "strange" state. The page is not
      PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY.
      
      This strange state looks like causing a hang with a call trace below when
      running fstests generic/091 on a null_blk device. It is waiting for a folio
      lock.
      
      While I don't have an exact relation between this hang and the strange
      state, fixing the state also fixes the hang. And, that state is worth
      fixing anyway.
      
      This commit reorders btrfs_folio_clear_dirty() and
      btrfs_set_range_writeback() in __extent_writepage_io(), so that the
      PAGECACHE_TAG_DIRTY tag is properly removed from the xarray.
      
        [464.274] task:fsx             state:D stack:0     pid:3034  tgid:3034  ppid:2853   flags:0x00004002
        [464.286] Call Trace:
        [464.291]  <TASK>
        [464.295]  __schedule+0x10ed/0x6260
        [464.301]  ? __pfx___blk_flush_plug+0x10/0x10
        [464.308]  ? __submit_bio+0x37c/0x450
        [464.314]  ? __pfx___schedule+0x10/0x10
        [464.321]  ? lock_release+0x567/0x790
        [464.327]  ? __pfx_lock_acquire+0x10/0x10
        [464.334]  ? __pfx_lock_release+0x10/0x10
        [464.340]  ? __pfx_lock_acquire+0x10/0x10
        [464.347]  ? __pfx_lock_release+0x10/0x10
        [464.353]  ? do_raw_spin_lock+0x12e/0x270
        [464.360]  schedule+0xdf/0x3b0
        [464.365]  io_schedule+0x8f/0xf0
        [464.371]  folio_wait_bit_common+0x2ca/0x6d0
        [464.378]  ? folio_wait_bit_common+0x1cc/0x6d0
        [464.385]  ? __pfx_folio_wait_bit_common+0x10/0x10
        [464.392]  ? __pfx_filemap_get_folios_tag+0x10/0x10
        [464.400]  ? __pfx_wake_page_function+0x10/0x10
        [464.407]  ? __pfx___might_resched+0x10/0x10
        [464.414]  ? do_raw_spin_unlock+0x58/0x1f0
        [464.420]  extent_write_cache_pages+0xe49/0x1620 [btrfs]
        [464.428]  ? lock_acquire+0x435/0x500
        [464.435]  ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs]
        [464.443]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
        [464.451]  ? orc_find.part.0+0x1d4/0x380
        [464.457]  ? __pfx_lock_release+0x10/0x10
        [464.464]  ? __pfx_lock_release+0x10/0x10
        [464.471]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
        [464.478]  btrfs_writepages+0x1cc/0x460 [btrfs]
        [464.485]  ? __pfx_btrfs_writepages+0x10/0x10 [btrfs]
        [464.493]  ? is_bpf_text_address+0x6e/0x100
        [464.500]  ? kernel_text_address+0x145/0x160
        [464.507]  ? unwind_get_return_address+0x5e/0xa0
        [464.514]  ? arch_stack_walk+0xac/0x100
        [464.521]  do_writepages+0x176/0x780
        [464.527]  ? lock_release+0x567/0x790
        [464.533]  ? __pfx_do_writepages+0x10/0x10
        [464.540]  ? __pfx_lock_acquire+0x10/0x10
        [464.546]  ? __pfx_stack_trace_save+0x10/0x10
        [464.553]  ? do_raw_spin_lock+0x12e/0x270
        [464.560]  ? do_raw_spin_unlock+0x58/0x1f0
        [464.566]  ? _raw_spin_unlock+0x23/0x40
        [464.573]  ? wbc_attach_and_unlock_inode+0x3da/0x7d0
        [464.580]  filemap_fdatawrite_wbc+0x113/0x180
        [464.587]  ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
        [464.596]  __filemap_fdatawrite_range+0xaf/0xf0
        [464.603]  ? __pfx___filemap_fdatawrite_range+0x10/0x10
        [464.611]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.618]  ? kasan_quarantine_put+0xd7/0x1e0
        [464.625]  btrfs_start_ordered_extent+0x46f/0x570 [btrfs]
        [464.633]  ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs]
        [464.642]  ? __clear_extent_bit+0x2c0/0x9d0 [btrfs]
        [464.650]  btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs]
        [464.659]  ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs]
        [464.669]  btrfs_read_folio+0x12a/0x1d0 [btrfs]
        [464.676]  ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs]
        [464.684]  ? __pfx_filemap_add_folio+0x10/0x10
        [464.691]  ? __pfx___might_resched+0x10/0x10
        [464.698]  ? __filemap_get_folio+0x1c5/0x450
        [464.705]  prepare_uptodate_page+0x12e/0x4d0 [btrfs]
        [464.713]  prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
        [464.721]  ? fault_in_iov_iter_readable+0xd2/0x240
        [464.729]  btrfs_buffered_write+0x5bd/0x12f0 [btrfs]
        [464.737]  ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs]
        [464.745]  ? __pfx_lock_release+0x10/0x10
        [464.752]  ? generic_write_checks+0x275/0x400
        [464.759]  ? down_write+0x118/0x1f0
        [464.765]  ? up_write+0x19b/0x500
        [464.770]  btrfs_direct_write+0x731/0xba0 [btrfs]
        [464.778]  ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs]
        [464.785]  ? __pfx___might_resched+0x10/0x10
        [464.792]  ? lock_acquire+0x435/0x500
        [464.798]  ? lock_acquire+0x435/0x500
        [464.804]  btrfs_do_write_iter+0x494/0x640 [btrfs]
        [464.811]  ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs]
        [464.819]  ? __pfx___might_resched+0x10/0x10
        [464.825]  ? rw_verify_area+0x6d/0x590
        [464.831]  vfs_write+0x5d7/0xf50
        [464.837]  ? __might_fault+0x9d/0x120
        [464.843]  ? __pfx_vfs_write+0x10/0x10
        [464.849]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
        [464.856]  ? lock_release+0x567/0x790
        [464.862]  ksys_write+0xfb/0x1d0
        [464.867]  ? __pfx_ksys_write+0x10/0x10
        [464.873]  ? _raw_spin_unlock+0x23/0x40
        [464.879]  ? btrfs_getattr+0x4af/0x670 [btrfs]
        [464.886]  ? vfs_getattr_nosec+0x79/0x340
        [464.892]  do_syscall_64+0x95/0x180
        [464.898]  ? __do_sys_newfstat+0xde/0xf0
        [464.904]  ? __pfx___do_sys_newfstat+0x10/0x10
        [464.911]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.918]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [464.925]  ? do_syscall_64+0xa1/0x180
        [464.931]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.939]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.946]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [464.953]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
        [464.960]  ? do_syscall_64+0xa1/0x180
        [464.966]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
        [464.973]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.980]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [464.987]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
        [464.995]  ? trace_irq_enable.constprop.0+0xce/0x110
        [465.002]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
        [465.010]  ? do_syscall_64+0xa1/0x180
        [465.016]  ? lock_release+0x567/0x790
        [465.022]  ? __pfx_lock_acquire+0x10/0x10
        [465.028]  ? __pfx_lock_release+0x10/0x10
        [465.034]  ? trace_irq_enable.constprop.0+0xce/0x110
        [465.042]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [465.049]  ? do_syscall_64+0xa1/0x180
        [465.055]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [465.062]  ? do_syscall_64+0xa1/0x180
        [465.068]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [465.075]  ? do_syscall_64+0xa1/0x180
        [465.081]  ? clear_bhb_loop+0x25/0x80
        [465.087]  ? clear_bhb_loop+0x25/0x80
        [465.093]  ? clear_bhb_loop+0x25/0x80
        [465.099]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
        [465.106] RIP: 0033:0x7f093b8ee784
        [465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
        [465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784
        [465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003
        [465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000
        [465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000
        [465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001
        [465.170]  </TASK>
        [465.174] INFO: lockdep is turned off.
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: 97713b1a ("btrfs: do not clear page dirty inside extent_write_locked_range()")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6252690f
    • Filipe Manana's avatar
      btrfs: send: allow cloning non-aligned extent if it ends at i_size · 46a6e10a
      Filipe Manana authored
      If we a find that an extent is shared but its end offset is not sector
      size aligned, then we don't clone it and issue write operations instead.
      This is because the reflink (remap_file_range) operation does not allow
      to clone unaligned ranges, except if the end offset of the range matches
      the i_size of the source and destination files (and the start offset is
      sector size aligned).
      
      While this is not incorrect because send can only guarantee that a file
      has the same data in the source and destination snapshots, it's not
      optimal and generates confusion and surprising behaviour for users.
      
      For example, running this test:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        # Use a file size not aligned to any possible sector size.
        file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
        dd if=/dev/random of=$MNT/foo bs=$file_size count=1
        cp --reflink=always $MNT/foo $MNT/bar
      
        btrfs subvolume snapshot -r $MNT/ $MNT/snap
        rm -f /tmp/send-test
        btrfs send -f /tmp/send-test $MNT/snap
      
        umount $MNT
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        btrfs receive -vv -f /tmp/send-test $MNT
      
        xfs_io -r -c "fiemap -v" $MNT/snap/bar
      
        umount $MNT
      
      Gives the following result:
      
        (...)
        mkfile o258-7-0
        rename o258-7-0 -> bar
        write bar - offset=0 length=49152
        write bar - offset=49152 length=49152
        write bar - offset=98304 length=49152
        write bar - offset=147456 length=49152
        write bar - offset=196608 length=49152
        write bar - offset=245760 length=49152
        write bar - offset=294912 length=49152
        write bar - offset=344064 length=49152
        write bar - offset=393216 length=49152
        write bar - offset=442368 length=49152
        write bar - offset=491520 length=49152
        write bar - offset=540672 length=49152
        write bar - offset=589824 length=49152
        write bar - offset=638976 length=49152
        write bar - offset=688128 length=49152
        write bar - offset=737280 length=49152
        write bar - offset=786432 length=49152
        write bar - offset=835584 length=49152
        write bar - offset=884736 length=49152
        write bar - offset=933888 length=49152
        write bar - offset=983040 length=49152
        write bar - offset=1032192 length=16389
        chown bar - uid=0, gid=0
        chmod bar - mode=0644
        utimes bar
        utimes
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
        /mnt/sdi/snap/bar:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..2055]:       26624..28679      2056   0x1
      
      There's no clone operation to clone extents from the file foo into file
      bar and fiemap confirms there's no shared flag (0x2000).
      
      So update send_write_or_clone() so that it proceeds with cloning if the
      source and destination ranges end at the i_size of the respective files.
      
      After this changes the result of the test is:
      
        (...)
        mkfile o258-7-0
        rename o258-7-0 -> bar
        clone bar - source=foo source offset=0 offset=0 length=1048581
        chown bar - uid=0, gid=0
        chmod bar - mode=0644
        utimes bar
        utimes
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
        /mnt/sdi/snap/bar:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..2055]:       26624..28679      2056 0x2001
      
      A test case for fstests will also follow up soon.
      
      Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46a6e10a
    • Filipe Manana's avatar
      btrfs: only run the extent map shrinker from kswapd tasks · ae1e766f
      Filipe Manana authored
      Currently the extent map shrinker can be run by any task when attempting
      to allocate memory and there's enough memory pressure to trigger it.
      
      To avoid too much latency we stop iterating over extent maps and removing
      them once the task needs to reschedule. This logic was introduced in commit
      b3ebb9b7 ("btrfs: stop extent map shrinker if reschedule is needed").
      
      While that solved high latency problems for some use cases, it's still
      not enough because with a too high number of tasks entering the extent map
      shrinker code, either due to memory allocations or because they are a
      kswapd task, we end up having a very high level of contention on some
      spin locks, namely:
      
      1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
         roots to iterate over their inodes;
      
      2) The spin lock of the xarray used to track open inodes for a root
         (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
         be a red black tree and the spin lock was root->inode_lock;
      
      3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
         delayed iputs (calls btrfs_add_delayed_iput()).
      
      Instead of allowing the extent map shrinker to be run by any task, make
      it run only by kswapd tasks. This still solves the problem of running
      into OOM situations due to an unbounded extent map creation, which is
      simple to trigger by direct IO writes, as described in the changelog
      of commit 956a17d9 ("btrfs: add a shrinker for extent maps"), and
      by a similar case when doing buffered IO on files with a very large
      number of holes (keeping the file open and creating many holes, whose
      extent maps are only released when the file is closed).
      Reported-by: default avatarkzd <kzd@56709.net>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121Reported-by: default avatarOctavia Togami <octavia.togami@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
      Fixes: 956a17d9 ("btrfs: add a shrinker for extent maps")
      CC: stable@vger.kernel.org # 6.10+
      Tested-by: default avatarkzd <kzd@56709.net>
      Tested-by: default avatarOctavia Togami <octavia.togami@gmail.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae1e766f
    • Qu Wenruo's avatar
      btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type · 31723c95
      Qu Wenruo authored
      [REPORT]
      There is a bug report that kernel is rejecting a mismatching inode mode
      and its dir item:
      
        [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with
        dir: inode mode=040700 btrfs type=2 dir type=0
      
      [CAUSE]
      It looks like the inode mode is correct, while the dir item type
      0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all.
      
      This may be caused by a memory bit flip.
      
      [ENHANCEMENT]
      Although tree-checker is not able to do any cross-leaf verification, for
      this particular case we can at least reject any dir type with
      BTRFS_FT_UNKNOWN.
      
      So here we enhance the dir type check from [0, BTRFS_FT_MAX), to
      (0, BTRFS_FT_MAX).
      Although the existing corruption can not be fixed just by such enhanced
      checking, it should prevent the same 0x2->0x0 bitflip for dir type to
      reach disk in the future.
      Reported-by: default avatarKota <nospam@kota.moe>
      Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31723c95
    • Josef Bacik's avatar
      btrfs: check delayed refs when we're checking if a ref exists · 42fac187
      Josef Bacik authored
      In the patch 78c52d9e ("btrfs: check for refs on snapshot delete
      resume") I added some code to handle file systems that had been
      corrupted by a bug that incorrectly skipped updating the drop progress
      key while dropping a snapshot.  This code would check to see if we had
      already deleted our reference for a child block, and skip the deletion
      if we had already.
      
      Unfortunately there is a bug, as the check would only check the on-disk
      references.  I made an incorrect assumption that blocks in an already
      deleted snapshot that was having the deletion resume on mount wouldn't
      be modified.
      
      If we have 2 pending deleted snapshots that share blocks, we can easily
      modify the rules for a block.  Take the following example
      
      subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
      share references to block 1.  Block 1 will have 2 full references, one
      for subvolume a and one for subvolume b, and it belongs to subvolume a
      (btrfs_header_owner(block 1) == subvolume a).
      
      When deleting subvolume a, we will drop our full reference for block 1,
      and because we are the owner we will drop our full reference for all of
      block 1's children, convert block 1 to FULL BACKREF, and add a shared
      reference to all of block 1's children.
      
      Then we will start the snapshot deletion of subvolume b.  We look up the
      extent info for block 1, which checks delayed refs and tells us that
      FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
      because this is a resumed snapshot deletion, we call into
      check_ref_exists().  Because check_ref_exists() only looks at the disk,
      it doesn't find the shared backref for the child of block 1, and thus
      returns 0 and we skip deleting the reference for the child of block 1
      and continue.  This orphans the child of block 1.
      
      The fix is to lookup the delayed refs, similar to what we do in
      btrfs_lookup_extent_info().  However we only care about whether the
      reference exists or not.  If we fail to find our reference on disk, go
      look up the bytenr in the delayed refs, and if it exists look for an
      existing ref in the delayed ref head.  If that exists then we know we
      can delete the reference safely and carry on.  If it doesn't exist we
      know we have to skip over this block.
      
      This bug has existed since I introduced this fix, however requires
      having multiple deleted snapshots pending when we unmount.  We noticed
      this in production because our shutdown path stops the container on the
      system, which deletes a bunch of subvolumes, and then reboots the box.
      This gives us plenty of opportunities to hit this issue.  Looking at the
      history we've seen this occasionally in production, but we had a big
      spike recently thanks to faster machines getting jobs with multiple
      subvolumes in the job.
      
      Chris Mason wrote a reproducer which does the following
      
      mount /dev/nvme4n1 /btrfs
      btrfs subvol create /btrfs/s1
      simoop -E -f 4k -n 200000 -z /btrfs/s1
      while(true) ; do
      	btrfs subvol snap /btrfs/s1 /btrfs/s2
      	simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
      	btrfs subvol snap /btrfs/s2 /btrfs/s3
      	btrfs balance start -dusage=80 /btrfs
      	btrfs subvol del /btrfs/s2 /btrfs/s3
      	umount /btrfs
      	btrfsck /dev/nvme4n1 || exit 1
      	mount /dev/nvme4n1 /btrfs
      done
      
      On the second loop this would fail consistently, with my patch it has
      been running for hours and hasn't failed.
      
      I also used dm-log-writes to capture the state of the failure so I could
      debug the problem.  Using the existing failure case to test my patch
      validated that it fixes the problem.
      
      Fixes: 78c52d9e ("btrfs: check for refs on snapshot delete resume")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42fac187
  9. 11 Aug, 2024 9 commits
    • Linus Torvalds's avatar
      Linux 6.11-rc3 · 7c626ce4
      Linus Torvalds authored
      7c626ce4
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7006fe2f
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
      
       - Fix 32-bit PTI for real.
      
         pti_clone_entry_text() is called twice, once before initcalls so that
         initcalls can use the user-mode helper and then again after text is
         set read only. Setting read only on 32-bit might break up the PMD
         mapping, which makes the second invocation of pti_clone_entry_text()
         find the mappings out of sync and failing.
      
         Allow the second call to split the existing PMDs in the user mapping
         and synchronize with the kernel mapping.
      
       - Don't make acpi_mp_wake_mailbox read-only after init as the mail box
         must be writable in the case that CPU hotplug operations happen after
         boot. Otherwise the attempt to start a CPU crashes with a write to
         read only memory.
      
       - Add a missing sanity check in mtrr_save_state() to ensure that the
         fixed MTRR MSRs are supported.
      
         Otherwise mtrr_save_state() ends up in a #GP, which is fixed up, but
         the WARN_ON() can bring systems down when panic on warn is set.
      
      * tag 'x86-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mtrr: Check if fixed MTRRs exist before saving them
        x86/paravirt: Fix incorrect virt spinlock setting on bare metal
        x86/acpi: Remove __ro_after_init from acpi_mp_wake_mailbox
        x86/mm: Fix PTI for i386 some more
      7006fe2f
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7270e931
      Linus Torvalds authored
      Pull time keeping fixes from Thomas Gleixner:
      
       - Fix a couple of issues in the NTP code where user supplied values are
         neither sanity checked nor clamped to the operating range. This
         results in integer overflows and eventualy NTP getting out of sync.
      
         According to the history the sanity checks had been removed in favor
         of clamping the values, but the clamping never worked correctly under
         all circumstances. The NTP people asked to not bring the sanity
         checks back as it might break existing applications.
      
         Make the clamping work correctly and add it where it's missing
      
       - If adjtimex() sets the clock it has to trigger the hrtimer subsystem
         so it can adjust and if the clock was set into the future expire
         timers if needed. The caller should provide a bitmask to tell
         hrtimers which clocks have been adjusted.
      
         adjtimex() uses not the proper constant and uses CLOCK_REALTIME
         instead, which is 0. So hrtimers adjusts only the clocks, but does
         not check for expired timers, which might make them expire really
         late. Use the proper bitmask constant instead.
      
      * tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
        ntp: Safeguard against time_constant overflow
        ntp: Clamp maxerror and esterror to operating range
      7270e931
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 56fe0a6a
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Three small fixes for interrupt core and drivers:
      
         - The interrupt core fails to honor caller supplied affinity hints
           for non-managed interrupts and uses the system default affinity on
           startup instead. Set the missing flag in the descriptor to tell the
           core to use the provided affinity.
      
         - Fix a shift out of bounds error in the Xilinx driver
      
         - Handle switching to level trigger correctly in the RISCV APLIC
           driver. It failed to retrigger the interrupt which causes it to
           become stale"
      
      * tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration
        irqchip/xilinx: Fix shift out of bounds
        genirq/irqdesc: Honor caller provided affinity in alloc_desc()
      56fe0a6a
    • Linus Torvalds's avatar
      Merge tag 'usb-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · cb2e5ee8
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a number of small USB driver fixes for reported issues for
        6.11-rc3. Included in here are:
      
         - usb serial driver MODULE_DESCRIPTION() updates
      
         - usb serial driver fixes
      
         - typec driver fixes
      
         - usb-ip driver fix
      
         - gadget driver fixes
      
         - dt binding update
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: ucsi: Fix a deadlock in ucsi_send_command_common()
        usb: typec: tcpm: avoid sink goto SNK_UNATTACHED state if not received source capability message
        usb: gadget: f_fs: pull out f->disable() from ffs_func_set_alt()
        usb: gadget: f_fs: restore ffs_func_disable() functionality
        USB: serial: debug: do not echo input by default
        usb: typec: tipd: Delete extra semi-colon
        usb: typec: tipd: Fix dereferencing freeing memory in tps6598x_apply_patch()
        usb: gadget: u_serial: Set start_delayed during suspend
        usb: typec: tcpci: Fix error code in tcpci_check_std_output_cap()
        usb: typec: fsa4480: Check if the chip is really there
        usb: gadget: core: Check for unset descriptor
        usb: vhci-hcd: Do not drop references before new references are gained
        usb: gadget: u_audio: Check return codes from usb_ep_enable and config_ep_by_speed.
        usb: gadget: midi2: Fix the response for FB info with block 0xff
        dt-bindings: usb: microchip,usb2514: Add USB2517 compatible
        USB: serial: garmin_gps: use struct_size() to allocate pkt
        USB: serial: garmin_gps: annotate struct garmin_packet with __counted_by
        USB: serial: add missing MODULE_DESCRIPTION() macros
        USB: serial: spcp8x5: remove unused struct 'spcp8x5_usb_ctrl_arg'
      cb2e5ee8
    • Linus Torvalds's avatar
      Merge tag 'tty-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 42b34a8d
      Linus Torvalds authored
      Pull tty / serial driver fixes from Greg KH:
       "Here are some small tty and serial driver fixes for reported problems
        for 6.11-rc3. Included in here are:
      
         - sc16is7xx serial driver fixes
      
         - uartclk bugfix for a divide by zero issue
      
         - conmakehash userspace build issue fix
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: vt: conmakehash: cope with abs_srctree no longer in env
        serial: sc16is7xx: fix invalid FIFO access with special register set
        serial: sc16is7xx: fix TX fifo corruption
        serial: core: check uartclk for zero to avoid divide by zero
      42b34a8d
    • Linus Torvalds's avatar
      Merge tag 'driver-core-6.11-rc3' of... · 84e6da57
      Linus Torvalds authored
      Merge tag 'driver-core-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core / documentation fixes from Greg KH:
       "Here are some small fixes, and some documentation updates for
        6.11-rc3. Included in here are:
      
         - embargoed hardware documenation updates based on a lot of review by
           legal-types in lots of companies to try to make the process a _bit_
           easier for us to manage over time.
      
         - rust firmware documentation fix
      
         - driver detach race fix for the fix that went into 6.11-rc1
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'driver-core-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        driver core: Fix uevent_show() vs driver detach race
        Documentation: embargoed-hardware-issues.rst: add a section documenting the "early access" process
        Documentation: embargoed-hardware-issues.rst: minor cleanups and fixes
        rust: firmware: fix invalid rustdoc link
      84e6da57
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 9221afb2
      Linus Torvalds authored
      Pull char/misc fixes from Greg KH:
       "Here are some small char/misc/other driver fixes for 6.11-rc3 for
        reported issues. Included in here are:
      
         - binder driver fixes
      
         - fsi MODULE_DESCRIPTION() additions (people seem to love them...)
      
         - eeprom driver fix
      
         - Kconfig dependency fix to resolve build issues
      
         - spmi driver fixes
      
        All of these have been in linux-next for a while with no reported
        problems"
      
      * tag 'char-misc-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        spmi: pmic-arb: add missing newline in dev_err format strings
        spmi: pmic-arb: Pass the correct of_node to irq_domain_add_tree
        binder_alloc: Fix sleeping function called from invalid context
        binder: fix descriptor lookup for context manager
        char: add missing NetWinder MODULE_DESCRIPTION() macros
        misc: mrvl-cn10k-dpi: add PCI_IOV dependency
        eeprom: ee1004: Fix locking issues in ee1004_probe()
        fsi: add missing MODULE_DESCRIPTION() macros
      9221afb2
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 04cc50c2
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two core fixes: one to prevent discard type changes (seen on iSCSI)
        during intermittent errors and the other is fixing a lockdep problem
        caused by the queue limits change.
      
        And one driver fix in ufs"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: sd: Keep the discard mode stable
        scsi: sd: Move sd_read_cpr() out of the q->limits_lock region
        scsi: ufs: core: Fix hba->last_dme_cmd_tstamp timestamp updating logic
      04cc50c2
  10. 10 Aug, 2024 8 commits
  11. 09 Aug, 2024 6 commits
    • Kent Overstreet's avatar
      bcachefs: bcachefs_metadata_version_disk_accounting_v3 · 8a2491db
      Kent Overstreet authored
      bcachefs_metadata_version_disk_accounting_v2 erroneously had padding
      bytes in disk_accounting_key, which is a problem because we have to
      guarantee that all unused bytes in disk_accounting_key are zeroed.
      
      Fortunately 6.11 isn't out yet, so it's cheap to fix this by spinning a
      new version.
      Reported-by: default avatarGabriel de Perthuis <g2p.code@gmail.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8a2491db
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2024-08-10' of https://gitlab.freedesktop.org/drm/kernel · 15833fea
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Weekly regular fixes, mostly amdgpu with i915/xe having a few each,
        and then some misc bits across the board, seems about right for rc3
        time.
      
        client:
         - fix null ptr deref
      
        bridge:
         - connector: fix double free
      
        atomic:
         - fix async flip update
      
        panel:
         - document panel
      
        omap:
         - add config dependency
      
        tests:
         - fix gem shmem test
      
        drm buddy:
         - Add start address to trim function
      
        amdgpu:
         - DMCUB fix
         - Fix DET programming on some DCNs
         - DCC fixes
         - DCN 4.0.1 fixes
         - SMU 14.0.x update
         - MMHUB fix
         - DCN 3.1.4 fix
         - GC 12.0 fixes
         - Fix soft recovery error propogation
         - SDMA 7.0 fixes
         - DSC fix
      
        xe:
         - Fix off-by-one when processing RTP rules
         - Use dma_fence_chain_free in chain fence unused as a sync
         - Fix PL1 disable flow in xe_hwmon_power_max_write
         - Take ref to VM in delayed dump snapshot
      
        i915:
         - correct dual pps handling for MTL_PCH+ [display]
         - Adjust vma offset for framebuffer mmap offset [gem]
         - Fix Virtual Memory mapping boundaries calculation [gem]
         - Allow evicting to use the requested placement
         - Attempt to get pages without eviction first"
      
      * tag 'drm-fixes-2024-08-10' of https://gitlab.freedesktop.org/drm/kernel: (31 commits)
        drm/xe: Take ref to VM in delayed snapshot
        drm/xe/hwmon: Fix PL1 disable flow in xe_hwmon_power_max_write
        drm/xe: Use dma_fence_chain_free in chain fence unused as a sync
        drm/xe/rtp: Fix off-by-one when processing rules
        drm/amdgpu: Add DCC GFX12 flag to enable address alignment
        drm/amdgpu: correct sdma7 max dw
        drm/amdgpu: Add address alignment support to DCC buffers
        drm/amd/display: Skip Recompute DSC Params if no Stream on Link
        drm/amdgpu: change non-dcc buffer copy configuration
        drm/amdgpu: Forward soft recovery errors to userspace
        drm/amdgpu: add golden setting for gc v12
        drm/buddy: Add start address support to trim function
        drm/amd/display: Add missing program DET segment call to pipe init
        drm/amd/display: Add missing DCN314 to the DML Makefile
        drm/amdgpu: force to use legacy inv in mmhub
        drm/amd/pm: update powerplay structure on smu v14.0.2/3
        drm/amd/display: Add missing mcache registers
        drm/amd/display: Add dcc propagation value
        drm/amd/display: Add missing DET segments programming
        drm/amd/display: Replace dm_execute_dmub_cmd with dc_wake_and_execute_dmub_cmd
        ...
      15833fea
    • Kent Overstreet's avatar
      bcachefs: improve bch2_dev_usage_to_text() · 1a9e219d
      Kent Overstreet authored
      Add a line for capacity
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1a9e219d
    • Kent Overstreet's avatar
      bcachefs: bch2_accounting_invalid() · 077e4737
      Kent Overstreet authored
      Implement bch2_accounting_invalid(); check for junk at the end, and
      replicas accounting entries in particular need to be checked or we'll
      pop asserts later.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      077e4737
    • Linus Torvalds's avatar
      Merge tag 'bitmap-6.11-rc' of https://github.com/norov/linux · afdab700
      Linus Torvalds authored
      Pull cpumask fix from Yury Norov:
       "Fix for cpumask merge"
      
      [ Mea culpa, this was my mismerge due to too much cut-and-paste - Linus ]
      
      * tag 'bitmap-6.11-rc' of https://github.com/norov/linux:
        cpumask: Fix crash on updating CPU enabled mask
      afdab700
    • Linus Torvalds's avatar
      Merge tag 'pm-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 85082897
      Linus Torvalds authored
      Pull power management fix from Rafael Wysocki:
       "Change the default EPP (energy-performence preference) value for the
        Emerald Rapids processor in the intel_pstate driver.
      
        Thisshould improve both the performance and energy efficiency (Pedro
        Henrique Kopper)"
      
      * tag 'pm-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: intel_pstate: Update Balance performance EPP for Emerald Rapids
      85082897