1. 07 Oct, 2020 40 commits
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref() · 07cce5cf
      Qu Wenruo authored
      [BUG]
      With a crafted image, btrfs can panic at insert_inline_extent_backref():
      
        kernel BUG at fs/btrfs/extent-tree.c:1857!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:insert_inline_extent_backref+0xcc/0xe0
        RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001
        RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000
        FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        __btrfs_inc_extent_ref.isra.64+0x7e/0x240
        ? btrfs_merge_delayed_refs+0xa5/0x330
        __btrfs_run_delayed_refs+0x653/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 2ad8b3de903cf825 ]---
      
      [CAUSE]
      Due to extent tree corruption (still valid by itself, but bad cross
      ref), we can allocate an extent which is still in extent tree.  The
      offending tree block of that case is from csum tree.  The newly
      allocated tree block is also for csum tree.
      
      Then we will try to insert a tree block ref for the existing tree block
      ref.
      
      For a tree extent item, tree block can never be shared directly by the
      same tree twice.  We have such BUG_ON() to prevent such problem, but
      this is not a proper error handling.
      
      [FIX]
      Replace that BUG_ON() with proper error message and leaf dump for debug
      build.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07cce5cf
    • Qu Wenruo's avatar
      btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent() · 1c2a07f5
      Qu Wenruo authored
      __btrfs_free_extent() is doing two things:
      
      1. Reduce the refs number of an extent backref
         Either it's an inline extent backref (inside EXTENT/METADATA item) or
         a keyed extent backref (SHARED_* item).
         We only need to locate that backref line, either reduce the number or
         remove the backref line completely.
      
      2. Update the refs count in EXTENT/METADATA_ITEM
      
      During step 1), we will try to locate the EXTENT/METADATA_ITEM without
      triggering another btrfs_search_slot() as fast path.
      
      Only when we fail to locate that item, we will trigger another
      btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
      updated/deleted the backref line.
      
      And we have a lot of strict checks on things like refs_to_drop against
      extent refs and special case checks for single ref extents.
      
      There are 7 BUG_ON()s, although they're doing correct checks, they can
      be triggered by crafted images.
      
      This patch improves the function:
      
      - Introduce two examples to show what __btrfs_free_extent() is doing
        One inline backref case and one keyed case.  Should cover most cases.
      
      - Kill all BUG_ON()s with proper error message and optional leaf dump
      
      - Add comment to show the overall flow
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
      [ The report triggers one BUG_ON() in __btrfs_free_extent() ]
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c2a07f5
    • Qu Wenruo's avatar
      btrfs: extent_io: do extra check for extent buffer read write functions · f98b6215
      Qu Wenruo authored
      Although we have start, len check for extent buffer reader/write (e.g.
      read_extent_buffer()), these checks have limitations:
      
      - No overflow check
        Values like start = 1024 len = -1024 can still pass the basic
         (start + len) > eb->len check.
      
      - Checks are not consistent
        For read_extent_buffer() we only check (start + len) against eb->len.
        While for memcmp_extent_buffer() we also check start against eb->len.
      
      - Different error reporting mechanism
        We use WARN() in read_extent_buffer() but BUG() in
        memcpy_extent_buffer().
      
      - Still modify memory if the request is obviously wrong
        In read_extent_buffer() even we find (start + len) > eb->len, we still
        call memset(dst, 0, len), which can easily cause memory access error
        if start + len overflows.
      
      To address above problems, this patch creates a new common function to
      check such access, check_eb_range().
      
      - Add overflow check
        This function checks start, start + len against eb->len and overflow
        check.
      
      - Unified checks
      
      - Unified error reports
        Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
        And also do btrfs_warn() message for non-debug build.
      
      - Exit ASAP if check fails
        No more possible memory corruption.
      
      - Add extra comment for @start @len used in those functions as it's
        sometimes confused with the logical addressing instead of a range
        inside the eb space
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
      [ Inspired by above report, the report itself is already addressed ]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use check_add_overflow ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f98b6215
    • Nikolay Borisov's avatar
      btrfs: rework error detection in init_tree_roots · 217f5004
      Nikolay Borisov authored
      To avoid duplicating 3 lines of code the error detection logic in
      init_tree_roots is somewhat quirky. It first checks for the presence of
      any error condition, then checks for the specific condition to perform
      any specific actions. That's spurious because directly checking for
      each respective error condition and doing the necessary steps is more
      obvious. While at it change the -EUCLEAN to -EIO in case the extent
      buffer is not read correctly, this is in line with other sites which
      return -EIO when the eb couldn't be read.
      
      Additionally it results in smaller code and the code reads
      more linearly:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
      Function                                     old     new   delta
      open_ctree                                 17243   17148     -95
      Total: Before=113104, After=113009, chg -0.08%
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      217f5004
    • Qu Wenruo's avatar
      btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations · e85fde51
      Qu Wenruo authored
      [BUG]
      When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
      
        generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
      
      And with the following metadata leak:
      
        BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace a6cfd45ba80e4e06 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
        BTRFS info (device dm-3): disk space caching is enabled
        BTRFS info (device dm-3): has skinny extents
      
      [CAUSE]
      The qgroup preallocated meta rsv operations of that offending root are:
      
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
      
      It's pretty obvious that, we reserve qgroup meta rsv in
      btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
      release/convert calls in btrfs_subvolume_release_metadata().
      
      This leads to the leakage.
      
      [FIX]
      To fix this bug, we should follow what we're doing in
      btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
      add it to block_rsv->qgroup_rsv_reserved.
      
      And free the qgroup reserved metadata space when releasing the
      block_rsv.
      
      To do this, we need to change the btrfs_subvolume_release_metadata() to
      accept btrfs_root, and record the qgroup_to_release number, and call
      btrfs_qgroup_convert_reserved_meta() for it.
      
      Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e85fde51
    • Qu Wenruo's avatar
      btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode · b4c5d8fd
      Qu Wenruo authored
      For delayed inode facility, qgroup metadata is reserved for it, and
      later freed.
      
      However we're freeing more bytes than we reserved.
      In btrfs_delayed_inode_reserve_metadata():
      
      	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
      	...
      		ret = btrfs_qgroup_reserve_meta_prealloc(root,
      				fs_info->nodesize, true);
      		...
      		if (!ret) {
      			node->bytes_reserved = num_bytes;
      
      But in btrfs_delayed_inode_release_metadata():
      
      	if (qgroup_free)
      		btrfs_qgroup_free_meta_prealloc(node->root,
      				node->bytes_reserved);
      	else
      		btrfs_qgroup_convert_reserved_meta(node->root,
      				node->bytes_reserved);
      
      This means, we're always releasing more qgroup metadata rsv than we have
      reserved.
      
      This won't trigger selftest warning, as btrfs qgroup metadata rsv has
      extra protection against cases like quota enabled half-way.
      
      But we still need to fix this problem any way.
      
      This patch will use the same num_bytes for qgroup metadata rsv so we
      could handle it correctly.
      
      Fixes: f218ea6c ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c5d8fd
    • Josef Bacik's avatar
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik authored
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      425c6ed6
    • Josef Bacik's avatar
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik authored
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62cf5391
    • Nikolay Borisov's avatar
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov authored
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68abf360
    • Nikolay Borisov's avatar
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov authored
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      427c8fdd
    • Nikolay Borisov's avatar
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov authored
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      944d3f9f
    • Nikolay Borisov's avatar
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov authored
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4989c2f
    • Nikolay Borisov's avatar
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov authored
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54eed6ae
    • Nikolay Borisov's avatar
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov authored
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3712ccb7
    • Nikolay Borisov's avatar
      btrfs: factor out reada loop in __reada_start_machine · dc0ab488
      Nikolay Borisov authored
      This is in preparation for moving fs_devices to proper lists.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc0ab488
    • Nikolay Borisov's avatar
      btrfs: remove err variable from btrfs_get_extent · 1028d1c4
      Nikolay Borisov authored
      There's no practical reason too use 'err' as a variable to convey
      errors. In fact it's value is either set explicitly in the beginning of
      the function or it simply takes the value of 'ret'. Not conforming to
      the usual pattern of having ret be the only variable used to convey
      errors makes the code more error prone to bugs. In fact one such bug
      was introduced by 6bf9e4bd ("btrfs: inode: Verify inode mode toi
      avoid NULL pointer dereference") by assigning the error value to 'ret'
      and not 'err'.
      
      Let's fix that issue and make the function less tricky by leaving only
      ret to convey error values.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1028d1c4
    • Josef Bacik's avatar
      btrfs: dio iomap DSYNC workaround · 0eb79294
      Josef Bacik authored
      iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
      This is problematic for us because of 2 reasons:
      
      1. we hold the inode_lock() during this operation, and we take it in
         generic_write_sync()
      2. we hold a read lock on the dio_sem but take the write lock in fsync
      
      Since we don't want to rip out this code right now, but reworking the
      locking is a bit much to do at this point, work around this problem with
      this masterpiece of a patch.
      
      First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
      that it needs to handle the sync.  We save this fact in
      current->journal_info, because we need to see do special things once
      we're in iomap_begin, and we have no way to pass private information
      into iomap_dio_rw().
      
      Next we specify a separate iomap_dio_ops for sync, which implements an
      ->end_io() callback that gets called when the dio completes.  This is
      important for AIO, because we really do need to run generic_write_sync()
      if we complete asynchronously.  However if we're still in the submitting
      context when we enter ->end_io() we clear the flag so that the submitter
      knows they're the ones that needs to run generic_write_sync().
      
      This is meant to be temporary.  We need to work out how to eliminate the
      inode_lock() and the dio_sem in our fsync and use another mechanism to
      protect these operations.
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0eb79294
    • Goldwyn Rodrigues's avatar
      btrfs: switch to iomap for direct IO · f85781fb
      Goldwyn Rodrigues authored
      We're using direct io implementation based on buffer heads. This patch
      switches to the new iomap infrastructure.
      
      Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
      btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
      iomap_begin() for iomap direct I/O functions. This function allocates
      and locks all the blocks required for the I/O.  btrfs_submit_direct() is
      used as the submit_io() hook for direct I/O ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore: set it to noop.
      
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
      
        - for writes, call __endio_write_update_ordered()
        - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f85781fb
    • Qu Wenruo's avatar
      btrfs: add owner and fs_info to alloc_state io_tree · 154f7cb8
      Qu Wenruo authored
      Commit 1c11b63e ("btrfs: replace pending/pinned chunks lists with io
      tree") introduced btrfs_device::alloc_state extent io tree, but it
      doesn't initialize the fs_info and owner member.
      
      This means the following features are not properly supported:
      
      - Fs owner report for insert_state() error
        Without fs_info initialized, although btrfs_err() won't panic, it
        won't output which fs is causing the error.
      
      - Wrong owner for trace events
        alloc_state will get the owner as pinned extents.
      
      Fix this by assiging proper fs_info and owner for
      btrfs_device::alloc_state.
      
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      154f7cb8
    • Marcos Paulo de Souza's avatar
      btrfs: make read_block_group_item return void · 4c448ce8
      Marcos Paulo de Souza authored
      Since it's inclusion on 9afc6649 ("btrfs: block-group: refactor how
      we read one block group item") this function always returned 0, so there
      is no need to check for the returned value.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c448ce8
    • Leon Romanovsky's avatar
      btrfs: sysfs: fix unused-but-set-variable warnings · 24646481
      Leon Romanovsky authored
      The compilation with W=1 generates the following warnings:
       fs/btrfs/sysfs.c:1630:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
        1630 |  int ret;
             |      ^~~
       fs/btrfs/sysfs.c:1629:6: warning: variable 'features' set but not used [-Wunused-but-set-variable]
        1629 |  u64 features;
             |      ^~~~~~~~
      
      [ The unused variables are leftover from e410e34f ("Revert "btrfs:
        synchronize incompat feature bits with sysfs files""), which needs
        to be properly fixed by moving feature bit manipulation from the sysfs
        context.  Silence the warning to save pepople time, we got several
        reports. ]
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24646481
    • Filipe Manana's avatar
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana authored
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48778179
    • Filipe Manana's avatar
      btrfs: do not commit logs and transactions during link and rename operations · 75b463d2
      Filipe Manana authored
      Since commit d4682ba0 ("Btrfs: sync log after logging new name") we
      started to commit logs, and fallback to transaction commits when we failed
      to log the new names or commit the logs, after link and rename operations
      when the target inodes (or their parents) were previously logged in the
      current transaction. This was to avoid losing directories despite an
      explicit fsync on them when they are ancestors of some inode that got a
      new named logged, due to a link or rename operation. However that adds the
      cost of starting IO and waiting for it to complete, which can cause higher
      latencies for applications.
      
      Instead of doing that, just make sure that when we log a new name for an
      inode we don't mark any of its ancestors as logged, so that if any one
      does an fsync against any of them, without doing any other change on them,
      the fsync commits the log. This way we only pay the cost of a log commit
      (or a transaction commit if something goes wrong or a new block group was
      created) if the application explicitly asks to fsync any of the parent
      directories.
      
      Using dbench, which mixes several filesystems operations including renames,
      revealed some significant latency gains. The following script that uses
      dbench was used to test this:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=16
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -t 300 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    10750455     0.011   155.088
       Close         7896674     0.001     0.243
       Rename         455222     2.158  1101.947
       Unlink        2171189     0.067   121.638
       Deltree           256     2.425     7.816
       Mkdir             128     0.002     0.003
       Qpathinfo     9744323     0.006    21.370
       Qfileinfo     1707092     0.001     0.146
       Qfsinfo       1786756     0.001    11.228
       Sfileinfo      875612     0.003    21.263
       Find          3767281     0.025     9.617
       WriteX        5356924     0.011   211.390
       ReadX        16852694     0.003     9.442
       LockX           35008     0.002     0.119
       UnlockX         35008     0.001     0.138
       Flush          753458     4.252  1102.249
      
      Throughput 1128.35 MB/sec  16 clients  16 procs  max_latency=1102.255 ms
      
      Results after this patch:
      
      16 clients, after
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11471098     0.012   448.281
       Close         8426396     0.001     0.925
       Rename         485746     0.123   267.183
       Unlink        2316477     0.080    63.433
       Deltree           288     2.830    11.144
       Mkdir             144     0.003     0.010
       Qpathinfo    10397420     0.006    10.288
       Qfileinfo     1822039     0.001     0.169
       Qfsinfo       1906497     0.002    14.039
       Sfileinfo      934433     0.004     2.438
       Find          4019879     0.026    10.200
       WriteX        5718932     0.011   200.985
       ReadX        17981671     0.003    10.036
       LockX           37352     0.002     0.076
       UnlockX         37352     0.001     0.109
       Flush          804018     5.015   778.033
      
      Throughput 1201.98 MB/sec  16 clients  16 procs  max_latency=778.036 ms
      (+6.5% throughput, -29.4% max latency, -75.8% rename latency)
      
      Test case generic/498 from fstests tests the scenario that the previously
      mentioned commit fixed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75b463d2
    • Filipe Manana's avatar
      btrfs: do not take the log_mutex of the subvolume when pinning the log · 5522a27e
      Filipe Manana authored
      During a rename we pin the log to make sure no one commits a log that
      reflects an ongoing rename operation, as it might result in a committed
      log where it recorded the unlink of the old name without having recorded
      the new name. However we are taking the subvolume's log_mutex before
      incrementing the log_writers counter, which is not necessary since that
      counter is atomic and we only remove the old name from the log and add
      the new name to the log after we have incremented log_writers, ensuring
      that no one can commit the log after we have removed the old name from
      the log and before we added the new name to the log.
      
      By taking the log_mutex lock we are just adding unnecessary contention on
      the lock, which can become visible for workloads that mix renames with
      fsyncs, writes for files opened with O_SYNC and unlink operations (if the
      inode or its parent were fsynced before in the current transaction).
      
      So just remove the lock and unlock of the subvolume's log_mutex at
      btrfs_pin_log_trans().
      
      Using dbench, which mixes different types of operations that end up taking
      that mutex (fsyncs, renames, unlinks and writes into files opened with
      O_SYNC) revealed some small gains. The following script that calls dbench
      was used:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=32
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -s -t 600 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4410848     0.017   738.640
       Close        3240222     0.001     0.834
       Rename        186850     7.478  1272.476
       Unlink        890875     0.128   785.018
       Deltree          128     2.846    12.081
       Mkdir             64     0.002     0.003
       Qpathinfo    3997659     0.009    11.171
       Qfileinfo     701307     0.001     0.478
       Qfsinfo       733494     0.002     1.103
       Sfileinfo     359362     0.004     3.266
       Find         1546226     0.041     4.128
       WriteX       2202803     7.905  1376.989
       ReadX        6917775     0.003     3.887
       LockX          14392     0.002     0.043
       UnlockX        14392     0.001     0.085
       Flush         309225     0.128  1033.936
      
      Throughput 231.555 MB/sec (sync open)  32 clients  32 procs  max_latency=1376.993 ms
      
      Results after this patch:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4603244     0.017   232.776
       Close        3381299     0.001     1.041
       Rename        194871     7.251  1073.165
       Unlink        929730     0.133   119.233
       Deltree          128     2.871    10.199
       Mkdir             64     0.002     0.004
       Qpathinfo    4171343     0.009    11.317
       Qfileinfo     731227     0.001     1.635
       Qfsinfo       765079     0.002     3.568
       Sfileinfo     374881     0.004     1.220
       Find         1612964     0.041     4.675
       WriteX       2296720     7.569  1178.204
       ReadX        7213633     0.003     3.075
       LockX          14976     0.002     0.076
       UnlockX        14976     0.001     0.061
       Flush         322635     0.102   579.505
      
      Throughput 241.4 MB/sec (sync open)  32 clients  32 procs  max_latency=1178.207 ms
      (+4.3% throughput, -14.4% max latency)
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5522a27e
    • David Sterba's avatar
      btrfs: send: remove indirect callback parameter for changed_cb · 1b51d6fc
      David Sterba authored
      There's a custom callback passed to btrfs_compare_trees which happens to
      be named exactly same as the existing function implementing it. This is
      confusing and the indirection is not necessary for our needs. Compiler
      is clever enough to call it directly so there's effectively no change.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b51d6fc
    • David Sterba's avatar
      btrfs: scrub: rename ratelimit state varaible to avoid shadowing · 8bb1cf1b
      David Sterba authored
      There's already defined _rs within ctree.h:btrfs_printk_ratelimited,
      local variables should not use _ to avoid such name clashes with
      macro-local variables.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bb1cf1b
    • David Sterba's avatar
      btrfs: remove unnecessarily shadowed variables · 0af447d0
      David Sterba authored
      In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
      the same as the one we already have.
      
      In btrfs_backref_finish_upper_links, rb_node is same type and used
      as temporary cursor to the tree.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0af447d0
    • David Sterba's avatar
      btrfs: compression: move declarations to header · cb4c9198
      David Sterba authored
      The declarations of compression algorithm callbacks are defined in the
      .c file as they're used from there. Compiler warns that there are no
      declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
      Fix that by moving the declarations to the header as it's the common
      place for all of them.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb4c9198
    • David Sterba's avatar
      btrfs: remove const from btrfs_feature_set_name · 9e6df7ce
      David Sterba authored
      The function btrfs_feature_set_name returns a const char pointer, the
      second const is not necessary and reported as a warning:
      
       In file included from fs/btrfs/space-info.c:6:
       fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
          16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
             | ^~~~~
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e6df7ce
    • Qu Wenruo's avatar
      btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need() · e21139c6
      Qu Wenruo authored
      We're just doing rounding up to sectorsize to calculate the lockend.
      There is no need to do the unnecessary length calculation, just direct
      round_up() is enough.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e21139c6
    • Josef Bacik's avatar
      btrfs: fix possible infinite loop in data async reclaim · c4923027
      Josef Bacik authored
      Dave reported an issue where generic/102 would sometimes hang.  This
      turned out to be because we'd get into this spot where we were no longer
      making progress on data reservations because our exit condition was not
      met.  The log is basically
      
      while (!space_info->full && !list_empty(&space_info->tickets))
      	flush_space(space_info, flush_state);
      
      where flush state is our various flush states, but doesn't include
      ALLOC_CHUNK_FORCE.  This is because we actually lead with allocating
      chunks, and so the assumption was that once you got to the actual
      flushing states you could no longer allocate chunks.  This was a stupid
      assumption, because you could have deleted block groups that would be
      reclaimed by a transaction commit, thus unsetting space_info->full.
      This is essentially what happens with generic/102, and so sometimes
      you'd get stuck in the flushing loop because we weren't allocating
      chunks, but flushing space wasn't giving us what we needed to make
      progress.
      
      Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
      that way we will eventually bail out because we did end up with
      space_info->full if we free'd a chunk previously.  Otherwise, as is the
      case for this test, we'll allocate our chunk and continue on our happy
      merry way.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4923027
    • Josef Bacik's avatar
      btrfs: add a comment explaining the data flush steps · 1a7a92c8
      Josef Bacik authored
      The data flushing steps are not obvious to people other than myself and
      Chris.  Write a giant comment explaining the reasoning behind each flush
      step for data as well as why it is in that particular order.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a7a92c8
    • Josef Bacik's avatar
      btrfs: do async reclaim for data reservations · 57056740
      Josef Bacik authored
      Now that we have the data ticketing stuff in place, move normal data
      reservations to use an async reclaim helper to satisfy tickets.  Before
      we could have multiple tasks race in and both allocate chunks, resulting
      in more data chunks than we would necessarily need.  Serializing these
      allocations and making a single thread responsible for flushing will
      only allocate chunks as needed, as well as cut down on transaction
      commits and other flush related activities.
      
      Priority reservations will still work as they have before, simply
      trying to allocate a chunk until they can make their reservation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57056740
    • Josef Bacik's avatar
      btrfs: flush delayed refs when trying to reserve data space · cb3e3930
      Josef Bacik authored
      We can end up with freed extents in the delayed refs, and thus
      may_commit_transaction() may not think we have enough pinned space to
      commit the transaction and we'll ENOSPC early.  Handle this by running
      the delayed refs in order to make sure pinned is uptodate before we try
      to commit the transaction.
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3e3930
    • Josef Bacik's avatar
      btrfs: run delayed iputs before committing the transaction for data · 327feeeb
      Josef Bacik authored
      Before we were waiting on iputs after we committed the transaction, but
      this doesn't really make much sense.  We want to reclaim any space we
      may have in order to be more likely to commit the transaction, due to
      pinned space being added by running the delayed iputs.  Fix this by
      making delayed iputs run before committing the transaction.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      327feeeb
    • Josef Bacik's avatar
      btrfs: don't force commit if we are data · bb86bd3d
      Josef Bacik authored
      We used to unconditionally commit the transaction at least 2 times and
      then on the 3rd try check against pinned space to make sure committing
      the transaction was worth the effort.  This is overkill, we know nobody
      is going to steal our reservation, and if we can't make our reservation
      with the pinned amount simply bail out.
      
      This also cleans up the passing of bytes_needed to
      may_commit_transaction, as that was the thing we added into place in
      order to accomplish this behavior.  We no longer need it so remove that
      mess.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb86bd3d
    • Josef Bacik's avatar
      btrfs: drop the commit_cycles stuff for data reservations · 02827001
      Josef Bacik authored
      This was an old wart left over from how we previously did data
      reservations.  Before we could have people race in and take a
      reservation while we were flushing space, so we needed to make sure we
      looped a few times before giving up.  Now that we're using the ticketing
      infrastructure we don't have to worry about this and can drop the logic
      altogether.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02827001
    • Josef Bacik's avatar
      btrfs: use the same helper for data and metadata reservations · f3bda421
      Josef Bacik authored
      Now that data reservations follow the same pattern as metadata
      reservations we can simply rename __reserve_metadata_bytes to
      __reserve_bytes and use that helper for data reservations.
      
      Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
      because we can never overcommit.  We also will never pass in FLUSH_ALL
      for data, so we'll simply be added to the priority list and go straight
      into handle_reserve_ticket.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3bda421
    • Josef Bacik's avatar
      btrfs: serialize data reservations if we are flushing · 0532a6f8
      Josef Bacik authored
      Nikolay reported a problem where generic/371 would fail sometimes with a
      slow drive.  The gist of the test is that we fallocate a file in
      parallel with a pwrite of a different file.  These two files combined
      are smaller than the file system, but sometimes the pwrite would ENOSPC.
      
      A fair bit of investigation uncovered the fact that the fallocate
      workload was racing in and grabbing the free space that the pwrite
      workload was trying to free up so it could make its own reservation.
      After a few loops of this eventually the pwrite workload would error out
      with an ENOSPC.
      
      We've had the same problem with metadata as well, and we serialized all
      metadata allocations to satisfy this problem.  This wasn't usually a
      problem with data because data reservations are more straightforward,
      but obviously could still happen.
      
      Fix this by not allowing reservations to occur if there are any pending
      tickets waiting to be satisfied on the space info.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0532a6f8
    • Josef Bacik's avatar
      btrfs: use ticketing for data space reservations · 1004f686
      Josef Bacik authored
      Now that we have all the infrastructure in place, use the ticketing
      infrastructure to make data allocations.  This still maintains the exact
      same flushing behavior, but now we're using tickets to get our
      reservations satisfied.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1004f686