1. 21 Jul, 2020 2 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: fix data leak caused by race between writeback and truncate · fa91e4aa
      Qu Wenruo authored
      [BUG]
      When running tests like generic/013 on test device with btrfs quota
      enabled, it can normally lead to data leak, detected at unmount time:
      
        BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
        RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace caf08beafeca2392 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
      
      [CAUSE]
      In the offending case, the offending operations are:
      2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
      2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
      
      The following sequence of events could happen after the writev():
      	CPU1 (writeback)		|		CPU2 (truncate)
      -----------------------------------------------------------------
      btrfs_writepages()			|
      |- extent_write_cache_pages()		|
         |- Got page for 1003520		|
         |  1003520 is Dirty, no writeback	|
         |  So (!clear_page_dirty_for_io())   |
         |  gets called for it		|
         |- Now page 1003520 is Clean.	|
         |					| btrfs_setattr()
         |					| |- btrfs_setsize()
         |					|    |- truncate_setsize()
         |					|       New i_size is 18388
         |- __extent_writepage()		|
         |  |- page_offset() > i_size		|
            |- btrfs_invalidatepage()		|
      	 |- Page is clean, so no qgroup |
      	    callback executed
      
      This means, the qgroup reserved data space is not properly released in
      btrfs_invalidatepage() as the page is Clean.
      
      [FIX]
      Instead of checking the dirty bit of a page, call
      btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
      
      As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
      io_tree, not bound to page status, thus we won't cause double freeing
      anyway.
      
      Fixes: 0b34c261 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa91e4aa
    • Filipe Manana's avatar
      btrfs: fix double free on ulist after backref resolution failure · 580c079b
      Filipe Manana authored
      At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
      argument to point to it. However if later we fail due to an error returned
      by find_parent_nodes(), we free that ulist but leave a dangling pointer in
      the **roots argument. Upon receiving the error, a caller of this function
      can attempt to free the same ulist again, resulting in an invalid memory
      access.
      
      One such scenario is during qgroup accounting:
      
      btrfs_qgroup_account_extents()
      
       --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
           pointer) to btrfs_find_all_roots()
      
         --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
             passing &new_roots to it
      
           --> allocates ulist and assigns its address to **roots (which
               points to new_roots from btrfs_qgroup_account_extents())
      
           --> find_parent_nodes() returns an error, so we free the ulist
               and leave **roots pointing to it after returning
      
       --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
           an error and jumps to the label 'cleanup', which just tries to
           free again the same ulist
      
      Stack trace example:
      
       ------------[ cut here ]------------
       BTRFS: tree first key check failed
       WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
       Modules linked in: dm_snapshot dm_thin_pool (...)
       CPU: 1 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
       Code: 28 5b 5d (...)
       RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
       RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
       R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
       FS:  00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        read_block_for_search+0xf6/0x350 [btrfs]
        btrfs_next_old_leaf+0x242/0x650 [btrfs]
        resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
        find_parent_nodes+0x4ea/0x12c0 [btrfs]
        btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
        btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
        btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
        btrfs_sync_file+0x3d4/0x4d0 [btrfs]
        do_fsync+0x38/0x70
        __x64_sys_fdatasync+0x13/0x20
        do_syscall_64+0x5c/0xe0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fc47e2d72e3
       Code: Bad RIP value.
       RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
       RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
       RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
       R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
       softirqs last  enabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace 8639237550317b48 ]---
       BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
       general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 2 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:ulist_release+0x14/0x60 [btrfs]
       Code: c7 07 00 (...)
       RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
       RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
       RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
       R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
       FS:  00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        ulist_free+0x13/0x20 [btrfs]
        btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
        btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
        btrfs_sync_file+0x3d4/0x4d0 [btrfs]
        do_fsync+0x38/0x70
        __x64_sys_fdatasync+0x13/0x20
        do_syscall_64+0x5c/0xe0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fc47e2d72e3
       Code: Bad RIP value.
       RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
       RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
       RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
       R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
       Modules linked in: dm_snapshot dm_thin_pool (...)
       ---[ end trace 8639237550317b49 ]---
       RIP: 0010:ulist_release+0x14/0x60 [btrfs]
       Code: c7 07 00 (...)
       RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
       RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
       RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
       R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
       FS:  00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
      it frees the ulist.
      
      Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      580c079b
  2. 09 Jul, 2020 2 commits
  3. 07 Jul, 2020 1 commit
    • Qu Wenruo's avatar
      btrfs: discard: add missing put when grabbing block group from unused list · 04e484c5
      Qu Wenruo authored
      [BUG]
      The following small test script can trigger ASSERT() at unmount time:
      
        mkfs.btrfs -f $dev
        mount $dev $mnt
        mount -o remount,discard=async $mnt
        umount $mnt
      
      The call trace:
        assertion failed: atomic_read(&block_group->count) == 1, in fs/btrfs/block-group.c:3431
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3204!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 4 PID: 10389 Comm: umount Tainted: G           O      5.8.0-rc3-custom+ #68
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         btrfs_free_block_groups.cold+0x22/0x55 [btrfs]
         close_ctree+0x2cb/0x323 [btrfs]
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The code:
                      ASSERT(atomic_read(&block_group->count) == 1);
                      btrfs_put_block_group(block_group);
      
      [CAUSE]
      Obviously it's some btrfs_get_block_group() call doesn't get its put
      call.
      
      The offending btrfs_get_block_group() happens here:
      
        void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
        {
        	if (list_empty(&bg->bg_list)) {
        		btrfs_get_block_group(bg);
      		list_add_tail(&bg->bg_list, &fs_info->unused_bgs);
        	}
        }
      
      So every call sites removing the block group from unused_bgs list should
      reduce the ref count of that block group.
      
      However for async discard, it didn't follow the call convention:
      
        void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
        {
        	list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
        				 bg_list) {
        		list_del_init(&block_group->bg_list);
        		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
        	}
        }
      
      And in btrfs_discard_queue_work(), it doesn't call
      btrfs_put_block_group() either.
      
      [FIX]
      Fix the problem by reducing the reference count when we grab the block
      group from unused_bgs list.
      Reported-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Fixes: 6e80d4f8 ("btrfs: handle empty block_group removal for async discard")
      CC: stable@vger.kernel.org # 5.6+
      Tested-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04e484c5
  4. 02 Jul, 2020 4 commits
    • Josef Bacik's avatar
      btrfs: reset tree root pointer after error in init_tree_roots · 0465337c
      Josef Bacik authored
      Eric reported an issue where mounting -o recovery with a fuzzed fs
      resulted in a kernel panic.  This is because we tried to free the tree
      node, except it was an error from the read.  Fix this by properly
      resetting the tree_root->node == NULL in this case.  The panic was the
      following
      
        BTRFS warning (device loop0): failed to read tree root
        BUG: kernel NULL pointer dereference, address: 000000000000001f
        RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs]
        Call Trace:
         free_root_extent_buffers.part.0+0x11/0x30 [btrfs]
         free_root_pointers+0x1a/0xa2 [btrfs]
         open_ctree+0x1776/0x18a5 [btrfs]
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? selinux_fs_context_parse_param+0x37/0x80
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         fc_mount+0xe/0x30
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x147/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? legacy_get_tree+0x27/0x40
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         do_mount+0x735/0xa40
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x4d/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Nik says: this is problematic only if we fail on the last iteration of
      the loop as this results in init_tree_roots returning err value with
      tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots
      which calls free_root_pointers on the bogus value.
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      Fixes: b8522a1e ("btrfs: Factor out tree roots initialization during mount")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add details how the pointer gets dereferenced ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0465337c
    • Filipe Manana's avatar
      btrfs: fix reclaim_size counter leak after stealing from global reserve · 6d548b9e
      Filipe Manana authored
      Commit 7f9fe614 ("btrfs: improve global reserve stealing logic"),
      added in the 5.8 merge window, introduced another leak for the space_info's
      reclaim_size counter. This is very often triggered by the test cases
      generic/269 and generic/416 from fstests, producing a stack trace like the
      following during unmount:
      
      [37079.155499] ------------[ cut here ]------------
      [37079.156844] WARNING: CPU: 2 PID: 2000423 at fs/btrfs/block-group.c:3422 btrfs_free_block_groups+0x2eb/0x300 [btrfs]
      [37079.158090] Modules linked in: dm_snapshot btrfs dm_thin_pool (...)
      [37079.164440] CPU: 2 PID: 2000423 Comm: umount Tainted: G        W         5.7.0-rc7-btrfs-next-62 #1
      [37079.165422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
      [37079.167384] RIP: 0010:btrfs_free_block_groups+0x2eb/0x300 [btrfs]
      [37079.168375] Code: bd 58 ff ff ff 00 4c 8d (...)
      [37079.170199] RSP: 0018:ffffaa53875c7de0 EFLAGS: 00010206
      [37079.171120] RAX: ffff98099e701cf8 RBX: ffff98099e2d4000 RCX: 0000000000000000
      [37079.172057] RDX: 0000000000000001 RSI: ffffffffc0acc5b1 RDI: 00000000ffffffff
      [37079.173002] RBP: ffff98099e701cf8 R08: 0000000000000000 R09: 0000000000000000
      [37079.173886] R10: 0000000000000000 R11: 0000000000000000 R12: ffff98099e701c00
      [37079.174730] R13: ffff98099e2d5100 R14: dead000000000122 R15: dead000000000100
      [37079.175578] FS:  00007f4d7d0a5840(0000) GS:ffff9809ec600000(0000) knlGS:0000000000000000
      [37079.176434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [37079.177289] CR2: 0000559224dcc000 CR3: 000000012207a004 CR4: 00000000003606e0
      [37079.178152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [37079.178935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [37079.179675] Call Trace:
      [37079.180419]  close_ctree+0x291/0x2d1 [btrfs]
      [37079.181162]  generic_shutdown_super+0x6c/0x100
      [37079.181898]  kill_anon_super+0x14/0x30
      [37079.182641]  btrfs_kill_super+0x12/0x20 [btrfs]
      [37079.183371]  deactivate_locked_super+0x31/0x70
      [37079.184012]  cleanup_mnt+0x100/0x160
      [37079.184650]  task_work_run+0x68/0xb0
      [37079.185284]  exit_to_usermode_loop+0xf9/0x100
      [37079.185920]  do_syscall_64+0x20d/0x260
      [37079.186556]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      [37079.187197] RIP: 0033:0x7f4d7d2d9357
      [37079.187836] Code: eb 0b 00 f7 d8 64 89 01 48 (...)
      [37079.189180] RSP: 002b:00007ffee4e0d368 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [37079.189845] RAX: 0000000000000000 RBX: 00007f4d7d3fb224 RCX: 00007f4d7d2d9357
      [37079.190515] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000559224dc5c90
      [37079.191173] RBP: 0000559224dc1970 R08: 0000000000000000 R09: 00007ffee4e0c0e0
      [37079.191815] R10: 0000559224dc7b00 R11: 0000000000000246 R12: 0000000000000000
      [37079.192451] R13: 0000559224dc5c90 R14: 0000559224dc1a80 R15: 0000559224dc1ba0
      [37079.193096] irq event stamp: 0
      [37079.193729] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [37079.194379] hardirqs last disabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0
      [37079.195033] softirqs last  enabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0
      [37079.195700] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [37079.196318] ---[ end trace b32710d864dea887 ]---
      
      In the past commit d611add4 ("btrfs: fix reclaim counter leak of
      space_info objects") fixed similar cases. That commit however has a date
      more recent (April 7 2020) then the commit mentioned before (March 13
      2020), however it was merged in kernel 5.7 while the older commit, which
      introduces a new leak, was merged only in the 5.8 merge window. So the
      leak sneaked in unnoticed.
      
      Fix this by making steal_from_global_rsv() remove the ticket using the
      helper remove_ticket(), which decrements the reclaim_size counter of the
      space_info object.
      
      Fixes: 7f9fe614 ("btrfs: improve global reserve stealing logic")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d548b9e
    • Boris Burkov's avatar
      btrfs: fix fatal extent_buffer readahead vs releasepage race · 6bf9cd2e
      Boris Burkov authored
      Under somewhat convoluted conditions, it is possible to attempt to
      release an extent_buffer that is under io, which triggers a BUG_ON in
      btrfs_release_extent_buffer_pages.
      
      This relies on a few different factors. First, extent_buffer reads done
      as readahead for searching use WAIT_NONE, so they free the local extent
      buffer reference while the io is outstanding. However, they should still
      be protected by TREE_REF. However, if the system is doing signficant
      reclaim, and simultaneously heavily accessing the extent_buffers, it is
      possible for releasepage to race with two concurrent readahead attempts
      in a way that leaves TREE_REF unset when the readahead extent buffer is
      released.
      
      Essentially, if two tasks race to allocate a new extent_buffer, but the
      winner who attempts the first io is rebuffed by a page being locked
      (likely by the reclaim itself) then the loser will still go ahead with
      issuing the readahead. The loser's call to find_extent_buffer must also
      race with the reclaim task reading the extent_buffer's refcount as 1 in
      a way that allows the reclaim to re-clear the TREE_REF checked by
      find_extent_buffer.
      
      The following represents an example execution demonstrating the race:
      
                  CPU0                                                         CPU1                                           CPU2
      reada_for_search                                            reada_for_search
        readahead_tree_block                                        readahead_tree_block
          find_create_tree_block                                      find_create_tree_block
            alloc_extent_buffer                                         alloc_extent_buffer
                                                                        find_extent_buffer // not found
                                                                        allocates eb
                                                                        lock pages
                                                                        associate pages to eb
                                                                        insert eb into radix tree
                                                                        set TREE_REF, refs == 2
                                                                        unlock pages
                                                                    read_extent_buffer_pages // WAIT_NONE
                                                                      not uptodate (brand new eb)
                                                                                                                  lock_page
                                                                      if !trylock_page
                                                                        goto unlock_exit // not an error
                                                                    free_extent_buffer
                                                                      release_extent_buffer
                                                                        atomic_dec_and_test refs to 1
              find_extent_buffer // found
                                                                                                                  try_release_extent_buffer
                                                                                                                    take refs_lock
                                                                                                                    reads refs == 1; no io
                atomic_inc_not_zero refs to 2
                mark_buffer_accessed
                  check_buffer_tree_ref
                    // not STALE, won't take refs_lock
                    refs == 2; TREE_REF set // no action
          read_extent_buffer_pages // WAIT_NONE
                                                                                                                    clear TREE_REF
                                                                                                                    release_extent_buffer
                                                                                                                      atomic_dec_and_test refs to 1
                                                                                                                      unlock_page
            still not uptodate (CPU1 read failed on trylock_page)
            locks pages
            set io_pages > 0
            submit io
            return
          free_extent_buffer
            release_extent_buffer
              dec refs to 0
              delete from radix tree
              btrfs_release_extent_buffer_pages
                BUG_ON(io_pages > 0)!!!
      
      We observe this at a very low rate in production and were also able to
      reproduce it in a test environment by introducing some spurious delays
      and by introducing probabilistic trylock_page failures.
      
      To fix it, we apply check_tree_ref at a point where it could not
      possibly be unset by a competing task: after io_pages has been
      incremented. All the codepaths that clear TREE_REF check for io, so they
      would not be able to clear it after this point until the io is done.
      
      Stack trace, for reference:
      [1417839.424739] ------------[ cut here ]------------
      [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
      [1417839.447024] invalid opcode: 0000 [#1] SMP
      [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
      [1417839.517008] Code: ed e9 ...
      [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
      [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
      [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
      [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
      [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
      [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
      [1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
      [1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
      [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [1417839.731320] Call Trace:
      [1417839.737103]  release_extent_buffer+0x39/0x90
      [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
      [1417839.758645]  btrfs_search_slot+0x260/0x9b0
      [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
      [1417839.778427]  btrfs_get_extent+0x15f/0x830
      [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
      [1417839.797474]  ? __do_readpage+0x299/0x7a0
      [1417839.806515]  __do_readpage+0x33b/0x7a0
      [1417839.815171]  ? btrfs_releasepage+0x70/0x70
      [1417839.824597]  extent_readpages+0x28f/0x400
      [1417839.833836]  read_pages+0x6a/0x1c0
      [1417839.841729]  ? startup_64+0x2/0x30
      [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
      [1417839.860590]  filemap_fault+0x6c7/0x990
      [1417839.869252]  ? xas_load+0x8/0x80
      [1417839.876756]  ? xas_find+0x150/0x190
      [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
      [1417839.894652]  __do_fault+0x32/0x110
      [1417839.902540]  __handle_mm_fault+0xacd/0x1000
      [1417839.912156]  handle_mm_fault+0xaa/0x1c0
      [1417839.921004]  __do_page_fault+0x242/0x4b0
      [1417839.930044]  ? page_fault+0x8/0x30
      [1417839.937933]  page_fault+0x1e/0x30
      [1417839.945631] RIP: 0033:0x33c4bae
      [1417839.952927] Code: Bad RIP value.
      [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
      [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
      [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
      [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
      [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
      [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bf9cd2e
    • Marcos Paulo de Souza's avatar
      btrfs: convert comments to fallthrough annotations · c730ae0c
      Marcos Paulo de Souza authored
      Convert fall through comments to the pseudo-keyword which is now the
      preferred way.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c730ae0c
  5. 16 Jun, 2020 10 commits
    • Waiman Long's avatar
      btrfs: use kfree() in btrfs_ioctl_get_subvol_info() · b091f7fe
      Waiman Long authored
      In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
      was incorrectly paired with kzfree(). According to David Sterba, there
      isn't any sensitive information in the subvol_info that needs to be
      cleared before freeing. So kzfree() isn't really needed, use kfree()
      instead.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b091f7fe
    • Filipe Manana's avatar
      btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO · 5dbb75ed
      Filipe Manana authored
      A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be
      held for a long time or for ongoing IO to complete.
      
      However when calling check_can_nocow(), if the inode has prealloc extents
      or has the NOCOW flag set, we can block on extent (file range) locks
      through the call to btrfs_lock_and_flush_ordered_range(). Such lock can
      take a significant amount of time to be available. For example, a fiemap
      task may be running, and iterating through the entire file range checking
      all extents and doing backref walking to determine if they are shared,
      or a readpage operation may be in progress.
      
      Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(),
      after locking the file range we wait for any existing ordered extent that
      is in progress to complete. Another operation that can take a significant
      amount of time and defeat the purpose of RWF_NOWAIT.
      
      So fix this by trying to lock the file range and if it's currently locked
      return -EAGAIN to user space. If we are able to lock the file range without
      waiting and there is an ordered extent in the range, return -EAGAIN as
      well, instead of waiting for it to complete. Finally, don't bother trying
      to lock the snapshot lock of the root when attempting a RWF_NOWAIT write,
      as that is only important for buffered writes.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5dbb75ed
    • Filipe Manana's avatar
      btrfs: fix RWF_NOWAIT write not failling when we need to cow · 260a6339
      Filipe Manana authored
      If we attempt to do a RWF_NOWAIT write against a file range for which we
      can only do NOCOW for a part of it, due to the existence of holes or
      shared extents for example, we proceed with the write as if it were
      possible to NOCOW the whole range.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/sdj/bar
        $ chattr +C /mnt/sdj/bar
      
        $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
        wrote 262144/262144 bytes at offset 0
        256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)
      
        $ xfs_io -c "fpunch 64K 64K" /mnt/bar
        $ sync
      
        $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
        wrote 131072/131072 bytes at offset 0
        128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)
      
      This last write should fail with -EAGAIN since the file range from 64K to
      128K is a hole. On xfs it fails, as expected, but on ext4 it currently
      succeeds because apparently it is expensive to check if there are extents
      allocated for the whole range, but I'll check with the ext4 people.
      
      Fix the issue by checking if check_can_nocow() returns a number of
      NOCOW'able bytes smaller then the requested number of bytes, and if it
      does return -EAGAIN.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      260a6339
    • Filipe Manana's avatar
      btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof · 4b194628
      Filipe Manana authored
      If we attempt to write to prealloc extent located after eof using a
      RWF_NOWAIT write, we always fail with -EAGAIN.
      
      We do actually check if we have an allocated extent for the write at
      the start of btrfs_file_write_iter() through a call to check_can_nocow(),
      but later when we go into the actual direct IO write path we simply
      return -EAGAIN if the write starts at or beyond EOF.
      
      Trivial to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foo
        $ chattr +C /mnt/foo
      
        $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)
      
        $ xfs_io -c "falloc -k 64K 1M" /mnt/foo
      
        $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
        pwrite: Resource temporarily unavailable
      
      On xfs and ext4 the write succeeds, as expected.
      
      Fix this by removing the wrong check at btrfs_direct_IO().
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b194628
    • Filipe Manana's avatar
      btrfs: fix hang on snapshot creation after RWF_NOWAIT write · f2cb2f39
      Filipe Manana authored
      If we do a successful RWF_NOWAIT write we end up locking the snapshot lock
      of the inode, through a call to check_can_nocow(), but we never unlock it.
      
      This means the next attempt to create a snapshot on the subvolume will
      hang forever.
      
      Trivial reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foobar
        $ chattr +C /mnt/foobar
        $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar
        $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap
          --> hangs
      
      Fix this by unlocking the snapshot lock if check_can_nocow() returned
      success.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2cb2f39
    • Filipe Manana's avatar
      btrfs: check if a log root exists before locking the log_mutex on unlink · e7a79811
      Filipe Manana authored
      This brings back an optimization that commit e678934c ("btrfs:
      Remove unnecessary check from join_running_log_trans") removed, but in
      a different form. So it's almost equivalent to a revert.
      
      That commit removed an optimization where we avoid locking a root's
      log_mutex when there is no log tree created in the current transaction.
      The affected code path is triggered through unlink operations.
      
      That commit was based on the assumption that the optimization was not
      necessary because we used to have the following checks when the patch
      was authored:
      
        int btrfs_del_dir_entries_in_log(...)
        {
              (...)
              if (dir->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
         int btrfs_del_inode_ref_in_log(...)
         {
              (...)
              if (inode->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
      However before that patch was merged, another patch was merged first which
      replaced those checks because they were buggy.
      
      That other patch corresponds to commit 803f0f64 ("Btrfs: fix fsync
      not persisting dentry deletions due to inode evictions"). The assumption
      that if the logged_trans field of an inode had a smaller value then the
      current transaction's generation (transid) meant that the inode was not
      logged in the current transaction was only correct if the inode was not
      evicted and reloaded in the current transaction. So the corresponding bug
      fix changed those checks and replaced them with the following helper
      function:
      
        static bool inode_logged(struct btrfs_trans_handle *trans,
                                 struct btrfs_inode *inode)
        {
              if (inode->logged_trans == trans->transid)
                      return true;
      
              if (inode->last_trans == trans->transid &&
                  test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
                  !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
                      return true;
      
              return false;
        }
      
      So if we have a subvolume without a log tree in the current transaction
      (because we had no fsyncs), every time we unlink an inode we can end up
      trying to lock the log_mutex of the root through join_running_log_trans()
      twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
      and once for the parent directory (with btrfs_del_dir_entries_in_log()).
      
      This means if we have several unlink operations happening in parallel for
      inodes in the same subvolume, and the those inodes and/or their parent
      inode were changed in the current transaction, we end up having a lot of
      contention on the log_mutex.
      
      The test robots from intel reported a -30.7% performance regression for
      a REAIM test after commit e678934c ("btrfs: Remove unnecessary check
      from join_running_log_trans").
      
      So just bring back the optimization to join_running_log_trans() where we
      check first if a log root exists before trying to lock the log_mutex. This
      is done by checking for a bit that is set on the root when a log tree is
      created and removed when a log tree is freed (at transaction commit time).
      
      Commit e678934c ("btrfs: Remove unnecessary check from
      join_running_log_trans") was merged in the 5.4 merge window while commit
      803f0f64 ("Btrfs: fix fsync not persisting dentry deletions due to
      inode evictions") was merged in the 5.3 merge window. But the first
      commit was actually authored before the second commit (May 23 2019 vs
      June 19 2019).
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
      Fixes: e678934c ("btrfs: Remove unnecessary check from join_running_log_trans")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7a79811
    • Filipe Manana's avatar
      btrfs: fix bytes_may_use underflow when running balance and scrub in parallel · 6bd335b4
      Filipe Manana authored
      When balance and scrub are running in parallel it is possible to end up
      with an underflow of the bytes_may_use counter of the data space_info
      object, which triggers a warning like the following:
      
         [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data
         [134243.806891] ------------[ cut here ]------------
         [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
         [134243.808819] Modules linked in: btrfs blake2b_generic xor (...)
         [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
         [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483)
         [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
         [134243.819963] Code: 0b f2 85 (...)
         [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287
         [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000
         [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810
         [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000
         [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000
         [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810
         [134243.827432] FS:  0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000
         [134243.828451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0
         [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [134243.831867] Call Trace:
         [134243.832211]  find_free_extent+0x4a0/0x16c0 [btrfs]
         [134243.832846]  btrfs_reserve_extent+0x91/0x180 [btrfs]
         [134243.833487]  cow_file_range+0x12d/0x490 [btrfs]
         [134243.834080]  fallback_to_cow+0x82/0x1b0 [btrfs]
         [134243.834689]  ? release_extent_buffer+0x121/0x170 [btrfs]
         [134243.835370]  run_delalloc_nocow+0x33f/0xa30 [btrfs]
         [134243.836032]  btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
         [134243.836725]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
         [134243.837450]  writepage_delalloc+0xe8/0x150 [btrfs]
         [134243.838059]  __extent_writepage+0xe8/0x4c0 [btrfs]
         [134243.838674]  extent_write_cache_pages+0x237/0x530 [btrfs]
         [134243.839364]  extent_writepages+0x44/0xa0 [btrfs]
         [134243.839946]  do_writepages+0x23/0x80
         [134243.840401]  __writeback_single_inode+0x59/0x700
         [134243.841006]  writeback_sb_inodes+0x267/0x5f0
         [134243.841548]  __writeback_inodes_wb+0x87/0xe0
         [134243.842091]  wb_writeback+0x382/0x590
         [134243.842574]  ? wb_workfn+0x4a2/0x6c0
         [134243.843030]  wb_workfn+0x4a2/0x6c0
         [134243.843468]  process_one_work+0x26d/0x6a0
         [134243.843978]  worker_thread+0x4f/0x3e0
         [134243.844452]  ? process_one_work+0x6a0/0x6a0
         [134243.844981]  kthread+0x103/0x140
         [134243.845400]  ? kthread_create_worker_on_cpu+0x70/0x70
         [134243.846030]  ret_from_fork+0x3a/0x50
         [134243.846494] irq event stamp: 0
         [134243.846892] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
         [134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134243.848687] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0
         [134243.850698] ---[ end trace bd7c03622e0b0a96 ]---
         [134243.851335] ------------[ cut here ]------------
      
      When relocating a data block group, for each extent allocated in the
      block group we preallocate another extent with the same size for the
      data relocation inode (we do it at prealloc_file_extent_cluster()).
      We reserve space by calling btrfs_check_data_free_space(), which ends
      up incrementing the data space_info's bytes_may_use counter, and
      then call btrfs_prealloc_file_range() to allocate the extent, which
      always decrements the bytes_may_use counter by the same amount.
      
      The expectation is that writeback of the data relocation inode always
      follows a NOCOW path, by writing into the preallocated extents. However,
      when starting writeback we might end up falling back into the COW path,
      because the block group that contains the preallocated extent was turned
      into RO mode by a scrub running in parallel. The COW path then calls the
      extent allocator which ends up calling btrfs_add_reserved_bytes(), and
      this function decrements the bytes_may_use counter of the data space_info
      object by an amount corresponding to the size of the allocated extent,
      despite we haven't previously incremented it. When the counter currently
      has a value smaller then the allocated extent we reset the counter to 0
      and emit a warning, otherwise we just decrement it and slowly mess up
      with this counter which is crucial for space reservation, the end result
      can be granting reserved space to tasks when there isn't really enough
      free space, and having the tasks fail later in critical places where
      error handling consists of a transaction abort or hitting a BUG_ON().
      
      Fix this by making sure that if we fallback to the COW path for a data
      relocation inode, we increment the bytes_may_use counter of the data
      space_info object. The COW path will then decrement it at
      btrfs_add_reserved_bytes() on success or through its error handling part
      by a call to extent_clear_unlock_delalloc() (which ends up calling
      btrfs_clear_delalloc_extent() that does the decrement operation) in case
      of an error.
      
      Test case btrfs/061 from fstests could sporadically trigger this.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bd335b4
    • Filipe Manana's avatar
      btrfs: fix data block group relocation failure due to concurrent scrub · 432cd2a1
      Filipe Manana authored
      When running relocation of a data block group while scrub is running in
      parallel, it is possible that the relocation will fail and abort the
      current transaction with an -EINVAL error:
      
         [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents
         [134243.999871] ------------[ cut here ]------------
         [134244.000741] BTRFS: Transaction aborted (error -22)
         [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs]
         [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
         [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
         [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs]
         [134244.017151] Code: 48 c7 c7 (...)
         [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286
         [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000
         [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001
         [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001
         [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08
         [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000
         [134244.028024] FS:  00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000
         [134244.029491] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0
         [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [134244.034484] Call Trace:
         [134244.034984]  btrfs_cow_block+0x12b/0x2b0 [btrfs]
         [134244.035859]  do_relocation+0x30b/0x790 [btrfs]
         [134244.036681]  ? do_raw_spin_unlock+0x49/0xc0
         [134244.037460]  ? _raw_spin_unlock+0x29/0x40
         [134244.038235]  relocate_tree_blocks+0x37b/0x730 [btrfs]
         [134244.039245]  relocate_block_group+0x388/0x770 [btrfs]
         [134244.040228]  btrfs_relocate_block_group+0x161/0x2e0 [btrfs]
         [134244.041323]  btrfs_relocate_chunk+0x36/0x110 [btrfs]
         [134244.041345]  btrfs_balance+0xc06/0x1860 [btrfs]
         [134244.043382]  ? btrfs_ioctl_balance+0x27c/0x310 [btrfs]
         [134244.045586]  btrfs_ioctl_balance+0x1ed/0x310 [btrfs]
         [134244.045611]  btrfs_ioctl+0x1880/0x3760 [btrfs]
         [134244.049043]  ? do_raw_spin_unlock+0x49/0xc0
         [134244.049838]  ? _raw_spin_unlock+0x29/0x40
         [134244.050587]  ? __handle_mm_fault+0x11b3/0x14b0
         [134244.051417]  ? ksys_ioctl+0x92/0xb0
         [134244.052070]  ksys_ioctl+0x92/0xb0
         [134244.052701]  ? trace_hardirqs_off_thunk+0x1a/0x1c
         [134244.053511]  __x64_sys_ioctl+0x16/0x20
         [134244.054206]  do_syscall_64+0x5c/0x280
         [134244.054891]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         [134244.055819] RIP: 0033:0x7f29b51c9dd7
         [134244.056491] Code: 00 00 00 (...)
         [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
         [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7
         [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003
         [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000
         [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a
         [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0
         [134244.067626] irq event stamp: 0
         [134244.068202] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
         [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134244.070909] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0
         [134244.073432] ---[ end trace bd7c03622e0b0a99 ]---
      
      The -EINVAL error comes from the following chain of function calls:
      
        __btrfs_cow_block() <-- aborts the transaction
          btrfs_reloc_cow_block()
            replace_file_extents()
              get_new_location() <-- returns -EINVAL
      
      When relocating a data block group, for each allocated extent of the block
      group, we preallocate another extent (at prealloc_file_extent_cluster()),
      associated with the data relocation inode, and then dirty all its pages.
      These preallocated extents have, and must have, the same size that extents
      from the data block group being relocated have.
      
      Later before we start the relocation stage that updates pointers (bytenr
      field of file extent items) to point to the the new extents, we trigger
      writeback for the data relocation inode. The expectation is that writeback
      will write the pages to the previously preallocated extents, that it
      follows the NOCOW path. That is generally the case, however, if a scrub
      is running it may have turned the block group that contains those extents
      into RO mode, in which case writeback falls back to the COW path.
      
      However in the COW path instead of allocating exactly one extent with the
      expected size, the allocator may end up allocating several smaller extents
      due to free space fragmentation - because we tell it at cow_file_range()
      that the minimum allocation size can match the filesystem's sector size.
      This later breaks the relocation's expectation that an extent associated
      to a file extent item in the data relocation inode has the same size as
      the respective extent pointed by a file extent item in another tree - in
      this case the extent to which the relocation inode poins to is smaller,
      causing relocation.c:get_new_location() to return -EINVAL.
      
      For example, if we are relocating a data block group X that has a logical
      address of X and the block group has an extent allocated at the logical
      address X + 128KiB with a size of 64KiB:
      
      1) At prealloc_file_extent_cluster() we allocate an extent for the data
         relocation inode with a size of 64KiB and associate it to the file
         offset 128KiB (X + 128KiB - X) of the data relocation inode. This
         preallocated extent was allocated at block group Z;
      
      2) A scrub running in parallel turns block group Z into RO mode and
         starts scrubing its extents;
      
      3) Relocation triggers writeback for the data relocation inode;
      
      4) When running delalloc (btrfs_run_delalloc_range()), we try first the
         NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC
         set in its flags. However, because block group Z is in RO mode, the
         NOCOW path (run_delalloc_nocow()) falls back into the COW path, by
         calling cow_file_range();
      
      5) At cow_file_range(), in the first iteration of the while loop we call
         btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum
         allocation size of 4KiB (fs_info->sectorsize). Due to free space
         fragmentation, btrfs_reserve_extent() ends up allocating two extents
         of 32KiB each, each one on a different iteration of that while loop;
      
      6) Writeback of the data relocation inode completes;
      
      7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
         with a leaf which has a file extent item that points to the data extent
         from block group X, that has a logical address (bytenr) of X + 128KiB
         and a size of 64KiB. Then it calls get_new_location(), which does a
         lookup in the data relocation tree for a file extent item starting at
         offset 128KiB (X + 128KiB - X) and belonging to the data relocation
         inode. It finds a corresponding file extent item, however that item
         points to an extent that has a size of 32KiB, which doesn't match the
         expected size of 64KiB, resuling in -EINVAL being returned from this
         function and propagated up to __btrfs_cow_block(), which aborts the
         current transaction.
      
      To fix this make sure that at cow_file_range() when we call the allocator
      we pass it a minimum allocation size corresponding the desired extent size
      if the inode belongs to the data relocation tree, otherwise pass it the
      filesystem's sector size as the minimum allocation size.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      432cd2a1
    • Filipe Manana's avatar
      btrfs: fix race between block group removal and block group creation · ffcb9d44
      Filipe Manana authored
      There is a race between block group removal and block group creation
      when the removal is completed by a task running fitrim or scrub. When
      this happens we end up failing the block group creation with an error
      -EEXIST since we attempt to insert a duplicate block group item key
      in the extent tree. That results in a transaction abort.
      
      The race happens like this:
      
      1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
         block group X with btrfs_freeze_block_group() (until very recently
         that was named btrfs_get_block_group_trimming());
      
      2) Task B starts removing block group X, either because it's now unused
         or due to relocation for example. So at btrfs_remove_block_group(),
         while holding the chunk mutex and the block group's lock, it sets
         the 'removed' flag of the block group and it sets the local variable
         'remove_em' to false, because the block group is currently frozen
         (its 'frozen' counter is > 0, until very recently this counter was
         named 'trimming');
      
      3) Task B unlocks the block group and the chunk mutex;
      
      4) Task A is done trimming the block group and unfreezes the block group
         by calling btrfs_unfreeze_block_group() (until very recently this was
         named btrfs_put_block_group_trimming()). In this function we lock the
         block group and set the local variable 'cleanup' to true because we
         were able to decrement the block group's 'frozen' counter down to 0 and
         the flag 'removed' is set in the block group.
      
         Since 'cleanup' is set to true, it locks the chunk mutex and removes
         the extent mapping representing the block group from the mapping tree;
      
      5) Task C allocates a new block group Y and it picks up the logical address
         that block group X had as the logical address for Y, because X was the
         block group with the highest logical address and now the second block
         group with the highest logical address, the last in the fs mapping tree,
         ends at an offset corresponding to block group X's logical address (this
         logical address selection is done at volumes.c:find_next_chunk()).
      
         At this point the new block group Y does not have yet its item added
         to the extent tree (nor the corresponding device extent items and
         chunk item in the device and chunk trees). The new group Y is added to
         the list of pending block groups in the transaction handle;
      
      6) Before task B proceeds to removing the block group item for block
         group X from the extent tree, which has a key matching:
      
         (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)
      
         task C while ending its transaction handle calls
         btrfs_create_pending_block_groups(), which finds block group Y and
         tries to insert the block group item for Y into the exten tree, which
         fails with -EEXIST since logical offset is the same that X had and
         task B hasn't yet deleted the key from the extent tree.
         This failure results in a transaction abort, producing a stack like
         the following:
      
      ------------[ cut here ]------------
       BTRFS: Transaction aborted (error -17)
       WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
       CPU: 2 PID: 19736 Comm: fsstress Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
       Code: ff ff ff 48 8b 55 50 f0 48 (...)
       RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
       RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
       R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
       R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
       FS:  00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
        btrfs_commit_transaction+0xd0/0xc50 [btrfs]
        ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
        ? __ia32_sys_fdatasync+0x20/0x20
        iterate_supers+0xdb/0x180
        ksys_sync+0x60/0xb0
        __ia32_sys_sync+0xa/0x10
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7f2ce1d4d5b7
       Code: 83 c4 08 48 3d 01 (...)
       RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
       RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
       RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
       RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
       R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
       R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace bd7c03622e0b0a9c ]---
      
      Fix this simply by making btrfs_remove_block_group() remove the block
      group's item from the extent tree before it flags the block group as
      removed. Also make the free space deletion from the free space tree
      before flagging the block group as removed, to avoid a similar race
      with adding and removing free space entries for the free space tree.
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffcb9d44
    • Filipe Manana's avatar
      btrfs: fix a block group ref counter leak after failure to remove block group · 9fecd132
      Filipe Manana authored
      When removing a block group, if we fail to delete the block group's item
      from the extent tree, we jump to the 'out' label and end up decrementing
      the block group's reference count once only (by 1), resulting in a counter
      leak because the block group at that point was already removed from the
      block group cache rbtree - so we have to decrement the reference count
      twice, once for the rbtree and once for our lookup at the start of the
      function.
      
      There is a second bug where if removing the free space tree entries (the
      call to remove_block_group_free_space()) fails we end up jumping to the
      'out_put_group' label but end up decrementing the reference count only
      once, when we should have done it twice, since we have already removed
      the block group from the block group cache rbtree. This happens because
      the reference count decrement for the rbtree reference happens after
      attempting to remove the free space tree entries, which is far away from
      the place where we remove the block group from the rbtree.
      
      To make things less error prone, decrement the reference count for the
      rbtree immediately after removing the block group from it. This also
      eleminates the need for two different exit labels on error, renaming
      'out_put_label' to just 'out' and removing the old 'out'.
      
      Fixes: f6033c5e ("btrfs: fix block group leak when removing fails")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9fecd132
  6. 13 Jun, 2020 1 commit
    • David Sterba's avatar
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba authored
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e20bd1
  7. 09 Jun, 2020 3 commits
  8. 28 May, 2020 9 commits
    • Filipe Manana's avatar
      btrfs: fix space_info bytes_may_use underflow during space cache writeout · 2166e5ed
      Filipe Manana authored
      We always preallocate a data extent for writing a free space cache, which
      causes writeback to always try the nocow path first, since the free space
      inode has the prealloc bit set in its flags.
      
      However if the block group that contains the data extent for the space
      cache has been turned to RO mode due to a running scrub or balance for
      example, we have to fallback to the cow path. In that case once a new data
      extent is allocated we end up calling btrfs_add_reserved_bytes(), which
      decrements the counter named bytes_may_use from the data space_info object
      with the expection that this counter was previously incremented with the
      same amount (the size of the data extent).
      
      However when we started writeout of the space cache at cache_save_setup(),
      we incremented the value of the bytes_may_use counter through a call to
      btrfs_check_data_free_space() and then decremented it through a call to
      btrfs_prealloc_file_range_trans() immediately after. So when starting the
      writeback if we fallback to cow mode we have to increment the counter
      bytes_may_use of the data space_info again to compensate for the extent
      allocation done by the cow path.
      
      When this issue happens we are incorrectly decrementing the bytes_may_use
      counter and when its current value is smaller then the amount we try to
      subtract we end up with the following warning:
      
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
       CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       Workqueue: writeback wb_workfn (flush-btrfs-1591)
       RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Code: ff ff 48 (...)
       RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
       RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
       RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
       R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
       FS:  0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        find_free_extent+0x4a0/0x16c0 [btrfs]
        btrfs_reserve_extent+0x91/0x180 [btrfs]
        cow_file_range+0x12d/0x490 [btrfs]
        btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
        ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        writepage_delalloc+0xe8/0x150 [btrfs]
        __extent_writepage+0xe8/0x4c0 [btrfs]
        extent_write_cache_pages+0x237/0x530 [btrfs]
        extent_writepages+0x44/0xa0 [btrfs]
        do_writepages+0x23/0x80
        __writeback_single_inode+0x59/0x700
        writeback_sb_inodes+0x267/0x5f0
        __writeback_inodes_wb+0x87/0xe0
        wb_writeback+0x382/0x590
        ? wb_workfn+0x4a2/0x6c0
        wb_workfn+0x4a2/0x6c0
        process_one_work+0x26d/0x6a0
        worker_thread+0x4f/0x3e0
        ? process_one_work+0x6a0/0x6a0
        kthread+0x103/0x140
        ? kthread_create_worker_on_cpu+0x70/0x70
        ret_from_fork+0x3a/0x50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace bd7c03622e0b0a52 ]---
       ------------[ cut here ]------------
      
      So fix this by incrementing the bytes_may_use counter of the data
      space_info when we fallback to the cow path. If the cow path is successful
      the counter is decremented after extent allocation (by
      btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
      well when clearing the delalloc range (extent_clear_unlock_delalloc()).
      
      This could be triggered sporadically by the test case btrfs/061 from
      fstests.
      
      Fixes: 82d5902d ("Btrfs: Support reading/writing on disk free ino cache")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2166e5ed
    • Filipe Manana's avatar
      btrfs: fix space_info bytes_may_use underflow after nocow buffered write · 467dc47e
      Filipe Manana authored
      When doing a buffered write we always try to reserve data space for it,
      even when the file has the NOCOW bit set or the write falls into a file
      range covered by a prealloc extent. This is done both because it is
      expensive to check if we can do a nocow write (checking if an extent is
      shared through reflinks or if there's a hole in the range for example),
      and because when writeback starts we might actually need to fallback to
      COW mode (for example the block group containing the target extents was
      turned into RO mode due to a scrub or balance).
      
      When we are unable to reserve data space we check if we can do a nocow
      write, and if we can, we proceed with dirtying the pages and setting up
      the range for delalloc. In this case the bytes_may_use counter of the
      data space_info object is not incremented, unlike in the case where we
      are able to reserve data space (done through btrfs_check_data_free_space()
      which calls btrfs_alloc_data_chunk_ondemand()).
      
      Later when running delalloc we attempt to start writeback in nocow mode
      but we might revert back to cow mode, for example because in the meanwhile
      a block group was turned into RO mode by a scrub or relocation. The cow
      path after successfully allocating an extent ends up calling
      btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
      the data space_info object to have been incremented before - but we did
      not do it when the buffered write started, since there was not enough
      available data space. So btrfs_add_reserved_bytes() ends up decrementing
      the bytes_may_use counter anyway, and when the counter's current value
      is smaller then the size of the allocated extent we get a stack trace
      like the following:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
       CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       Workqueue: writeback wb_workfn (flush-btrfs-1754)
       RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Code: ff ff 48 (...)
       RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
       RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
       RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
       R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
       FS:  0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        find_free_extent+0x4a0/0x16c0 [btrfs]
        btrfs_reserve_extent+0x91/0x180 [btrfs]
        cow_file_range+0x12d/0x490 [btrfs]
        run_delalloc_nocow+0x341/0xa40 [btrfs]
        btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
        ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        writepage_delalloc+0xe8/0x150 [btrfs]
        __extent_writepage+0xe8/0x4c0 [btrfs]
        extent_write_cache_pages+0x237/0x530 [btrfs]
        ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
        extent_writepages+0x44/0xa0 [btrfs]
        do_writepages+0x23/0x80
        __writeback_single_inode+0x59/0x700
        writeback_sb_inodes+0x267/0x5f0
        __writeback_inodes_wb+0x87/0xe0
        wb_writeback+0x382/0x590
        ? wb_workfn+0x4a2/0x6c0
        wb_workfn+0x4a2/0x6c0
        process_one_work+0x26d/0x6a0
        worker_thread+0x4f/0x3e0
        ? process_one_work+0x6a0/0x6a0
        kthread+0x103/0x140
        ? kthread_create_worker_on_cpu+0x70/0x70
        ret_from_fork+0x3a/0x50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace f9f6ef8ec4cd8ec9 ]---
      
      So to fix this, when falling back into cow mode check if space was not
      reserved, by testing for the bit EXTENT_NORESERVE in the respective file
      range, and if not, increment the bytes_may_use counter for the data
      space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
      that if the cow path fails it decrements the bytes_may_use counter when
      clearing the delalloc range (through the btrfs_clear_delalloc_extent()
      callback).
      
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      467dc47e
    • Filipe Manana's avatar
      btrfs: fix wrong file range cleanup after an error filling dealloc range · e2c8e92d
      Filipe Manana authored
      If an error happens while running dellaloc in COW mode for a range, we can
      end up calling extent_clear_unlock_delalloc() for a range that goes beyond
      our range's end offset by 1 byte, which affects 1 extra page. This results
      in clearing bits and doing page operations (such as a page unlock) outside
      our target range.
      
      Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
      offset, instead of an exclusive end offset, at cow_file_range().
      
      Fixes: a315e68f ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2c8e92d
    • Nikolay Borisov's avatar
      btrfs: remove redundant local variable in read_block_for_search · 213ff4b7
      Nikolay Borisov authored
      The local 'b' variable is only used to directly read values from passed
      extent buffer. So eliminate  it and directly use the input parameter.
      Furthermore this shrinks the size of the following functions:
      
      ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o
      add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-73 (-73)
      Function                                     old     new   delta
      read_block_for_search.isra                   876     871      -5
      push_node_left                              1112    1044     -68
      Total: Before=50348, After=50275, chg -0.14%
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      213ff4b7
    • Nikolay Borisov's avatar
      btrfs: open code key_search · 995e9a16
      Nikolay Borisov authored
      This function wraps the optimisation implemented by d7396f07
      ("Btrfs: optimize key searches in btrfs_search_slot") however this
      optimisation is really used in only one place - btrfs_search_slot.
      
      Just open code the optimisation and also add a comment explaining how it
      works since it's not clear just by looking at the code - the key point
      here is it depends on an internal invariant that BTRFS' btree provides,
      namely intermediate pointers always contain the key at slot0 at the
      child node. So in the case of exact match we can safely assume that the
      given key will always be in slot 0 on lower levels.
      
      Furthermore this results in a reduction of btrfs_search_slot's size:
      
      ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-75 (-75)
      Function                                     old     new   delta
      btrfs_search_slot                           2783    2708     -75
      Total: Before=50423, After=50348, chg -0.15%
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      995e9a16
    • Christoph Hellwig's avatar
      btrfs: split btrfs_direct_IO to read and write part · d8f3e735
      Christoph Hellwig authored
      The read and write versions don't have anything in common except for the
      call to iomap_dio_rw.  So split this function, and merge each half into
      its only caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8f3e735
    • Goldwyn Rodrigues's avatar
      btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK · 5f008163
      Goldwyn Rodrigues authored
      Since we now perform direct reads using i_rwsem, we can remove this
      inode flag used to co-ordinate unlocked reads.
      
      The truncate call takes i_rwsem. This means it is correctly synchronized
      with concurrent direct reads.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jth@kernel.org>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f008163
    • Goldwyn Rodrigues's avatar
      fs: remove dio_end_io() · b75b7ca7
      Goldwyn Rodrigues authored
      Since we removed the last user of dio_end_io(), remove the helper
      function dio_end_io().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b75b7ca7
    • Goldwyn Rodrigues's avatar
      btrfs: switch to iomap_dio_rw() for dio · a43a67a2
      Goldwyn Rodrigues authored
      Switch from __blockdev_direct_IO() to iomap_dio_rw().
      Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
      as iomap_begin() for iomap direct I/O functions. This function
      allocates and locks all the blocks required for the I/O.
      btrfs_submit_direct() is used as the submit_io() hook for direct I/O
      ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore so set it to noop.
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
       - for writes, call __endio_write_update_ordered()
       - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a43a67a2
  9. 25 May, 2020 8 commits
    • Goldwyn Rodrigues's avatar
      iomap: remove lockdep_assert_held() · 3ad99bec
      Goldwyn Rodrigues authored
      Filesystems such as btrfs can perform direct I/O without holding the
      inode->i_rwsem in some of the cases like writing within i_size.  So,
      remove the check for lockdep_assert_held() in iomap_dio_rw().
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ad99bec
    • Goldwyn Rodrigues's avatar
      iomap: add a filesystem hook for direct I/O bio submission · 8cecd0ba
      Goldwyn Rodrigues authored
      This helps filesystems to perform tasks on the bio while submitting for
      I/O. This could be post-write operations such as data CRC or data
      replication for fs-handled RAID.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cecd0ba
    • Goldwyn Rodrigues's avatar
      fs: export generic_file_buffered_read() · d85dc2e1
      Goldwyn Rodrigues authored
      Export generic_file_buffered_read() to be used to supplement incomplete
      direct reads.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d85dc2e1
    • Filipe Manana's avatar
      btrfs: turn space cache writeout failure messages into debug messages · bbcd1f4d
      Filipe Manana authored
      Since commit 1afb648e ("btrfs: use standard debug config option to
      enable free-space-cache debug prints"), we started to log error messages
      that were never logged before since there was no DEBUG macro defined
      anywhere. This started to make test case btrfs/187 to fail very often,
      as it greps for any btrfs error messages in dmesg/syslog and fails if
      any is found:
      
      (...)
      btrfs/186 1s ...  2s
      btrfs/187       - output mismatch (see .../results//btrfs/187.out.bad)
          \--- tests/btrfs/187.out     2019-05-17 12:48:32.537340749 +0100
          \+++ /home/fdmanana/git/hub/xfstests/results//btrfs/187.out.bad ...
          \@@ -1,3 +1,8 @@
           QA output created by 187
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap1'
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap2'
          +[268364.139958] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.156503] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.161703] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.253180] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          ...
          (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/187.out ...
      btrfs/188 4s ...  2s
      (...)
      
      The space cache write failures happen due to ENOSPC when attempting to
      update the free space cache items in the root tree. This happens because
      when starting or joining a transaction we don't know how many block
      groups we will end up changing (due to extent allocation or release) and
      therefore never reserve space for updating free space cache items.
      More often than not, the free space cache writeout succeeds since the
      metadata space info is not yet full nor very close to being full, but
      when it is, the space cache writeout fails with ENOSPC.
      
      Occasional failures to write space caches are not considered critical
      since they can be rebuilt when mounting the filesystem or the next
      attempt to write a free space cache in the next transaction commit might
      succeed, so we used to hide those error messages with a preprocessor
      check for the existence of the DEBUG macro that was never enabled
      anywhere.
      
      A few other generic test cases also trigger the error messages due to
      ENOSPC failure when writing free space caches as well, however they don't
      fail since they don't grep dmesg/syslog for any btrfs specific error
      messages.
      
      So change the messages from 'error' level to 'debug' level, as it doesn't
      make much sense to have error messages triggered only if the debug macro
      is enabled plus, more importantly, the error is not serious nor highly
      unexpected.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbcd1f4d
    • Filipe Manana's avatar
      btrfs: include error on messages about failure to write space/inode caches · 2e69a7a6
      Filipe Manana authored
      Currently the error messages logged when we fail to write a free space
      cache or an inode cache are not very useful as they don't mention what
      was the error. So include the error number in the messages.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e69a7a6
    • Filipe Manana's avatar
      btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks() · 918cdf44
      Filipe Manana authored
      The label 'fail_unlock' is pointless, all it does is to jump to the label
      'out', so just remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      918cdf44
    • Filipe Manana's avatar
      btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums · 7e4a3f7e
      Filipe Manana authored
      We are currently treating any non-zero return value from btrfs_next_leaf()
      the same way, by going to the code that inserts a new checksum item in the
      tree. However if btrfs_next_leaf() returns an error (a value < 0), we
      should just stop and return the error, and not behave as if nothing has
      happened, since in that case we do not have a way to know if there is a
      next leaf or we are currently at the last leaf already.
      
      So fix that by returning the error from btrfs_next_leaf().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e4a3f7e
    • Filipe Manana's avatar
      btrfs: make checksum item extension more efficient · cc14600c
      Filipe Manana authored
      When we want to add checksums into the checksums tree, or a log tree, we
      try whenever possible to extend existing checksum items, as this helps
      reduce amount of metadata space used, since adding a new item uses extra
      metadata space for a btrfs_item structure (25 bytes).
      
      However we have two inefficiencies in the current approach:
      
      1) After finding a checksum item that covers a range with an end offset
         that matches the start offset of the checksum range we want to insert,
         we release the search path populated by btrfs_lookup_csum() and then
         do another COW search on tree with the goal of getting additional
         space for at least one checksum. Doing this path release and then
         searching again is a waste of time because very often the leaf already
         has enough free space for at least one more checksum;
      
      2) After the COW search that guarantees we get free space in the leaf for
         at least one more checksum, we end up not doing the extension of the
         previous checksum item, and fallback to insertion of a new checksum
         item, if the leaf doesn't have an amount of free space larger then the
         space required for 2 checksums plus one btrfs_item structure - this is
         pointless for two reasons:
      
         a) We want to extend an existing item, so we don't need to account for
            a btrfs_item structure (25 bytes);
      
         b) We made the COW search with an insertion size for 1 single checksum,
            so if the leaf ends up with a free space amount smaller then 2
            checksums plus the size of a btrfs_item structure, we give up on the
            extension of the existing item and jump to the 'insert' label, where
            we end up releasing the path and then doing yet another search to
            insert a new checksum item for a single checksum.
      
      Fix these inefficiencies by doing the following:
      
      - For case 1), before releasing the path just check if the leaf already
        has enough space for at least 1 more checksum, and if it does, jump
        directly to the item extension code, with releasing our current path,
        which was already COWed by btrfs_lookup_csum();
      
      - For case 2), fix the logic so that for item extension we require only
        that the leaf has enough free space for 1 checksum, and not a minimum
        of 2 checksums plus space for a btrfs_item structure.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc14600c