1. 15 Oct, 2018 40 commits
    • Qu Wenruo's avatar
      btrfs: tree-checker: Check level for leaves and nodes · f556faa4
      Qu Wenruo authored
      Although we have tree level check at tree read runtime, it's completely
      based on its parent level.
      We still need to do accurate level check to avoid invalid tree blocks
      sneak into kernel space.
      
      The check itself is simple, for leaf its level should always be 0.
      For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f556faa4
    • Qu Wenruo's avatar
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo authored
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't trace subtree if we're dropping reloc tree · 2cd86d30
      Qu Wenruo authored
      Reloc tree doesn't contribute to qgroup numbers, as we have accounted
      them at balance time (see replace_path()).
      
      Skipping the unneeded subtree tracing should reduce the overhead.
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22900          | -0.1%
      qgroup dirty extents | 227757       | 167139         | -26.6%
      time (sys)           | 65.253s      | 50.123s        | -23.2%
      time (real)          | 74.032s      | 52.551s        | -29.0%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cd86d30
    • Qu Wenruo's avatar
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo authored
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f527822
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree · ea49f3e7
      Qu Wenruo authored
      Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
      all new tree blocks in a reloc tree.
      So that qgroup could skip unrelated tree blocks during balance, which
      should hugely speedup balance speed when quota is enabled.
      
      The function qgroup_trace_new_subtree_blocks() itself only cares about
      new tree blocks in reloc tree.
      
      All its main works are:
      
      1) Read out tree blocks according to parent pointers
      
      2) Do recursive depth-first search
         Will call the same function on all its children tree blocks, with
         search level set to current level -1.
         And will also skip all children whose generation is smaller than
         @last_snapshot.
      
      3) Call qgroup_trace_extent_swap() to trace tree blocks
      
      So although we have parameter list related to source file tree, it's not
      used at all, but only passed to qgroup_trace_extent_swap().
      Thus despite the tree read code, the core should be pretty short and all
      about recursive depth-first search.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea49f3e7
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to trace two swaped extents · 25982561
      Qu Wenruo authored
      Introduce a new function, qgroup_trace_extent_swap(), which will be used
      later for balance qgroup speedup.
      
      The basis idea of balance is swapping tree blocks between reloc tree and
      the real file tree.
      
      The swap will happen in highest tree block, but there may be a lot of
      tree blocks involved.
      
      For example:
       OO = Old tree blocks
       NN = New tree blocks allocated during balance
      
                File tree (257)                  Reloc tree for 257
      L2              OO                                NN
                    /    \                            /    \
      L1          OO      OO (a)                    OO      NN (a)
                 / \     / \                       / \     / \
      L0       OO   OO OO   OO                   OO   OO NN   NN
                       (b)  (c)                          (b)  (c)
      
      When calling qgroup_trace_extent_swap(), we will pass:
      @src_eb = OO(a)
      @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
      @dst_level = 0
      @root_level = 1
      
      In that case, qgroup_trace_extent_swap() will search from OO(a) to
      reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
      
      The main work of qgroup_trace_extent_swap() can be split into 3 parts:
      
      1) Tree search from @src_eb
         It should acts as a simplified btrfs_search_slot().
         The key for search can be extracted from @dst_path->nodes[dst_level]
         (first key).
      
      2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
         NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
         They should be marked during preivous (@dst_level = 1) iteration.
      
      3) Mark file extents in leaves dirty
         We don't have good way to pick out new file extents only.
         So we still follow the old method by scanning all file extents in
         the leave.
      
      This function can free us from keeping two pathes, thus later we only need
      to care about how to iterate all new tree blocks in reloc tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ copy changelog to function comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25982561
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted · c337e7b0
      Qu Wenruo authored
      Number of qgroup dirty extents is directly linked to the performance
      overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
      record how many dirty extents is processed in
      btrfs_qgroup_account_extents().
      
      This will be pretty handy to analyze later balance performance
      improvement.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c337e7b0
    • Qu Wenruo's avatar
      btrfs: relocation: Add basic extent backref related comments for build_backref_tree · fa6ac715
      Qu Wenruo authored
      fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
      although it works pretty fine, it's not that easy to understand.
      Especially combined with the complex btrfs backref format.
      
      This patch adds some basic comment for the backref build part of the
      code, making it less hard to read, at least for backref searching part.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa6ac715
    • Omar Sandoval's avatar
      Btrfs: get rid of btrfs_symlink_aops · 4779cc04
      Omar Sandoval authored
      The only aops we define for symlinks are identical to the aops for
      regular files. This has been the case since symlink support was added in
      commit 2b8d99a7 ("Btrfs: symlinks and hard links"). As far as I can
      tell, there wasn't a good reason to have separate aops then, and there
      isn't now, so let's just do what most other filesystems do and reuse the
      same structure.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4779cc04
    • Chris Mason's avatar
      Btrfs: don't clean dirty pages during buffered writes · 7703bdd8
      Chris Mason authored
      During buffered writes, we follow this basic series of steps:
      
      again:
      	lock all the pages
      	wait for writeback on all the pages
      	Take the extent range lock
      	wait for ordered extents on the whole range
      	clean all the pages
      
      	if (copy_from_user_in_atomic() hits a fault) {
      		drop our locks
      		goto again;
      	}
      
      	dirty all the pages
      	release all the locks
      
      The extra waiting, cleaning and locking are there to make sure we don't
      modify pages in flight to the drive, after they've been crc'd.
      
      If some of the pages in the range were already dirty when the write
      began, and we need to goto again, we create a window where a dirty page
      has been cleaned and unlocked.  It may be reclaimed before we're able to
      lock it again, which means we'll read the old contents off the drive and
      lose any modifications that had been pending writeback.
      
      We don't actually need to clean the pages.  All of the other locking in
      place makes sure we don't start IO on the pages, so we can just leave
      them dirty for the duration of the write.
      
      Fixes: 73d59314 (the original btrfs merge)
      CC: stable@vger.kernel.org # v4.4+
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7703bdd8
    • David Sterba's avatar
      btrfs: use common helper instead of open coding a bit test · 818255fe
      David Sterba authored
      The helper does the same math and we take care about the special case
      when flags is 0 too.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      818255fe
    • Nikolay Borisov's avatar
      btrfs: refactor __btrfs_run_delayed_refs loop · 0110a4c4
      Nikolay Borisov authored
      Refactor the delayed refs loop by using the newly introduced
      btrfs_run_delayed_refs_for_head function. This greatly simplifies
      __btrfs_run_delayed_refs and makes it more obvious what is happening.
      
      We now have 1 loop which iterates the existing delayed_heads and then
      each selected ref head is processed by the new helper. All existing
      semantics of the code are preserved so no functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0110a4c4
    • Nikolay Borisov's avatar
      btrfs: Factor out loop processing all refs of a head · e7261386
      Nikolay Borisov authored
      This patch introduces a new helper encompassing the implicit inner loop
      in __btrfs_run_delayed_refs which processes all the refs for a given
      head. The code is mostly copy/paste, the only difference is that if we
      detect a newer reference then -EAGAIN is returned so that callers can
      react correctly.
      
      Also, at the end of the loop the head is relocked and
      btrfs_merge_delayed_refs is run again to retain the pre-refactoring
      semantics.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7261386
    • Nikolay Borisov's avatar
      btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs · b1cdbcb5
      Nikolay Borisov authored
      This is in preparation to refactor the giant loop in
      __btrfs_run_delayed_refs. As a first step define a new function
      which implements acquiring a reference to a btrfs_delayed_refs_head and
      use it. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1cdbcb5
    • David Sterba's avatar
      btrfs: tests: polish ifdefs around testing helper · b2fa1154
      David Sterba authored
      Avoid the inline ifdefs and use two sections for self-tests enabled and
      disabled.
      
      Though there could be no ifdef and unconditional test_bit of
      BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
      any code that would depend on conditions using btrfs_is_testing.
      
      As this is only for the testing code, drop unlikely().
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2fa1154
    • David Sterba's avatar
      a654666a
    • David Sterba's avatar
      btrfs: tests: move testing members of struct btrfs_root to the end · 57ec5fb4
      David Sterba authored
      The data used only for tests are better placed at the end of the
      structure so that they don't change the structure layout. All new
      members of btrfs_root should be placed before.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57ec5fb4
    • David Sterba's avatar
      btrfs: tests: add separate stub for find_lock_delalloc_range · 9c36396c
      David Sterba authored
      The helper find_lock_delalloc_range is now conditionally built static,
      dpending on whether the self-tests are enabled or not. There's a macro
      that is supposed to hide the export, used only once. To discourage
      further use, drop it an add a public wrapper for the helper needed by
      tests.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c36396c
    • Liu Bo's avatar
      Btrfs: preftree: use rb_first_cached · ecf160b4
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      While resolving indirect refs and missing refs, it always looks for the
      first rb entry in a while loop, it's helpful to use rb_first_cached
      instead.
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ecf160b4
    • Liu Bo's avatar
      Btrfs: extent_map: use rb_first_cached · 07e1ce09
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the
      same job as rb_first() but in O(1).
      
      As evict_inode_truncate_pages() removes all extent mapping by always
      looking for the first rb entry, it's helpful to use rb_first_cached
      instead.
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e1ce09
    • Liu Bo's avatar
      Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root · 03a1d4c8
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
      rb_first() but in O(1).
      
      Functions manipulating delayed_item need to get the first entry, this converts
      it to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03a1d4c8
    • Liu Bo's avatar
      Btrfs: delayed-refs: use rb_first_cached for ref_tree · e3d03965
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href->ref_tree need to get the first entry, this
      converts href->ref_tree to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3d03965
    • Liu Bo's avatar
      Btrfs: delayed-refs: use rb_first_cached for href_root · 5c9d028b
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href_root need to get the first entry, this
      converts href_root to use rb_first_cached().
      
      This patch is first in the sequenct of similar updates to other rbtrees
      and this is analysis of the expected behaviour and improvements.
      
      There's a common pattern:
      
      while (node = rb_first) {
              entry = rb_entry(node)
              next = rb_next(node)
              rb_erase(node)
              cleanup(entry)
      }
      
      rb_first needs to traverse the tree up to logN depth, rb_erase can
      completely reshuffle the tree. With the caching we'll skip the traversal
      in rb_first.  That's a cached memory access vs looped pointer
      dereference trade-off that IMHO has a clear winner.
      
      Measurements show there's not much difference in a sample tree with
      10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
      caching and pointer chasing are unpredictable though.
      
      Further optimzations can be done to avoid the expensive rb_erase step.
      In some cases it's ok to process the nodes in any order, so the tree can
      be traversed in post-order, not rebalancing the children nodes and just
      calling free. Care must be taken regarding the next node.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog from mail discussions ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c9d028b
    • Josef Bacik's avatar
      btrfs: wait on caching when putting the bg cache · 3aa7c7a3
      Josef Bacik authored
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3aa7c7a3
    • Jeff Mahoney's avatar
      btrfs: keep trim from interfering with transaction commits · fee7acc3
      Jeff Mahoney authored
      Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      fixed free space trimming, but introduced latency when it was running.
      This is due to it pinning the transaction using both a incremented
      refcount and holding the commit root sem for the duration of a single
      trim operation.
      
      This was to ensure safety but it's unnecessary.  We already hold the the
      chunk mutex so we know that the chunk we're using can't be allocated
      while we're trimming it.
      
      In order to check against chunks allocated already in this transaction,
      we need to check the pending chunks list.  To to that safely without
      joining the transaction (or attaching than then having to commit it) we
      need to ensure that the dev root's commit root doesn't change underneath
      us and the pending chunk lists stays around until we're done with it.
      
      We can ensure the former by holding the commit root sem and the latter
      by pinning the transaction.  We do this now, but the critical section
      covers the trim operation itself and we don't need to do that.
      
      This patch moves the pinning and unpinning logic into helpers and unpins
      the transaction after performing the search and check for pending
      chunks.
      
      Limiting the critical section of the transaction pinning improves the
      latency substantially on slower storage (e.g. image files over NFS).
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fee7acc3
    • Jeff Mahoney's avatar
      btrfs: don't attempt to trim devices that don't support it · 0be88e36
      Jeff Mahoney authored
      We check whether any device the file system is using supports discard in
      the ioctl call, but then we attempt to trim free extents on every device
      regardless of whether discard is supported.  Due to the way we mask off
      EOPNOTSUPP, we can end up issuing the trim operations on each free range
      on devices that don't support it, just wasting time.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0be88e36
    • Jeff Mahoney's avatar
      btrfs: iterate all devices during trim, instead of fs_devices::alloc_list · d4e329de
      Jeff Mahoney authored
      btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
      device_list_mutex.  The problem is that ->alloc_list is protected by the
      chunk mutex.  We don't want to hold the chunk mutex over the trim of the
      entire file system.  Fortunately, the ->dev_list list is protected by
      the dev_list mutex and while it will give us all devices, including
      read-only devices, we already just skip the read-only devices.  Then we
      can continue to take and release the chunk mutex while scanning each
      device.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e329de
    • Qu Wenruo's avatar
      btrfs: Ensure btrfs_trim_fs can trim the whole filesystem · 6ba9fc8e
      Qu Wenruo authored
      [BUG]
      fstrim on some btrfs only trims the unallocated space, not trimming any
      space in existing block groups.
      
      [CAUSE]
      Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
      range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
      able to trim block groups in range [0, super->total_bytes).
      
      While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
      uses its logical address space, there is nothing limiting the location
      where we put block groups.
      
      For filesystem with frequent balance, it's quite easy to relocate all
      block groups and bytenr of block groups will start beyond
      super->total_bytes.
      
      In that case, btrfs will not trim existing block groups.
      
      [FIX]
      Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
      can get the unmodified range, which is normally set to [0, U64_MAX].
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
      CC: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ba9fc8e
    • Qu Wenruo's avatar
      btrfs: Enhance btrfs_trim_fs function to handle error better · 93bba24d
      Qu Wenruo authored
      Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
      error happens when trimming existing block groups, it will skip the
      remaining blocks and continue to trim unallocated space for each device.
      
      The return value will only reflect the final error from device trimming.
      
      This patch will fix such behavior by:
      
      1) Recording the last error from block group or device trimming
         The return value will also reflect the last error during trimming.
         Make developer more aware of the problem.
      
      2) Continuing trimming if possible
         If we failed to trim one block group or device, we could still try
         the next block group or device.
      
      3) Report number of failures during block group and device trimming
         It would be less noisy, but still gives user a brief summary of
         what's going wrong.
      
      Such behavior can avoid confusion for cases like failure to trim the
      first block group and then only unallocated space is trimmed.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add bg_ret and dev_ret to the messages ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93bba24d
    • Jeff Mahoney's avatar
      btrfs: fix error handling in btrfs_dev_replace_start · 5c061471
      Jeff Mahoney authored
      When we fail to start a transaction in btrfs_dev_replace_start, we leave
      dev_replace->replace_start set to STARTED but clear ->srcdev and
      ->tgtdev.  Later, that can result in an Oops in
      btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
      implies that ->srcdev is valid.
      
      Also fix error handling when the state is already STARTED or SUSPENDED
      while starting.  That, too, will clear ->srcdev and ->tgtdev even though
      it doesn't own them.  This should be an impossible case to hit since we
      should be protected by the BTRFS_FS_EXCL_OP bit being set.  Let's add an
      ASSERT there while we're at it.
      
      Fixes: e93c89c1 (Btrfs: add new sources for device replace code)
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c061471
    • zhong jiang's avatar
      btrfs: change remove_extent_mapping to return void · c1766dd7
      zhong jiang authored
      remove_extent_mapping uses the variable "ret" for return value, but it
      is not modified after initialzation. Further, I find that any of the
      callers do not handle the return value and the callees are only simple
      functions so the return values does not need to be passed.
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1766dd7
    • Nikolay Borisov's avatar
      btrfs: handle error of get_old_root · 315bed43
      Nikolay Borisov authored
      In btrfs_search_old_slot get_old_root is always used with the assumption
      it cannot fail. However, this is not true in rare circumstance it can
      fail and return null. This will lead to null point dereference when the
      header is read. Fix this by checking the return value and properly
      handling NULL by setting ret to -EIO and returning gracefully.
      
      Coverity-id: 1087503
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      315bed43
    • Nikolay Borisov's avatar
      btrfs: Remove logically dead code from btrfs_orphan_cleanup · 28bee489
      Nikolay Borisov authored
      In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
      executed. This is due to the last assignment to 'ret' depending on the
      return value of btrfs_iget. If an error other than -ENOENT is returned
      then the loop is prematurely terminated by 'goto out'.  On the other
      hand, if the error value is ENOENT then a subsequent if branch is
      executed that always re-assigns 'ret' and in case it's an error just
      terminates the loop. No functional changes.
      
      Coverity-id: 1437392
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28bee489
    • Liu Bo's avatar
      Btrfs: remove wait_ordered_range in btrfs_evict_inode · 4183c52c
      Liu Bo authored
      When we delete an inode,
      
      btrfs_evict_inode() {
          truncate_inode_pages_final()
              truncate_inode_pages_range()
                  lock_page()
                  truncate_cleanup_page()
                       btrfs_invalidatepage()
                            wait_on_page_writeback
                                 btrfs_lookup_ordered_range()
                       cancel_dirty_page()
                 unlock_page()
           ...
           btrfs_wait_ordered_range()
           ...
      
      As VFS has called ->invalidatepage() to get all ordered extents done (if
      there are any) and truncated all page cache pages (no dirty pages to
      writeback after this step), wait_ordered_range() is just a noop.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4183c52c
    • Liu Bo's avatar
      Btrfs: skip set_page_dirty if eb pages are already dirty · abb57ef3
      Liu Bo authored
      As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
      are dirty, so no need to set pages dirty again.
      
      Ftrace showed that the loop took 10us on my dev box, so removing this
      can save us at least 10us if eb is already dirty and otherwise avoid a
      potentially expensive calls to set_page_dirty.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abb57ef3
    • Liu Bo's avatar
      Btrfs: assert page dirty bit on extent buffer pages · 51995c39
      Liu Bo authored
      Just in case that someone breaks the rule that pages are dirty as long
      as eb is dirty. The next patch will dirty the pages conditionally.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51995c39
    • Liu Bo's avatar
      Btrfs: remove unnecessary level check in balance_level · 98e6b1eb
      Liu Bo authored
      In the callchain:
      
      btrfs_search_slot()
         if (level != 0)
            setup_nodes_for_search()
                balance_level()
      
      It is just impossible to have level=0 in balance_level, we can drop the
      check.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98e6b1eb
    • Liu Bo's avatar
      Btrfs: unify error handling of btrfs_lookup_dir_item · 3cf5068f
      Liu Bo authored
      Unify the error handling of directory item lookups using IS_ERR_OR_NULL.
      No functional changes.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cf5068f
    • Liu Bo's avatar
      Btrfs: use args in the correct order for kcalloc in btrfsic_read_block · 3b2fd801
      Liu Bo authored
      kcalloc is defined as:
      
        kcalloc(size_t n, size_t size, gfp_t flags)
      
      Although this won't cause problems in practice, btrfsic_read_block()
      uses kcalloc with n and size in the opposite order.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3b2fd801
    • Nikolay Borisov's avatar
      btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly · a27a94c2
      Nikolay Borisov authored
      Instead of returning an error value and using one of the parameters for
      returning the actual object we are interested in just refactor the
      function to directly return btrfs_device *. Also bubble up the error
      handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
      btrfs_rm_device. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a27a94c2