1. 25 Feb, 2019 40 commits
    • Anand Jain's avatar
      btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get · eb4318e5
      Anand Jain authored
      scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
      in scrub_workers_get().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb4318e5
    • Anand Jain's avatar
      btrfs: scrub: fix circular locking dependency warning · 1cec3f27
      Anand Jain authored
      This fixes a longstanding lockdep warning triggered by
      fstests/btrfs/011.
      
      Circular locking dependency check reports warning[1], that's because the
      btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
      held. The test case leading to this warning:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /btrfs
        $ btrfs scrub start -B /btrfs
      
      In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
      of the scrub workers are needed. So once we have incremented and decremented
      the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
      scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
      patch drops the scrub_lock before calling btrfs_destroy_workqueue().
      
        [359.258534] ======================================================
        [359.260305] WARNING: possible circular locking dependency detected
        [359.261938] 5.0.0-rc6-default #461 Not tainted
        [359.263135] ------------------------------------------------------
        [359.264672] btrfs/20975 is trying to acquire lock:
        [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
        [359.268416]
        [359.268416] but task is already holding lock:
        [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.272418]
        [359.272418] which lock already depends on the new lock.
        [359.272418]
        [359.274692]
        [359.274692] the existing dependency chain (in reverse order) is:
        [359.276671]
        [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
        [359.278187]        __mutex_lock+0x86/0x9c0
        [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
        [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
        [359.281931]        close_ctree+0x30b/0x350 [btrfs]
        [359.283208]        generic_shutdown_super+0x64/0x100
        [359.284516]        kill_anon_super+0x14/0x30
        [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
        [359.286964]        deactivate_locked_super+0x29/0x60
        [359.288242]        cleanup_mnt+0x3b/0x70
        [359.289310]        task_work_run+0x98/0xc0
        [359.290428]        exit_to_usermode_loop+0x83/0x90
        [359.291445]        do_syscall_64+0x15b/0x180
        [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.294011]
        [359.294011] -> #2 (sb_internal#2){.+.+}:
        [359.295432]        __sb_start_write+0x113/0x1d0
        [359.296394]        start_transaction+0x369/0x500 [btrfs]
        [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
        [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
        [359.299698]        process_one_work+0x246/0x610
        [359.300898]        worker_thread+0x3c/0x390
        [359.302020]        kthread+0x116/0x130
        [359.303053]        ret_from_fork+0x24/0x30
        [359.304152]
        [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
        [359.306100]        process_one_work+0x21f/0x610
        [359.307302]        worker_thread+0x3c/0x390
        [359.308465]        kthread+0x116/0x130
        [359.309357]        ret_from_fork+0x24/0x30
        [359.310229]
        [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
        [359.311812]        lock_acquire+0x90/0x180
        [359.312929]        flush_workqueue+0xaa/0x540
        [359.313845]        drain_workqueue+0xa1/0x180
        [359.314761]        destroy_workqueue+0x17/0x240
        [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
        [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.322908]        do_vfs_ioctl+0xa2/0x6d0
        [359.324021]        ksys_ioctl+0x3a/0x70
        [359.325066]        __x64_sys_ioctl+0x16/0x20
        [359.326236]        do_syscall_64+0x54/0x180
        [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.328772]
        [359.328772] other info that might help us debug this:
        [359.328772]
        [359.330990] Chain exists of:
        [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
        [359.330990]
        [359.334376]  Possible unsafe locking scenario:
        [359.334376]
        [359.336020]        CPU0                    CPU1
        [359.337070]        ----                    ----
        [359.337821]   lock(&fs_info->scrub_lock);
        [359.338506]                                lock(sb_internal#2);
        [359.339506]                                lock(&fs_info->scrub_lock);
        [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
        [359.342437]
        [359.342437]  *** DEADLOCK ***
        [359.342437]
        [359.343745] 1 lock held by btrfs/20975:
        [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.346778]
        [359.346778] stack backtrace:
        [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
        [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [359.350501] Call Trace:
        [359.350931]  dump_stack+0x67/0x90
        [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
        [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
        [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
        [359.356505]  __lock_acquire+0xb84/0xf10
        [359.357505]  lock_acquire+0x90/0x180
        [359.358271]  ? flush_workqueue+0x87/0x540
        [359.359098]  flush_workqueue+0xaa/0x540
        [359.359912]  ? flush_workqueue+0x87/0x540
        [359.360740]  ? drain_workqueue+0x1e/0x180
        [359.361565]  ? drain_workqueue+0xa1/0x180
        [359.362391]  drain_workqueue+0xa1/0x180
        [359.363193]  destroy_workqueue+0x17/0x240
        [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
        [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
        [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.370186]  ? __lock_acquire+0x263/0xf10
        [359.370777]  ? kvm_clock_read+0x14/0x30
        [359.371392]  ? kvm_sched_clock_read+0x5/0x10
        [359.372248]  ? sched_clock+0x5/0x10
        [359.372786]  ? sched_clock_cpu+0xc/0xc0
        [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
        [359.374552]  do_vfs_ioctl+0xa2/0x6d0
        [359.375378]  ? do_sigaction+0xff/0x250
        [359.376233]  ksys_ioctl+0x3a/0x70
        [359.376954]  __x64_sys_ioctl+0x16/0x20
        [359.377772]  do_syscall_64+0x54/0x180
        [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.380422] RIP: 0033:0x7f5429296a97
      
      Backporting to older kernels: scrub_nocow_workers must be freed the same
      way as the others.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cec3f27
    • Anand Jain's avatar
      btrfs: fix comment its device list mutex not volume lock · 7faad6e2
      Anand Jain authored
      We have killed volume mutex (commit: dccdb07b
      btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have
      escaped.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7faad6e2
    • Qu Wenruo's avatar
      btrfs: extent_io: Kill the forward declaration of flush_write_bio · bb58eb9e
      Qu Wenruo authored
      There is no need to forward declare flush_write_bio(), as it only
      depends on submit_one_bio().  Both of them are pretty small, just move
      them to kill the forward declaration.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb58eb9e
    • Nikolay Borisov's avatar
      btrfs: Fix grossly misleading argument names in extent io search · 352646c7
      Nikolay Borisov authored
      The variables and function parameters of __etree_search which pertain to
      prev/next are grossly misnamed. Namely, prev_ret holds the next state
      and not the previous. Similarly, next_ret actually holds the previous
      extent state relating to the offset we are interested in. Fix this by
      renaming the variables as well as switching the arguments order. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      352646c7
    • Nikolay Borisov's avatar
      btrfs: Remove EXTENT_FIRST_DELALLOC bit · ba8f5206
      Nikolay Borisov authored
      With the refactoring introduced in 8b62f87b ("Btrfs: reworki
      outstanding_extents") this flag became unused. Remove it and renumber
      the following flags accordingly. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba8f5206
    • Nikolay Borisov's avatar
      btrfs: use WARN_ON in a canonical form btrfs_remove_block_group · 9a0ec83d
      Nikolay Borisov authored
      There is no point in using a construct like 'if (!condition)
      WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a0ec83d
    • Josef Bacik's avatar
      btrfs: reserve extra space during evict · 260e7702
      Josef Bacik authored
      We could generate a lot of delayed refs in evict but never have any left
      over space from our block rsv to make up for that fact.  So reserve some
      extra space and give it to the transaction so it can be used to refill
      the delayed refs rsv every loop through the truncate path.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      260e7702
    • Josef Bacik's avatar
      btrfs: be more explicit about allowed flush states · 8a1bbe1d
      Josef Bacik authored
      For FLUSH_LIMIT flushers we really can only allocate chunks and flush
      delayed inode items, everything else is problematic.  I added a bunch of
      new states and it lead to weirdness in the FLUSH_LIMIT case because I
      forgot about how it worked.  So instead explicitly declare the states
      that are ok for flushing with FLUSH_LIMIT and use that for our state
      machine.  Then as we add new things that are safe we can just add them
      to this list.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a1bbe1d
    • Josef Bacik's avatar
      btrfs: loop in inode_rsv_refill · 5df11363
      Josef Bacik authored
      With severe fragmentation we can end up with our inode rsv size being
      huge during writeout, which would cause us to need to make very large
      metadata reservations.
      
      However we may not actually need that much once writeout is complete,
      because of the over-reservation for the worst case.
      
      So instead try to make our reservation, and if we couldn't make it
      re-calculate our new reservation size and try again.  If our reservation
      size doesn't change between tries then we know we are actually out of
      space and can error. Flushing that could have been running in parallel
      did not make any space.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ rename to calc_refill_bytes, update comment and changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5df11363
    • Josef Bacik's avatar
      btrfs: don't enospc all tickets on flush failure · f91587e4
      Josef Bacik authored
      With the introduction of the per-inode block_rsv it became possible to
      have really really large reservation requests made because of data
      fragmentation.  Since the ticket stuff assumed that we'd always have
      relatively small reservation requests it just killed all tickets if we
      were unable to satisfy the current request.
      
      However, this is generally not the case anymore.  So fix this logic to
      instead see if we had a ticket that we were able to give some
      reservation to, and if we were continue the flushing loop again.
      
      Likewise we make the tickets use the space_info_add_old_bytes() method
      of returning what reservation they did receive in hopes that it could
      satisfy reservations down the line.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91587e4
    • Josef Bacik's avatar
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik authored
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      450114fc
    • Josef Bacik's avatar
      btrfs: dump block_rsv details when dumping space info · b78e5616
      Josef Bacik authored
      For enospc_debug having the block rsvs is super helpful to see if we've
      done something wrong.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b78e5616
    • Josef Bacik's avatar
      btrfs: check if there are free block groups for commit · d89dbefb
      Josef Bacik authored
      may_commit_transaction will skip committing the transaction if we don't
      have enough pinned space or if we're trying to find space for a SYSTEM
      chunk.  However, if we have pending free block groups in this transaction
      we still want to commit as we may be able to allocate a chunk to make
      our reservation.  So instead of just returning ENOSPC, check if we have
      free block groups pending, and if so commit the transaction to allow us
      to use that free space.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89dbefb
    • Dennis Zhou's avatar
      btrfs: add zstd compression level support · 3f93aef5
      Dennis Zhou authored
      Zstd compression requires different amounts of memory for each level of
      compression. The prior patches implemented indirection to allow for each
      compression type to manage their workspaces independently. This patch
      uses this indirection to implement compression level support for zstd.
      
      To manage the additional memory require, each compression level has its
      own queue of workspaces. A global LRU is used to help with reclaim.
      Reclaim is done via a timer which provides a mechanism to decrease
      memory utilization by keeping only workspaces around that are sized
      appropriately. Forward progress is guaranteed by a preallocated max
      workspace hidden from the LRU.
      
      When getting a workspace, it uses a bitmap to identify the levels that
      are populated and scans up. If it finds a workspace that is greater than
      it, it uses it, but does not update the last_used time and the
      corresponding place in the LRU. If we hit memory pressure, we sleep on
      the max level workspace. We continue to rescan in case we can use a
      smaller workspace, but eventually should be able to obtain the max level
      workspace or allocate one again should memory pressure subside.
      
      The memory requirement for decompression is the same as level 1, and
      therefore can use any of available workspace.
      
      The number of workspaces is bound by an upper limit of the workqueue's
      limit which currently is 2 (percpu limit). The reclaim timer is used to
      free inactive/improperly sized workspaces and is set to 307s to avoid
      colliding with transaction commit (every 30s).
      
      Repeating the experiment from v2 [1], the Silesia corpus was copied to a
      btrfs filesystem 10 times and then read back after dropping the caches.
      The btrfs filesystem was on an SSD.
      
      Level   Ratio   Compression (MB/s)  Decompression (MB/s)  Memory (KB)
      1       2.658        438.47                910.51            780
      2       2.744        364.86                886.55           1004
      3       2.801        336.33                828.41           1260
      4       2.858        286.71                886.55           1260
      5       2.916        212.77                556.84           1388
      6       2.363        119.82                990.85           1516
      7       3.000        154.06                849.30           1516
      8       3.011        159.54                875.03           1772
      9       3.025        100.51                940.15           1772
      10      3.033        118.97                616.26           1772
      11      3.036         94.19                802.11           1772
      12      3.037         73.45                931.49           1772
      13      3.041         55.17                835.26           2284
      14      3.087         44.70                716.78           2547
      15      3.126         37.30                878.84           2547
      
      [1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/
      
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3f93aef5
    • Dennis Zhou's avatar
      btrfs: make zstd memory requirements monotonic · d3c6ab75
      Dennis Zhou authored
      It is possible based on the level configurations that a higher level
      workspace uses less memory than a lower level workspace. In order to
      reuse workspaces, this must be made a monotonic relationship. This
      precomputes the required memory for each level and enforces the
      monotonicity between level and memory required. This is also done
      in upstream zstd in [1].
      
      [1] https://github.com/facebook/zstd/commit/a68b76afefec6876f8e8a538155109a5aeac0143
      
      Cc: Nick Terrell <terrelln@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3c6ab75
    • Dennis Zhou's avatar
      btrfs: zstd use the passed through level instead of default · e0dc87af
      Dennis Zhou authored
      Zstd currently only supports the default level of compression. This
      patch switches to using the level passed in for btrfs zstd
      configuration.
      
      Zstd workspaces now keep track of the requested level as this can differ
      from the size of the workspace.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0dc87af
    • Dennis Zhou's avatar
      btrfs: change set_level() to bound the level passed in · d0ab62ce
      Dennis Zhou authored
      Currently, the only user of set_level() is zlib which sets an internal
      workspace parameter. As level is now plumbed into get_workspace(), this
      can be handled there rather than separately.
      
      This repurposes set_level() to bound the level passed in so it can be
      used when setting the mounts compression level and as well as verifying
      the level before getting a workspace. The other benefit is this divides
      the meaning of compress(0) and get_workspace(0). The former means we
      want to use the default compression level of the compression type. The
      latter means we can use any workspace available.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0ab62ce
    • Dennis Zhou's avatar
      btrfs: plumb level through the compression interface · 7bf49943
      Dennis Zhou authored
      Zlib compression supports multiple levels, but doesn't require changing
      in how a workspace itself is created and managed. Zstd introduces a
      different memory requirement such that higher levels of compression
      require more memory.
      
      This requires changes in how the alloc()/get() methods work for zstd.
      This pach plumbs compression level through the interface as a parameter
      in preparation for zstd compression levels.  This gives the compression
      types opportunity to create/manage based on the compression level.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bf49943
    • Dennis Zhou's avatar
      btrfs: move to function pointers for get/put workspaces · 92ee5530
      Dennis Zhou authored
      The previous patch added generic helpers for get_workspace() and
      put_workspace(). Now, we can migrate ownership of the workspace_manager
      to be in the compression type code as the compression code itself
      doesn't care beyond being able to get a workspace. The init/cleanup and
      get/put methods are abstracted so each compression algorithm can decide
      how they want to manage their workspaces.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92ee5530
    • Dennis Zhou's avatar
      btrfs: add compression interface in (get/put)_workspace · 929f4baf
      Dennis Zhou authored
      There are two levels of workspace management. First, alloc()/free()
      which are responsible for actually creating and destroy workspaces.
      Second, at a higher level, get()/put() which is the compression code
      asking for a workspace from a workspace_manager.
      
      The compression code shouldn't really care how it gets a workspace, but
      that it got a workspace. This adds get_workspace() and put_workspace()
      to be the higher level interface which is responsible for indexing into
      the appropriate compression type. It also introduces
      btrfs_put_workspace() and btrfs_get_workspace() to be the generic
      implementations of the higher interface.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      929f4baf
    • Dennis Zhou's avatar
      btrfs: add helper methods for workspace manager init and cleanup · 1666edab
      Dennis Zhou authored
      Workspace manager init and cleanup code is open coded inside a for loop
      over the compression types. This forces each compression type to rely on
      the same workspace manager implementation. This patch creates helper
      methods that will be the generic implementation for btrfs workspace
      management.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1666edab
    • Dennis Zhou's avatar
      btrfs: unify compression ops with workspace_manager · 10b94a51
      Dennis Zhou authored
      Make the workspace_manager own the interface operations rather than
      managing index-paired arrays for the workspace_manager and compression
      operations.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10b94a51
    • Dennis Zhou's avatar
      btrfs: manage heuristic workspace as index 0 · ca4ac360
      Dennis Zhou authored
      While the heuristic workspaces aren't really compression workspaces,
      they use the same interface for managing them. So rather than branching,
      let's just handle them once again as the index 0 compression type.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca4ac360
    • Dennis Zhou's avatar
      btrfs: rename workspaces_list to workspace_manager · acce85de
      Dennis Zhou authored
      This is in preparation for zstd compression levels. As each level will
      require different size of workspace, workspaces_list is no longer a
      really fitting name.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      acce85de
    • Dennis Zhou's avatar
      btrfs: add helpers for compression type and level · 1972708a
      Dennis Zhou authored
      It is very easy to miss places that rely on a certain bitshifting for
      decoding the type_level overloading. Add helpers to do this instead.
      
      Cc: Omar Sandoval <osandov@osandov.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1972708a
    • Anand Jain's avatar
      btrfs: introduce new ioctl to unregister a btrfs device · 228a73ab
      Anand Jain authored
      Support for a new command that can be used eg. as a command
      
        $ btrfs device scan --forget [dev]'
      (the final name may change though)
      
      to undo the effects of 'btrfs device scan [dev]'. For this purpose
      this patch proposes to use ioctl #5 as it was empty and is next to the
      SCAN ioctl.
      
      The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
      (/dev/btrfs-control) to unregister one or all devices, devices that are
      not mounted.
      
      The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
      path. To unregister all device, the path is an empty string.
      
      Again, the devices are removed only if they aren't part of a mounte
      filesystem.
      
      This new ioctl provides:
      
      - release of unwanted btrfs_fs_devices and btrfs_devices structures
        from memory if the device is not going to be mounted
      
      - ability to mount filesystem in degraded mode, when one devices is
        corrupted like in split brain raid1
      
      - running test cases which would require reloading the kernel module
        but this is not possible eg. due to mounted filesystem or built-in
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      228a73ab
    • Josef Bacik's avatar
      btrfs: replace cleaner_delayed_iput_mutex with a waitqueue · 034f784d
      Josef Bacik authored
      The throttle path doesn't take cleaner_delayed_iput_mutex, which means
      we could think we're done flushing iputs in the data space reservation
      path when we could have a throttler doing an iput.  There's no real
      reason to serialize the delayed iput flushing, so instead of taking the
      cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
      replace it with an atomic counter and a waitqueue.  This removes the
      short (or long depending on how big the inode is) window where we think
      there are no more pending iputs when there really are some.
      
      The waiting is killable as it could be indirectly called from user
      operations like fallocate or zero-range. Such call sites should handle
      the error but otherwise it's not necessary. Eg. flush_space just needs
      to attempt to make space by waiting on iputs.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add killable comment and changelog parts ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      034f784d
    • Qu Wenruo's avatar
      btrfs: Output ENOSPC debug info in inc_block_group_ro · 3ece54e5
      Qu Wenruo authored
      Since inc_block_group_ro() would return -ENOSPC, outputting debug info
      for enospc_debug mount option would be helpful to debug some balance
      false ENOSPC report.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ece54e5
    • Qu Wenruo's avatar
      btrfs: qgroup: Remove duplicated trace points for qgroup_rsv_add/release · c8f72b98
      Qu Wenruo authored
      Inside qgroup_rsv_add/release(), we have trace events
      trace_qgroup_update_reserve() to catch reserved space update.
      
      However we still have two manual trace_qgroup_update_reserve() calls
      just outside these functions.  Remove these duplicated calls.
      
      Fixes: 64ee4e75 ("btrfs: qgroup: Update trace events to use new separate rsv types")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8f72b98
    • Anders Roxell's avatar
      btrfs: let the assertion expression compile in all configs · 2eec5f00
      Anders Roxell authored
      A compiler warning (in a patch in development) pointed to a variable
      that was used only inside and ASSERT:
      
        u64 root_objectid = root->root_key.objectid;
        ASSERT(root_objectid == ...);
      
        fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
        fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
          u64 root_objectid = root->root_key.objectid;
      	^~~~~~~~~~~~~
      
      When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.
      
      Rework the assertion helper by adding a runtime check instead of the
      '#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
      condition being passed into an inline function after preprocessing.
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2eec5f00
    • David Sterba's avatar
      btrfs: merge btrfs_set_lock_blocking_rw with it's caller · 766ece54
      David Sterba authored
      The last caller that does not have a fixed value of lock is
      btrfs_set_path_blocking, that actually does the same conditional swtich
      by the lock type so we can merge the branches together and remove the
      helper.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      766ece54
    • David Sterba's avatar
      btrfs: simplify waiting loop in btrfs_tree_lock · 970e74d9
      David Sterba authored
      Currently, the number of readers and writers is checked and in case
      there are any, wait and redo the locks. There's some duplication
      before the branches go back to again label, eg. calling wait_event on
      blocking_readers twice.
      
      The sequence is transformed
      
      loop:
      * wait for readers
      * wait for writers
      * write_lock
      * check readers, unlock and wait for readers, loop
      * check writers, unlock and wait for writers, loop
      
      The new sequence is not exactly the same due to the simplification, for
      readers it's slightly faster. For the writers, original code does
      
      * wait for writers
      * (loop) wait for readers
      *        wait for writers -- again
      
      while the new goes directly to the reader check. This should behave the
      same on a contended lock with multiple writers and readers, but can
      reduce number of times we're waiting on something.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      970e74d9
    • David Sterba's avatar
      btrfs: open code now trivial btrfs_set_lock_blocking · 8bead258
      David Sterba authored
      btrfs_set_lock_blocking is now only a simple wrapper around
      btrfs_set_lock_blocking_write. The name does not bring any semantic
      value that could not be inferred from the new function so there's no
      point keeping it.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bead258
    • David Sterba's avatar
      btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers · 300aa896
      David Sterba authored
      We can use the right helper where the lock type is a fixed parameter.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      300aa896
    • David Sterba's avatar
      btrfs: split btrfs_clear_lock_blocking_rw to read and write helpers · aa12c027
      David Sterba authored
      There are many callers that hardcode the desired lock type so we can
      avoid the switch and call them directly. Split the current function to
      two. There are no remaining users of btrfs_clear_lock_blocking_rw so
      it's removed.  The call sites will be converted in followup patches.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa12c027
    • David Sterba's avatar
      btrfs: split btrfs_set_lock_blocking_rw to read and write helpers · b95be2d9
      David Sterba authored
      There are many callers that hardcode the desired lock type so we can
      avoid the switch and call them directly. Split the current function to
      two but leave a helper that still takes the variable lock type to make
      current code compile.  The call sites will be converted in followup
      patches.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b95be2d9
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup old subtree swap code · 9627736b
      Qu Wenruo authored
      Since it's replaced by new delayed subtree swap code, remove the
      original code.
      
      The cleanup is small since most of its core function is still used by
      delayed subtree swap trace.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9627736b
    • Qu Wenruo's avatar
      btrfs: qgroup: Use delayed subtree rescan for balance · f616f5cd
      Qu Wenruo authored
      Before this patch, qgroup code traces the whole subtree of subvolume and
      reloc trees unconditionally.
      
      This makes qgroup numbers consistent, but it could cause tons of
      unnecessary extent tracing, which causes a lot of overhead.
      
      However for subtree swap of balance, just swap both subtrees because
      they contain the same contents and tree structure, so qgroup numbers
      won't change.
      
      It's the race window between subtree swap and transaction commit could
      cause qgroup number change.
      
      This patch will delay the qgroup subtree scan until COW happens for the
      subtree root.
      
      So if there is no other operations for the fs, balance won't cause extra
      qgroup overhead. (best case scenario)
      Depending on the workload, most of the subtree scan can still be
      avoided.
      
      Only for worst case scenario, it will fall back to old subtree swap
      overhead. (scan all swapped subtrees)
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
      And after file system population, there is no other activity, so it
      should be the best case scenario.
      
                           | v4.20-rc1            | w/ patchset    | diff
      -----------------------------------------------------------------------
      relocated extents    | 22615                | 22457          | -0.1%
      qgroup dirty extents | 163457               | 121606         | -25.6%
      time (sys)           | 22.884s              | 18.842s        | -17.6%
      time (real)          | 27.724s              | 22.884s        | -17.5%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f616f5cd
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce per-root swapped blocks infrastructure · 370a11b8
      Qu Wenruo authored
      To allow delayed subtree swap rescan, btrfs needs to record per-root
      information about which tree blocks get swapped.  This patch introduces
      the required infrastructure.
      
      The designed workflow will be:
      
      1) Record the subtree root block that gets swapped.
      
         During subtree swap:
         O = Old tree blocks
         N = New tree blocks
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               NA     OB                          OA      OB
             /  |     |  \                      /  |      |  \
           NC  ND     OE  OF                   OC  OD     OE  OF
      
        In this case, NA and OA are going to be swapped, record (NA, OA) into
        subvolume tree X.
      
      2) After subtree swap.
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               OA     OB                          NA      OB
             /  |     |  \                      /  |      |  \
           OC  OD     OE  OF                   NC  ND     OE  OF
      
      3a) COW happens for OB
          If we are going to COW tree block OB, we check OB's bytenr against
          tree X's swapped_blocks structure.
          If it doesn't fit any, nothing will happen.
      
      3b) COW happens for NA
          Check NA's bytenr against tree X's swapped_blocks, and get a hit.
          Then we do subtree scan on both subtrees OA and NA.
          Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
      
          Then no matter what we do to subvolume tree X, qgroup numbers will
          still be correct.
          Then NA's record gets removed from X's swapped_blocks.
      
      4)  Transaction commit
          Any record in X's swapped_blocks gets removed, since there is no
          modification to swapped subtrees, no need to trigger heavy qgroup
          subtree rescan for them.
      
      This will introduce 128 bytes overhead for each btrfs_root even qgroup
      is not enabled. This is to reduce memory allocations and potential
      failures.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      370a11b8