1. 18 Nov, 2019 40 commits
    • Josef Bacik's avatar
      btrfs: use btrfs_block_group_cache_done in update_block_group · a60adce8
      Josef Bacik authored
      When free'ing extents in a block group we check to see if the block
      group is not cached, and then cache it if we need to.  However we'll
      just carry on as long as we're loading the cache.  This is problematic
      because we are dirtying the block group here.  If we are fast enough we
      could do a transaction commit and clear the free space cache while we're
      still loading the space cache in another thread.  This truncates the
      free space inode, which will keep it from loading the space cache.
      
      Fix this by using the btrfs_block_group_cache_done helper so that we try
      to load the space cache unconditionally here, which will result in the
      caller waiting for the fast caching to complete and keep us from
      truncating the free space inode.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a60adce8
    • Josef Bacik's avatar
      btrfs: check page->mapping when loading free space cache · 3797136b
      Josef Bacik authored
      While testing 5.2 we ran into the following panic
      
      [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
      [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
      [52238.304051] Call Trace:
      [52238.308958]  try_to_free_buffers+0x15b/0x1b0
      [52238.317503]  shrink_page_list+0x1164/0x1780
      [52238.325877]  shrink_inactive_list+0x18f/0x3b0
      [52238.334596]  shrink_node_memcg+0x23e/0x7d0
      [52238.342790]  ? do_shrink_slab+0x4f/0x290
      [52238.350648]  shrink_node+0xce/0x4a0
      [52238.357628]  balance_pgdat+0x2c7/0x510
      [52238.365135]  kswapd+0x216/0x3e0
      [52238.371425]  ? wait_woken+0x80/0x80
      [52238.378412]  ? balance_pgdat+0x510/0x510
      [52238.386265]  kthread+0x111/0x130
      [52238.392727]  ? kthread_create_on_node+0x60/0x60
      [52238.401782]  ret_from_fork+0x1f/0x30
      
      The page we were trying to drop had a page->private, but had no
      page->mapping and so called drop_buffers, assuming that we had a
      buffer_head on the page, and then panic'ed trying to deref 1, which is
      our page->private for data pages.
      
      This is happening because we're truncating the free space cache while
      we're trying to load the free space cache.  This isn't supposed to
      happen, and I'll fix that in a followup patch.  However we still
      shouldn't allow those sort of mistakes to result in messing with pages
      that do not belong to us.  So add the page->mapping check to verify that
      we still own this page after dropping and re-acquiring the page lock.
      
      This page being unlocked as:
      btrfs_readpage
        extent_read_full_page
          __extent_read_full_page
            __do_readpage
              if (!nr)
      	   unlock_page  <-- nr can be 0 only if submit_extent_page
      			    returns an error
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add callchain ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3797136b
    • Filipe Manana's avatar
      Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc · 53687007
      Filipe Manana authored
      In the fixup worker, if we fail to mark the range as delalloc in the io
      tree, we must release the previously reserved metadata, as well as update
      the outstanding extents counter for the inode, otherwise we leak metadata
      space.
      
      In pratice we can't return an error from btrfs_set_extent_delalloc(),
      which is just a wrapper around __set_extent_bit(), as for most errors
      __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
      well) and returning an -EEXIST error doesn't happen in this case since
      the exclusive bits parameter always has a value of 0 through this code
      path. Nevertheless, just fix the error handling in the fixup worker,
      in case one day __set_extent_bit() can return an error to this code
      path.
      
      Fixes: f3038ee3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53687007
    • Filipe Manana's avatar
      Btrfs: fix negative subv_writers counter and data space leak after buffered write · a0e248bb
      Filipe Manana authored
      When doing a buffered write it's possible to leave the subv_writers
      counter of the root, used for synchronization between buffered nocow
      writers and snapshotting. This happens in an exceptional case like the
      following:
      
      1) We fail to allocate data space for the write, since there's not
         enough available data space nor enough unallocated space for allocating
         a new data block group;
      
      2) Because of that failure, we try to go to NOCOW mode, which succeeds
         and therefore we set the local variable 'only_release_metadata' to true
         and set the root's sub_writers counter to 1 through the call to
         btrfs_start_write_no_snapshotting() made by check_can_nocow();
      
      3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
         to happen but not impossible;
      
      4) No pages are copied because btrfs_copy_from_user() returned zero;
      
      5) We call btrfs_end_write_no_snapshotting() which decrements the root's
         subv_writers counter to 0;
      
      6) We don't set 'only_release_metadata' back to 'false' because we do
         it only if 'copied', the value returned by btrfs_copy_from_user(), is
         greater than zero;
      
      7) On the next iteration of the while loop, which processes the same
         page range, we are now able to allocate data space for the write (we
         got enough data space released in the meanwhile);
      
      8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
         now there isn't enough free metadata space, or in some other place
         further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
         btrfs_dirty_pages()), we break out of the while loop with
         'only_release_metadata' having a value of 'true';
      
      9) Because 'only_release_metadata' is 'true' we end up decrementing the
         root's subv_writers counter to -1 (through a call to
         btrfs_end_write_no_snapshotting()), and we also end up not releasing the
         data space previously reserved through btrfs_check_data_free_space().
         As a consequence the mechanism for synchronizing NOCOW buffered writes
         with snapshotting gets broken.
      
      Fix this by always setting 'only_release_metadata' to false at the start
      of each iteration.
      
      Fixes: 8257b2dc ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0e248bb
    • Marcos Paulo de Souza's avatar
      btrfs: ioctl: Try to use btrfs_fs_info instead of *file · b929c1d8
      Marcos Paulo de Souza authored
      Some functions are doing some unnecessary indirection to reach the
      btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info
      struct instead of a *file.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b929c1d8
    • Anand Jain's avatar
      btrfs: use bool argument in free_root_pointers() · 4273eaff
      Anand Jain authored
      We don't need int argument bool shall do in free_root_pointers().  And
      rename the argument as it confused two people.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4273eaff
    • Chengguang Xu's avatar
      btrfs: use better definition of number of compression type · ce96b7ff
      Chengguang Xu authored
      The compression type upper limit constant is the same as the last value
      and this is confusing.  In order to keep coding style consistent, use
      BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
      'NR' being one more than the last value.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce96b7ff
    • Chengguang Xu's avatar
      btrfs: use enum for extent type defines · b9b1a53e
      Chengguang Xu authored
      Use enum to replace macro definitions of extent types.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9b1a53e
    • Chengguang Xu's avatar
      btrfs: props: remove unnecessary hash_init() · b2cd2959
      Chengguang Xu authored
      DEFINE_HASHTABLE itself has already included initialization code,
      we don't have to call hash_init() again, so remove it.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2cd2959
    • Nikolay Borisov's avatar
      btrfs: Rename btrfs_join_transaction_nolock · 8d510121
      Nikolay Borisov authored
      This function is used only during the final phase of freespace cache
      writeout. This is necessary since using the plain btrfs_join_transaction
      api is deadlock prone. The deadlock looks like:
      
      T1:
      btrfs_commit_transaction
        commit_cowonly_roots
          btrfs_write_dirty_block_groups
            btrfs_wait_cache_io
              __btrfs_wait_cache_io
             btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
                                          inode and blocks transaction commit
      				    until freespace cache writeout
      
      T2: <-- after T1 has triggered the writeout
      finish_ordered_fn
        btrfs_finish_ordered_io
          btrfs_join_transaction <--- this would block waiting for current
                                      transaction to commit, but since trans
      				commit is waiting for this writeout to
      				finish
      
      The special purpose functions prevents it by simply skipping the "wait
      for writeout" since it's guaranteed the transaction won't proceed until
      we are done.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d510121
    • Nikolay Borisov's avatar
      btrfs: User assert to document transaction requirement · ce6d3eb6
      Nikolay Borisov authored
      Using an ASSERT in btrfs_pin_extent allows to more stringently observe
      whether the function is called under a transaction or not.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce6d3eb6
    • David Sterba's avatar
      btrfs: opencode extent_buffer_get · 67439dad
      David Sterba authored
      The helper is trivial and we can understand what the atomic_inc on
      something named refs does.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67439dad
    • Tejun Heo's avatar
      btrfs: Avoid getting stuck during cyclic writebacks · f7bddf1e
      Tejun Heo authored
      During a cyclic writeback, extent_write_cache_pages() uses done_index
      to update the writeback_index after the current run is over.  However,
      instead of current index + 1, it gets to to the current index itself.
      
      Unfortunately, this, combined with returning on EOF instead of looping
      back, can lead to the following pathlogical behavior.
      
      1. There is a single file which has accumulated enough dirty pages to
         trigger balance_dirty_pages() and the writer appending to the file
         with a series of short writes.
      
      2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
      
      3. Writeback kicks in and the cursor is on the last page of the dirty
         file.  Writeback is started or skipped if already in progress.  As
         it's EOF, extent_write_cache_pages() returns and the cursor is set
         to done_index which is pointing to the last page.
      
      4. Writeback is done.  Nothing happens till balance_dirty_pages
         finishes, at which point we go back to #1.
      
      This can almost completely stall out writing back of the file and keep
      the system over dirty threshold for a long time which can mess up the
      whole system.  We encountered this issue in production with a package
      handling application which can reliably reproduce the issue when
      running under tight memory limits.
      
      Reading the comment in the error handling section, this seems to be to
      avoid accidentally skipping a page in case the write attempt on the
      page doesn't succeed.  However, this concern seems bogus.
      
      On each page, the code either:
      
      * Skips and moves onto the next page.
      
      * Fails issue and sets done_index to index + 1.
      
      * Successfully issues and continue to the next page if budget allows
        and not EOF.
      
      IOW, as long as it's not EOF and there's budget, the code never
      retries writing back the same page.  Only when a page happens to be
      the last page of a particular run, we end up retrying the page, which
      can't possibly guarantee anything data integrity related.  Besides,
      cyclic writes are only used for non-syncing writebacks meaning that
      there's no data integrity implication to begin with.
      
      Fix it by always setting done_index past the current page being
      processed.
      
      Note that this problem exists in other writepages too.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7bddf1e
    • Marcos Paulo de Souza's avatar
      btrfs: block-group: Rework documentation of check_system_chunk function · a9143bd3
      Marcos Paulo de Souza authored
      Commit 4617ea3a (" Btrfs: fix necessary chunk tree space calculation
      when allocating a chunk") removed the is_allocation argument from
      check_system_chunk, since the formula for reserving the necessary space
      for allocation or removing a chunk would be the same.
      
      So, rework the comment by removing the mention of is_allocation
      argument.
      Signed-off-by: default avatarMarcos Paulo de Souza <marcos.souza.org@gmail.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9143bd3
    • Qu Wenruo's avatar
      btrfs: Enhance error output for write time tree checker · c06631b0
      Qu Wenruo authored
      Unlike read time tree checker errors, write time error can't be
      inspected by "btrfs inspect dump-tree", so we need extra information to
      determine what's going wrong.
      
      The patch will add the following output for write time tree checker
      error:
      
      - The content of the offending tree block
        To help determining if it's a false alert.
      
      - Kernel WARN_ON() for debug build
        This is helpful for us to detect unexpected write time tree checker
        error, especially fstests could catch the dmesg.
        Since the WARN_ON() is only triggered for write time tree checker,
        test cases utilizing dm-error won't trigger this WARN_ON(), thus no
        extra noise.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c06631b0
    • Qu Wenruo's avatar
      btrfs: tree-checker: Refactor prev_key check for ino into a function · 80d7fd1e
      Qu Wenruo authored
      Refactor the check for prev_key->objectid of the following key types
      into one function, check_prev_ino():
      
      - EXTENT_DATA
      - INODE_REF
      - DIR_INDEX
      - DIR_ITEM
      - XATTR_ITEM
      
      Also add the check of prev_key for INODE_REF.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80d7fd1e
    • Chris Mason's avatar
      Btrfs: extent_write_locked_range() should attach inode->i_wb · dbb70bec
      Chris Mason authored
      extent_write_locked_range() is used when we're falling back to buffered
      IO from inside of compression.  It allocates its own wbc and should
      associate it with the inode's i_wb to make sure the IO goes down from
      the correct cgroup.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbb70bec
    • Chris Mason's avatar
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason authored
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Modified-and-reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec39f769
    • Chris Mason's avatar
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason authored
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      clear_page_dirty_for_io().
      
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
      this:
      
       compress_file_range(chunkA)
       submit_compress_extents(chunkA)
       submit compressed bios(chunkA)
       put_page(locked_page)
      
      				 compress_file_range(chunkB)
      				 ...
      
      Or:
      
       async_cow_submit
        submit_compressed_extents <--- falls back to buffered writeout
         cow_file_range
          extent_clear_unlock_delalloc
           __process_pages_contig
             put_page(locked_pages)
      
      					    async_cow_submit
      
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      
      in submit_compressed_extents():
        while (async extents) {
        ...
          cow_file_range
          if (!page_started ...)
            extent_write_locked_range
          else if (...)
            unlock_page
          continue;
      
      This hasn't been observed in practice but is still possible.
      
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         account_page_cleaned+0x15b/0x1f0
         __cancel_dirty_page+0x146/0x200
         truncate_cleanup_page+0x92/0xb0
         truncate_inode_pages_range+0x202/0x7d0
         btrfs_evict_inode+0x92/0x5a0
         evict+0xc1/0x190
         do_unlinkat+0x176/0x280
         do_syscall_64+0x63/0x1a0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d53c9e6
    • Chris Mason's avatar
      Btrfs: delete the entire async bio submission framework · ba8a9d07
      Chris Mason authored
      Now that we're not using btrfs_schedule_bio() anymore, delete all the
      code that supported it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba8a9d07
    • Chris Mason's avatar
      Btrfs: stop using btrfs_schedule_bio() · 08635bae
      Chris Mason authored
      btrfs_schedule_bio() hands IO off to a helper thread to do the actual
      submit_bio() call.  This has been used to make sure async crc and
      compression helpers don't get stuck on IO submission.  To maintain good
      performance, over time the IO submission threads duplicated some IO
      scheduler characteristics such as high and low priority IOs and they
      also made some ugly assumptions about request allocation batch sizes.
      
      All of this cost at least one extra context switch during IO submission,
      and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
      using btrfs_schedule_bio().  We may need to adjust the number of async
      helper threads for crcs and compression, but long term it's a better
      path.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08635bae
    • David Sterba's avatar
      btrfs: add __pure attribute to functions · e1f60a65
      David Sterba authored
      The attribute is more relaxed than const and the functions could
      dereference pointers, as long as the observable state is not changed. We
      do have such functions, based on -Wsuggest-attribute=pure .
      
      The visible effects of this patch are negligible, there are differences
      in the assembly but hard to summarize.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1f60a65
    • David Sterba's avatar
      btrfs: add const function attribute · 4143cb8b
      David Sterba authored
      For some reason the attribute is called __attribute_const__ and not
      __const, marks functions that have no observable effects on program
      state, IOW not reading pointers, just the arguments and calculating a
      value. Allows the compiler to do some optimizations, based on
      -Wsuggest-attribute=const . The effects are rather small, though, about
      60 bytes decrese of btrfs.ko.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4143cb8b
    • David Sterba's avatar
      btrfs: add __cold attribute to more functions · b105e927
      David Sterba authored
      The attribute can mark functions supposed to be called rarely if at all
      and the text can be moved to sections far from the other code. The
      attribute has been added to several functions already, this patch is
      based on hints given by gcc -Wsuggest-attribute=cold.
      
      The net effect of this patch is decrease of btrfs.ko by 1000-1300,
      depending on the config options.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b105e927
    • David Sterba's avatar
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba authored
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c66e0d4
    • Josef Bacik's avatar
      btrfs: use refcount_inc_not_zero in kill_all_nodes · baf320b9
      Josef Bacik authored
      We hit the following warning while running down a different problem
      
      [ 6197.175850] ------------[ cut here ]------------
      [ 6197.185082] refcount_t: underflow; use-after-free.
      [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
      [ 6197.521792] Call Trace:
      [ 6197.526687]  __btrfs_release_delayed_node+0x76/0x1c0
      [ 6197.536615]  btrfs_kill_all_delayed_nodes+0xec/0x130
      [ 6197.546532]  ? __btrfs_btree_balance_dirty+0x60/0x60
      [ 6197.556482]  btrfs_clean_one_deleted_snapshot+0x71/0xd0
      [ 6197.566910]  cleaner_kthread+0xfa/0x120
      [ 6197.574573]  kthread+0x111/0x130
      [ 6197.581022]  ? kthread_create_on_node+0x60/0x60
      [ 6197.590086]  ret_from_fork+0x1f/0x30
      [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---
      
      This is because the free side drops the ref without the lock, and then
      takes the lock if our refcount is 0.  So you can have nodes on the tree
      that have a refcount of 0.  Fix this by zero'ing out that element in our
      temporary array so we don't try to kill it again.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      baf320b9
    • Anand Jain's avatar
      btrfs: print process name and pid that calls device scanning · aa6c0df7
      Anand Jain authored
      Its very helpful if we had logged the device scanner process name to
      debug the race condition between the systemd-udevd scan and the user
      initiated device forget command.
      
      This patch adds process name and pid to the scan message.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add pid to the message ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa6c0df7
    • Nikolay Borisov's avatar
      btrfs: Open-code name_in_log_ref in replay_one_name · 725af92a
      Nikolay Borisov authored
      That function adds unnecessary indirection between backref_in_log and
      the caller. Furthermore it also "downgrades" backref_in_log's return
      value to a boolean, when in fact it could very well be an error.
      
      Rectify the situation by simply opencoding name_in_log_ref in
      replay_one_name and properly handling possible return codes from
      backref_in_log.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      725af92a
    • Nikolay Borisov's avatar
      btrfs: Properly handle backref_in_log retval · d3316c82
      Nikolay Borisov authored
      This function can return a negative error value if btrfs_search_slot
      errors for whatever reason or if btrfs_alloc_path runs out of memory.
      This is currently problemattic because backref_in_log is treated by its
      callers as if it returns boolean.
      
      Fix this by adding proper error handling in callers. That also enables
      the function to return the direct error code from btrfs_search_slot.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3316c82
    • Nikolay Borisov's avatar
      btrfs: Don't opencode btrfs_find_name_in_backref in backref_in_log · 89cbf5f6
      Nikolay Borisov authored
      Direct replacement, though note that the inside of the loop in
      btrfs_find_name_in_backref is organized in a slightly different way but
      is equvalent.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      89cbf5f6
    • Qu Wenruo's avatar
      btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKED · 3296bf56
      Qu Wenruo authored
      The state was introduced in commit 4a9d8bde ("Btrfs: make the state
      of the transaction more readable"), then in commit 302167c5
      ("btrfs: don't end the transaction for delayed refs in throttle") the
      state is completely removed.
      
      So we can just clean up the state since it's only compared but never
      set.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3296bf56
    • Qu Wenruo's avatar
      btrfs: transaction: describe transaction states and transitions · 61c047b5
      Qu Wenruo authored
      Add an overview of the basic btrfs transaction transitions, including
      the following states:
      
      - No transaction states
      - Transaction N [[TRANS_STATE_RUNNING]]
      - Transaction N [[TRANS_STATE_COMMIT_START]]
      - Transaction N [[TRANS_STATE_COMMIT_DOING]]
      - Transaction N [[TRANS_STATE_UNBLOCKED]]
      - Transaction N [[TRANS_STATE_COMPLETED]]
      
      For each state, the comment will include:
      
      - Basic explaination about current state
      - How to go next stage
      - What will happen if we call various start_transaction() functions
      - Relationship to transaction N+1
      
      This doesn't provide tech details, but serves as a cheat sheet for
      reader to get into the code a little easier.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61c047b5
    • David Sterba's avatar
      btrfs: use has_single_bit_set for clarity · c1499166
      David Sterba authored
      Replace is_power_of_2 with the helper that is self-documenting and
      remove the open coded call in alloc_profile_is_valid.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1499166
    • David Sterba's avatar
      btrfs: add 64bit safe helper for power of two checks · 79c8264e
      David Sterba authored
      As is_power_of_two takes unsigned long, it's not safe on 32bit
      architectures, but we could pass any u64 value in seveal places. Add a
      separate helper and also an alias that better expresses the purpose for
      which the helper is used.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79c8264e
    • Anand Jain's avatar
      btrfs: balance: use term redundancy instead of integrity in message · e62869be
      Anand Jain authored
      When balance reduces the number of copies of metadata, it reduces the
      redundancy, use the term redundancy instead of integrity.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e62869be
    • David Sterba's avatar
      btrfs: move btrfs_unlock_up_safe to other locking functions · 1f95ec01
      David Sterba authored
      The function belongs to the family of locking functions, so move it
      there. The 'noinline' keyword is dropped as it's now an exported
      function that does not need it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f95ec01
    • David Sterba's avatar
      btrfs: move btrfs_set_path_blocking to other locking functions · ed2b1d36
      David Sterba authored
      The function belongs to the family of locking functions, so move it
      there. The 'noinline' keyword is dropped as it's now an exported
      function that does not need it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed2b1d36
    • David Sterba's avatar
      btrfs: make btrfs_assert_tree_locked static inline · 31f6e769
      David Sterba authored
      The function btrfs_assert_tree_locked is used outside of the locking
      code so it is exported, however we can make it static inine as it's
      fairly trivial.
      
      This is the only locking assertion used in release builds, inlining
      improves the text size by 174 bytes and reduces stack consumption in the
      callers.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31f6e769
    • David Sterba's avatar
      btrfs: make locking assertion helpers static inline · d6156218
      David Sterba authored
      I've noticed that none of the btrfs_assert_*lock* debugging helpers is
      inlined, despite they're short and mostly a value update. Making them
      inline shaves 67 from the text size, reduces stack consumption and
      perhaps also slightly improves the performance due to avoiding
      unnecessary calls.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6156218
    • Omar Sandoval's avatar
      btrfs: get rid of pointless wtag variable in async-thread.c · c9eb55db
      Omar Sandoval authored
      Commit ac0c7cf8 ("btrfs: fix crash when tracepoint arguments are
      freed by wq callbacks") added a void pointer, wtag, which is passed into
      trace_btrfs_all_work_done() instead of the freed work item. This is
      silly for a few reasons:
      
      1. The freed work item still has the same address.
      2. work is still in scope after it's freed, so assigning wtag doesn't
         stop anyone from using it.
      3. The tracepoint has always taken a void * argument, so assigning wtag
         doesn't actually make things any more type-safe. (Note that the
         original bug in commit bc074524 ("btrfs: prefix fsid to all trace
         events") was that the void * was implicitly casted when it was passed
         to btrfs_work_owner() in the trace point itself).
      
      Instead, let's add some clearer warnings as comments.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9eb55db