1. 18 Apr, 2018 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo authored
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a514d638
  2. 13 Apr, 2018 1 commit
    • Qu Wenruo's avatar
      btrfs: Only check first key for committed tree blocks · 5d41be6f
      Qu Wenruo authored
      When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
      kernel warning due to first key verification:
      
      [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.523830] Modules linked in:
      [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.527101] Call Trace:
      [ 4239.527251]  read_tree_block+0x42/0x70
      [ 4239.527434]  read_node_slot+0xd2/0x110
      [ 4239.527632]  push_leaf_right+0xad/0x1b0
      [ 4239.527809]  split_leaf+0x4ea/0x700
      [ 4239.527988]  ? leaf_space_used+0xbc/0xe0
      [ 4239.528192]  ? btrfs_set_lock_blocking_rw+0x99/0xb0
      [ 4239.528416]  btrfs_search_slot+0x8cc/0xa40
      [ 4239.528605]  btrfs_insert_empty_items+0x71/0xc0
      [ 4239.528798]  __btrfs_run_delayed_refs+0xa98/0x1680
      [ 4239.529013]  btrfs_run_delayed_refs+0x10b/0x1b0
      [ 4239.529205]  btrfs_commit_transaction+0x33/0xaf0
      [ 4239.529445]  ? start_transaction+0xa8/0x4f0
      [ 4239.529630]  btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
      [ 4239.529833]  btrfs_check_data_free_space+0x54/0xa0
      [ 4239.530045]  btrfs_delalloc_reserve_space+0x25/0x70
      [ 4239.531907]  btrfs_direct_IO+0x233/0x3d0
      [ 4239.532098]  generic_file_direct_write+0xcb/0x170
      [ 4239.532296]  btrfs_file_write_iter+0x2bb/0x5f4
      [ 4239.532491]  aio_write+0xe2/0x180
      [ 4239.532669]  ? lock_acquire+0xac/0x1e0
      [ 4239.532839]  ? __might_fault+0x3e/0x90
      [ 4239.533032]  do_io_submit+0x594/0x860
      [ 4239.533223]  ? do_io_submit+0x594/0x860
      [ 4239.533398]  SyS_io_submit+0x10/0x20
      [ 4239.533560]  ? SyS_io_submit+0x10/0x20
      [ 4239.533729]  do_syscall_64+0x75/0x1d0
      [ 4239.533979]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [ 4239.534182] RIP: 0033:0x7f8519741697
      
      The problem here is, at btree_read_extent_buffer_pages() we don't have
      acquired read/write lock on that extent buffer, only basic info like
      level/bytenr is reliable.
      
      So race condition leads to such false alert.
      
      However in current call site, it's impossible to acquire proper lock
      without race window.
      To fix the problem, we only verify first key for committed tree blocks
      (whose generation is no larger than fs_info->last_trans_committed), so
      the content of such tree blocks will not change and there is no need to
      get read/write lock.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d41be6f
  3. 12 Apr, 2018 5 commits
    • David Sterba's avatar
      btrfs: add SPDX header to Kconfig · 852eb3ae
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      852eb3ae
    • David Sterba's avatar
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba authored
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1d7c514
    • David Sterba's avatar
      btrfs: replace GPL boilerplate by SPDX -- headers · 9888c340
      David Sterba authored
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      
      Unify the include protection macros to match the file names.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9888c340
    • Filipe Manana's avatar
      Btrfs: fix loss of prealloc extents past i_size after fsync log replay · 471d557a
      Filipe Manana authored
      Currently if we allocate extents beyond an inode's i_size (through the
      fallocate system call) and then fsync the file, we log the extents but
      after a power failure we replay them and then immediately drop them.
      This behaviour happens since about 2009, commit c71bf099 ("Btrfs:
      Avoid orphan inodes cleanup while replaying log"), because it marks
      the inode as an orphan instead of dropping any extents beyond i_size
      before replaying logged extents, so after the log replay, and while
      the mount operation is still ongoing, we find the inode marked as an
      orphan and then perform a truncation (drop extents beyond the inode's
      i_size). Because the processing of orphan inodes is still done
      right after replaying the log and before the mount operation finishes,
      the intention of that commit does not make any sense (at least as
      of today). However reverting that behaviour is not enough, because
      we can not simply discard all extents beyond i_size and then replay
      logged extents, because we risk dropping extents beyond i_size created
      in past transactions, for example:
      
        add prealloc extent beyond i_size
        fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
        transaction commit
        add another prealloc extent beyond i_size
        fsync - triggers the fast fsync path
        power failure
      
      In that scenario, we would drop the first extent and then replay the
      second one. To fix this just make sure that all prealloc extents
      beyond i_size are logged, and if we find too many (which is far from
      a common case), fallback to a full transaction commit (like we do when
      logging regular extents in the fast fsync path).
      
      Trivial reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
       $ sync
       $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
       $ xfs_io -c "fsync" /mnt/foo
       <power failure>
      
       # mount to replay log
       $ mount /dev/sdb /mnt
       # at this point the file only has one extent, at offset 0, size 256K
      
      A test case for fstests follows soon, covering multiple scenarios that
      involve adding prealloc extents with previous shrinking truncates and
      without such truncates.
      
      Fixes: c71bf099 ("Btrfs: Avoid orphan inodes cleanup while replaying log")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      471d557a
    • Liu Bo's avatar
      Btrfs: clean up resources during umount after trans is aborted · af722733
      Liu Bo authored
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af722733
  4. 05 Apr, 2018 3 commits
  5. 31 Mar, 2018 18 commits
    • David Sterba's avatar
      btrfs: lift errors from add_extent_changeset to the callers · 57599c7e
      David Sterba authored
      The missing error handling in add_extent_changeset was hidden, so make
      it at least visible in the callers.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57599c7e
    • Liu Bo's avatar
      Btrfs: print error messages when failing to read trees · f50f4353
      Liu Bo authored
      When mount fails to read trees like fs tree, checksum tree, extent
      tree, etc, there is not enough information about where went wrong.
      
      With this, messages like
      
      "BTRFS warning (device sdf): failed to read root (objectid=7): -5"
      
      would help us a bit.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f50f4353
    • David Sterba's avatar
      btrfs: user proper type for btrfs_mask_flags flags · 38e82de8
      David Sterba authored
      All users pass a local unsigned int and not the __uXX types that are
      supposed to be used for userspace interfaces.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e82de8
    • David Sterba's avatar
      btrfs: split dev-replace locking helpers for read and write · 7e79cb86
      David Sterba authored
      The current calls are unclear in what way btrfs_dev_replace_lock takes
      the locks, so drop the argument, split the helpers and use similar
      naming as for read and write locks.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e79cb86
    • David Sterba's avatar
      btrfs: remove stale comments about fs_mutex · e7ab0af6
      David Sterba authored
      The fs_mutex has been killed in 2008, a2135011 ("Btrfs: Replace
      the big fs_mutex with a collection of other locks"), still remembered in
      some comments.
      
      We don't have any extra needs for locking in the ACL handlers.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7ab0af6
    • David Sterba's avatar
      btrfs: use RCU in btrfs_show_devname for device list traversal · 88c14590
      David Sterba authored
      The show_devname callback is used to print device name in
      /proc/self/mounts, we need to traverse the device list consistently and
      read the name that's copied to a seq buffer so we don't need further
      locking.
      
      If the first device is being deleted at the same time, the RCU will
      allow us to read the device name, though it will become stale right
      after the RCU protection ends. This is unavoidable and the user can
      expect that the device will disappear from the filesystem's list at some
      point.
      
      The device_list_mutex was pretty heavy as it is used eg. for writing
      superblock and a few other IO related contexts. This can stall any
      application that reads the proc file for no reason.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c14590
    • David Sterba's avatar
      btrfs: update barrier in should_cow_block · d1980131
      David Sterba authored
      Once there was a simple int force_cow that was used with the plain
      barriers, and then converted to a bit, so we should use the appropriate
      barrier helper.
      
      Other variables in the complex if condition do not depend on a barrier,
      so we should be fine in case the atomic barrier becomes a no-op.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1980131
    • David Sterba's avatar
      btrfs: use lockdep_assert_held for mutexes · a32bf9a3
      David Sterba authored
      Using lockdep_assert_held is preferred, replace mutex_is_locked.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a32bf9a3
    • David Sterba's avatar
      btrfs: use lockdep_assert_held for spinlocks · a4666e68
      David Sterba authored
      Using lockdep_assert_held is preferred, replace assert_spin_locked.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4666e68
    • Qu Wenruo's avatar
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo authored
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      581c1760
    • Qu Wenruo's avatar
      btrfs: tests/qgroup: Fix wrong tree backref level · 3c0efdf0
      Qu Wenruo authored
      The extent tree of the test fs is like the following:
      
       BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
        item 0 key (4096 168 4096) itemoff 3944 itemsize 51
                extent refs 1 gen 1 flags 2
                tree block key (68719476736 0 0) level 1
                                                 ^^^^^^^
                ref#0: tree block backref root 5
      
      And it's using an empty tree for fs tree, so there is no way that its
      level can be 1.
      
      For REAL (created by mkfs) fs tree backref with no skinny metadata, the
      result should look like:
      
       item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
               refs 1 gen 4 flags TREE_BLOCK
               tree block key (256 INODE_ITEM 0) level 0
                                                 ^^^^^^^
               tree block backref root 5
      
      Fix the level to 0, so it won't break later tree level checker.
      
      Fixes: faa2dbf0 ("Btrfs: add sanity tests for new qgroup accounting code")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c0efdf0
    • Filipe Manana's avatar
      Btrfs: fix copy_items() return value when logging an inode · 8434ec46
      Filipe Manana authored
      When logging an inode, at tree-log.c:copy_items(), if we call
      btrfs_next_leaf() at the loop which checks for the need to log holes, we
      need to make sure copy_items() returns the value 1 to its caller and
      not 0 (on success). This is because the path the caller passed was
      released and is now different from what is was before, and the caller
      expects a return value of 0 to mean both success and that the path
      has not changed, while a return value of 1 means both success and
      signals the caller that it can not reuse the path, it has to perform
      another tree search.
      
      Even though this is a case that should not be triggered on normal
      circumstances or very rare at least, its consequences can be very
      unpredictable (especially when replaying a log tree).
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8434ec46
    • Filipe Manana's avatar
      Btrfs: fix fsync after hole punching when using no-holes feature · 4ee3fad3
      Filipe Manana authored
      When we have the no-holes mode enabled and fsync a file after punching a
      hole in it, we can end up not logging the whole hole range in the log tree.
      This happens if the file has extent items that span more than one leaf and
      we punch a hole that covers a range that starts in a leaf but does not go
      beyond the offset of the first extent in the next leaf.
      
      Example:
      
        $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
        $ mount /dev/sdb /mnt
        $ for ((i = 0; i <= 831; i++)); do
      	offset=$((i * 2 * 256 * 1024))
      	xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
      		/mnt/foobar >/dev/null
          done
        $ sync
      
        # We now have 2 leafs in our filesystem fs tree, the first leaf has an
        # item corresponding the extent at file offset 216530944 and the second
        # leaf has a first item corresponding to the extent at offset 217055232.
        # Now we punch a hole that partially covers the range of the extent at
        # offset 216530944 but does go beyond the offset 217055232.
      
        $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        <power fail>
      
        # mount to replay the log
        $ mount /dev/sdb /mnt
      
        # Before this patch, only the subrange [216658016, 216662016[ (length of
        # 4000 bytes) was logged, leaving an incorrect file layout after log
        # replay.
      
      Fix this by checking if there is a hole between the last extent item that
      we processed and the first extent item in the next leaf, and if there is
      one, log an explicit hole extent item.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ee3fad3
    • David Sterba's avatar
      btrfs: use helper to set ulist aux from a qgroup · a1840b50
      David Sterba authored
      We have a nice helper to do proper casting of a qgroup to a ulist aux
      value. And several places that could make use of it.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1840b50
    • Qu Wenruo's avatar
      Revert "btrfs: qgroups: Retry after commit on getting EDQUOT" · 0b78877a
      Qu Wenruo authored
      This reverts commit 48a89bc4.
      
      The idea to commit transaction and free some space after hitting qgroup
      limit is good, although the problem is it can easily cause deadlocks.
      
      One deadlock example is caused by trying to flush data while still
      holding it:
      
      Call Trace:
       __schedule+0x49d/0x10f0
       schedule+0xc6/0x290
       schedule_timeout+0x187/0x1c0
       wait_for_completion+0x204/0x3a0
       btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs]
       qgroup_reserve+0x913/0xa10 [btrfs]
       btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs]
       btrfs_check_data_free_space+0x96/0xd0 [btrfs]
       __btrfs_buffered_write+0x3ac/0xd40 [btrfs]
       btrfs_file_write_iter+0x62a/0xba0 [btrfs]
       __vfs_write+0x320/0x430
       vfs_write+0x107/0x270
       SyS_write+0xbf/0x150
       do_syscall_64+0x1b0/0x3d0
       entry_SYSCALL64_slow_path+0x25/0x25
      
      Another can be caused by trying to commit one transaction while nesting
      with trans handle held by ourselves:
      
      btrfs_start_transaction()
      |- btrfs_qgroup_reserve_meta_pertrans()
         |- qgroup_reserve()
            |- btrfs_join_transaction()
            |- btrfs_commit_transaction()
      
      The retry is causing more problems than exppected when limit is enabled.
      At least a graceful EDQUOT is way better than deadlock.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b78877a
    • Qu Wenruo's avatar
      btrfs: qgroup: Update trace events for metadata reservation · 4ee0d883
      Qu Wenruo authored
      Now trace_qgroup_meta_reserve() will have extra type parameter.
      
      And introduce two new trace events:
      
      1) trace_qgroup_meta_free_all_pertrans()
         For btrfs_qgroup_free_meta_all_pertrans()
      
      2) trace_qgroup_meta_convert()
         For btrfs_qgroup_convert_reserved_meta()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ee0d883
    • Qu Wenruo's avatar
      btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space · 8287475a
      Qu Wenruo authored
      For quota disabled->enable case, it's possible that at reservation time
      quota was not enabled so no bytes were really reserved, while at release
      time, quota was enabled so we will try to release some bytes we didn't
      really own.
      
      Such situation can cause metadata reserveation underflow, for both types,
      also less possible for per-trans type since quota enable will commit
      transaction.
      
      To address this, record qgroup meta reserved bytes into
      root::qgroup_meta_rsv_pertrans and ::prealloc.
      So at releasing time we won't free any bytes we didn't reserve.
      
      For DATA, it's already handled by io_tree, so nothing needs to be done
      there.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8287475a
    • Qu Wenruo's avatar
      btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item · 4f5427cc
      Qu Wenruo authored
      Quite similar for delalloc, some modification to delayed-inode and
      delayed-item reservation.  Also needs extra parameter for release case
      to distinguish normal release and error release.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f5427cc
  6. 30 Mar, 2018 12 commits