1. 12 Jun, 2015 2 commits
  2. 10 Jun, 2015 27 commits
    • Zhao Lei's avatar
      btrfs: wait for delayed iputs on no space · 9a4e7276
      Zhao Lei authored
      btrfs will report no_space when we run following write and delete
      file loop:
       # FILE_SIZE_M=[ 75% of fs space ]
       # DEV=[ some dev ]
       # MNT=[ some dir ]
       #
       # mkfs.btrfs -f "$DEV"
       # mount -o nodatacow "$DEV" "$MNT"
       # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
       #
      
      Reason:
       iput() and evict() is run after write pages to block device, if
       write pages work is not finished before next write, the "rm"ed space
       is not freed, and caused above bug.
      
      Fix:
       We can add "-o flushoncommit" mount option to avoid above bug, but
       it have performance problem. Actually, we can to wait for on-the-fly
       writes only when no-space happened, it is which this patch do.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9a4e7276
    • Qu Wenruo's avatar
      btrfs: qgroup: Make snapshot accounting work with new extent-oriented · d6726335
      Qu Wenruo authored
      qgroup.
      
      Make snapshot accounting work with new extent-oriented mechanism by
      skipping given root in new/old_roots in create_pending_snapshot().
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d6726335
    • Qu Wenruo's avatar
      btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. · 9086db86
      Qu Wenruo authored
      This is used by later qgroup fix patches for snapshot.
      
      As current snapshot accounting is done by btrfs_qgroup_inherit(), but
      new extent oriented quota mechanism will account extent from
      btrfs_copy_root() and other snapshot things, causing wrong result.
      
      So add this ability to handle snapshot accounting.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9086db86
    • Qu Wenruo's avatar
      btrfs: ulist: Add ulist_del() function. · d4b80404
      Qu Wenruo authored
      This function will delete unode with given (val,aux) pair.
      And with this patch, seqnum for debug usage doesn't have any meaning
      now, so remove them.
      
      This is used by later patches to skip snapshot root.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d4b80404
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. · e69bcee3
      Qu Wenruo authored
      Goodbye, the old mechanisim.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e69bcee3
    • Qu Wenruo's avatar
      btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. · 442244c9
      Qu Wenruo authored
      Since the self test transaction don't have delayed_ref_roots, so use
      find_all_roots() and export btrfs_qgroup_account_extent() to simulate it
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      442244c9
    • Qu Wenruo's avatar
      btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. · 0ed4792a
      Qu Wenruo authored
      Switch from old ref_node based qgroup to extent based qgroup mechanism
      for normal operations.
      
      The new mechanism should hugely reduce the overhead of btrfs quota
      system, and further more, the codes and logic should be more clean and
      easier to maintain.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0ed4792a
    • Qu Wenruo's avatar
      btrfs: qgroup: Switch rescan to new mechanism. · 9d220c95
      Qu Wenruo authored
      Switch rescan to use the new new extent oriented mechanism.
      
      As rescan is also based on extent, new mechanism is just a perfect match
      for rescan.
      
      With re-designed internal functions, rescan is quite easy, just call
      btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent().
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9d220c95
    • Qu Wenruo's avatar
      btrfs: qgroup: Add new qgroup calculation function · 550d7a2e
      Qu Wenruo authored
      btrfs_qgroup_account_extents().
      
      The new btrfs_qgroup_account_extents() function should be called in
      btrfs_commit_transaction() and it will update all the qgroup according
      to delayed_ref_root->dirty_extent_root.
      
      The new function can handle both normal operation during
      commit_transaction() or in rescan in a unified method with clearer
      logic.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      550d7a2e
    • Qu Wenruo's avatar
      btrfs: backref: Add special time_seq == (u64)-1 case for · 21633fc6
      Qu Wenruo authored
      btrfs_find_all_roots().
      
      Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree
      lock to do tree search.
      
      This is important for later qgroup implement which will call
      find_all_roots() after fs trees are committed.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      21633fc6
    • Qu Wenruo's avatar
      btrfs: qgroup: Add new function to record old_roots. · 3b7d00f9
      Qu Wenruo authored
      Add function btrfs_qgroup_prepare_account_extents() to get old_roots
      which are needed for qgroup.
      
      We do it in commit_transaction() and before switch_roots(), and only
      search commit_root, so it gives a quite accurate view for previous
      transaction.
      
      With old_roots from previous transaction, we can use it to do accurate
      account with current transaction.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3b7d00f9
    • Qu Wenruo's avatar
      btrfs: qgroup: Record possible quota-related extent for qgroup. · 3368d001
      Qu Wenruo authored
      Add hook in add_delayed_ref_head() to record quota-related extent record
      into delayed_ref_root->dirty_extent_record rb-tree for later qgroup
      accounting.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3368d001
    • Qu Wenruo's avatar
      btrfs: qgroup: Add function qgroup_update_counters(). · 823ae5b8
      Qu Wenruo authored
      Add function qgroup_update_counters(), which will update related
      qgroups' rfer/excl according to old/new_roots.
      
      This is one of the two core functions for the new qgroup implement.
      
      This is based on btrfs_adjust_coutners() but with clearer logic and
      comment.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      823ae5b8
    • Qu Wenruo's avatar
      btrfs: qgroup: Add function qgroup_update_refcnt(). · d810ef2b
      Qu Wenruo authored
      This function is used to update refcnt for qgroups.
      And is one of the two core functions used in the new qgroup implement.
      
      This is based on the old update_old/new_refcnt, but provides a unified
      logic and behavior.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d810ef2b
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use ref_node to replace unneeded parameters in... · c682f9b3
      Qu Wenruo authored
      btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent()
      
      __btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too
      many parameters, but three of them can be extracted from
      btrfs_delayed_ref_node struct.
      
      So use btrfs_delayed_ref_node struct as a single parameter to replace
      the bytenr/num_byte/no_quota parameters.
      
      The real objective of this patch is to allow btrfs_qgroup_record_ref()
      get the delayed_ref_node in incoming qgroup patches.
      
      Other functions calling btrfs_qgroup_record_ref() are not affected since
      the rest will only add/sub exclusive extents, where node is not used.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c682f9b3
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read. · 9c542136
      Qu Wenruo authored
      Use inline functions to do such things, to improve readability.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9c542136
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Cleanup the unneeded functions. · c43d160f
      Qu Wenruo authored
      Cleanup the rb_tree merge/insert/update functions, since now we use list
      instead of rb_tree now.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c43d160f
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Use list to replace the ref_root in ref_head. · c6fc2454
      Qu Wenruo authored
      This patch replace the rbtree used in ref_head to list.
      This has the following advantage:
      1) Easier merge logic.
      With the new list implement, we only need to care merging the tail
      ref_node with the new ref_node.
      And this can be done quite easy at insert time, no need to do a
      indicated merge at run_delayed_refs().
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c6fc2454
    • Qu Wenruo's avatar
      btrfs: backref: Don't merge refs which are not for same block. · 00db646d
      Qu Wenruo authored
      Old __merge_refs() in backref.c will even merge refs whose root_id are
      different, which makes qgroup gives wrong result.
      
      Fix it by checking ref_for_same_block() before any mode specific works.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      00db646d
    • Zhao Lei's avatar
      btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() · 20b2e302
      Zhao Lei authored
      lockdep report following warning in test:
       [25176.843958] =================================
       [25176.844519] [ INFO: inconsistent lock state ]
       [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
       [25176.845591] ---------------------------------
       [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
       [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
       [25176.847838] {SOFTIRQ-ON-W} state was registered at:
       [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
       [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
       [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
       [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
       [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
       [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
       [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
       [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
       [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
       [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
       [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
       [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
       [25176.854935] irq event stamp: 51506
       [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
       [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
       [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
       [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
       [25176.857746]
       other info that might help us debug this:
       [25176.858845]  Possible unsafe locking scenario:
       [25176.859981]        CPU0
       [25176.860537]        ----
       [25176.861059]   lock(&wr_ctx->wr_lock);
       [25176.861705]   <Interrupt>
       [25176.862272]     lock(&wr_ctx->wr_lock);
       [25176.862881]
        *** DEADLOCK ***
      
      Reason:
       Above warning is caused by:
       Interrupt
       -> bio_endio()
       -> ...
       -> scrub_put_ctx()
       -> scrub_free_ctx() *1
       -> ...
       -> mutex_lock(&wr_ctx->wr_lock);
      
       scrub_put_ctx() is allowed to be called in end_bio interrupt, but
       in code design, it will never call scrub_free_ctx(sctx) in interrupe
       context(above *1), because btrfs_scrub_dev() get one additional
       reference of sctx->refs, which makes scrub_free_ctx() only called
       withine btrfs_scrub_dev().
      
       Now the code runs out of our wish, because free sequence in
       scrub_pending_bio_dec() have a gap.
      
       Current code:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
       -> scrub_free_ctx()                |
       -----------------------------------+-----------------------------------
      
       We expected:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
                                          | -> scrub_free_ctx()
       -----------------------------------+-----------------------------------
      
      Fix:
       Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
       in interrupt context.
       Tested by check tracelog in debug.
      
      Changelog v1->v2:
       Use workqueue instead of adjust function call sequence in v1,
       because v1 will introduce a bug pointed out by:
       Filipe David Manana <fdmanana@gmail.com>
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      20b2e302
    • Mark Fasheh's avatar
      btrfs: Handle unaligned length in extent_same · e1d227a4
      Mark Fasheh authored
      The extent-same code rejects requests with an unaligned length. This
      poses a problem when we want to dedupe the tail extent of files as we
      skip cloning the portion between i_size and the extent boundary.
      
      If we don't clone the entire extent, it won't be deleted. So the
      combination of these behaviors winds up giving us worst-case dedupe on
      many files.
      
      We can fix this by allowing a length that extents to i_size and
      internally aligining those to the end of the block. This is what
      btrfs_ioctl_clone() so we can just copy that check over.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e1d227a4
    • chandan's avatar
      Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag. · 070034bd
      chandan authored
      max_to_defrag represents the number of pages to defrag rather than the last
      page of the file range to be defragged.
      
      Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the
      defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag
      should actually have the value 4. Instead in the current code we end up
      setting it to 6.
      
      Now, this does not (yet) cause an issue since the first part of the while loop
      condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control
      to flow out of the while loop before any buggy behavior is actually caused. So
      the patch just makes sure that max_to_defrag ends up having the right value
      rather than fixing a bug. I did run the xfstests suite to make sure that the
      code does not regress.
      
      Changelog: v1->v2:
      Provide a much descriptive commit message.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      070034bd
    • chandan's avatar
      Btrfs: btrfs_defrag_file: Fix ra_index computation. · e4826a5b
      chandan authored
      Read-ahead is done for the pages in the range [ra_index, ra_index + cluster -
      1]. So the next read-ahead should be starting from the page at index 'ra_index
      + cluster' (unless we deemed that the extent at 'ra_index + cluster' as
      non-defraggable) rather than from the page at index 'ra_index +
      max_cluster'. This patch fixes this. I did run the xfstests suite to make sure
      that the code does not regress.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e4826a5b
    • Filipe Manana's avatar
      Btrfs: fix necessary chunk tree space calculation when allocating a chunk · 4617ea3a
      Filipe Manana authored
      When allocating a new chunk or removing one we need to update num_devs
      device items and insert or remove a chunk item in the chunk tree, so
      in the worst case the space needed in the chunk space_info is:
      
        btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
           btrfs_calc_trans_metadata_size(chunk_root, 1)
      
      That is, in the worst case we need to cow num_devs paths and cow 1 other
      path that can result in splitting every node and leaf, and each path
      consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
      some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
      which were unnecessary since updating the existing device items does
      not result in splitting the nodes and leaf since after updating them
      they remain with the same size.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4617ea3a
    • Filipe Manana's avatar
      Btrfs: don't attach unnecessary extents to transaction on fsync · 7558c8bc
      Filipe Manana authored
      We don't need to attach ordered extents that have completed to the current
      transaction. Doing so only makes us hold memory for longer than necessary
      and delaying the iput of the inode until the transaction is committed (for
      each created ordered extent we do an igrab and then schedule an asynchronous
      iput when the ordered extent's reference count drops to 0), preventing the
      inode from being evictable until the transaction commits.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7558c8bc
    • Filipe Manana's avatar
      Btrfs: avoid syncing log in the fast fsync path when not necessary · b659ef02
      Filipe Manana authored
      Commit 3a8b36f3 ("Btrfs: fix data loss in the fast fsync path") added
      a performance regression for that causes an unnecessary sync of the log
      trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
      against a file, without no writes or any metadata updates to the inode in
      between them and if a transaction is committed before the second fsync is
      called.
      
      Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
      after a test sysbench test that measured a -62% decrease of file io
      requests per second for that tests' workload.
      
      The test is:
      
        echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
        mkfs -t btrfs /dev/sda2
        mount -t btrfs /dev/sda2 /fs/sda2
        cd /fs/sda2
        for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
        sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
          --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
          --file-num=1024 run
      
      A test on kvm guest, running a debug kernel gave me the following results:
      
      Without 3a8b36f3:             16.01 reqs/sec
      With 3a8b36f3:                 3.39 reqs/sec
      With 3a8b36f3 and this patch: 16.04 reqs/sec
      Reported-by: default avatarHuang Ying <ying.huang@intel.com>
      Tested-by: default avatarHuang, Ying <ying.huang@intel.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b659ef02
    • Chris Mason's avatar
      Merge branch 'send_fixes_4.2' of... · 1ab818b1
      Chris Mason authored
      Merge branch 'send_fixes_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2
      1ab818b1
  3. 03 Jun, 2015 11 commits
    • Filipe Manana's avatar
      Btrfs: fix hang during inode eviction due to concurrent readahead · 6ca07097
      Filipe Manana authored
      Zygo Blaxell and other users have reported occasional hangs while an
      inode is being evicted, leading to traces like the following:
      
      [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
      [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
      [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
      [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
      [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
      [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
      [ 5281.981396] Call Trace:
      [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
      [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
      [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
      [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
      [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
      [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
      [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
      [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
      [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
      [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 5282.001299] 1 lock held by rm/20488:
      [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
      
      This happens when we have readahead, which calls readpages(), happening
      right before the inode eviction handler is invoked. So the reason is
      essentially:
      
      1) readpages() is called while a reference on the inode is held, so
         eviction can not be triggered before readpages() returns. It also
         locks one or more ranges in the inode's io_tree (which is done at
         extent_io.c:__do_contiguous_readpages());
      
      2) readpages() submits several read bios, all with an end io callback
         that runs extent_io.c:end_bio_extent_readpage() and that is executed
         by other task when a bio finishes, corresponding to a work queue
         (fs_info->end_io_workers) worker kthread. This callback unlocks
         the ranges in the inode's io_tree that were previously locked in
         step 1;
      
      3) readpages() returns, the reference on the inode is dropped;
      
      4) One or more of the read bios previously submitted are still not
         complete (their end io callback was not yet invoked or has not
         yet finished execution);
      
      5) Inode eviction is triggered (through an unlink call for example).
         The inode reference count was not incremented before submitting
         the read bios, therefore this is possible;
      
      6) The eviction handler starts executing and enters the loop that
         iterates over all extent states in the inode's io_tree;
      
      7) The loop picks one extent state record and uses its ->start and
         ->end fields, after releasing the inode's io_tree spinlock, to
         call lock_extent_bits() and clear_extent_bit(). The call to lock
         the range [state->start, state->end] blocks because the whole
         range or a part of it was locked by the previous call to
         readpages() and the corresponding end io callback, which unlocks
         the range was not yet executed;
      
      8) The end io callback for the read bio is executed and unlocks the
         range [state->start, state->end] (or a superset of that range).
         And at clear_extent_bit() the extent_state record state is used
         as a second argument to split_state(), which sets state->start to
         a larger value;
      
      9) The task executing the eviction handler is woken up by the task
         executing the bio's end io callback (through clear_state_bit) and
         the eviction handler locks the range
         [old value for state->start, state->end]. Shortly after, when
         calling clear_extent_bit(), it unlocks the range
         [new value for state->start, state->end], so it ends up unlocking
         only part of the range that it locked, leaving an extent state
         record in the io_tree that represents the unlocked subrange;
      
      10) The eviction handler loop, in its next iteration, gets the
          extent_state record for the subrange that it did not unlock in the
          previous step and then tries to lock it, resulting in an hang.
      
      So fix this by not using the ->start and ->end fields of an existing
      extent_state record. This is a simple solution, and an alternative
      could be to bump the inode's reference count before submitting each
      read bio and having it dropped in the bio's end io callback. But that
      would be a more invasive/complex change and would not protect against
      other possible places that are not holding a reference on the inode
      as well. Something to consider in the future.
      
      Many thanks to Zygo Blaxell for reporting, in the mailing list, the
      issue, a set of scripts to trigger it and testing this fix.
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6ca07097
    • Liu Bo's avatar
      Btrfs: fix up read_tree_block to return proper error · 64c043de
      Liu Bo authored
      The return value of read_tree_block() can confuse callers as it always
      returns NULL for either -ENOMEM or -EIO, so it's likely that callers
      parse it to a wrong error, for instance, in btrfs_read_tree_root().
      
      This fixes the above issue.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      64c043de
    • Liu Bo's avatar
      Btrfs: add missing free_extent_buffer · 8635eda9
      Liu Bo authored
      read_tree_block may take a reference on the 'eb', a following
      free_extent_buffer is necessary.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8635eda9
    • Liu Bo's avatar
      Btrfs: remove csum_bytes_left · 0c304304
      Liu Bo authored
      After commit 8407f553
      ("Btrfs: fix data corruption after fast fsync and writeback error"),
      during wait_ordered_extents(), we wait for ordered extent setting
      BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
      already got checksum information, so we don't need to check
      (csum_bytes_left == 0) in the whole logging path.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0c304304
    • Filipe Manana's avatar
      Btrfs: fix -ENOSPC on block group removal · 39c2d7fa
      Filipe Manana authored
      Unlike when attempting to allocate a new block group, where we check
      that we have enough space in the system space_info to update the device
      items and insert a new chunk item in the chunk tree, we were not checking
      if the system space_info had enough space for updating the device items
      and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
      error when attempting to allocate blocks for the chunk tree (during btree
      node/leaf COW operations) while updating the device items or deleting the
      chunk item, which resulted in the current transaction being aborted and
      turning the filesystem into read-only mode.
      
      While running fstests generic/038, which stresses allocation of block
      groups and removal of unused block groups, with a large scratch device
      (750Gb) this happened often, despite more than enough unallocated space,
      and resulted in the following trace:
      
      [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [68663.600407] BTRFS: Transaction aborted (error -28)
      (...)
      [68663.730829] Call Trace:
      [68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
      [68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
      [68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
      [68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
      [68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
      [68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
      [68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
      [68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
      [68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
      [68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
      [68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
      [68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
      [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left
      
      So fix this by verifying that enough space exists in system space_info,
      and reserving the space in the chunk block reserve, before attempting to
      delete the block group and allocate a new system chunk if we don't have
      enough space to perform the necessary updates and delete in the chunk
      tree. Like for the block group creation case, we don't error our if we
      fail to allocate a new system chunk, since we might end up not needing
      it (no node/leaf splits happen during the COW operations and/or we end
      up not needing to COW any btree nodes or leafs because they were already
      COWed in the current transaction and their writeback didn't start yet).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      39c2d7fa
    • Filipe Manana's avatar
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana authored
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4fbcdf66
    • Josef Bacik's avatar
      Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap · 0d2b2372
      Josef Bacik authored
      We should be doing this, it's weird we hadn't been doing this.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0d2b2372
    • Omar Sandoval's avatar
      Btrfs: show subvol= and subvolid= in /proc/mounts · c8d3fe02
      Omar Sandoval authored
      Now that we're guaranteed to have a meaningful root dentry, we can just
      export seq_dentry() and use it in btrfs_show_options(). The subvolume ID
      is easy to get and can also be useful, so put that in there, too.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c8d3fe02
    • Omar Sandoval's avatar
      Btrfs: unify subvol= and subvolid= mounting · 05dbe683
      Omar Sandoval authored
      Currently, mounting a subvolume with subvolid= takes a different code
      path than mounting with subvol=. This isn't really a big deal except for
      the fact that mounts done with subvolid= or the default subvolume don't
      have a dentry that's connected to the dentry tree like in the subvol=
      case. To unify the code paths, when given subvolid= or using the default
      subvolume ID, translate it into a subvolume name by walking
      ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      05dbe683
    • Omar Sandoval's avatar
      Btrfs: fail on mismatched subvol and subvolid mount options · bb289b7b
      Omar Sandoval authored
      There's nothing to stop a user from passing both subvol= and subvolid=
      to mount, but if they don't refer to the same subvolume, someone is
      going to be surprised at some point. Error out on this case, but allow
      users to pass in both if they do match (which they could, for example,
      get out of /proc/mounts).
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bb289b7b
    • Omar Sandoval's avatar
      Btrfs: clean up error handling in mount_subvol() · fa330659
      Omar Sandoval authored
      In preparation for new functionality in mount_subvol(), give it
      ownership of subvol_name and tidy up the error paths.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fa330659