1. 10 Sep, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock · 85e0a0f2
      Filipe Manana authored
      After commmit e44163e1 ("btrfs: explictly delete unused block groups
      in close_ctree and ro-remount"), added in the 4.3 merge window, we have
      calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex.
      This can cause a deadlock with a concurrent block group relocation (when
      a filesystem balance or shrink operation is in progress for example)
      because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the
      relocation path locks first delete_unused_bgs_mutex and then it locks
      cleaner_mutex, resulting in a classic ABBA deadlock:
      
               CPU 0                                        CPU 1
      
      lock fs_info->cleaner_mutex
      
                                                 __btrfs_balance() || btrfs_shrink_device()
                                                   lock fs_info->delete_unused_bgs_mutex
                                                   btrfs_relocate_chunk()
                                                     btrfs_relocate_block_group()
                                                       lock fs_info->cleaner_mutex
      btrfs_delete_unused_bgs()
        lock fs_info->delete_unused_bgs_mutex
      
      Fix this by not taking the cleaner_mutex before calling
      btrfs_delete_unused_bgs() because it's no longer needed after
      commit 67c5e7d4 ("Btrfs: fix race between balance and unused block
      group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the
      spinlock fs_info->unused_bgs_lock and a block group's spinlock are
      enough to get correct serialization between tasks running relocation
      and unused block group deletion (as well as between multiple tasks
      concurrently calling btrfs_delete_unused_bgs()).
      
      This issue was discussed (in the mailing list) during the review of
      the patch titled "btrfs: explictly delete unused block groups in
      close_ctree and ro-remount" and it was agreed that acquiring the
      cleaner mutex had to be dropped after the patch titled
      "Btrfs: fix race between balance and unused block group deletion"
      got merged (both patches were submitted at about the same time, but
      one landed in kernel 4.2 and the other in the 4.3 merge window).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      85e0a0f2
  2. 08 Sep, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: don't initialize a space info as full to prevent ENOSPC · 6af3e3ad
      Filipe Manana authored
      Commit 2e6e5183 ("Btrfs: fix block group ->space_info null pointer
      dereference") accidently marked a space info as full when initializing
      it with a value of 0 total bytes. This introduces an ENOSPC problem when
      writing file data if we mount a filesystem that has no data block groups
      allocated, because the data space info is initialized with 0 total bytes,
      marked as full, and it never gets its total bytes incremented by a
      (positive) value to unmark it as full (because there are no data block
      groups loaded when the fs is mounted).
      For metadata and system spaces this issue can never happen since we always
      have at least one metadata block group and one system block group (even
      for an empty filesystem).
      
      So fix this by just not initializing a space info as full, reverting the
      offending part of the commit mentioned above.
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Mount our filesystem without space caches enabled so that we do not
        # get any space used from the initial data block group that mkfs creates
        # (space caches used space from data block groups).
        _scratch_mount "-o nospace_cache"
      
        # Need an fs with at least 2Gb to make sure mkfs.btrfs does not create
        # an fs using mixed block groups (used both for data and metadata). We
        # really need to have dedicated block groups for data to reproduce the
        # issue and mkfs.btrfs defaults to mixed block groups only for small
        # filesystems (up to 1Gb).
        _require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024))
      
        # Run balance with the purpose of deleting the unused data block group
        # that mkfs created. We could also wait for the background kthread to
        # automatically delete the unused block group, but we do not have a way
        # to make it run and wait for it to complete, so just do a balance
        # instead of some unreliable sleep
        _run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT
      
        # Now unmount the filesystem, mount it again (either with or with space
        # caches enabled, it does not matter to trigger the problem) and attempt
        # to create a file with some data - this used to fail with ENOSPC
        # because there were no data block groups when the filesystem was
        # mounted and the data space info object was marked as full when
        # initialized (because it had 0 total bytes), which prevented the file
        # write path from attempting to allocate a data block group and fail
        # immediately with ENOSPC.
        _scratch_remount
        echo "hello world" > $SCRATCH_MNT/foobar
      
        echo "Silence is golden"
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      6af3e3ad
  3. 31 Aug, 2015 6 commits
  4. 21 Aug, 2015 1 commit
    • Chris Mason's avatar
      btrfs: fix compile when block cgroups are not enabled · 3a9508b0
      Chris Mason authored
      bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
      This adds an ifdef around them.  It's not perfect, but our
      use of bi_ioc is being removed in the 4.3 merge window.
      
      The bi_css usage really should go into bio_clone, but I want to make
      sure that doesn't introduce problems for other bio_clone use cases.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a9508b0
  5. 19 Aug, 2015 6 commits
    • Filipe Manana's avatar
      Btrfs: fix file read corruption after extent cloning and fsync · b84b8390
      Filipe Manana authored
      If we partially clone one extent of a file into a lower offset of the
      file, fsync the file, power fail and then mount the fs to trigger log
      replay, we can get multiple checksum items in the csum tree that overlap
      each other and result in checksum lookup failures later. Those failures
      can make file data read requests assume a checksum value of 0, but they
      will not return an error (-EIO for example) to userspace exactly because
      the expected checksum value 0 is a special value that makes the read bio
      endio callback return success and set all the bytes of the corresponding
      page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
      From a userspace perspective this is equivalent to file corruption
      because we are not returning what was written to the file.
      
      Details about how this can happen, and why, are included inline in the
      following reproducer test case for fstests and the comment added to
      tree-log.c.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with a single 100K extent starting at file
        # offset 800K. We fsync the file here to make the fsync log tree gets
        # a single csum item that covers the whole 100K extent, which causes
        # the second fsync, done after the cloning operation below, to not
        # leave in the log tree two csum items covering two sub-ranges
        # ([0, 20K[ and [20K, 100K[)) of our extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                        -c "fsync"                     \
                         $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now clone part of our extent into file offset 400K. This adds a file
        # extent item to our inode's metadata that points to the 100K extent
        # we created before, using a data offset of 20K and a data length of
        # 20K, so that it refers to the sub-range [20K, 40K[ of our original
        # extent.
        $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
            -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # Now fsync our file to make sure the extent cloning is durably
        # persisted. This fsync will not add a second csum item to the log
        # tree containing the checksums for the blocks in the sub-range
        # [20K, 40K[ of our extent, because there was already a csum item in
        # the log tree covering the whole extent, added by the first fsync
        # we did before.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power
        # failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        # The fsync log replay first processes the file extent item
        # corresponding to the file offset 400K (the one which refers to the
        # [20K, 40K[ sub-range of our 100K extent) and then processes the file
        # extent item for file offset 800K. It used to happen that when
        # processing the later, it erroneously left in the csum tree 2 csum
        # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
        # 1 for the whole range of our extent. This introduced a problem where
        # subsequent lookups for the checksums of blocks within the range
        # [40K, 100K[ of our extent would not find anything because lookups in
        # the csum tree ended up looking only at the smaller csum item, the
        # one covering the subrange [20K, 40K[. This made read requests assume
        # an expected checksum with a value of 0 for those blocks, which caused
        # checksum verification failure when the read operations finished.
        # However those checksum failure did not result in read requests
        # returning an error to user space (like -EIO for e.g.) because the
        # expected checksum value had the special value 0, and in that case
        # btrfs set all bytes of the corresponding pages with the value 0x01
        # and produce the following warning in dmesg/syslog:
        #
        #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
        #   1322675045 expected csum 0"
        #
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest he had after cloning the extent and
        # before the power failure happened.
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b84b8390
    • Filipe Manana's avatar
      Btrfs: check if previous transaction aborted to avoid fs corruption · 1f9b8c8f
      Filipe Manana authored
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1f9b8c8f
    • Michal Hocko's avatar
      btrfs: use __GFP_NOFAIL in alloc_btrfs_bio · 277fb5fc
      Michal Hocko authored
      alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
      transaction but this allocation context is rather weak wrt. reclaim
      capabilities. The page allocator currently tries hard to not fail these
      allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
      still fail if the _current_ process is the OOM killer victim. Moreover
      there is an attempt to move away from the default no-fail behavior and
      allow these allocation to fail more eagerly. This would lead to:
      
      [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
      
      which is clearly undesirable and the nofail behavior should be explicit
      if the allocation failure cannot be tolerated.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      277fb5fc
    • Michal Hocko's avatar
      btrfs: Prevent from early transaction abort · d1b5c567
      Michal Hocko authored
      Btrfs relies on GFP_NOFS allocation when committing the transaction but
      this allocation context is rather weak wrt. reclaim capabilities. The
      page allocator currently tries hard to not fail these allocations if
      they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
      currently but there is an attempt to move away from the default no-fail
      behavior and allow these allocation to fail more eagerly. And this would
      lead to a pre-mature transaction abort as follows:
      
      [   55.328093] Call Trace:
      [   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
      [   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
      [   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
      [   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
      [   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
      [   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
      [   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
      [   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
      [   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
      [   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
      [   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
      [   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
      [   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
      [   55.384280] ------------[ cut here ]------------
      [   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
      [...]
      [   55.384337] Call Trace:
      [   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
      [   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
      [   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
      [   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
      [   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
      [   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [...]
      [   55.384608] ---[ end trace c29799da1d4dd621 ]---
      [   55.437323] BTRFS info (device hdb1): forced readonly
      [   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry
      
      Fix this by being explicit about the no-fail behavior of this allocation
      path and use __GFP_NOFAIL.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d1b5c567
    • Zhaolei's avatar
      btrfs: Remove unused arguments in tree-log.c · 60d53eb3
      Zhaolei authored
      Following arguments are not used in tree-log.c:
       insert_one_name(): path, type
       wait_log_commit(): trans
       wait_for_writer(): trans
      
      This patch remove them.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      60d53eb3
    • Zhaolei's avatar
      btrfs: Remove useless condition in start_log_trans() · 34eb2a52
      Zhaolei authored
      Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
      for start_log_trans():
       fs/btrfs/tree-log.c:178 start_log_trans()
       warn: we tested 'root->log_root' before and it was 'false'
      
       fs/btrfs/tree-log.c
       147          if (root->log_root) {
       We test "root->log_root" here.
       ...
      
      Reason:
       Condition of:
       fs/btrfs/tree-log.c:178: if (!root->log_root) {
       is not necessary after commit: 7237f183
      
       It caused a smatch warning, and no functionally error.
      
      Fix:
       Deleting above condition will make smatch shut up,
       but a better way is to do cleanup for start_log_trans()
       to remove duplicated code and make code more readable.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      34eb2a52
  6. 09 Aug, 2015 25 commits