1. 28 Jan, 2014 40 commits
    • David Sterba's avatar
      btrfs: check balance of send_in_progress · 66ef7d65
      David Sterba authored
      Warn if the balance goes below zero, which appears to be unlikely
      though. Otherwise cleans up the code a bit.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      66ef7d65
    • Wang Shilong's avatar
      Btrfs: remove transaction from btrfs send · 41ce9970
      Wang Shilong authored
      Since daivd did the work that force us to use readonly snapshot,
      we can safely remove transaction protection from btrfs send.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      41ce9970
    • Miao Xie's avatar
      Btrfs: fix double initialization of the raid kobject · 536cd964
      Miao Xie authored
      We met the following oops when doing space balance:
       kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong.
       ...
       Call Trace:
        [<ffffffff81937262>] dump_stack+0x49/0x5f
        [<ffffffff8137d259>] kobject_init+0x89/0xa0
        [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70
        [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs]
        [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs]
        [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs]
        [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs]
        [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs]
        [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs]
        [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs]
        [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs]
        [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs]
        ...
      
      Steps to reproduce:
       # mkfs.btrfs -f <dev>
       # mount -o nospace_cache <dev> <mnt>
       # btrfs balance start <mnt>
       # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
      
      The reason of this problem is that we initialized the raid kobject when we added
      a block group into a empty raid list. As we know, when we mounted a btrfs filesystem,
      the raid list was empty, we would initialize the raid kobject when we added the first
      block group. But if there was not data stored in the block group, the block group
      would be freed when doing balance, and the raid list would be empty. And then if we
      allocated a new block group and added it into the raid list, we would initialize
      the raid kobject again, the oops happened.
      
      Fix this problem by initializing the raid kobject just when mounting the fs.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reported-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      536cd964
    • Wang Shilong's avatar
      Btrfs: fix a warning when iput a file · 180589ef
      Wang Shilong authored
      See the warning below:
      
      [ 1209.102076]  [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs]
      [ 1209.102084]  [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs]
      [ 1209.102089]  [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40
      [ 1209.102092]  [<ffffffff8118ab2e>] evict+0x9e/0x190
      [ 1209.102094]  [<ffffffff8118b313>] iput+0xf3/0x180
      [ 1209.102101]  [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs]
      [ 1209.102107]  [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs]
      
      clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping()
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      180589ef
    • David Sterba's avatar
      btrfs: Check read-only status of roots during send · 2c686537
      David Sterba authored
      All the subvolues that are involved in send must be read-only during the
      whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
      status to read-write and the result of send stream is undefined if the
      data change unexpectedly.
      
      Fix that by adding a refcount for all involved roots and verify that
      there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
      read-only -> read-write transition.
      
      We need refcounts because there are no restrictions on number of send
      parallel operations currently run on a single subvolume, be it source,
      parent or one of the multiple clone sources.
      
      Kernel is silent when the RO checks fail and returns EPERM. The same set
      of checks is done already in userspace before send starts.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2c686537
    • David Sterba's avatar
      btrfs: remove unused mnt from send_ctx · a8d89f5b
      David Sterba authored
      Unused since ed259095
      "Btrfs: stop using vfs_read in send".
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a8d89f5b
    • David Sterba's avatar
      btrfs: send: clean up dead code · 95bc79d5
      David Sterba authored
      Remove ifdefed code:
      
      - tlv_put for 8, 16 and 32, add a generic tempalte if needed in future
      - tlv_put_timespec - the btrfs_timespec fields are used
      - fs_path_remove obsoleted long ago
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      95bc79d5
    • Filipe David Borba Manana's avatar
      Btrfs: fix deadlock when iterating inode refs and running delayed inodes · 3fe81ce2
      Filipe David Borba Manana authored
      While running btrfs/004 from xfstests, after 503 iterations, dmesg reported
      a deadlock between tasks iterating inode refs and tasks running delayed inodes
      (during a transaction commit).
      
      It turns out that iterating inode refs implies doing one tree search and
      release all nodes in the path except the leaf node, and then passing that
      leaf node to btrfs_ref_to_path(), which in turn does another tree search
      without releasing the lock on the leaf node it received as parameter.
      
      This is a problem when other task wants to write to the btree as well and
      ends up updating the leaf that is read locked - the writer task locks the
      parent of the leaf and then blocks waiting for the leaf's lock to be
      released - at the same time, the task executing btrfs_ref_to_path()
      does a second tree search, without releasing the lock on the first leaf,
      and wants to access a leaf (the same or another one) that is a child of
      the same parent, resulting in a deadlock.
      
      The trace reported by lockdep follows.
      
      [84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds.
      [84314.936381]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
      [84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84314.936386] fsstress        D ffff8806e1bf8000     0 11930  11926 0x00000000
      [84314.936393]  ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd
      [84314.936399]  ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8
      [84314.936405]  ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190
      [84314.936410] Call Trace:
      [84314.936421]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
      [84314.936428]  [<ffffffff81774269>] schedule+0x29/0x70
      [84314.936451]  [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs]
      [84314.936457]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
      [84314.936470]  [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs]
      [84314.936489]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
      [84314.936504]  [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
      [84314.936510]  [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0
      [84314.936528]  [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs]
      [84314.936543]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
      [84314.936558]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
      [84314.936573]  [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs]
      [84314.936589]  [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs]
      [84314.936604]  [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs]
      [84314.936620]  [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs]
      [84314.936630]  [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs]
      [84314.936635]  [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
      [84314.936639]  [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
      [84314.936643]  [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30
      [84314.936648]  [<ffffffff811a3541>] iterate_supers+0xf1/0x100
      [84314.936652]  [<ffffffff811d0c45>] sys_sync+0x55/0x90
      [84314.936658]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
      [84314.936660] INFO: lockdep is turned off.
      [84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds.
      [84314.936666]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
      [84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84314.936670] btrfs           D ffff880541729a88     0 11955  11608 0x00000000
      [84314.936674]  ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd
      [84314.936680]  ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8
      [84314.936685]  ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000
      [84314.936690] Call Trace:
      [84314.936695]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
      [84314.936700]  [<ffffffff81774269>] schedule+0x29/0x70
      [84314.936717]  [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs]
      [84314.936721]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
      [84314.936733]  [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs]
      [84314.936746]  [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs]
      [84314.936763]  [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
      [84314.936780]  [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs]
      [84314.936797]  [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs]
      [84314.936813]  [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
      [84314.936830]  [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs]
      [84314.936846]  [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs]
      [84314.936851]  [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200
      [84314.936868]  [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs]
      [84314.936873]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
      [84314.936877]  [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00
      [84314.936882]  [<ffffffff81075563>] ? up_read+0x23/0x40
      [84314.936886]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
      [84314.936892]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
      [84314.936896]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
      [84314.936901]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
      [84314.936906]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
      [84314.936910]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
      [84314.936915]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [84314.936920]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
      [84314.936922] INFO: lockdep is turned off.
      [84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds.
      [84434.866881]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
      [84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84434.866886] btrfs-transacti D ffff880755b6a478     0 11921      2 0x00000000
      [84434.866893]  ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd
      [84434.866899]  ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8
      [84434.866904]  ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0
      [84434.866910] Call Trace:
      [84434.866920]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
      [84434.866927]  [<ffffffff81774269>] schedule+0x29/0x70
      [84434.866948]  [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
      [84434.866954]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
      [84434.866970]  [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs]
      [84434.866985]  [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs]
      [84434.866999]  [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs]
      [84434.867012]  [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs]
      [84434.867026]  [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs]
      [84434.867030]  [<ffffffff81070dad>] kthread+0xed/0x100
      [84434.867035]  [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
      [84434.867040]  [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0
      [84434.867044]  [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3fe81ce2
    • Wang Shilong's avatar
      Btrfs: remove dead comments for read_csums() · 663df053
      Wang Shilong authored
      Chris introduced hleper function  read_csums() and this function
      has been removed, but we forgot to remove its corresponding comments.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      663df053
    • Filipe David Borba Manana's avatar
      Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct · e223cfcd
      Filipe David Borba Manana authored
      It's not used anywhere, so just drop it.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e223cfcd
    • Filipe David Borba Manana's avatar
      Btrfs: fix use of uninitialized err variable · fc28b62d
      Filipe David Borba Manana authored
      fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
      fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fc28b62d
    • Wang Shilong's avatar
      Btrfs: remove unnecessary filemap writting and waiting after block group relocation · 54eb72c0
      Wang Shilong authored
      We have commited transaction before, remove redundant filemap writting and
      waiting here, it can speed up balance relocation process.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      54eb72c0
    • Tsutomu Itoh's avatar
      Btrfs: fix error check of btrfs_lookup_dentry() · 5662344b
      Tsutomu Itoh authored
      Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
      instead. This keeps the return value convention consistent.
      
      Callers who use btrfs_lookup_dentry() require a trivial update.
      
      create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
      which is not really needed - there seems less harm in returning ENOENT to
      userspace at that point in the stack than there is to crash the machine.
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5662344b
    • Filipe David Borba Manana's avatar
      Btrfs: return immediately if tree log mod is not necessary · 78357766
      Filipe David Borba Manana authored
      In ctree.c:tree_mod_log_set_node_key() we were calling
      __tree_mod_log_insert_key() even when the modification doesn't need
      to be logged. This would allocate a tree_mod_elem structure, fill it
      and pass it to  __tree_mod_log_insert(), which would just acquire
      the tree mod log write lock and then free the tree_mod_elem structure
      and return (that is, a no-op).
      
      Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
      which just returns immediately if the modification doesn't need to be
      logged (without allocating the structure, fill it, acquire write lock,
      free structure).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      78357766
    • Josef Bacik's avatar
      Btrfs: move the extent buffer radix tree into the fs_info · f28491e0
      Josef Bacik authored
      I need to create a fake tree to test qgroups and I don't want to have to setup a
      fake btree_inode.  The fact is we only use the radix tree for the fs_info, so
      everybody else who allocates an extent_io_tree is just wasting the space anyway.
      This patch moves the radix tree and its lock into btrfs_fs_info so there is less
      stuff I have to fake to do qgroup sanity tests.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f28491e0
    • Josef Bacik's avatar
      Btrfs: use a bit to track if we're in the radix tree · 34b41ace
      Josef Bacik authored
      For creating a dummy in-memory btree I need to be able to use the radix tree to
      keep track of the buffers like normal extent buffers.  With dummy buffers we
      skip the radix tree step, and we still want to do that for the tree mod log
      dummy buffers but for my test buffers we need to be able to remove them from the
      radix tree like normal.  This will give me a way to do that.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      34b41ace
    • Josef Bacik's avatar
      Btrfs: deal with io_tree->mapping being NULL · a5dee37d
      Josef Bacik authored
      I need to add infrastructure to allocate dummy extent buffers for running sanity
      tests, and to do this I need to not have to worry about having an
      address_mapping for an io_tree, so just fix up the places where we assume that
      all io_tree's have a non-NULL ->mapping.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a5dee37d
    • Filipe David Borba Manana's avatar
      Btrfs: more efficient push_leaf_right · 2ef1fed2
      Filipe David Borba Manana authored
      Currently when finding the leaf to insert a key into a btree, if the
      leaf doesn't have enough space to store the item we attempt to move
      off some items from our leaf to its right neighbor leaf, and if this
      fails to create enough free space in our leaf, we try to move off more
      items to the left neighbor leaf as well.
      
      When trying to move off items to the right neighbor leaf, if it has
      enough room to store the new key but not not enough room to move off
      at least one item from our target leaf, __push_leaf_right returns 1 and
      we have to attempt to move items to the left neighbor (push_leaf_left
      function) without touching the right neighbor leaf.
      For the case where the right leaf has enough room to store at least 1
      item from our leaf, we end up modifying (and dirtying) both our leaf
      and the right leaf. This is non-optimal for the case where the new key
      is greater than any key in our target leaf because it can be inserted at
      slot 0 of the right neighbor leaf and we don't need to touch our leaf
      at all nor to attempt to move off items to the left neighbor leaf.
      
      Therefore this change just selects the right neighbor leaf as our new
      target leaf if it has enough room for the new key without modifying our
      initial target leaf - we do this only if the new key is higher than any
      key in the initial target leaf.
      
      While running the following test, push_leaf_right was called by split_leaf
      4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
      special case (right leaf has enough room and new key is higher than any key
      in the initial target leaf).
      
      Test:
      
        sysbench --test=fileio --file-num=512 --file-total-size=5G \
          --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
          --max-requests=100000 --file-io-mode=sync [prepare|run]
      
      Results:
      
      sequential writes
      
      Throughput before this change: 65.71Mb/sec (average of 10 runs)
      Throughput after this change:  66.58Mb/sec (average of 10 runs)
      
      random writes
      
      Throughput before this change: 10.75Mb/sec (average of 10 runs)
      Throughput after this change:  11.56Mb/sec (average of 10 runs)
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2ef1fed2
    • Wang Shilong's avatar
      Btrfs: wrap repeated code into scrub_blocked_if_needed() · cb7ab021
      Wang Shilong authored
      Just wrap same code into one function scrub_blocked_if_needed().
      
      This make a change that we will move waiting (@workers_pending = 0)
      before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
      we must take carefully to not deadlock here.
      
      Thread 1			Thread 2
      				|->btrfs_commit_transaction()
      					|->set trans type(COMMIT_DOING)
      					|->btrfs_scrub_paused()(blocked)
      |->join_transaction(blocked)
      
      Move btrfs_scrub_paused() before setting trans type which means we can
      still join a transaction when commiting_transaction is blocked.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Suggested-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb7ab021
    • Wang Shilong's avatar
      Btrfs: fix wrong super generation mismatch when scrubbing supers · 3cb0929a
      Wang Shilong authored
      We came a race condition when scrubbing superblocks, the story is:
      
      In commiting transaction, we will update @last_trans_commited after
      writting superblocks, if scrubber start after writting superblocks
      and before updating @last_trans_commited, generation mismatch happens!
      
      We fix this by checking @scrub_pause_req, and we won't start a srubber
      until commiting transaction is finished.(after btrfs_scrub_continue()
      finished.)
      Reported-by: default avatarSebastian Ochmann <ochmann@informatik.uni-bonn.de>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3cb0929a
    • Filipe David Borba Manana's avatar
      Btrfs: fix pass of transid with wrong endianness in send.c · 5a0f4e2c
      Filipe David Borba Manana authored
      fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types)
      fs/btrfs/send.c:2190:9:    expected unsigned long long [unsigned] [usertype] value
      fs/btrfs/send.c:2190:9:    got restricted __le64 [usertype] ctransid
      fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types)
      fs/btrfs/send.c:2195:17:    expected unsigned long long [unsigned] [usertype] value
      fs/btrfs/send.c:2195:17:    got restricted __le64 [usertype] ctransid
      fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types)
      fs/btrfs/send.c:3716:9:    expected unsigned long long [unsigned] [usertype] value
      fs/btrfs/send.c:3716:9:    got restricted __le64 [usertype] ctransid
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5a0f4e2c
    • Filipe David Borba Manana's avatar
      Btrfs: fix extent_map block_len after merging · d527afe1
      Filipe David Borba Manana authored
      When merging an extent_map with its right neighbor, increment
      its block_len with the neighbor's block_len.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d527afe1
    • Michal Nazarewicz's avatar
      btrfs: remove dead code · 11850392
      Michal Nazarewicz authored
      [commit 8185554d: fix incorrect inode acl reset] introduced a dead
      code by adding a condition which can never be true to an else
      branch.  The condition can never be true because it is already
      checked by a previous if statement which causes function to return.
      Signed-off-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Reviewed-By: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      11850392
    • Filipe David Borba Manana's avatar
      Btrfs: fix max dir item size calculation · 878f2d2c
      Filipe David Borba Manana authored
      We were accounting for sizeof(struct btrfs_item) twice, once
      in the data_size variable and another time in the if statement
      below.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      878f2d2c
    • Filipe David Borba Manana's avatar
      Btrfs: more efficient extent state insertions · 12cfbad9
      Filipe David Borba Manana authored
      Currently we do 2 traversals of an inode's extent_io_tree
      before inserting an extent state structure: 1 to see if a
      matching extent state already exists and 1 to do the insertion
      if the fist traversal didn't found such extent state.
      
      This change just combines those tree traversals into a single one.
      While running sysbench tests (random writes) I captured the number
      of elements in extent_io_tree trees for a while (into a procfs file
      backed by a seq_list from seq_file module) and got this histogram:
      
      Count: 9310
      Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
      Percentiles:  90th: 20985.000; 95th: 21155.000; 99th: 21369.000
        51.000 -   93.933:   693 ########
        93.933 -  172.314:   938 ##########
       172.314 -  315.408:   856 #########
       315.408 -  576.646:    95 #
       576.646 - 6415.830:   888 ##########
      6415.830 - 11713.809:  1024 ###########
      11713.809 - 21386.000:  4816 #####################################################
      
      So traversing such trees can take some significant time that can
      easily be avoided.
      
      Ran the following sysbench tests, 5 times each, for sequential and
      random writes, and got the following results:
      
        sysbench --test=fileio --file-num=1 --file-total-size=2G \
          --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
          --max-requests=0 --max-time=60 --file-io-mode=sync
      
        sysbench --test=fileio --file-num=1 --file-total-size=2G \
          --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
          --max-requests=0 --max-time=60 --file-io-mode=sync
      
      Before this change:
      
      sequential writes: 69.28Mb/sec (average of 5 runs)
      random writes:     4.14Mb/sec  (average of 5 runs)
      
      After this change:
      
      sequential writes: 69.91Mb/sec (average of 5 runs)
      random writes:     5.69Mb/sec  (average of 5 runs)
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      12cfbad9
    • Filipe David Borba Manana's avatar
      Btrfs: add missing extent state caching calls · c42ac0bc
      Filipe David Borba Manana authored
      When we didn't find a matching extent state, we inserted a new one
      but didn't cache it in the **cached_state parameter, which makes a
      subsequent call do a tree lookup to get it.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c42ac0bc
    • Filipe David Borba Manana's avatar
      Btrfs: faster and more efficient extent map insertion · 32193c14
      Filipe David Borba Manana authored
      Before this change, adding an extent map to the extent map tree of an
      inode required 2 tree nevigations:
      
      1) doing a tree navigation to search for an existing extent map starting
         at the same offset or an extent map that overlaps the extent map we
         want to insert;
      
      2) Another tree navigation to add the extent map to the tree (if the
         former tree search didn't found anything).
      
      This change just merges these 2 steps into a single one.
      While running first few btrfs xfstests I had noticed these trees easily
      had a few hundred elements, and then with the following sysbench test it
      reached over 1100 elements very often.
      
      Test:
      
        sysbench --test=fileio --file-num=32 --file-total-size=10G \
          --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
          --max-requests=1000000 --file-io-mode=sync [prepare|run]
      
      (fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
      prepare phase)
      
      Before this patch:
      
      run 1 - 41.894Mb/sec
      run 2 - 40.527Mb/sec
      run 3 - 40.922Mb/sec
      run 4 - 49.433Mb/sec
      run 5 - 40.959Mb/sec
      
      average - 42.75Mb/sec
      
      After this patch:
      
      run 1 - 48.036Mb/sec
      run 2 - 50.21Mb/sec
      run 3 - 50.929Mb/sec
      run 4 - 46.881Mb/sec
      run 5 - 53.192Mb/sec
      
      average - 49.85Mb/sec
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      32193c14
    • Filipe David Borba Manana's avatar
    • Filipe David Borba Manana's avatar
      Btrfs: try harder to avoid btree node splits · 5a4267ca
      Filipe David Borba Manana authored
      When attempting to move items from our target leaf to its neighbor
      leaves (right and left), we only need to free data_size - free_space
      bytes from our leaf in order to add the new item (which has size of
      data_size bytes). Therefore attempt to move items to the right and
      left leaves if they have at least data_size - free_space bytes free,
      instead of data_size bytes free.
      
      After 5 runs of the following test, I got a smaller number of btree
      node splits overall:
      
      sysbench --test=fileio --file-num=512 --file-total-size=5G \
        --file-test-mode=seqwr --num-threads=512 \
         --file-block-size=8192 --max-requests=100000 --file-io-mode=sync
      
      Before this change:
      * 6171 splits (average of 5 test runs)
      * 61.508Mb/sec of throughput (average of 5 test runs)
      
      After this change:
      * 6036 splits (average of 5 test runs)
      * 63.533Mb/sec of throughput (average of 5 test runs)
      
      An ideal test would not just have multiple threads/processes writing
      to a file (insertion of file extent items) but also do other operations
      that result in insertion of items with varied sizes, like file/directory
      creations, creation of links, symlinks, xattrs, etc.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5a4267ca
    • Filipe David Borba Manana's avatar
      Btrfs: avoid unnecessary ordered extent cache resets · 1b8e7e45
      Filipe David Borba Manana authored
      After an ordered extent completes, don't blindly reset the
      inode's ordered tree last accessed ordered extent pointer.
      
      While running the xfstests I noticed that about 29% of the
      time the ordered extent to which tree->last pointed was not
      the same as our just completed ordered extent. After that I
      ran the following sysbench test (after a prepare phase) and
      noticed that about 68% of the time tree->last pointed to
      a different ordered extent too.
      
      sysbench --test=fileio --file-num=32 --file-total-size=4G \
          --file-test-mode=rndwr --num-threads=512 \
          --file-block-size=32768 --max-time=60 --max-requests=0 run
      
      Therefore reset tree->last on ordered extent removal only if
      it pointed to the ordered extent we're removing from the tree.
      
      Results from 4 runs of the following test before and after
      applying this patch:
      
      $ sysbench --test=fileio --file-num=32 --file-total-size=4G \
        --file-test-mode=seqwr --num-threads=512 \
        --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare
      $ sysbench --test=fileio --file-num=32 --file-total-size=4G \
        --file-test-mode=seqwr --num-threads=512 \
        --file-block-size=32768 --max-time=60 --file-io-mode=sync run
      
      Before this path:
      
      run 1 - 64.049Mb/sec
      run 2 - 63.455Mb/sec
      run 3 - 64.656Mb/sec
      run 4 - 63.833Mb/sec
      
      After this patch:
      
      run 1 - 66.149Mb/sec
      run 2 - 68.459Mb/sec
      run 3 - 66.338Mb/sec
      run 4 - 66.176Mb/sec
      
      With random writes (--file-test-mode=rndwr) I had huge fluctuations
      on the results (+- 35% easily).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1b8e7e45
    • Jeff Mahoney's avatar
      btrfs: fix leaks during sysfs teardown · e453d989
      Jeff Mahoney authored
      Filipe noticed that we were leaking the features attribute group
      after umount. His fix of just calling sysfs_remove_group() wasn't enough
      since that removes just the supported features and not the unsupported
      features.
      
      This patch changes the unknown feature handling to add them individually
      so we can skip the kmalloc and uses the same iteration to tear them down
      later.
      
      We also fix the error handling during mount so that we catch the
      failing creation of the per-super kobject, and handle proper teardown
      of a half-setup sysfs context.
      
      Tested properly with kmemleak enabled this time.
      Reported-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Tested-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e453d989
    • Jeff Mahoney's avatar
      btrfs: fix static checker warnings · 1b8e5df6
      Jeff Mahoney authored
      This patch fixes the following warnings:
      fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
      fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1b8e5df6
    • Filipe David Borba Manana's avatar
      Btrfs: fix very slow inode eviction and fs unmount · 131e404a
      Filipe David Borba Manana authored
      The inode eviction can be very slow, because during eviction we
      tell the VFS to truncate all of the inode's pages. This results
      in calls to btrfs_invalidatepage() which in turn does calls to
      lock_extent_bits() and clear_extent_bit(). These calls result in
      too many merges and splits of extent_state structures, which
      consume a lot of time and cpu when the inode has many pages. In
      some scenarios I have experienced umount times higher than 15
      minutes, even when there's no pending IO (after a btrfs fs sync).
      
      A quick way to reproduce this issue:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt/btrfs
      $ cd /mnt/btrfs
      $ sysbench --test=fileio --file-num=128 --file-total-size=16G \
          --file-test-mode=seqwr --num-threads=128 \
          --file-block-size=16384 --max-time=60 --max-requests=0 run
      $ time btrfs fi sync .
      FSSync '.'
      
      real	0m25.457s
      user	0m0.000s
      sys	0m0.092s
      $ cd ..
      $ time umount /mnt/btrfs
      
      real	1m38.234s
      user	0m0.000s
      sys	1m25.760s
      
      The same test on ext4 runs much faster:
      
      $ mkfs.ext4 /dev/sdb3
      $ mount /dev/sdb3 /mnt/ext4
      $ cd /mnt/ext4
      $ sysbench --test=fileio --file-num=128 --file-total-size=16G \
          --file-test-mode=seqwr --num-threads=128 \
          --file-block-size=16384 --max-time=60 --max-requests=0 run
      $ sync
      $ cd ..
      $ time umount /mnt/ext4
      
      real	0m3.626s
      user	0m0.004s
      sys	0m3.012s
      
      After this patch, the unmount (inode evictions) is much faster:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt/btrfs
      $ cd /mnt/btrfs
      $ sysbench --test=fileio --file-num=128 --file-total-size=16G \
          --file-test-mode=seqwr --num-threads=128 \
          --file-block-size=16384 --max-time=60 --max-requests=0 run
      $ time btrfs fi sync .
      FSSync '.'
      
      real	0m26.774s
      user	0m0.000s
      sys	0m0.084s
      $ cd ..
      $ time umount /mnt/btrfs
      
      real	0m1.811s
      user	0m0.000s
      sys	0m1.564s
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      131e404a
    • Wang Shilong's avatar
      Btrfs: improve forever loop when doing balance relocation · 0647bf56
      Wang Shilong authored
      We hit a forever loop when doing balance relocation,the reason
      is that we firstly reserve 4M(node size is 16k).and within transaction
      we will try to add extra reservation for snapshot roots,this will
      return -EAGAIN if there has been a thread flushing space to reserve
      space.We will do this again and again with filesystem becoming nearly
      full.
      
      If the above '-EAGAIN' case happens, we try to refill reservation more
      outsize of transaction, and this will return eariler in enospc case,however,
      this dosen't really hurt because it makes no sense doing balance relocation
      with the filesystem nearly full.
      
      Miao Xie helped a lot to track this issue, thanks.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0647bf56
    • Filipe David Borba Manana's avatar
      Btrfs: fix ordered extent check in btrfs_punch_hole · 6126e3ca
      Filipe David Borba Manana authored
      If the ordered extent's last byte was 1 less than our region's
      start byte, we would unnecessarily wait for the completion of
      that ordered extent, because it doesn't intersect our target
      range.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6126e3ca
    • Miao Xie's avatar
      Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io · 376cc685
      Miao Xie authored
      When we ran sysbench on the fs with compression, the following WARN_ONs were
      triggered:
       fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
       fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
       fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);
      
      Steps to reproduce:
       # mkfs.btrfs -f <dev>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync prepare
       # cd -
       # umount <mnt>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync run
       # cd -
       # umount <mnt>
      
      The reason of this problem is:
      Task0				Task1
      btrfs_direct_IO
        unlock(&inode->i_mutex)
      				lock(&inode->i_mutex)
      				reserve_space()
      				prepare_pages()
      				  lock_extent()
      				  clear_extent()
      				  unlock_extent()
        lock_extent()
        test_extent(uptodate)
          return false
      				copy_data()
      				set_delalloc_extent()
        extent need compress
          go back to buffered write
        clear_extent(DELALLOC | DIRTY)
        unlock_extent()
      
      Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
      was set by task1, it made the dirty pages in that extents couldn't be flushed
      into the disk, so the reserved space for that extent was not released at
      the end.
      
      This patch fixes the above bug by unlocking the extent after the delalloc.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      376cc685
    • Miao Xie's avatar
      Btrfs: cleanup unnecessary parameter and variant of prepare_pages() · b37392ea
      Miao Xie authored
      - the caller has gotten the inode object, needn't pass the file object.
        And if so, we needn't define a inode pointer variant.
      - the position should be aligned by the page size not sector size, so
        we also needn't pass the root object into prepare_pages().
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b37392ea
    • David Sterba's avatar
      btrfs: replace BUG in can_modify_feature · cc37bb04
      David Sterba authored
      We don't need to crash hard here, it's just reading a sysfs file. The
      values considered in switch are from a fixed set, the default case
      should not happen at all.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cc37bb04
    • David Sterba's avatar
      btrfs: reserve no transaction units in btrfs_feature_attr_store · 43d87fa2
      David Sterba authored
      Added in patch "btrfs: add ability to change features via sysfs",
      modifications to superblock don't need to reserve metadata blocks when
      starting a transaction.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      43d87fa2
    • Frank Holton's avatar
      Btrfs: make btrfs_debug match pr_debug handling related to DEBUG · 27a0dd61
      Frank Holton authored
      The kernel macro pr_debug is defined as a empty statement when DEBUG is
      not defined. Make btrfs_debug match pr_debug to avoid spamming
      the kernel log with debug messages
      Signed-off-by: default avatarFrank Holton <fholton@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      27a0dd61