- 28 Jan, 2014 40 commits
-
-
David Sterba authored
Warn if the balance goes below zero, which appears to be unlikely though. Otherwise cleans up the code a bit. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
Since daivd did the work that force us to use readonly snapshot, we can safely remove transaction protection from btrfs send. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
We met the following oops when doing space balance: kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong. ... Call Trace: [<ffffffff81937262>] dump_stack+0x49/0x5f [<ffffffff8137d259>] kobject_init+0x89/0xa0 [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70 [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs] [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs] [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs] [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs] [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs] [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs] [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs] [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs] [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs] ... Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o nospace_cache <dev> <mnt> # btrfs balance start <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 The reason of this problem is that we initialized the raid kobject when we added a block group into a empty raid list. As we know, when we mounted a btrfs filesystem, the raid list was empty, we would initialize the raid kobject when we added the first block group. But if there was not data stored in the block group, the block group would be freed when doing balance, and the raid list would be empty. And then if we allocated a new block group and added it into the raid list, we would initialize the raid kobject again, the oops happened. Fix this problem by initializing the raid kobject just when mounting the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
See the warning below: [ 1209.102076] [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs] [ 1209.102084] [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs] [ 1209.102089] [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40 [ 1209.102092] [<ffffffff8118ab2e>] evict+0x9e/0x190 [ 1209.102094] [<ffffffff8118b313>] iput+0xf3/0x180 [ 1209.102101] [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs] [ 1209.102107] [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs] clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
All the subvolues that are involved in send must be read-only during the whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the status to read-write and the result of send stream is undefined if the data change unexpectedly. Fix that by adding a refcount for all involved roots and verify that there's no send in progress during SUBVOL_SETFLAGS ioctl call that does read-only -> read-write transition. We need refcounts because there are no restrictions on number of send parallel operations currently run on a single subvolume, be it source, parent or one of the multiple clone sources. Kernel is silent when the RO checks fail and returns EPERM. The same set of checks is done already in userspace before send starts. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
Unused since ed259095 "Btrfs: stop using vfs_read in send". Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
Remove ifdefed code: - tlv_put for 8, 16 and 32, add a generic tempalte if needed in future - tlv_put_timespec - the btrfs_timespec fields are used - fs_path_remove obsoleted long ago Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
While running btrfs/004 from xfstests, after 503 iterations, dmesg reported a deadlock between tasks iterating inode refs and tasks running delayed inodes (during a transaction commit). It turns out that iterating inode refs implies doing one tree search and release all nodes in the path except the leaf node, and then passing that leaf node to btrfs_ref_to_path(), which in turn does another tree search without releasing the lock on the leaf node it received as parameter. This is a problem when other task wants to write to the btree as well and ends up updating the leaf that is read locked - the writer task locks the parent of the leaf and then blocks waiting for the leaf's lock to be released - at the same time, the task executing btrfs_ref_to_path() does a second tree search, without releasing the lock on the first leaf, and wants to access a leaf (the same or another one) that is a child of the same parent, resulting in a deadlock. The trace reported by lockdep follows. [84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds. [84314.936381] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84314.936386] fsstress D ffff8806e1bf8000 0 11930 11926 0x00000000 [84314.936393] ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd [84314.936399] ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8 [84314.936405] ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190 [84314.936410] Call Trace: [84314.936421] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84314.936428] [<ffffffff81774269>] schedule+0x29/0x70 [84314.936451] [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs] [84314.936457] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84314.936470] [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs] [84314.936489] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936504] [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [84314.936510] [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0 [84314.936528] [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs] [84314.936543] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936558] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936573] [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs] [84314.936589] [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs] [84314.936604] [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs] [84314.936620] [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs] [84314.936630] [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs] [84314.936635] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60 [84314.936639] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60 [84314.936643] [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30 [84314.936648] [<ffffffff811a3541>] iterate_supers+0xf1/0x100 [84314.936652] [<ffffffff811d0c45>] sys_sync+0x55/0x90 [84314.936658] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [84314.936660] INFO: lockdep is turned off. [84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds. [84314.936666] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84314.936670] btrfs D ffff880541729a88 0 11955 11608 0x00000000 [84314.936674] ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd [84314.936680] ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8 [84314.936685] ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000 [84314.936690] Call Trace: [84314.936695] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84314.936700] [<ffffffff81774269>] schedule+0x29/0x70 [84314.936717] [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs] [84314.936721] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84314.936733] [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs] [84314.936746] [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs] [84314.936763] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs] [84314.936780] [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs] [84314.936797] [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs] [84314.936813] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs] [84314.936830] [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs] [84314.936846] [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs] [84314.936851] [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200 [84314.936868] [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs] [84314.936873] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [84314.936877] [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00 [84314.936882] [<ffffffff81075563>] ? up_read+0x23/0x40 [84314.936886] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [84314.936892] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [84314.936896] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [84314.936901] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [84314.936906] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [84314.936910] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [84314.936915] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [84314.936920] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [84314.936922] INFO: lockdep is turned off. [84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds. [84434.866881] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84434.866886] btrfs-transacti D ffff880755b6a478 0 11921 2 0x00000000 [84434.866893] ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd [84434.866899] ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8 [84434.866904] ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0 [84434.866910] Call Trace: [84434.866920] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84434.866927] [<ffffffff81774269>] schedule+0x29/0x70 [84434.866948] [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [84434.866954] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84434.866970] [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs] [84434.866985] [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs] [84434.866999] [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs] [84434.867012] [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs] [84434.867026] [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs] [84434.867030] [<ffffffff81070dad>] kthread+0xed/0x100 [84434.867035] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190 [84434.867040] [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0 [84434.867044] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190 Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
Chris introduced hleper function read_csums() and this function has been removed, but we forgot to remove its corresponding comments. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
It's not used anywhere, so just drop it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
fs/btrfs/file.c: In function ‘prepare_pages.isra.18’: fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized] Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
We have commited transaction before, remove redundant filemap writting and waiting here, it can speed up balance relocation process. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Tsutomu Itoh authored
Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT) instead. This keeps the return value convention consistent. Callers who use btrfs_lookup_dentry() require a trivial update. create_snapshot() in particular looks like it can also lose a BUG_ON(!inode) which is not really needed - there seems less harm in returning ENOENT to userspace at that point in the stack than there is to crash the machine. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
In ctree.c:tree_mod_log_set_node_key() we were calling __tree_mod_log_insert_key() even when the modification doesn't need to be logged. This would allocate a tree_mod_elem structure, fill it and pass it to __tree_mod_log_insert(), which would just acquire the tree mod log write lock and then free the tree_mod_elem structure and return (that is, a no-op). Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert() which just returns immediately if the modification doesn't need to be logged (without allocating the structure, fill it, acquire write lock, free structure). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
I need to create a fake tree to test qgroups and I don't want to have to setup a fake btree_inode. The fact is we only use the radix tree for the fs_info, so everybody else who allocates an extent_io_tree is just wasting the space anyway. This patch moves the radix tree and its lock into btrfs_fs_info so there is less stuff I have to fake to do qgroup sanity tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
For creating a dummy in-memory btree I need to be able to use the radix tree to keep track of the buffers like normal extent buffers. With dummy buffers we skip the radix tree step, and we still want to do that for the tree mod log dummy buffers but for my test buffers we need to be able to remove them from the radix tree like normal. This will give me a way to do that. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
I need to add infrastructure to allocate dummy extent buffers for running sanity tests, and to do this I need to not have to worry about having an address_mapping for an io_tree, so just fix up the places where we assume that all io_tree's have a non-NULL ->mapping. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
Currently when finding the leaf to insert a key into a btree, if the leaf doesn't have enough space to store the item we attempt to move off some items from our leaf to its right neighbor leaf, and if this fails to create enough free space in our leaf, we try to move off more items to the left neighbor leaf as well. When trying to move off items to the right neighbor leaf, if it has enough room to store the new key but not not enough room to move off at least one item from our target leaf, __push_leaf_right returns 1 and we have to attempt to move items to the left neighbor (push_leaf_left function) without touching the right neighbor leaf. For the case where the right leaf has enough room to store at least 1 item from our leaf, we end up modifying (and dirtying) both our leaf and the right leaf. This is non-optimal for the case where the new key is greater than any key in our target leaf because it can be inserted at slot 0 of the right neighbor leaf and we don't need to touch our leaf at all nor to attempt to move off items to the left neighbor leaf. Therefore this change just selects the right neighbor leaf as our new target leaf if it has enough room for the new key without modifying our initial target leaf - we do this only if the new key is higher than any key in the initial target leaf. While running the following test, push_leaf_right was called by split_leaf 4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this special case (right leaf has enough room and new key is higher than any key in the initial target leaf). Test: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare|run] Results: sequential writes Throughput before this change: 65.71Mb/sec (average of 10 runs) Throughput after this change: 66.58Mb/sec (average of 10 runs) random writes Throughput before this change: 10.75Mb/sec (average of 10 runs) Throughput after this change: 11.56Mb/sec (average of 10 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
Just wrap same code into one function scrub_blocked_if_needed(). This make a change that we will move waiting (@workers_pending = 0) before we can wake up commiting transaction(atomic_inc(@scrub_paused)), we must take carefully to not deadlock here. Thread 1 Thread 2 |->btrfs_commit_transaction() |->set trans type(COMMIT_DOING) |->btrfs_scrub_paused()(blocked) |->join_transaction(blocked) Move btrfs_scrub_paused() before setting trans type which means we can still join a transaction when commiting_transaction is blocked. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
We came a race condition when scrubbing superblocks, the story is: In commiting transaction, we will update @last_trans_commited after writting superblocks, if scrubber start after writting superblocks and before updating @last_trans_commited, generation mismatch happens! We fix this by checking @scrub_pause_req, and we won't start a srubber until commiting transaction is finished.(after btrfs_scrub_continue() finished.) Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:2190:9: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:2190:9: got restricted __le64 [usertype] ctransid fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:2195:17: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:2195:17: got restricted __le64 [usertype] ctransid fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:3716:9: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:3716:9: got restricted __le64 [usertype] ctransid Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
When merging an extent_map with its right neighbor, increment its block_len with the neighbor's block_len. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Michal Nazarewicz authored
[commit 8185554d: fix incorrect inode acl reset] introduced a dead code by adding a condition which can never be true to an else branch. The condition can never be true because it is already checked by a previous if statement which causes function to return. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
We were accounting for sizeof(struct btrfs_item) twice, once in the data_size variable and another time in the if statement below. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
Currently we do 2 traversals of an inode's extent_io_tree before inserting an extent state structure: 1 to see if a matching extent state already exists and 1 to do the insertion if the fist traversal didn't found such extent state. This change just combines those tree traversals into a single one. While running sysbench tests (random writes) I captured the number of elements in extent_io_tree trees for a while (into a procfs file backed by a seq_list from seq_file module) and got this histogram: Count: 9310 Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688 Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000 51.000 - 93.933: 693 ######## 93.933 - 172.314: 938 ########## 172.314 - 315.408: 856 ######### 315.408 - 576.646: 95 # 576.646 - 6415.830: 888 ########## 6415.830 - 11713.809: 1024 ########### 11713.809 - 21386.000: 4816 ##################################################### So traversing such trees can take some significant time that can easily be avoided. Ran the following sysbench tests, 5 times each, for sequential and random writes, and got the following results: sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync Before this change: sequential writes: 69.28Mb/sec (average of 5 runs) random writes: 4.14Mb/sec (average of 5 runs) After this change: sequential writes: 69.91Mb/sec (average of 5 runs) random writes: 5.69Mb/sec (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
When we didn't find a matching extent state, we inserted a new one but didn't cache it in the **cached_state parameter, which makes a subsequent call do a tree lookup to get it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
Before this change, adding an extent map to the extent map tree of an inode required 2 tree nevigations: 1) doing a tree navigation to search for an existing extent map starting at the same offset or an extent map that overlaps the extent map we want to insert; 2) Another tree navigation to add the extent map to the tree (if the former tree search didn't found anything). This change just merges these 2 steps into a single one. While running first few btrfs xfstests I had noticed these trees easily had a few hundred elements, and then with the following sysbench test it reached over 1100 elements very often. Test: sysbench --test=fileio --file-num=32 --file-total-size=10G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=1000000 --file-io-mode=sync [prepare|run] (fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench prepare phase) Before this patch: run 1 - 41.894Mb/sec run 2 - 40.527Mb/sec run 3 - 40.922Mb/sec run 4 - 49.433Mb/sec run 5 - 40.959Mb/sec average - 42.75Mb/sec After this patch: run 1 - 48.036Mb/sec run 2 - 50.21Mb/sec run 3 - 50.929Mb/sec run 4 - 46.881Mb/sec run 5 - 53.192Mb/sec average - 49.85Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
When attempting to move items from our target leaf to its neighbor leaves (right and left), we only need to free data_size - free_space bytes from our leaf in order to add the new item (which has size of data_size bytes). Therefore attempt to move items to the right and left leaves if they have at least data_size - free_space bytes free, instead of data_size bytes free. After 5 runs of the following test, I got a smaller number of btree node splits overall: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=8192 --max-requests=100000 --file-io-mode=sync Before this change: * 6171 splits (average of 5 test runs) * 61.508Mb/sec of throughput (average of 5 test runs) After this change: * 6036 splits (average of 5 test runs) * 63.533Mb/sec of throughput (average of 5 test runs) An ideal test would not just have multiple threads/processes writing to a file (insertion of file extent items) but also do other operations that result in insertion of items with varied sizes, like file/directory creations, creation of links, symlinks, xattrs, etc. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
After an ordered extent completes, don't blindly reset the inode's ordered tree last accessed ordered extent pointer. While running the xfstests I noticed that about 29% of the time the ordered extent to which tree->last pointed was not the same as our just completed ordered extent. After that I ran the following sysbench test (after a prepare phase) and noticed that about 68% of the time tree->last pointed to a different ordered extent too. sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=rndwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --max-requests=0 run Therefore reset tree->last on ordered extent removal only if it pointed to the ordered extent we're removing from the tree. Results from 4 runs of the following test before and after applying this patch: $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync run Before this path: run 1 - 64.049Mb/sec run 2 - 63.455Mb/sec run 3 - 64.656Mb/sec run 4 - 63.833Mb/sec After this patch: run 1 - 66.149Mb/sec run 2 - 68.459Mb/sec run 3 - 66.338Mb/sec run 4 - 66.176Mb/sec With random writes (--file-test-mode=rndwr) I had huge fluctuations on the results (+- 35% easily). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Jeff Mahoney authored
Filipe noticed that we were leaking the features attribute group after umount. His fix of just calling sysfs_remove_group() wasn't enough since that removes just the supported features and not the unsupported features. This patch changes the unknown feature handling to add them individually so we can skip the kmalloc and uses the same iteration to tear them down later. We also fix the error handling during mount so that we catch the failing creation of the per-super kobject, and handle proper teardown of a half-setup sysfs context. Tested properly with kmemleak enabled this time. Reported-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Tested-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Jeff Mahoney authored
This patch fixes the following warnings: fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static? fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index)); Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
The inode eviction can be very slow, because during eviction we tell the VFS to truncate all of the inode's pages. This results in calls to btrfs_invalidatepage() which in turn does calls to lock_extent_bits() and clear_extent_bit(). These calls result in too many merges and splits of extent_state structures, which consume a lot of time and cpu when the inode has many pages. In some scenarios I have experienced umount times higher than 15 minutes, even when there's no pending IO (after a btrfs fs sync). A quick way to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m25.457s user 0m0.000s sys 0m0.092s $ cd .. $ time umount /mnt/btrfs real 1m38.234s user 0m0.000s sys 1m25.760s The same test on ext4 runs much faster: $ mkfs.ext4 /dev/sdb3 $ mount /dev/sdb3 /mnt/ext4 $ cd /mnt/ext4 $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ sync $ cd .. $ time umount /mnt/ext4 real 0m3.626s user 0m0.004s sys 0m3.012s After this patch, the unmount (inode evictions) is much faster: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m26.774s user 0m0.000s sys 0m0.084s $ cd .. $ time umount /mnt/btrfs real 0m1.811s user 0m0.000s sys 0m1.564s Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Wang Shilong authored
We hit a forever loop when doing balance relocation,the reason is that we firstly reserve 4M(node size is 16k).and within transaction we will try to add extra reservation for snapshot roots,this will return -EAGAIN if there has been a thread flushing space to reserve space.We will do this again and again with filesystem becoming nearly full. If the above '-EAGAIN' case happens, we try to refill reservation more outsize of transaction, and this will return eariler in enospc case,however, this dosen't really hurt because it makes no sense doing balance relocation with the filesystem nearly full. Miao Xie helped a lot to track this issue, thanks. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe David Borba Manana authored
If the ordered extent's last byte was 1 less than our region's start byte, we would unnecessarily wait for the completion of that ordered extent, because it doesn't intersect our target range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
When we ran sysbench on the fs with compression, the following WARN_ONs were triggered: fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents); fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents); fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes); Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync prepare # cd - # umount <mnt> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync run # cd - # umount <mnt> The reason of this problem is: Task0 Task1 btrfs_direct_IO unlock(&inode->i_mutex) lock(&inode->i_mutex) reserve_space() prepare_pages() lock_extent() clear_extent() unlock_extent() lock_extent() test_extent(uptodate) return false copy_data() set_delalloc_extent() extent need compress go back to buffered write clear_extent(DELALLOC | DIRTY) unlock_extent() Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which was set by task1, it made the dirty pages in that extents couldn't be flushed into the disk, so the reserved space for that extent was not released at the end. This patch fixes the above bug by unlocking the extent after the delalloc. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Miao Xie authored
- the caller has gotten the inode object, needn't pass the file object. And if so, we needn't define a inode pointer variant. - the position should be aligned by the page size not sector size, so we also needn't pass the root object into prepare_pages(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
We don't need to crash hard here, it's just reading a sysfs file. The values considered in switch are from a fixed set, the default case should not happen at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
David Sterba authored
Added in patch "btrfs: add ability to change features via sysfs", modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Frank Holton authored
The kernel macro pr_debug is defined as a empty statement when DEBUG is not defined. Make btrfs_debug match pr_debug to avoid spamming the kernel log with debug messages Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
-