- 28 May, 2018 40 commits
-
-
Nikolay Borisov authored
The invariant is that when nr_delalloc_inodes is 0 then the root mustn't have any inodes on its delalloc inodes list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Robbie Ko authored
Currently when checking if a directory can be deleted, we always check if all its children have been processed. Example: A directory with 2,000,000 files was deleted original: 1994m57.071s patch: 1m38.554s [FIX] Instead of checking all children on all calls to can_rmdir(), we keep track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls to can_rmdir(). Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Robbie Ko authored
Move the allocation after the search when it's clear that the new entry will be added. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
add_delayed_ref_head really performed 2 independent operations - initialisting the ref head and adding it to a list. Now that the init part is in a separate function let's complete the separation between both operations. This results in a lot simpler interface for add_delayed_ref_head since the function now deals solely with either adding the newly initialised delayed ref head or merging it into an existing delayed ref head. This results in vastly simplified function signature since 5 arguments are dropped. The only other thing worth mentioning is that due to this split the WARN_ON catching reinit of existing. In this patch the condition is extended such that: qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved is added. This is done because the two qgroup_* prefixed member are set only if both ref_root and reserved are passed. So functionally it's equivalent to the old WARN_ON and allows to remove the two args from add_delayed_ref_head. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Use the newly introduced function when initialising the head_ref in add_delayed_ref_head. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
add_delayed_ref_head implements the logic to both initialize a head_ref structure as well as perform the necessary operations to add it to the delayed ref machinery. This has resulted in a very cumebrsome interface with loads of parameters and code, which at first glance, looks very unwieldy. Begin untangling it by first extracting the initialization only code in its own function. It's more or less verbatim copy of the first part of add_delayed_ref_head. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Now that the initialization part and the critical section code have been split it's a lot easier to open code add_delayed_data_ref. Do so in the following manner: 1. The common init function is put immediately after memory-to-be-initialized is allocated, followed by the specific data ref initialization. 2. The only piece of code that remains in the critical section is insert_delayed_ref call. 3. Tracing and memory freeing code is moved outside of the critical section. No functional changes, just an overall shorter critical section. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Now that the initialization part and the critical section code have been split it's a lot easier to open code add_delayed_tree_ref. Do so in the following manner: 1. The comming init code is put immediately after memory-to-be-initialized is allocated, followed by the ref-specific member initialization. 2. The only piece of code that remains in the critical section is insert_delayed_ref call. 3. Tracing and memory freeing code is put outside of the critical section as well. The only real change here is an overall shorter critical section when dealing with delayed tree refs. From functional point of view - the code is unchanged. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Use the newly introduced helper and remove the duplicate code. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Use the newly introduced common helper. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
THe majority of the init code for struct btrfs_delayed_ref_node is duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out the common bits in init_delayed_ref_common. This function is going to be used in future patches to clean that up. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Chengguang Xu authored
It's not good to overwrite -ENOMEM using -EINVAL when failing from mount option parsing, so just return original error code. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The fs_info is always available from the context so we don't need to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
When debugging quota rescan race, some times btrfs rescan could account some old (committed) leaf and then re-account newly committed leaf in next generation. This race needs extra transid to locate, so add @transid for trace_btrfs_qgroup_account_extent() for such debug. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Colin Ian King authored
Trivial fix to spelling mistake of function name in btrfs_err message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This function is used in only one place and devid argument is always passed 0. So just remove it, similarly to how it was removed in the userspace code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Origin trace_qgroup_update_counters() only records qgroup id and its reference count change. It's good enough to debug qgroup accounting change, but when rescan race is involved, it's pretty hard to distinguish which modification belongs to which rescan. So add old_rfer and old_excl trace output to help distinguishing different rescan instance. (Different rescan instance should reset its qgroup->rfer to 0) For trace event parameter, it just changes from u64 qgroup_id to struct btrfs_qgroup *qgroup, so number of parameters is not changed at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It's used only in inode.c so makes no sense to have it exported. Also move the definition of btrfs_delalloc_work to inode.c since it's used only this file. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
When allocating a delalloc work we are always setting the delayed_iput to 0. So remove the delay_iput member of btrfs_delalloc_work, as a result also remove it as a parameter from btrfs_alloc_delalloc_work since it's not used anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It's always set to 0 so remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ rename to start_delalloc_inodes ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It's always set to 0, so just remove it and collapse the constant value to the only function we are passing it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This parameter was introduced alongside the function in eb73c1b7 ("Btrfs: introduce per-subvolume delalloc inode list") to avoid deadlocks since this function was used in the transaction commit path. However, commit 8d875f95 ("btrfs: disable strict file flushes for renames and truncates") removed that usage, rendering the parameter obsolete. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Gu Jinxiang authored
In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to READA_FORWARD. But I think READA_BACK is correct. Since: 1. key.offset is set to (u64)-1 2. after btrfs_search_slot, btrfs_previous_item is called So, for readahead previous items, READA_BACK is the correct one. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This patch will add the following trace events: 1) btrfs_remove_block_group For btrfs_remove_block_group() function. Triggered when a block group is really removed. 2) btrfs_add_unused_block_group Triggered which block group is added to unused_bgs list. 3) btrfs_skip_unused_block_group Triggered which unused block group is not deleted. These trace events is pretty handy to debug case related to block group auto remove. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
fs_info can be extracted from btrfs_block_group_cache, and all btrfs_block_group_cache is created by btrfs_create_block_group_cache() with fs_info initialized, no need to worry about NULL pointer dereference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Gu Jinxiang authored
Since the commit c6100a4b ("Btrfs: replace tree->mapping with tree->private_data"), parameter fs_info in alloc_reloc_control is not used. So remove it. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Add a new member struct btrfs_raid_attr::mindev_error so that btrfs_raid_array can maintain the error code to return if the minimum number of devices condition is not met while trying to delete a device in the given raid. And so we can drop btrfs_raid_mindev_error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Add a new member struct btrfs_raid_attr::bg_flag so that btrfs_raid_array can maintain the bit map flag of the raid type, and so we can drop btrfs_raid_group. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Add a new member struct btrfs_raid_attr::raid_name so that btrfs_raid_array can maintain the name of the raid type, and so we can drop btrfs_raid_type_names. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
It's pretty handy if we can get the debug output for locking status of an extent buffer, specially for race condition related debugging. So add the following output for btrfs_print_tree() and btrfs_print_leaf(): - refs - write_locks (as w:%d) - read_locks (as r:%d) - blocking_writers (as bw:%d) - blocking_readers (as br:%d) - spinning_writers (as sw:%d) - spinning_readers (as sr:%d) - lock_owner - current->pid Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helper is quite simple and I'd like to see the locking in the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
While the spinlock does not cause problems, using the mutex is more correct and consistent with others. The global status of balance is eg. checked from btrfs_pause_balance or btrfs_cancel_balance with mutex. Resuming balance happens during mount or ro->rw remount. In the former case, no other user of the balance_ctl exists, in the latter, balance cannot run until the ro/rw transition is finished. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Balance cannot be started on a read-only filesystem and will have to finish/exit before eg. going to read-only via remount. In case the filesystem is forcibly set to read-only after an error, balance will finish anyway and if the cancel call is too fast it will just wait for that to happen. The last case is when the balance is paused after mount but it's read-only and cancelling would want to delete the item. The test is moved after the check if balance is running at all, as it looks more logical to report "no balance running" instead of "read-only filesystem". Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit 5ac00add ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The device replace is paused by unmount or read only remount, and resumed on next mount or write remount. The exclusive status should be checked properly as it's a global invariant and we must not allow 2 operations run. In this case, the balance can be also paused and resumed under same conditions. It's always checked first so dev-replace could see the EXCL_OP already taken, BUT, the ioctl would never let start both at the same time. Replace the WARN_ON with message and return 0, indicating no error as this is purely theoretical and the user will be informed. Resolving that manually should be possible by waiting for the other operation to finish or cancel the paused state. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-