- 12 Oct, 2023 40 commits
-
-
Josef Bacik authored
My overcommit patch exposed a bug with btrfs/177 [1]. The problem here is that when we grow the device we're not adding to ->free_chunk_space, so subsequent allocations can cause ->free_chunk_space to wrap, which causes problems in can_overcommit because we add this to ->total_bytes, which causes the counter to wrap and gives us an unexpected ENOSPC. Fix this by properly updating ->free_chunk_space with the new available space in btrfs_grow_device. [1] First version of the fix: https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
There are two bugs in how we adjust ->free_chunk_space in btrfs_shrink_device. First we're removing the entire diff between new_size and old_size from ->free_chunk_space. This only works if we're reducing the free area, which we could potentially not be. So adjust the math to only subtract the diff in the free space from ->free_chunk_space. Additionally in the error case we're unconditionally adding the diff back into ->free_chunk_space, which we need to only do if this device is writeable. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently, at find_first_extent_bit(), when we are given a cached extent state that happens to have its end offset match the desired range start, we find the next extent state using that cached state, with next_state() calls, and then return it. We then try to cache that next state by calling cache_state_if_flags(), but that will not cache the state because we haven't reset *cached_state to NULL, so we end up with the cached_state unchanged, and if the caller is iterating over extent states in the io tree, its next call to find_first_extent_bit() will not use the current cached state as its end offset does not match the minimum start range offset, therefore the cached state is reset and we have to search the rbtree to find the next suitable extent state record. So fix this by resetting the cached state to NULL (and dropping our ref on it) when we have a suitable cached state and we found a next state by using next_state() starting from the cached state. This makes use cases of calling find_first_extent_bit() to go over all ranges in the io tree to do a single rbtree full search, only on the first call, and the next calls will just do next_state() (rb_next() wrapper) calls, which is more efficient. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When freeing a log tree, during a transaction commit, we clear its dirty log pages io tree by calling clear_extent_bits() using a range from 0 to (u64)-1. This will iterate the io tree's rbtree and call rb_erase() on each node before freeing it, which will often trigger rebalance operations on the rbtree. A better alternative it to use extent_io_tree_release(), which will not do deletions and trigger rebalances. So use extent_io_tree_release() instead of clear_extent_bits(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently extent_io_tree_release() is a loop that keeps getting the first node in the io tree, using rb_first() which is a loop that gets to the leftmost node of the rbtree, and then for each node it calls rb_erase(), which often requires rebalancing the rbtree. We can make this more efficient by using rbtree_postorder_for_each_entry_safe() to free each node without having to delete it from the rbtree and without looping to get the first node. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The wait_on_state() function is very short and has a single caller, which is wait_extent_bit(), so remove the function and put its code into the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The memory barrier at extent_io_tree_release() is redundant. Holding spin_lock here is not enough to drop the barrier completely. We only change the waitqueue of an extent state record while holding the tree lock - see wait_on_state(). The update to waitqueue state will not become stale because there will be an spin_unlock/spin_lock sequence between the change and waiting, this implies a full memory barrier. So remove the explicit smp_mb() barrier. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reword reasoning ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The function wait_extent_bit() is not used outside extent-io-tree.c so make it static. Furthermore the function doesn't have the 'btrfs_' prefix. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
There's this comment at extent_io_tree_release() that mentions io btrees, but this function is no longer used only for io btrees. Originally it was added as a static function named clear_btree_io_tree() at transaction.c, in commit 663dfbb0 ("Btrfs: deal with convert_extent_bit errors to avoid fs corruption"), as it was used only for cleaning one of the io trees that track dirty extent buffers, the dirty_log_pages io tree of a a root and the dirty_pages io tree of a transaction. Later it was renamed and exported and now it's used to cleanup other io trees such as the allocation state io tree of a device or the csums range io tree of a log root. So remove that comment and replace it with one at the top of the function that is more complete, mentioning what the function does and that it's expected to be called only when a task is sure no one else will need to use the tree anymore, as well as there should be no locked ranges in the tree and therefore no waiters on its extent state records. Also add an assertion to check that there are no locked extent state records in the tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When inserting a new extent state record into an io tree that happens to be mergeable, we currently do the following: 1) Insert the extent state record in the io tree's rbtree. This requires going down the tree to find where to insert it, and during the insertion we often need to balance the rbtree; 2) We then check if the previous node is mergeable, so we call rb_prev() to find it, which requires some looping to find the previous node; 3) If the previous node is mergeable, we adjust our node to include the range of the previous node and then delete the previous node from the rbtree, which again may need to balance the rbtree; 4) Then we check if the next node is mergeable with the node we inserted, so we call rb_next(), which requires some looping too. If the next node is indeed mergeable, we expand the range of our node to include the next node's range and then delete the next node from the rbtree, which again may need to balance the tree. So these are quite of lot of iterations and looping over the rbtree, and some of the operations may need to rebalance the rb tree. This can be made a bit more efficient by: 1) When iterating the rbtree, once we find a node that is mergeable with the node we want to insert, we can just adjust that node's range with the range of the node to insert - this avoids continuing iterating over the tree and deleting a node from the rbtree; 2) If we expand the range of a mergeable node, then we find the next or the previous node, depending on other we merged a range to the right or to the left of the node we are currently at during the iteration. This merging is as before, we find the next or previous node with rb_next() or rb_prev() and if that other node is mergeable with the current one, we adjust the range of the current node and remove the other node from the rbtree; 3) Whenever we need to insert the new extent state record it's because we don't have any extent state record in the rbtree which can be merged, so we can remove the call to merge_state() after the insertion, saving rb_next() and rb_prev() calls, which require some looping. So update the insertion function insert_state() to have this behaviour. Running dbench for 120 seconds and capturing the execution times of set_extent_bit() at pin_down_extent(), resulted in the following data (time values are in nanoseconds): Before this change: Count: 2278299 Range: 0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952 Percentiles: 90th: 1187.000; 95th: 1350.000; 99th: 1724.000 0.000 - 7.534: 5 | 7.534 - 35.418: 36 | 35.418 - 154.403: 273 | 154.403 - 662.138: 1244016 ##################################################### 662.138 - 2828.745: 1031335 ############################################ 2828.745 - 12074.102: 1395 | 12074.102 - 51525.930: 806 | 51525.930 - 219874.955: 162 | 219874.955 - 938254.688: 22 | 938254.688 - 4003728.000: 3 | After this change: Count: 2275862 Range: 0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785 Percentiles: 90th: 1105.000; 95th: 1245.000; 99th: 1590.000 0.000 - 10.219: 10 | 10.219 - 40.957: 36 | 40.957 - 155.907: 262 | 155.907 - 585.789: 1127214 #################################################### 585.789 - 2193.431: 1145134 ##################################################### 2193.431 - 8205.578: 1648 | 8205.578 - 30689.378: 1039 | 30689.378 - 114772.699: 362 | 114772.699 - 429221.537: 52 | 429221.537 - 1605175.000: 10 | Maximum duration (range), average duration, percentiles and standard deviation are all better. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The semantics of test_range_bit() with filled == 0 is now in it's own helper so test_range_bit will check the whole range unconditionally. The detection logic is flipped and assumes success by default and catches exceptions. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The existing helper test_range_bit works in two ways, checks if the whole range contains all the bits, or stop on the first occurrence. By adding a specific helper for the latter case, the inner loop can be simplified and contains fewer conditionals, making it a bit faster. There's no caller that uses the cached state pointer so this reduces the argument count further. Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
btrfs_realloc_node() is only used by the defrag code. Nowadays we have a defrag.c file, so move it, and its helper close_blocks(), into defrag.c. During the move also do a few minor cosmetic changes: 1) Change the return value of close_blocks() from int to bool; 2) Use SZ_32K instead of 32768 at close_blocks(); 3) Make some variables const in btrfs_realloc_node(), 'blocksize' and 'end_slot'; 4) Get rid of 'parent_nritems' variable, in both places where it was used it could be replaced by calling btrfs_header_nritems(parent); 5) Change the type of a couple variables from int to bool; 6) Rename variable 'err' to 'ret', as that's the most common name we use to track the return value of a function; 7) Move some variables from the top scope to the scope of the for loop where they are used. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a later patch we can move out defrag specific code from ctree.c into defrag.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is to allow to move defrag specific code out of ctree.c and into defrag.c in one of the next patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_cow_block() we can use round_down() to align the extent buffer's logical offset to the start offset of a metadata block group, instead of the less easy to read set of bitwise operations (two plus one subtraction). So replace the bitwise operations with a round_down() call. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
It's pointless to have the noiline attribute for btrfs_cow_block(), as the function is exported and widely used. So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has stopped the assembly of devices with the CHANGING_FSID_V2 flag in the kernel. Such devices can be scanned but will not be registered and can't be mounted without a manual fix by btrfstune. Remove the related logic and now unused code. The original motivation was to allow an interrupted partial conversion fix itself on next mount, in case the system has to be rebooted. This is a convenience but brings a lot of complexity the device scanning and handling the partial states. It's hard to estimate if this was ever needed in practice, expecting the typical use case like a manual conversion of an unmounted filesystem where the user can verify the success and rerun it eventually. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add historical context ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state where the device in the userspace btrfstune -m|-M operation failed to complete changing the fsid. This flag makes the kernel to automatically determine the other partner devices to which a given device can be associated, based on the fsid, metadata_uuid and generation values. btrfstune -m|M feature is especially useful in virtual cloud setups, where compute instances (disk images) are quickly copied, fsid changed, and launched. Given numerous disk images with the same metadata_uuid but different fsid, there's no clear way a device can be correctly assembled with the proper partners when the CHANGING_FSID_V2 flag is set. So, the disk could be assembled incorrectly, as in the example below: Before this patch: Consider the following two filesystems: /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m operation fails. In this scenario, as the /dev/loop0's fsid change is interrupted, and the CHANGING_FSID_V2 flag is set as shown below. $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags" $ btrfs inspect dump-super /dev/loop0 | egrep '$p' superblock: bytenr=65536, device=/dev/loop0 flags 0x1000000001 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 9 num_devices 2 incompat_flags 0x741 dev_item.devid 1 $ btrfs inspect dump-super /dev/loop1 | egrep '$p' superblock: bytenr=65536, device=/dev/loop1 flags 0x1 fsid 11d2af4d-1b71-45a9-83f6-f2100766939d metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 10 num_devices 2 incompat_flags 0x741 dev_item.devid 2 $ btrfs inspect dump-super /dev/loop2 | egrep '$p' superblock: bytenr=65536, device=/dev/loop2 flags 0x1 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 8 num_devices 2 incompat_flags 0x741 dev_item.devid 1 $ btrfs inspect dump-super /dev/loop3 | egrep '$p' superblock: bytenr=65536, device=/dev/loop3 flags 0x1 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 8 num_devices 2 incompat_flags 0x741 dev_item.devid 2 It is normal that some devices aren't instantly discovered during system boot or iSCSI discovery. The controlled scan below demonstrates this. $ btrfs device scan --forget $ btrfs device scan /dev/loop0 Scanning for btrfs filesystems on '/dev/loop0' $ mount /dev/loop3 /btrfs $ btrfs filesystem show -m Label: none uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 Total devices 2 FS bytes used 144.00KiB devid 1 size 300.00MiB used 48.00MiB path /dev/loop0 devid 2 size 300.00MiB used 40.00MiB path /dev/loop3 /dev/loop0 and /dev/loop3 are incorrectly partnered. This kernel patch removes functions and code connected to the CHANGING_FSID_V2 flag. With this patch, now devices with the CHANGING_FSID_V2 flag are rejected. And its partner will fail to mount with the extra -o degraded option. The check is removed from open_ctree(), devices are rejected during scanning which in turn fails the mount. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Lots of the functions in relocation.c don't change pointer parameters but lack the annotations. Add them and reformat according to current coding style if needed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
btrfs_should_ignore_reloc_root() is a predicate so it should return bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The btrfs_backref_cache::is_reloc is an indicator variable and should use a bool type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's only one user of mapping_tree_init, we don't need a helper for the simple initialization. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Use bool types for the indicators instead of bitfields. The structure size slightly grows but the new types are placed within the padding. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Add an enum type for data relocation stages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We don't need to use bitfields for tree_block::level and tree_block::key_ready, there's enough padding in the structure for proper types. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The btrfs_defrag_root() function does not really belong in the transaction.{c,h} module and as we have a defrag.{c,h} nowadays, move it to there instead. This also allows to stop exporting btrfs_defrag_leaves(), so we can make it static. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename info to fs_info for consistency ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The root argument for fixup_inode_link_count() always matches the root of the given inode, so remove the root argument and get it from the inode argument. This also applies to the helpers count_inode_extrefs() and count_inode_refs() used by fixup_inode_link_count() - they don't need the root argument, as it always matches the root of the inode passed to them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The root argument for maybe_insert_hole() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The root argument for btrfs_delayed_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The root argument for btrfs_update_inode_item() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The noinline attribute of btrfs_update_inode() is pointless as the function is exported and widely used, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The following condition at btrfs_dirty_inode() is redundant: if (ret && (ret == -ENOSPC || ret == -EDQUOT)) The first check for a non-zero 'ret' value is pointless, we can simplify this to simply: if (ret == -ENOSPC || ret == -EDQUOT) Not only this makes it easier to read, it also slightly reduces the text size of the btrfs kernel module: $ size fs/btrfs/btrfs.ko.before text data bss dec hex filename 1641400 168265 16864 1826529 1bdee1 fs/btrfs/btrfs.ko.before $ size fs/btrfs/btrfs.ko.after text data bss dec hex filename 1641224 168181 16864 1826269 1bdddd fs/btrfs/btrfs.ko.after Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Boris Burkov authored
In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a quota_root, as opposed to after we are done setting up the qgroup structures. In the quota_enable path, we wait until after the structures are set up. Likewise, in disable, we clear the bit before tearing down the structures. I feel that this organization is less surprising for the open_ctree path. I don't believe this fixes any actual bug, but avoids potential confusion when using btrfs_qgroup_mode in an intermediate state where we are enabled but haven't yet setup the qgroup status flags. It also avoids any risk of calling a qgroup function and attempting to use the qgroup rbtrees before they exist/are setup. This all occurs before we do rw setup, so I believe it should be mostly a no-op. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-
Boris Burkov authored
Relocation data allocations are quite tricky for simple quotas. The basic data relocation sequence is (ignoring details that aren't relevant to this fix): - create a fake relocation data fs root - create a fake relocation inode in that root - for each data extent: - preallocate a data extent on behalf of the fake inode - copy over the data - for each extent - swap the refs so that the original file extent now refers to the new extent item - drop the fake root, dropping its refs on the old extents, which lets us delete them. Done naively, this results in storing an extent item in the extent tree whose owner_ref points at the relocation data root and a no-op squota recording, since the reloc root is not a legit fstree. So far, that's OK. The problem comes when you do the swap, and leave an extent item owned by this bogus root as the real permanent extents of the file. If the file then drops that ref, we free it and no-op account that against the fake relocation root. Essentially, this means that relocation is simple quota "extent laundering", since we re-own the extents into a fake root. Simple quotas very intentionally doesn't have a mechanism for transferring ownership of extents, as that is exactly the complicated thing we are trying to avoid with the new design. Further, it cannot be correctly done in this case, since at the time you create the new "real" refs, there is no way to know which was the original owner before relocation unless we track it. Therefore, it makes more sense to trick the preallocation to handle relocation as a special case and note the proper owner ref from the beginning. That way, we never write out an extent item without the correct owner ref that it will eventually have. This could be done by wiring a special root parameter all the way through the allocation code path, but to avoid that special case touching all the code, take advantage of the serial nature of relocation to store the src root on the relocation root object. Then when we finish the prealloc, if it happens to be this case, prepare the delayed ref appropriately. We must also add logic to handle relocating adjacent extents with different owning roots. Those cannot be preallocated together in a cluster as it would lose the separate ownership information. This is obviously a smelly bit of code, but I think it is the best solution to the problem, given the relocation implementation. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-
Boris Burkov authored
Relocation COWs metadata blocks in two cases for the reloc root: - copying the subvolume root item when creating the reloc root - copying a btree node when there is a COW during relocation In both cases, the resulting btree node hits an abnormal code path with respect to the owner field in its btrfs_header. It first creates the root item for the new objectid, which populates the reloc root id, and it at this point that delayed refs are created. Later, it fully copies the old node into the new node (including the original owner field) which overwrites it. This results in a simple quotas mismatch where we run the delayed ref for the reloc root which has no simple quota effect (reloc root is not an fstree) but when we ultimately delete the node, the owner is the real original fstree and we do free the space. To work around this without tampering with the behavior of relocation, add a parameter to btrfs_add_tree_block that lets the relocation code path specify a different owning root than the "operating" root (in this case, owning root is the real root and the operating root is the reloc root). These can naturally be plumbed into delayed refs that have the same concept. Note that this is a double count in some sense, but a relatively natural one, as there are really two extents, and the old one will be deleted soon. This is consistent with how data relocation extents are accounted by simple quotas. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-
Boris Burkov authored
Simple quotas count extents only from the moment the feature is enabled. Therefore, if we do something like: 1. create subvol S 2. write F in S 3. enable quotas 4. remove F 5. write G in S then after 3. and 4. we would expect the simple quota usage of S to be 0 (putting aside some metadata extents that might be written) and after 5., it should be the size of G plus metadata. Therefore, we need to be able to determine whether a particular quota delta we are processing predates simple quota enablement. To do this, store the transaction id when quotas were enabled. In fs_info for immediate use and in the quota status item to make it recoverable on mount. When we see a delta, check if the generation of the extent item is less than that of quota enablement. If so, we should ignore the delta from this extent. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-
Boris Burkov authored
Consider the following sequence: - enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's auto qgroup) to 1/100 - create subvol T id 257 at dir outer/inner/ With full qgroups, there is no relationship between 0/257 and either of 0/256 or 1/100. There is an inherit feature that the creator of inner/ can use to specify it ought to be in 1/100. Simple quotas are targeted at container isolation, where such automatic inheritance for not necessarily trusted/controlled nested subvol creation would be quite helpful. Therefore, add a new default behavior for simple quotas: when you create a nested subvol, automatically inherit as parents any parents of the qgroup of the subvol the new inode is going in. In our example, 257/0 would also be under 1/100, allowing easy control of a total quota over an arbitrary hierarchy of subvolumes. I think this _might_ be a generally useful behavior, so it could be interesting to put it behind a new inheritance flag that simple quotas always use while traditional quotas let the user specify, but this is a minimally intrusive change to start. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
-