Commits · 8d47a0d8f7947422dd359ac8e462687f81a7a137 · Kirill Smelkov / linux

29 Apr, 2019 40 commits

btrfs: Do mandatory tree block check before submitting bio · 8d47a0d8

Qu Wenruo authored Apr 04, 2019

There are at least 2 reports about a memory bit flip sneaking into
on-disk data.

Currently we only have a relaxed check triggered at
btrfs_mark_buffer_dirty() time, as it's not mandatory and only for
CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build, it doesn't help users to
detect such problem.

This patch will address the hole by triggering comprehensive check on
tree blocks before writing it back to disk.

The design points are:

- Timing of the check: Tree block write hook
  This timing is chosen to reduce the overhead.
  The comprehensive check should be as expensive as a checksum
  calculation.
  Doing full check at btrfs_mark_buffer_dirty() is too expensive for end
  user.

- Loose empty leaf check
  Originally for an empty leaf, tree-checker will report error if it's
  not a tree root.

  The problem for such check at write time is:
  * False alert for tree root created in current transaction
    In that case, the commit root still needs to be written to disk.
    And since current root can differ from commit root, then it will
    cause false alert.
    This happens for log tree.

  * False alert for relocated tree block
    Relocated tree block can be written to disk due to memory pressure,
    in that case an empty csum tree root can be written to disk and
    cause false alert, since csum root node hasn't been updated.

  Previous patch of removing comprehensive empty leaf owner check has
  paved the way for this patch.

The example error output will be something like:

  BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
  BTRFS error (device dm-3): block=1350630375424 write time tree block corruption detected
  BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
  BTRFS info (device dm-3): forced readonly
  BTRFS warning (device dm-3): Skipping commit of aborted transaction.
  BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
  BTRFS info (device dm-3): delayed_refs has NO entry
Reported-by: Leonard Lausen <leonard@lausen.nl>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8d47a0d8

btrfs: tree-checker: Remove comprehensive root owner check · ff2ac107

Qu Wenruo authored Apr 04, 2019

Commit 1ba98d08 ("Btrfs: detect corruption when non-root leaf has
zero item") introduced comprehensive root owner checker.

However it's pretty expensive tree search to locate the owner root,
especially when it get reused by mandatory read and write time
tree-checker.

This patch will remove that check, and completely rely on owner based
empty leaf check, which is much faster and still works fine for most
case.

And since we skip the old root owner check, now write time tree check
can be merged with btrfs_check_leaf_full().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ff2ac107

Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve · 39ad3173

Robbie Ko authored Mar 26, 2019

When doing fallocate, we first add the range to the reserve_list and
then reserve the quota.  If quota reservation fails, we'll release all
reserved parts of reserve_list.

However, cur_offset is not updated to indicate that this range is
already been inserted into the list.  Therefore, the same range is freed
twice.  Once at list_for_each_entry loop, and once at the end of the
function.  This will result in WARN_ON on bytes_may_use when we free the
remaining space.

At the end, under the 'out' label we have a call to:

   btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);

The start offset, third argument, should be cur_offset.

Everything from alloc_start to cur_offset was freed by the
list_for_each_entry_safe_loop.

Fixes: 18513091 ("btrfs: update btrfs_space_info's bytes_may_use timely")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>

39ad3173

btrfs: get fs_info from eb in read_one_dev · 17850759

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

17850759

btrfs: get fs_info from eb in read_one_chunk · 9690ac09

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

9690ac09

btrfs: get fs_info from eb in btrfs_check_chunk_valid · ddaf1d5a

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

ddaf1d5a

btrfs: get fs_info from eb in should_balance_chunk · 6ec0896c

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

6ec0896c

btrfs: get fs_info from eb in btrfs_check_node · 813fd1dc

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

813fd1dc

btrfs: get fs_info from eb in btrfs_check_leaf_relaxed · cfdaad5e

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

cfdaad5e

btrfs: get fs_info from eb in btrfs_check_leaf_full · 1c4360ee

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

1c4360ee

btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit · 929be17a

Nikolay Borisov authored Mar 27, 2019

Instead of always calling the allocator to search for a free extent,
that satisfies the input criteria, switch btrfs_trim_free_extents to
using find_first_clear_extent_bit. With this change it's no longer
necessary to read the device tree in order to figure out holes in
the devices.

Now the code always searches in-memory data structure to figure out the
space range which contains the requested which should result in speed
improvements.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

929be17a

btrfs: Implement find_first_clear_extent_bit · 45bfcfc1

Nikolay Borisov authored Mar 27, 2019

This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

45bfcfc1

btrfs: Optimize unallocated chunks discard · 8811133d

Nikolay Borisov authored Mar 27, 2019

Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.

Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.

This applies to the single mount period of the filesystem as the
information is not stored permanently.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8811133d

btrfs: Factor out in_range macro · e74e3993

Nikolay Borisov authored Mar 27, 2019

This is used in more than one places so let's factor it out in ctree.h.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e74e3993

btrfs: Remove 'trans' argument from find_free_dev_extent(_start) · 60dfdf25

Nikolay Borisov authored Mar 27, 2019

Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

60dfdf25

btrfs: replace pending/pinned chunks lists with io tree · 1c11b63e

Jeff Mahoney authored Mar 27, 2019

The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device.  Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1c11b63e

btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree · 68c94e55

Nikolay Borisov authored Feb 12, 2019

Following the introduction of the alloc_state tree, some of the callees
of btrfs_mapping_tree_free will have to interact with the btrfs_device
of the constituent devices. Enable this by moving the code responsible
for freeing devices after the last user (btrfs_mapping_tree_free).
Otherwise the kernel could crash due to use-after-free.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

68c94e55

btrfs: Stop using call_rcu for device freeing · 8e75fd89

Nikolay Borisov authored Mar 27, 2019

btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.

This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8e75fd89

btrfs: Implement set_extent_bits_nowait · 4ca73656

Nikolay Borisov authored Mar 27, 2019

It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4ca73656

btrfs: Introduce new bits for device allocation tree · 930b0907

Nikolay Borisov authored Mar 25, 2019

Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

930b0907

btrfs: Populate ->orig_block_len during read_one_chunk · 39e264a4

Nikolay Borisov authored Mar 25, 2019

Chunks read from disk currently don't get their ->orig_block_len member
set, in contrast when a new chunk is allocated, the respective
extent_map's ->orig_block_len is assigned the size of the stripe of this
chunk.

Let's apply the same strategy for chunks which are read from
disk, not only does this codify the invariant that ->orig_block_len
always contains the size of the stripe for a chunk (when the em belongs
to the mapping tree). But it's also a preparatory patch for further work
around tracking chunk allocation in an extent tree rather than
pinned/pending lists.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

39e264a4

btrfs: Rename and export clear_btree_io_tree · 41e7acd3

Nikolay Borisov authored Mar 25, 2019

This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

41e7acd3

btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink · 61d0d0d2

Nikolay Borisov authored Mar 25, 2019

During device shrink pinned/pending chunks (i.e. those which have been
deleted/created respectively, in the current transaction and haven't
touched disk) need to be accounted when doing device shrink. Presently
this happens after the main relocation loop in btrfs_shrink_device,
which could lead to making another go in the body of the function.

Since there is no hard requirement to perform pinned/pending chunks
handling after the relocation loop, move the code before it. This leads
to simplifying the code flow around - i.e. no need to use 'goto again'.

A notable side effect of this change is that modification of the
device's size requires a transaction to be started and committed before
the relocation loop starts. This is necessary to ensure that relocation
process sees the shrunk device size.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

61d0d0d2

btrfs: combine device update operations during transaction commit · bbbf7243

Nikolay Borisov authored Mar 25, 2019

We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used.  We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times.  The
fs_devices->resized_list does more or less the same thing, but with the
disk size.  They are called consecutively during commit and have more or
less the same purpose.

We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating.  Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bbbf7243

btrfs: Honour FITRIM range constraints during free space trim · c2d1b3aa

Nikolay Borisov authored Mar 25, 2019

Up until now trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end of
the passed range as well as completing trim after trimming
fstrim_range::len bytes.

Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c2d1b3aa

Btrfs: send, improve clone range · 040ee612

Robbie Ko authored Mar 29, 2019

Improve clone_range in two scenarios.

1. Remove the limit of inode size when find clone inodes We can do
   partial clone, so there is no need to limit the size of the candidate
   inode.  When clone a range, we clone the legal range only by bytenr,
   offset, len, inode size.

2. In the scenarios of rewrite or clone_range, data_offset rarely
   matches exactly, so the chance of a clone is missed.

e.g.
    1. Write a 1M file
        dd if=/dev/zero of=1M bs=1M count=1

    2. Clone 1M file
       cp --reflink 1M clone

    3. Rewrite 4k on the clone file
       dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc

    The disk layout is as follows:
    item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
	extent data disk byte 1103101952 nr 1048576
	extent data offset 0 nr 1048576 ram 1048576
	extent compression(none)
    ...
    item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
	extent data disk byte 1104150528 nr 4096
	extent data offset 0 nr 4096 ram 4096
	extent compression(none)
    item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
	extent data disk byte 1103101952 nr 1048576
	extent data offset 4096 nr 1044480 ram 1048576
	extent compression(none)

When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.

Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>

040ee612

btrfs: prop: open code btrfs_set_prop in inherit_prop · 8b4d1efc

Anand Jain authored Apr 02, 2019

When an inode inherits property from its parent, we call btrfs_set_prop().
btrfs_set_prop() does an elaborate checks, which is not required in the
context of inheriting a property. Instead just open-code only the required
items from btrfs_set_prop() and then call btrfs_setxattr() directly. So
now the only user of btrfs_set_prop() is gone, (except for the wraper
function btrfs_set_prop_trans()).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8b4d1efc

btrfs: drop unused parameter in mount_subvol · ae0bc863

Anand Jain authored Mar 29, 2019

@device_name in mount_subvol() is not used, drop it.  Also see:
5bedc48a ("btrfs: drop unused parameters from mount_subvol").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ae0bc863

btrfs: tree-checker: get fs_info from eb in check_inode_item · 39e57f49

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

39e57f49

btrfs: tree-checker: get fs_info from eb in check_dev_item · 412a2312

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

412a2312

btrfs: tree-checker: get fs_info from eb in dev_item_err · 5617ed80

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

5617ed80

btrfs: tree-checker: get fs_info from eb in chunk_err · d001e4a3

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

d001e4a3

btrfs: tree-checker: get fs_info from eb in check_leaf · e2ccd361

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

e2ccd361

btrfs: tree-checker: get fs_info from eb in check_leaf_item · 0076bc89

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

0076bc89

btrfs: tree-checker: get fs_info from eb in check_extent_data_item · ae2a19d8

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

ae2a19d8

btrfs: tree-checker: get fs_info from eb in check_block_group_item · af60ce2b

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

af60ce2b

btrfs: tree-checker: get fs_info from eb in block_group_err · 4806bd88

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

4806bd88

btrfs: tree-checker: get fs_info from eb in check_dir_item · ce4252c0

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

ce4252c0

btrfs: tree-checker: get fs_info from eb in dir_item_err · d98ced68

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

d98ced68

btrfs: tree-checker: get fs_info from eb in check_csum_item · 68128ce7

David Sterba authored Mar 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>

68128ce7