Commits · feb5cc398120ce09fd7c72d361b3d14d9e280b96 · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: trace_read_nopromote() · feb5cc39

Kent Overstreet authored Sep 11, 2023

Add a tracepoint to print the reason a read wasn't promoted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

feb5cc39

bcachefs: Log finsert/fcollapse operations · f3e374ef

Kent Overstreet authored Sep 10, 2023

Now that we have the logged operations btree, we can make
finsert/fcollapse atomic w.r.t. unclean shutdown as well.

This adds bch_logged_op_finsert to represent the state of an finsert or
fcollapse, which is a bit more complicated than truncate since we need
to track our position in the "shift extents" operation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f3e374ef

bcachefs: Log truncate operations · b030e262

Kent Overstreet authored Sep 10, 2023

Previously, we guaranteed atomicity of truncate after unclean shutdown
with the BCH_INODE_I_SIZE_DIRTY flag - which required a full scan of the
inodes btree.

Recently the deleted inodes btree was added so that we no longer have to
scan for deleted inodes, but truncate was unfinished and that change
left it broken.

This patch uses the new logged operations btree to fix truncate
atomicity; we now log an operation that can be replayed at the start of
a truncate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b030e262

bcachefs: BTREE_ID_logged_ops · aaad530a

Kent Overstreet authored Aug 27, 2023

Add a new btree for long running logged operations - i.e. for logging
operations that we can't do within a single btree transaction, so that
they can be resumed if we crash.

Keys in the logged operations btree will represent operations in
progress, with the state of the operation stored in the value.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

aaad530a

bcachefs: New io_misc.c helpers · 5902cc28

Kent Overstreet authored Sep 04, 2023

This pulls the non vfs specific parts of truncate and finsert/fcollapse
out of fs-io.c, and moves them to io_misc.c.

This is prep work for logging these operations, to make them atomic in
the event of a crash.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5902cc28

bcachefs: Break up io.c · 1809b8cb

Kent Overstreet authored Sep 10, 2023

More reorganization, this splits up io.c into
 - io_read.c
 - io_misc.c - fallocate, fpunch, truncate
 - io_write.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1809b8cb

bcachefs: bch2_trans_update_get_key_cache() · cbf57db5

Kent Overstreet authored Sep 11, 2023

Factor out a slowpath into a separate function.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cbf57db5

bcachefs: __bch2_btree_insert() -> bch2_btree_insert_trans() · aef32bf7
Kent Overstreet authored Sep 11, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
aef32bf7

bcachefs: Kill incorrect assertion · 39791d7d

Kent Overstreet authored Sep 11, 2023

In the bch2_fs_alloc() error path we call bch2_fs_free() without setting
BCH_FS_STOPPING - this is fine.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

39791d7d

bcachefs: Convert more code to bch_err_msg() · e46c181a
Kent Overstreet authored Sep 11, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e46c181a

bcachefs: Kill missing inode warnings in bch2_quota_read() · da187cac

Kent Overstreet authored Sep 10, 2023

bch2_quota_read(), when scanning for inodes, may attempt to look up
inodes that have been deleted in the main subvolume - this is not an
error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

da187cac

bcachefs: Fix bch_sb_handle type · c7afec9b

Kent Overstreet authored Sep 10, 2023

blk_mode_t was recently introduced; we should be using it now, instead
of fmode_t.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c7afec9b

bcachefs: Fix bch2_propagate_key_to_snapshot_leaves() · c872afa2

Kent Overstreet authored Sep 10, 2023

When we handle a transaction restart in a nested context, we need to
return -BCH_ERR_transaction_restart_nested because we invalidated the
outer context's iterators and locks.

bch2_propagate_key_to_snapshot_leaves() wasn't doing this, this patch
fixes it to use trans_was_restarted().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c872afa2

bcachefs: Fix silent enum conversion error · 5b7fbdcd

Kent Overstreet authored Sep 09, 2023

This changes mark_btree_node_locked() to take an enum
btree_node_locked_type, not a six_lock_type, since BTREE_NODE_UNLOCKED
is -1 which may cause problems converting back and forth to
six_lock_type if short enums are in use.

With this change, we never store BTREE_NODE_UNLOCKED in a six_lock_type
enum.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5b7fbdcd

bcachefs: Array bounds fixes · 5cfd6977

Kent Overstreet authored Sep 09, 2023

It's no longer legal to use a zero size array as a flexible array
member - this causes UBSAN to complain.

This patch switches our zero size arrays to normal flexible array
members when possible, and inserts casts in other places (e.g. where we
use the zero size array as a marker partway through an array).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5cfd6977

bcachefs: bch2_acl_to_text() · a9a7bbab

Kent Overstreet authored Sep 08, 2023

We can now print out acls from bch2_xattr_to_text(), when the xattr
contains an acl.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a9a7bbab

bcachefs: restart journal reclaim thread on ro->rw transitions · 197763a7

Brian Foster authored Aug 30, 2023

Commit c2d5ff36065a4 ("bcachefs: Start journal reclaim thread
earlier") tweaked reclaim thread management to start a bit earlier
in the mount sequence by moving the start call from
__bch2_fs_read_write() to bch2_fs_journal_start(). This has the side
effect of never starting the reclaim thread on a ro->rw transition,
which can be observed by monitoring reclaim behavior via the
journal_reclaim tracepoints. I.e. once an fs has remounted ro->rw,
we only ever rely on direct reclaim from that point forward.

Since bch2_journal_reclaim_start() properly handles the case where
the reclaim thread has already been created, restore the start call
in the read-write helper. This allows the reclaim thread to start
early when appropriate and also exit/restart on remounts or freeze
cycles. In the latter case it may be possible to simply allow the
task to freeze rather than destroy it, but for now just fix the
immediate bug.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

197763a7

bcachefs: Fix snapshot_skiplist_good() · 097d4cc8

Kent Overstreet authored Aug 28, 2023

We weren't correctly checking snapshot skiplist nodes - we were checking
if they were in the same tree, not if they were an actual ancestor.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

097d4cc8

bcachefs: Kill stripe check in bch2_alloc_v4_invalid() · cba37d81

Kent Overstreet authored Aug 24, 2023

Since we set bucket data type to BCH_DATA_stripe based on the data
pointer, not just the stripe pointer, it doesn't make sense to check for
no stripe in the .key_invalid method - this is a situation that
shouldn't happen, but our other fsck/repair code handles it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cba37d81

bcachefs: Improve bch2_moving_ctxt_to_text() · 9d2a7bd8

Kent Overstreet authored Aug 23, 2023

Print more information out about moving contexts - fold in the output of
the redundant bch2_data_jobs_to_text(), and also include information
relevant to whether move_data() should be blocked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9d2a7bd8

bcachefs: Put bkey invalid check in commit path in a more useful place · cc07773f

Kent Overstreet authored Aug 22, 2023

When doing updates early in recovery, before we can go RW, we still want
to check that keys are valid at commit time - this moves key invalid
checking to before the "btree updates to journal" path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cc07773f

bcachefs: Always check alloc data type · 71aba590
Kent Overstreet authored Aug 22, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
71aba590
bcachefs: Fix a double free on invalid bkey · 4491283f
Kent Overstreet authored Aug 22, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
4491283f

bcachefs: bch2_propagate_key_to_snapshot_leaves() · a111901f

Kent Overstreet authored Aug 18, 2023

If fsck finds a key that needs work done, the primary example being an
unlinked inode that needs to be deleted, and the key is in an internal
snapshot node, we have a bit of a conundrum.

The conundrum is that internal snapshot nodes are shared, and we in
general do updates in internal snapshot nodes because there may be
overwrites in some snapshots and not others, and this may affect other
keys referenced by this key (i.e. extents).

For example, we might be seeing an unlinked inode in an internal
snapshot node, but then in one child snapshot the inode might have been
reattached and might not be unlinked. Deleting the inode in the internal
snapshot node would be wrong, because then we'll delete all the extents
that the child snapshot references.

But if an unlinked inode does not have any overwrites in child
snapshots, we're fine: the inode is overwrritten in all child snapshots,
so we can do the deletion at the point of comonality in the snapshot
tree, i.e. the node where we found it.

This patch adds a new helper, bch2_propagate_key_to_snapshot_leaves(),
to handle the case where we need a to update a key that does have
overwrites in child snapshots: we copy the key to leaf snapshot nodes,
and then rewind fsck and process the needed updates there.

With this, fsck can now always correctly handle unlinked inodes found in
internal snapshot nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a111901f

bcachefs: Cleanup redundant snapshot nodes · f55d6e07

Kent Overstreet authored Aug 17, 2023

After deleteing snapshots, we may be left with a snapshot tree where
some nodes only have one child, and we have a linear chain.

Interior snapshot nodes are never used directly (i.e. they never have
subvolumes that point to them), they are only referered to by child
snapshot nodes - hence, they are redundant.

The existing code talks about redundant snapshot nodes as forming and
equivalence class; i.e. nodes for which snapshot_t->equiv is equal. In a
given equivalence class, we only ever need a single key at a given
position - i.e. multiple versions with different snapshot fields are
redundant.

The existing snapshot cleanup code deletes these redundant keys, but not
redundant nodes. It turns out this is buggy, because we assume that
after snapshot deletion finishes we should only have a single key per
equivalence class, but the btree update path doesn't preserve this -
overwriting keys in old snapshots doesn't check for the equivalence
class being equal, and thus we can end up with duplicate keys in the
same equivalence class and fsck complaining about snapshot deletion not
having run correctly.

The equivalence class notion has been leaking out of the core snapshots
code and into too much other code, i.e. fsck, so this patch takes a
different approach: snapshot deletion now moves keys to the node in an
equivalence class being kept (the leafiest node) and then deletes the
redundant nodes in the equivalance class.

Some work has to be done to correctly delete interior snapshot nodes;
snapshot node depth and skiplist fields for descendent nodes have to be
fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f55d6e07

bcachefs: Fix btree write buffer with snapshots btrees · da525760
Kent Overstreet authored Aug 21, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
da525760

bcachefs: Fix is_ancestor bitmap · 66487c54

Kent Overstreet authored Jul 13, 2023

The is_ancestor bitmap is at optimization for bch2_snapshot_is_ancestor;
once we get sufficiently close to the ancestor ID we're searching for we
test a bitmap.

But initialization of the is_ancestor bitmap was broken; we do it by
using bch2_snapshot_parent(), but we call that on nodes that haven't
been initialized yet with bch2_mark_snapshot().

Fix this by adding a separate loop in bch2_snapshots_read() for
initializing the is_ancestor bitmap, and also add some new debug asserts
for checking this sort of breakage in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

66487c54

bcachefs: move check_pos_snapshot_overwritten() to snapshot.c · fa5bed37
Kent Overstreet authored Aug 18, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
fa5bed37

bcachefs: Fix bch2_mount error path · 7573041a

Kent Overstreet authored Aug 18, 2023

In the bch2_mount() error path, we were calling
deactivate_locked_super(), which calls ->kill_sb(), which in our case
was calling bch2_fs_free() without __bch2_fs_stop().

This changes bch2_mount() to just call bch2_fs_stop() directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7573041a

bcachefs: Delete a faulty assertion · adc0e950
Kent Overstreet authored Aug 18, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
adc0e950

bcachefs: Improve btree_path_relock_fail tracepoint · 55d5276d

Kent Overstreet authored Aug 17, 2023

In https://github.com/koverstreet/bcachefs/issues/450, we're seeing
unexplained btree_path_relock_fail events - according to the information
currently in the tracepoint, it appears the relock should be succeeding.

This adds lock counts to the tracepoint to help track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

55d5276d

bcachefs: Fix divide by zero in rebalance_work() · d0445e13

Kent Overstreet authored Aug 17, 2023

This fixes https://github.com/koverstreet/bcachefs-tools/issues/159Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d0445e13

bcachefs: Split out snapshot.c · 8e877caa

Kent Overstreet authored Aug 16, 2023

subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8e877caa

bcachefs: stack_trace_save_tsk() depends on CONFIG_STACKTRACE · e5570df2
Kent Overstreet authored Aug 16, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e5570df2

bcachefs: Fix swallowing of data in buffered write path · 62898dd1

Kent Overstreet authored Aug 14, 2023

In __bch2_buffered_write, if we fail to write to an entire !uptodate
folio, we have to back out the write, bail out and retry.

But we were missing an iov_iter_revert() call, so the data written to
the folio was lost and the rest of the write shifted to the wrong
offset.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

62898dd1

bcachefs: fix up wonky error handling in bch2_seek_pagecache_hole() · 8c9b0f7b

Brian Foster authored Aug 14, 2023

The folio_hole_offset() helper returns a mix of bool and int types.
The latter is to support a possible -EAGAIN error code when using
nonblocking locks. This is not only confusing, but the only caller
also essentially ignores errors outside of stopping the range
iteration. This means an -EAGAIN error can't return directly from
folio_hole_offset() and may be lost via bch2_clamp_data_hole().

Fix up the error handling and make it more readable.
__filemap_get_folio() returns -ENOENT instead of NULL when no folio
exists, so reuse the same error code in folio_hole_offset(). Fix up
bch2_seek_pagecache_hole() to return the current offset on -ENOENT,
but otherwise return unexpected error code up to the caller.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8c9b0f7b

bcachefs: Fix bkey format calculation · 029b85fe

Kent Overstreet authored Aug 13, 2023

For extents, we increase the number of bits of the size field to allow
extents to get bigger due to merging - but this code didn't check for
overflow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

029b85fe

bcachefs: Fix bch2_extent_fallocate() · c8ef8c3e

Kent Overstreet authored Aug 13, 2023

 - There was no need for a retry loop in bch2_extent_fallocate(); if we
   have to retry we may be overwriting something different and we need
   to return an error and let the caller retry.
 - The bch2_alloc_sectors_start() error path was wrong, and wasn't
   running our cleanup at the end of the function

This also fixes a very rare open bucket leak due to the missing cleanup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c8ef8c3e

bcachefs: Zero btree_paths on allocation · ff5b741c

Kent Overstreet authored Aug 13, 2023

This fixes a bug in the cycle detector, bch2_check_for_deadlock() - we
have to make sure the node pointers in the btree paths array are set to
something not-garbage before another thread may see them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ff5b741c

bcachefs: Fix 'pointer to invalid device' check · e9679b4a

Kent Overstreet authored Aug 13, 2023

This fixes the device removal tests, which have been failing at random
due to the fact that when we're running the .key_invalid checks in the
write path the key may actually no longer exist - we might be racing
with the keys being deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e9679b4a