Commits · a0b73c1c5363f5e2cd9a7a7968a9d6579548050a · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: Add (partial) support for fixing btree topology · a0b73c1c

Kent Overstreet authored Jan 26, 2021

When we walk the btrees during recovery, part of that is checking that
btree topology is correct: for every interior btree node, its child
nodes should exactly span the range the parent node covers.

Previously, we had checks for this, but not repair code. Now that we
have the ability to do btree updates during initial GC, this patch adds
that repair code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a0b73c1c

bcachefs: Add support for doing btree updates prior to journal replay · 5b593ee1

Kent Overstreet authored Jan 26, 2021

Some errors may need to be fixed in order for GC to successfully run -
walk and mark all metadata. But we can't start the allocators and do
normal btree updates until after GC has completed, and allocation
information is known to be consistent, so we need a different method of
doing btree updates.

Fortunately, we already have code for walking the btree while overlaying
keys from the journal to be replayed. This patch adds an update path
that adds keys to the list of keys to be replayed by journal replay, and
also fixes up iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5b593ee1

bcachefs: Add BTREE_PTR_RANGE_UPDATED · 51d2dfb8

Kent Overstreet authored Jan 26, 2021

This is so that when we discover btree topology issues, we can just
update the pointer to a btree node and signal btree read path that the
min/max keys in the node header should be updated from the node pointer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

51d2dfb8

bcachefs: Refactor checking of btree topology · a66f7989

Kent Overstreet authored Jan 26, 2021

Still a lot of work to be done here: we can't yet repair btree topology
issues, but this patch refactors things so that we have better access to
what we need in the topology checks. Next up will be figuring out a way
to do btree updates during gc, before journal replay is done.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a66f7989

bcachefs: Improve diagnostics when journal entries are missing · e4c3f386

Kent Overstreet authored Jan 26, 2021

There's an outstanding bug with journal entries being missing in journal
replay. This patch adds code to print out where the journal entries were
physically located that were around the entry(ies) being missing, which
should make debugging easier.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e4c3f386

bcachefs: Fix BCH_REPLICAS_MAX check · 522c25f0

Kent Overstreet authored Jan 26, 2021

Ideally, this limit will be going away in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

522c25f0

bcachefs: Fix build in userspace · 0093a50f

Kent Overstreet authored Jan 27, 2021

The userspace bch_err() macro doesn't use the filesystem argument. Could
also be fixed with a better macro.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0093a50f

bcachefs: Fix an assertion · 4529ae09

Kent Overstreet authored Jan 25, 2021

If we're invalidating a bucket that has cached data in it, data_type
won't be 0 - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

4529ae09

bcachefs: Mark superblocks transactionally · bfcf840d

Kent Overstreet authored Jan 22, 2021

More work towards getting rid of the in memory struct bucket: this path
adds code for marking superblock and journal buckets via the btree, and
uses it in the device add and journal resize paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bfcf840d

bcachefs: Kill bch2_invalidate_bucket() · 9afc6652

Kent Overstreet authored Jan 22, 2021

This patch is working towards eventually getting rid of the in memory
struct bucket, and relying only on the btree representation.

Since bch2_invalidate_bucket() was only used for incrementing gens, not
invalidating cached data, no other counters were being changed as a side
effect - meaning it's safe for the allocator code to increment the
bucket gen directly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9afc6652

bcachefs: Refactor dev usage · 72eab8da

Kent Overstreet authored Jan 21, 2021

This is to make it more amenable for serialization.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

72eab8da

bcachefs: Kill metadata only gc · 079663d8

Kent Overstreet authored Jan 21, 2021

This was useful before we had transactional updates to interior btree
nodes - but now, it's just extra unneeded complexity.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

079663d8

bcachefs: Ensure __bch2_trans_commit() always calls bch2_trans_reset() · b7cf4bd7

Kent Overstreet authored Jan 21, 2021

This was leading to a very strange bug in bch2_bucket_io_time_reset(),
where we'd retry without clearing out the list of updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b7cf4bd7

bcachefs: Fix a faulty assertion · fdbb88ac

Kent Overstreet authored Jan 21, 2021

If journal replay hasn't finished, the journal can't be empty - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

fdbb88ac

bcachefs: Switch replicas.c allocations to GFP_KERNEL · e46b8557

Kent Overstreet authored Jan 21, 2021

We're transitioning to memalloc_nofs_save/restore instead of GFP flags
with the rest of the kernel, and GFP_NOIO was excessively strict and
causing unnnecessary allocation failures - these allocations are done
with btree locks dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e46b8557

bcachefs: Fix loopback in dio mode · b4725cc1

Kent Overstreet authored Jan 21, 2021

We had a deadlock on page_lock, because buffered reads signal completion
by unlocking the page, but the dio read path normally dirties the pages
it's reading to with set_page_dirty_lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b4725cc1

bcachefs: Clean up bch2_extent_can_insert · ef470b48

Kent Overstreet authored Jan 20, 2021

It was using an internal btree node iterator interface, when
bch2_btree_iter_peek_slot() sufficed. We were hitting a null ptr deref
that looked like it was from the iterator not being uptodate - this will
also fix that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ef470b48

bcachefs: Fix an assertion pop · a5cd80ea

Kent Overstreet authored Jan 20, 2021

There was a race: btree node writes drop their reference on journal pins
before clearing the btree_node_write_in_flight flag.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a5cd80ea

bcachefs: Don't allocate stripes at POS_MIN · 33ccd718

Kent Overstreet authored Jan 18, 2021

In the future, stripe index 0 will be a sentinal value. This patch
doesn't disallow stripes at POS_MIN yet, leaving that for when we do the
on disk format changes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

33ccd718

bcachefs: Rework allocating buckets for stripes · 6c7585b0

Kent Overstreet authored Jan 18, 2021

Allocating buckets for existing stripes was busted, in part because the
data structures were too contorted. This reworks new stripes so that we
have an array of open buckets that matches blocks in the stripe, and
it's sparse if we're reusing an existing stripe.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6c7585b0

bcachefs: Verify transaction updates are sorted · f9ef45ad

Kent Overstreet authored Jan 18, 2021

A user reported a bug that implies they might not be correctly sorted,
this should help track that down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f9ef45ad

bcachefs: Preserve stripe blockcounts on existing stripes · c6e658ee

Kent Overstreet authored Jan 17, 2021

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c6e658ee

bcachefs: Kill stripe->dirty · 6e53151b

Kent Overstreet authored Jan 17, 2021

This makes bch2_stripes_write() work more like bch2_alloc_write().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6e53151b

bcachefs: Fix gc updating stripes info · a39c74be

Kent Overstreet authored Jan 17, 2021

The primary stripes radix tree can be sparse, which was causing an
assertion to pop because the one use for gc isn't. Fix this by changing
the algorithm to copy between the two radix trees.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a39c74be

bcachefs: Fix double counting of stripe block counts by GC · 2ef220cb

Kent Overstreet authored Jan 17, 2021

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2ef220cb

bcachefs: Fix integer overflow in bch2_disk_reservation_get() · cd9f3dfe

Kent Overstreet authored Jan 17, 2021

The sectors argument shouldn't have been a u32 - it can be up to U32_MAX
(i.e. fallocate creating persistent reservations), and if replication is
enabled we'll overflow when we calculate the real number of sectors to
reserve. Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

cd9f3dfe

bcachefs: Correctly order flushes and journal writes on multi device filesystems · 280249b9

Kent Overstreet authored Jan 16, 2021

All writes prior to a journal write need to be flushed before the
journal write itself happens. On single device filesystems, it suffices
to mark the write with REQ_PREFLUSH|REQ_FUA, but on multi device
filesystems we need to issue flushes to every device - and wait for them
to complete - before issuing the journal writes. Previously, we were
issuing flushes to every device, but we weren't waiting for them to
complete before issuing the journal writes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

280249b9

bcachefs: Run jset_validate in write path as well · ed9d58a2

Kent Overstreet authored Jan 14, 2021

This is because we had a bug where we were writing out journal entries
with garbage last_seq, and not catching it.

Also, completely ignore jset->last_seq when JSET_NO_FLUSH is true,
because of aforementioned bug, but change the write path to set last_seq
to 0 when JSET_NO_FLUSH is true.

Minor other cleanups and comments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ed9d58a2

bcachefs: Factor out bch2_ec_stripes_heap_start() · ac958006

Kent Overstreet authored Jan 14, 2021

This fixes a bug where mark and sweep gc incorrectly was clearing out
the stripes heap and causing assertions to fire later - simpler to just
create the stripes heap after gc has finished.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ac958006

bcachefs: Add btree node prefetching to bch2_btree_and_journal_walk() · edfbba58

Kent Overstreet authored Jan 11, 2021

bch2_btree_and_journal_walk() walks the btree overlaying keys from the
journal; it was introduced so that we could read in the alloc btree
prior to journal replay being done, when journalling of updates to
interior btree nodes was introduced.

But it didn't have btree node prefetching, which introduced a severe
regression with mount times, particularly on spinning rust. This patch
implements btree node prefetching for the btree + journal walk,
hopefully fixing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

edfbba58

bcachefs: Erasure coding fixes & refactoring · 2a3731e3

Kent Overstreet authored Jan 11, 2021

 - Originally bch_extent_stripe_ptr didn't contain the block index,
   instead we'd have to search through the stripe pointers to figure out
   which pointer matched. When the block field was added to
   bch_extent_stripe_ptr, not all of the code was updated to use it.
   This patch fixes that, and we also now verify that field where it
   makes sense.

 - The ec_stripe_buf_init/exit() functions have been improved, and are
   now used by the bch2_ec_read_extent() (recovery read) path.

 - get_stripe_key() is now used by bch2_ec_read_extent().

 - We now have a getter and setter for checksums within a stripe, like
   we had previously for block sector counts, and ec_generate_checksums
   and ec_validate_checksums are now quite a bit smaller and cleaner.

ec.c still needs a lot of work, but this patch is slowly moving things
in the right direction.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2a3731e3

bcachefs: Add cannibalize lock to btree_cache_to_text() · b929bbef

Kent Overstreet authored Jan 11, 2021

More debugging info is always a good thing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b929bbef

bcachefs: Fix .splice_write · 032ac32c
Kent Overstreet authored Apr 27, 2021
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
032ac32c

bcachefs: Fix bch2_replicas_gc2 · 53ef2c5c

Kent Overstreet authored Jan 10, 2021

This fixes a regression introduced by "bcachefs: Refactor filesystem
usage accounting". We have to include all the replicas entries that have
any of the entries for different journal entries nonzero, we can't skip
them if they sum to zero.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

53ef2c5c

bcachefs: bch2_alloc_write() should be writing for all devices · 4291a331

Kent Overstreet authored Jan 08, 2021

Alloc info isn't stored on a particular device, it makes no sense to
only be writing it out for rw members - this was causing fsck to not fix
alloc info errors, oops.

Also, make sure we write out alloc info in other repair paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

4291a331

bcachefs: Fix btree node split after merge operations · dcf64dfb

Kent Overstreet authored Jan 08, 2021

A btree node merge operation deletes a key in the parent node; if when
inserting into the parent node we split the parent node, we can end up
with a whiteout in the parent node that we don't want.

The existing code drops them before doing the split, because they can
screw up picking the pivot, but we forgot about the unwritten writeouts
area - that needs to be cleared out too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dcf64dfb

bcachefs: Reserve some open buckets for btree allocations · 890e3f5b

Kent Overstreet authored Jan 07, 2021

This reverts part of the change from "bcachefs: Don't use
BTREE_INSERT_USE_RESERVE so much" - it turns out we still should be
reserving open buckets for btree node allocations, because otherwise
data bucket allocations (especially with erasure coding enabled) can use
up all our open buckets and we won't be able to do the metadata update
that lets us release those open bucket references. Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

890e3f5b

bcachefs: Work around a zstd bug · fd54c40e

Kent Overstreet authored Jan 07, 2021

The zstd compression code seems to have a bug where it will write just
past the end of the destination buffer - probably only when the
compressed output isn't going to fit in the destination buffer, which
will never happen if you're always allocating a bigger buffer than the
source buffer which would explain other users not hitting it. But, we
size the buffer according to how much contiguous space on disk we have,
so...

generally, bugs like this don't write more than a word past the end of
the buffer, so an easy workaround is to subtract a fudge factor from the
buffer size.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

fd54c40e

bcachefs: Don't error out of recovery process on journal read error · 29d90f61

Kent Overstreet authored Jan 06, 2021

We don't want to fail the recovery/mount because of a single error
reading from the journal - the relevant journal entry may still be found
on other devices, and missing or no journal entries found is already
handled later in the recovery process.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

29d90f61

bcachefs: Fix journal_buf_realloc() · c859430b

Kent Overstreet authored Jan 04, 2021

It used to be safe to reallocate a buf that the write path owns without
holding the journal lock, but now this can trigger an assertion in
journal_seq_to_buf().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c859430b