Commit e53d03fe authored by Brian Foster's avatar Brian Foster Committed by Kent Overstreet

bcachefs: don't bump key cache journal seq on nojournal commits

fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().
Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
parent 83ec519a
......@@ -769,11 +769,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans,
bool bch2_btree_insert_key_cached(struct btree_trans *trans,
unsigned flags,
struct btree_path *path,
struct bkey_i *insert)
struct btree_insert_entry *insert_entry)
{
struct bch_fs *c = trans->c;
struct bkey_cached *ck = (void *) path->l[0].b;
struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
struct bkey_i *insert = insert_entry->k;
bool kick_reclaim = false;
BUG_ON(insert->k.u64s > ck->u64s);
......@@ -801,9 +801,24 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans,
kick_reclaim = true;
}
/*
* To minimize lock contention, we only add the journal pin here and
* defer pin updates to the flush callback via ->seq. Be careful not to
* update ->seq on nojournal commits because we don't want to update the
* pin to a seq that doesn't include journal updates on disk. Otherwise
* we risk losing the update after a crash.
*
* The only exception is if the pin is not active in the first place. We
* have to add the pin because journal reclaim drives key cache
* flushing. The flush callback will not proceed unless ->seq matches
* the latest pin, so make sure it starts with a consistent value.
*/
if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
!journal_pin_active(&ck->journal)) {
ck->seq = trans->journal_res.seq;
}
bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
&ck->journal, bch2_btree_key_cache_journal_flush);
ck->seq = trans->journal_res.seq;
if (kick_reclaim)
journal_reclaim_kick(&c->journal);
......
......@@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *,
unsigned);
bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
struct btree_path *, struct bkey_i *);
struct btree_insert_entry *);
int bch2_btree_key_cache_flush(struct btree_trans *,
enum btree_id, struct bpos);
void bch2_btree_key_cache_drop(struct btree_trans *,
......
......@@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
if (!i->cached)
btree_insert_key_leaf(trans, i);
else if (!i->key_cache_already_flushed)
bch2_btree_insert_key_cached(trans, flags, i->path, i->k);
bch2_btree_insert_key_cached(trans, flags, i);
else {
bch2_btree_key_cache_drop(trans, i->path);
btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment