• Brian Foster's avatar
    bcachefs: don't bump key cache journal seq on nojournal commits · e53d03fe
    Brian Foster authored
    fstest generic/388 occasionally reproduces corruptions where an
    inode has extents beyond i_size. This is a deliberate crash and
    recovery test, and the post crash+recovery characteristics are
    usually the same: the inode exists on disk in an early (i.e. just
    allocated) state based on the journal sequence number associated
    with the inode. Subsequent inode updates exist in the journal at
    higher sequence numbers, but the inode hadn't been written back
    before the associated crash and the post-crash recovery processes a
    set of journal sequence numbers that doesn't include updates to the
    inode. In fact, the sequence with the most recent inode key update
    always happens to be the sequence just before the front of the
    journal processed by recovery.
    
    This last bit is a significant hint that the problem relates to an
    on-disk journal update of the front of the journal. The root cause
    of this problem is basically that the inode is updated (multiple
    times) in-core and in the key cache, each time bumping the key cache
    sequence number used to control the cache flush. The cache flush
    skips one or more times, bumping the associated key cache journal
    pin to the key cache seq value. This has a side effect of holding
    the inode in memory a bit longer than normal, which helps exacerbate
    this problem, but is also unsafe in certain cases where the key
    cache seq may have been updated by a transaction commit that didn't
    journal the associated key.
    
    For example, consider an inode that has been allocated, updated
    several times in the key cache, journaled, but not yet written back.
    At this stage, everything should be consistent if the fs happens to
    crash because the latest update has been journal. Now consider a key
    update via bch2_extent_update_i_size_sectors() that uses the
    BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
    state, it can have the side effect of bumping ck->seq in
    bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
    flush skips due to seq not matching the former, the ck->journal pin
    is updated to ck->seq even though the most recent key update was not
    journaled. If this pin happens to reside at the front (tail) of the
    journal, this means a subsequent journal write can update last_seq
    to a value beyond that which includes the most recent update to the
    inode. If this occurs and the fs happens to crash before the inode
    happens to flush, recovery will see the latest last_seq, fail to
    recover the inode and leave the inode in the inconsistent state
    described above.
    
    To avoid this problem, skip the key cache seq update on NOJOURNAL
    commits, except on initial pin add. Pass the insert entry directly
    to bch2_btree_insert_key_cached() to make the associated flag
    available and be consistent with btree_insert_key_leaf().
    Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
    e53d03fe
btree_key_cache.h 1.61 KB