• Brian Foster's avatar
    bcachefs: serialize on cached key in early bucket allocator · 385a82f6
    Brian Foster authored
    bcachefs had a transient bug where freespace_initialized was not
    properly being set, which lead to unexpected use of the early bucket
    allocator at runtime. This issue has been fixed, but the existence
    of it uncovered a coherency issue in the early bucket allocation
    code that is somewhat related to how uncached iterators deal with
    the key cache.
    
    The problem itself manifests as occasional failure of generic/113
    due to corruption, often seen as a duplicate backpointer or multiple
    data types per-bucket error. The immediate cause of the error is a
    racing bucket allocation along the lines of the following sequence:
    
    - Task 1 selects key A in bch2_bucket_alloc_early() and schedules.
    - Task 2 selects the same key A, but proceeds to complete the
      allocation and associated I/O, after which it releases the
      open_bucket.
    - Task 1 resumes with key A, but does not recognize the bucket is
      now allocated because the open_bucket has been removed
      from the hash when it was released in the previous step.
    
    This generally shouldn't happen because the allocating task updates
    the alloc btree key before releasing the bucket. This is not
    sufficient in this particular instance, however, because an uncached
    iterator for a cached btree doesn't actually lock the key cache slot
    when no key exists for a given slot in the cache. Thus the fact that
    the allocation side updates the cached key means that multiple
    uncached iters can stumble across the same alloc key and duplicate
    the bucket allocation as described above.
    
    This is something that probably needs a longer term fix in the
    iterator code. As a short term fix, close the race through explicit
    use of a cached iterator for likely allocation candidates. We don't
    want to scan the btree with a cached iterator because that would
    unnecessarily pollute the cache. This mitigates cache pollution by
    primarily scanning the tree with an uncached iterator, but closes
    the race by creating a key cache entry for any prospective slot
    prior to the bucket allocation attempt (also similar to how
    _alloc_freelist() works via try_alloc_bucket()). This survives many
    iterations of generic/113 on a kernel hacked to always use the early
    bucket allocator.
    Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
    385a82f6
alloc_foreground.c 39.9 KB