• Kent Overstreet's avatar
    bcachefs: Deadlock cycle detector · 33bd5d06
    Kent Overstreet authored
    We've outgrown our own deadlock avoidance strategy.
    
    The btree iterator API provides an interface where the user doesn't need
    to concern themselves with lock ordering - different btree iterators can
    be traversed in any order. Without special care, this will lead to
    deadlocks.
    
    Our previous strategy was to define a lock ordering internally, and
    whenever we attempt to take a lock and trylock() fails, we'd check if
    the current btree transaction is holding any locks that cause a lock
    ordering violation. If so, we'd issue a transaction restart, and then
    bch2_trans_begin() would re-traverse all previously used iterators, but
    in the correct order.
    
    That approach had some issues, though.
     - Sometimes we'd issue transaction restarts unnecessarily, when no
       deadlock would have actually occured. Lock ordering restarts have
       become our primary cause of transaction restarts, on some workloads
       totally 20% of actual transaction commits.
    
     - To avoid deadlock or livelock, we'd often have to take intent locks
       when we only wanted a read lock: with the lock ordering approach, it
       is actually illegal to hold _any_ read lock while blocking on an intent
       lock, and this has been causing us unnecessary lock contention.
    
     - It was getting fragile - the various lock ordering rules are not
       trivial, and we'd been seeing occasional livelock issues related to
       this machinery.
    
    So, since bcachefs is already a relational database masquerading as a
    filesystem, we're stealing the next traditional database technique and
    switching to a cycle detector for avoiding deadlocks.
    
    When we block taking a btree lock, after adding ourself to the waitlist
    but before sleeping, we do a DFS of btree transactions waiting on other
    btree transactions, starting with the current transaction and walking
    our held locks, and transactions blocking on our held locks.
    
    If we find a cycle, we emit a transaction restart. Occasionally (e.g.
    the btree split path) we can not allow the lock() operation to fail, so
    if necessary we'll tell another transaction that it has to fail.
    
    Result: trans_restart_would_deadlock events are reduced by a factor of
    10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
    code.
    Signed-off-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
    33bd5d06
debug.c 16.8 KB