• Coly Li's avatar
    bcache: fix race in btree_flush_write() · 50a260e8
    Coly Li authored
    There is a race between mca_reap(), btree_node_free() and journal code
    btree_flush_write(), which results very rare and strange deadlock or
    panic and are very hard to reproduce.
    
    Let me explain how the race happens. In btree_flush_write() one btree
    node with oldest journal pin is selected, then it is flushed to cache
    device, the select-and-flush is a two steps operation. Between these two
    steps, there are something may happen inside the race window,
    - The selected btree node was reaped by mca_reap() and allocated to
      other requesters for other btree node.
    - The slected btree node was selected, flushed and released by mca
      shrink callback bch_mca_scan().
    When btree_flush_write() tries to flush the selected btree node, firstly
    b->write_lock is held by mutex_lock(). If the race happens and the
    memory of selected btree node is allocated to other btree node, if that
    btree node's write_lock is held already, a deadlock very probably
    happens here. A worse case is the memory of the selected btree node is
    released, then all references to this btree node (e.g. b->write_lock)
    will trigger NULL pointer deference panic.
    
    This race was introduced in commit cafe5635 ("bcache: A block layer
    cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
    occupancy during journal"), which selected 128 btree nodes and flushed
    them one-by-one in a quite long time period.
    
    Such race is not easy to reproduce before. On a Lenovo SR650 server with
    48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
    device assembled by 3 NVMe SSDs as backing device, this race can be
    observed around every 10,000 times btree_flush_write() gets called. Both
    deadlock and kernel panic all happened as aftermath of the race.
    
    The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
    is set when selecting btree nodes, and cleared after btree nodes
    flushed. Then when mca_reap() selects a btree node with this bit set,
    this btree node will be skipped. Since mca_reap() only reaps btree node
    without BTREE_NODE_journal_flush flag, such race is avoided.
    
    Once corner case should be noticed, that is btree_node_free(). It might
    be called in some error handling code path. For example the following
    code piece from btree_split(),
            2149 err_free2:
            2150         bkey_put(b->c, &n2->key);
            2151         btree_node_free(n2);
            2152         rw_unlock(true, n2);
            2153 err_free1:
            2154         bkey_put(b->c, &n1->key);
            2155         btree_node_free(n1);
            2156         rw_unlock(true, n1);
    At line 2151 and 2155, the btree node n2 and n1 are released without
    mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
    If btree_node_free() is called directly in such error handling path,
    and the selected btree node has BTREE_NODE_journal_flush bit set, just
    delay for 1 us and retry again. In this case this btree node won't
    be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
    and free the btree node memory.
    
    Fixes: cafe5635 ("bcache: A block layer cache")
    Signed-off-by: default avatarColy Li <colyli@suse.de>
    Reported-and-tested-by: default avatarkbuild test robot <lkp@intel.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    50a260e8
btree.h 11.5 KB