• Dave Chinner's avatar
    xfs: fix race in inode cluster freeing failing to stale inodes · 5b257b4a
    Dave Chinner authored
    When an inode cluster is freed, it needs to mark all inodes in memory as
    XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
    have a different life cycle to the buffer, and once the buffer is torn down
    during transaction completion, we must ensure none of the inodes get written
    back (which is what XFS_ISTALE does).
    
    Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
    marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
    inodes either during inode reclaim or tail pushing on the AIL.  The buffer is
    read back, but no longer contains inodes and so triggers assert failures and
    shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.
    
    There are two main causes of xfs_ifree_cluster() failing. The first is simple -
    it checks in-memory inodes it finds in the per-ag icache to see if they are
    clean without holding the flush lock. if they are clean it skips them
    completely. However, If an inode is flushed delwri, it will
    appear clean, but is not guaranteed to be written back until the flush lock has
    been dropped. Hence we may have raced on the clean check and the inode may
    actually be dirty. Hence always mark inodes found in memory stale before we
    check properly if they are clean.
    
    The second is more complex, and makes the first problem easier to hit.
    Basically the in-memory inode scan is done with full knowledge it can be racing
    with inode flushing and AIl tail pushing, which means that inodes that it can't
    get the flush lock on might not be attached to the buffer after then in-memory
    inode scan due to IO completion occurring. This is actually documented in the
    code as "needs better interlocking". i.e. this is a zero-day bug.
    
    Effectively, the in-memory scan must be done while the inode buffer is locked
    and Io cannot be issued on it while we do the in-memory inode scan. This
    ensures that inodes we couldn't get the flush lock on are guaranteed to be
    attached to the cluster buffer, so we can then catch all in-memory inodes and
    mark them stale.
    
    Now that the inode cluster buffer is locked before the in-memory scan is done,
    there is no need for the two-phase update of the in-memory inodes, so simplify
    the code into two loops and remove the allocation of the temporary buffer used
    to hold locked inodes across the phases.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    5b257b4a
xfs_inode.c 121 KB