• Dave Chinner's avatar
    xfs: fix unlink vs cluster buffer instantiation race · 348a1983
    Dave Chinner authored
    Luis has been reporting an assert failure when freeing an inode
    cluster during inode inactivation for a while. The assert looks
    like:
    
     XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
     ------------[ cut here ]------------
     kernel BUG at fs/xfs/xfs_message.c:102!
     Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
     CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
     Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
     RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
     RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
     RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
     RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
     RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
     R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
     R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
     FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
     PKRU: 55555554
     Call Trace:
      <TASK>
     xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
     xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
     xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
     xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
     __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
     xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
     xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
     xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
     process_one_work (kernel/workqueue.c:3231)
     worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
     kthread (kernel/kthread.c:389)
     ret_from_fork (arch/x86/kernel/process.c:147)
     ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
      </TASK>
    
    And occurs when the the inode precommit handlers is attempt to look
    up the inode cluster buffer to attach the inode for writeback.
    
    The trail of logic that I can reconstruct is as follows.
    
    	1. the inode is clean when inodegc runs, so it is not
    	   attached to a cluster buffer when precommit runs.
    
    	2. #1 implies the inode cluster buffer may be clean and not
    	   pinned by dirty inodes when inodegc runs.
    
    	3. #2 implies that the inode cluster buffer can be reclaimed
    	   by memory pressure at any time.
    
    	4. The assert failure implies that the cluster buffer was
    	   attached to the transaction, but not marked done. It had
    	   been accessed earlier in the transaction, but not marked
    	   done.
    
    	5. #4 implies the cluster buffer has been invalidated (i.e.
    	   marked stale).
    
    	6. #5 implies that the inode cluster buffer was instantiated
    	   uninitialised in the transaction in xfs_ifree_cluster(),
    	   which only instantiates the buffers to invalidate them
    	   and never marks them as done.
    
    Given factors 1-3, this issue is highly dependent on timing and
    environmental factors. Hence the issue can be very difficult to
    reproduce in some situations, but highly reliable in others. Luis
    has an environment where it can be reproduced easily by g/531 but,
    OTOH, I've reproduced it only once in ~2000 cycles of g/531.
    
    I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
    on the cluster buffers, even though they may not be initialised. The
    reasons why I think this is safe are:
    
    	1. A buffer cache lookup hit on a XBF_STALE buffer will
    	   clear the XBF_DONE flag. Hence all future users of the
    	   buffer know they have to re-initialise the contents
    	   before use and mark it done themselves.
    
    	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
    	   means the buffer remains locked until the journal commit
    	   completes and the buffer is unpinned. Hence once marked
    	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
    	   context that can access the freed buffer is the currently
    	   running transaction.
    
    	3. #2 implies that future buffer lookups in the currently
    	   running transaction will hit the transaction match code
    	   and not the buffer cache. Hence XBF_STALE and
    	   XFS_BLI_STALE will not be cleared unless the transaction
    	   initialises and logs the buffer with valid contents
    	   again. At which point, the buffer will be marked marked
    	   XBF_DONE again, so having XBF_DONE already set on the
    	   stale buffer is a moot point.
    
    	4. #2 also implies that any concurrent access to that
    	   cluster buffer will block waiting on the buffer lock
    	   until the inode cluster has been fully freed and is no
    	   longer an active inode cluster buffer.
    
    	5. #4 + #1 means that any future user of the disk range of
    	   that buffer will always see the range of disk blocks
    	   covered by the cluster buffer as not done, and hence must
    	   initialise the contents themselves.
    
    	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
    	   unlinked inode precommit code will see a XBF_DONE buffer
    	   from the transaction match as it expects. It can then
    	   attach the stale but newly dirtied inode to the stale
    	   but newly dirtied cluster buffer without unexpected
    	   failures. The stale buffer will then sail through the
    	   journal and do the right thing with the attached stale
    	   inode during unpin.
    
    Hence the fix is just one line of extra code. The explanation of
    why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
    complex....
    
    Fixes: 82842fee ("xfs: fix AGF vs inode cluster buffer deadlock")
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Tested-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
    348a1983
xfs_inode.c 116 KB