• Dave Chinner's avatar
    xfs: fix ordering violation between cache flushes and tail updates · 0dc8f7f1
    Dave Chinner authored
    There is a race between the new CIL async data device metadata IO
    completion cache flush and the log tail in the iclog the flush
    covers being updated. This can be seen by repeating generic/482 in a
    loop and eventually log recovery fails with a failures such as this:
    
    XFS (dm-3): Starting recovery (logdev: internal)
    XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
    XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
    XFS (dm-3): Unmount and run xfs_repair
    XFS (dm-3): First 128 bytes of corrupted metadata buffer:
    00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
    
    Analysis of the logwrite replay shows that there were no writes to
    the data device between the FUA @ write 124 and the FUA at write @
    125, but log recovery @ 125 failed. The difference was the one log
    write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
    and so the inode create intent in (1,8) was not replayed and so the
    inode cluster was zero on disk when replay of the first inode item
    in (1,32) was attempted.
    
    What this meant was that the journal write that occurred at @ 125
    did not ensure that metadata completed before the iclog was written
    was correctly on stable storage. The tail of the log moved forward,
    so IO must have been completed between the two iclog writes. This
    means that there is a race condition between the unconditional async
    cache flush in the CIL push work and the tail LSN that is written to
    the iclog. This happens like so:
    
    CIL push work				AIL push work
    -------------				-------------
    Add to committing list
    start async data dev cache flush
    .....
    <flush completes>
    <all writes to old tail lsn are stable>
    xlog_write
      ....					push inode create buffer
    					<start IO>
    					.....
    xlog_write(commit record)
      ....					<IO completes>
      					log tail moves
      					  xlog_assign_tail_lsn()
    start_lsn == commit_lsn
      <no iclog preflush!>
    xlog_state_release_iclog
      __xlog_state_release_iclog()
        <writes *new* tail_lsn into iclog>
      xlog_sync()
        ....
        submit_bio()
    <tail in log moves forward without flushing written metadata>
    
    Essentially, this can only occur if the commit iclog is issued
    without a cache flush. If the iclog bio is submitted with
    REQ_PREFLUSH, then it will guarantee that all the completed IO is
    one stable storage before the iclog bio with the new tail LSN in it
    is written to the log.
    
    IOWs, the tail lsn that is written to the iclog needs to be sampled
    *before* we issue the cache flush that guarantees all IO up to that
    LSN has been completed.
    
    To fix this without giving up the performance advantage of the
    flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
    compared to 5.13), we need to ensure that we always issue a cache
    flush if the tail LSN changes between the initial async flush and
    the commit record being written. THis requires sampling the tail_lsn
    before we start the flush, and then passing the sampled tail LSN to
    xlog_state_release_iclog() so it can determine if the the tail LSN
    has changed while writing the checkpoint. If the tail LSN has
    changed, then it needs to set the NEED_FLUSH flag on the iclog and
    we'll issue another cache flush before writing the iclog.
    
    Fixes: eef983ff ("xfs: journal IO cache flush reductions")
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    0dc8f7f1
xfs_log_cil.c 41.6 KB