• Dave Chinner's avatar
    xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight · 01728b44
    Dave Chinner authored
    I've been chasing a recent resurgence in generic/388 recovery
    failure and/or corruption events. The events have largely been
    uninitialised inode chunks being tripped over in log recovery
    such as:
    
     XFS (pmem1): User initiated shutdown received.
     pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
     XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
     XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
     XFS (pmem1): Unmounting Filesystem
     XFS (pmem1): Mounting V5 Filesystem
     XFS (pmem1): Starting recovery (logdev: internal)
     XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
     XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
     XFS (pmem1): Unmount and run xfs_repair
     XFS (pmem1): First 128 bytes of corrupted metadata buffer:
     00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
     XFS (pmem1): log mount/recovery failed: error -117
     XFS (pmem1): log mount failed
    
    There have been isolated random other issues, too - xfs_repair fails
    because it finds some corruption in symlink blocks, rmap
    inconsistencies, etc - but they are nowhere near as common as the
    uninitialised inode chunk failure.
    
    The problem has clearly happened at runtime before recovery has run;
    I can see the ICREATE log item in the log shortly before the
    actively recovered range of the log. This means the ICREATE was
    definitely created and written to the log, but for some reason the
    tail of the log has been moved past the ordered buffer log item that
    tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
    log moving past the ICREATE log item before the inode chunk buffer
    is written to disk.
    
    Tracing the fsstress processes that are running when the filesystem
    shut down immediately pin-pointed the problem:
    
    user shutdown marks xfs_mount as shutdown
    
             godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
    .....
    
    aild tries to push ordered inode cluster buffer
    
      xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
      xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
    
    xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
    calls xfs_iflush_abort() to kill writeback of the inode.
    Inode is removed from AIL, drops cluster buffer reference.
    
      xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
      xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
    
    .....
    
    All inodes on cluster buffer are aborted, then the cluster buffer
    itself is aborted and removed from the AIL *without writeback*:
    
    xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
    
    The inode buffer was at 7/20344 when it was removed from the AIL.
    
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
    
    .....
    
    Userspace is still running, doing stuff. an fsstress process runs
    syncfs() or sync() and we end up in sync_fs_one_sb() which issues
    a log force. This pushes on the CIL:
    
            fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
            fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
            fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
               <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f
    
    And the CIL fills up iclogs with pending changes. This picks up
    the current tail from the AIL:
    
               <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
               <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
               <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
               <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
    
    And it moves the tail of the log to 7/21000 from 7/20344. This
    *moves the tail of the log beyond the ICREATE transaction* that was
    at 7/20344 and pinned by the inode cluster buffer that was cancelled
    above.
    
    ....
    
             godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
              godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
              godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
    
    And finally the log itself is now shutdown, stopping all further
    writes to the log. But this is too late to prevent the corruption
    that moving the tail of the log forwards after we start cancelling
    writeback causes.
    
    The fundamental problem here is that we are using the wrong shutdown
    checks for log items. We've long conflated mount shutdown with log
    shutdown state, and I started separating that recently with the
    atomic shutdown state changes in commit b36d4651 ("xfs: make
    forced shutdown processing atomic"). The changes in that commit
    series are directly responsible for being able to diagnose this
    issue because it clearly separated mount shutdown from log shutdown.
    
    Essentially, once we start cancelling writeback of log items and
    removing them from the AIL because the filesystem is shut down, we
    *cannot* update the journal because we may have cancelled the items
    that pin the tail of the log. That moves the tail of the log
    forwards without having written the metadata back, hence we have
    corrupt in memory state and writing to the journal propagates that
    to the on-disk state.
    
    What commit b36d4651 makes clear is that log item state needs to
    change relative to log shutdown, not mount shutdown. IOWs, anything
    that aborts metadata writeback needs to check log shutdown state
    because log items directly affect log consistency. Having them check
    mount shutdown state introduces the above race condition where we
    cancel metadata writeback before the log shuts down.
    
    To fix this, this patch works through all log items and converts
    shutdown checks to use xlog_is_shutdown() rather than
    xfs_is_shutdown(), so that we don't start aborting metadata
    writeback before we shut off journal writes.
    
    AFAICT, this race condition is a zero day IO error handling bug in
    XFS that dates back to the introduction of XLOG_IO_ERROR,
    XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    01728b44
xfs_inode_item.c 24.6 KB