• Darrick J. Wong's avatar
    xfs: drop xfarray sortinfo folio on error · 2b3f004d
    Darrick J. Wong authored
    Chandan Babu reports the following livelock in xfs/708:
    
     run fstests xfs/708 at 2024-05-04 15:35:29
     XFS (loop16): EXPERIMENTAL online scrub feature in use. Use at your own risk!
     XFS (loop5): Mounting V5 Filesystem e96086f0-a2f9-4424-a1d5-c75d53d823be
     XFS (loop5): Ending clean mount
     XFS (loop5): Quotacheck needed: Please wait.
     XFS (loop5): Quotacheck: Done.
     XFS (loop5): EXPERIMENTAL online scrub feature in use. Use at your own risk!
     INFO: task xfs_io:143725 blocked for more than 122 seconds.
           Not tainted 6.9.0-rc4+ #1
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:xfs_io          state:D stack:0     pid:143725 tgid:143725 ppid:117661 flags:0x00004006
     Call Trace:
      <TASK>
      __schedule+0x69c/0x17a0
      schedule+0x74/0x1b0
      io_schedule+0xc4/0x140
      folio_wait_bit_common+0x254/0x650
      shmem_undo_range+0x9d5/0xb40
      shmem_evict_inode+0x322/0x8f0
      evict+0x24e/0x560
      __dentry_kill+0x17d/0x4d0
      dput+0x263/0x430
      __fput+0x2fc/0xaa0
      task_work_run+0x132/0x210
      get_signal+0x1a8/0x1910
      arch_do_signal_or_restart+0x7b/0x2f0
      syscall_exit_to_user_mode+0x1c2/0x200
      do_syscall_64+0x72/0x170
      entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
    The shmem code is trying to drop all the folios attached to a shmem
    file and gets stuck on a locked folio after a bnobt repair.  It looks
    like the process has a signal pending, so I started looking for places
    where we lock an xfile folio and then deal with a fatal signal.
    
    I found a bug in xfarray_sort_scan via code inspection.  This function
    is called to set up the scanning phase of a quicksort operation, which
    may involve grabbing a locked xfile folio.  If we exit the function with
    an error code, the caller does not call xfarray_sort_scan_done to put
    the xfile folio.  If _sort_scan returns an error code while si->folio is
    set, we leak the reference and never unlock the folio.
    
    Therefore, change xfarray_sort to call _scan_done on exit.  This is safe
    to call multiple times because it sets si->folio to NULL and ignores a
    NULL si->folio.  Also change _sort_scan to use an intermediate variable
    so that we never pollute si->folio with an errptr.
    
    Fixes: 232ea052 ("xfs: enable sorting of xfile-backed arrays")
    Reported-by: default avatarChandan Babu R <chandanbabu@kernel.org>
    Signed-off-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
    2b3f004d
xfarray.c 27.7 KB