• Huang Ying's avatar
    migrate_pages: fix deadlock in batched migration · fb3592c4
    Huang Ying authored
    Patch series "migrate_pages: fix deadlock in batched synchronous
    migration", v2.
    
    Two deadlock bugs were reported for the migrate_pages() batching series. 
    Thanks Hugh and Pengfei.  Analysis shows that if we have locked some other
    folios except the one we are migrating, it's not safe in general to wait
    synchronously, for example, to wait the writeback to complete or wait to
    lock the buffer head.
    
    So 1/3 fixes the deadlock in a simple way, where the batching support for
    the synchronous migration is disabled.  The change is straightforward and
    easy to be understood.  While 3/3 re-introduce the batching for
    synchronous migration via trying to migrate asynchronously in batch
    optimistically, then fall back to migrate synchronously one by one for
    fail-to-migrate folios.  Test shows that this can restore the TLB flushing
    batching performance for synchronous migration effectively.
    
    
    This patch (of 3):
    
    Two deadlock bugs were reported for the migrate_pages() batching series. 
    Thanks Hugh and Pengfei!  For example, in the following deadlock trace
    snippet,
    
     INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
           Not tainted 6.2.0-rc4-kvm+ #1314
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
     Workqueue: loop4 loop_rootcg_workfn
     Call Trace:
      <TASK>
      __schedule+0x43b/0xd00
      schedule+0x6a/0xf0
      io_schedule+0x4a/0x80
      folio_wait_bit_common+0x1b5/0x4e0
      ? __pfx_wake_page_function+0x10/0x10
      __filemap_get_folio+0x73d/0x770
      shmem_get_folio_gfp+0x1fd/0xc80
      shmem_write_begin+0x91/0x220
      generic_perform_write+0x10e/0x2e0
      __generic_file_write_iter+0x17e/0x290
      ? generic_write_checks+0x12b/0x1a0
      generic_file_write_iter+0x97/0x180
      ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      do_iter_readv_writev+0x13c/0x210
      ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      do_iter_write+0xf6/0x330
      vfs_iter_write+0x46/0x70
      loop_process_work+0x723/0xfe0
      loop_rootcg_workfn+0x28/0x40
      process_one_work+0x3cc/0x8d0
      worker_thread+0x66/0x630
      ? __pfx_worker_thread+0x10/0x10
      kthread+0x153/0x190
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x29/0x50
      </TASK>
    
     INFO: task repro:1023 blocked for more than 147 seconds.
           Not tainted 6.2.0-rc4-kvm+ #1314
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
     Call Trace:
      <TASK>
      __schedule+0x43b/0xd00
      schedule+0x6a/0xf0
      io_schedule+0x4a/0x80
      folio_wait_bit_common+0x1b5/0x4e0
      ? compaction_alloc+0x77/0x1150
      ? __pfx_wake_page_function+0x10/0x10
      folio_wait_bit+0x30/0x40
      folio_wait_writeback+0x2e/0x1e0
      migrate_pages_batch+0x555/0x1ac0
      ? __pfx_compaction_alloc+0x10/0x10
      ? __pfx_compaction_free+0x10/0x10
      ? __this_cpu_preempt_check+0x17/0x20
      ? lock_is_held_type+0xe6/0x140
      migrate_pages+0x100e/0x1180
      ? __pfx_compaction_free+0x10/0x10
      ? __pfx_compaction_alloc+0x10/0x10
      compact_zone+0xe10/0x1b50
      ? lock_is_held_type+0xe6/0x140
      ? check_preemption_disabled+0x80/0xf0
      compact_node+0xa3/0x100
      ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      ? _find_first_bit+0x7b/0x90
      sysctl_compaction_handler+0x5d/0xb0
      proc_sys_call_handler+0x29d/0x420
      proc_sys_write+0x2b/0x40
      vfs_write+0x3a3/0x780
      ksys_write+0xb7/0x180
      __x64_sys_write+0x26/0x30
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x72/0xdc
     RIP: 0033:0x7f3a2471f59d
     RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
     RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
     RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
     RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
     R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
     R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
      </TASK>
    
    The page migration task has held the lock of the shmem folio A, and is
    waiting the writeback of the folio B of the file system on the loop block
    device to complete.  While the loop worker task which writes back the
    folio B is waiting to lock the shmem folio A, because the folio A backs
    the folio B in the loop device.  Thus deadlock is triggered.
    
    In general, if we have locked some other folios except the one we are
    migrating, it's not safe to wait synchronously, for example, to wait the
    writeback to complete or wait to lock the buffer head.
    
    To fix the deadlock, in this patch, we avoid to batch the page migration
    except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode, synchronous waiting
    is avoided.
    
    The fix can be improved further.  We will do that as soon as possible.
    
    Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
    Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
    Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
    Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
    Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
    Fixes: 5dfab109 ("migrate_pages: batch _unmap and _move")
    Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Reported-by: default avatarHugh Dickins <hughd@google.com>
    Reported-by: default avatar"Xu, Pengfei" <pengfei.xu@intel.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    fb3592c4
migrate.c 68.7 KB