• Gabriel Krisman Bertazi's avatar
    scsi: iscsi: Fix deadlock on recovery path during GFP_IO reclaim · 7e7cd796
    Gabriel Krisman Bertazi authored
    iSCSI suffers from a deadlock in case a management command submitted via
    the netlink socket sleeps on an allocation while holding the rx_queue_mutex
    if that allocation causes a memory reclaim that writebacks to a failed
    iSCSI device.  The recovery procedure can never make progress to recover
    the failed disk or abort outstanding IO operations to complete the reclaim
    (since rx_queue_mutex is locked), thus locking the system.
    
    Nevertheless, just marking all allocations under rx_queue_mutex as GFP_NOIO
    (or locking the userspace process with something like PF_MEMALLOC_NOIO) is
    not enough, since the iSCSI command code relies on other subsystems that
    try to grab locked mutexes, whose threads are GFP_IO, leading to the same
    deadlock. One instance where this situation can be observed is in the
    backtraces below, stitched from multiple bugs reports, involving the kobj
    uevent sent when a session is created.
    
    The root of the problem is not the fact that iSCSI does GFP_IO allocations,
    that is acceptable. The actual problem is that rx_queue_mutex has a very
    large granularity, covering every unrelated netlink command execution at
    the same time as the error recovery path.
    
    The proposed fix leverages the recently added mechanism to stop failed
    connections from the kernel, by enabling it to execute even though a
    management command from the netlink socket is being run (rx_queue_mutex is
    held), provided that the command is known to be safe.  It splits the
    rx_queue_mutex in two mutexes, one protecting from concurrent command
    execution from the netlink socket, and one protecting stop_conn from racing
    with other connection management operations that might conflict with it.
    
    It is not very pretty, but it is the simplest way to resolve the deadlock.
    I considered making it a lock per connection, but some external mutex would
    still be needed to deal with iscsi_if_destroy_conn.
    
    The patch was tested by forcing a memory shrinker (unrelated, but used
    bufio/dm-verity) to reclaim iSCSI pages every time
    ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
    reclaims that might happen with GFP_KERNEL on that path.  Then, a faulty
    hung target causes a connection to fail during intensive IO, at the same
    time a new session is added by iscsid.
    
    The following stacktraces are stiches from several bug reports, showing a
    case where the deadlock can happen.
    
     iSCSI-write
             holding: rx_queue_mutex
             waiting: uevent_sock_mutex
    
             kobject_uevent_env+0x1bd/0x419
             kobject_uevent+0xb/0xd
             device_add+0x48a/0x678
             scsi_add_host_with_dma+0xc5/0x22d
             iscsi_host_add+0x53/0x55
             iscsi_sw_tcp_session_create+0xa6/0x129
             iscsi_if_rx+0x100/0x1247
             netlink_unicast+0x213/0x4f0
             netlink_sendmsg+0x230/0x3c0
    
     iscsi_fail iscsi_conn_failure
             waiting: rx_queue_mutex
    
             schedule_preempt_disabled+0x325/0x734
             __mutex_lock_slowpath+0x18b/0x230
             mutex_lock+0x22/0x40
             iscsi_conn_failure+0x42/0x149
             worker_thread+0x24a/0xbc0
    
     EventManager_
             holding: uevent_sock_mutex
             waiting: dm_bufio_client->lock
    
             dm_bufio_lock+0xe/0x10
             shrink+0x34/0xf7
             shrink_slab+0x177/0x5d0
             do_try_to_free_pages+0x129/0x470
             try_to_free_mem_cgroup_pages+0x14f/0x210
             memcg_kmem_newpage_charge+0xa6d/0x13b0
             __alloc_pages_nodemask+0x4a3/0x1a70
             fallback_alloc+0x1b2/0x36c
             __kmalloc_node_track_caller+0xb9/0x10d0
             __alloc_skb+0x83/0x2f0
             kobject_uevent_env+0x26b/0x419
             dm_kobject_uevent+0x70/0x79
             dev_suspend+0x1a9/0x1e7
             ctl_ioctl+0x3e9/0x411
             dm_ctl_ioctl+0x13/0x17
             do_vfs_ioctl+0xb3/0x460
             SyS_ioctl+0x5e/0x90
    
     MemcgReclaimerD"
             holding: dm_bufio_client->lock
             waiting: stuck io to finish (needs iscsi_fail thread to progress)
    
             schedule at ffffffffbd603618
             io_schedule at ffffffffbd603ba4
             do_io_schedule at ffffffffbdaf0d94
             __wait_on_bit at ffffffffbd6008a6
             out_of_line_wait_on_bit at ffffffffbd600960
             wait_on_bit.constprop.10 at ffffffffbdaf0f17
             __make_buffer_clean at ffffffffbdaf18ba
             __cleanup_old_buffer at ffffffffbdaf192f
             shrink at ffffffffbdaf19fd
             do_shrink_slab at ffffffffbd6ec000
             shrink_slab at ffffffffbd6ec24a
             do_try_to_free_pages at ffffffffbd6eda09
             try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
             mem_cgroup_resize_limit at ffffffffbd7024c0
             mem_cgroup_write at ffffffffbd703149
             cgroup_file_write at ffffffffbd6d9c6e
             sys_write at ffffffffbd6662ea
             system_call_fastpath at ffffffffbdbc34a2
    
    Link: https://lore.kernel.org/r/20200520022959.1912856-1-krisman@collabora.comReported-by: default avatarKhazhismel Kumykov <khazhy@google.com>
    Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
    Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@collabora.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    7e7cd796
scsi_transport_iscsi.c 144 KB