• Jack Morgenstein's avatar
    net/mlx4_core: Fix reset flow when in command polling mode · e15ce4b8
    Jack Morgenstein authored
    As part of unloading a device, the driver switches from
    FW command event mode to FW command polling mode.
    
    Part of switching over to polling mode is freeing the command context array
    memory (unfortunately, currently, without NULLing the command context array
    pointer).
    
    The reset flow calls "complete" to complete all outstanding fw commands
    (if we are in event mode). The check for event vs. polling mode here
    is to test if the command context array pointer is NULL.
    
    If the reset flow is activated after the switch to polling mode, it will
    attempt (incorrectly) to complete all the commands in the context array --
    because the pointer was not NULLed when the driver switched over to polling
    mode.
    
    As a result, we have a use-after-free situation, which results in a
    kernel crash.
    
    For example:
    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<ffffffff876c4a8e>] __wake_up_common+0x2e/0x90
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: netconsole nfsv3 nfs_acl nfs lockd grace ...
    CPU: 2 PID: 940 Comm: kworker/2:3 Kdump: loaded Not tainted 3.10.0-862.el7.x86_64 #1
    Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  04/28/2016
    Workqueue: events hv_eject_device_work [pci_hyperv]
    task: ffff8d1734ca0fd0 ti: ffff8d17354bc000 task.ti: ffff8d17354bc000
    RIP: 0010:[<ffffffff876c4a8e>]  [<ffffffff876c4a8e>] __wake_up_common+0x2e/0x90
    RSP: 0018:ffff8d17354bfa38  EFLAGS: 00010082
    RAX: 0000000000000000 RBX: ffff8d17362d42c8 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8d17362d42c8
    RBP: ffff8d17354bfa70 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000298 R11: ffff8d173610e000 R12: ffff8d17362d42d0
    R13: 0000000000000246 R14: 0000000000000000 R15: 0000000000000003
    FS:  0000000000000000(0000) GS:ffff8d1802680000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 00000000f16d8000 CR4: 00000000001406e0
    Call Trace:
     [<ffffffff876c7adc>] complete+0x3c/0x50
     [<ffffffffc04242f0>] mlx4_cmd_wake_completions+0x70/0x90 [mlx4_core]
     [<ffffffffc041e7b1>] mlx4_enter_error_state+0xe1/0x380 [mlx4_core]
     [<ffffffffc041fa4b>] mlx4_comm_cmd+0x29b/0x360 [mlx4_core]
     [<ffffffffc041ff51>] __mlx4_cmd+0x441/0x920 [mlx4_core]
     [<ffffffff877f62b1>] ? __slab_free+0x81/0x2f0
     [<ffffffff87951384>] ? __radix_tree_lookup+0x84/0xf0
     [<ffffffffc043a8eb>] mlx4_free_mtt_range+0x5b/0xb0 [mlx4_core]
     [<ffffffffc043a957>] mlx4_mtt_cleanup+0x17/0x20 [mlx4_core]
     [<ffffffffc04272c7>] mlx4_free_eq+0xa7/0x1c0 [mlx4_core]
     [<ffffffffc042803e>] mlx4_cleanup_eq_table+0xde/0x130 [mlx4_core]
     [<ffffffffc0433e08>] mlx4_unload_one+0x118/0x300 [mlx4_core]
     [<ffffffffc0434191>] mlx4_remove_one+0x91/0x1f0 [mlx4_core]
    
    The fix is to set the command context array pointer to NULL after freeing
    the array.
    
    Fixes: f5aef5aa ("net/mlx4_core: Activate reset flow upon fatal command cases")
    Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
    Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    e15ce4b8
cmd.c 91.3 KB