• Wei Hu (Xavier)'s avatar
    RDMA/hns: Fix the Oops during rmmod or insmod ko when reset occurs · d061effc
    Wei Hu (Xavier) authored
    In the reset process, the hns3 NIC driver notifies the RoCE driver to
    perform reset related processing by calling the .reset_notify() interface
    registered by the RoCE driver in hip08 SoC.
    
    In the current version, if a reset occurs simultaneously during the
    execution of rmmod or insmod ko, there may be Oops error as below:
    
     Internal error: Oops: 86000007 [#1] PREEMPT SMP
     Modules linked in: hns_roce(O) hns3(O) hclge(O) hnae3(O) [last unloaded: hns_roce_hw_v2]
     CPU: 0 PID: 14 Comm: kworker/0:1 Tainted: G           O      4.19.0-ge00d540 #1
     Hardware name: Huawei Technologies Co., Ltd.
     Workqueue: events hclge_reset_service_task [hclge]
     pstate: 60c00009 (nZCv daif +PAN +UAO)
     pc : 0xffff00000100b0b8
     lr : 0xffff00000100aea0
     sp : ffff000009afbab0
     x29: ffff000009afbab0 x28: 0000000000000800
     x27: 0000000000007ff0 x26: ffff80002f90c004
     x25: 00000000000007ff x24: ffff000008f97000
     x23: ffff80003efee0a8 x22: 0000000000001000
     x21: ffff80002f917ff0 x20: ffff8000286ea070
     x19: 0000000000000800 x18: 0000000000000400
     x17: 00000000c4d3225d x16: 00000000000021b8
     x15: 0000000000000400 x14: 0000000000000400
     x13: 0000000000000000 x12: ffff80003fac6e30
     x11: 0000800036303000 x10: 0000000000000001
     x9 : 0000000000000000 x8 : ffff80003016d000
     x7 : 0000000000000000 x6 : 000000000000003f
     x5 : 0000000000000040 x4 : 0000000000000000
     x3 : 0000000000000004 x2 : 00000000000007ff
     x1 : 0000000000000000 x0 : 0000000000000000
     Process kworker/0:1 (pid: 14, stack limit = 0x00000000af8f0ad9)
     Call trace:
      0xffff00000100b0b8
      0xffff00000100b3a0
      hns_roce_init+0x624/0xc88 [hns_roce]
      0xffff000001002df8
      0xffff000001006960
      hclge_notify_roce_client+0x74/0xe0 [hclge]
      hclge_reset_service_task+0xa58/0xbc0 [hclge]
      process_one_work+0x1e4/0x458
      worker_thread+0x40/0x450
      kthread+0x12c/0x130
      ret_from_fork+0x10/0x18
     Code: bad PC value
    
    In the reset process, we will release the resources firstly, and after the
    hardware reset is completed, we will reapply resources and reconfigure the
    hardware.
    
    We can solve this problem by modifying both the NIC and the RoCE
    driver. We can modify the concurrent processing in the NIC driver to avoid
    calling the .reset_notify and .uninit_instance ops at the same time. And
    we need to modify the RoCE driver to record the reset stage and the
    driver's init/uninit state, and check the state in the .reset_notify,
    .init_instance. and uninit_instance functions to avoid NULL pointer
    operation.
    
    Fixes: cb7a94c9 ("RDMA/hns: Add reset process for RoCE in hip08")
    Signed-off-by: default avatarWei Hu (Xavier) <xavier.huwei@huawei.com>
    Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
    d061effc
hns_roce_hw_v2.c 186 KB