1. 08 Oct, 2020 28 commits
  2. 07 Oct, 2020 7 commits
    • Xiang Chen's avatar
      scsi: hisi_sas: Recover PHY state according to the status before reset · 69f4ec1e
      Xiang Chen authored
      Currently the PHY state is set according to the state of the PHYs after
      reset. This is invalid as the PHYs are already re-initialized.
      
      Set PHY state according to the state before the reset instead of after.
      
      Link: https://lore.kernel.org/r/1601649038-25534-8-git-send-email-john.garry@huawei.comSigned-off-by: default avatarXiang Chen <chenxiang66@hisilicon.com>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      69f4ec1e
    • Xiang Chen's avatar
      scsi: hisi_sas: Filter out new PHY up events during suspend · b14a37e0
      Xiang Chen authored
      Currently sas_resume_ha() is called while resuming the controller to wait
      for all suspended PHYs to come up and all the libsas events to be
      completed.
      
      There is a scenario which will cause task hung: For direct attach with two
      disks connected with two PHYs, disable phy0 before suspending the disk on
      phy1 and the controller, then enable phy0 and resume the controller, and
      task hung occurs as follows:
      
      [  591.901463] hisi_sas_v3_hw 0000:b4:02.0: resuming from operating state [D0]
      [  593.113525] hisi_sas_v3_hw 0000:b4:02.0: neither _PS0 nor _PR0 is defined
      [  593.120301] hisi_sas_v3_hw 0000:b4:02.0: waiting up to 25 seconds for 1 phy to resume
      [  593.120836] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy0 link_rate=10(sata)
      [  593.134680] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy1 link_rate=10(sata)
      [  593.134733] sas: phy-2:0 added to port-2:0, phy_mask:0x1 (5000000000000200)
      [  593.148350] sas: DOING DISCOVERY on port 0, pid:948
      [  593.153227] hisi_sas_v3_hw 0000:b4:02.0: dev[3:5] found
      [  593.159840] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
      [  593.165663] sas: ata7: end_device-2:0: dev error handler
      [  593.165730] sas: ata2: end_device-2:1: dev error handler
      [  593.172532] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy0 phy_state=0x2
      [  593.182570] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy0 down
      [  593.331277] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy0 link_rate=10(sata)
      [  593.498956] ata7.00: ATA-11: SAMSUNG MZ7LH960HAJR-00005, HXT7404Q, max UDMA/133
      [  593.506235] ata7.00: 1875385008 sectors, multi 16: LBA48 NCQ (depth 32)
      [  593.514295] ata7.00: configured for UDMA/133
      [  593.518557] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
      [  593.528613] sas: ata7: end_device-2:0: model:SAMSUNG MZ7LH960HAJR-00005
      serial:S45NNA0M712225
      [  593.537520] device_link_add 316: dev=2:0:2:0 supplier:2 consumer:0
      [  593.543674] device_link_add 324
      [  593.546801] device_link_add 352
      [  593.549930] device_link_add 406
      [  593.553058] device_link_add 440: dev=2:0:2:0 supplier:2 consumer:0
      [  593.559208] device_link_add 444
      [  593.562335] device_link_add 455
      [  593.565517] scsi 2:0:2:0: Direct-Access     ATA      SAMSUNG MZ7LH960 404Q PQ: 0
      ANSI: 5
      [  620.057464]  phy-2:1: resume timeout
      [  738.841445] INFO: task kworker/u256:0:8 blocked for more than 120 seconds.
      [  738.848295]       Not tainted 5.8.0-rc1-76154-g0d52b59-dirty #744
      [  738.854361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  738.862155] kworker/u256:0  D    0     8      2 0x00000028
      [  738.867626] Workqueue: 0000:b4:02.0_event_q sas_port_event_worker
      [  738.873693] Call trace:
      [  738.876133]  __switch_to+0xf4/0x148
      [  738.879613]  __schedule+0x270/0x5d8
      [  738.883091]  schedule+0x78/0x110
      [  738.886307]  schedule_timeout+0x1ac/0x280
      [  738.890299]  wait_for_completion+0x94/0x138
      [  738.894472]  flush_workqueue+0x114/0x438
      [  738.898377]  sas_porte_bytes_dmaed+0x400/0x500
      [  738.902801]  sas_port_event_worker+0x28/0x40
      [  738.907053]  process_one_work+0x1e8/0x360
      [  738.911046]  worker_thread+0x44/0x478
      [  738.914698]  kthread+0x150/0x158
      [  738.917915]  ret_from_fork+0x10/0x1c
      [  738.921534] INFO: task kworker/u256:1:948 blocked for more than 120 seconds.
      [  738.928550]       Not tainted 5.8.0-rc1-76154-g0d52b59-dirty #744
      [  738.934614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  738.942408] kworker/u256:1  D    0   948      2 0x00000028
      [  738.947873] Workqueue: 0000:b4:02.0_disco_q sas_discover_domain
      [  738.953766] Call trace:
      [  738.956203]  __switch_to+0xf4/0x148
      [  738.959678]  __schedule+0x270/0x5d8
      [  738.963152]  schedule+0x78/0x110
      [  738.966368]  rpm_resume+0xcc/0x550
      [  738.969757]  __pm_runtime_resume+0x3c/0x88
      [  738.973836]  rpm_get_suppliers+0x50/0x148
      [  738.977829]  __pm_runtime_set_status+0x124/0x2f0
      [  738.982427]  scsi_sysfs_add_sdev+0x1a0/0x2a8
      [  738.986679]  scsi_probe_and_add_lun+0x888/0xab0
      [  738.991190]  __scsi_scan_target+0xec/0x520
      [  738.995268]  scsi_scan_target+0x11c/0x128
      [  738.999261]  sas_rphy_add+0x15c/0x1e8
      [  739.002907]  sas_probe_devices+0xe4/0x150
      [  739.006899]  sas_discover_domain+0x33c/0x588
      [  739.011150]  process_one_work+0x1e8/0x360
      [  739.015143]  worker_thread+0x44/0x478
      [  739.018789]  kthread+0x150/0x158
      [  739.022003]  ret_from_fork+0x10/0x1c
      ...
      
      If an extra phy0 up happens during resume of the SAS controller, it will
      emit a new libsas event (event PORTE_BYTES_DMAED and event
      DISCE_DISCOVER_DOMAIN). We will call function scsi_sysfs_add_sdev() in
      event DISCE_DISCOVER_DOMAIN, which will call __pm_runtime_set_status() to
      resume supplier (host controller). For runtime PM core, if device is in the
      resuming state, the later resume request of the device will wait for
      previous resume request to complete synchronously. At that point in time
      the state of the controller is still resuming as it waits for all libsas
      events to be completed, while libsas event DISCE_DISCOVER_DOMAIN is blocked
      as the state of the controller is resuming which causes a deadlock.
      
      To avoid the issue, filter out new PHY up events while the controller is
      suspended.
      
      Link: https://lore.kernel.org/r/1601649038-25534-7-git-send-email-john.garry@huawei.comSigned-off-by: default avatarXiang Chen <chenxiang66@hisilicon.com>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      b14a37e0
    • Xiang Chen's avatar
      scsi: hisi_sas: Add device link between SCSI devices and hisi_hba · 16fd4a7c
      Xiang Chen authored
      Runtime PM of SCSI devices is already supported in SCSI layer, we can
      suspend/resume every SCSI device separately. But if there is no link
      between hisi_hba and SCSI devices or SCSI targets it will cause issues if
      the controller is suspended while SCSI devices are still resuming.  Only
      when all the SCSI devices under the controller are suspended, the
      controller can be suspended. Add the device link between SCSI devices
      and the controller.
      
      Link: https://lore.kernel.org/r/1601649038-25534-6-git-send-email-john.garry@huawei.comSigned-off-by: default avatarXiang Chen <chenxiang66@hisilicon.com>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      16fd4a7c
    • Xiang Chen's avatar
      scsi: hisi_sas: Add check for methods _PS0 and _PR0 · e06596d5
      Xiang Chen authored
      To support system suspend/resume or runtime suspend/resume, need to use the
      function pci_set_power_state() to change the power state which requires at
      least method _PS0 or _PR0 be filled by platform for v3 hw. So check whether
      the method is supported, if not, print a warning.
      
      A Kconfig dependency is added as there is no stub for
      acpi_device_power_manageable().
      
      Link: https://lore.kernel.org/r/1601649038-25534-5-git-send-email-john.garry@huawei.comSigned-off-by: default avatarXiang Chen <chenxiang66@hisilicon.com>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      e06596d5
    • Xiang Chen's avatar
    • Xiang Chen's avatar
      scsi: hisi_sas: Switch to new framework to support suspend and resume · 6c459ea1
      Xiang Chen authored
      For v3 hw we will add support for runtime PM which is only supported in new
      framework. Legacy PM support and new framework are not allowed to be used
      together. Switch to new framework to support suspend and resume.
      
      Link: https://lore.kernel.org/r/1601649038-25534-3-git-send-email-john.garry@huawei.comSigned-off-by: default avatarXiang Chen <chenxiang66@hisilicon.com>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      6c459ea1
    • Luo Jiaxing's avatar
      scsi: hisi_sas: Use hisi_hba->cq_nvecs for calling calling synchronize_irq() · 7f054da7
      Luo Jiaxing authored
      A call trace is observed when running function level reset with online CPUs
      less than 16 and MSI auto-affinity enabled.
      
      [16538.348038] Call trace:
      [16538.348422]  pci_irq_vector+0x98/0xc0
      [16538.348947]  disable_host_v3_hw+0x8c/0x288 [hisi_sas_v3_hw]
      [16538.349706]  hisi_sas_reset_prepare_v3_hw+0x60/0x88 [hisi_sas_v3_hw]
      [16538.350631]  pci_dev_save_and_disable+0x38/0x68
      [16538.351290]  pci_reset_function+0x44/0x88
      [16538.351846]  reset_store+0x6c/0xb8
      [16538.352429]  dev_attr_store+0x44/0x60
      [16538.353035]  sysfs_kf_write+0x58/0x80
      [16538.353558]  kernfs_fop_write+0x140/0x230
      [16538.354175]  __vfs_write+0x48/0x80
      [16538.354675]  vfs_write+0xb8/0x1d8
      [16538.355145]  ksys_write+0x74/0xf8
      [16538.355615]  __arm64_sys_write+0x24/0x30
      [16538.356240]  el0_svc_common.constprop.4+0x80/0x1f0
      [16538.356905]  do_el0_svc+0x2c/0x38
      [16538.357408]  el0_svc+0x14/0x40
      [16538.357848]  el0_sync_handler+0xbc/0x2ec
      [16538.358388]  el0_sync+0x140/0x180
      
      The reason is that if we use pci_alloc_irq_vectors_affinity() to allocate
      IRQs, the number of CQ IRQs can only be less than or equal to the number of
      online CPUs, but we use hisi_hba->queue_count (always 16) to iterate during
      interrupt_disable_v3_hw().
      
      Use hisi_hba->cq_nvecs to replace hisi_hba->queue_count to avoid
      synchronize IRQ on a CPU which does not exist.
      
      Link: https://lore.kernel.org/r/1601649038-25534-2-git-send-email-john.garry@huawei.comSigned-off-by: default avatarLuo Jiaxing <luojiaxing@huawei.com>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      7f054da7
  3. 03 Oct, 2020 5 commits