1. 30 Jan, 2024 1 commit
    • Jeffrey Hugo's avatar
      bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state · bce3f770
      Jeffrey Hugo authored
      When processing a SYSERR, if the device does not respond to the MHI_RESET
      from the host, the host will be stuck in a difficult to recover state.
      The host will remain in MHI_PM_SYS_ERR_PROCESS and not clean up the host
      channels.  Clients will not be notified of the SYSERR via the destruction
      of their channel devices, which means clients may think that the device is
      still up.  Subsequent SYSERR events such as a device fatal error will not
      be processed as the state machine cannot transition from PROCESS back to
      DETECT.  The only way to recover from this is to unload the mhi module
      (wipe the state machine state) or for the mhi controller to initiate
      SHUTDOWN.
      
      This issue was discovered by stress testing soc_reset events on AIC100
      via the sysfs node.
      
      soc_reset is processed entirely in hardware.  When the register write
      hits the endpoint hardware, it causes the soc to reset without firmware
      involvement.  In stress testing, there is a rare race where soc_reset N
      will cause the soc to reset and PBL to signal SYSERR (fatal error).  If
      soc_reset N+1 is triggered before PBL can process the MHI_RESET from the
      host, then the soc will reset again, and re-run PBL from the beginning.
      This will cause PBL to lose all state.  PBL will be waiting for the host
      to respond to the new syserr, but host will be stuck expecting the
      previous MHI_RESET to be processed.
      
      Additionally, the AMSS EE firmware (QSM) was hacked to synthetically
      reproduce the issue by simulating a FW hang after the QSM issued a
      SYSERR.  In this case, soc_reset would not recover the device.
      
      For this failure case, to recover the device, we need a state similar to
      PROCESS, but can transition to DETECT.  There is not a viable existing
      state to use.  POR has the needed transitions, but assumes the device is
      in a good state and could allow the host to attempt to use the device.
      Allowing PROCESS to transition to DETECT invites the possibility of
      parallel SYSERR processing which could get the host and device out of
      sync.
      
      Thus, invent a new state - MHI_PM_SYS_ERR_FAIL
      
      This essentially a holding state.  It allows us to clean up the host
      elements that are based on the old state of the device (channels), but
      does not allow us to directly advance back to an operational state.  It
      does allow the detection and processing of another SYSERR which may
      recover the device, or allows the controller to do a clean shutdown.
      Signed-off-by: default avatarJeffrey Hugo <quic_jhugo@quicinc.com>
      Reviewed-by: default avatarCarl Vanderlip <quic_carlv@quicinc.com>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20240112180800.536733-1-quic_jhugo@quicinc.comSigned-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      bce3f770
  2. 21 Jan, 2024 39 commits