• Lukas Wunner's avatar
    PCI: pciehp: Reduce noisiness on hot removal · 8a614499
    Lukas Wunner authored
    When a PCIe card is hot-removed, the Presence Detect State and Data Link
    Layer Link Active bits often do not clear simultaneously.  I've seen delays
    of up to 244 msec between the two events with Thunderbolt.
    
    After pciehp has brought down the slot in response to the first event, the
    other bit may still be set.  It's not discernible whether it's set because
    a new card is already in the slot or if it will soon clear.  So pciehp
    tries to bring up the slot and in the latter case fails with a bunch of
    messages, some of them at KERN_ERR severity.  If the slot is no longer
    occupied, the messages are false positives and annoy users.
    
    Stuart Hayes reports the following splat on hot removal:
    
      KERN_INFO pcieport 0000:3c:06.0: pciehp: Slot(180): Link Up
      KERN_INFO pcieport 0000:3c:06.0: pciehp: Timeout waiting for Presence Detect
      KERN_ERR  pcieport 0000:3c:06.0: pciehp: link training error: status 0x0001
      KERN_ERR  pcieport 0000:3c:06.0: pciehp: Failed to check link status
    
    Dongdong Liu complains about a similar splat:
    
      KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): Link Down
      KERN_INFO iommu: Removing device 0000:87:00.0 from group 12
      KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): Card present
      KERN_INFO pcieport 0000:80:10.0: Data Link Layer Link Active not set in 1000 msec
      KERN_ERR  pciehp 0000:80:10.0:pcie004: Failed to check link status
    
    Users are particularly irritated to see a bringup attempt even though the
    slot was explicitly brought down via sysfs.  In a perfect world, we could
    avoid this by setting Link Disable on slot bringdown and re-enabling it
    upon a Presence Detect State change.  In reality however, there are broken
    hotplug ports which hardwire Presence Detect to zero, see 80696f99
    ("PCI: pciehp: Tolerate Presence Detect hardwired to zero").  Conversely,
    PCIe r1.0 hotplug ports hardwire Link Active to zero because Link Active
    Reporting wasn't specified before PCIe r1.1.  On unplug, some ports first
    clear Presence then Link (see Stuart Hayes' splat) whereas others use the
    inverse order (see Dongdong Liu's splat).  To top it off, there are hotplug
    ports which flap the Presence and Link bits on slot bringup, see
    6c35a1ac ("PCI: pciehp: Tolerate initially unstable link").
    
    pciehp is designed to work with all of these variants.  Surplus attempts at
    slot bringup are a lesser evil than not being able to bring up slots at
    all.  Although we could try to perfect the behavior for specific hotplug
    controllers, we'd risk breaking others or increasing code complexity.
    
    But we can certainly minimize annoyance by emitting only a single message
    with KERN_INFO severity if bringup is unsuccessful:
    
    * Drop the "Timeout waiting for Presence Detect" message in
      pcie_wait_for_presence().  The sole caller of that function,
      pciehp_check_link_status(), ignores the timeout and carries on.  It emits
      error messages of its own and I don't think this particular message adds
      much value.
    
    * There's a single error condition in pciehp_check_link_status() which
      does not emit a message.  Adding one allows dropping the "Failed to check
      link status" message emitted by board_added() if
      pciehp_check_link_status() returns a non-zero integer.
    
    * Tone down all messages in pciehp_check_link_status() to KERN_INFO
      severity and rephrase them to look as innocuous as possible.  To this
      end, move the message emitted by pcie_wait_for_link_delay() to its
      callers.
    
    As a result, Stuart Hayes' splat becomes:
    
      KERN_INFO pcieport 0000:3c:06.0: pciehp: Slot(180): Link Up
      KERN_INFO pcieport 0000:3c:06.0: pciehp: Slot(180): Cannot train link: status 0x0001
    
    Dongdong Liu's splat becomes:
    
      KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): Card present
      KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): No link
    
    The messages now merely serve as information that presence or link bits
    were set a little longer than expected.  Bringup failures which are not
    false positives are still reported, albeit no longer at KERN_ERR severity.
    
    Link: https://lore.kernel.org/linux-pci/20200310182100.102987-1-stuart.w.hayes@gmail.com/
    Link: https://lore.kernel.org/linux-pci/1547649064-19019-1-git-send-email-liudongdong3@huawei.com/
    Link: https://lore.kernel.org/r/b45e46fd8a6aa6930aaac9d7718c2e4b787a4e5e.1595935071.git.lukas@wunner.deReported-by: default avatarStuart Hayes <stuart.w.hayes@gmail.com>
    Reported-by: default avatarDongdong Liu <liudongdong3@huawei.com>
    Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
    Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
    8a614499
pciehp_ctrl.c 10.6 KB