• Dan Williams's avatar
    nfit, address-range-scrub: rework and simplify ARS state machine · bc6ba808
    Dan Williams authored
    ARS is an operation that can take 10s to 100s of seconds to find media
    errors that should rarely be present. If the platform crashes due to
    media errors in persistent memory, the expectation is that the BIOS will
    report those known errors in a 'short' ARS request.
    
    A 'short' ARS request asks platform firmware to return an ARS payload
    with all known errors, but without issuing a 'long' scrub. At driver
    init a short request is issued to all PMEM ranges before registering
    regions. Then, in the background, a long ARS is scheduled for each
    region.
    
    The ARS implementation is simplified to centralize ARS completion work
    in the ars_complete() helper. The timeout is removed since there is no
    facility to cancel ARS, and this otherwise arranges for system init to
    never be blocked waiting for a 'long' ARS. The ars_state flags are used
    to coordinate ARS requests from driver init, ARS requests from
    userspace, and ARS requests in response to media error notifications.
    
    Given that there is no notification of ARS completion the implementation
    still needs to poll. It backs off exponentially to a maximum poll period
    of 30 minutes.
    Suggested-by: default avatarToshi Kani <toshi.kani@hpe.com>
    Co-developed-by: default avatarDave Jiang <dave.jiang@intel.com>
    Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
    Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    bc6ba808
core.c 91.7 KB