Commit bc6ba808 authored by Dan Williams's avatar Dan Williams

nfit, address-range-scrub: rework and simplify ARS state machine

ARS is an operation that can take 10s to 100s of seconds to find media
errors that should rarely be present. If the platform crashes due to
media errors in persistent memory, the expectation is that the BIOS will
report those known errors in a 'short' ARS request.

A 'short' ARS request asks platform firmware to return an ARS payload
with all known errors, but without issuing a 'long' scrub. At driver
init a short request is issued to all PMEM ranges before registering
regions. Then, in the background, a long ARS is scheduled for each
region.

The ARS implementation is simplified to centralize ARS completion work
in the ars_complete() helper. The timeout is removed since there is no
facility to cancel ARS, and this otherwise arranges for system init to
never be blocked waiting for a 'long' ARS. The ars_state flags are used
to coordinate ARS requests from driver init, ARS requests from
userspace, and ARS requests in response to media error notifications.

Given that there is no notification of ARS completion the implementation
still needs to poll. It backs off exponentially to a maximum poll period
of 30 minutes.
Suggested-by: default avatarToshi Kani <toshi.kani@hpe.com>
Co-developed-by: default avatarDave Jiang <dave.jiang@intel.com>
Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
parent 459d0ddb
This diff is collapsed.
...@@ -197,18 +197,18 @@ struct acpi_nfit_desc { ...@@ -197,18 +197,18 @@ struct acpi_nfit_desc {
struct device *dev; struct device *dev;
u8 ars_start_flags; u8 ars_start_flags;
struct nd_cmd_ars_status *ars_status; struct nd_cmd_ars_status *ars_status;
struct work_struct work; struct delayed_work dwork;
struct list_head list; struct list_head list;
struct kernfs_node *scrub_count_state; struct kernfs_node *scrub_count_state;
unsigned int max_ars; unsigned int max_ars;
unsigned int scrub_count; unsigned int scrub_count;
unsigned int scrub_mode; unsigned int scrub_mode;
unsigned int cancel:1; unsigned int cancel:1;
unsigned int init_complete:1;
unsigned long dimm_cmd_force_en; unsigned long dimm_cmd_force_en;
unsigned long bus_cmd_force_en; unsigned long bus_cmd_force_en;
unsigned long bus_nfit_cmd_force_en; unsigned long bus_nfit_cmd_force_en;
unsigned int platform_cap; unsigned int platform_cap;
unsigned int scrub_tmo;
int (*blk_do_io)(struct nd_blk_region *ndbr, resource_size_t dpa, int (*blk_do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
void *iobuf, u64 len, int rw); void *iobuf, u64 len, int rw);
}; };
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment