1. 11 Mar, 2024 1 commit
    • Borislav Petkov (AMD)'s avatar
      Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and... · af65545a
      Borislav Petkov (AMD) authored
      Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and 'ras/edac-amd-atl' into edac-updates-for-v6.9
      
      * ras/edac-drivers:
        EDAC/i10nm: Add Intel Grand Ridge micro-server support
        EDAC/igen6: Add one more Intel Alder Lake-N SoC support
      
      * ras/edac-misc:
        EDAC/versal: Convert to platform remove callback returning void
        EDAC/versal: Make the bit position of injected errors configurable
        EDAC/synopsys: Convert to devm_platform_ioremap_resource()
      
      * ras/edac-amd-atl:
        RAS/AMD/FMPM: Fix off by one when unwinding on error
        RAS/AMD/FMPM: Add debugfs interface to print record entries
        RAS/AMD/FMPM: Save SPA values
        RAS: Export helper to get ras_debugfs_dir
        RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
        RAS: Introduce a FRU memory poison manager
        RAS/AMD/ATL: Add MI300 row retirement support
        Documentation: Move RAS section to admin-guide
        RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
        RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
        RAS/AMD/ATL: Add MI300 support
        Documentation: RAS: Add index and address translation section
        EDAC/amd64: Use new AMD Address Translation Library
        RAS: Introduce AMD Address Translation Library
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      af65545a
  2. 08 Mar, 2024 1 commit
  3. 06 Mar, 2024 1 commit
  4. 01 Mar, 2024 3 commits
  5. 26 Feb, 2024 1 commit
  6. 20 Feb, 2024 1 commit
    • Yazen Ghannam's avatar
      RAS: Introduce a FRU memory poison manager · 6f15e617
      Yazen Ghannam authored
      Memory errors are an expected occurrence on systems with high memory
      density. Generally, errors within a small number of unique physical
      locations are acceptable, based on manufacturer and/or admin policy.
      During run time, memory with errors may be retired so it is no longer
      used by the system. This is done in mm through page poisoning, and the
      effect will remain until the system is restarted.
      
      If a memory location is consistently faulty, then the same run time
      error handling may occur in the next reboot cycle, leading to
      terminating jobs due to that already known bad memory. This could be
      prevented if information from the previous boot was not lost.
      
      Some add-in cards with driver-managed memory have on-board persistent
      storage. Their driver saves memory error information to the persistent
      storage during run time. The information is then restored after reset,
      and known bad memory will be retired before the hardware is used.
      A running log of bad memory locations is kept across multiple resets.
      
      A similar solution is desirable for CPUs. However, this solution should
      leverage industry-standard components as much as possible, rather than
      a bespoke platform driver.
      
      Two components are needed: a record format and a persistent storage
      interface.
      
      Implement a new module to manage the record formats on persistent
      storage. Use the requirements for an AMD MI300-based system to start.
      Vendor- and platform-specific details can be abstracted later as needed.
      
        [ bp: Massage commit message and code, squash 30-ish more fixes from
          Yazen and me. ]
      Signed-off-by: default avatarYazen Ghannam <yazen.ghannam@amd.com>
      Co-developed-by: <naveenkrishna.chatradhi@amd.com>
      Signed-off-by: <naveenkrishna.chatradhi@amd.com>
      Co-developed-by: <muralidhara.mk@amd.com>
      Signed-off-by: <muralidhara.mk@amd.com>
      Tested-by: <sathyapriya.k@amd.com>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
      6f15e617
  7. 14 Feb, 2024 3 commits
  8. 01 Feb, 2024 3 commits
  9. 31 Jan, 2024 1 commit
  10. 29 Jan, 2024 1 commit
  11. 24 Jan, 2024 3 commits
  12. 23 Jan, 2024 1 commit
  13. 21 Jan, 2024 20 commits