• Kai Huang's avatar
    x86/mce: Differentiate real hardware #MCs from TDX erratum ones · 70060463
    Kai Huang authored
    The first few generations of TDX hardware have an erratum.  Triggering
    it in Linux requires some kind of kernel bug involving relatively exotic
    memory writes to TDX private memory and will manifest via
    spurious-looking machine checks when reading the affected memory.
    
    Make an effort to detect these TDX-induced machine checks and spit out
    a new blurb to dmesg so folks do not think their hardware is failing.
    
    == Background ==
    
    Virtually all kernel memory accesses operations happen in full
    cachelines.  In practice, writing a "byte" of memory usually reads a 64
    byte cacheline of memory, modifies it, then writes the whole line back.
    Those operations do not trigger this problem.
    
    This problem is triggered by "partial" writes where a write transaction
    of less than cacheline lands at the memory controller.  The CPU does
    these via non-temporal write instructions (like MOVNTI), or through
    UC/WC memory mappings.  The issue can also be triggered away from the
    CPU by devices doing partial writes via DMA.
    
    == Problem ==
    
    A partial write to a TDX private memory cacheline will silently "poison"
    the line.  Subsequent reads will consume the poison and generate a
    machine check.  According to the TDX hardware spec, neither of these
    things should have happened.
    
    To add insult to injury, the Linux machine code will present these as a
    literal "Hardware error" when they were, in fact, a software-triggered
    issue.
    
    == Solution ==
    
    In the end, this issue is hard to trigger.  Rather than do something
    rash (and incomplete) like unmap TDX private memory from the direct map,
    improve the machine check handler.
    
    Currently, the #MC handler doesn't distinguish whether the memory is
    TDX private memory or not but just dump, for instance, below message:
    
     [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
     [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
     	...
     [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
     [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
     [...] Kernel panic - not syncing: Fatal local machine check
    
    Which says "Hardware Error" and "Data load in unrecoverable area of
    kernel".
    
    Ideally, it's better for the log to say "software bug around TDX private
    memory" instead of "Hardware Error".  But in reality the real hardware
    memory error can happen, and sadly such software-triggered #MC cannot be
    distinguished from the real hardware error.  Also, the error message is
    used by userspace tool 'mcelog' to parse, so changing the output may
    break userspace.
    
    So keep the "Hardware Error".  The "Data load in unrecoverable area of
    kernel" is also helpful, so keep it too.
    
    Instead of modifying above error log, improve the error log by printing
    additional TDX related message to make the log like:
    
      ...
     [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
     [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
    
    Adding this additional message requires determination of whether the
    memory page is TDX private memory.  There is no existing infrastructure
    to do that.  Add an interface to query the TDX module to fill this gap.
    
    == Impact ==
    
    This issue requires some kind of kernel bug to trigger.
    
    TDX private memory should never be mapped UC/WC.  A partial write
    originating from these mappings would require *two* bugs, first mapping
    the wrong page, then writing the wrong memory.  It would also be
    detectable using traditional memory corruption techniques like
    DEBUG_PAGEALLOC.
    
    MOVNTI (and friends) could cause this issue with something like a simple
    buffer overrun or use-after-free on the direct map.  It should also be
    detectable with normal debug techniques.
    
    The one place where this might get nasty would be if the CPU read data
    then wrote back the same data.  That would trigger this problem but
    would not, for instance, set off mechanisms like slab redzoning because
    it doesn't actually corrupt data.
    
    With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
    TDX private memory would first need to be incorrectly mapped into the
    I/O space and then a later DMA to that mapping would actually cause the
    poisoning event.
    
    [ dhansen: changelog tweaks ]
    Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
    Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
    Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
    70060463
tdx.c 37.1 KB