• Sean Christopherson's avatar
    KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing · d02c357e
    Sean Christopherson authored
    Retry page faults without acquiring mmu_lock, and without even faulting
    the page into the primary MMU, if the resolved gfn is covered by an active
    invalidation.  Contending for mmu_lock is especially problematic on
    preemptible kernels as the mmu_notifier invalidation task will yield
    mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
    ultimately increase the latency of resolving the page fault.  And in the
    worst case scenario, yielding will be accompanied by a remote TLB flush,
    e.g. if the invalidation covers a large range of memory and vCPUs are
    accessing addresses that were already zapped.
    
    Faulting the page into the primary MMU is similarly problematic, as doing
    so may acquire locks that need to be taken for the invalidation to
    complete (the primary MMU has finer grained locks than KVM's MMU), and/or
    may cause unnecessary churn (getting/putting pages, marking them accessed,
    etc).
    
    Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
    iterators to perform more work before yielding, but that wouldn't solve
    the lock contention and would negatively affect scenarios where a vCPU is
    trying to fault in an address that is NOT covered by the in-progress
    invalidation.
    
    Add a dedicated lockess version of the range-based retry check to avoid
    false positives on the sanity check on start+end WARN, and so that it's
    super obvious that checking for a racing invalidation without holding
    mmu_lock is unsafe (though obviously useful).
    
    Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
    invalidation in a loop won't put KVM into an infinite loop, e.g. due to
    caching the in-progress flag and never seeing it go to '0'.
    
    Force a load of mmu_invalidate_seq as well, even though it isn't strictly
    necessary to avoid an infinite loop, as doing so improves the probability
    that KVM will detect an invalidation that already completed before
    acquiring mmu_lock and bailing anyways.
    
    Do the pre-check even for non-preemptible kernels, as waiting to detect
    the invalidation until mmu_lock is held guarantees the vCPU will observe
    the worst case latency in terms of handling the fault, and can generate
    even more mmu_lock contention.  E.g. the vCPU will acquire mmu_lock,
    detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
    eventually re-acquire mmu_lock.  This behavior is also why there are no
    new starvation issues due to losing the fairness guarantees provided by
    rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
    on mmu_lock doesn't guarantee forward progress in the face of _another_
    mmu_notifier invalidation event.
    
    Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
    may generate a load into a register instead of doing a direct comparison
    (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
    is a few bytes of code and maaaaybe a cycle or three.
    Reported-by: default avatarYan Zhao <yan.y.zhao@intel.com>
    Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.comReported-by: default avatarFriedrich Weber <f.weber@proxmox.com>
    Cc: Kai Huang <kai.huang@intel.com>
    Cc: Yan Zhao <yan.y.zhao@intel.com>
    Cc: Yuan Yao <yuan.yao@linux.intel.com>
    Cc: Xu Yilun <yilun.xu@linux.intel.com>
    Acked-by: default avatarKai Huang <kai.huang@intel.com>
    Reviewed-by: default avatarYan Zhao <yan.y.zhao@intel.com>
    Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
    d02c357e
mmu.c 207 KB