• Sean Christopherson's avatar
    KVM: x86/mmu: Fix a largely theoretical race in kvm_mmu_track_write() · 226d9b8f
    Sean Christopherson authored
    Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
    to plug a (very, very theoretical) race where kvm_mmu_track_write() could
    miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
    *stale* SPTEs.
    
    Without the barriers, because modern x86 CPUs allow (per the SDM):
    
      Reads may be reordered with older writes to different locations but not
      with older writes to the same location.
    
    it's possible that the following could happen (terms of values being
    visible/resolved):
    
     CPU0                          CPU1
     read memory[gfn] (=Y)
                                   memory[gfn] Y=>X
                                   read indirect_shadow_pages (=0)
     indirect_shadow_pages 0=>1
    
    or conversely:
    
     CPU0                          CPU1
     indirect_shadow_pages 0=>1
                                   read indirect_shadow_pages (=0)
     read memory[gfn] (=Y)
                                   memory[gfn] Y=>X
    
    E.g. in the below scenario, CPU0 could fail to zap SPTEs, and CPU1 could
    fail to retry the faulting instruction, resulting in a KVM entering the
    guest with a stale SPTE (map PTE=X instead of PTE=Y).
    
    PTE = X;
    
    CPU0:
        emulator_write_phys()
        PTE = Y
        kvm_page_track_write()
          kvm_mmu_track_write()
          // memory barrier missing here
          if (indirect_shadow_pages)
              zap();
    
    CPU1:
       FNAME(page_fault)
         FNAME(walk_addr)
           FNAME(walk_addr_generic)
             gw->pte = PTE; // X
    
         FNAME(fetch)
           kvm_mmu_get_child_sp
             kvm_mmu_get_shadow_page
               __kvm_mmu_get_shadow_page
                 kvm_mmu_alloc_shadow_page
                   account_shadowed
                     indirect_shadow_pages++
                     // memory barrier missing here
           if (FNAME(gpte_changed)) // if (PTE == X)
               return RET_PF_RETRY;
    
    In practice, this bug likely cannot be observed as both the 0=>1
    transition and reordering of this scope are extremely rare occurrences.
    
    Note, if the cost of the barrier (which is simply a locked ADD, see commit
    450cbdd0 ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
    is problematic, KVM could avoid the barrier by bailing earlier if checking
    kvm_memslots_have_rmaps() is false.  But the odds of the barrier being
    problematic is extremely low, *and* the odds of the extra checks being
    meaningfully faster overall is also low.
    
    Link: https://lore.kernel.org/r/20240423193114.2887673-1-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
    226d9b8f
mmu.c 208 KB