• Sean Christopherson's avatar
    KVM: x86/mmu: Don't advance iterator after restart due to yielding · 3a0f64de
    Sean Christopherson authored
    After dropping mmu_lock in the TDP MMU, restart the iterator during
    tdp_iter_next() and do not advance the iterator.  Advancing the iterator
    results in skipping the top-level SPTE and all its children, which is
    fatal if any of the skipped SPTEs were not visited before yielding.
    
    When zapping all SPTEs, i.e. when min_level == root_level, restarting the
    iter and then invoking tdp_iter_next() is always fatal if the current gfn
    has as a valid SPTE, as advancing the iterator results in try_step_side()
    skipping the current gfn, which wasn't visited before yielding.
    
    Sprinkle WARNs on iter->yielded being true in various helpers that are
    often used in conjunction with yielding, and tag the helper with
    __must_check to reduce the probabily of improper usage.
    
    Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
    SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
    the shadow page will be leaked and KVM will WARN accordingly.
    
      WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
      RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
      Call Trace:
       <TASK>
       kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
       kvm_destroy_vm+0x162/0x2a0 [kvm]
       kvm_vcpu_release+0x34/0x60 [kvm]
       __fput+0x82/0x240
       task_work_run+0x5c/0x90
       do_exit+0x364/0xa10
       ? futex_unqueue+0x38/0x60
       do_group_exit+0x33/0xa0
       get_signal+0x155/0x850
       arch_do_signal_or_restart+0xed/0x750
       exit_to_user_mode_prepare+0xc5/0x120
       syscall_exit_to_user_mode+0x1d/0x40
       do_syscall_64+0x48/0xc0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
    kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
    marking a struct page as dirty/accessed after it has been put back on the
    free list.  This directly triggers a WARN due to encountering a page with
    page_count() == 0, but it can also lead to data corruption and additional
    errors in the kernel.
    
      WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
      RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
      Call Trace:
       <TASK>
       kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
       __handle_changed_spte+0x92e/0xca0 [kvm]
       __handle_changed_spte+0x63c/0xca0 [kvm]
       __handle_changed_spte+0x63c/0xca0 [kvm]
       __handle_changed_spte+0x63c/0xca0 [kvm]
       zap_gfn_range+0x549/0x620 [kvm]
       kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
       mmu_free_root_page+0x219/0x2c0 [kvm]
       kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
       kvm_mmu_unload+0x1c/0xa0 [kvm]
       kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
       kvm_put_kvm+0x3b1/0x8b0 [kvm]
       kvm_vcpu_release+0x4e/0x70 [kvm]
       __fput+0x1f7/0x8c0
       task_work_run+0xf8/0x1a0
       do_exit+0x97b/0x2230
       do_group_exit+0xda/0x2a0
       get_signal+0x3be/0x1e50
       arch_do_signal_or_restart+0x244/0x17f0
       exit_to_user_mode_prepare+0xcb/0x120
       syscall_exit_to_user_mode+0x1d/0x40
       do_syscall_64+0x4d/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    Note, the underlying bug existed even before commit 1af4a960 ("KVM:
    x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
    tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
    incorrectly advance past a top-level entry when yielding on a lower-level
    entry.  But with respect to leaking shadow pages, the bug was introduced
    by yielding before processing the current gfn.
    
    Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
    callers could jump to their "retry" label.  The downside of that approach
    is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
    in the loop, and there's no easy way to enfornce that requirement.
    
    Ideally, KVM would handling the cond_resched() fully within the iterator
    macro (the code is actually quite clean) and avoid this entire class of
    bugs, but that is extremely difficult do while also supporting yielding
    after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
    SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
    bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
    may block operations on the SPTE for a significant amount of time.
    
    Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
    Fixes: 1af4a960 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
    Reported-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    Message-Id: <20211214033528.123268-1-seanjc@google.com>
    Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    3a0f64de
tdp_iter.h 2.25 KB