• Sean Christopherson's avatar
    KVM: x86: update %rip after emulating IO · b9733a74
    Sean Christopherson authored
    commit 45def77e upstream.
    
    Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
    OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
    vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
    that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
    
    To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
    Skip pio instruction when it is emulated, not executed") changed the
    behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
    instruction prior to exiting to userspace.  Full emulation doesn't need
    such tricks becase re-emulating the instruction will naturally handle
    %rip being changed to point at the reset vector.
    
    Updating %rip prior to executing to userspace has several drawbacks:
    
      - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
        fails it will likely yell about the wrong address.
      - Single step exits to userspace for are effectively dropped as
        KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
      - Behavior of PIO emulation is different depending on whether it
        goes down the fast path or the slow path.
    
    Rather than skip the PIO instruction before exiting to userspace,
    snapshot the linear %rip and cancel PIO completion if the current
    value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
    common scenario, the snapshot and comparison has negligible overhead
    as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
    VMREAD in this case.
    
    All other alternatives to snapshotting the linear %rip that don't
    rely on an explicit reset announcenment suffer from one corner case
    or another.  For example, canceling PIO completion on any write to
    %rip fails if userspace does a save/restore of %rip, and attempting to
    avoid that issue by canceling PIO only if %rip changed then fails if PIO
    collides with the reset %rip.  Attempting to zero in on the exact reset
    vector won't work for APs, which means adding more hooks such as the
    vCPU's MP_STATE, and so on and so forth.
    
    Checking for a linear %rip match technically suffers from corner cases,
    e.g. userspace could theoretically rewrite the underlying code page and
    expect a different instruction to execute, or the guest hardcodes a PIO
    reset at 0xfffffff0, but those are far, far outside of what can be
    considered normal operation.
    
    Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
    Cc: <stable@vger.kernel.org>
    Reported-by: default avatarJim Mattson <jmattson@google.com>
    Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
    Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    b9733a74
x86.c 247 KB