1. 08 Jun, 2022 16 commits
    • Robert Hoo's avatar
      KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config · 1ad4e543
      Robert Hoo authored
      Check VMX features on tertiary execution control in VMCS config setup.
      Sub-features in tertiary execution control to be enabled are adjusted
      according to hardware capabilities although no sub-feature is enabled
      in this patch.
      
      EVMCSv1 doesn't support tertiary VM-execution control, so disable it
      when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
      control field here, using the new BUILD_CONTROLS_SHADOW().
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153400.11642-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ad4e543
    • Robert Hoo's avatar
      KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation · ed3905ba
      Robert Hoo authored
      The Tertiary VM-Exec Control, different from previous control fields, is 64
      bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
      support both 32 bit and 64 bit fields' auxiliary functions building.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153318.11595-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed3905ba
    • Robert Hoo's avatar
      x86/cpu: Add new VMX feature, Tertiary VM-Execution control · 465932db
      Robert Hoo authored
      A new 64-bit control field "tertiary processor-based VM-execution
      controls", is defined [1]. It's controlled by bit 17 of the primary
      processor-based VM-execution controls.
      
      Different from its brother VM-execution fields, this tertiary VM-
      execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
      TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
      
      Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
      (0x492), is also semantically different from its brothers, whose 64 bits
      consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
      Therefore, its init_vmx_capabilities() is a little different from others.
      
      [1] ISE 6.2 "VMCS Changes"
      https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
      
      [2] SDM Vol3. Appendix A.3
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419153240.11549-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      465932db
    • Sean Christopherson's avatar
      KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logic · b8b9156e
      Sean Christopherson authored
      Add a comment to FNAME(sync_page) to explain why the TLB flushing logic
      conspiculously doesn't handle the scenario of guest protections being
      reduced.  Specifically, if synchronizing a SPTE drops execute protections,
      KVM will not emit a TLB flush, whereas dropping writable or clearing A/D
      bits does trigger a flush via mmu_spte_update().  Architecturally, until
      the GPTE is implicitly or explicitly flushed from the guest's perspective,
      KVM is not required to flush any old, stale translations.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220513195000.99371-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8b9156e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page() · 9fb35657
      Sean Christopherson authored
      All of sync_page()'s existing checks filter out only !PRESENT gPTE,
      because without execute-only, all upper levels are guaranteed to be at
      least READABLE.  However, if EPT with execute-only support is in use by
      L1, KVM can create an SPTE that is shadow-present but guest-inaccessible
      (RWX=0) if the upper level combined permissions are R (or RW) and
      the leaf EPTE is changed from R (or RW) to X.  Because the EPTE is
      considered present when viewed in isolation, and no reserved bits are set,
      FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a
      not-present SPTE to be created.
      
      The SPTE is "correct": the guest translation is inaccessible because
      the combined protections of all levels yield RWX=0, and KVM will just
      redirect any vmexits to the guest.  If EPT A/D bits are disabled, KVM
      can mistake the SPTE for an access-tracked SPTE, but again such confusion
      isn't fatal, as the "saved" protections are also RWX=0.  However,
      creating a useless SPTE in general means that KVM messed up something,
      even if this particular goof didn't manifest as a functional bug.
      So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and
      add a WARN in make_spte() to detect creation of SPTEs that will
      result in RWX=0 protections.
      
      Fixes: d95c5568 ("kvm: mmu: track read permission explicitly for shadow EPT page tables")
      Cc: David Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220513195000.99371-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fb35657
    • Maciej S. Szmigiero's avatar
      KVM: selftests: nSVM: Add svm_nested_soft_inject_test · d8969871
      Maciej S. Szmigiero authored
      Add a KVM self-test that checks whether a nSVM L1 is able to successfully
      inject a software interrupt, a soft exception and a NMI into its L2 guest.
      
      In practice, this tests both the next_rip field consistency and
      L1-injected event with intervening L0 VMEXIT during its delivery:
      the first nested VMRUN (that's also trying to inject a software interrupt)
      will immediately trigger a L0 NPF.
      This L0 NPF will have zero in its CPU-returned next_rip field, which if
      incorrectly reused by KVM will trigger a #PF when trying to return to
      such address 0 from the interrupt handler.
      
      For NMI injection this tests whether the L1 NMI state isn't getting
      incorrectly mixed with the L2 NMI state if a L1 -> L2 NMI needs to be
      re-injected.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      [sean: check exact L2 RIP on first soft interrupt]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <d5f3d56528558ad8e28a9f1e1e4187f5a1e6770a.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d8969871
    • Maciej S. Szmigiero's avatar
      KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection · 159fc6fa
      Maciej S. Szmigiero authored
      A NMI that L1 wants to inject into its L2 should be directly re-injected,
      without causing L0 side effects like engaging NMI blocking for L1.
      
      It's also worth noting that in this case it is L1 responsibility
      to track the NMI window status for its L2 guest.
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      159fc6fa
    • Sean Christopherson's avatar
      KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepoint · 2d613912
      Sean Christopherson authored
      In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
      "IRQs", i.e. interrupts that are reinjected after incomplete delivery of
      a software interrupt from an INTn instruction.  Tag reinjected interrupts
      as such, even though the information is usually redundant since soft
      interrupts are only ever reinjected by KVM.  Though rare in practice, a
      hard IRQ can be reinjected.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      [MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d613912
    • Sean Christopherson's avatar
      KVM: x86: Print error code in exception injection tracepoint iff valid · 21d4c575
      Sean Christopherson authored
      Print the error code in the exception injection tracepoint if and only if
      the exception has an error code.  Define the entire error code sequence
      as a set of formatted strings, print empty strings if there's no error
      code, and abuse __print_symbolic() by passing it an empty array to coerce
      it into printing the error code as a hex string.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <e8f0511733ed2a0410cbee8a0a7388eac2ee5bac.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21d4c575
    • Sean Christopherson's avatar
      KVM: x86: Trace re-injected exceptions · a61d7c54
      Sean Christopherson authored
      Trace exceptions that are re-injected, not just those that KVM is
      injecting for the first time.  Debugging re-injection bugs is painful
      enough as is, not having visibility into what KVM is doing only makes
      things worse.
      
      Delay propagating pending=>injected in the non-reinjection path so that
      the tracing can properly identify reinjected exceptions.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <25470690a38b4d2b32b6204875dd35676c65c9f2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a61d7c54
    • Sean Christopherson's avatar
      KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" · 7e5b5ef8
      Sean Christopherson authored
      Re-inject INTn software interrupts instead of retrying the instruction if
      the CPU encountered an intercepted exception while vectoring the INTn,
      e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
      instruction is architecturally wrong e.g. will result in a spurious #DB
      if there's a code breakpoint on the INT3/O, and lack of re-injection also
      breaks nested virtualization, e.g. if L1 injects a software interrupt and
      vectoring the injected interrupt encounters an exception that is
      intercepted by L0 but not L1.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7e5b5ef8
    • Sean Christopherson's avatar
      KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction · 6ef88d6e
      Sean Christopherson authored
      Re-inject INT3/INTO instead of retrying the instruction if the CPU
      encountered an intercepted exception while vectoring the software
      exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
      shadow paging.  Retrying the instruction is architecturally wrong, e.g.
      will result in a spurious #DB if there's a code breakpoint on the INT3/O,
      and lack of re-injection also breaks nested virtualization, e.g. if L1
      injects a software exception and vectoring the injected exception
      encounters an exception that is intercepted by L0 but not L1.
      
      Due to, ahem, deficiencies in the SVM architecture, acquiring the next
      RIP may require flowing through the emulator even if NRIPS is supported,
      as the CPU clears next_rip if the VM-Exit is due to an exception other
      than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
      deal with this, "skip" the instruction to calculate next_rip (if it's
      not already known), and then unwind the RIP write and any side effects
      (RFLAGS updates).
      
      Save the computed next_rip and use it to re-stuff next_rip if injection
      doesn't complete.  This allows KVM to do the right thing if next_rip was
      known prior to injection, e.g. if L1 injects a soft event into L2, and
      there is no backing INTn instruction, e.g. if L1 is injecting an
      arbitrary event.
      
      Note, it's impossible to guarantee architectural correctness given SVM's
      architectural flaws.  E.g. if the guest executes INTn (no KVM injection),
      an exit occurs while vectoring the INTn, and the guest modifies the code
      stream while the exit is being handled, KVM will compute the incorrect
      next_rip due to "skipping" the wrong instruction.  A future enhancement
      to make this less awful would be for KVM to detect that the decoded
      instruction is not the correct INTn and drop the to-be-injected soft
      event (retrying is a lesser evil compared to shoving the wrong RIP on the
      exception stack).
      Reported-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6ef88d6e
    • Sean Christopherson's avatar
      KVM: SVM: Stuff next_rip on emulated INT3 injection if NRIPS is supported · 3741aec4
      Sean Christopherson authored
      If NRIPS is supported in hardware but disabled in KVM, set next_rip to
      the next RIP when advancing RIP as part of emulating INT3 injection.
      There is no flag to tell the CPU that KVM isn't using next_rip, and so
      leaving next_rip is left as is will result in the CPU pushing garbage
      onto the stack when vectoring the injected event.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <cd328309a3b88604daa2359ad56f36cb565ce2d4.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3741aec4
    • Sean Christopherson's avatar
      KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" · cd9e6da8
      Sean Christopherson authored
      Unwind the RIP advancement done by svm_queue_exception() when injecting
      an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
      vectoring the injected event, even if the exception reported by the CPU
      isn't the same event that was injected.  If vectoring INT3 encounters an
      exception, e.g. #NP, and vectoring the #NP encounters an intercepted
      exception, e.g. #PF when KVM is using shadow paging, then the #NP will
      be reported as the event that was in-progress.
      
      Note, this is still imperfect, as it will get a false positive if the
      INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
      handler in the guest, the instruction following the INT3 generates an
      exception (directly or indirectly), _and_ vectoring that exception
      encounters an exception that is intercepted by KVM.  The false positives
      could theoretically be solved by further analyzing the vectoring event,
      e.g. by comparing the error code against the expected error code were an
      exception to occur when vectoring the original injected exception, but
      SVM without NRIPS is a complete disaster, trying to make it 100% correct
      is a waste of time.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <450133cf0a026cb9825a2ff55d02cb136a1cb111.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd9e6da8
    • Maciej S. Szmigiero's avatar
      KVM: SVM: Don't BUG if userspace injects an interrupt with GIF=0 · f17c31c4
      Maciej S. Szmigiero authored
      Don't BUG/WARN on interrupt injection due to GIF being cleared,
      since it's trivial for userspace to force the situation via
      KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct
      for KVM internally generated injections).
      
        kernel BUG at arch/x86/kvm/svm/svm.c:3386!
        invalid opcode: 0000 [#1] SMP
        CPU: 15 PID: 926 Comm: smm_test Not tainted 5.17.0-rc3+ #264
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:svm_inject_irq+0xab/0xb0 [kvm_amd]
        Code: <0f> 0b 0f 1f 00 0f 1f 44 00 00 80 3d ac b3 01 00 00 55 48 89 f5 53
        RSP: 0018:ffffc90000b37d88 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88810a234ac0 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: ffffc90000b37df7 RDI: ffff88810a234ac0
        RBP: ffffc90000b37df7 R08: ffff88810a1fa410 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff888109571000 R14: ffff88810a234ac0 R15: 0000000000000000
        FS:  0000000001821380(0000) GS:ffff88846fdc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f74fc550008 CR3: 000000010a6fe000 CR4: 0000000000350ea0
        Call Trace:
         <TASK>
         inject_pending_event+0x2f7/0x4c0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x791/0x17a0 [kvm]
         kvm_vcpu_ioctl+0x26d/0x650 [kvm]
         __x64_sys_ioctl+0x82/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Fixes: 219b65dc ("KVM: SVM: Improve nested interrupt injection")
      Cc: stable@vger.kernel.org
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <35426af6e123cbe91ec7ce5132ce72521f02b1b5.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f17c31c4
    • Maciej S. Szmigiero's avatar
      KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 · 00f08d99
      Maciej S. Szmigiero authored
      The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
      This field value (instead of the saved guest RIP) in used by the CPU for
      the return address pushed on stack when injecting a software interrupt or
      INT3 or INTO exception.
      
      Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
      loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
      in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
      vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
      saves RIP on the stack as-is).
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00f08d99
  2. 07 Jun, 2022 10 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-5.19-2' of... · 5552de7b
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: pvdump and selftest improvements
      
      - add an interface to provide a hypervisor dump for secure guests
      - improve selftests to show tests
      5552de7b
    • Paolo Bonzini's avatar
      b31455e9
    • Paolo Bonzini's avatar
      a280e358
    • Maxim Levitsky's avatar
      KVM: SVM: fix tsc scaling cache logic · 11d39e8c
      Maxim Levitsky authored
      SVM uses a per-cpu variable to cache the current value of the
      tsc scaling multiplier msr on each cpu.
      
      Commit 1ab9287a
      ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      broke this caching logic.
      
      Refactor the code so that all TSC scaling multiplier writes go through
      a single function which checks and updates the cache.
      
      This fixes the following scenario:
      
      1. A CPU runs a guest with some tsc scaling ratio.
      
      2. New guest with different tsc scaling ratio starts on this CPU
         and terminates almost immediately.
      
         This ensures that the short running guest had set the tsc scaling ratio just
         once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
         the per-cpu cache is not updated.
      
      3. The original guest continues to run, it doesn't restore the msr
         value back to its own value, because the cache matches,
         and thus continues to run with a wrong tsc scaling ratio.
      
      Fixes: 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      11d39e8c
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Make hyperv_clock selftest more stable · eae260be
      Vitaly Kuznetsov authored
      hyperv_clock doesn't always give a stable test result, especially with
      AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either
      with rdmsr() from within the guest or KVM_GET_MSRS from the host)
      against rdtsc(). To increase the accuracy, increase the measured delay
      (done with nop loop) by two orders of magnitude and take the mean rdtsc()
      value before and after rdmsr()/KVM_GET_MSRS.
      Reported-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220601144322.1968742-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eae260be
    • Ben Gardon's avatar
      KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging · 5ba7c4c6
      Ben Gardon authored
      Currently disabling dirty logging with the TDP MMU is extremely slow.
      On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
      to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
      the shadow MMU.
      
      When disabling dirty logging, zap non-leaf parent entries to allow
      replacement with huge pages instead of recursing and zapping all of the
      child, leaf entries. This reduces the number of TLB flushes required.
      and reduces the disable dirty log time with the TDP MMU to ~3 seconds.
      
      Opportunistically add a WARN() to catch GFNs that are mapped at a
      higher level than their max level.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220525230904.1584480-1-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ba7c4c6
    • Jan Beulich's avatar
      x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() · 1df931d9
      Jan Beulich authored
      As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs
      and clobbering of "cc" don't work well together. The compiler appears to
      mean to reject such, but doesn't - in its upstream form - quite manage
      to yet for "cc". Furthermore two similar macros don't clobber "cc", and
      clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler
      always assumes status flags to be clobbered there.
      
      Fixes: 989b5db2 ("x86/uaccess: Implement macros for CMPXCHG on user addresses")
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1df931d9
    • Shaoqin Huang's avatar
      KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() · cf4a8693
      Shaoqin Huang authored
      When freeing obsolete previous roots, check prev_roots as intended, not
      the current root.
      Signed-off-by: default avatarShaoqin Huang <shaoqin.huang@intel.com>
      Fixes: 527d5cd7 ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped")
      Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf4a8693
    • Seth Forshee's avatar
      entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set · 3e684903
      Seth Forshee authored
      A livepatch transition may stall indefinitely when a kvm vCPU is heavily
      loaded. To the host, the vCPU task is a user thread which is spending a
      very long time in the ioctl(KVM_RUN) syscall. During livepatch
      transition, set_notify_signal() will be called on such tasks to
      interrupt the syscall so that the task can be transitioned. This
      interrupts guest execution, but when xfer_to_guest_mode_work() sees that
      TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
      exit to user mode is unnecessary, and guest execution is resumed without
      transitioning the task for the livepatch.
      
      This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
      is expected to break tasks out of interruptible kernel loops and cause
      them to return to userspace. Change xfer_to_guest_mode_work() to handle
      TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
      loop that an exit to userpsace is needed. Any pending task_work will be
      run when get_signal() is called from exit_to_user_mode_loop(), so there
      is no longer any need to run task work from xfer_to_guest_mode_work().
      Suggested-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3e684903
    • Alexey Kardashevskiy's avatar
      KVM: Don't null dereference ops->destroy · e8bc2427
      Alexey Kardashevskiy authored
      A KVM device cleanup happens in either of two callbacks:
      1) destroy() which is called when the VM is being destroyed;
      2) release() which is called when a device fd is closed.
      
      Most KVM devices use 1) but Book3s's interrupt controller KVM devices
      (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
      the machine execution. The error handling in kvm_ioctl_create_device()
      assumes destroy() is always defined which leads to NULL dereference as
      discovered by Syzkaller.
      
      This adds a checks for destroy!=NULL and adds a missing release().
      
      This is not changing kvm_destroy_devices() as devices with defined
      release() should have been removed from the KVM devices list by then.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8bc2427
  3. 06 Jun, 2022 3 commits
  4. 05 Jun, 2022 11 commits