1. 18 Nov, 2022 23 commits
  2. 17 Nov, 2022 17 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-svm-harden' into HEAD · 771a579c
      Paolo Bonzini authored
      This fixes three issues in nested SVM:
      
      1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset().
      However, if running nested and L1 doesn't intercept shutdown, the function
      resets vcpu->arch.hflags without properly leaving the nested state.
      This leaves the vCPU in inconsistent state and later triggers a kernel
      panic in SVM code.  The same bug can likely be triggered by sending INIT
      via local apic to a vCPU which runs a nested guest.
      
      On VMX we are lucky that the issue can't happen because VMX always
      intercepts triple faults, thus triple fault in L2 will always be
      redirected to L1.  Plus, handle_triple_fault() doesn't reset the vCPU.
      INIT IPI can't happen on VMX either because INIT events are masked while
      in VMX mode.
      
      Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM.
      A normal hypervisor should always intercept SHUTDOWN, a unit test on
      the other hand might want to not do so.
      
      Finally, the guest can trigger a kernel non rate limited printk on SVM
      from the guest, which is fixed as well.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      771a579c
    • Maxim Levitsky's avatar
      KVM: x86: remove exit_int_info warning in svm_handle_exit · 05311ce9
      Maxim Levitsky authored
      It is valid to receive external interrupt and have broken IDT entry,
      which will lead to #GP with exit_int_into that will contain the index of
      the IDT entry (e.g any value).
      
      Other exceptions can happen as well, like #NP or #SS
      (if stack switch fails).
      
      Thus this warning can be user triggred and has very little value.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-10-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05311ce9
    • Maxim Levitsky's avatar
      KVM: selftests: add svm part to triple_fault_test · 8357b9e1
      Maxim Levitsky authored
      Add a SVM implementation to triple_fault_test to test that
      emulated/injected shutdown works.
      
      Since instead of the VMX, the SVM allows the hypervisor to avoid
      intercepting shutdown in guest, don't intercept shutdown to test that
      KVM suports this correctly.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-9-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8357b9e1
    • Maxim Levitsky's avatar
      KVM: x86: allow L1 to not intercept triple fault · 92e7d5c8
      Maxim Levitsky authored
      This is SVM correctness fix - although a sane L1 would intercept
      SHUTDOWN event, it doesn't have to, so we have to honour this.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      92e7d5c8
    • Maxim Levitsky's avatar
      kvm: selftests: add svm nested shutdown test · 0bd2d3f4
      Maxim Levitsky authored
      Add test that tests that on SVM if L1 doesn't intercept SHUTDOWN,
      then L2 crashes L1 and doesn't crash L2
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-7-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0bd2d3f4
    • Maxim Levitsky's avatar
      KVM: selftests: move idt_entry to header · fc6392d5
      Maxim Levitsky authored
      struct idt_entry will be used for a test which will break IDT on purpose.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-6-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fc6392d5
    • Maxim Levitsky's avatar
      KVM: x86: forcibly leave nested mode on vCPU reset · ed129ec9
      Maxim Levitsky authored
      While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing
      'vcpu->arch.hflags' but it does so without all the required housekeeping.
      
      On SVM,	it is possible to have a vCPU reset while in guest mode because
      unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in
      addition to that L1 doesn't have to intercept triple fault, which should
      also trigger L1's reset if happens in L2 while L1 didn't intercept it.
      
      If one of the above conditions happen, KVM will	continue to use vmcb02
      while not having in the guest mode.
      
      Later the IA32_EFER will be cleared which will lead to freeing of the
      nested guest state which will (correctly) free the vmcb02, but since
      KVM still uses it (incorrectly) this will lead to a use after free
      and kernel crash.
      
      This issue is assigned CVE-2022-3344
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed129ec9
    • Maxim Levitsky's avatar
      KVM: x86: add kvm_leave_nested · f9697df2
      Maxim Levitsky authored
      add kvm_leave_nested which wraps a call to nested_ops->leave_nested
      into a function.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f9697df2
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use · 16ae56d7
      Maxim Levitsky authored
      Make sure that KVM uses vmcb01 before freeing nested state, and warn if
      that is not the case.
      
      This is a minimal fix for CVE-2022-3344 making the kernel print a warning
      instead of a kernel panic.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      16ae56d7
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: leave nested mode on vCPU free · 917401f2
      Maxim Levitsky authored
      If the VM was terminated while nested, we free the nested state
      while the vCPU still is in nested mode.
      
      Soon a warning will be added for this condition.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      917401f2
    • David Matlack's avatar
      KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages · eb298605
      David Matlack authored
      Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked,
      as it will just be faulted back in at the same 4KiB granularity when
      accessed by a vCPU. This may need to be changed if KVM ever supports
      2MiB (or larger) dirty tracking granularity, or faulting huge pages
      during dirty tracking for reads/executes. However for now, these zaps
      are entirely wasteful.
      
      In order to check if this commit increases the CPU usage of the NX
      recovery worker thread I used a modified version of execute_perf_test
      [1] that supports splitting guest memory into multiple slots and reports
      /proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just
      before tearing down the VM. The goal was to force a large number of NX
      Huge Page recoveries and see if the recovery worker used any more CPU.
      
      Test Setup:
      
        echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
        echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio
      
      Test Command:
      
        ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o
      
              | kvm-nx-lpage-re:se.sum_exec_runtime      |
              | ---------------------------------------- |
      Run     | Before             | After               |
      ------- | ------------------ | ------------------- |
      1       | 730.084105         | 724.375314          |
      2       | 728.751339         | 740.581988          |
      3       | 736.264720         | 757.078163          |
      
      Comparing the median results, this commit results in about a 1% increase
      CPU usage of the NX recovery worker when testing a VM with 16 slots.
      However, the effect is negligible with the default halving time of NX
      pages, which is 1 hour rather than 10 seconds given by period_ms = 1000,
      ratio = 10.
      
      [1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221103204421.1146958-1-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eb298605
    • Paolo Bonzini's avatar
      KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry · 63d28a25
      Paolo Bonzini authored
      A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map
      only fails in the exact same conditions that the earlier loop
      tested in order to issue a  "break". So, instead of checking twice the
      condition (upper level SPTEs could not be created or was frozen), just
      exit the loop with a goto---the usual poor-man C replacement for RAII
      early returns.
      
      While at it, do not use the "ret" variable for return values of
      functions that do not return a RET_PF_* enum.  This is clearer
      and also makes it possible to initialize ret to RET_PF_RETRY.
      Suggested-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63d28a25
    • David Matlack's avatar
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault · c4b33d28
      David Matlack authored
      Now that the TDP MMU has a mechanism to split huge pages, use it in the
      fault path when a huge page needs to be replaced with a mapping at a
      lower level.
      
      This change reduces the negative performance impact of NX HugePages.
      Prior to this change if a vCPU executed from a huge page and NX
      HugePages was enabled, the vCPU would take a fault, zap the huge page,
      and mapping the faulting address at 4KiB with execute permissions
      enabled. The rest of the memory would be left *unmapped* and have to be
      faulted back in by the guest upon access (read, write, or execute). If
      guest is backed by 1GiB, a single execute instruction can zap an entire
      GiB of its physical address space.
      
      For example, it can take a VM longer to execute from its memory than to
      populate that memory in the first place:
      
      $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
      
      Populating memory             : 2.748378795s
      Executing from memory         : 2.899670885s
      
      With this change, such faults split the huge page instead of zapping it,
      which avoids the non-present faults on the rest of the huge page:
      
      $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
      
      Populating memory             : 2.729544474s
      Executing from memory         : 0.111965688s   <---
      
      This change also reduces the performance impact of dirty logging when
      eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
      be desirable for read-heavy workloads, as it avoids allocating memory to
      split huge pages that are never written and avoids increasing the TLB
      miss cost on reads of those pages.
      
                   | Config: ept=Y, tdp_mmu=Y, 5% writes           |
                   | Iteration 1 dirty memory time                 |
                   | --------------------------------------------- |
      vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
      ------------ | -------------- | ------------- | ------------ |
      2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
      4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
      8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
      16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
      32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
      64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
      96           | 3.016535213s   | 0.062608739s  | 0.044760838s |
      
      Eager page splitting remains beneficial for write-heavy workloads, but
      the gap is now reduced.
      
                   | Config: ept=Y, tdp_mmu=Y, 100% writes         |
                   | Iteration 1 dirty memory time                 |
                   | --------------------------------------------- |
      vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
      ------------ | -------------- | ------------- | ------------ |
      2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
      4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
      8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
      16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
      32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
      64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
      96           | 2.681123208s   | 0.673778787s  | 0.165386739s |
      
      Further study is needed to determine if the remaining gap is acceptable
      for customer workloads or if eager_page_split=N still requires a-priori
      knowledge of the VM workload, especially when considering these costs
      extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20221109185905.486172-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c4b33d28
    • Paolo Bonzini's avatar
      Merge tag 'kvm-selftests-6.2-1' of https://github.com/kvm-x86/linux into HEAD · 92292c1d
      Paolo Bonzini authored
      KVM selftests updates for 6.2
      
      perf_util:
       - Add support for pinning vCPUs in dirty_log_perf_test.
       - Add a lightweight psuedo RNG for guest use, and use it to randomize
         the access pattern and write vs. read percentage in the so called
         "perf util" tests.
       - Rename the so called "perf_util" framework to "memstress".
      
      ucall:
       - Add a common pool-based ucall implementation (code dedup and pre-work
         for running SEV (and beyond) guests in selftests.
       - Fix an issue in ARM's single-step test when using the new pool-based
         implementation; LDREX/STREX don't play nice with single-step exceptions.
      
      init:
       - Provide a common constructor and arch hook, which will eventually be
         used by x86 to automatically select the right hypercall (AMD vs. Intel).
      
      x86:
       - Clean up x86's page tabe management.
       - Clean up and enhance the "smaller maxphyaddr" test, and add a related
         test to cover generic emulation failure.
       - Clean up the nEPT support checks.
       - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
      92292c1d
    • David Matlack's avatar
      KVM: selftests: Check for KVM nEPT support using "feature" MSRs · ecb89a51
      David Matlack authored
      When checking for nEPT support in KVM, use kvm_get_feature_msr() instead
      of vcpu_get_msr() to retrieve KVM's default TRUE_PROCBASED_CTLS and
      PROCBASED_CTLS2 MSR values, i.e. don't require a VM+vCPU to query nEPT
      support.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Link: https://lore.kernel.org/r/20220927165209.930904-1-dmatlack@google.com
      [sean: rebase on merged code, write changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      ecb89a51
    • David Matlack's avatar
      KVM: selftests: Assert in prepare_eptp() that nEPT is supported · 5c107f70
      David Matlack authored
      Now that a VM isn't needed to check for nEPT support, assert that KVM
      supports nEPT in prepare_eptp() instead of skipping the test, and push
      the TEST_REQUIRE() check out to individual tests.  The require+assert are
      somewhat redundant and will incur some amount of ongoing maintenance
      burden, but placing the "require" logic in the test makes it easier to
      find/understand a test's requirements and in this case, provides a very
      strong hint that the test cares about nEPT.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Link: https://lore.kernel.org/r/20220927165209.930904-1-dmatlack@google.com
      [sean: rebase on merged code, write changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      5c107f70
    • Sean Christopherson's avatar
      KVM: selftests: Drop helpers for getting specific KVM supported CPUID entry · b941ba23
      Sean Christopherson authored
      Drop kvm_get_supported_cpuid_entry() and its inner helper now that all
      known usage can use X86_FEATURE_*, X86_PROPERTY_*, X86_PMU_FEATURE_*, or
      the dedicated Family/Model helpers.  Providing "raw" access to CPUID
      leafs is undesirable as it encourages open coding CPUID checks, which is
      often error prone and not self-documenting.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221006005125.680782-13-seanjc@google.com
      b941ba23