Commits · cb00a70bd4b7e42dcbd6cd80b3f1697b10cdb44e · Kirill Smelkov / linux

10 Feb, 2022 40 commits

KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b

David Matlack authored Jan 19, 2022

When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.

Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

cb00a70b

KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd

David Matlack authored Jan 19, 2022

When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.

Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.

Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:

 1. Splitting on the vCPU thread interrupts vCPUs execution and is
    disruptive to customers whereas splitting on VM ioctl threads can
    run in parallel with vCPU execution.

 2. Splitting all huge pages at once is more efficient because it does
    not require performing VM-exit handling or walking the page table for
    every 4KiB page in the memslot, and greatly reduces the amount of
    contention on the mmu_lock.

For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.

Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.

Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a3fe5dbd

KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization · a82070b6

David Matlack authored Jan 19, 2022

Separate the allocation of shadow pages from their initialization.  This
is in preparation for splitting huge pages outside of the vCPU fault
context, which requires a different allocation mechanism.

No functional changed intended.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-15-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a82070b6

KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent · a3aca4de

David Matlack authored Jan 19, 2022

Derive the page role from the parent shadow page, since the only thing
that changes is the level. This is in preparation for splitting huge
pages during VM-ioctls which do not have access to the vCPU MMU context.

No functional change intended.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-14-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a3aca4de

KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages · a81399a5

David Matlack authored Jan 19, 2022

The vCPU's mmu_role already has the correct values for direct,
has_4_byte_gpte, access, and ad_disabled. Remove the code that was
redundantly overwriting these fields with the same values.

No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-13-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a81399a5

KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root · 77aa6075

David Matlack authored Jan 19, 2022

Instead of passing a pointer to the root page table and the root level
separately, pass in a pointer to the root kvm_mmu_page struct.  This
reduces the number of arguments by 1, cutting down on line lengths.

No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-12-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

77aa6075

KVM: x86/mmu: Move restore_acc_track_spte() to spte.h · 315d86da

David Matlack authored Jan 19, 2022

restore_acc_track_spte() is pure SPTE bit manipulation, making it a good
fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed,
there isn't any good reason to not inline it.

This move also prepares for a follow-up commit that will need to call
restore_acc_track_spte() from spte.c

No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

315d86da

KVM: x86/mmu: Drop new_spte local variable from restore_acc_track_spte() · 77c23c77

David Matlack authored Jan 19, 2022

The new_spte local variable is unnecessary. Deleting it can save a line
of code and simplify the remaining lines a bit.

No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-10-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

77c23c77

KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() · 59940e76

David Matlack authored Jan 19, 2022

The warnings in restore_acc_track_spte() can be removed because the only
caller checks is_access_track_spte(), and is_access_track_spte() checks
!spte_ad_enabled(). In other words, the warning can never be triggered.

No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-9-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

59940e76

KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table · 7b7e1ab6

David Matlack authored Jan 19, 2022

Consolidate the logic to atomically replace an SPTE with an SPTE that
points to a new page table into a single helper function. This will be
used in a follow-up commit to split huge pages, which involves replacing
each huge page SPTE with an SPTE that points to a page table.

Opportunistically drop the call to trace_kvm_mmu_get_page() in
kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
tdp_mmu_alloc_sp().

No functional change intended.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

7b7e1ab6

KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt() · 0f53dfa3

David Matlack authored Jan 19, 2022

First remove tdp_mmu_ from the name since it is redundant given that it
is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_
as a prefix in the names of static TDP MMU functions, but all of the
other handle_*() variants do not include such a prefix. So drop it
entirely.

Then change "page" to "pt" to convey that this is operating on a page
table rather than an struct page. Purposely use "pt" instead of "sp"
since this function takes the raw RCU-protected page table pointer as an
argument rather than  a pointer to the struct kvm_mmu_page.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

0f53dfa3

KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages · c298a30c

David Matlack authored Jan 19, 2022

Rename 3 functions in tdp_mmu.c that handle shadow pages:

  alloc_tdp_mmu_page()  -> tdp_mmu_alloc_sp()
  tdp_mmu_link_page()   -> tdp_mmu_link_sp()
  tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp()

These changed make tdp_mmu a consistent prefix before the verb in the
function name, and make it more clear that these functions deal with
kvm_mmu_page structs rather than struct pages.

One could argue that "shadow page" is the wrong term for a page table in
the TDP MMU since it never actually shadows a guest page table.
However, "shadow page" (or "sp" for short) has evolved to become the
standard term in KVM when referring to a kvm_mmu_page struct, and its
associated page table and other metadata, regardless of whether the page
table shadows a guest page table. So this commit just makes the TDP MMU
more consistent with the rest of KVM.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

c298a30c

KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY · 3e72c791

David Matlack authored Jan 19, 2022

tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool
with true indicating the SPTE modification was successful and false
indicating failure. Change these functions to return an int instead
since that is the common practice.

Opportunistically fix up the kernel-doc style for the Return section
above tdp_mmu_set_spte_atomic().

No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3e72c791

KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails · 3255530a

David Matlack authored Jan 19, 2022

Consolidate a bunch of code that was manually re-reading the spte if the
cmpxchg failed. There is no extra cost of doing this because we already
have the spte value as a result of the cmpxchg (and in fact this
eliminates re-reading the spte), and none of the call sites depend on
iter->old_spte retaining the stale spte value.
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3255530a

KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() · 1346bbb6

David Matlack authored Jan 19, 2022

The function formerly known as rmap_write_protect() has been renamed to
kvm_vcpu_write_protect_gfn(), so we can get rid of the double
underscores in front of __rmap_write_protect().

No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1346bbb6

KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() · cf48f9e2

David Matlack authored Jan 19, 2022

rmap_write_protect() is a poor name because it also write-protects SPTEs
in the TDP MMU, not just SPTEs in the rmap. It is also confusing that
rmap_write_protect() is not a simple wrapper around
__rmap_write_protect(), since that is the common pattern for functions
with double-underscore names.

Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey
that KVM is write-protecting a specific gfn in the context of a vCPU.

No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

cf48f9e2

KVM: x86: Add checks for reserved-to-zero Hyper-V hypercall fields · 413af660

Sean Christopherson authored Dec 07, 2021

Add checks for the three fields in Hyper-V's hypercall params that must
be zero. Per the TLFS, HV_STATUS_INVALID_HYPERCALL_INPUT is returned if
"A reserved bit in the specified hypercall input value is non-zero."

Note, some versions of the TLFS have an off-by-one bug for the last
reserved field, and define it as being bits 64:60. See
https://github.com/MicrosoftDocs/Virtualization-Documentation/pull/1682.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

413af660

KVM: x86: Reject fixeds-size Hyper-V hypercalls with non-zero "var_cnt" · 40421f38

Sean Christopherson authored Dec 07, 2021

Reject Hyper-V hypercalls if the guest specifies a non-zero variable size
header (var_cnt in KVM) for a hypercall that has a fixed header size.
Per the TLFS:

  It is illegal to specify a non-zero variable header size for a
  hypercall that is not explicitly documented as accepting variable sized
  input headers. In such a case the hypercall will result in a return
  code of HV_STATUS_INVALID_HYPERCALL_INPUT.

Note, at least some of the various DEBUG commands likely aren't allowed
to use variable size headers, but the TLFS documentation doesn't clearly
state what is/isn't allowed.  Omit them for now to avoid unnecessary
breakage.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

40421f38

KVM: x86: Shove vp_bitmap handling down into sparse_set_to_vcpu_mask() · 9c52f6b3

Sean Christopherson authored Dec 07, 2021

Move the vp_bitmap "allocation" that's needed to handle mismatched vp_index
values down into sparse_set_to_vcpu_mask() and drop __always_inline from
said helper. The need for an intermediate vp_bitmap is a detail that's
specific to the sparse translation with mismatched VP<=>vCPU indexes and
does not need to be exposed to the caller.

Regarding the __always_inline, prior to commit f21dd494 ("KVM: x86:
hyperv: optimize sparse VP set processing") the helper, then named
hv_vcpu_in_sparse_set(), was a tiny bit of code that effectively boiled
down to a handful of bit ops. The __always_inline was understandable, if
not justifiable. Since the aforementioned change, sparse_set_to_vcpu_mask()
is a chunky 350-450+ bytes of code without KASAN=y, and balloons to 1100+
with KASAN=y. In other words, it has no business being forcefully inlined.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

9c52f6b3

KVM: x86: Don't bother reading sparse banks that end up being ignored · 79661c37

Sean Christopherson authored Dec 07, 2021

When handling "sparse" VP_SET requests, don't read sparse banks that
can't possibly contain a legal VP index instead of ignoring such banks
later on in sparse_set_to_vcpu_mask().  This allows KVM to cap the size
of its sparse_banks arrays for VP_SET at KVM_HV_MAX_SPARSE_VCPU_SET_BITS.
Add a compile time assert that KVM_HV_MAX_SPARSE_VCPU_SET_BITS<=64, i.e.
that KVM_MAX_VCPUS<=4096, as the TLFS allows for at most 64 sparse banks,
and KVM will need to do _something_ to play nice with Hyper-V.

Reducing the size of sparse_banks fudges around a compilation warning
(that becomes error with KVM_WERROR=y) when CONFIG_KASAN_STACK=y, which
is selected (and can't be unselected) by CONFIG_KASAN=y when using gcc
(clang/LLVM is a stack hog in some cases so it's opt-in for clang).
KASAN_STACK adds a redzone around every stack variable, which pushes the
Hyper-V functions over the default limit of 1024.

Ideally, KVM would flat out reject such impossibilities, but the TLFS
explicitly allows providing empty banks, even if a bank can't possibly
contain a valid VP index due to its position exceeding KVM's max.

  Furthermore, for a bit 1 in ValidBankMask, it is valid state for the
  corresponding element in BanksContents can be all 0s, meaning no
  processors are specified in this bank.

Arguably KVM should reject and not ignore the "extra" banks, but that can
be done independently and without bloating sparse_banks, e.g. by reading
each "extra" 8-byte chunk individually.
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

79661c37

KVM: x86: Add a helper to get the sparse VP_SET for IPIs and TLB flushes · a0dd008f

Sean Christopherson authored Dec 07, 2021

Add a helper, kvm_get_sparse_vp_set(), to handle sanity checks related to
the VARHEAD field and reading the sparse banks of a VP_SET.  A future
commit to reduce the memory footprint of sparse_banks will introduce more
common code to the sparse bank retrieval.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a0dd008f

KVM: x86: Refactor kvm_hv_flush_tlb() to reduce indentation · 25af9081

Sean Christopherson authored Dec 07, 2021

Refactor the "extended" path of kvm_hv_flush_tlb() to reduce the nesting
depth for the non-fast sparse path, and to make the code more similar to
the extended path in kvm_hv_send_ipi().

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

25af9081

KVM: x86: Get the number of Hyper-V sparse banks from the VARHEAD field · bd1ba573

Sean Christopherson authored Dec 07, 2021

Get the number of sparse banks from the VARHEAD field, which the guest is
required to provide as "The size of a variable header, in QWORDS.", where
the variable header is:

  Variable Header Bytes = {Total Header Bytes - sizeof(Fixed Header)}
                          rounded up to nearest multiple of 8
  Variable HeaderSize = Variable Header Bytes / 8

In other words, the VARHEAD should match the number of sparse banks.
Keep the manual count as a sanity check, but otherwise rely on the field
so as to more closely align with the logic defined in the TLFS and to
allow for future cleanups.

Tweak the tracepoint output to use "rep_cnt" instead of simply "cnt" now
that there is also "var_cnt".
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211207220926.718794-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

bd1ba573

KVM: x86/mmu: Consolidate comments about {Host,MMU}-writable · 02844ac1

David Matlack authored Jan 25, 2022

Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the
large comment above is_writable_pte() into one comment. This comment
explains the different reasons why an SPTE may be non-writable and KVM
keeps track of that with the {Host,MMU}-writable bits.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230723.1701061-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

02844ac1

KVM: x86/mmu: Rename DEFAULT_SPTE_MMU_WRITEABLE to DEFAULT_SPTE_MMU_WRITABLE · 1ca87e01

David Matlack authored Jan 25, 2022

Both "writeable" and "writable" are valid, but we should be consistent
about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in
the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE.

No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230713.1700406-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1ca87e01

KVM: x86/mmu: Move is_writable_pte() to spte.h · 00610021

David Matlack authored Jan 25, 2022

Move is_writable_pte() close to the other functions that check
writability information about SPTEs. While here opportunistically
replace the open-coded bit arithmetic in
check_spte_writable_invariants() with a call to is_writable_pte().

No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

00610021

KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs · 115111ef

David Matlack authored Jan 25, 2022

Check SPTE writable invariants when setting SPTEs rather than in
spte_can_locklessly_be_made_writable(). By the time KVM checks
spte_can_locklessly_be_made_writable(), the SPTE has long been since
corrupted.

Note that these invariants only apply to shadow-present leaf SPTEs (i.e.
not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the
restriction and only instrument the code paths that set shadow-present
leaf SPTEs.

To account for access tracking, also check the SPTE writable invariants
when marking an SPTE as an access track SPTE. This also lets us remove
a redundant WARN from mark_spte_for_access_track().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

115111ef

KVM: x86/mmu: Move SPTE writable invariant checks to a helper function · 932859a4

David Matlack authored Jan 25, 2022

Move the WARNs in spte_can_locklessly_be_made_writable() to a separate
helper function. This is in preparation for moving these checks to the
places where SPTEs are set.

Opportunistically add warning error messages that include the SPTE to
make future debugging of these warnings easier.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

932859a4

KVM: LAPIC: Enable timer posted-interrupt only when mwait/hlt is advertised · 1714a4eb

Wanpeng Li authored Jan 25, 2022

As commit 0c5f81da ("KVM: LAPIC: Inject timer interrupt via posted
interrupt") mentioned that the host admin should well tune the guest
setup, so that vCPUs are placed on isolated pCPUs, and with several pCPUs
surplus for *busy* housekeeping.  In this setup, it is preferrable to
disable mwait/hlt/pause vmexits to keep the vCPUs in non-root mode.

However, if only some guests isolated and others not, they would not
have any benefit from posted timer interrupts, and at the same time lose
VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
returns true and therefore forces kvm_can_use_hv_timer() to false.

By guaranteeing that posted-interrupt timer is only used if MWAIT or
HLT are done without vmexit, KVM can make a better choice and use the
VMX preemption timer and the corresponding fast paths.
Reported-by: Aili Yao <yaoaili@kingsoft.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1643112538-36743-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1714a4eb

KVM: VMX: Dont' send posted IRQ if vCPU == this vCPU and vCPU is IN_GUEST_MODE · 9b44423b

Wanpeng Li authored Jan 25, 2022

When delivering a virtual interrupt, don't actually send a posted interrupt
if the target vCPU is also the currently running vCPU and is IN_GUEST_MODE,
in which case the interrupt is being sent from a VM-Exit fastpath and the
core run loop in vcpu_enter_guest() will manually move the interrupt from
the PIR to vmcs.GUEST_RVI.  IRQs are disabled while IN_GUEST_MODE, thus
there's no possibility of the virtual interrupt being sent from anything
other than KVM, i.e. KVM won't suppress a wake event from an IRQ handler
(see commit fdba608f, "KVM: VMX: Wake vCPU when delivering posted IRQ
even if vCPU == this vCPU").

Eliding the posted interrupt restores the performance provided by the
combination of commits 379a3c8e ("KVM: VMX: Optimize posted-interrupt
delivery for timer fastpath") and 26efe2fd ("KVM: VMX: Handle
preemption timer fastpath").

Thanks Sean for better comments.
Suggested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1643111979-36447-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

9b44423b

KVM: SVM: Rename hook implementations to conform to kvm_x86_ops' names · 23e5092b

Sean Christopherson authored Jan 28, 2022

Massage SVM's implementation names that still diverge from kvm_x86_ops to
allow for wiring up all SVM-defined functions via kvm-x86-ops.h.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

23e5092b

KVM: SVM: Rename SEV implemenations to conform to kvm_x86_ops hooks · 559c7c75

Sean Christopherson authored Jan 28, 2022

Rename svm_vm_copy_asid_from() and svm_vm_migrate_from() to conform to
the names used by kvm_x86_ops, and opportunistically use "sev" instead of
"svm" to more precisely identify the role of the hooks.

svm_vm_copy_asid_from() in particular was poorly named as the function
does much more than simply copy the ASID.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

559c7c75

KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks · 03d004cd

Sean Christopherson authored Jan 28, 2022

Use slightly more verbose names for the so called "memory encrypt",
a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
super short kvm_x86_ops names and SVM's more verbose, but non-conforming
names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
to fill svm_x86_ops.

Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
reflect its true nature, as it really is a full fledged ioctl() of its
own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
the ioctl() is a gateway to more than just memory encryption, and because
its underlying purpose to support Confidential VMs, which can be provided
without memory encryption, e.g. if the TCB of the guest includes the host
kernel but not host userspace, or by isolation in hardware without
encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
further is undeseriable, and short of creating alises for all related
ioctl()s, which introduces a different flavor of divergence, KVM is stuck
with the nomenclature.

Defer renaming SVM's functions to a future commit as there are additional
changes needed to make SVM fully conforming and to match reality (looking
at you, svm_vm_copy_asid_from()).

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

03d004cd

KVM: SVM: Remove unused MAX_INST_SIZE #define · 771eda3f

Sean Christopherson authored Jan 28, 2022

Remove SVM's MAX_INST_SIZE, which has long since been obsoleted by the
common MAX_INSN_SIZE.  Note, the latter's "insn" is also the generally
preferred abbreviation of instruction.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

771eda3f

KVM: SVM: Rename svm_flush_tlb() to svm_flush_tlb_current() · 4d9c83f5

Sean Christopherson authored Jan 28, 2022

Rename svm_flush_tlb() to svm_flush_tlb_current() so that at least one of
the flushing operations in svm_x86_ops can be filled via kvm-x86-ops.h,
and to document the scope of the flush (specifically that it doesn't
flush "all").

Opportunistically make svm_tlb_flush_current(), was svm_flush_tlb(),
static.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

4d9c83f5

KVM: x86: Move get_cs_db_l_bits() helper to SVM · 872e0c53

Sean Christopherson authored Jan 28, 2022

Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
superfluous export from KVM x86.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

872e0c53

KVM: VMX: Rename VMX functions to conform to kvm_x86_ops names · 58fccda4

Sean Christopherson authored Jan 28, 2022

Massage VMX's implementation names for kvm_x86_ops to maximize use of
kvm-x86-ops.h.  Leave cpu_has_vmx_wbinvd_exit() as-is to preserve the
cpu_has_vmx_*() pattern used for querying VMCS capabilities.  Keep
pi_has_pending_interrupt() as vmx_dy_apicv_has_pending_interrupt() does
a poor job of describing exactly what is being checked in VMX land.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

58fccda4

KVM: x86: Use static_call() for copy/move encryption context ioctls() · 7ad02ef0

Sean Christopherson authored Jan 28, 2022

Define and use static_call()s for .vm_{copy,move}_enc_context_from(),
mostly so that the op is defined in kvm-x86-ops.h.  This will allow using
KVM_X86_OP in vendor code to wire up the implementation.  Any performance
gains eeked out by using static_call() is a happy bonus and not the
primary motiviation.

Opportunistically refactor the code to reduce indentation and keep line
lengths reasonable, and to be consistent when wrapping versus running
a bit over the 80 char soft limit.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

7ad02ef0

KVM: x86: Unexport kvm_x86_ops · dfc4e6ca

Sean Christopherson authored Jan 28, 2022

Drop the export of kvm_x86_ops now it is no longer referenced by SVM or
VMX.  Disallowing access to kvm_x86_ops is very desirable as it prevents
vendor code from incorrectly modifying hooks after they have been set by
kvm_arch_hardware_setup(), and more importantly after each function's
associated static_call key has been updated.

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

dfc4e6ca

KVM: x86: Uninline and export hv_track_root_tdp() · 3d4421f8

Sean Christopherson authored Jan 28, 2022

Uninline and export Hyper-V's hv_track_root_tdp(), which is (somewhat
indirectly) the last remaining reference to kvm_x86_ops from vendor
modules, i.e. will allow unexporting kvm_x86_ops. Reloading the TDP PGD
isn't the fastest of paths, hv_track_root_tdp() isn't exactly tiny, and
disallowing vendor code from accessing kvm_x86_ops provides nice-to-have
encapsulation of common x86 code (and of Hyper-V code for that matter).

No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3d4421f8