- 10 Feb, 2022 40 commits
-
-
Vitaly Kuznetsov authored
Similar to VMX, allocate memory for MSR-Bitmap and fill in 'msrpm_base_pa' in VMCB. To use it, tests will need to set INTERCEPT_MSR_PROT interception along with the required bits in the MSR-Bitmap. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220203104620.277031-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Introduce a test for enlightened MSR-Bitmap feature (Hyper-V on KVM): - Intercept access to MSR_FS_BASE in L1 and check that this works with enlightened MSR-Bitmap disabled. - Enabled enlightened MSR-Bitmap and check that the intercept still works as expected. - Intercept access to MSR_GS_BASE but don't clear the corresponding bit from 'hv_clean_fields', KVM is supposed to skip updating MSR-Bitmap02 and thus the consequent access to the MSR from L2 will not get intercepted. - Finally, clear the corresponding bit from 'hv_clean_fields' and check that access to MSR_GS_BASE is now intercepted. The test works with the assumption, that access to MSR_FS_BASE/MSR_GS_BASE is not intercepted for L1. If this ever becomes not true the test will fail as nested_vmx_exit_handled_msr() always checks L1's MSR-Bitmap for L2 irrespective of 'hv_clean_fields'. The behavior is correct as enlightened MSR-Bitmap feature is just an optimization, KVM is not obliged to ignore updates when the corresponding bit in 'hv_clean_fields' stays clear. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220203104620.277031-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Instead of just resetting 'hv_clean_fields' to 0 on every enlightened vmresume, do the expected cleaning of the corresponding bit on enlightened vmwrite. Avoid direct access to 'current_evmcs' from evmcs_test to support the change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220203104620.277031-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
CPUID 0x40000000.EAX is now always present as it has Enlightened MSR-Bitmap feature bit set. Adapt the test accordingly. Opportunistically add a check for the supported eVMCS version range. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220203104620.277031-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Similar to nVMX commit 502d2bf5 ("KVM: nVMX: Implement Enlightened MSR Bitmap feature"), add support for the feature for nSVM (Hyper-V on KVM). Notable differences from nVMX implementation: - As the feature uses SW reserved fields in VMCB control, KVM needs to make sure it's dealing with a Hyper-V guest (kvm_hv_hypercall_enabled()). - 'msrpm_base_pa' needs to be always be overwritten in nested_svm_vmrun_msrpm(), even when the update is skipped. As an optimization, nested_vmcb02_prepare_control() copies it from VMCB01 so when MSR-Bitmap feature for L2 is disabled nothing needs to be done. - 'struct vmcb_ctrl_area_cached' needs to be extended with clean fields/sw reserved data and __nested_copy_vmcb_control_to_cache() needs to copy it so nested_svm_vmrun_msrpm() can use it later. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220202095100.129834-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
In preparation to implementing Enlightened MSR-Bitmap feature for Hyper-V on KVM, split off the required definitions into common 'svm/hyperv.h' header. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220202095100.129834-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
In preparation for using kvm_hv_hypercall_enabled() from SVM code, make it static inline to avoid the need to export it. The function is a simple check with only two call sites currently. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220202095100.129834-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Similar to nVMX commit ed2a4800 ("KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt"), introduce a flag to keep track of whether MSR bitmap for L2 needs to be rebuilt due to changes in MSR bitmap for L1 or switching to a different L2. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220202095100.129834-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Add an option to dirty_log_perf_test.c to disable KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE and KVM_DIRTY_LOG_INITIALLY_SET so the legacy dirty logging code path can be tested. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-19-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Add a tracepoint that records whenever KVM eagerly splits a huge page and the error status of the split to indicate if it succeeded or failed and why. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-18-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Separate the allocation of shadow pages from their initialization. This is in preparation for splitting huge pages outside of the vCPU fault context, which requires a different allocation mechanism. No functional changed intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-15-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Derive the page role from the parent shadow page, since the only thing that changes is the level. This is in preparation for splitting huge pages during VM-ioctls which do not have access to the vCPU MMU context. No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-14-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
The vCPU's mmu_role already has the correct values for direct, has_4_byte_gpte, access, and ad_disabled. Remove the code that was redundantly overwriting these fields with the same values. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-13-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Instead of passing a pointer to the root page table and the root level separately, pass in a pointer to the root kvm_mmu_page struct. This reduces the number of arguments by 1, cutting down on line lengths. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-12-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
restore_acc_track_spte() is pure SPTE bit manipulation, making it a good fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed, there isn't any good reason to not inline it. This move also prepares for a follow-up commit that will need to call restore_acc_track_spte() from spte.c No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
The new_spte local variable is unnecessary. Deleting it can save a line of code and simplify the remaining lines a bit. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
The warnings in restore_acc_track_spte() can be removed because the only caller checks is_access_track_spte(), and is_access_track_spte() checks !spte_ad_enabled(). In other words, the warning can never be triggered. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Consolidate the logic to atomically replace an SPTE with an SPTE that points to a new page table into a single helper function. This will be used in a follow-up commit to split huge pages, which involves replacing each huge page SPTE with an SPTE that points to a page table. Opportunistically drop the call to trace_kvm_mmu_get_page() in kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in tdp_mmu_alloc_sp(). No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
First remove tdp_mmu_ from the name since it is redundant given that it is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_ as a prefix in the names of static TDP MMU functions, but all of the other handle_*() variants do not include such a prefix. So drop it entirely. Then change "page" to "pt" to convey that this is operating on a page table rather than an struct page. Purposely use "pt" instead of "sp" since this function takes the raw RCU-protected page table pointer as an argument rather than a pointer to the struct kvm_mmu_page. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Rename 3 functions in tdp_mmu.c that handle shadow pages: alloc_tdp_mmu_page() -> tdp_mmu_alloc_sp() tdp_mmu_link_page() -> tdp_mmu_link_sp() tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp() These changed make tdp_mmu a consistent prefix before the verb in the function name, and make it more clear that these functions deal with kvm_mmu_page structs rather than struct pages. One could argue that "shadow page" is the wrong term for a page table in the TDP MMU since it never actually shadows a guest page table. However, "shadow page" (or "sp" for short) has evolved to become the standard term in KVM when referring to a kvm_mmu_page struct, and its associated page table and other metadata, regardless of whether the page table shadows a guest page table. So this commit just makes the TDP MMU more consistent with the rest of KVM. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool with true indicating the SPTE modification was successful and false indicating failure. Change these functions to return an int instead since that is the common practice. Opportunistically fix up the kernel-doc style for the Return section above tdp_mmu_set_spte_atomic(). No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Consolidate a bunch of code that was manually re-reading the spte if the cmpxchg failed. There is no extra cost of doing this because we already have the spte value as a result of the cmpxchg (and in fact this eliminates re-reading the spte), and none of the call sites depend on iter->old_spte retaining the stale spte value. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
The function formerly known as rmap_write_protect() has been renamed to kvm_vcpu_write_protect_gfn(), so we can get rid of the double underscores in front of __rmap_write_protect(). No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
rmap_write_protect() is a poor name because it also write-protects SPTEs in the TDP MMU, not just SPTEs in the rmap. It is also confusing that rmap_write_protect() is not a simple wrapper around __rmap_write_protect(), since that is the common pattern for functions with double-underscore names. Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey that KVM is write-protecting a specific gfn in the context of a vCPU. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add checks for the three fields in Hyper-V's hypercall params that must be zero. Per the TLFS, HV_STATUS_INVALID_HYPERCALL_INPUT is returned if "A reserved bit in the specified hypercall input value is non-zero." Note, some versions of the TLFS have an off-by-one bug for the last reserved field, and define it as being bits 64:60. See https://github.com/MicrosoftDocs/Virtualization-Documentation/pull/1682. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Reject Hyper-V hypercalls if the guest specifies a non-zero variable size header (var_cnt in KVM) for a hypercall that has a fixed header size. Per the TLFS: It is illegal to specify a non-zero variable header size for a hypercall that is not explicitly documented as accepting variable sized input headers. In such a case the hypercall will result in a return code of HV_STATUS_INVALID_HYPERCALL_INPUT. Note, at least some of the various DEBUG commands likely aren't allowed to use variable size headers, but the TLFS documentation doesn't clearly state what is/isn't allowed. Omit them for now to avoid unnecessary breakage. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the vp_bitmap "allocation" that's needed to handle mismatched vp_index values down into sparse_set_to_vcpu_mask() and drop __always_inline from said helper. The need for an intermediate vp_bitmap is a detail that's specific to the sparse translation with mismatched VP<=>vCPU indexes and does not need to be exposed to the caller. Regarding the __always_inline, prior to commit f21dd494 ("KVM: x86: hyperv: optimize sparse VP set processing") the helper, then named hv_vcpu_in_sparse_set(), was a tiny bit of code that effectively boiled down to a handful of bit ops. The __always_inline was understandable, if not justifiable. Since the aforementioned change, sparse_set_to_vcpu_mask() is a chunky 350-450+ bytes of code without KASAN=y, and balloons to 1100+ with KASAN=y. In other words, it has no business being forcefully inlined. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When handling "sparse" VP_SET requests, don't read sparse banks that can't possibly contain a legal VP index instead of ignoring such banks later on in sparse_set_to_vcpu_mask(). This allows KVM to cap the size of its sparse_banks arrays for VP_SET at KVM_HV_MAX_SPARSE_VCPU_SET_BITS. Add a compile time assert that KVM_HV_MAX_SPARSE_VCPU_SET_BITS<=64, i.e. that KVM_MAX_VCPUS<=4096, as the TLFS allows for at most 64 sparse banks, and KVM will need to do _something_ to play nice with Hyper-V. Reducing the size of sparse_banks fudges around a compilation warning (that becomes error with KVM_WERROR=y) when CONFIG_KASAN_STACK=y, which is selected (and can't be unselected) by CONFIG_KASAN=y when using gcc (clang/LLVM is a stack hog in some cases so it's opt-in for clang). KASAN_STACK adds a redzone around every stack variable, which pushes the Hyper-V functions over the default limit of 1024. Ideally, KVM would flat out reject such impossibilities, but the TLFS explicitly allows providing empty banks, even if a bank can't possibly contain a valid VP index due to its position exceeding KVM's max. Furthermore, for a bit 1 in ValidBankMask, it is valid state for the corresponding element in BanksContents can be all 0s, meaning no processors are specified in this bank. Arguably KVM should reject and not ignore the "extra" banks, but that can be done independently and without bloating sparse_banks, e.g. by reading each "extra" 8-byte chunk individually. Reported-by: Ajay Garg <ajaygargnsit@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a helper, kvm_get_sparse_vp_set(), to handle sanity checks related to the VARHEAD field and reading the sparse banks of a VP_SET. A future commit to reduce the memory footprint of sparse_banks will introduce more common code to the sparse bank retrieval. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Refactor the "extended" path of kvm_hv_flush_tlb() to reduce the nesting depth for the non-fast sparse path, and to make the code more similar to the extended path in kvm_hv_send_ipi(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Get the number of sparse banks from the VARHEAD field, which the guest is required to provide as "The size of a variable header, in QWORDS.", where the variable header is: Variable Header Bytes = {Total Header Bytes - sizeof(Fixed Header)} rounded up to nearest multiple of 8 Variable HeaderSize = Variable Header Bytes / 8 In other words, the VARHEAD should match the number of sparse banks. Keep the manual count as a sanity check, but otherwise rely on the field so as to more closely align with the logic defined in the TLFS and to allow for future cleanups. Tweak the tracepoint output to use "rep_cnt" instead of simply "cnt" now that there is also "var_cnt". Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211207220926.718794-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the large comment above is_writable_pte() into one comment. This comment explains the different reasons why an SPTE may be non-writable and KVM keeps track of that with the {Host,MMU}-writable bits. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230723.1701061-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Both "writeable" and "writable" are valid, but we should be consistent about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230713.1700406-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Move is_writable_pte() close to the other functions that check writability information about SPTEs. While here opportunistically replace the open-coded bit arithmetic in check_spte_writable_invariants() with a call to is_writable_pte(). No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Check SPTE writable invariants when setting SPTEs rather than in spte_can_locklessly_be_made_writable(). By the time KVM checks spte_can_locklessly_be_made_writable(), the SPTE has long been since corrupted. Note that these invariants only apply to shadow-present leaf SPTEs (i.e. not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the restriction and only instrument the code paths that set shadow-present leaf SPTEs. To account for access tracking, also check the SPTE writable invariants when marking an SPTE as an access track SPTE. This also lets us remove a redundant WARN from mark_spte_for_access_track(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Move the WARNs in spte_can_locklessly_be_made_writable() to a separate helper function. This is in preparation for moving these checks to the places where SPTEs are set. Opportunistically add warning error messages that include the SPTE to make future debugging of these warnings easier. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
As commit 0c5f81da ("KVM: LAPIC: Inject timer interrupt via posted interrupt") mentioned that the host admin should well tune the guest setup, so that vCPUs are placed on isolated pCPUs, and with several pCPUs surplus for *busy* housekeeping. In this setup, it is preferrable to disable mwait/hlt/pause vmexits to keep the vCPUs in non-root mode. However, if only some guests isolated and others not, they would not have any benefit from posted timer interrupts, and at the same time lose VMX preemption timer fast paths because kvm_can_post_timer_interrupt() returns true and therefore forces kvm_can_use_hv_timer() to false. By guaranteeing that posted-interrupt timer is only used if MWAIT or HLT are done without vmexit, KVM can make a better choice and use the VMX preemption timer and the corresponding fast paths. Reported-by: Aili Yao <yaoaili@kingsoft.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1643112538-36743-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
When delivering a virtual interrupt, don't actually send a posted interrupt if the target vCPU is also the currently running vCPU and is IN_GUEST_MODE, in which case the interrupt is being sent from a VM-Exit fastpath and the core run loop in vcpu_enter_guest() will manually move the interrupt from the PIR to vmcs.GUEST_RVI. IRQs are disabled while IN_GUEST_MODE, thus there's no possibility of the virtual interrupt being sent from anything other than KVM, i.e. KVM won't suppress a wake event from an IRQ handler (see commit fdba608f, "KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU"). Eliding the posted interrupt restores the performance provided by the combination of commits 379a3c8e ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") and 26efe2fd ("KVM: VMX: Handle preemption timer fastpath"). Thanks Sean for better comments. Suggested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1643111979-36447-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-