- 20 Apr, 2021 21 commits
-
-
Sean Christopherson authored
Introduce a scheme that allows KVM's CPUID magic to support features that are scattered in the kernel's feature words. To advertise and/or query guest support for CPUID-based features, KVM requires the bit number of an X86_FEATURE_* to match the bit number in its associated CPUID entry. For scattered features, this does not hold true. Add a framework to allow defining KVM-only words, stored in kvm_cpu_caps after the shared kernel caps, that can be used to gather the scattered feature bits by translating X86_FEATURE_* flags into their KVM-defined feature. Note, because reverse_cpuid_check() effectively forces kvm_cpu_caps lookups to be resolved at compile time, there is no runtime cost for translating from kernel-defined to kvm-defined features. More details here: https://lkml.kernel.org/r/X/jxCOLG+HUO4QlZ@google.comSigned-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <16cad8d00475f67867fb36701fc7fb7c1ec86ce1.1618196135.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Page faults that are signaled by the SGX Enclave Page Cache Map (EPCM), as opposed to the traditional IA32/EPT page tables, set an SGX bit in the error code to indicate that the #PF was induced by SGX. KVM will need to emulate this behavior as part of its trap-and-execute scheme for virtualizing SGX Launch Control, e.g. to inject SGX-induced #PFs if EINIT faults in the host, and to support live migration. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <e170c5175cb9f35f53218a7512c9e3db972b97a2.1618196135.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Export the gva_to_gpa() helpers for use by SGX virtualization when executing ENCLS[ECREATE] and ENCLS[EINIT] on behalf of the guest. To execute ECREATE and EINIT, KVM must obtain the GPA of the target Secure Enclave Control Structure (SECS) in order to get its corresponding HVA. Because the SECS must reside in the Enclave Page Cache (EPC), copying the SECS's data to a host-controlled buffer via existing exported helpers is not a viable option as the EPC is not readable or writable by the kernel. SGX virtualization will also use gva_to_gpa() to obtain HVAs for non-EPC pages in order to pass user pointers directly to ECREATE and EINIT, which avoids having to copy pages worth of data into the kernel. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <02f37708321bcdfaa2f9d41c8478affa6e84b04d.1618196135.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
This test serves as a performance tester and a bug reproducer for kvm page table code (GPA->HPA mappings), so it gives guidance for people trying to make some improvement for kvm. The function guest_code() can cover the conditions where a single vcpu or multiple vcpus access guest pages within the same memory region, in three VM stages(before dirty logging, during dirty logging, after dirty logging). Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested memory region can be specified by users, which means normal page mappings or block mappings can be chosen by users to be created in the test. If ANONYMOUS memory is specified, kvm will create normal page mappings for the tested memory region before dirty logging, and update attributes of the page mappings from RO to RW during dirty logging. If THP/HUGETLB memory is specified, kvm will create block mappings for the tested memory region before dirty logging, and split the blcok mappings into normal page mappings during dirty logging, and coalesce the page mappings back into block mappings after dirty logging is stopped. So in summary, as a performance tester, this test can present the performance of kvm creating/updating normal page mappings, or the performance of kvm creating/splitting/recovering block mappings, through execution time. When we need to coalesce the page mappings back to block mappings after dirty logging is stopped, we have to firstly invalidate *all* the TLB entries for the page mappings right before installation of the block entry, because a TLB conflict abort error could occur if we can't invalidate the TLB entries fully. We have hit this TLB conflict twice on aarch64 software implementation and fixed it. As this test can imulate process from dirty logging enabled to dirty logging stopped of a VM with block mappings, so it can also reproduce this TLB conflict abort due to inadequate TLB invalidation when coalescing tables. Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-11-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(), we have to get the transparent hugepage size for HVA alignment. With the new helpers, we can use get_backing_src_pagesz() to check whether THP is configured and then get the exact configured hugepage size. As different architectures may have different THP page sizes configured, this can get the accurate THP page sizes on any platform. Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-10-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system default hugetlb pages to back the testing guest memory. In order to add flexibility, now list all the known hugetlb backing src types with different page sizes, so that we can specify use of hugetlb pages of the exact granularity that we want. And as all the known hugetlb page sizes are listed, it's appropriate for all architectures. Besides, the helper get_backing_src_pagesz() is added to get the granularity of different backing src types(anonumous, thp, hugetlb). Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-9-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
If HUGETLB is configured in the host kernel, then we can know the system default hugetlb page size through *cat /proc/meminfo*. Otherwise, we will not see the information of hugetlb pages in file /proc/meminfo if it's not configured. So add a helper to determine whether HUGETLB is configured and then get the default page size by reading /proc/meminfo. This helper can be useful when a program wants to use the default hugetlb pages of the system and doesn't know the default page size. Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-8-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
If we want to have some tests about transparent hugepages, the system configured THP hugepage size should better be known by the tests, which can be used for kinds of alignment or guest memory accessing of vcpus... So it makes sense to add a helper to get the transparent hugepage size. With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(), we now stat /sys/kernel/mm/transparent_hugepage to check whether THP is configured in the host kernel before madvise(). Based on this, we can also read file /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to get THP hugepage size. Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-7-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
For generality and conciseness, make an API which can be used in all kvm libs and selftests to get vm guest mode strings. And the index i is checked in the API in case of possiable faults. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-6-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
Print the errno besides error-string in TEST_ASSERT in the format of "errno=%d - %s" will explicitly indicate that the string is an error information. Besides, the errno is easier to be used for debugging than the error-string. Suggested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <20210330080856.14940-5-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Yanan Wang authored
This patch syncs contents of tools/include/asm-generic/hugetlb_encode.h and include/uapi/asm-generic/hugetlb_encode.h. Arch powerpc supports 16KB hugepages and ARM64 supports 32MB/512MB hugepages. The corresponding mmap flags have already been added in include/uapi/asm-generic/hugetlb_encode.h, but not tools/include/asm-generic/hugetlb_encode.h. Cc: Ingo Molnar <mingo@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20210330080856.14940-2-wangyanan55@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Haiwei Li authored
Add compile-time assertions in vmcs_check32() to disallow accesses to 64-bit and 64-bit high fields via vmcs_{read,write}32(). Upper level KVM code should never do partial accesses to VMCS fields. KVM handles the split accesses automatically in vmcs_{read,write}64() when running as a 32-bit kernel. Reviewed-and-tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Haiwei Li <lihaiwei@tencent.com> Message-Id: <20210409022456.23528-1-lihaiwei.kernel@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Convert a comment above kvm_io_bus_unregister_dev() into an actual lockdep assertion, and opportunistically add curly braces to a multi-line for-loop. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev() fails to allocate memory for the new instance of the bus. If it can't instantiate a new bus, unregister_dev() destroys all devices _except_ the target device. But, it doesn't tell the caller that it obliterated the bus and invoked the destructor for all devices that were on the bus. In the coalesced MMIO case, this can result in a deleted list entry dereference due to attempting to continue iterating on coalesced_zones after future entries (in the walk) have been deleted. Opportunistically add curly braces to the for-loop, which encompasses many lines but sneaks by without braces due to the guts being a single if statement. Fixes: f6588660 ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
If allocating a new instance of an I/O bus fails when unregistering a device, wait to destroy the device until after all readers are guaranteed to see the new null bus. Destroying devices before the bus is nullified could lead to use-after-free since readers expect the devices on their reference of the bus to remain valid. Fixes: f6588660 ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Emanuele Giuseppe Esposito authored
KVM_CAP_PPC_MULTITCE is a capability, not an ioctl. Therefore move it from section 4.97 to the new 8.31 (other capabilities). To fill the gap, move KVM_X86_SET_MSR_FILTER (was 4.126) to 4.97, and shifted Xen-related ioctl (were 4.127 - 4.130) by one place (4.126 - 4.129). Also fixed minor typo in KVM_GET_MSR_INDEX_LIST ioctl description (section 4.3). Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20210316170814.64286-1-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Keqian Zhu authored
kvm_mmu_slot_largepage_remove_write_access() is decared but not used, just remove it. Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com> Message-Id: <20210406063504.17552-1-zhukeqian1@huawei.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly document why a vmcb must be marked dirty and assigned a new asid when it will be run on a different cpu. The "what" is relatively obvious, whereas the "why" requires reading the APM and/or KVM code. Opportunistically remove a spurious period and several unnecessary newlines in the comment. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210406171811.4043363-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a comment above the declaration of vcpu_svm.vmcb to call out that it is simply a shorthand for current_vmcb->ptr. The myriad accesses to svm->vmcb are quite confusing without this crucial detail. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210406171811.4043363-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove vmcb_pa from vcpu_svm and simply read current_vmcb->pa directly in the one path where it is consumed. Unlike svm->vmcb, use of the current vmcb's address is very limited, as evidenced by the fact that its use can be trimmed to a single dereference. Opportunistically add a comment about using vmcb01 for VMLOAD/VMSAVE, at first glance using vmcb01 instead of vmcb_pa looks wrong. No functional change intended. Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210406171811.4043363-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Do not update the new vmcb's last-run cpu when switching to a different vmcb. If the vCPU is migrated between its last run and a vmcb switch, e.g. for nested VM-Exit, then setting the cpu without marking the vmcb dirty will lead to KVM running the vCPU on a different physical cpu with stale clean bit settings. vcpu->cpu current_vmcb->cpu hardware pre_svm_run() cpu0 cpu0 cpu0,clean kvm_arch_vcpu_load() cpu1 cpu0 cpu0,clean svm_switch_vmcb() cpu1 cpu1 cpu0,clean pre_svm_run() cpu1 cpu1 kaboom Simply delete the offending code; unlike VMX, which needs to update the cpu at switch time due to the need to do VMPTRLD, SVM only cares about which cpu last ran the vCPU. Fixes: af18fa77 ("KVM: nSVM: Track the physical cpu of the vmcb vmrun through the vmcb") Cc: Cathy Avery <cavery@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210406171811.4043363-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 19 Apr, 2021 18 commits
-
-
Tom Lendacky authored
Access to the GHCB is mainly in the VMGEXIT path and it is known that the GHCB will be mapped. But there are two paths where it is possible the GHCB might not be mapped. The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform the caller of the AP Reset Hold NAE event that a SIPI has been delivered. However, if a SIPI is performed without a corresponding AP Reset Hold, then the GHCB might not be mapped (depending on the previous VMEXIT), which will result in a NULL pointer dereference. The svm_complete_emulated_msr() routine will update the GHCB to inform the caller of a RDMSR/WRMSR operation about any errors. While it is likely that the GHCB will be mapped in this situation, add a safe guard in this path to be certain a NULL pointer dereference is not encountered. Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Fixes: 647daca2 ("KVM: SVM: Add support for booting APs in an SEV-ES guest") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Message-Id: <a5d3ebb600a91170fc88599d5a575452b3e31036.1617979121.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
If the target is self we do not need to yield, we can avoid malicious guest to play this. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1617941911-5338-3-git-send-email-wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
To analyze some performance issues with lock contention and scheduling, it is nice to know when directed yield are successful or failing. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's move it inside CONFIG_SMP. In addition, we can avoid define and alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable in kvm_alloc_cpumask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1617941911-5338-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
To avoid saddling a vCPU thread with the work of tearing down an entire paging structure, take a reference on each root before they become obsolete, so that the thread initiating the fast invalidation can tear down the paging structure and (most likely) release the last reference. As a bonus, this teardown can happen under the MMU lock in read mode so as not to block the progress of vCPU threads. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-14-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Provide a real mechanism for fast invalidation by marking roots as invalid so that their reference count will quickly fall to zero and they will be torn down. One negative side affect of this approach is that a vCPU thread will likely drop the last reference to a root and be saddled with the work of tearing down an entire paging structure. This issue will be resolved in a later commit. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-13-bgardon@google.com> [Move the loop to tdp_mmu.c, otherwise compilation fails on 32-bit. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
To reduce lock contention and interference with page fault handlers, allow the TDP MMU functions which enable and disable dirty logging to operate under the MMU read lock. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-12-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
To reduce the impact of disabling dirty logging, change the TDP MMU function which zaps collapsible SPTEs to run under the MMU read lock. This way, page faults on zapped SPTEs can proceed in parallel with kvm_mmu_zap_collapsible_sptes. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
To reduce lock contention and interference with page fault handlers, allow the TDP MMU function to zap a GFN range to operate under the MMU read lock. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-10-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Protect the contents of the TDP MMU roots list with RCU in preparation for a future patch which will allow the iterator macro to be used under the MMU lock in read mode. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-9-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
To reduce dependence on the MMU write lock, don't rely on the assumption that the atomic operation in kvm_tdp_mmu_get_root will always succeed. By not relying on that assumption, threads do not need to hold the MMU lock in write mode in order to take a reference on a TDP MMU root. In the root iterator, this change means that some roots might have to be skipped if they are found to have a zero refcount. This will still never happen as of this patch, but a future patch will need that flexibility to make the root iterator safe under the MMU read lock. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-8-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
In order to parallelize more operations for the TDP MMU, make the refcount on TDP MMU roots atomic, so that a future patch can allow multiple threads to take a reference on the root concurrently, while holding the MMU lock in read mode. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Refactor the yield safe TDP MMU root iterator to be more amenable to changes in future commits which will allow it to be used under the MMU lock in read mode. Currently the iterator requires a complicated dance between the helper functions and different parts of the for loop which makes it hard to reason about. Moving all the logic into a single function simplifies the iterator substantially. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
kvm_tdp_mmu_put_root and kvm_tdp_mmu_free_root are always called together, so merge the functions to simplify TDP MMU root refcounting / freeing. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-5-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Minor cleanup to deduplicate the code used to free a struct kvm_mmu_page in the TDP MMU. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-4-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
The TDP MMU is almost the only user of kvm_mmu_get_root and kvm_mmu_put_root. There is only one use of put_root in mmu.c for the legacy / shadow MMU. Open code that one use and move the get / put functions to the TDP MMU so they can be extended in future commits. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
kvm_tdp_mmu_zap_collapsible_sptes unnecessarily removes the const qualifier from its memlsot argument, leading to a compiler warning. Add the const annotation and pass it to subsequent functions. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Let the TDP MMU yield when unmapping a range in response to a MMU notification, if yielding is allowed by said notification. There is no reason to disallow yielding in this case, and in theory the range being invalidated could be quite large. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 17 Apr, 2021 1 commit
-
-
Sean Christopherson authored
Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been detected in the memslots, i.e. don't take the lock for notifications that don't affect the guest. For small VMs, spurious locking is a minor annoyance. And for "volatile" setups where the majority of notifications _are_ relevant, this barely qualifies as an optimization. But, for large VMs (hundreds of threads) with static setups, e.g. no page migration, no swapping, etc..., the vast majority of MMU notifier callbacks will be unrelated to the guest, e.g. will often be in response to the userspace VMM adjusting its own virtual address space. In such large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from handling page faults. In some scenarios it can even be "fatal" in the sense that it causes unacceptable brownouts, e.g. when rebuilding huge pages after live migration, a significant percentage of vCPUs will be attempting to handle page faults. x86's TDP MMU implementation is especially susceptible to spurious locking due it taking mmu_lock for read when handling page faults. Because rwlock is fair, a single writer will stall future readers, while the writer is itself stalled waiting for in-progress readers to complete. This is exacerbated by the MMU notifiers often firing multiple times in quick succession, e.g. moving a page will (always?) invoke three separate notifiers: .invalidate_range_start(), invalidate_range_end(), and .change_pte(). Unnecessarily taking mmu_lock each time means even a single spurious sequence can be problematic. Note, this optimizes only the unpaired callbacks. Optimizing the .invalidate_range_{start,end}() pairs is more complex and will be done in a future patch. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-