- 08 Jun, 2021 4 commits
-
-
Lai Jiangshan authored
In record_steal_time(), st->preempted is read twice, and trace_kvm_pv_tlb_flush() might output result inconsistent if kvm_vcpu_flush_tlb_guest() see a different st->preempted later. It is a very trivial problem and hardly has actual harm and can be avoided by reseting and reading st->preempted in atomic way via xchg(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
When computing the access permissions of a shadow page, use the effective permissions of the walk up to that point, i.e. the logic AND of its parents' permissions. Two guest PxE entries that point at the same table gfn need to be shadowed with different shadow pages if their parents' permissions are different. KVM currently uses the effective permissions of the last non-leaf entry for all non-leaf entries. Because all non-leaf SPTEs have full ("uwx") permissions, and the effective permissions are recorded only in role.access and merged into the leaves, this can lead to incorrect reuse of a shadow page and eventually to a missing guest protection page fault. For example, here is a shared pagetable: pgd[] pud[] pmd[] virtual address pointers /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--) /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-) pgd-| (shared pmd[] as above) \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--) \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--) pud1 and pud2 point to the same pmd table, so: - ptr1 and ptr3 points to the same page. - ptr2 and ptr4 points to the same page. (pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries) - First, the guest reads from ptr1 first and KVM prepares a shadow page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1. "u--" comes from the effective permissions of pgd, pud1 and pmd1, which are stored in pt->access. "u--" is used also to get the pagetable for pud1, instead of "uw-". - Then the guest writes to ptr2 and KVM reuses pud1 which is present. The hypervisor set up a shadow page for ptr2 with pt->access is "uw-" even though the pud1 pmd (because of the incorrect argument to kvm_mmu_get_page in the previous step) has role.access="u--". - Then the guest reads from ptr3. The hypervisor reuses pud1's shadow pmd for pud2, because both use "u--" for their permissions. Thus, the shadow pmd already includes entries for both pmd1 and pmd2. - At last, the guest writes to ptr4. This causes no vmexit or pagefault, because pud1's shadow page structures included an "uw-" page even though its role.access was "u--". Any kind of shared pagetable might have the similar problem when in virtual machine without TDP enabled if the permissions are different from different ancestors. In order to fix the problem, we change pt->access to be an array, and any access in it will not include permissions ANDed from child ptes. The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/ Remember to test it with TDP disabled. The problem had existed long before the commit 41074d07 ("KVM: MMU: Fix inherited permissions for emulated guest pte updates"), and it is hard to find which is the culprit. So there is no fixes tag here. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org Fixes: cea0f0e7 ("[PATCH] KVM: MMU: Shadow page table caching") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
According to the SDM 10.5.4.1: A write of 0 to the initial-count register effectively stops the local APIC timer, in both one-shot and periodic mode. However, the lapic timer oneshot/periodic mode which is emulated by vmx-preemption timer doesn't stop by writing 0 to TMICT since vmx->hv_deadline_tsc is still programmed and the guest will receive the spurious timer interrupt later. This patch fixes it by also cancelling the vmx-preemption timer when writing 0 to the initial-count register. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1623050385-100988-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Ashish Kalra authored
KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca82 Commit 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack") uses the local stack to allocate the structures used to communicate with the PSP, which were earlier being kzalloced. This breaks SEV live migration for computing the SEND_START session length and SEND_UPDATE_DATA query length as session_len and trans_len and hdr_len fields are not zeroed respectively for the above commands before issuing the SEV Firmware API call, hence the firmware returns incorrect session length and update data header or trans length. Also the SEV Firmware API returns SEV_RET_INVALID_LEN firmware error for these length query API calls, and the return value and the firmware error needs to be passed to the userspace as it is, so need to remove the return check in the KVM code. Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-Id: <20210607061532.27459-1-Ashish.Kalra@amd.com> Fixes: 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 29 May, 2021 1 commit
-
-
Paolo Bonzini authored
vm_create allocates memory and maps it close to GPA. This memory is separate from what is allocated in subsequent calls to vm_userspace_mem_region_add, so it is incorrect to pass the test memory size to vm_create_default. Just pass a small fixed amount of memory which can be used later for page table, otherwise GPAs are already allocated at MEM_GPA and the test aborts. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 28 May, 2021 4 commits
-
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.13, take #2 - Another state update on exit to userspace fix - Prevent the creation of mixed 32/64 VMs
-
Wanpeng Li authored
ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by passing emulation_type to x86_decode_insn() and dropping ctxt->ud altogether. Tracking that info in ctxt for literally one call is silly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
-
Wanpeng Li authored
Reported by syzkaller: WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm] RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm] Call Trace: kvm_mmu_page_fault+0x126/0x8f0 [kvm] vmx_handle_exit+0x11e/0x680 [kvm_intel] vcpu_enter_guest+0xd95/0x1b40 [kvm] kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm] kvm_vcpu_ioctl+0x389/0x630 [kvm] __x64_sys_ioctl+0x8e/0xd0 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Commit 4a1e10d5 ("KVM: x86: handle hardware breakpoints during emulation()) adds hardware breakpoints check before emulation the instruction and parts of emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag here and the emulation context will not be reused. Commit c8848cee ("KVM: x86: set ctxt->have_exception in x86_decode_insn()) triggers the warning because it catches the stale emulation context has #UD, however, it is not during instruction decoding which should result in EMULATION_FAILED. This patch fixes it by moving the second part emulation context initialization into init_emulate_ctxt() and before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up patch. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000 Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com Fixes: 4a1e10d5 (KVM: x86: handle hardware breakpoints during emulation) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
-
Yuan Yao authored
The kvm_get_linear_rip() handles x86/long mode cases well and has better readability, __kvm_set_rflags() also use the paired function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and handle_exception_nmi() to this one. Signed-off-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 27 May, 2021 30 commits
-
-
David Matlack authored
This comment was left over from a previous version of the patch that introduced wrprot_gfn_range, when skip_4k was passed in instead of min_level. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210526163227.3113557-1-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Marcelo Tosatti authored
For VMX, when a vcpu enters HLT emulation, pi_post_block will: 1) Add vcpu to per-cpu list of blocked vcpus. 2) Program the posted-interrupt descriptor "notification vector" to POSTED_INTR_WAKEUP_VECTOR With interrupt remapping, an interrupt will set the PIR bit for the vector programmed for the device on the CPU, test-and-set the ON bit on the posted interrupt descriptor, and if the ON bit is clear generate an interrupt for the notification vector. This way, the target CPU wakes upon a device interrupt and wakes up the target vcpu. Problem is that pi_post_block only programs the notification vector if kvm_arch_has_assigned_device() is true. Its possible for the following to happen: 1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false, notification vector is not programmed 2) device is assigned to VM 3) device interrupts vcpu V, sets ON bit (notification vector not programmed, so pcpu P remains in idle) 4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set, kvm_vcpu_kick is skipped 5) vcpu 0 busy spins on vcpu V's response for several seconds, until RCU watchdog NMIs all vCPUs. To fix this, use the start_assignment kvm_x86_ops callback to kick vcpus out of the halt loop, so the notification vector is properly reprogrammed to the wakeup vector. Reported-by: Pei Zhang <pezhang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210526172014.GA29007@fuller.cnet> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Marcelo Tosatti authored
KVM_REQ_UNBLOCK will be used to exit a vcpu from its inner vcpu halt emulation loop. Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch PowerPC to arch specific request bit. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.303768132@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Marcelo Tosatti authored
Add a start_assignment hook to kvm_x86_ops, which is called when kvm_arch_start_assignment is done. The hook is required to update the wakeup vector of a sleeping vCPU when a device is assigned to the guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.254128742@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor overhead, move it before wait_lapic_expire instead of between wait_lapic_expire and the world switch, the wait duration should be calculated by the up-to-date guest_tsc after the overhead of automatic tuning logic. This patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing busy waits. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-5-git-send-email-wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The test takes a long time with the current implementation of memslots, so cut the run time a bit. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Joe Richey authored
Replace BIT() in KVM's UPAI header with _BITUL(). BIT() is not defined in the UAPI headers and its usage may cause userspace build errors. Fixes: fb04a1ed ("KVM: X86: Implement ring-based dirty memory tracking") Signed-off-by: Joe Richey <joerichey@google.com> Message-Id: <20210521085849.37676-3-joerichey94@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
This lets us run the demand paging test on top of a shared hugetlbfs-backed area. The "shared" is key, as this allows us to exercise userfaultfd minor faults on hugetlbfs. Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-11-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
UFFD handling of MINOR faults is a new feature whose use case is to speed up demand paging (compared to MISSING faults). So, it's interesting to let this selftest exercise this new mode. Modify the demand paging test to have the option of using UFFD minor faults, as opposed to missing faults. Now, when turning on userfaultfd with '-u', the desired mode has to be specified ("MISSING" or "MINOR"). If we're in minor mode, before registering, prefault via the *alias*. This way, the guest will trigger minor faults, instead of missing faults, and we can UFFDIO_CONTINUE to resolve them. Modify the page fault handler function to use the right ioctl depending on the mode we're running in. In MINOR mode, use UFFDIO_CONTINUE. Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-10-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
When a memory region is added with a src_type specifying that it should use some kind of shared memory, also create an alias mapping to the same underlying physical pages. And, add an API so tests can get access to these alias addresses. Basically, for a guest physical address, let us look up the analogous host *alias* address. In a future commit, we'll modify the demand paging test to take advantage of this to exercise UFFD minor faults. The idea is, we pre-fault the underlying pages *via the alias*. When the *guest* faults, it gets a "minor" fault (PTEs don't exist yet, but a page is already in the page cache). Then, the userfaultfd theads can handle the fault: they could potentially modify the underlying memory *via the alias* if they wanted to, and then they install the PTEs and let the guest carry on via a UFFDIO_CONTINUE ioctl. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-9-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
This lets us run the demand paging test on top of a shmem-backed area. In follow-up commits, we'll 1) leverage this new capability to create an alias mapping, and then 2) use the alias mapping to exercise UFFD minor faults. Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-8-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
Each struct vm_mem_backing_src_alias has a flags field, which denotes the flags used to mmap() an area of that type. Previously, this field never included MAP_PRIVATE | MAP_ANONYMOUS, because vm_userspace_mem_region_add assumed that *all* types would always use those flags, and so it hardcoded them. In a follow-up commit, we'll add a new type: shmem. Areas of this type must not have MAP_PRIVATE | MAP_ANONYMOUS, and instead they must have MAP_SHARED. So, refactor things. Make it so that the flags field of struct vm_mem_backing_src_alias really is a complete set of flags, and don't add in any extras in vm_userspace_mem_region_add. This will let us easily tack on shmem. Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-7-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
Add an argument which lets us specify a different backing memory type for the test. The default is just to use anonymous, matching existing behavior. This is in preparation for testing UFFD minor faults. For that, we'll need to use a new backing memory type which is setup with MAP_SHARED. Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-6-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
This is a preparatory commit needed before we can use different kinds of backing pages for guest memory. Previously, we used perf_test_args.host_page_size, which is the host's native page size (commonly 4K). For VM_MEM_SRC_ANONYMOUS this turns out to be okay, but in a follow-up commit we want to allow using different kinds of backing memory. Take VM_MEM_SRC_ANONYMOUS_HUGETLB for example. Without this change, if we used that backing page type, when we issued a UFFDIO_COPY ioctl we'd only do so with 4K, rather than the full 2M of a backing hugepage. In this case, UFFDIO_COPY returns -EINVAL (__mcopy_atomic_hugetlb checks the size). Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-5-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
A small cleanup. Our caller writes: r = setup_demand_paging(...); if (r < 0) exit(-r); Since we're just going to exit anyway, instead of returning an error we can just re-use TEST_ASSERT. This makes the caller simpler, as well as the function itself - no need to write our branches, etc. Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-3-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
If a KVM selftest is run on a machine without /dev/kvm, it will exit silently. Make it easy to tell what's happening by printing an error message. Opportunistically consolidate all codepaths that open /dev/kvm into a single function so they all print the same message. This slightly changes the semantics of vm_is_unrestricted_guest() by changing a TEST_ASSERT() to exit(KSFT_SKIP). However vm_is_unrestricted_guest() is only called in one place (x86_64/mmio_warning_test.c) and that is to determine if the test should be skipped or not. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210511202120.1371800-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Axel Rasmussen authored
Some trivial fixes I found while touching related code in this series, factored out into a separate commit for easier reviewing: - s/gor/got/ and add a newline in demand_paging_test.c - s/backing_src/src_type/ in a comment to be consistent with the real function signature in kvm_util.c Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Message-Id: <20210519200339.829146-2-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
If /dev/kvm is not available then hardware_disable_test will hang indefinitely because the child process exits before posting to the semaphore for which the parent is waiting. Fix this by making the parent periodically check if the child has exited. We have to be careful to forward the child's exit status to preserve a KSFT_SKIP status. I considered just checking for /dev/kvm before creating the child process, but there are so many other reasons why the child could exit early that it seemed better to handle that as general case. Tested: $ ./hardware_disable_test /dev/kvm not available, skipping test $ echo $? 4 $ modprobe kvm_intel $ ./hardware_disable_test $ echo $? 0 Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210514230521.2608768-1-dmatlack@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Similar to CPUID.0DH.0H this entry depends on the vCPU's XCR0 register and IA32_XSS MSR. Since this test does not control for either before assigning the vCPU's CPUID, these entries will not necessarily match the supported CPUID exposed by KVM. This fixes get_cpuid_test on Cascade Lake CPUs. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210519211345.3944063-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
vm_get_max_gfn() casts vm->max_gfn from a uint64_t to an unsigned int, which causes the upper 32-bits of the max_gfn to get truncated. Nobody noticed until now likely because vm_get_max_gfn() is only used as a mechanism to create a memslot in an unused region of the guest physical address space (the top), and the top of the 32-bit physical address space was always good enough. This fix reveals a bug in memslot_modification_stress_test which was trying to create a dummy memslot past the end of guest physical memory. Fix that by moving the dummy memslot lower. Fixes: 52200d0d ("KVM: selftests: Remove duplicate guest mode handling") Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210521173828.1180619-1-dmatlack@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maciej S. Szmigiero authored
This benchmark contains the following tests: * Map test, where the host unmaps guest memory while the guest writes to it (maps it). The test is designed in a way to make the unmap operation on the host take a negligible amount of time in comparison with the mapping operation in the guest. The test area is actually split in two: the first half is being mapped by the guest while the second half in being unmapped by the host. Then a guest <-> host sync happens and the areas are reversed. * Unmap test which is broadly similar to the above map test, but it is designed in an opposite way: to make the mapping operation in the guest take a negligible amount of time in comparison with the unmap operation on the host. This test is available in two variants: with per-page unmap operation or a chunked one (using 2 MiB chunk size). * Move active area test which involves moving the last (highest gfn) memslot a bit back and forth on the host while the guest is concurrently writing around the area being moved (including over the moved memslot). * Move inactive area test which is similar to the previous move active area test, but now guest writes all happen outside of the area being moved. * Read / write test in which the guest writes to the beginning of each page of the test area while the host writes to the middle of each such page. Then each side checks the values the other side has written. This particular test is not expected to give different results depending on particular memslots implementation, it is meant as a rough sanity check and to provide insight on the spread of test results expected. Each test performs its operation in a loop until a test period ends (this is 5 seconds by default, but it is configurable). Then the total count of loops done is divided by the actual elapsed time to give the test result. The tests have a configurable memslot cap with the "-s" test option, by default the system maximum is used. Each test is repeated a particular number of times (by default 20 times), the best result achieved is printed. The test memory area is divided equally between memslots, the reminder is added to the last memslot. The test area size does not depend on the number of memslots in use. The tests also measure the time that it took to add all these memslots. The best result from the tests that use the whole test area is printed after all the requested tests are done. In general, these tests are designed to use as much memory as possible (within reason) while still doing 100+ loops even on high memslot counts with the default test length. Increasing the test runtime makes it increasingly more likely that some event will happen on the system during the test run, which might lower the test result. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <8d31bb3d92bc8fa33a9756fa802ee14266ab994e.1618253574.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maciej S. Szmigiero authored
The KVM selftest framework was using a simple list for keeping track of the memslots currently in use. This resulted in lookups and adding a single memslot being O(n), the later due to linear scanning of the existing memslot set to check for the presence of any conflicting entries. Before this change, benchmarking high count of memslots was more or less impossible as pretty much all the benchmark time was spent in the selftest framework code. We can simply use a rbtree for keeping track of both of gfn and hva. We don't need an interval tree for hva here as we can't have overlapping memslots because we allocate a completely new memory chunk for each new memslot. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Message-Id: <b12749d47ee860468240cf027412c91b76dbe3db.1618253574.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
vm_vaddr_alloc() sets up GVA to GPA mapping page by page; therefore, GPAs may not be continuous if same memslot is used for data and page table allocation. kvm_vm_elf_load() however expects a continuous range of HVAs (and thus GPAs) because it does not try to read file data page by page. Fix this mismatch by allocating memory in one step. Reported-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Zhenzhong Duan authored
The extra memory pages is missed to be allocated during VM creating. perf_test_util and kvm_page_table_test use it to alloc extra memory currently. Fix it by adding extra_mem_pages to the total memory calculation before allocate. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Message-Id: <20210512043107.30076-1-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
WARNING: suspicious RCU usage 5.13.0-rc1 #4 Not tainted ----------------------------- ./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by hyperv_clock/8318: #0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm] stack backtrace: CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4 Call Trace: dump_stack+0x87/0xb7 lockdep_rcu_suspicious+0xce/0xf0 kvm_write_guest_page+0x1c1/0x1d0 [kvm] kvm_write_guest+0x50/0x90 [kvm] kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm] kvm_gen_update_masterclock+0x1d/0x110 [kvm] kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm] kvm_vm_ioctl+0x123/0x11d0 [kvm] __x64_sys_ioctl+0x3ed/0x9d0 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock. Fixes: e880c6ea (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs) Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-4-git-send-email-wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Commit 66570e96 (kvm: x86: only provide PV features if enabled in guest's CPUID) avoids to access pv tlb shootdown host side logic when this pv feature is not exposed to guest, however, kvm_steal_time.preempted not only leveraged by pv tlb shootdown logic but also mitigate the lock holder preemption issue. From guest's point of view, vCPU is always preempted since we lose the reset of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not exposed. This patch fixes it by clearing kvm_steal_time.preempted before vmentry. Fixes: 66570e96 (kvm: x86: only provide PV features if enabled in guest's CPUID) Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
In case of under-committed scenarios, vCPUs can be scheduled easily; kvm_vcpu_yield_to adds extra overhead, and it is also common to see when vcpu->ready is true but yield later failing due to p->state is TASK_RUNNING. Let's bail out in such scenarios by checking the length of current cpu runqueue, which can be treated as a hint of under-committed instead of guarantee of accuracy. 30%+ of directed-yield attempts can now avoid the expensive lookups in kvm_sched_yield() in an under-committed scenario. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
This is inspired by commit 262de410 (kvm: exit halt polling on need_resched() as well). Due to PPC implements an arch specific halt polling logic, we have to the need_resched() check there as well. This patch adds a helper function that can be shared between book3s and generic halt-polling loops. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Ben Segall <bsegall@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Jim Mattson <jmattson@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com> [Make the function inline. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Marc Zyngier authored
It looks like we have tolerated creating mixed-width VMs since... forever. However, that was never the intention, and we'd rather not have to support that pointless complexity. Forbid such a setup by making sure all the vcpus have the same register width. Reported-by: Steven Price <steven.price@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20210524170752.1549797-1-maz@kernel.org
-
Zenghui Yu authored
Commit 26778aaa ("KVM: arm64: Commit pending PC adjustemnts before returning to userspace") fixed the PC updating issue by forcing an explicit synchronisation of the exception state on vcpu exit to userspace. However, we forgot to take into account the case where immediate_exit is set by userspace and KVM_RUN will exit immediately. Fix it by resolving all pending PC updates before returning to userspace. Since __kvm_adjust_pc() relies on a loaded vcpu context, I moved the immediate_exit checking right after vcpu_load(). We will get some overhead if immediate_exit is true (which should hopefully be rare). Fixes: 26778aaa ("KVM: arm64: Commit pending PC adjustemnts before returning to userspace") Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210526141831.1662-1-yuzenghui@huawei.com Cc: stable@vger.kernel.org # 5.11
-
- 24 May, 2021 1 commit
-
-
Paolo Bonzini authored
Make it consistent with kvm_intel.enable_apicv. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-