An error occurred fetching the project authors.
- 02 May, 2024 1 commit
-
-
Sean Christopherson authored
Add full memory barriers in kvm_mmu_track_write() and account_shadowed() to plug a (very, very theoretical) race where kvm_mmu_track_write() could miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant, *stale* SPTEs. Without the barriers, because modern x86 CPUs allow (per the SDM): Reads may be reordered with older writes to different locations but not with older writes to the same location. it's possible that the following could happen (terms of values being visible/resolved): CPU0 CPU1 read memory[gfn] (=Y) memory[gfn] Y=>X read indirect_shadow_pages (=0) indirect_shadow_pages 0=>1 or conversely: CPU0 CPU1 indirect_shadow_pages 0=>1 read indirect_shadow_pages (=0) read memory[gfn] (=Y) memory[gfn] Y=>X E.g. in the below scenario, CPU0 could fail to zap SPTEs, and CPU1 could fail to retry the faulting instruction, resulting in a KVM entering the guest with a stale SPTE (map PTE=X instead of PTE=Y). PTE = X; CPU0: emulator_write_phys() PTE = Y kvm_page_track_write() kvm_mmu_track_write() // memory barrier missing here if (indirect_shadow_pages) zap(); CPU1: FNAME(page_fault) FNAME(walk_addr) FNAME(walk_addr_generic) gw->pte = PTE; // X FNAME(fetch) kvm_mmu_get_child_sp kvm_mmu_get_shadow_page __kvm_mmu_get_shadow_page kvm_mmu_alloc_shadow_page account_shadowed indirect_shadow_pages++ // memory barrier missing here if (FNAME(gpte_changed)) // if (PTE == X) return RET_PF_RETRY; In practice, this bug likely cannot be observed as both the 0=>1 transition and reordering of this scope are extremely rare occurrences. Note, if the cost of the barrier (which is simply a locked ADD, see commit 450cbdd0 ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")), is problematic, KVM could avoid the barrier by bailing earlier if checking kvm_memslots_have_rmaps() is false. But the odds of the barrier being problematic is extremely low, *and* the odds of the extra checks being meaningfully faster overall is also low. Link: https://lore.kernel.org/r/20240423193114.2887673-1-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 06 Mar, 2024 3 commits
-
-
Peter Xu authored
They're not used anymore, drop all of them. Link: https://lkml.kernel.org/r/20240305043750.93762-10-peterx@redhat.comSigned-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Reviewed-by:
Mike Rapoport (IBM) <rppt@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
pud_large() is always defined as pud_leaf(). Merge their usages. Chose pud_leaf() because pud_leaf() is a global API, while pud_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.comSigned-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
pmd_large() is always defined as pmd_leaf(). Merge their usages. Chose pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.comSigned-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Reviewed-by:
Mike Rapoport (IBM) <rppt@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- 04 Mar, 2024 1 commit
-
-
Thomas Gleixner authored
Sparse complains rightfully about the missing declaration which has been placed sloppily into the usage site: bugs.c:2223:6: sparse: warning: symbol 'itlb_multihit_kvm_mitigation' was not declared. Should it be static? Add it to <asm/spec-ctrl.h> where it belongs and remove the one in the KVM code. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.787173239@linutronix.de
-
- 23 Feb, 2024 3 commits
-
-
Sean Christopherson authored
Retry page faults without acquiring mmu_lock, and without even faulting the page into the primary MMU, if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Faulting the page into the primary MMU is similarly problematic, as doing so may acquire locks that need to be taken for the invalidation to complete (the primary MMU has finer grained locks than KVM's MMU), and/or may cause unnecessary churn (getting/putting pages, marking them accessed, etc). Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by:
Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.comReported-by:
Friedrich Weber <f.weber@proxmox.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Cc: Xu Yilun <yilun.xu@linux.intel.com> Acked-by:
Kai Huang <kai.huang@intel.com> Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Free TDP MMU roots from vCPU context while holding mmu_lock for read, it is completely legal to invoke kvm_tdp_mmu_put_root() as a reader. This eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated roots, free obsolete roots, and allocate new roots in parallel. On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap" operation to run in parallel with freeing and allocating roots reduces the worst case latency for a vCPU to reload a root from 2-3ms to <100us. Link: https://lore.kernel.org/r/20240111020048.844847-9-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
When allocating a new TDP MMU root, check for a usable root while holding mmu_lock for read and only acquire mmu_lock for write if a new root needs to be created. There is no need to serialize other MMU operations if a vCPU is simply grabbing a reference to an existing root, holding mmu_lock for write is "necessary" (spoiler alert, it's not strictly necessary) only to ensure KVM doesn't end up with duplicate roots. Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and to setups that frequently delete memslots, i.e. which force all vCPUs to reload all roots. Link: https://lore.kernel.org/r/20240111020048.844847-7-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 01 Feb, 2024 1 commit
-
-
Tanzir Hasan authored
This patch creates wordpart.h and includes it in asm/word-at-a-time.h for all architectures. WORD_AT_A_TIME_CONSTANTS depends on kernel.h because of REPEAT_BYTE. Moving this to another header and including it where necessary allows us to not include the bloated kernel.h. Making this implicit dependency on REPEAT_BYTE explicit allows for later improvements in the lib/string.c inclusion list. Suggested-by:
Al Viro <viro@zeniv.linux.org.uk> Suggested-by:
Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by:
Tanzir Hasan <tanzirh@google.com> Reviewed-by:
Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20231226-libstringheader-v6-1-80aa08c7652c@google.comSigned-off-by:
Kees Cook <keescook@chromium.org>
-
- 31 Jan, 2024 1 commit
-
-
Kunwu Chan authored
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Note, KMEM_CACHE() uses the required alignment of the struct, '8' as the alignment, whereas KVM's existing code passes '0'. In the end, the two values yield the same result as x86's minimum slab alignment is also '8' (which is not at all coincidental). Signed-off-by:
Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240116100025.95702-1-chentao@kylinos.cn [sean: call out alignment behavior] Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- 10 Jan, 2024 1 commit
-
-
Breno Leitao authored
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by:
Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by:
Breno Leitao <leitao@debian.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Ariel Miculas <amiculas@cisco.com> Acked-by:
Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
-
- 03 Jan, 2024 1 commit
-
-
Bjorn Helgaas authored
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by:
Bjorn Helgaas <bhelgaas@google.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
-
- 01 Dec, 2023 3 commits
-
-
Paolo Bonzini authored
Fix the comment about what can and cannot happen when mmu_unsync_pages_lock is not help. The comment correctly mentions "clearing sp->unsync", but then it talks about unsync going from 0 to 1. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-5-pbonzini@redhat.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Paolo Bonzini authored
Neither tdp_mmu_next_root nor kvm_tdp_mmu_put_root need to know if the lock is taken for read or write. Either way, protection is achieved via RCU and tdp_mmu_pages_lock. Remove the argument and just assert that the lock is taken. Reviewed-by:
Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-2-pbonzini@redhat.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
David Matlack authored
Fix an off-by-1 error when passing in the range of pages to kvm_mmu_try_split_huge_pages() during CLEAR_DIRTY_LOG. Specifically, end is the last page that needs to be split (inclusive) so pass in `end + 1` since kvm_mmu_try_split_huge_pages() expects the `end` to be non-inclusive. At worst this will cause a huge page to be write-protected instead of eagerly split, which is purely a performance issue, not a correctness issue. But even that is unlikely as it would require userspace pass in a bitmap where the last page is the only 4K page on a huge page that needs to be split. Reported-by:
Vipin Sharma <vipinsh@google.com> Fixes: cb00a70b ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG") Signed-off-by:
David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20231027172640.2335197-2-dmatlack@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 29 Nov, 2023 2 commits
-
-
Sean Christopherson authored
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when running under Hyper-V if and only if CONFIG_HYPERV!=n. Wrapping yet more code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional branches, and makes it super obvious why the hooks *might* be valid. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Binbin Wu authored
Drop non-PA bits when getting GFN for guest's PGD with the maximum theoretical mask for guest MAXPHYADDR. Do it unconditionally because it's harmless for 32-bit guests, querying 64-bit mode would be more expensive, and for EPT the mask isn't tied to guest mode. Using PT_BASE_ADDR_MASK would be technically wrong (PAE paging has 64-bit elements _except_ for CR3, which has only 32 valid bits), it wouldn't matter in practice though. Opportunistically use GENMASK_ULL() to define __PT_BASE_ADDR_MASK. Signed-off-by:
Binbin Wu <binbin.wu@linux.intel.com> Tested-by:
Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-6-binbin.wu@linux.intel.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 14 Nov, 2023 3 commits
-
-
Sean Christopherson authored
Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Chao Peng authored
Add support for resolving page faults on guest private memory for VMs that differentiate between "shared" and "private" memory. For such VMs, KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and hva-based shared memory, and KVM needs to map in the "correct" variant, i.e. KVM needs to map the gfn shared/private as appropriate based on the current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag. For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request shared vs. private via a bit in the guest page tables, i.e. what the guest wants may conflict with the current memory attributes. To support such "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT to forward the request to userspace. Add a new flag for memory faults, KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to map memory as shared vs. private. Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to exit on missing mappings when handling guest page fault VM-Exits. In that case, userspace will want to know RWX information in order to correctly/precisely resolve the fault. Note, private memory *must* be backed by guest_memfd, i.e. shared mappings always come from the host userspace page tables, and private mappings always come from a guest_memfd instance. Co-developed-by:
Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by:
Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-21-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Chao Peng authored
Disallow creating hugepages with mixed memory attributes, e.g. shared versus private, as mapping a hugepage in this case would allow the guest to access memory with the wrong attributes, e.g. overlaying private memory with a shared hugepage. Tracking whether or not attributes are mixed via the existing disallow_lpage field, but use the most significant bit in 'disallow_lpage' to indicate a hugepage has mixed attributes instead using the normal refcounting. Whether or not attributes are mixed is binary; either they are or they aren't. Attempting to squeeze that info into the refcount is unnecessarily complex as it would require knowing the previous state of the mixed count when updating attributes. Using a flag means KVM just needs to ensure the current status is reflected in the memslots. Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-20-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 13 Nov, 2023 1 commit
-
-
Chao Peng authored
Currently in mmu_notifier invalidate path, hva range is recorded and then checked against by mmu_invalidate_retry_hva() in the page fault handling path. However, for the soon-to-be-introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing hva based shared memory, gfn is expected to also work. The only downside is when aliasing multiple gfns to a single hva, the current algorithm of checking multiple ranges could result in a much larger range being rejected. Such aliasing should be uncommon, so the impact is expected small. Suggested-by:
Sean Christopherson <seanjc@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API] Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Xu Yilun <yilun.xu@linux.intel.com> Message-Id: <20231027182217.3615211-4-seanjc@google.com> Reviewed-by:
Kai Huang <kai.huang@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 18 Oct, 2023 1 commit
-
-
Li zeming authored
Don't initialize "spte" and "sptep" in fast_page_fault() as they are both guaranteed (for all intents and purposes) to be written at the start of every loop iteration. Add a sanity check that "sptep" is non-NULL after walking the shadow page tables, as encountering a NULL root would result in "spte" not being written, i.e. would lead to uninitialized data or the previous value being consumed. Signed-off-by:
Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20230905182006.2964-1-zeming@nfschina.com [sean: rewrite changelog with --verbose] Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- 09 Oct, 2023 1 commit
-
-
Yan Zhao authored
Add helpers to check if KVM honors guest MTRRs instead of open coding the logic in kvm_tdp_page_fault(). Future fixes and cleanups will also need to determine if KVM should honor guest MTRRs, e.g. for CR0.CD toggling and and non-coherent DMA transitions. Provide an inner helper, __kvm_mmu_honors_guest_mtrrs(), so that KVM can check if guest MTRRs were honored when stopping non-coherent DMA. Note, there is no need to explicitly check that TDP is enabled, KVM clears shadow_memtype_mask when TDP is disabled, i.e. it's non-zero if and only if EPT is enabled. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065006.20201-1-yan.y.zhao@intel.com Link: https://lore.kernel.org/r/20230714065043.20258-1-yan.y.zhao@intel.com [sean: squash into a one patch, drop explicit TDP check massage changelog] Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- 04 Oct, 2023 1 commit
-
-
Qi Zheng authored
Use new APIs to dynamically allocate the x86-mmu shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-3-zhengqi.arch@bytedance.comSigned-off-by:
Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by:
Muchun Song <songmuchun@bytedance.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- 23 Sep, 2023 2 commits
-
-
Sean Christopherson authored
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by:
Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
All callers except the MMU notifier want to process all address spaces. Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe() and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe(). Extracted out of a patch by Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 21 Sep, 2023 1 commit
-
-
Sean Christopherson authored
The mmu_notifier path is a bit of a special snowflake, e.g. it zaps only a single address space (because it's per-slot), and can't always yield. Because of this, it calls kvm_tdp_mmu_zap_leafs() in ways that no one else does. Iterate manually over the leafs in response to an mmu_notifier invalidation, instead of invoking kvm_tdp_mmu_zap_leafs(). Drop the @can_yield param from kvm_tdp_mmu_zap_leafs() as its sole remaining caller unconditionally passes "true". Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 31 Aug, 2023 13 commits
-
-
Sean Christopherson authored
When attempting to allocate a shadow root for a !visible guest root gfn, e.g. that resides in MMIO space, load a dummy root that is backed by the zero page instead of immediately synthesizing a triple fault shutdown (using the zero page ensures any attempt to translate memory will generate a !PRESENT fault and thus VM-Exit). Unless the vCPU is racing with memslot activity, KVM will inject a page fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e. the end result is mostly same, but critically KVM will inject a fault only *after* KVM runs the vCPU with the bogus root. Waiting to inject a fault until after running the vCPU fixes a bug where KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled and a !visible root. Even though a bad root will *probably* lead to shutdown, (a) it's not guaranteed and (b) the CPU won't read the underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with a VMX preemption timer value of '0', then architecturally the preemption timer VM-Exit is guaranteed to occur before the CPU executes any instruction, i.e. before the CPU needs to translate a GPA to a HPA (so long as there are no injected events with higher priority than the preemption timer). If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because userspace created a memslot between installing the dummy root and handling the page fault, simply unload the MMU to allocate a new root and retry the instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and conceptually the dummy root has indeeed become obsolete. The only difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is that the root has become obsolete due to memslot *creation*, not memslot deletion or movement. Reported-by:
Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Harden kvm_mmu_new_pgd() against NULL pointer dereference bugs by sanity checking that the target root has an associated shadow page prior to dereferencing said shadow page. The code in question is guaranteed to only see roots with shadow pages as fast_pgd_switch() explicitly frees the current root if it doesn't have a shadow page, i.e. is a PAE root, and that in turn prevents valid roots from being cached, but that's all very subtle. Link: https://lore.kernel.org/r/20230729005200.1057358-3-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a dedicated helper for converting a root hpa to a shadow page in anticipation of using a "dummy" root to handle the scenario where KVM needs to load a valid shadow root (from hardware's perspective), but the guest doesn't have a visible root to shadow. Similar to PAE roots, the dummy root won't have an associated kvm_mmu_page and will need special handling when finding a shadow page given a root. Opportunistically retrieve the root shadow page in kvm_mmu_sync_roots() *after* verifying the root is unsync (the dummy root can never be unsync). Link: https://lore.kernel.org/r/20230729005200.1057358-2-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Refactor KVM's exported/external page-track, a.k.a. write-track, APIs to take only the gfn and do the required memslot lookup in KVM proper. Forcing users of the APIs to get the memslot unnecessarily bleeds KVM internals into KVMGT and complicates usage of the APIs. No functional change intended. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-28-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Rename the page-track APIs to capture that they're all about tracking writes, now that the facade of supporting multiple modes is gone. Opportunstically replace "slot" with "gfn" in anticipation of removing the @slot param from the external APIs. No functional change intended. Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-25-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Ben Gardon <bgardon@google.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Remove ->track_remove_slot(), there are no longer any users and it's unlikely a "flush" hook will ever be the correct API to provide to an external page-track user. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-21-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of bouncing through the page-track mechanism. KVM (unfortunately) needs to zap and flush all page tables on memslot DELETE/MOVE irrespective of whether KVM is shadowing guest page tables. This will allow changing KVM to register a page-track notifier on the first shadow root allocation, and will also allow deleting the misguided kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a different method for reacting to memslot changes. No functional change intended. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.comReviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() on corruption of host kernel data structures. Environments that don't have infrastructure to automatically capture crash dumps, i.e. aren't likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better served overall by WARN-and-continue behavior (for the kernel, the VM is dead regardless), as a BUG() while holding mmu_lock all but guarantees the _best_ case scenario is a panic(). Make the BUG()s conditional instead of removing/replacing them entirely as there's a non-zero chance (though by no means a guarantee) that the damage isn't contained to the target VM, e.g. if no rmap is found for a SPTE then KVM may be double-zapping the SPTE, i.e. has already freed the memory the SPTE pointed at and thus KVM is reading/writing memory that KVM no longer owns. Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.comSuggested-by:
Mingwei Zhang <mizhang@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by:
Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Mingwei Zhang authored
Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending VM instead of doing BUG() if the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data structures (rmaps) appear to be corrupted. Signed-off-by:
Mingwei Zhang <mizhang@google.com> [sean: tweak changelog] Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-