1. 29 Apr, 2022 1 commit
  2. 27 Apr, 2022 3 commits
    • Marc Zyngier's avatar
      KVM: arm64: Inject exception on out-of-IPA-range translation fault · 85ea6b1e
      Marc Zyngier authored
      When taking a translation fault for an IPA that is outside of
      the range defined by the hypervisor (between the HW PARange and
      the IPA range), we stupidly treat it as an IO and forward the access
      to userspace. Of course, userspace can't do much with it, and things
      end badly.
      
      Arguably, the guest is braindead, but we should at least catch the
      case and inject an exception.
      
      Check the faulting IPA against:
      - the sanitised PARange: inject an address size fault
      - the IPA size: inject an abort
      Reported-by: default avatarChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      85ea6b1e
    • Alexandru Elisei's avatar
      KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set · 8f6379e2
      Alexandru Elisei authored
      kvm->arch.arm_pmu is set when userspace attempts to set the first PMU
      attribute. As certain attributes are mandatory, arm_pmu ends up always
      being set to a valid arm_pmu, otherwise KVM will refuse to run the VCPU.
      However, this only happens if the VCPU has the PMU feature. If the VCPU
      doesn't have the feature bit set, kvm->arch.arm_pmu will be left
      uninitialized and equal to NULL.
      
      KVM doesn't do ID register emulation for 32-bit guests and accesses to the
      PMU registers aren't gated by the pmu_visibility() function. This is done
      to prevent injecting unexpected undefined exceptions in guests which have
      detected the presence of a hardware PMU. But even though the VCPU feature
      is missing, KVM still attempts to emulate certain aspects of the PMU when
      PMU registers are accessed. This leads to a NULL pointer dereference like
      this one, which happens on an odroid-c4 board when running the
      kvm-unit-tests pmu-cycle-counter test with kvmtool and without the PMU
      feature being set:
      
      [  454.402699] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000150
      [  454.405865] Mem abort info:
      [  454.408596]   ESR = 0x96000004
      [  454.411638]   EC = 0x25: DABT (current EL), IL = 32 bits
      [  454.416901]   SET = 0, FnV = 0
      [  454.419909]   EA = 0, S1PTW = 0
      [  454.423010]   FSC = 0x04: level 0 translation fault
      [  454.427841] Data abort info:
      [  454.430687]   ISV = 0, ISS = 0x00000004
      [  454.434484]   CM = 0, WnR = 0
      [  454.437404] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000c924000
      [  454.443800] [0000000000000150] pgd=0000000000000000, p4d=0000000000000000
      [  454.450528] Internal error: Oops: 96000004 [#1] PREEMPT SMP
      [  454.456036] Modules linked in:
      [  454.459053] CPU: 1 PID: 267 Comm: kvm-vcpu-0 Not tainted 5.18.0-rc4 #113
      [  454.465697] Hardware name: Hardkernel ODROID-C4 (DT)
      [  454.470612] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  454.477512] pc : kvm_pmu_event_mask.isra.0+0x14/0x74
      [  454.482427] lr : kvm_pmu_set_counter_event_type+0x2c/0x80
      [  454.487775] sp : ffff80000a9839c0
      [  454.491050] x29: ffff80000a9839c0 x28: ffff000000a83a00 x27: 0000000000000000
      [  454.498127] x26: 0000000000000000 x25: 0000000000000000 x24: ffff00000a510000
      [  454.505198] x23: ffff000000a83a00 x22: ffff000003b01000 x21: 0000000000000000
      [  454.512271] x20: 000000000000001f x19: 00000000000003ff x18: 0000000000000000
      [  454.519343] x17: 000000008003fe98 x16: 0000000000000000 x15: 0000000000000000
      [  454.526416] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
      [  454.533489] x11: 000000008003fdbc x10: 0000000000009d20 x9 : 000000000000001b
      [  454.540561] x8 : 0000000000000000 x7 : 0000000000000d00 x6 : 0000000000009d00
      [  454.547633] x5 : 0000000000000037 x4 : 0000000000009d00 x3 : 0d09000000000000
      [  454.554705] x2 : 000000000000001f x1 : 0000000000000000 x0 : 0000000000000000
      [  454.561779] Call trace:
      [  454.564191]  kvm_pmu_event_mask.isra.0+0x14/0x74
      [  454.568764]  kvm_pmu_set_counter_event_type+0x2c/0x80
      [  454.573766]  access_pmu_evtyper+0x128/0x170
      [  454.577905]  perform_access+0x34/0x80
      [  454.581527]  kvm_handle_cp_32+0x13c/0x160
      [  454.585495]  kvm_handle_cp15_32+0x1c/0x30
      [  454.589462]  handle_exit+0x70/0x180
      [  454.592912]  kvm_arch_vcpu_ioctl_run+0x1c4/0x5e0
      [  454.597485]  kvm_vcpu_ioctl+0x23c/0x940
      [  454.601280]  __arm64_sys_ioctl+0xa8/0xf0
      [  454.605160]  invoke_syscall+0x48/0x114
      [  454.608869]  el0_svc_common.constprop.0+0xd4/0xfc
      [  454.613527]  do_el0_svc+0x28/0x90
      [  454.616803]  el0_svc+0x34/0xb0
      [  454.619822]  el0t_64_sync_handler+0xa4/0x130
      [  454.624049]  el0t_64_sync+0x18c/0x190
      [  454.627675] Code: a9be7bfd 910003fd f9000bf3 52807ff3 (b9415001)
      [  454.633714] ---[ end trace 0000000000000000 ]---
      
      In this particular case, Linux hasn't detected the presence of a hardware
      PMU because the PMU node is missing from the DTB, so userspace would have
      been unable to set the VCPU PMU feature even if it attempted it. What
      happens is that the 32-bit guest reads ID_DFR0, which advertises the
      presence of the PMU, and when it tries to program a counter, it triggers
      the NULL pointer dereference because kvm->arch.arm_pmu is NULL.
      
      kvm-arch.arm_pmu was introduced by commit 46b18782 ("KVM: arm64:
      Keep a per-VM pointer to the default PMU"). Until that commit, this
      error would be triggered instead:
      
      [   73.388140] ------------[ cut here ]------------
      [   73.388189] Unknown PMU version 0
      [   73.390420] WARNING: CPU: 1 PID: 264 at arch/arm64/kvm/pmu-emul.c:36 kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.399821] Modules linked in:
      [   73.402835] CPU: 1 PID: 264 Comm: kvm-vcpu-0 Not tainted 5.17.0 #114
      [   73.409132] Hardware name: Hardkernel ODROID-C4 (DT)
      [   73.414048] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [   73.420948] pc : kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.425863] lr : kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.430779] sp : ffff80000a8db9b0
      [   73.434055] x29: ffff80000a8db9b0 x28: ffff000000dbaac0 x27: 0000000000000000
      [   73.441131] x26: ffff000000dbaac0 x25: 00000000c600000d x24: 0000000000180720
      [   73.448203] x23: ffff800009ffbe10 x22: ffff00000b612000 x21: 0000000000000000
      [   73.455276] x20: 000000000000001f x19: 0000000000000000 x18: ffffffffffffffff
      [   73.462348] x17: 000000008003fe98 x16: 0000000000000000 x15: 0720072007200720
      [   73.469420] x14: 0720072007200720 x13: ffff800009d32488 x12: 00000000000004e6
      [   73.476493] x11: 00000000000001a2 x10: ffff800009d32488 x9 : ffff800009d32488
      [   73.483565] x8 : 00000000ffffefff x7 : ffff800009d8a488 x6 : ffff800009d8a488
      [   73.490638] x5 : ffff0000f461a9d8 x4 : 0000000000000000 x3 : 0000000000000001
      [   73.497710] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000000dbaac0
      [   73.504784] Call trace:
      [   73.507195]  kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.511768]  kvm_pmu_set_counter_event_type+0x2c/0x80
      [   73.516770]  access_pmu_evtyper+0x128/0x16c
      [   73.520910]  perform_access+0x34/0x80
      [   73.524532]  kvm_handle_cp_32+0x13c/0x160
      [   73.528500]  kvm_handle_cp15_32+0x1c/0x30
      [   73.532467]  handle_exit+0x70/0x180
      [   73.535917]  kvm_arch_vcpu_ioctl_run+0x20c/0x6e0
      [   73.540489]  kvm_vcpu_ioctl+0x2b8/0x9e0
      [   73.544283]  __arm64_sys_ioctl+0xa8/0xf0
      [   73.548165]  invoke_syscall+0x48/0x114
      [   73.551874]  el0_svc_common.constprop.0+0xd4/0xfc
      [   73.556531]  do_el0_svc+0x28/0x90
      [   73.559808]  el0_svc+0x28/0x80
      [   73.562826]  el0t_64_sync_handler+0xa4/0x130
      [   73.567054]  el0t_64_sync+0x1a0/0x1a4
      [   73.570676] ---[ end trace 0000000000000000 ]---
      [   73.575382] kvm: pmu event creation failed -2
      
      The root cause remains the same: kvm->arch.pmuver was never set to
      something sensible because the VCPU feature itself was never set.
      
      The odroid-c4 is somewhat of a special case, because Linux doesn't probe
      the PMU. But the above errors can easily be reproduced on any hardware,
      with or without a PMU driver, as long as userspace doesn't set the PMU
      feature.
      
      Work around the fact that KVM advertises a PMU even when the VCPU feature
      is not set by gating all PMU emulation on the feature. The guest can still
      access the registers without KVM injecting an undefined exception.
      Signed-off-by: default avatarAlexandru Elisei <alexandru.elisei@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220425145530.723858-1-alexandru.elisei@arm.com
      8f6379e2
    • Will Deacon's avatar
      KVM: arm64: Handle host stage-2 faults from 32-bit EL0 · 2a50fc5f
      Will Deacon authored
      When pKVM is enabled, host memory accesses are translated by an identity
      mapping at stage-2, which is populated lazily in response to synchronous
      exceptions from 64-bit EL1 and EL0.
      
      Extend this handling to cover exceptions originating from 32-bit EL0 as
      well. Although these are very unlikely to occur in practice, as the
      kernel typically ensures that user pages are initialised before mapping
      them in, drivers could still map previously untouched device pages into
      userspace and expect things to work rather than panic the system.
      
      Cc: Quentin Perret <qperret@google.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220427171332.13635-1-will@kernel.org
      2a50fc5f
  3. 21 Apr, 2022 18 commits
    • Paolo Bonzini's avatar
      kvm: selftests: introduce and use more page size-related constants · e852be8b
      Paolo Bonzini authored
      Clean up code that was hardcoding masks for various fields,
      now that the masks are included in processor.h.
      
      For more cleanup, define PAGE_SIZE and PAGE_MASK just like in Linux.
      PAGE_SIZE in particular was defined by several tests.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e852be8b
    • Paolo Bonzini's avatar
      kvm: selftests: do not use bitfields larger than 32-bits for PTEs · f18b4aeb
      Paolo Bonzini authored
      Red Hat's QE team reported test failure on access_tracking_perf_test:
      
      Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
      guest physical test memory offset: 0x3fffbffff000
      
      Populating memory             : 0.684014577s
      Writing to populated memory   : 0.006230175s
      Reading from populated memory : 0.004557805s
      ==== Test Assertion Failure ====
        lib/kvm_util.c:1411: false
        pid=125806 tid=125809 errno=4 - Interrupted system call
           1  0x0000000000402f7c: addr_gpa2hva at kvm_util.c:1411
           2   (inlined by) addr_gpa2hva at kvm_util.c:1405
           3  0x0000000000401f52: lookup_pfn at access_tracking_perf_test.c:98
           4   (inlined by) mark_vcpu_memory_idle at access_tracking_perf_test.c:152
           5   (inlined by) vcpu_thread_main at access_tracking_perf_test.c:232
           6  0x00007fefe9ff81ce: ?? ??:0
           7  0x00007fefe9c64d82: ?? ??:0
        No vm physical memory at 0xffbffff000
      
      I can easily reproduce it with a Intel(R) Xeon(R) CPU E5-2630 with 46 bits
      PA.
      
      It turns out that the address translation for clearing idle page tracking
      returned a wrong result; addr_gva2gpa()'s last step, which is based on
      "pte[index[0]].pfn", did the calculation with 40 bits length and the
      high 12 bits got truncated.  In above case the GPA address to be returned
      should be 0x3fffbffff000 for GVA 0xc0000000, but it got truncated into
      0xffbffff000 and the subsequent gpa2hva lookup failed.
      
      The width of operations on bit fields greater than 32-bit is
      implementation defined, and differs between GCC (which uses the bitfield
      precision) and clang (which uses 64-bit arithmetic), so this is a
      potential minefield.  Remove the bit fields and using manual masking
      instead.
      
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2075036Reported-by: default avatarNana Liu <nanliu@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f18b4aeb
    • Mingwei Zhang's avatar
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang authored
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, malicous userspace can crash the host kernel:
      creating a malicious VM and continuously allocates/releases unpinned
      confidential memory pages when the VM is running.
      
      Add cache flush operations to mmu_notifier operations to ensure that any
      physical memory leaving the guest VM get flushed. In particular, hook
      mmu_notifier_invalidate_range_start and mmu_notifier_release events and
      flush cache accordingly. The hook after releasing the mmu lock to avoid
      contention with other vCPUs.
      
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarSean Christpherson <seanjc@google.com>
      Reported-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-4-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      683412cc
    • Mingwei Zhang's avatar
      KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs · d45829b3
      Mingwei Zhang authored
      Use clflush_cache_range() to flush the confidential memory when
      SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
      SME_COHERENT only support cache invalidation at CPU side. All confidential
      cache lines are still incoherent with DMA devices.
      
      Cc: stable@vger.kerel.org
      
      Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-3-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d45829b3
    • Sean Christopherson's avatar
      KVM: SVM: Simplify and harden helper to flush SEV guest page(s) · 4bbef7e8
      Sean Christopherson authored
      Rework sev_flush_guest_memory() to explicitly handle only a single page,
      and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails.  Per-page
      flushing is currently used only to flush the VMSA, and in its current
      form, the helper is completely broken with respect to flushing actual
      guest memory, i.e. won't work correctly for an arbitrary memory range.
      
      VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
      walks, i.e. will fault if the address is not present in the host page
      tables or does not have the correct permissions.  Current AMD CPUs also
      do not honor SMAP overrides (undocumented in kernel versions of the APM),
      so passing in a userspace address is completely out of the question.  In
      other words, KVM would need to manually walk the host page tables to get
      the pfn, ensure the pfn is stable, and then use the direct map to invoke
      VM_PAGE_FLUSH.  And the latter might not even work, e.g. if userspace is
      particularly evil/clever and backs the guest with Secret Memory (which
      unmaps memory from the direct map).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      
      Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reported-by: default avatarMingwei Zhang <mizhang@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-2-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4bbef7e8
    • Thomas Huth's avatar
      KVM: selftests: Silence compiler warning in the kvm_page_table_test · 266a19a0
      Thomas Huth authored
      When compiling kvm_page_table_test.c, I get this compiler warning
      with gcc 11.2:
      
      kvm_page_table_test.c: In function 'pre_init_before_test':
      ../../../../tools/include/linux/kernel.h:44:24: warning: comparison of
       distinct pointer types lacks a cast
         44 |         (void) (&_max1 == &_max2);              \
            |                        ^~
      kvm_page_table_test.c:281:21: note: in expansion of macro 'max'
        281 |         alignment = max(0x100000, alignment);
            |                     ^~~
      
      Fix it by adjusting the type of the absolute value.
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Message-Id: <20220414103031.565037-1-thuth@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      266a19a0
    • Like Xu's avatar
      KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog · 75189d1d
      Like Xu authored
      NMI-watchdog is one of the favorite features of kernel developers,
      but it does not work in AMD guest even with vPMU enabled and worse,
      the system misrepresents this capability via /proc.
      
      This is a PMC emulation error. KVM does not pass the latest valid
      value to perf_event in time when guest NMI-watchdog is running, thus
      the perf_event corresponding to the watchdog counter will enter the
      old state at some point after the first guest NMI injection, forcing
      the hardware register PMC0 to be constantly written to 0x800000000001.
      
      Meanwhile, the running counter should accurately reflect its new value
      based on the latest coordinated pmc->counter (from vPMC's point of view)
      rather than the value written directly by the guest.
      
      Fixes: 168d918f ("KVM: x86: Adjust counter sample period after a wrmsr")
      Reported-by: default avatarDongli Cao <caodongli@kingsoft.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Tested-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220409015226.38619-1-likexu@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      75189d1d
    • Wanpeng Li's avatar
      x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume · 0361bdfd
      Wanpeng Li authored
      MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
      host-side polling after suspend/resume.  Non-bootstrap CPUs are
      restored correctly by the haltpoll driver because they are hot-unplugged
      during suspend and hot-plugged during resume; however, the BSP
      is not hotpluggable and remains in host-sde polling mode after
      the guest resume.  The makes the guest pay for the cost of vmexits
      every time the guest enters idle.
      
      Fix it by recording BSP's haltpoll state and resuming it during guest
      resume.
      
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0361bdfd
    • Tom Rix's avatar
      KVM: SPDX style and spelling fixes · a413a625
      Tom Rix authored
      SPDX comments use use /* */ style comments in headers anad
      // style comments in .c files.  Also fix two spelling mistakes.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Message-Id: <20220410153840.55506-1-trix@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a413a625
    • Sean Christopherson's avatar
      KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled · 0047fb33
      Sean Christopherson authored
      Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
      disabled at the module level to avoid having to acquire the mutex and
      potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
      never be lifted, so piling on more inhibits is unnecessary.
      
      Fixes: cae72dcc ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0047fb33
    • Sean Christopherson's avatar
      KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race · 423ecfea
      Sean Christopherson authored
      Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
      in-kernel local APIC and APICv enabled at the module level.  Consuming
      kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
      race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
      before the vCPU is fully onlined, i.e. it won't get the request made to
      "all" vCPUs.  If APICv is globally inhibited between setting apicv_active
      and onlining the vCPU, the vCPU will end up running with APICv enabled
      and trigger KVM's sanity check.
      
      Mark APICv as active during vCPU creation if APICv is enabled at the
      module level, both to be optimistic about it's final state, e.g. to avoid
      additional VMWRITEs on VMX, and because there are likely bugs lurking
      since KVM checks apicv_active in multiple vCPU creation paths.  While
      keeping the current behavior of consuming kvm_apicv_activated() is
      arguably safer from a regression perspective, force apicv_active so that
      vCPU creation runs with deterministic state and so that if there are bugs,
      they are found sooner than later, i.e. not when some crazy race condition
      is hit.
      
        WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Modules linked in:
        CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
        RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Call Trace:
         <TASK>
         vcpu_run arch/x86/kvm/x86.c:10039 [inline]
         kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
         kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
      call to KVM_SET_GUEST_DEBUG.
      
        r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
        r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
        ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
        r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
        r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
        ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
        ioctl$KVM_RUN(r2, 0xae80, 0x0)
      Reported-by: default avatarGaoning Pan <pgn@zju.edu.cn>
      Reported-by: default avatarYongkang Jia <kangel@zju.edu.cn>
      Fixes: 8df14af4 ("kvm: x86: Add support for dynamic APICv activation")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      423ecfea
    • Sean Christopherson's avatar
      KVM: nVMX: Defer APICv updates while L2 is active until L1 is active · 7c69661e
      Sean Christopherson authored
      Defer APICv updates that occur while L2 is active until nested VM-Exit,
      i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
      is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
      vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
      APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
      becomes unhibited while L2 is active, KVM will set various APICv controls
      in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
      running with nested_early_check=1, KVM blames L1 and chaos ensues.
      
      In all cases, ignoring vmcs02 and always deferring the inhibition change
      to vmcs01 is correct (or at least acceptable).  The ABSENT and DISABLE
      inhibitions cannot truly change while L2 is active (see below).
      
      IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
      Furthermore, only L2's APIC is accelerated/virtualized to the full extent
      possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
      interception will apply to the virtual APIC managed by KVM.
      The exception is the SELF_IPI register when x2APIC is enabled, but that's
      an acceptable hole.
      
      Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
      MSRs to L2, but for that to work in any sane capacity, L1 would need to
      pass through IRQs to L2 as well, and IRQs must be intercepted to enable
      virtual interrupt delivery.  I.e. exposing Auto EOI to L2 and enabling
      VID for L2 are, for all intents and purposes, mutually exclusive.
      
      Lack of dynamic toggling is also why this scenario is all but impossible
      to encounter in KVM's current form.  But a future patch will pend an
      APICv update request _during_ vCPU creation to plug a race where a vCPU
      that's being created doesn't get included in the "all vCPUs request"
      because it's not yet visible to other vCPUs.  If userspaces restores L2
      after VM creation (hello, KVM selftests), the first KVM_RUN will occur
      while L2 is active and thus service the APICv update request made during
      VM creation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420013732.3308816-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c69661e
    • Sean Christopherson's avatar
      KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled · 80f0497c
      Sean Christopherson authored
      Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
      module param.  A recent refactoring to add a wrapper for setting/clearing
      inhibits unintentionally changed the flag, probably due to a copy+paste
      goof.
      
      Fixes: 4f4c4a3e ("KVM: x86: Trace all APICv inhibit changes and capture overall status")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80f0497c
    • Sean Christopherson's avatar
      KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref · 5c697c36
      Sean Christopherson authored
      Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
      is created.  KVM's teardown when VM creation fails is kludgy and calls
      kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
      attempted kvm_create_vm_debugfs().  Because debugfs_entry is zero
      initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__dentry_path+0x7b/0x130
        Call Trace:
         <TASK>
         dentry_path_raw+0x42/0x70
         kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm]
         kvm_put_kvm+0x63/0x2b0 [kvm]
         kvm_dev_ioctl+0x43a/0x920 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x31/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        Modules linked in: kvm_intel kvm irqbypass
      
      Fixes: a44a4cc1 ("KVM: Don't create VM debugfs files outside of the VM directory")
      Cc: stable@vger.kernel.org
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oupton@google.com>
      Reported-by: syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Message-Id: <20220415004622.2207751-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5c697c36
    • Sean Christopherson's avatar
      KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused · 2031f287
      Sean Christopherson authored
      Add wrappers to acquire/release KVM's SRCU lock when stashing the index
      in vcpu->src_idx, along with rudimentary detection of illegal usage,
      e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
      SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
      unnoticed for quite some time and only cause problems when the nested
      lock happens to get a different index.
      
      Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
      likely yell so loudly that it will bring the kernel to its knees.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Tested-by: default avatarFabiano Rosas <farosas@linux.ibm.com>
      Message-Id: <20220415004343.2203171-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2031f287
    • Sean Christopherson's avatar
      KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy · fdd6f6ac
      Sean Christopherson authored
      Use the generic kvm_vcpu's srcu_idx instead of using an indentical field
      in RISC-V's version of kvm_vcpu_arch.  Generic KVM very intentionally
      does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul
      of common code.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220415004343.2203171-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdd6f6ac
    • Sean Christopherson's avatar
      KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io() · 2d089356
      Sean Christopherson authored
      Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
      lock in kvm_arch_vcpu_ioctl_run().  More importantly, don't overwrite
      vcpu->srcu_idx.  If the index acquired by complete_emulated_io() differs
      from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
      leak a lock and hang if/when synchronize_srcu() is invoked for the
      relevant grace period.
      
      Fixes: 8d25b7be ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220415004343.2203171-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d089356
    • Paolo Bonzini's avatar
      Merge tag 'kvm-riscv-fixes-5.18-2' of https://github.com/kvm-riscv/linux into HEAD · 012c7225
      Paolo Bonzini authored
      KVM/riscv fixes for 5.18, take #2
      
      - Remove 's' & 'u' as valid ISA extension
      
      - Do not allow disabling the base extensions 'i'/'m'/'a'/'c'
      012c7225
  4. 20 Apr, 2022 2 commits
  5. 17 Apr, 2022 10 commits
  6. 16 Apr, 2022 6 commits
    • Linus Torvalds's avatar
      Merge tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 70a0cec8
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "There are a number of SoC bugfixes that came in since the merge
        window, and more of them are already pending.
      
        This batch includes:
      
         - A boot time regression fix for davinci that triggered on
           multi_v5_defconfig when booting any platform
      
         - Defconfig updates to address removed features, changed symbol names
           or dependencies, for gemini, ux500, and pxa
      
         - Email address changes for Krzysztof Kozlowski
      
         - Build warning fixes for ep93xx and iop32x
      
         - Devicetree warning fixes across many platforms
      
         - Minor bugfixes for the reset controller, memory controller and SCMI
           firmware subsystems plus the versatile-express board"
      
      * tag 'soc-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (34 commits)
        ARM: config: Update Gemini defconfig
        arm64: dts: qcom/sdm845-shift-axolotl: Fix boolean properties with values
        ARM: dts: align SPI NOR node name with dtschema
        ARM: dts: Fix more boolean properties with values
        arm/arm64: dts: qcom: Fix boolean properties with values
        arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
        arm: dts: imx: Fix boolean properties with values
        arm64: dts: tegra: Fix boolean properties with values
        arm: dts: at91: Fix boolean properties with values
        arm: configs: imote2: Drop defconfig as board support dropped.
        ep93xx: clock: Don't use plain integer as NULL pointer
        ep93xx: clock: Fix UAF in ep93xx_clk_register_gate()
        ARM: vexpress/spc: Fix all the kernel-doc build warnings
        ARM: vexpress/spc: Fix kernel-doc build warning for ve_spc_cpu_in_wfi
        ARM: config: u8500: Re-enable AB8500 battery charging
        ARM: config: u8500: Add some common hardware
        memory: fsl_ifc: populate child nodes of buses and mfd devices
        ARM: config: Refresh U8500 defconfig
        firmware: arm_scmi: Fix sparse warnings in OPTEE transport driver
        firmware: arm_scmi: Replace zero-length array with flexible-array member
        ...
      70a0cec8
    • Linus Torvalds's avatar
      Merge tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random · 92edbe32
      Linus Torvalds authored
      Pull random number generator fixes from Jason Donenfeld:
      
       - Per your suggestion, random reads now won't fail if there's a page
         fault after some non-zero amount of data has been read, which makes
         the behavior consistent with all other reads in the kernel.
      
       - Rather than an inconsistent mix of random_get_entropy() returning an
         unsigned long or a cycles_t, now it just returns an unsigned long.
      
       - A memcpy() was replaced with an memmove(), because the addresses are
         sometimes overlapping. In practice the destination is always before
         the source, so not really an issue, but better to be correct than
         not.
      
      * tag 'random-5.18-rc3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
        random: use memmove instead of memcpy for remaining 32 bytes
        random: make random_get_entropy() return an unsigned long
        random: allow partial reads if later user copies fail
      92edbe32
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 90ea17a9
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "13 fixes, all in drivers.
      
        The most extensive changes are in the iscsi series (affecting drivers
        qedi, cxgbi and bnx2i), the next most is scsi_debug, but that's just a
        simple revert and then minor updates to pm80xx"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: iscsi: MAINTAINERS: Add Mike Christie as co-maintainer
        scsi: qedi: Fix failed disconnect handling
        scsi: iscsi: Fix NOP handling during conn recovery
        scsi: iscsi: Merge suspend fields
        scsi: iscsi: Fix unbound endpoint error handling
        scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
        scsi: iscsi: Fix endpoint reuse regression
        scsi: iscsi: Release endpoint ID when its freed
        scsi: iscsi: Fix offload conn cleanup when iscsid restarts
        scsi: iscsi: Move iscsi_ep_disconnect()
        scsi: pm80xx: Enable upper inbound, outbound queues
        scsi: pm80xx: Mask and unmask upper interrupt vectors 32-63
        Revert "scsi: scsi_debug: Address races following module load"
      90ea17a9
    • Bartosz Golaszewski's avatar
      Merge tag 'intel-gpio-v5.18-2' of... · 0ebb4fbe
      Bartosz Golaszewski authored
      Merge tag 'intel-gpio-v5.18-2' of gitolite.kernel.org:pub/scm/linux/kernel/git/andy/linux-gpio-intel into gpio/for-current
      
      intel-gpio for v5.18-2
      
      * Couple of fixes related to handling unsigned value of the pin from ACPI
      
      gpiolib:
       -  acpi: Convert type for pin to be unsigned
       -  acpi: use correct format characters
      0ebb4fbe
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.18-2' of git://git.infradead.org/users/hch/dma-mapping · b0086839
      Linus Torvalds authored
      Pull dma-mapping fix from Christoph Hellwig:
      
       - avoid a double memory copy for swiotlb (Chao Gao)
      
      * tag 'dma-mapping-5.18-2' of git://git.infradead.org/users/hch/dma-mapping:
        dma-direct: avoid redundant memory sync for swiotlb
      b0086839
    • Jason A. Donenfeld's avatar
      random: use memmove instead of memcpy for remaining 32 bytes · 35a33ff3
      Jason A. Donenfeld authored
      In order to immediately overwrite the old key on the stack, before
      servicing a userspace request for bytes, we use the remaining 32 bytes
      of block 0 as the key. This means moving indices 8,9,a,b,c,d,e,f ->
      4,5,6,7,8,9,a,b. Since 4 < 8, for the kernel implementations of
      memcpy(), this doesn't actually appear to be a problem in practice. But
      relying on that characteristic seems a bit brittle. So let's change that
      to a proper memmove(), which is the by-the-books way of handling
      overlapping memory copies.
      Reviewed-by: default avatarDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      35a33ff3