1. 27 May, 2020 4 commits
  2. 19 May, 2020 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: only do L1TF workaround on affected processors · d43e2675
      Paolo Bonzini authored
      KVM stores the gfn in MMIO SPTEs as a caching optimization.  These are split
      in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
      in an L1TF attack.  This works as long as there are 5 free bits between
      MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
      access triggers a reserved-bit-set page fault.
      
      The bit positions however were computed wrongly for AMD processors that have
      encryption support.  In this case, x86_phys_bits is reduced (for example
      from 48 to 43, to account for the C bit at position 47 and four bits used
      internally to store the SEV ASID and other stuff) while x86_cache_bits in
      would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
      and bit 51 are set.  Then low_phys_bits would also cover some of the
      bits that are set in the shadow_mmio_value, terribly confusing the gfn
      caching mechanism.
      
      To fix this, avoid splitting gfns as long as the processor does not have
      the L1TF bug (which includes all AMD processors).  When there is no
      splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
      the overlap.  This fixes "npt=0" operation on EPYC processors.
      
      Thanks to Maxim Levitsky for bisecting this bug.
      
      Cc: stable@vger.kernel.org
      Fixes: 52918ed5 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d43e2675
  3. 15 May, 2020 2 commits
  4. 13 May, 2020 1 commit
    • Babu Moger's avatar
      KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c · 37486135
      Babu Moger authored
      Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU
      resource isn't. It can be read with XSAVE and written with XRSTOR.
      So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state),
      the guest can read the host value.
      
      In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could
      potentially use XRSTOR to change the host PKRU value.
      
      While at it, move pkru state save/restore to common code and the
      host_pkru field to kvm_vcpu_arch.  This will let SVM support protection keys.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarBabu Moger <babu.moger@amd.com>
      Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      37486135
  5. 08 May, 2020 5 commits
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Disable AVIC before setting V_IRQ · 7d611233
      Suravee Suthikulpanit authored
      The commit 64b5bd27 ("KVM: nSVM: ignore L1 interrupt window
      while running L2 with V_INTR_MASKING=1") introduced a WARN_ON,
      which checks if AVIC is enabled when trying to set V_IRQ
      in the VMCB for enabling irq window.
      
      The following warning is triggered because the requesting vcpu
      (to deactivate AVIC) does not get to process APICv update request
      for itself until the next #vmexit.
      
      WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd]
       RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd]
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm]
        ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
        ? _copy_to_user+0x26/0x30
        ? kvm_vm_ioctl+0xb3e/0xd90 [kvm]
        ? set_next_entity+0x78/0xc0
        kvm_vcpu_ioctl+0x236/0x610 [kvm]
        ksys_ioctl+0x8a/0xc0
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x58/0x210
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes by sending APICV update request to all other vcpus, and
      immediately update APIC for itself.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Link: https://lkml.org/lkml/2020/5/2/167
      Fixes: 64b5bd27 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
      Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7d611233
    • Suravee Suthikulpanit's avatar
      KVM: Introduce kvm_make_all_cpus_request_except() · 54163a34
      Suravee Suthikulpanit authored
      This allows making request to all other vcpus except the one
      specified in the parameter.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54163a34
    • Paolo Bonzini's avatar
      KVM: VMX: pass correct DR6 for GD userspace exit · 45981ded
      Paolo Bonzini authored
      When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD),
      DR6 was incorrectly copied from the value in the VM.  Instead,
      DR6.BD should be set in order to catch this case.
      
      On AMD this does not need any special code because the processor triggers
      a #DB exception that is intercepted.  However, the testcase would fail
      without the previous patch because both DR6.BS and DR6.BD would be set.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45981ded
    • Paolo Bonzini's avatar
      KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 · d67668e9
      Paolo Bonzini authored
      There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the
      different handling of DR6 on intercepted #DB exceptions on Intel and AMD.
      
      On Intel, #DB exceptions transmit the DR6 value via the exit qualification
      field of the VMCS, and the exit qualification only contains the description
      of the precise event that caused a vmexit.
      
      On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception
      was to be injected into the guest.  This has two effects when guest debugging
      is in use:
      
      * the guest DR6 is clobbered
      
      * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather
      than just the last one that happened (the testcase in the next patch covers
      this issue).
      
      This patch fixes both issues by emulating, so to speak, the Intel behavior
      on AMD processors.  The important observation is that (after the previous
      patches) the VMCB value of DR6 is only ever observable from the guest is
      KVM_DEBUGREG_WONT_EXIT is set.  Therefore we can actually set vmcb->save.dr6
      to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it
      will be if guest debugging is enabled.
      
      Therefore it is possible to enter the guest with an all-zero DR6,
      reconstruct the #DB payload from the DR6 we get at exit time, and let
      kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6.
      Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT
      is set, but this is harmless.
      
      This may not be the most optimized way to deal with this, but it is
      simple and, being confined within SVM code, it gets rid of the set_dr6
      callback and kvm_update_dr6.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d67668e9
    • Paolo Bonzini's avatar
      KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6 · 5679b803
      Paolo Bonzini authored
      kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the
      second argument.  Ensure that the VMCB value is synchronized to
      vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so
      that the current value of DR6 is always available in vcpu->arch.dr6.
      The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5679b803
  6. 07 May, 2020 6 commits
  7. 06 May, 2020 6 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-master-5.7-3' of... · 2673cb68
      Paolo Bonzini authored
      Merge tag 'kvm-s390-master-5.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: Fix for running nested uner z/VM
      
      There are circumstances when running nested under z/VM that would trigger a
      WARN_ON_ONCE. Remove the WARN_ON_ONCE. Long term we certainly want to make this
      code more robust and flexible, but just returning instead of WARNING makes
      guest bootable again.
      2673cb68
    • Peter Xu's avatar
      KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly · 495907ec
      Peter Xu authored
      KVM_CAP_SET_GUEST_DEBUG should be supported for x86 however it's not declared
      as supported.  My wild guess is that userspaces like QEMU are using "#ifdef
      KVM_CAP_SET_GUEST_DEBUG" to check for the capability instead, but that could be
      wrong because the compilation host may not be the runtime host.
      
      The userspace might still want to keep the old "#ifdef" though to not break the
      guest debug on old kernels.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20200505154750.126300-1-peterx@redhat.com>
      [Do the same for PPC and s390. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      495907ec
    • Peter Xu's avatar
      KVM: selftests: Fix build for evmcs.h · 8ffdaf91
      Peter Xu authored
      I got this error when building kvm selftests:
      
      /usr/bin/ld: /home/xz/git/linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:222: multiple definition of `current_evmcs'; /tmp/cco1G48P.o:/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:222: first defined here
      /usr/bin/ld: /home/xz/git/linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:223: multiple definition of `current_vp_assist'; /tmp/cco1G48P.o:/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:223: first defined here
      
      I think it's because evmcs.h is included both in a test file and a lib file so
      the structs have multiple declarations when linking.  After all it's not a good
      habit to declare structs in the header files.
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20200504220607.99627-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8ffdaf91
    • Paolo Bonzini's avatar
      kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits · 139f7425
      Paolo Bonzini authored
      Using CPUID data can be useful for the processor compatibility
      check, but that's it.  Using it to compute guest-reserved bits
      can have both false positives (such as LA57 and UMIP which we
      are already handling) and false negatives: in particular, with
      this patch we don't allow anymore a KVM guest to set CR4.PKE
      when CR4.PKE is clear on the host.
      
      Fixes: b9dd21e1 ("KVM: x86: simplify handling of PKRU")
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Tested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      139f7425
    • Sean Christopherson's avatar
      KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path · c7cb2d65
      Sean Christopherson authored
      Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so
      that KVM doesn't interpret clobbered RFLAGS as a VM-Fail.  Filling the
      RSB has always clobbered RFLAGS, its current incarnation just happens
      clear CF and ZF in the processs.  Relying on the macro to clear CF and
      ZF is extremely fragile, e.g. commit 089dd8e5 ("x86/speculation:
      Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such
      that the ZF flag is always set.
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: f2fde6a5 ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c7cb2d65
    • Kashyap Chamarthy's avatar
      docs/virt/kvm: Document configuring and running nested guests · 27abe577
      Kashyap Chamarthy authored
      This is a rewrite of this[1] Wiki page with further enhancements.  The
      doc also includes a section on debugging problems in nested
      environments, among other improvements.
      
      [1] https://www.linux-kvm.org/page/Nested_GuestsSigned-off-by: default avatarKashyap Chamarthy <kchamart@redhat.com>
      Message-Id: <20200505112839.30534-1-kchamart@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      27abe577
  8. 05 May, 2020 1 commit
  9. 04 May, 2020 6 commits
  10. 01 May, 2020 1 commit
    • Marc Zyngier's avatar
      KVM: arm64: Fix 32bit PC wrap-around · 0225fd5e
      Marc Zyngier authored
      In the unlikely event that a 32bit vcpu traps into the hypervisor
      on an instruction that is located right at the end of the 32bit
      range, the emulation of that instruction is going to increment
      PC past the 32bit range. This isn't great, as userspace can then
      observe this value and get a bit confused.
      
      Conversly, userspace can do things like (in the context of a 64bit
      guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0,
      set PC to a 64bit value, change PSTATE to AArch32-USR, and observe
      that PC hasn't been truncated. More confusion.
      
      Fix both by:
      - truncating PC increments for 32bit guests
      - sanitizing all 32bit regs every time a core reg is changed by
        userspace, and that PSTATE indicates a 32bit mode.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      0225fd5e
  11. 30 Apr, 2020 3 commits
  12. 23 Apr, 2020 4 commits
    • Marc Zyngier's avatar
    • Marc Zyngier's avatar
    • Zenghui Yu's avatar
      KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi() · 57bdb436
      Zenghui Yu authored
      If we're going to fail out the vgic_add_lpi(), let's make sure the
      allocated vgic_irq memory is also freed. Though it seems that both
      cases are unlikely to fail.
      Signed-off-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20200414030349.625-3-yuzenghui@huawei.com
      57bdb436
    • Zenghui Yu's avatar
      KVM: arm64: vgic-v3: Retire all pending LPIs on vcpu destroy · 969ce8b5
      Zenghui Yu authored
      It's likely that the vcpu fails to handle all virtual interrupts if
      userspace decides to destroy it, leaving the pending ones stay in the
      ap_list. If the un-handled one is a LPI, its vgic_irq structure will
      be eventually leaked because of an extra refcount increment in
      vgic_queue_irq_unlock().
      
      This was detected by kmemleak on almost every guest destroy, the
      backtrace is as follows:
      
      unreferenced object 0xffff80725aed5500 (size 128):
      comm "CPU 5/KVM", pid 40711, jiffies 4298024754 (age 166366.512s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 08 01 a9 73 6d 80 ff ff ...........sm...
      c8 61 ee a9 00 20 ff ff 28 1e 55 81 6c 80 ff ff .a... ..(.U.l...
      backtrace:
      [<000000004bcaa122>] kmem_cache_alloc_trace+0x2dc/0x418
      [<0000000069c7dabb>] vgic_add_lpi+0x88/0x418
      [<00000000bfefd5c5>] vgic_its_cmd_handle_mapi+0x4dc/0x588
      [<00000000cf993975>] vgic_its_process_commands.part.5+0x484/0x1198
      [<000000004bd3f8e3>] vgic_its_process_commands+0x50/0x80
      [<00000000b9a65b2b>] vgic_mmio_write_its_cwriter+0xac/0x108
      [<0000000009641ebb>] dispatch_mmio_write+0xd0/0x188
      [<000000008f79d288>] __kvm_io_bus_write+0x134/0x240
      [<00000000882f39ac>] kvm_io_bus_write+0xe0/0x150
      [<0000000078197602>] io_mem_abort+0x484/0x7b8
      [<0000000060954e3c>] kvm_handle_guest_abort+0x4cc/0xa58
      [<00000000e0d0cd65>] handle_exit+0x24c/0x770
      [<00000000b44a7fad>] kvm_arch_vcpu_ioctl_run+0x460/0x1988
      [<0000000025fb897c>] kvm_vcpu_ioctl+0x4f8/0xee0
      [<000000003271e317>] do_vfs_ioctl+0x160/0xcd8
      [<00000000e7f39607>] ksys_ioctl+0x98/0xd8
      
      Fix it by retiring all pending LPIs in the ap_list on the destroy path.
      
      p.s. I can also reproduce it on a normal guest shutdown. It is because
      userspace still send LPIs to vcpu (through KVM_SIGNAL_MSI ioctl) while
      the guest is being shutdown and unable to handle it. A little strange
      though and haven't dig further...
      Reviewed-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      [maz: moved the distributor deallocation down to avoid an UAF splat]
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20200414030349.625-2-yuzenghui@huawei.com
      969ce8b5