1. 24 Sep, 2020 1 commit
    • Maxim Levitsky's avatar
      KVM: x86: fix MSR_IA32_TSC read for nested migration · ee6fa053
      Maxim Levitsky authored
      MSR reads/writes should always access the L1 state, since the (nested)
      hypervisor should intercept all the msrs it wants to adjust, and these
      that it doesn't should be read by the guest as if the host had read it.
      
      However IA32_TSC is an exception. Even when not intercepted, guest still
      reads the value + TSC offset.
      The write however does not take any TSC offset into account.
      
      This is documented in Intel's SDM and seems also to happen on AMD as well.
      
      This creates a problem when userspace wants to read the IA32_TSC value and then
      write it. (e.g for migration)
      
      In this case it reads L2 value but write is interpreted as an L1 value.
      To fix this make the userspace initiated reads of IA32_TSC return L1 value
      as well.
      
      Huge thanks to Dave Gilbert for helping me understand this very confusing
      semantic of MSR writes.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee6fa053
  2. 23 Sep, 2020 2 commits
    • Yang Weijiang's avatar
      selftests: kvm: Fix assert failure in single-step test · 18391e5e
      Yang Weijiang authored
      This is a follow-up patch to fix an issue left in commit:
      98b0bf02
      selftests: kvm: Use a shorter encoding to clear RAX
      
      With the change in the commit, we also need to modify "xor" instruction
      length from 3 to 2 in array ss_size accordingly to pass below check:
      
      for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) {
              target_rip += ss_size[i];
              CLEAR_DEBUG();
              debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
              debug.arch.debugreg[7] = 0x00000400;
              APPLY_DEBUG();
              vcpu_run(vm, VCPU_ID);
              TEST_ASSERT(run->exit_reason == KVM_EXIT_DEBUG &&
                          run->debug.arch.exception == DB_VECTOR &&
                          run->debug.arch.pc == target_rip &&
                          run->debug.arch.dr6 == target_dr6,
                          "SINGLE_STEP[%d]: exit %d exception %d rip 0x%llx "
                          "(should be 0x%llx) dr6 0x%llx (should be 0x%llx)",
                          i, run->exit_reason, run->debug.arch.exception,
                          run->debug.arch.pc, target_rip, run->debug.arch.dr6,
                          target_dr6);
      }
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarYang Weijiang <weijiang.yang@intel.com>
      Message-Id: <20200826015524.13251-1-weijiang.yang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18391e5e
    • Mohammed Gamal's avatar
      KVM: x86: VMX: Make smaller physical guest address space support user-configurable · b96e6506
      Mohammed Gamal authored
      This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
      Since smaller physical address spaces are only supported on VMX, the
      parameter is only exposed in the kvm_intel module.
      
      For now disable support by default, and let the user decide if they want
      to enable it.
      
      Modifications to VMX page fault and EPT violation handling will depend
      on whether that parameter is enabled.
      Signed-off-by: default avatarMohammed Gamal <mgamal@redhat.com>
      Message-Id: <20200903141122.72908-1-mgamal@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b96e6506
  3. 20 Sep, 2020 3 commits
  4. 18 Sep, 2020 2 commits
  5. 14 Sep, 2020 1 commit
  6. 12 Sep, 2020 8 commits
  7. 11 Sep, 2020 9 commits
    • Vitaly Kuznetsov's avatar
      KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN · d831de17
      Vitaly Kuznetsov authored
      Even without in-kernel LAPIC we should allow writing '0' to
      MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In
      particular, QEMU with 'kernel-irqchip=off' fails to start
      a guest with
      
      qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0
      
      Fixes: 9d3c447c ("KVM: X86: Fix async pf caused null-ptr-deref")
      Reported-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200911093147.484565-1-vkuznets@redhat.com>
      [Actually commit the version proposed by Sean Christopherson. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d831de17
    • David Rientjes's avatar
      KVM: SVM: Periodically schedule when unregistering regions on destroy · 7be74942
      David Rientjes authored
      There may be many encrypted regions that need to be unregistered when a
      SEV VM is destroyed.  This can lead to soft lockups.  For example, on a
      host running 4.15:
      
      watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
      CPU: 206 PID: 194348 Comm: t_virtual_machi
      RIP: 0010:free_unref_page_list+0x105/0x170
      ...
      Call Trace:
       [<0>] release_pages+0x159/0x3d0
       [<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
       [<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
       [<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
       [<0>] kvm_arch_destroy_vm+0x47/0x200
       [<0>] kvm_put_kvm+0x1a8/0x2f0
       [<0>] kvm_vm_release+0x25/0x30
       [<0>] do_exit+0x335/0xc10
       [<0>] do_group_exit+0x3f/0xa0
       [<0>] get_signal+0x1bc/0x670
       [<0>] do_signal+0x31/0x130
      
      Although the CLFLUSH is no longer issued on every encrypted region to be
      unregistered, there are no other changes that can prevent soft lockups for
      very large SEV VMs in the latest kernel.
      
      Periodically schedule if necessary.  This still holds kvm->lock across the
      resched, but since this only happens when the VM is destroyed this is
      assumed to be acceptable.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Message-Id: <alpine.DEB.2.23.453.2008251255240.2987727@chino.kir.corp.google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7be74942
    • Huacai Chen's avatar
      KVM: MIPS: Change the definition of kvm type · 15e9e35c
      Huacai Chen authored
      MIPS defines two kvm types:
      
       #define KVM_VM_MIPS_TE          0
       #define KVM_VM_MIPS_VZ          1
      
      In Documentation/virt/kvm/api.rst it is said that "You probably want to
      use 0 as machine type", which implies that type 0 be the "automatic" or
      "default" type. And, in user-space libvirt use the null-machine (with
      type 0) to detect the kvm capability, which returns "KVM not supported"
      on a VZ platform.
      
      I try to fix it in QEMU but it is ugly:
      https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg05629.html
      
      And Thomas Huth suggests me to change the definition of kvm type:
      https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg03281.html
      
      So I define like this:
      
       #define KVM_VM_MIPS_AUTO        0
       #define KVM_VM_MIPS_VZ          1
       #define KVM_VM_MIPS_TE          2
      
      Since VZ and TE cannot co-exists, using type 0 on a TE platform will
      still return success (so old user-space tools have no problems on new
      kernels); the advantage is that using type 0 on a VZ platform will not
      return failure. So, the only problem is "new user-space tools use type
      2 on old kernels", but if we treat this as a kernel bug, we can backport
      this patch to old stable kernels.
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Message-Id: <1599734031-28746-1-git-send-email-chenhc@lemote.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15e9e35c
    • Lai Jiangshan's avatar
      kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed · f6f6195b
      Lai Jiangshan authored
      When kvm_mmu_get_page() gets a page with unsynced children, the spt
      pagetable is unsynchronized with the guest pagetable. But the
      guest might not issue a "flush" operation on it when the pagetable
      entry is changed from zero or other cases. The hypervisor has the
      responsibility to synchronize the pagetables.
      
      KVM behaved as above for many years, But commit 8c8560b8
      ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      inadvertently included a line of code to change it without giving any
      reason in the changelog. It is clear that the commit's intention was to
      change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
      needlessly flush other contexts; however, one of the hunks changed
      a nearby KVM_REQ_MMU_SYNC instead.  This patch changes it back.
      
      Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com>
      fixes: 8c8560b8 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6f6195b
    • Chenyi Qiang's avatar
      KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control · c6b177a3
      Chenyi Qiang authored
      A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
      in exit_ctls_high.
      
      Fixes: 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
      VM-{Entry,Exit} control")
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6b177a3
    • Rustam Kovhaev's avatar
      KVM: fix memory leak in kvm_io_bus_unregister_dev() · f6588660
      Rustam Kovhaev authored
      when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
      the bus, we should iterate over all other devices linked to it and call
      kvm_iodevice_destructor() for them
      
      Fixes: 90db1043 ("KVM: kvm_io_bus_unregister_dev() should never fail")
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
      Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707Signed-off-by: default avatarRustam Kovhaev <rkovhaev@gmail.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200907185535.233114-1-rkovhaev@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6588660
    • Haiwei Li's avatar
      KVM: Check the allocation of pv cpu mask · 0f990222
      Haiwei Li authored
      check the allocation of per-cpu __pv_cpu_mask. Initialize ops only when
      successful.
      Signed-off-by: default avatarHaiwei Li <lihaiwei@tencent.com>
      Message-Id: <d59f05df-e6d3-3d31-a036-cc25a2b2f33f@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f990222
    • Peter Shier's avatar
      KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected · 43fea4e4
      Peter Shier authored
      When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
      load_pdptrs to read the possibly updated PDPTEs from the guest
      physical address referenced by CR3.  It loads them into
      vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
      vcpu->arch.regs_dirty.
      
      At the subsequent assumed reentry into L2, the mmu will call
      vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
      VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
      VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
      if the L2 CRn write intercept always resumes L2.
      
      The resume path calls vmx_check_nested_events which checks for
      exceptions, MTF, and expired VMX preemption timers. If
      vmx_check_nested_events finds any of these conditions pending it will
      reflect the corresponding exit into L1. Live migration at this point
      would also cause a missed immediate reentry into L2.
      
      After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
      clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty.  When L2 next
      resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
      vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
      vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
      VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
      values stored at last L2 exit. A repro of this bug showed L2 entering
      triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
      
      When L2 is in PAE paging mode add a call to ept_load_pdptrs before
      leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
      vcpu->arch.walk_mmu->pdptrs[].
      
      Tested:
      kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
      Verified that test fails without the fix.
      
      Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
      custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
      would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
      failures.
      Signed-off-by: default avatarPeter Shier <pshier@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20200820230545.2411347-1-pshier@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      43fea4e4
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.9-1' of... · 1b67fd08
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for Linux 5.9, take #1
      
      - Multiple stolen time fixes, with a new capability to match x86
      - Fix for hugetlbfs mappings when PUD and PMD are the same level
      - Fix for hugetlbfs mappings when PTE mappings are enforced
        (dirty logging, for example)
      - Fix tracing output of 64bit values
      1b67fd08
  8. 04 Sep, 2020 3 commits
  9. 21 Aug, 2020 8 commits
  10. 17 Aug, 2020 3 commits
    • Jim Mattson's avatar
      kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode · cb957adb
      Jim Mattson authored
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Cc: Huaitong Han <huaitong.han@intel.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarPeter Shier <pshier@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cb957adb
    • Jim Mattson's avatar
      kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode · 427890af
      Jim Mattson authored
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: 0be0226f ("KVM: MMU: fix SMAP virtualization")
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarPeter Shier <pshier@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-2-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      427890af
    • Paolo Bonzini's avatar
      KVM: x86: fix access code passed to gva_to_gpa · 19cf4b7e
      Paolo Bonzini authored
      The PK bit of the error code is computed dynamically in permission_fault
      and therefore need not be passed to gva_to_gpa: only the access bits
      (fetch, user, write) need to be passed down.
      
      Not doing so causes a splat in the pku test:
      
         WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
         Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
         RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
         Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
         RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
         RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
         RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
         RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
         R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
         R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
         FS:  00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
         DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         PKRU: 55555554
         Call Trace:
          paging64_gva_to_gpa+0x3f/0xb0 [kvm]
          kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
          handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
          kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
          kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
          ksys_ioctl+0x92/0xb0
          __x64_sys_ioctl+0x16/0x20
          do_syscall_64+0x3e/0xb0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
         ---[ end trace d17eb998aee991da ]---
      Reported-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Fixes: 89786147 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      19cf4b7e