1. 24 Sep, 2019 40 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"" · dd6223c7
      Sean Christopherson authored
      Now that the fast invalidate mechanism has been reintroduced, restore
      tracing of the generation number in shadow page tracepoints.
      
      This reverts commit b59c4830.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd6223c7
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use fast invalidate mechanism to zap MMIO sptes · 92f58b5c
      Sean Christopherson authored
      Use the fast invalidate mechasim to zap MMIO sptes on a MMIO generation
      wrap.  The fast invalidate flow was reintroduced to fix a livelock bug
      in kvm_mmu_zap_all() that can occur if kvm_mmu_zap_all() is invoked when
      the guest has live vCPUs.  I.e. using kvm_mmu_zap_all() to handle the
      MMIO generation wrap is theoretically susceptible to the livelock bug.
      
      This effectively reverts commit 4771450c ("Revert "KVM: MMU: drop
      kvm_mmu_zap_mmio_sptes""), i.e. restores the behavior of commit
      a8eca9dc ("KVM: MMU: drop kvm_mmu_zap_mmio_sptes").
      
      Note, this actually fixes commit 571c5af0 ("KVM: x86/mmu:
      Voluntarily reschedule as needed when zapping MMIO sptes"), but there
      is no need to incrementally revert back to using fast invalidate, e.g.
      doing so doesn't provide any bisection or stability benefits.
      
      Fixes: 571c5af0 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      92f58b5c
    • Sean Christopherson's avatar
      KVM: x86/mmu: Treat invalid shadow pages as obsolete · fac026da
      Sean Christopherson authored
      Treat invalid shadow pages as obsolete to fix a bug where an obsolete
      and invalid page with a non-zero root count could become non-obsolete
      due to mmu_valid_gen wrapping.  The bug is largely theoretical with the
      current code base, as an unsigned long will effectively never wrap on
      64-bit KVM, and userspace would have to deliberately stall a vCPU in
      order to keep an obsolete invalid page on the active list while
      simultaneously modifying memslots billions of times to trigger a wrap.
      
      The obvious alternative is to use a 64-bit value for mmu_valid_gen,
      but it's actually desirable to go in the opposite direction, i.e. using
      a smaller 8-bit value to reduce KVM's memory footprint by 8 bytes per
      shadow page, and relying on proper treatment of invalid pages instead of
      preventing the generation from wrapping.
      
      Note, "Fixes" points at a commit that was at one point reverted, but has
      since been restored.
      
      Fixes: 5304b8d3 ("KVM: MMU: fast invalidate all pages")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fac026da
    • Wanpeng Li's avatar
      KVM: LAPIC: Tune lapic_timer_advance_ns smoothly · d0f5a86a
      Wanpeng Li authored
      Filter out drastic fluctuation and random fluctuation, remove
      timer_advance_adjust_done altogether, the adjustment would be
      continuous.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0f5a86a
    • Tao Xu's avatar
      KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit · bf653b78
      Tao Xu authored
      As the latest Intel 64 and IA-32 Architectures Software Developer's
      Manual, UMWAIT and TPAUSE instructions cause a VM exit if the
      RDTSC exiting and enable user wait and pause VM-execution
      controls are both 1.
      
      Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE
      should never happen. Considering EXIT_REASON_XSAVES and
      EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common
      exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit.
      Suggested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: default avatarTao Xu <tao3.xu@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bf653b78
    • Tao Xu's avatar
      KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL · 6e3ba4ab
      Tao Xu authored
      UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index
      E1H to determines the maximum time in TSC-quanta that the processor can
      reside in either C0.1 or C0.2.
      
      This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
      IA32_UMWAIT_CONTROL between host and guest. The variable
      mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value,
      so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
      Co-developed-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: default avatarTao Xu <tao3.xu@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6e3ba4ab
    • Tao Xu's avatar
      KVM: x86: Add support for user wait instructions · e69e72fa
      Tao Xu authored
      UMONITOR, UMWAIT and TPAUSE are a set of user wait instructions.
      This patch adds support for user wait instructions in KVM. Availability
      of the user wait instructions is indicated by the presence of the CPUID
      feature flag WAITPKG CPUID.0x07.0x0:ECX[5]. User wait instructions may
      be executed at any privilege level, and use 32bit IA32_UMWAIT_CONTROL MSR
      to set the maximum time.
      
      The behavior of user wait instructions in VMX non-root operation is
      determined first by the setting of the "enable user wait and pause"
      secondary processor-based VM-execution control bit 26.
      	If the VM-execution control is 0, UMONITOR/UMWAIT/TPAUSE cause
      an invalid-opcode exception (#UD).
      	If the VM-execution control is 1, treatment is based on the
      setting of the “RDTSC exiting†VM-execution control. Because KVM never
      enables RDTSC exiting, if the instruction causes a delay, the amount of
      time delayed is called here the physical delay. The physical delay is
      first computed by determining the virtual delay. If
      IA32_UMWAIT_CONTROL[31:2] is zero, the virtual delay is the value in
      EDX:EAX minus the value that RDTSC would return; if
      IA32_UMWAIT_CONTROL[31:2] is not zero, the virtual delay is the minimum
      of that difference and AND(IA32_UMWAIT_CONTROL,FFFFFFFCH).
      
      Because umwait and tpause can put a (psysical) CPU into a power saving
      state, by default we dont't expose it to kvm and enable it only when
      guest CPUID has it.
      
      Detailed information about user wait instructions can be found in the
      latest Intel 64 and IA-32 Architectures Software Developer's Manual.
      Co-developed-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: default avatarTao Xu <tao3.xu@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e69e72fa
    • Sean Christopherson's avatar
      KVM: x86: Add comments to document various emulation types · 41577ab8
      Sean Christopherson authored
      Document the intended usage of each emulation type as each exists to
      handle an edge case of one kind or another and can be easily
      misinterpreted at first glance.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      41577ab8
    • Sean Christopherson's avatar
      KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig · 1957aa63
      Sean Christopherson authored
      VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding
      the instruction to determine the instruction length when running as a
      guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's
      technically not defined for EPT misconfigs).  Rather than implement the
      slow skip in VMX's generic skip_emulated_instruction(),
      handle_ept_misconfig() directly calls kvm_emulate_instruction() with
      EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and
      so handle_ept_misconfig() misses a single-step #DB.
      
      Rework the EPT misconfig fallback case to route it through
      kvm_skip_emulated_instruction() so that single-step #DBs and interrupt
      shadow updates are handled automatically.  I.e. make VMX's slow skip
      logic match SVM's and have the SVM flow not intentionally avoid the
      shadow update.
      
      Alternatively, the handle_ept_misconfig() could manually handle single-
      step detection, but that results in EMULTYPE_SKIP having split logic for
      the interrupt shadow vs. single-step #DBs, and split emulator logic is
      largely what led to this mess in the first place.
      
      Modifying SVM to mirror VMX flow isn't really an option as SVM's case
      isn't limited to a specific exit reason, i.e. handling the slow skip in
      skip_emulated_instruction() is mandatory for all intents and purposes.
      
      Drop VMX's skip_emulated_instruction() wrapper since it can now fail,
      and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow
      becomes corrupted.
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: d391f120 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1957aa63
    • Sean Christopherson's avatar
      KVM: x86: Remove emulation_result enums, EMULATE_{DONE,FAIL,USER_EXIT} · 60fc3d02
      Sean Christopherson authored
      Deferring emulation failure handling (in some cases) to the caller of
      x86_emulate_instruction() has proven fragile, e.g. multiple instances of
      KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it
      being difficult to discern what emulation types can return what result,
      and which combination of types and results are handled where.
      
      Now that x86_emulate_instruction() always handles emulation failure,
      i.e. EMULATION_FAIL is only referenced in callers, remove the
      emulation_result enums entirely.  Per KVM's existing exit handling
      conventions, return '0' and '1' for "exit to userspace" and "resume
      guest" respectively.  Doing so cleans up many callers, e.g. they can
      return kvm_emulate_instruction() directly instead of having to interpret
      its result.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      60fc3d02
    • Sean Christopherson's avatar
      KVM: VMX: Remove EMULATE_FAIL handling in handle_invalid_guest_state() · 8fff2710
      Sean Christopherson authored
      Now that EMULATE_FAIL is completely unused, remove the last remaning
      usage where KVM does something functional in response to EMULATE_FAIL.
      Leave the check in place as a WARN_ON_ONCE to provide a better paper
      trail when EMULATE_{DONE,FAIL,USER_EXIT} are completely removed.
      
      Opportunistically remove the gotos in handle_invalid_guest_state().
      With the EMULATE_FAIL handling gone there is no need to have a common
      handler for emulation failure and the gotos only complicate things,
      e.g. the signal_pending() check always returns '1', but this is far
      from obvious when glancing through the code.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8fff2710
    • Sean Christopherson's avatar
      KVM: x86: Move triple fault request into RM int injection · 9497e1f2
      Sean Christopherson authored
      Request triple fault in kvm_inject_realmode_interrupt() instead of
      returning EMULATE_FAIL and deferring to the caller.  All existing
      callers request triple fault and it's highly unlikely Real Mode is
      going to acquire new features.  While this consolidates a small amount
      of code, the real goal is to remove the last reference to EMULATE_FAIL.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9497e1f2
    • Sean Christopherson's avatar
      KVM: x86: Handle emulation failure directly in kvm_task_switch() · 1051778f
      Sean Christopherson authored
      Consolidate the reporting of emulation failure into kvm_task_switch()
      so that it can return EMULATE_USER_EXIT.  This helps pave the way for
      removing EMULATE_FAIL altogether.
      
      This also fixes a theoretical bug where task switch interception could
      suppress an EMULATE_USER_EXIT return.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1051778f
    • Sean Christopherson's avatar
      KVM: x86: Exit to userspace on emulation skip failure · 738fece4
      Sean Christopherson authored
      Kill a few birds with one stone by forcing an exit to userspace on skip
      emulation failure.  This removes a reference to EMULATE_FAIL, fixes a
      bug in handle_ept_misconfig() where it would exit to userspace without
      setting run->exit_reason, and fixes a theoretical bug in SVM's
      task_switch_interception() where it would overwrite run->exit_reason on
      a return of EMULATE_USER_EXIT.
      
      Note, this technically doesn't fully fix task_switch_interception()
      as it now incorrectly handles EMULATE_FAIL, but in practice there is no
      bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      738fece4
    • Sean Christopherson's avatar
      KVM: x86: Move #UD injection for failed emulation into emulation code · c83fad65
      Sean Christopherson authored
      Immediately inject a #UD and return EMULATE done if emulation fails when
      handling an intercepted #UD.  This helps pave the way for removing
      EMULATE_FAIL altogether.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c83fad65
    • Sean Christopherson's avatar
      KVM: x86: Add explicit flag for forced emulation on #UD · b4000606
      Sean Christopherson authored
      Add an explicit emulation type for forced #UD emulation and use it to
      detect that KVM should unconditionally inject a #UD instead of falling
      into its standard emulation failure handling.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4000606
    • Sean Christopherson's avatar
      KVM: x86: Move #GP injection for VMware into x86_emulate_instruction() · 42cbf068
      Sean Christopherson authored
      Immediately inject a #GP when VMware emulation fails and return
      EMULATE_DONE instead of propagating EMULATE_FAIL up the stack.  This
      helps pave the way for removing EMULATE_FAIL altogether.
      
      Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86
      emulator is called to handle VMware #GP interception, e.g. why a #GP
      is injected on emulation failure for EMULTYPE_VMWARE_GP.
      
      Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type.  The "no #UD on fail"
      is used only in the VMWare case and is obsoleted by having the emulator
      itself reinject #GP.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      42cbf068
    • Sean Christopherson's avatar
      KVM: x86: Don't attempt VMWare emulation on #GP with non-zero error code · a6c6ed1e
      Sean Christopherson authored
      The VMware backdoor hooks #GP faults on IN{S}, OUT{S}, and RDPMC, none
      of which generate a non-zero error code for their #GP.  Re-injecting #GP
      instead of attempting emulation on a non-zero error code will allow a
      future patch to move #GP injection (for emulation failure) into
      kvm_emulate_instruction() without having to plumb in the error code.
      Reviewed-and-tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a6c6ed1e
    • Sean Christopherson's avatar
      KVM: x86: Refactor kvm_vcpu_do_singlestep() to remove out param · 120c2c4f
      Sean Christopherson authored
      Return the single-step emulation result directly instead of via an out
      param.  Presumably at some point in the past kvm_vcpu_do_singlestep()
      could be called with *r==EMULATE_USER_EXIT, but that is no longer the
      case, i.e. all callers are happy to overwrite their own return variable.
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      120c2c4f
    • Sean Christopherson's avatar
      KVM: x86: Clean up handle_emulation_failure() · 22da61c9
      Sean Christopherson authored
      When handling emulation failure, return the emulation result directly
      instead of capturing it in a local variable.  Future patches will move
      additional cases into handle_emulation_failure(), clean up the cruft
      before so there isn't an ugly mix of setting a local variable and
      returning directly.
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      22da61c9
    • Sean Christopherson's avatar
      KVM: x86: Relocate MMIO exit stats counting · bc8a0aaf
      Sean Christopherson authored
      Move the stat.mmio_exits update into x86_emulate_instruction().  This is
      both a bug fix, e.g. the current update flows will incorrectly increment
      mmio_exits on emulation failure, and a preparatory change to set the
      stage for eliminating EMULATE_DONE and company.
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bc8a0aaf
    • Krish Sadhukhan's avatar
      KVM: nVMX: Check Host Address Space Size on vmentry of nested guests · 5845038c
      Krish Sadhukhan authored
      According to section "Checks Related to Address-Space Size" in Intel SDM
      vol 3C, the following checks are performed on vmentry of nested guests:
      
          If the logical processor is outside IA-32e mode (if IA32_EFER.LMA = 0)
          at the time of VM entry, the following must hold:
      	- The "IA-32e mode guest" VM-entry control is 0.
      	- The "host address-space size" VM-exit control is 0.
      
          If the logical processor is in IA-32e mode (if IA32_EFER.LMA = 1) at the
          time of VM entry, the "host address-space size" VM-exit control must be 1.
      
          If the "host address-space size" VM-exit control is 0, the following must
          hold:
      	- The "IA-32e mode guest" VM-entry control is 0.
      	- Bit 17 of the CR4 field (corresponding to CR4.PCIDE) is 0.
      	- Bits 63:32 in the RIP field are 0.
      
          If the "host address-space size" VM-exit control is 1, the following must
          hold:
      	- Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
      	- The RIP field contains a canonical address.
      
          On processors that do not support Intel 64 architecture, checks are
          performed to ensure that the "IA-32e mode guest" VM-entry control and the
          "host address-space size" VM-exit control are both 0.
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5845038c
    • Vitaly Kuznetsov's avatar
      KVM: selftests: hyperv_cpuid: add check for NoNonArchitecturalCoreSharing bit · e738772e
      Vitaly Kuznetsov authored
      The bit is supposed to be '1' when SMT is not supported or forcefully
      disabled and '0' otherwise.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e738772e
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: set NoNonArchitecturalCoreSharing CPUID bit when SMT is impossible · b2d8b167
      Vitaly Kuznetsov authored
      Hyper-V 2019 doesn't expose MD_CLEAR CPUID bit to guests when it cannot
      guarantee that two virtual processors won't end up running on sibling SMT
      threads without knowing about it. This is done as an optimization as in
      this case there is nothing the guest can do to protect itself against MDS
      and issuing additional flush requests is just pointless. On bare metal the
      topology is known, however, when Hyper-V is running nested (e.g. on top of
      KVM) it needs an additional piece of information: a confirmation that the
      exposed topology (wrt vCPU placement on different SMT threads) is
      trustworthy.
      
      NoNonArchitecturalCoreSharing (CPUID 0x40000004 EAX bit 18) is described in
      TLFS as follows: "Indicates that a virtual processor will never share a
      physical core with another virtual processor, except for virtual processors
      that are reported as sibling SMT threads." From KVM we can give such
      guarantee in two cases:
      - SMT is unsupported or forcefully disabled (just 'disabled' doesn't work
       as it can become re-enabled during the lifetime of the guest).
      - vCPUs are properly pinned so the scheduler won't put them on sibling
      SMT threads (when they're not reported as such).
      
      This patch reports NoNonArchitecturalCoreSharing bit in to userspace in the
      first case. The second case is outside of KVM's domain of responsibility
      (as vCPU pinning is actually done by someone who manages KVM's userspace -
      e.g. libvirt pinning QEMU threads).
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b2d8b167
    • Vitaly Kuznetsov's avatar
      cpu/SMT: create and export cpu_smt_possible() · e1572f1d
      Vitaly Kuznetsov authored
      KVM needs to know if SMT is theoretically possible, this means it is
      supported and not forcefully disabled ('nosmt=force'). Create and
      export cpu_smt_possible() answering this question.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e1572f1d
    • Wanpeng Li's avatar
      KVM: hyperv: Fix Direct Synthetic timers assert an interrupt w/o lapic_in_kernel · a073d7e3
      Wanpeng Li authored
      Reported by syzkaller:
      
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	RIP: 0010:__apic_accept_irq+0x46/0x740 arch/x86/kvm/lapic.c:1029
      	Call Trace:
      	kvm_apic_set_irq+0xb4/0x140 arch/x86/kvm/lapic.c:558
      	stimer_notify_direct arch/x86/kvm/hyperv.c:648 [inline]
      	stimer_expiration arch/x86/kvm/hyperv.c:659 [inline]
      	kvm_hv_process_stimers+0x594/0x1650 arch/x86/kvm/hyperv.c:686
      	vcpu_enter_guest+0x2b2a/0x54b0 arch/x86/kvm/x86.c:7896
      	vcpu_run+0x393/0xd40 arch/x86/kvm/x86.c:8152
      	kvm_arch_vcpu_ioctl_run+0x636/0x900 arch/x86/kvm/x86.c:8360
      	kvm_vcpu_ioctl+0x6cf/0xaf0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2765
      
      The testcase programs HV_X64_MSR_STIMERn_CONFIG/HV_X64_MSR_STIMERn_COUNT,
      in addition, there is no lapic in the kernel, the counters value are small
      enough in order that kvm_hv_process_stimers() inject this already-expired
      timer interrupt into the guest through lapic in the kernel which triggers
      the NULL deferencing. This patch fixes it by don't advertise direct mode
      synthetic timers and discarding the inject when lapic is not in kernel.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=1752fe0a600000
      
      Reported-by: syzbot+dff25ee91f0c7d5c1695@syzkaller.appspotmail.com
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a073d7e3
    • Sean Christopherson's avatar
      KVM: x86: Manually flush collapsible SPTEs only when toggling flags · 319109a2
      Sean Christopherson authored
      Zapping collapsible sptes, a.k.a. 4k sptes that can be promoted into a
      large page, is only necessary when changing only the dirty logging flag
      of a memory region.  If the memslot is also being moved, then all sptes
      for the memslot are zapped when it is invalidated.  When a memslot is
      being created, it is impossible for there to be existing dirty mappings,
      e.g. KVM can have MMIO sptes, but not present, and thus dirty, sptes.
      
      Note, the comment and logic are shamelessly borrowed from MIPS's version
      of kvm_arch_commit_memory_region().
      
      Fixes: 3ea3b7fa ("kvm: mmu: lazy collapse small sptes into large sptes")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      319109a2
    • Peter Xu's avatar
      KVM: selftests: Remove duplicate guest mode handling · 52200d0d
      Peter Xu authored
      Remove the duplication code in run_test() of dirty_log_test because
      after some reordering of functions now we can directly use the outcome
      of vm_create().
      
      Meanwhile, with the new VM_MODE_PXXV48_4K, we can safely revert
      b442324b too where we stick the x86_64 PA width to 39 bits for
      dirty_log_test.
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52200d0d
    • Peter Xu's avatar
      KVM: selftests: Introduce VM_MODE_PXXV48_4K · 567a9f1e
      Peter Xu authored
      The naming VM_MODE_P52V48_4K is explicit but unclear when used on
      x86_64 machines, because x86_64 machines are having various physical
      address width rather than some static values.  Here's some examples:
      
        - Intel Xeon E3-1220:  36 bits
        - Intel Core i7-8650:  39 bits
        - AMD   EPYC 7251:     48 bits
      
      All of them are using 48 bits linear address width but with totally
      different physical address width (and most of the old machines should
      be less than 52 bits).
      
      Let's create a new guest mode called VM_MODE_PXXV48_4K for current
      x86_64 tests and make it as the default to replace the old naming of
      VM_MODE_P52V48_4K because it shows more clearly that the PA width is
      not really a constant.  Meanwhile we also stop assuming all the x86
      machines are having 52 bits PA width but instead we fetch the real
      vm->pa_bits from CPUID 0x80000008 during runtime.
      
      We currently make this exclusively used by x86_64 but no other arch.
      
      As a slight touch up, moving DEBUG macro from dirty_log_test.c to
      kvm_util.h so lib can use it too.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      567a9f1e
    • Peter Xu's avatar
      KVM: selftests: Create VM earlier for dirty log test · 338eb298
      Peter Xu authored
      Since we've just removed the dependency of vm type in previous patch,
      now we can create the vm much earlier.  Note that to move it earlier
      we used an approximation of number of extra pages but it should be
      fine.
      
      This prepares for the follow up patches to finally remove the
      duplication of guest mode parsings.
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      338eb298
    • Peter Xu's avatar
      KVM: selftests: Move vm type into _vm_create() internally · 12c386b2
      Peter Xu authored
      Rather than passing the vm type from the top level to the end of vm
      creation, let's simply keep that as an internal of kvm_vm struct and
      decide the type in _vm_create().  Several reasons for doing this:
      
      - The vm type is only decided by physical address width and currently
        only used in aarch64, so we've got enough information as long as
        we're passing vm_guest_mode into _vm_create(),
      
      - This removes a loop dependency between the vm->type and creation of
        vms.  That's why now we need to parse vm_guest_mode twice sometimes,
        once in run_test() and then again in _vm_create().  The follow up
        patches will move on to clean up that as well so we can have a
        single place to decide guest machine types and so.
      
      Note that this patch will slightly change the behavior of aarch64
      tests in that previously most vm_create() callers will directly pass
      in type==0 into _vm_create() but now the type will depend on
      vm_guest_mode, however it shouldn't affect any user because all
      vm_create() users of aarch64 will be using VM_MODE_DEFAULT guest
      mode (which is VM_MODE_P40V48_4K) so at last type will still be zero.
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      12c386b2
    • Vitaly Kuznetsov's avatar
      KVM: x86: announce KVM_CAP_HYPERV_ENLIGHTENED_VMCS support only when it is available · 5a0165f6
      Vitaly Kuznetsov authored
      It was discovered that after commit 65efa61d ("selftests: kvm: provide
      common function to enable eVMCS") hyperv_cpuid selftest is failing on AMD.
      The reason is that the commit changed _vcpu_ioctl() to vcpu_ioctl() in the
      test and this one can't fail.
      
      Instead of fixing the test is seems to make more sense to not announce
      KVM_CAP_HYPERV_ENLIGHTENED_VMCS support if it is definitely missing
      (on svm and in case kvm_intel.nested=0).
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5a0165f6
    • Vitaly Kuznetsov's avatar
      KVM: x86: svm: remove unneeded nested_enable_evmcs() hook · 956e255c
      Vitaly Kuznetsov authored
      Since commit 5158917c ("KVM: x86: nVMX: Allow nested_enable_evmcs to
      be NULL") the code in x86.c is prepared to see nested_enable_evmcs being
      NULL and in VMX case it actually is when nesting is disabled. Remove the
      unneeded stub from SVM code.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      956e255c
    • Vitaly Kuznetsov's avatar
      KVM/Hyper-V/VMX: Add direct tlb flush support · 6f6a657c
      Vitaly Kuznetsov authored
      Hyper-V provides direct tlb flush function which helps
      L1 Hypervisor to handle Hyper-V tlb flush request from
      L2 guest. Add the function support for VMX.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6f6a657c
    • Tianyu Lan's avatar
      KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSH · 344c6c80
      Tianyu Lan authored
      Hyper-V direct tlb flush function should be enabled for
      guest that only uses Hyper-V hypercall. User space
      hypervisor(e.g, Qemu) can disable KVM identification in
      CPUID and just exposes Hyper-V identification to make
      sure the precondition. Add new KVM capability KVM_CAP_
      HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V
      direct tlb function and this function is default to be
      disabled in KVM.
      Signed-off-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      344c6c80
    • Tianyu Lan's avatar
      x86/Hyper-V: Fix definition of struct hv_vp_assist_page · 7a83247e
      Tianyu Lan authored
      The struct hv_vp_assist_page was defined incorrectly.
      The "vtl_control" should be u64[3], "nested_enlightenments
      _control" should be a u64 and there are 7 reserved bytes
      following "enlighten_vmentry". Fix the definition.
      Signed-off-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a83247e
    • Jim Mattson's avatar
      kvm: x86: Add Intel PMU MSRs to msrs_to_save[] · e2ada66e
      Jim Mattson authored
      These MSRs should be enumerated by KVM_GET_MSR_INDEX_LIST, so that
      userspace knows that these MSRs may be part of the vCPU state.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarEric Hankland <ehankland@google.com>
      Reviewed-by: default avatarPeter Shier <pshier@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e2ada66e
    • Linus Torvalds's avatar
      Merge tag 'mfd-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · 4c07e2dd
      Linus Torvalds authored
      Pull MFD updates from Lee Jones:
       "New Drivers:
         - Add support for Merrifield Basin Cove PMIC
      
        New Device Support:
         - Add support for Intel Tiger Lake to Intel LPSS PCI
         - Add support for Intel Sky Lake to Intel LPSS PCI
         - Add support for ST-Ericsson DB8520 to DB8500 PRCMU
      
        New Functionality:
         - Add RTC and PWRC support to MT6323
      
        Fix-ups:
         - Clean-up include files; davinci_voicecodec, asic3, sm501, mt6397
         - Ignore return values from debugfs_create*(); ab3100-*, ab8500-debugfs, aat2870-core
         - Device Tree changes; rn5t618, mt6397
         - Use new I2C API; tps80031, 88pm860x-core, ab3100-core, bcm590xx,
                            da9150-core, max14577, max77693, max77843, max8907,
                            max8925-i2c, max8997, max8998, palmas, twl-core,
         - Remove obsolete code; da9063, jz4740-adc
         - Simplify semantics; timberdale, htc-i2cpld
         - Add 'fall-through' tags; omap-usb-host, db8500-prcmu
         - Remove superfluous prints; ab8500-debugfs, db8500-prcmu, fsl-imx25-tsadc,
                                      intel_soc_pmic_bxtwc, qcom_rpm, sm501
         - Trivial rename/whitespace/typo fixes; mt6397-core, MAINTAINERS
         - Reorganise code structure; mt6397-*
         - Improve code consistency; intel-lpss
         - Use MODULE_SOFTDEP() helper; intel-lpss
         - Use DEFINE_RES_*() helpers; mt6397-core
      
        Bug Fixes:
         - Clean-up resources; max77620
         - Prevent input events being dropped on resume; intel-lpss-pci
         - Prevent sleeping in IRQ context; ezx-pcap"
      
      * tag 'mfd-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (48 commits)
        mfd: mt6323: Add MT6323 RTC and PWRC
        mfd: mt6323: Replace boilerplate resource code with DEFINE_RES_* macros
        mfd: mt6397: Add mutex include
        dt-bindings: mfd: mediatek: Add MT6323 Power Controller
        dt-bindings: mfd: mediatek: Update RTC to include MT6323
        dt-bindings: mfd: mediatek: mt6397: Change to relative paths
        mfd: db8500-prcmu: Support the higher DB8520 ARMSS
        mfd: intel-lpss: Use MODULE_SOFTDEP() instead of implicit request
        mfd: htc-i2cpld: Drop check because i2c_unregister_device() is NULL safe
        mfd: sm501: Include the GPIO driver header
        mfd: intel-lpss: Add Intel Skylake ACPI IDs
        mfd: intel-lpss: Consistently use GENMASK()
        mfd: Add support for Merrifield Basin Cove PMIC
        mfd: ezx-pcap: Replace mutex_lock with spin_lock
        mfd: asic3: Include the right header
        MAINTAINERS: altera-sysmgr: Fix typo in a filepath
        mfd: mt6397: Extract IRQ related code from core driver
        mfd: mt6397: Rename macros to something more readable
        mfd: Remove dev_err() usage after platform_get_irq()
        mfd: db8500-prcmu: Mark expected switch fall-throughs
        ...
      4c07e2dd
    • Linus Torvalds's avatar
      Merge tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · d0b3cfee
      Linus Torvalds authored
      Pull backlight updates from Lee Jones:
       "Core Frameworks
         - Obtain scale type through sysfs
      
        New Functionality:
         - Provide Device Tree functionality in rave-sp-backlight
         - Calculate if scale type is (non-)linear in pwm_bl
      
        Fix-ups:
         - Simplify code in lm3630a_bl
         - Trivial rename/whitespace/typo fixes in lms283gf05
         - Remove superfluous NULL check in tosa_lcd
         - Fix power state initialisation in gpio_backlight
         - List supported file in MAINTAINERS
      
        Bug Fixes:
         - Kconfig - default to not building unless requested in
           {LED,BACKLIGHT}_CLASS_DEVICE"
      
      * tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
        backlight: pwm_bl: Set scale type for brightness curves specified in the DT
        backlight: pwm_bl: Set scale type for CIE 1931 curves
        backlight: Expose brightness curve type through sysfs
        MAINTAINERS: Add entry for stable backlight sysfs ABI documentation
        backlight: gpio-backlight: Correct initial power state handling
        video: backlight: tosa_lcd: drop check because i2c_unregister_device() is NULL safe
        video: backlight: Drop default m for {LCD,BACKLIGHT_CLASS_DEVICE}
        backlight: lms283gf05: Fix a typo in the description passed to 'devm_gpio_request_one()'
        backlight: lm3630a: Switch to use fwnode_property_count_uXX()
        backlight: rave-sp: Leave initial state and register with correct device
      d0b3cfee
    • Linus Torvalds's avatar
      Merge tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 299d14d4
      Linus Torvalds authored
      Pull PCI updates from Bjorn Helgaas:
       "Enumeration:
      
         - Consolidate _HPP/_HPX stuff in pci-acpi.c and simplify it
           (Krzysztof Wilczynski)
      
         - Fix incorrect PCIe device types and remove dev->has_secondary_link
           to simplify code that deals with upstream/downstream ports (Mika
           Westerberg)
      
         - After suspend, restore Resizable BAR size bits correctly for 1MB
           BARs (Sumit Saxena)
      
         - Enable PCI_MSI_IRQ_DOMAIN support for RISC-V (Wesley Terpstra)
      
        Virtualization:
      
         - Add ACS quirks for iProc PAXB (Abhinav Ratna), Amazon Annapurna
           Labs (Ali Saidi)
      
         - Move sysfs SR-IOV functions to iov.c (Kelsey Skunberg)
      
         - Remove group write permissions from sysfs sriov_numvfs,
           sriov_drivers_autoprobe (Kelsey Skunberg)
      
        Hotplug:
      
         - Simplify pciehp indicator control (Denis Efremov)
      
        Peer-to-peer DMA:
      
         - Allow P2P DMA between root ports for whitelisted bridges (Logan
           Gunthorpe)
      
         - Whitelist some Intel host bridges for P2P DMA (Logan Gunthorpe)
      
         - DMA map P2P DMA requests that traverse host bridge (Logan
           Gunthorpe)
      
        Amazon Annapurna Labs host bridge driver:
      
         - Add DT binding and controller driver (Jonathan Chocron)
      
        Hyper-V host bridge driver:
      
         - Fix hv_pci_dev->pci_slot use-after-free (Dexuan Cui)
      
         - Fix PCI domain number collisions (Haiyang Zhang)
      
         - Use instance ID bytes 4 & 5 as PCI domain numbers (Haiyang Zhang)
      
         - Fix build errors on non-SYSFS config (Randy Dunlap)
      
        i.MX6 host bridge driver:
      
         - Limit DBI register length (Stefan Agner)
      
        Intel VMD host bridge driver:
      
         - Fix config addressing issues (Jon Derrick)
      
        Layerscape host bridge driver:
      
         - Add bar_fixed_64bit property to endpoint driver (Xiaowei Bao)
      
         - Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC drivers separately
           (Xiaowei Bao)
      
        Mediatek host bridge driver:
      
         - Add MT7629 controller support (Jianjun Wang)
      
        Mobiveil host bridge driver:
      
         - Fix CPU base address setup (Hou Zhiqiang)
      
         - Make "num-lanes" property optional (Hou Zhiqiang)
      
        Tegra host bridge driver:
      
         - Fix OF node reference leak (Nishka Dasgupta)
      
         - Disable MSI for root ports to work around design problem (Vidya
           Sagar)
      
         - Add Tegra194 DT binding and controller support (Vidya Sagar)
      
         - Add support for sideband pins and slot regulators (Vidya Sagar)
      
         - Add PIPE2UPHY support (Vidya Sagar)
      
        Misc:
      
         - Remove unused pci_block_cfg_access() et al (Kelsey Skunberg)
      
         - Unexport pci_bus_get(), etc (Kelsey Skunberg)
      
         - Hide PM, VC, link speed, ATS, ECRC, PTM constants and interfaces in
           the PCI core (Kelsey Skunberg)
      
         - Clean up sysfs DEVICE_ATTR() usage (Kelsey Skunberg)
      
         - Mark expected switch fall-through (Gustavo A. R. Silva)
      
         - Propagate errors for optional regulators and PHYs (Thierry Reding)
      
         - Fix kernel command line resource_alignment parameter issues (Logan
           Gunthorpe)"
      
      * tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (112 commits)
        PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI
        arm64: tegra: Add PCIe slot supply information in p2972-0000 platform
        arm64: tegra: Add configuration for PCIe C5 sideband signals
        PCI: tegra: Add support to enable slot regulators
        PCI: tegra: Add support to configure sideband pins
        PCI: vmd: Fix shadow offsets to reflect spec changes
        PCI: vmd: Fix config addressing when using bus offsets
        PCI: dwc: Add validation that PCIe core is set to correct mode
        PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver
        dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding
        PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port
        PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port
        PCI: Add ACS quirk for Amazon Annapurna Labs root ports
        PCI: Add Amazon's Annapurna Labs vendor ID
        MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer
        PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers
        dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries
        dt-bindings: PCI: tegra: Add sideband pins configuration entries
        PCI: tegra: Add Tegra194 PCIe support
        PCI: Get rid of dev->has_secondary_link flag
        ...
      299d14d4