1. 11 Sep, 2019 12 commits
    • Liran Alon's avatar
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon authored
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
    • Liran Alon's avatar
      KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode · 4a53d99d
      Liran Alon authored
      According to Intel SDM section 25.2 "Other Causes of VM Exits",
      When INIT signal is received on a CPU that is running in VMX
      non-root mode it should cause an exit with exit-reason of 3.
      (See Intel SDM Appendix C "VMX BASIC EXIT REASONS")
      
      This patch introduce the exit-reason definition.
      Reviewed-by: default avatarBhavesh Davda <bhavesh.davda@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a53d99d
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-5.4-1' of... · 17a81bdb
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      * More selftests
      * Improved KVM_S390_MEM_OP ioctl input checking
      * Add kvm_valid_regs and kvm_dirty_regs invalid bit checking
      17a81bdb
    • Wanpeng Li's avatar
      KVM: VMX: Stop the preemption timer during vCPU reset · 95c06540
      Wanpeng Li authored
      The hrtimer which is used to emulate lapic timer is stopped during
      vcpu reset, preemption timer should do the same.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      95c06540
    • Wanpeng Li's avatar
      KVM: LAPIC: Micro optimize IPI latency · 2b0911d1
      Wanpeng Li authored
      This patch optimizes the virtual IPI emulation sequence:
      
      write ICR2                     write ICR2
      write ICR                      read ICR2
      read ICR            ==>        send virtual IPI
      read ICR2                      write ICR
      send virtual IPI
      
      It can reduce kvm-unit-tests/vmexit.flat IPI testing latency(from sender
      send IPI to sender receive the ACK) from 3319 cycles to 3203 cycles on
      SKylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b0911d1
    • Jiří Paleček's avatar
      kvm: Nested KVM MMUs need PAE root too · 1cfff4d9
      Jiří Paleček authored
      On AMD processors, in PAE 32bit mode, nested KVM instances don't
      work. The L0 host get a kernel OOPS, which is related to
      arch.mmu->pae_root being NULL.
      
      The reason for this is that when setting up nested KVM instance,
      arch.mmu is set to &arch.guest_mmu (while normally, it would be
      &arch.root_mmu). However, the initialization and allocation of
      pae_root only creates it in root_mmu. KVM code (ie. in
      mmu_alloc_shadow_roots) then accesses arch.mmu->pae_root, which is the
      unallocated arch.guest_mmu->pae_root.
      
      This fix just allocates (and frees) pae_root in both guest_mmu and
      root_mmu (and also lm_root if it was allocated). The allocation is
      subject to previous restrictions ie. it won't allocate anything on
      64-bit and AFAIK not on Intel.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=203923
      Fixes: 14c07ad8 ("x86/kvm/mmu: introduce guest_mmu")
      Signed-off-by: default avatarJiri Palecek <jpalecek@web.de>
      Tested-by: default avatarJiri Palecek <jpalecek@web.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1cfff4d9
    • Jan Dakinevich's avatar
      KVM: x86: set ctxt->have_exception in x86_decode_insn() · c8848cee
      Jan Dakinevich authored
      x86_emulate_instruction() takes into account ctxt->have_exception flag
      during instruction decoding, but in practice this flag is never set in
      x86_decode_insn().
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Cc: stable@vger.kernel.org
      Cc: Denis Lunev <den@virtuozzo.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Signed-off-by: default avatarJan Dakinevich <jan.dakinevich@virtuozzo.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c8848cee
    • Jan Dakinevich's avatar
      KVM: x86: always stop emulation on page fault · 8530a79c
      Jan Dakinevich authored
      inject_emulated_exception() returns true if and only if nested page
      fault happens. However, page fault can come from guest page tables
      walk, either nested or not nested. In both cases we should stop an
      attempt to read under RIP and give guest to step over its own page
      fault handler.
      
      This is also visible when an emulated instruction causes a #GP fault
      and the VMware backdoor is enabled.  To handle the VMware backdoor,
      KVM intercepts #GP faults; with only the next patch applied,
      x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL
      instead of EMULATE_DONE.   EMULATE_FAIL causes handle_exception_nmi()
      (or gp_interception() for SVM) to re-inject the original #GP because it
      thinks emulation failed due to a non-VMware opcode.  This patch prevents
      the issue as x86_emulate_instruction() will return EMULATE_DONE after
      injecting the #GP.
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Cc: stable@vger.kernel.org
      Cc: Denis Lunev <den@virtuozzo.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Signed-off-by: default avatarJan Dakinevich <jan.dakinevich@virtuozzo.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8530a79c
    • Sean Christopherson's avatar
      KVM: nVMX: trace nested VM-Enter failures detected by H/W · 380e0055
      Sean Christopherson authored
      Use the recently added tracepoint for logging nested VM-Enter failures
      instead of spamming the kernel log when hardware detects a consistency
      check failure.  Take the opportunity to print the name of the error code
      instead of dumping the raw hex number, but limit the symbol table to
      error codes that can reasonably be encountered by KVM.
      
      Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so
      that tracing of "invalid control field" errors isn't suppressed when
      nested early checks are enabled.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      380e0055
    • Sean Christopherson's avatar
      KVM: nVMX: add tracepoint for failed nested VM-Enter · 5497b955
      Sean Christopherson authored
      Debugging a failed VM-Enter is often like searching for a needle in a
      haystack, e.g. there are over 80 consistency checks that funnel into
      the "invalid control field" error code.  One way to expedite debug is
      to run the buggy code as an L1 guest under KVM (and pray that the
      failing check is detected by KVM).  However, extracting useful debug
      information out of L0 KVM requires attaching a debugger to KVM and/or
      modifying the source, e.g. to log which check is failing.
      
      Make life a little less painful for VMM developers and add a tracepoint
      for failed VM-Enter consistency checks.  Ideally the tracepoint would
      capture both what check failed and precisely why it failed, but logging
      why a checked failed is difficult to do in a generic tracepoint without
      resorting to invasive techniques, e.g. generating a custom string on
      failure.  That being said, for the vast majority of VM-Enter failures
      the most difficult step is figuring out exactly what to look at, e.g.
      figuring out which bit was incorrectly set in a control field is usually
      not too painful once the guilty field as been identified.
      
      To reach a happy medium between precision and ease of use, simply log
      the code that detected a failed check, using a macro to execute the
      check and log the trace event on failure.  This approach enables tracing
      arbitrary code, e.g. it's not limited to function calls or specific
      formats of checks, and the changes to the existing code are minimally
      invasive.  A macro with a two-character name is desirable as usage of
      the macro doesn't result in overly long lines or confusing alignment,
      while still retaining some amount of readability.  I.e. a one-character
      name is a little too terse, and a three-character name results in the
      contents being passed to the macro aligning with an indented line when
      the macro is used an in if-statement, e.g.:
      
              if (VCC(nested_vmx_check_long_line_one(...) &&
                      nested_vmx_check_long_line_two(...)))
                      return -EINVAL;
      
      And that is the story of how the CC(), a.k.a. Consistency Check, macro
      got its name.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5497b955
    • Dan Carpenter's avatar
      x86: KVM: svm: Fix a check in nested_svm_vmrun() · a061985b
      Dan Carpenter authored
      We refactored this code a bit and accidentally deleted the "-" character
      from "-EINVAL".  The kvm_vcpu_map() function never returns positive
      EINVAL.
      
      Fixes: c8e16b78 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a061985b
    • Liran Alon's avatar
      KVM: x86: Return to userspace with internal error on unexpected exit reason · 7396d337
      Liran Alon authored
      Receiving an unexpected exit reason from hardware should be considered
      as a severe bug in KVM. Therefore, instead of just injecting #UD to
      guest and ignore it, exit to userspace on internal error so that
      it could handle it properly (probably by terminating guest).
      
      In addition, prefer to use vcpu_unimpl() instead of WARN_ONCE()
      as handling unexpected exit reason should be a rare unexpected
      event (that was expected to never happen) and we prefer to print
      a message on it every time it occurs to guest.
      
      Furthermore, dump VMCS/VMCB to dmesg to assist diagnosing such cases.
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7396d337
  2. 10 Sep, 2019 11 commits
  3. 09 Sep, 2019 1 commit
    • Marc Zyngier's avatar
      KVM: arm/arm64: vgic: Allow more than 256 vcpus for KVM_IRQ_LINE · 92f35b75
      Marc Zyngier authored
      While parts of the VGIC support a large number of vcpus (we
      bravely allow up to 512), other parts are more limited.
      
      One of these limits is visible in the KVM_IRQ_LINE ioctl, which
      only allows 256 vcpus to be signalled when using the CPU or PPI
      types. Unfortunately, we've cornered ourselves badly by allocating
      all the bits in the irq field.
      
      Since the irq_type subfield (8 bit wide) is currently only taking
      the values 0, 1 and 2 (and we have been careful not to allow anything
      else), let's reduce this field to only 4 bits, and allocate the
      remaining 4 bits to a vcpu2_index, which acts as a multiplier:
      
        vcpu_id = 256 * vcpu2_index + vcpu_index
      
      With that, and a new capability (KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)
      allowing this to be discovered, it becomes possible to inject
      PPIs to up to 4096 vcpus. But please just don't.
      
      Whilst we're there, add a clarification about the use of KVM_IRQ_LINE
      on arm, which is not completely conditionned by KVM_CAP_IRQCHIP.
      Reported-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      92f35b75
  4. 04 Sep, 2019 2 commits
  5. 29 Aug, 2019 3 commits
  6. 27 Aug, 2019 4 commits
    • James Morse's avatar
      arm64: KVM: Device mappings should be execute-never · e8688ba3
      James Morse authored
      Since commit 2f6ea23f ("arm64: KVM: Avoid marking pages as XN in
      Stage-2 if CTR_EL0.DIC is set"), KVM has stopped marking normal memory
      as execute-never at stage2 when the system supports D->I Coherency at
      the PoU. This avoids KVM taking a trap when the page is first executed,
      in order to clean it to PoU.
      
      The patch that added this change also wrapped PAGE_S2_DEVICE mappings
      up in this too. The upshot is, if your CPU caches support DIC ...
      you can execute devices.
      
      Revert the PAGE_S2_DEVICE change so PTE_S2_XN is always used
      directly.
      
      Fixes: 2f6ea23f ("arm64: KVM: Avoid marking pages as XN in Stage-2 if CTR_EL0.DIC is set")
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      e8688ba3
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · ff42df49
      Paul Mackerras authored
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Tested-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      ff42df49
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · d28eafc5
      Paul Mackerras authored
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d28eafc5
    • Paul Mackerras's avatar
      KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions · 2ad7a27d
      Paul Mackerras authored
      There are some POWER9 machines where the OPAL firmware does not support
      the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
      The impact of this is that a guest using XIVE natively will not be able
      to be migrated successfully.  On the source side, the get_attr operation
      on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
      will fail; on the destination side, the set_attr operation for the same
      attribute will fail.
      
      This adds tests for the existence of the OPAL get/set queue state
      functions, and if they are not supported, the XIVE-native KVM device
      is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
      Userspace can then either provide a software emulation of XIVE, or
      else tell the guest that it does not have a XIVE controller available
      to it.
      
      Cc: stable@vger.kernel.org # v5.2+
      Fixes: 3fab2d10 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2ad7a27d
  7. 25 Aug, 2019 2 commits
  8. 23 Aug, 2019 3 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot · d22deab6
      Suraj Jitindar Singh authored
      The rmap array in the guest memslot is an array of size number of guest
      pages, allocated at memslot creation time. Each rmap entry in this array
      is used to store information about the guest page to which it
      corresponds. For example for a hpt guest it is used to store a lock bit,
      rc bits, a present bit and the index of a hpt entry in the guest hpt
      which maps this page. For a radix guest which is running nested guests
      it is used to store a pointer to a linked list of nested rmap entries
      which store the nested guest physical address which maps this guest
      address and for which there is a pte in the shadow page table.
      
      As there are currently two uses for the rmap array, and the potential
      for this to expand to more in the future, define a type field (being the
      top 8 bits of the rmap entry) to be used to define the type of the rmap
      entry which is currently present and define two values for this field
      for the two current uses of the rmap array.
      
      Since the nested case uses the rmap entry to store a pointer, define
      this type as having the two high bits set as is expected for a pointer.
      Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d22deab6
    • Paul Menzel's avatar
      KVM: PPC: Book3S: Mark expected switch fall-through · ff7240cc
      Paul Menzel authored
      Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging
      it as an expected fall-through.
      
          arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’:
          arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=]
                pte->may_write = true;
                ~~~~~~~~~~~~~~~^~~~~~
          arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here
               case 3:
               ^~~~
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ff7240cc
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · 75bf465f
      Paul Mackerras authored
      This merges in fixes for the XIVE interrupt controller which touch both
      generic powerpc and PPC KVM code.  To avoid merge conflicts, these
      commits will go upstream via the powerpc tree as well as the KVM tree.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      75bf465f
  9. 22 Aug, 2019 2 commits