1. 11 Sep, 2019 5 commits
    • Jan Dakinevich's avatar
      KVM: x86: always stop emulation on page fault · 8530a79c
      Jan Dakinevich authored
      inject_emulated_exception() returns true if and only if nested page
      fault happens. However, page fault can come from guest page tables
      walk, either nested or not nested. In both cases we should stop an
      attempt to read under RIP and give guest to step over its own page
      fault handler.
      
      This is also visible when an emulated instruction causes a #GP fault
      and the VMware backdoor is enabled.  To handle the VMware backdoor,
      KVM intercepts #GP faults; with only the next patch applied,
      x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL
      instead of EMULATE_DONE.   EMULATE_FAIL causes handle_exception_nmi()
      (or gp_interception() for SVM) to re-inject the original #GP because it
      thinks emulation failed due to a non-VMware opcode.  This patch prevents
      the issue as x86_emulate_instruction() will return EMULATE_DONE after
      injecting the #GP.
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Cc: stable@vger.kernel.org
      Cc: Denis Lunev <den@virtuozzo.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Signed-off-by: default avatarJan Dakinevich <jan.dakinevich@virtuozzo.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8530a79c
    • Sean Christopherson's avatar
      KVM: nVMX: trace nested VM-Enter failures detected by H/W · 380e0055
      Sean Christopherson authored
      Use the recently added tracepoint for logging nested VM-Enter failures
      instead of spamming the kernel log when hardware detects a consistency
      check failure.  Take the opportunity to print the name of the error code
      instead of dumping the raw hex number, but limit the symbol table to
      error codes that can reasonably be encountered by KVM.
      
      Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so
      that tracing of "invalid control field" errors isn't suppressed when
      nested early checks are enabled.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      380e0055
    • Sean Christopherson's avatar
      KVM: nVMX: add tracepoint for failed nested VM-Enter · 5497b955
      Sean Christopherson authored
      Debugging a failed VM-Enter is often like searching for a needle in a
      haystack, e.g. there are over 80 consistency checks that funnel into
      the "invalid control field" error code.  One way to expedite debug is
      to run the buggy code as an L1 guest under KVM (and pray that the
      failing check is detected by KVM).  However, extracting useful debug
      information out of L0 KVM requires attaching a debugger to KVM and/or
      modifying the source, e.g. to log which check is failing.
      
      Make life a little less painful for VMM developers and add a tracepoint
      for failed VM-Enter consistency checks.  Ideally the tracepoint would
      capture both what check failed and precisely why it failed, but logging
      why a checked failed is difficult to do in a generic tracepoint without
      resorting to invasive techniques, e.g. generating a custom string on
      failure.  That being said, for the vast majority of VM-Enter failures
      the most difficult step is figuring out exactly what to look at, e.g.
      figuring out which bit was incorrectly set in a control field is usually
      not too painful once the guilty field as been identified.
      
      To reach a happy medium between precision and ease of use, simply log
      the code that detected a failed check, using a macro to execute the
      check and log the trace event on failure.  This approach enables tracing
      arbitrary code, e.g. it's not limited to function calls or specific
      formats of checks, and the changes to the existing code are minimally
      invasive.  A macro with a two-character name is desirable as usage of
      the macro doesn't result in overly long lines or confusing alignment,
      while still retaining some amount of readability.  I.e. a one-character
      name is a little too terse, and a three-character name results in the
      contents being passed to the macro aligning with an indented line when
      the macro is used an in if-statement, e.g.:
      
              if (VCC(nested_vmx_check_long_line_one(...) &&
                      nested_vmx_check_long_line_two(...)))
                      return -EINVAL;
      
      And that is the story of how the CC(), a.k.a. Consistency Check, macro
      got its name.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5497b955
    • Dan Carpenter's avatar
      x86: KVM: svm: Fix a check in nested_svm_vmrun() · a061985b
      Dan Carpenter authored
      We refactored this code a bit and accidentally deleted the "-" character
      from "-EINVAL".  The kvm_vcpu_map() function never returns positive
      EINVAL.
      
      Fixes: c8e16b78 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a061985b
    • Liran Alon's avatar
      KVM: x86: Return to userspace with internal error on unexpected exit reason · 7396d337
      Liran Alon authored
      Receiving an unexpected exit reason from hardware should be considered
      as a severe bug in KVM. Therefore, instead of just injecting #UD to
      guest and ignore it, exit to userspace on internal error so that
      it could handle it properly (probably by terminating guest).
      
      In addition, prefer to use vcpu_unimpl() instead of WARN_ONCE()
      as handling unexpected exit reason should be a rare unexpected
      event (that was expected to never happen) and we prefer to print
      a message on it every time it occurs to guest.
      
      Furthermore, dump VMCS/VMCB to dmesg to assist diagnosing such cases.
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7396d337
  2. 10 Sep, 2019 11 commits
  3. 09 Sep, 2019 1 commit
    • Marc Zyngier's avatar
      KVM: arm/arm64: vgic: Allow more than 256 vcpus for KVM_IRQ_LINE · 92f35b75
      Marc Zyngier authored
      While parts of the VGIC support a large number of vcpus (we
      bravely allow up to 512), other parts are more limited.
      
      One of these limits is visible in the KVM_IRQ_LINE ioctl, which
      only allows 256 vcpus to be signalled when using the CPU or PPI
      types. Unfortunately, we've cornered ourselves badly by allocating
      all the bits in the irq field.
      
      Since the irq_type subfield (8 bit wide) is currently only taking
      the values 0, 1 and 2 (and we have been careful not to allow anything
      else), let's reduce this field to only 4 bits, and allocate the
      remaining 4 bits to a vcpu2_index, which acts as a multiplier:
      
        vcpu_id = 256 * vcpu2_index + vcpu_index
      
      With that, and a new capability (KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)
      allowing this to be discovered, it becomes possible to inject
      PPIs to up to 4096 vcpus. But please just don't.
      
      Whilst we're there, add a clarification about the use of KVM_IRQ_LINE
      on arm, which is not completely conditionned by KVM_CAP_IRQCHIP.
      Reported-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      92f35b75
  4. 27 Aug, 2019 4 commits
    • James Morse's avatar
      arm64: KVM: Device mappings should be execute-never · e8688ba3
      James Morse authored
      Since commit 2f6ea23f ("arm64: KVM: Avoid marking pages as XN in
      Stage-2 if CTR_EL0.DIC is set"), KVM has stopped marking normal memory
      as execute-never at stage2 when the system supports D->I Coherency at
      the PoU. This avoids KVM taking a trap when the page is first executed,
      in order to clean it to PoU.
      
      The patch that added this change also wrapped PAGE_S2_DEVICE mappings
      up in this too. The upshot is, if your CPU caches support DIC ...
      you can execute devices.
      
      Revert the PAGE_S2_DEVICE change so PTE_S2_XN is always used
      directly.
      
      Fixes: 2f6ea23f ("arm64: KVM: Avoid marking pages as XN in Stage-2 if CTR_EL0.DIC is set")
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      e8688ba3
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · ff42df49
      Paul Mackerras authored
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Tested-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      ff42df49
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · d28eafc5
      Paul Mackerras authored
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d28eafc5
    • Paul Mackerras's avatar
      KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions · 2ad7a27d
      Paul Mackerras authored
      There are some POWER9 machines where the OPAL firmware does not support
      the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
      The impact of this is that a guest using XIVE natively will not be able
      to be migrated successfully.  On the source side, the get_attr operation
      on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
      will fail; on the destination side, the set_attr operation for the same
      attribute will fail.
      
      This adds tests for the existence of the OPAL get/set queue state
      functions, and if they are not supported, the XIVE-native KVM device
      is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
      Userspace can then either provide a software emulation of XIVE, or
      else tell the guest that it does not have a XIVE controller available
      to it.
      
      Cc: stable@vger.kernel.org # v5.2+
      Fixes: 3fab2d10 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2ad7a27d
  5. 25 Aug, 2019 2 commits
  6. 23 Aug, 2019 3 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot · d22deab6
      Suraj Jitindar Singh authored
      The rmap array in the guest memslot is an array of size number of guest
      pages, allocated at memslot creation time. Each rmap entry in this array
      is used to store information about the guest page to which it
      corresponds. For example for a hpt guest it is used to store a lock bit,
      rc bits, a present bit and the index of a hpt entry in the guest hpt
      which maps this page. For a radix guest which is running nested guests
      it is used to store a pointer to a linked list of nested rmap entries
      which store the nested guest physical address which maps this guest
      address and for which there is a pte in the shadow page table.
      
      As there are currently two uses for the rmap array, and the potential
      for this to expand to more in the future, define a type field (being the
      top 8 bits of the rmap entry) to be used to define the type of the rmap
      entry which is currently present and define two values for this field
      for the two current uses of the rmap array.
      
      Since the nested case uses the rmap entry to store a pointer, define
      this type as having the two high bits set as is expected for a pointer.
      Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d22deab6
    • Paul Menzel's avatar
      KVM: PPC: Book3S: Mark expected switch fall-through · ff7240cc
      Paul Menzel authored
      Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging
      it as an expected fall-through.
      
          arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’:
          arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=]
                pte->may_write = true;
                ~~~~~~~~~~~~~~~^~~~~~
          arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here
               case 3:
               ^~~~
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ff7240cc
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · 75bf465f
      Paul Mackerras authored
      This merges in fixes for the XIVE interrupt controller which touch both
      generic powerpc and PPC KVM code.  To avoid merge conflicts, these
      commits will go upstream via the powerpc tree as well as the KVM tree.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      75bf465f
  7. 22 Aug, 2019 14 commits