1. 16 Oct, 2018 16 commits
    • Sean Christopherson's avatar
      KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail · bd18bffc
      Sean Christopherson authored
      A VMEnter that VMFails (as opposed to VMExits) does not touch host
      state beyond registers that are explicitly noted in the VMFail path,
      e.g. EFLAGS.  Host state does not need to be loaded because VMFail
      is only signaled for consistency checks that occur before the CPU
      starts to load guest state, i.e. there is no need to restore any
      state as nothing has been modified.  But in the case where a VMFail
      is detected by hardware and not by KVM (due to deferring consistency
      checks to hardware), KVM has already loaded some amount of guest
      state.  Luckily, "loaded" only means loaded to KVM's software model,
      i.e. vmcs01 has not been modified.  So, unwind our software model to
      the pre-VMEntry host state.
      
      Not restoring host state in this VMFail path leads to a variety of
      failures because we end up with stale data in vcpu->arch, e.g. CR0,
      CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
      significant delta in the stale data is all but guaranteed to crash
      L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
      
      An alternative to this "soft" reload would be to load host state from
      vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
      wildly inconsistent with respect to the VMX architecture, e.g. an L1
      VMM with separate VMExit and VMFail paths would explode.
      
      Note that this approach does not mean KVM is 100% accurate with
      respect to VMX hardware behavior, even at an architectural level
      (the exact order of consistency checks is microarchitecture specific).
      But 100% emulation accuracy isn't the goal (with this patch), rather
      the goal is to be consistent in the information delivered to L1, e.g.
      a VMExit should not fall-through VMENTER, and a VMFail should not jump
      to HOST_RIP.
      
      This technically reverts commit "5af41573 (KVM: nVMX: Fix mmu
      context after VMLAUNCH/VMRESUME failure)", but retains the core
      aspects of that patch, just in an open coded form due to the need to
      pull state from vmcs01 instead of vmcs12.  Restoring host state
      resolves a variety of issues introduced by commit "4f350c6d
      (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
      which remedied the incorrect behavior of treating VMFail like VMExit
      but in doing so neglected to restore arch state that had been modified
      prior to attempting nested VMEnter.
      
      A sample failure that occurs due to stale vcpu.arch state is a fault
      of some form while emulating an LGDT (due to emulated UMIP) from L1
      after a failed VMEntry to L3, in this case when running the KVM unit
      test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
      case due to a stale arch.cr4.UMIP.
      
      L1:
        BUG: unable to handle kernel paging request at ffffc90000663b9e
        PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
        Oops: 0009 [#1] SMP
        CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:native_load_gdt+0x0/0x10
      
        ...
      
        Call Trace:
         load_fixmap_gdt+0x22/0x30
         __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
         vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
         nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
         vmx_handle_exit+0x246/0x15a0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x600
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4f/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      L0:
        WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
        ...
        CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
        Hardware name: Intel Corporation Kabylake Client platform/KBL S
        RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
      
        ...
      
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x5e0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x49/0xf0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 5af41573 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
      Fixes: 4f350c6d (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmáŠ<rkrcmar@redhat.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd18bffc
    • Jim Mattson's avatar
      KVM: nVMX: Clear reserved bits of #DB exit qualification · cfb634fe
      Jim Mattson authored
      According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit
      qualification field for debug exceptions are reserved (cleared to
      0). However, the SDM is incorrect about bit 16 (corresponding to
      DR6.RTM). This bit should be set if a debug exception (#DB) or a
      breakpoint exception (#BP) occurred inside an RTM region while
      advanced debugging of RTM transactional regions was enabled. Note that
      this is the opposite of DR6.RTM, which "indicates (when clear) that a
      debug exception (#DB) or breakpoint exception (#BP) occurred inside an
      RTM region while advanced debugging of RTM transactional regions was
      enabled."
      
      There is still an issue with stale DR6 bits potentially being
      misreported for the current debug exception.  DR6 should not have been
      modified before vectoring the #DB exception, and the "new DR6 bits"
      should be available somewhere, but it was and they aren't.
      
      Fixes: b96fb439 ("KVM: nVMX: fixes to nested virt interrupt injection")
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cfb634fe
    • Andrew Jones's avatar
      5b8ee879
    • Andrew Jones's avatar
      kvm: selftests: stop lying to aarch64 tests about PA-bits · e28934e6
      Andrew Jones authored
      Let's add the 40 PA-bit versions of the VM modes, that AArch64
      should have been using, so we can extend the dirty log test without
      breaking things.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e28934e6
    • Andrew Jones's avatar
    • Andrew Jones's avatar
      kvm: selftests: port dirty_log_test to aarch64 · fff8dcd7
      Andrew Jones authored
      While we're messing with the code for the port and to support guest
      page sizes that are less than the host page size, we also make some
      code formatting cleanups and apply sync_global_to_guest().
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fff8dcd7
    • Andrew Jones's avatar
      kvm: selftests: introduce new VM mode for 64K pages · 81d1cca0
      Andrew Jones authored
      Rename VM_MODE_FLAT48PG to be more descriptive of its config and add a
      new config that has the same parameters, except with 64K pages.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81d1cca0
    • Andrew Jones's avatar
      kvm: selftests: add vcpu support for aarch64 · 0bec140f
      Andrew Jones authored
      This code adds VM and VCPU setup code for the VM_MODE_FLAT48PG mode.
      The VM_MODE_FLAT48PG isn't yet fully supportable, as it defines the
      guest physical address limit as 52-bits, and KVM currently only
      supports guests with up to 40-bit physical addresses (see
      KVM_PHYS_SHIFT). VM_MODE_FLAT48PG will work fine, though, as long as
      no >= 40-bit physical addresses are used.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0bec140f
    • Andrew Jones's avatar
      7a6629ef
    • Andrew Jones's avatar
      kvm: selftests: add vm_phy_pages_alloc · d5106539
      Andrew Jones authored
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d5106539
    • Andrew Jones's avatar
      kvm: selftests: tidy up kvm_util · eabe7881
      Andrew Jones authored
      Tidy up kvm-util code: code/comment formatting, remove unused code,
      and move x86 specific code out. We also move vcpu_dump() out of
      common code, because not all arches (AArch64) have KVM_GET_REGS.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eabe7881
    • Andrew Jones's avatar
      kvm: selftests: add cscope make target · eea192bf
      Andrew Jones authored
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eea192bf
    • Andrew Jones's avatar
    • Andrew Jones's avatar
      kvm: selftests: introduce ucall · 14c47b75
      Andrew Jones authored
      Rework the guest exit to userspace code to generalize the concept
      into what it is, a "hypercall to userspace", and provide two
      implementations of it: the PortIO version currently used, but only
      useable by x86, and an MMIO version that other architectures (except
      s390) can use.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      14c47b75
    • Andrew Jones's avatar
      kvm: selftests: vcpu_setup: set cr4.osfxsr · 6c930268
      Andrew Jones authored
      Guest code may want to call functions that have variable arguments.
      To do so, we either need to compile with -mno-sse or enable SSE in
      the VCPUs. As it should be pretty safe to turn on the feature, and
      -mno-sse would make linking test code with standard libraries
      difficult, we choose the feature enabling.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c930268
    • Wanpeng Li's avatar
      KVM: LAPIC: Tune lapic_timer_advance_ns automatically · 3b8a5df6
      Wanpeng Li authored
      In cloud environment, lapic_timer_advance_ns is needed to be tuned for every CPU
      generations, and every host kernel versions(the kvm-unit-tests/tscdeadline_latency.flat
      is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product kernel,
      both preemption_timer=N, Skylake server).
      
      This patch adds the capability to automatically tune lapic_timer_advance_ns
      step by step, the initial value is 1000ns as 'commit d0659d94 ("KVM: x86:
      add option to advance tscdeadline hrtimer expiration")' recommended, it will be
      reduced when it is too early, and increased when it is too late. The guest_tsc
      and tsc_deadline are hard to equal, so we assume we are done when the delta
      is within a small scope e.g. 100 cycles. This patch reduces latency
      (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
      from ~2600 cyles to ~1200 cyles on our Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3b8a5df6
  2. 13 Oct, 2018 6 commits
  3. 10 Oct, 2018 1 commit
    • Paolo Bonzini's avatar
      Merge tag 'kvm-ppc-next-4.20-1' of... · 7dd2157c
      Paolo Bonzini authored
      Merge tag 'kvm-ppc-next-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
      
      PPC KVM update for 4.20.
      
      The major new feature here is nested HV KVM support.  This allows the
      HV KVM module to load inside a radix guest on POWER9 and run radix
      guests underneath it.  These nested guests can run in supervisor mode
      and don't require any additional instructions to be emulated, unlike
      with PR KVM, and so performance is much better than with PR KVM, and
      is very close to the performance of a non-nested guest.  A nested
      hypervisor (a guest with nested guests) can be migrated to another
      host and will bring all its nested guests along with it.  A nested
      guest can also itself run guests, and so on down to any desired depth
      of nesting.
      
      Apart from that there are a series of updates for IOMMU handling from
      Alexey Kardashevskiy, a "one VM per core" mode for HV KVM for
      security-paranoid applications, and a small fix for PR KVM.
      7dd2157c
  4. 09 Oct, 2018 17 commits
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result · 901f8c3f
      Paul Mackerras authored
      This adds a KVM_PPC_NO_HASH flag to the flags field of the
      kvm_ppc_smmu_info struct, and arranges for it to be set when
      running as a nested hypervisor, as an unambiguous indication
      to userspace that HPT guests are not supported.  Reporting the
      KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
      indicating only that the new HPT features in ISA V3.0 are not
      supported, leaving it ambiguous whether pre-V3.0 HPT features
      are supported.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      901f8c3f
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras authored
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · 9d67121a
      Paul Mackerras authored
      This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
      series of commits that touch both general arch/powerpc code and KVM
      code.  These commits will be merged both via the KVM tree and the
      powerpc tree.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      9d67121a
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs · 83a05510
      Paul Mackerras authored
      This adds a list of valid shadow PTEs for each nested guest to
      the 'radix' file for the guest in debugfs.  This can be useful for
      debugging.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83a05510
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode · de760db4
      Paul Mackerras authored
      With this, the KVM-HV module can be loaded in a guest running under
      KVM-HV, and if the hypervisor supports nested virtualization, this
      guest can now act as a nested hypervisor and run nested guests.
      
      This also adds some checks to inform userspace that HPT guests are not
      supported by nested hypervisors (by returning false for the
      KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
      configuring a guest to use HPT mode.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      de760db4
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED · 10b5022d
      Suraj Jitindar Singh authored
      The hcall H_ENTER_NESTED takes two parameters: the address in L1 guest
      memory of a hv_regs struct and the address of a pt_regs struct.  The
      hcall requests the L0 hypervisor to use the register values in these
      structs to run a L2 guest and to return the exit state of the L2 guest
      in these structs.  These are in the endianness of the L1 guest, rather
      than being always big-endian as is usually the case for PAPR
      hypercalls.
      
      This is convenient because it means that the L1 guest can pass the
      address of the regs field in its kvm_vcpu_arch struct.  This also
      improves performance slightly by avoiding the need for two copies of
      the pt_regs struct.
      
      When reading/writing these structures, this patch handles the case
      where the endianness of the L1 guest differs from that of the L0
      hypervisor, by byteswapping the structures after reading and before
      writing them back.
      
      Since all the fields of the pt_regs are of the same type, i.e.,
      unsigned long, we treat it as an array of unsigned longs.  The fields
      of struct hv_guest_state are not all the same, so its fields are
      byteswapped individually.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      10b5022d
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry · 73937deb
      Suraj Jitindar Singh authored
      restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
      nested (L2) guest into the vcpu structure. We need to sanitise these
      values to ensure we don't let the L1 guest hypervisor do things we don't
      want it to.
      
      We don't let data address watchpoints or completed instruction address
      breakpoints be set to match in hypervisor state.
      
      We also don't let L1 enable features in the hypervisor facility status
      and control register (HFSCR) for L2 which we have disabled for L1. That
      is L2 will get the subset of features which the L0 hypervisor has
      enabled for L1 and the features L1 wants to enable for L2. This could
      mean we give L1 a hypervisor facility unavailable interrupt for a
      facility it thinks it has enabled, however it shouldn't have enabled a
      facility it itself doesn't have for the L2 guest.
      
      We sanitise the registers when copying in the L2 hv_regs. We don't need
      to sanitise when copying back the L1 hv_regs since these shouldn't be
      able to contain invalid values as they're just what was copied out.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      73937deb
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register · 30323418
      Paul Mackerras authored
      This adds a one-reg register identifier which can be used to read and
      set the virtual PTCR for the guest.  This register identifies the
      address and size of the virtual partition table for the guest, which
      contains information about the nested guests under this guest.
      
      Migrating this value is the only extra requirement for migrating a
      guest which has nested guests (assuming of course that the destination
      host supports nested virtualization in the kvm-hv module).
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      30323418
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested · f3c99f97
      Paul Mackerras authored
      When running as a nested hypervisor, this avoids reading hypervisor
      privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
      instead reasonable default values are used.  This also avoids writing
      LPIDR in the single-vcpu entry/exit path.
      
      Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
      since its only caller already checks this.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f3c99f97
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu · 9d0b048d
      Suraj Jitindar Singh authored
      This is only done at level 0, since only level 0 knows which physical
      CPU a vcpu is running on.  This does for nested guests what L0 already
      did for its own guests, which is to flush the TLB on a pCPU when it
      goes to run a vCPU there, and there is another vCPU in the same VM
      which previously ran on this pCPU and has now started to run on another
      pCPU.  This is to handle the situation where the other vCPU touched
      a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
      on that new pCPU and thus left behind a stale TLB entry on this pCPU.
      
      This introduces a limit on the the vcpu_token values used in the
      H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
      
      [paulus@ozlabs.org - made prev_cpu array be short[] to reduce
       memory consumption.]
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9d0b048d
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested · 690ed4ca
      Paul Mackerras authored
      This adds code to call the H_TLB_INVALIDATE hypercall when running as
      a guest, in the cases where we need to invalidate TLBs (or other MMU
      caches) as part of managing the mappings for a nested guest.  Calling
      H_TLB_INVALIDATE lets the nested hypervisor inform the parent
      hypervisor about changes to partition-scoped page tables or the
      partition table without needing to do hypervisor-privileged tlbie
      instructions.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      690ed4ca
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall · e3b6b466
      Suraj Jitindar Singh authored
      When running a nested (L2) guest the guest (L1) hypervisor will use
      the H_TLB_INVALIDATE hcall when it needs to change the partition
      scoped page tables or the partition table which it manages.  It will
      use this hcall in the situations where it would use a partition-scoped
      tlbie instruction if it were running in hypervisor mode.
      
      The H_TLB_INVALIDATE hcall can invalidate different scopes:
      
      Invalidate TLB for a given target address:
      - This invalidates a single L2 -> L1 pte
      - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
        address space which is being invalidated. This is because a single
        L2 -> L1 pte may have been mapped with more than one pte in the
        L2 -> L0 page tables.
      
      Invalidate the entire TLB for a given LPID or for all LPIDs:
      - Invalidate the entire shadow_pgtable for a given nested guest, or
        for all nested guests.
      
      Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
      - We don't cache the PWC, so nothing to do.
      
      Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
      - Here we re-read the partition table entry and remove the nested state
        for any nested guest for which the first doubleword of the partition
        table entry is now zero.
      
      The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
      word (of which only the RIC, PRS and R fields are used), the rS value
      (giving the lpid, where required) and the rB value (giving the IS, AP
      and EPN values).
      
      [paulus@ozlabs.org - adapted to having the partition table in guest
      memory, added the H_TLB_INVALIDATE implementation, removed tlbie
      instruction emulation, reworded the commit message.]
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e3b6b466
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings · 8cf531ed
      Suraj Jitindar Singh authored
      When a host (L0) page which is mapped into a (L1) guest is in turn
      mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
      so that these mappings can be retrieved later.
      
      Whenever we create an entry in a shadow_pgtable for a nested guest we
      create a corresponding rmap entry and add it to the list for the
      L1 guest memslot at the index of the L1 guest page it maps. This means
      at the L1 guest memslot we end up with lists of rmaps.
      
      When we are notified of a host page being invalidated which has been
      mapped through to a (L1) guest, we can then walk the rmap list for that
      guest page, and find and invalidate all of the corresponding
      shadow_pgtable entries.
      
      In order to reduce memory consumption, we compress the information for
      each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
      for the guest real page frame number -- which will fit in a single
      unsigned long.  To avoid a scenario where a guest can trigger
      unbounded memory allocations, we scan the list when adding an entry to
      see if there is already an entry with the contents we need.  This can
      occur, because we don't ever remove entries from the middle of a list.
      
      A struct nested guest rmap is a list pointer and an rmap entry;
      ----------------
      | next pointer |
      ----------------
      | rmap entry   |
      ----------------
      
      Thus the rmap pointer for each guest frame number in the memslot can be
      either NULL, a single entry, or a pointer to a list of nested rmap entries.
      
      gfn	 memslot rmap array
       	-------------------------
       0	| NULL			|	(no rmap entry)
       	-------------------------
       1	| single rmap entry	|	(rmap entry with low bit set)
       	-------------------------
       2	| list head pointer	|	(list of rmap entries)
       	-------------------------
      
      The final entry always has the lowest bit set and is stored in the next
      pointer of the last list entry, or as a single rmap entry.
      With a list of rmap entries looking like;
      
      -----------------	-----------------	-------------------------
      | list head ptr	| ----> | next pointer	| ---->	| single rmap entry	|
      -----------------	-----------------	-------------------------
      			| rmap entry	|	| rmap entry		|
      			-----------------	-------------------------
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8cf531ed
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Handle page fault for a nested guest · fd10be25
      Suraj Jitindar Singh authored
      Consider a normal (L1) guest running under the main hypervisor (L0),
      and then a nested guest (L2) running under the L1 guest which is acting
      as a nested hypervisor. L0 has page tables to map the address space for
      L1 providing the translation from L1 real address -> L0 real address;
      
      	L1
      	|
      	| (L1 -> L0)
      	|
      	----> L0
      
      There are also page tables in L1 used to map the address space for L2
      providing the translation from L2 real address -> L1 read address. Since
      the hardware can only walk a single level of page table, we need to
      maintain in L0 a "shadow_pgtable" for L2 which provides the translation
      from L2 real address -> L0 real address. Which looks like;
      
      	L2				L2
      	|				|
      	| (L2 -> L1)			|
      	|				|
      	----> L1			| (L2 -> L0)
      	      |				|
      	      | (L1 -> L0)		|
      	      |				|
      	      ----> L0			--------> L0
      
      When a page fault occurs while running a nested (L2) guest we need to
      insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
      do this we need to:
      
      1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
         provide a page fault to L1 if this mapping doesn't exist.
      2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
         or try to insert a pte for that mapping if it doesn't exist.
      3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
      
      Once this mapping exists we can take rc faults when hardware is unable
      to automatically set the reference and change bits in the pte. On these
      we need to:
      
      1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
         the fault down to L1.
      2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
         host page.
      3. Set the rc bits in the L2 -> L0 pte.
      
      As we reuse a large number of functions in book3s_64_mmu_radix.c for
      this we also needed to refactor a number of these functions to take
      an lpid parameter so that the correct lpid is used for tlb invalidations.
      The functionality however has remained the same.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      fd10be25
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Handle hypercalls correctly when nested · 4bad7779
      Paul Mackerras authored
      When we are running as a nested hypervisor, we use a hypercall to
      enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
      that the hypercall handlers listed in hcall_real_table never get called.
      There are some hypercalls that are handled there and not in
      kvmppc_pseries_do_hcall(), which therefore won't get processed for
      a nested guest.
      
      To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
      hypercalls, with the following exceptions:
      
      - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
        we only support radix mode for nested guests.
      
      - H_CEDE has to be handled specially because the cede logic in
        kvmhv_run_single_vcpu assumes that it has been processed by the time
        that kvmhv_p9_guest_entry() returns.  Therefore we put a special
        case for H_CEDE in kvmhv_p9_guest_entry().
      
      For the XICS hypercalls, if real-mode processing is enabled, then the
      virtual-mode handlers assume that they are being called only to finish
      up the operation.  Therefore we turn off the real-mode flag in the XICS
      code when running as a nested hypervisor.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4bad7779
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor · f3c18e93
      Paul Mackerras authored
      This adds code to call the H_IPI and H_EOI hypercalls when we are
      running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
      feature) and we would otherwise access the XICS interrupt controller
      directly or via an OPAL call.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f3c18e93
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Nested guest entry via hypercall · 360cae31
      Paul Mackerras authored
      This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
      hypervisor to enter one of its nested guests.  The hypercall supplies
      register values in two structs.  Those values are copied by the level 0
      (L0) hypervisor (the one which is running in hypervisor mode) into the
      vcpu struct of the L1 guest, and then the guest is run until an
      interrupt or error occurs which needs to be reported to L1 via the
      hypercall return value.
      
      Currently this assumes that the L0 and L1 hypervisors are the same
      endianness, and the structs passed as arguments are in native
      endianness.  If they are of different endianness, the version number
      check will fail and the hcall will be rejected.
      
      Nested hypervisors do not support indep_threads_mode=N, so this adds
      code to print a warning message if the administrator has set
      indep_threads_mode=N, and treat it as Y.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      360cae31