1. 16 Oct, 2018 19 commits
  2. 13 Oct, 2018 6 commits
  3. 10 Oct, 2018 1 commit
    • Paolo Bonzini's avatar
      Merge tag 'kvm-ppc-next-4.20-1' of... · 7dd2157c
      Paolo Bonzini authored
      Merge tag 'kvm-ppc-next-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
      
      PPC KVM update for 4.20.
      
      The major new feature here is nested HV KVM support.  This allows the
      HV KVM module to load inside a radix guest on POWER9 and run radix
      guests underneath it.  These nested guests can run in supervisor mode
      and don't require any additional instructions to be emulated, unlike
      with PR KVM, and so performance is much better than with PR KVM, and
      is very close to the performance of a non-nested guest.  A nested
      hypervisor (a guest with nested guests) can be migrated to another
      host and will bring all its nested guests along with it.  A nested
      guest can also itself run guests, and so on down to any desired depth
      of nesting.
      
      Apart from that there are a series of updates for IOMMU handling from
      Alexey Kardashevskiy, a "one VM per core" mode for HV KVM for
      security-paranoid applications, and a small fix for PR KVM.
      7dd2157c
  4. 09 Oct, 2018 14 commits
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result · 901f8c3f
      Paul Mackerras authored
      This adds a KVM_PPC_NO_HASH flag to the flags field of the
      kvm_ppc_smmu_info struct, and arranges for it to be set when
      running as a nested hypervisor, as an unambiguous indication
      to userspace that HPT guests are not supported.  Reporting the
      KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
      indicating only that the new HPT features in ISA V3.0 are not
      supported, leaving it ambiguous whether pre-V3.0 HPT features
      are supported.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      901f8c3f
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras authored
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · 9d67121a
      Paul Mackerras authored
      This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
      series of commits that touch both general arch/powerpc code and KVM
      code.  These commits will be merged both via the KVM tree and the
      powerpc tree.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      9d67121a
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs · 83a05510
      Paul Mackerras authored
      This adds a list of valid shadow PTEs for each nested guest to
      the 'radix' file for the guest in debugfs.  This can be useful for
      debugging.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83a05510
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode · de760db4
      Paul Mackerras authored
      With this, the KVM-HV module can be loaded in a guest running under
      KVM-HV, and if the hypervisor supports nested virtualization, this
      guest can now act as a nested hypervisor and run nested guests.
      
      This also adds some checks to inform userspace that HPT guests are not
      supported by nested hypervisors (by returning false for the
      KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
      configuring a guest to use HPT mode.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      de760db4
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED · 10b5022d
      Suraj Jitindar Singh authored
      The hcall H_ENTER_NESTED takes two parameters: the address in L1 guest
      memory of a hv_regs struct and the address of a pt_regs struct.  The
      hcall requests the L0 hypervisor to use the register values in these
      structs to run a L2 guest and to return the exit state of the L2 guest
      in these structs.  These are in the endianness of the L1 guest, rather
      than being always big-endian as is usually the case for PAPR
      hypercalls.
      
      This is convenient because it means that the L1 guest can pass the
      address of the regs field in its kvm_vcpu_arch struct.  This also
      improves performance slightly by avoiding the need for two copies of
      the pt_regs struct.
      
      When reading/writing these structures, this patch handles the case
      where the endianness of the L1 guest differs from that of the L0
      hypervisor, by byteswapping the structures after reading and before
      writing them back.
      
      Since all the fields of the pt_regs are of the same type, i.e.,
      unsigned long, we treat it as an array of unsigned longs.  The fields
      of struct hv_guest_state are not all the same, so its fields are
      byteswapped individually.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      10b5022d
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry · 73937deb
      Suraj Jitindar Singh authored
      restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
      nested (L2) guest into the vcpu structure. We need to sanitise these
      values to ensure we don't let the L1 guest hypervisor do things we don't
      want it to.
      
      We don't let data address watchpoints or completed instruction address
      breakpoints be set to match in hypervisor state.
      
      We also don't let L1 enable features in the hypervisor facility status
      and control register (HFSCR) for L2 which we have disabled for L1. That
      is L2 will get the subset of features which the L0 hypervisor has
      enabled for L1 and the features L1 wants to enable for L2. This could
      mean we give L1 a hypervisor facility unavailable interrupt for a
      facility it thinks it has enabled, however it shouldn't have enabled a
      facility it itself doesn't have for the L2 guest.
      
      We sanitise the registers when copying in the L2 hv_regs. We don't need
      to sanitise when copying back the L1 hv_regs since these shouldn't be
      able to contain invalid values as they're just what was copied out.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      73937deb
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register · 30323418
      Paul Mackerras authored
      This adds a one-reg register identifier which can be used to read and
      set the virtual PTCR for the guest.  This register identifies the
      address and size of the virtual partition table for the guest, which
      contains information about the nested guests under this guest.
      
      Migrating this value is the only extra requirement for migrating a
      guest which has nested guests (assuming of course that the destination
      host supports nested virtualization in the kvm-hv module).
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      30323418
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested · f3c99f97
      Paul Mackerras authored
      When running as a nested hypervisor, this avoids reading hypervisor
      privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
      instead reasonable default values are used.  This also avoids writing
      LPIDR in the single-vcpu entry/exit path.
      
      Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
      since its only caller already checks this.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f3c99f97
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu · 9d0b048d
      Suraj Jitindar Singh authored
      This is only done at level 0, since only level 0 knows which physical
      CPU a vcpu is running on.  This does for nested guests what L0 already
      did for its own guests, which is to flush the TLB on a pCPU when it
      goes to run a vCPU there, and there is another vCPU in the same VM
      which previously ran on this pCPU and has now started to run on another
      pCPU.  This is to handle the situation where the other vCPU touched
      a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
      on that new pCPU and thus left behind a stale TLB entry on this pCPU.
      
      This introduces a limit on the the vcpu_token values used in the
      H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
      
      [paulus@ozlabs.org - made prev_cpu array be short[] to reduce
       memory consumption.]
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9d0b048d
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested · 690ed4ca
      Paul Mackerras authored
      This adds code to call the H_TLB_INVALIDATE hypercall when running as
      a guest, in the cases where we need to invalidate TLBs (or other MMU
      caches) as part of managing the mappings for a nested guest.  Calling
      H_TLB_INVALIDATE lets the nested hypervisor inform the parent
      hypervisor about changes to partition-scoped page tables or the
      partition table without needing to do hypervisor-privileged tlbie
      instructions.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      690ed4ca
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall · e3b6b466
      Suraj Jitindar Singh authored
      When running a nested (L2) guest the guest (L1) hypervisor will use
      the H_TLB_INVALIDATE hcall when it needs to change the partition
      scoped page tables or the partition table which it manages.  It will
      use this hcall in the situations where it would use a partition-scoped
      tlbie instruction if it were running in hypervisor mode.
      
      The H_TLB_INVALIDATE hcall can invalidate different scopes:
      
      Invalidate TLB for a given target address:
      - This invalidates a single L2 -> L1 pte
      - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
        address space which is being invalidated. This is because a single
        L2 -> L1 pte may have been mapped with more than one pte in the
        L2 -> L0 page tables.
      
      Invalidate the entire TLB for a given LPID or for all LPIDs:
      - Invalidate the entire shadow_pgtable for a given nested guest, or
        for all nested guests.
      
      Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
      - We don't cache the PWC, so nothing to do.
      
      Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
      - Here we re-read the partition table entry and remove the nested state
        for any nested guest for which the first doubleword of the partition
        table entry is now zero.
      
      The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
      word (of which only the RIC, PRS and R fields are used), the rS value
      (giving the lpid, where required) and the rB value (giving the IS, AP
      and EPN values).
      
      [paulus@ozlabs.org - adapted to having the partition table in guest
      memory, added the H_TLB_INVALIDATE implementation, removed tlbie
      instruction emulation, reworded the commit message.]
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e3b6b466
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings · 8cf531ed
      Suraj Jitindar Singh authored
      When a host (L0) page which is mapped into a (L1) guest is in turn
      mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
      so that these mappings can be retrieved later.
      
      Whenever we create an entry in a shadow_pgtable for a nested guest we
      create a corresponding rmap entry and add it to the list for the
      L1 guest memslot at the index of the L1 guest page it maps. This means
      at the L1 guest memslot we end up with lists of rmaps.
      
      When we are notified of a host page being invalidated which has been
      mapped through to a (L1) guest, we can then walk the rmap list for that
      guest page, and find and invalidate all of the corresponding
      shadow_pgtable entries.
      
      In order to reduce memory consumption, we compress the information for
      each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
      for the guest real page frame number -- which will fit in a single
      unsigned long.  To avoid a scenario where a guest can trigger
      unbounded memory allocations, we scan the list when adding an entry to
      see if there is already an entry with the contents we need.  This can
      occur, because we don't ever remove entries from the middle of a list.
      
      A struct nested guest rmap is a list pointer and an rmap entry;
      ----------------
      | next pointer |
      ----------------
      | rmap entry   |
      ----------------
      
      Thus the rmap pointer for each guest frame number in the memslot can be
      either NULL, a single entry, or a pointer to a list of nested rmap entries.
      
      gfn	 memslot rmap array
       	-------------------------
       0	| NULL			|	(no rmap entry)
       	-------------------------
       1	| single rmap entry	|	(rmap entry with low bit set)
       	-------------------------
       2	| list head pointer	|	(list of rmap entries)
       	-------------------------
      
      The final entry always has the lowest bit set and is stored in the next
      pointer of the last list entry, or as a single rmap entry.
      With a list of rmap entries looking like;
      
      -----------------	-----------------	-------------------------
      | list head ptr	| ----> | next pointer	| ---->	| single rmap entry	|
      -----------------	-----------------	-------------------------
      			| rmap entry	|	| rmap entry		|
      			-----------------	-------------------------
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8cf531ed
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Handle page fault for a nested guest · fd10be25
      Suraj Jitindar Singh authored
      Consider a normal (L1) guest running under the main hypervisor (L0),
      and then a nested guest (L2) running under the L1 guest which is acting
      as a nested hypervisor. L0 has page tables to map the address space for
      L1 providing the translation from L1 real address -> L0 real address;
      
      	L1
      	|
      	| (L1 -> L0)
      	|
      	----> L0
      
      There are also page tables in L1 used to map the address space for L2
      providing the translation from L2 real address -> L1 read address. Since
      the hardware can only walk a single level of page table, we need to
      maintain in L0 a "shadow_pgtable" for L2 which provides the translation
      from L2 real address -> L0 real address. Which looks like;
      
      	L2				L2
      	|				|
      	| (L2 -> L1)			|
      	|				|
      	----> L1			| (L2 -> L0)
      	      |				|
      	      | (L1 -> L0)		|
      	      |				|
      	      ----> L0			--------> L0
      
      When a page fault occurs while running a nested (L2) guest we need to
      insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
      do this we need to:
      
      1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
         provide a page fault to L1 if this mapping doesn't exist.
      2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
         or try to insert a pte for that mapping if it doesn't exist.
      3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
      
      Once this mapping exists we can take rc faults when hardware is unable
      to automatically set the reference and change bits in the pte. On these
      we need to:
      
      1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
         the fault down to L1.
      2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
         host page.
      3. Set the rc bits in the L2 -> L0 pte.
      
      As we reuse a large number of functions in book3s_64_mmu_radix.c for
      this we also needed to refactor a number of these functions to take
      an lpid parameter so that the correct lpid is used for tlb invalidations.
      The functionality however has remained the same.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      fd10be25