1. 24 Dec, 2018 2 commits
  2. 22 Dec, 2018 16 commits
  3. 21 Dec, 2018 22 commits
    • Alexey Kardashevskiy's avatar
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy authored
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • Alexey Kardashevskiy's avatar
      vfio_pci: Allow regions to add own capabilities · c2c0f1cd
      Alexey Kardashevskiy authored
      VFIO regions already support region capabilities with a limited set of
      fields. However the subdriver might have to report to the userspace
      additional bits.
      
      This adds an add_capability() hook to vfio_pci_regops.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c2c0f1cd
    • Alexey Kardashevskiy's avatar
      vfio_pci: Allow mapping extra regions · a15b1883
      Alexey Kardashevskiy authored
      So far we only allowed mapping of MMIO BARs to the userspace. However
      there are GPUs with on-board coherent RAM accessible via side
      channels which we also want to map to the userspace. The first client
      for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
      NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
      to the system address space, we are going to export these as an extra
      PCI region.
      
      We already support extra PCI regions and this adds support for mapping
      them to the userspace.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a15b1883
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Fault user page into the hypervisor's pagetable · 58629c0d
      Alexey Kardashevskiy authored
      When a page fault happens in a GPU, the GPU signals the OS and the GPU
      driver calls the fault handler which populated a page table; this allows
      the GPU to complete an ATS request.
      
      On the bare metal get_user_pages() is enough as it adds a pte to
      the kernel page table but under KVM the partition scope tree does not get
      updated so ATS will still fail.
      
      This reads a byte from an effective address which causes HV storage
      interrupt and KVM updates the partition scope tree.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      58629c0d
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Check mmio_atsd array bounds when populating · 135ef954
      Alexey Kardashevskiy authored
      A broken device tree might contain more than 8 values and introduce hard
      to debug memory corruption bug. This adds the boundary check.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      135ef954
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Add release_ownership hook · 1b785611
      Alexey Kardashevskiy authored
      In order to make ATS work and translate addresses for arbitrary
      LPID and PID, we need to program an NPU with LPID and allow PID wildcard
      matching with a specific MSR mask.
      
      This implements a helper to assign a GPU to LPAR and program the NPU
      with a wildcard for PID and a helper to do clean-up. The helper takes
      MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
      ATS checkout requests support.
      
      This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
      from the VFIO driver.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1b785611
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Add compound IOMMU groups · 0bd97167
      Alexey Kardashevskiy authored
      At the moment the powernv platform registers an IOMMU group for each
      PE. There is an exception though: an NVLink bridge which is attached
      to the corresponding GPU's IOMMU group making it a master.
      
      Now we have POWER9 systems with GPUs connected to each other directly
      bypassing PCI. At the moment we do not control state of these links so
      we have to put such interconnected GPUs to one IOMMU group which means
      that the old scheme with one GPU as a master won't work - there will
      be up to 3 GPUs in such group.
      
      This introduces a npu_comp struct which represents a compound IOMMU
      group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for
      NVLink bridges). This converts the existing NVLink1 code to use the
      new scheme. >From now on, each PE must have a valid
      iommu_table_group_ops which will either be called directly (for a
      single PE group) or indirectly from a compound group handlers.
      
      This moves IOMMU group registration for NVLink-connected GPUs to
      npu-dma.c. For POWER8, this stores a new compound group pointer in the
      PE (so a GPU is still a master); for POWER9 the new group pointer is
      stored in an NPU (which is allocated per a PCI host controller).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0bd97167
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops · 83fb8ccf
      Alexey Kardashevskiy authored
      At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
      PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
      IOMMU groups with several PEs from several different PHB (such as
      interconnected GPUs and NPUs) so there will be no single master but
      a one big IOMMU group.
      
      This makes a first step and converts an NPU PE with a set of extern
      function to a table group.
      
      This should cause no behavioral change. Note that
      pnv_npu_release_ownership() has never been implemented.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83fb8ccf
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Move single TVE handling to NPU PE · b04149c2
      Alexey Kardashevskiy authored
      Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only
      one which points to one of two tables of the corresponding PCI PE.
      
      So whenever a new DMA window is programmed to PEs, the NPU PE needs to
      release old table in order to use the new one.
      
      Commit d41ce7b1 ("powerpc/powernv/npu: Do not try invalidating 32bit
      table when 64bit table is enabled") did just that but in pci-ioda.c
      while it actually belongs to npu-dma.c.
      
      This moves the single TVE handling to npu-dma.c. This does not implement
      restoring though as it is highly unlikely that we can set the table to
      PCI PE and cannot to NPU PE and if that fails, we could only set 32bit
      table to NPU PE and this configuration is not really supported or wanted.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b04149c2
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Reference iommu_table while it is linked to a group · 847e6563
      Alexey Kardashevskiy authored
      The iommu_table pointer stored in iommu_table_group may get stale
      by accident, this adds referencing and removes a redundant comment
      about this.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      847e6563
    • Alexey Kardashevskiy's avatar
      powerpc/iommu_api: Move IOMMU groups setup to a single place · 5eada8a3
      Alexey Kardashevskiy authored
      Registering new IOMMU groups and adding devices to them are separated in
      code and the latter is dug in the DMA setup code which it does not
      really belong to.
      
      This moved IOMMU groups setup to a separate helper which registers a group
      and adds devices as before. This does not make a difference as IOMMU
      groups are not used anyway; the only dependency here is that
      iommu_add_device() requires a valid pointer to an iommu_table
      (set by set_iommu_table_base()).
      
      To keep the old behaviour, this does not add new IOMMU groups for PEs
      with no DMA weight and also skips NVLink bridges which do not have
      pci_controller_ops::setup_bridge (the normal way of adding PEs).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5eada8a3
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/pseries: Rework device adding to IOMMU groups · c4e9d3c1
      Alexey Kardashevskiy authored
      The powernv platform registers IOMMU groups and adds devices to them
      from the pci_controller_ops::setup_bridge() hook except one case when
      virtual functions (SRIOV VFs) are added from a bus notifier.
      
      The pseries platform registers IOMMU groups from
      the pci_controller_ops::dma_bus_setup() hook and adds devices from
      the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
      used for powernv does not add devices for pseries though as
      __of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
      
      Both platforms use iommu_add_device() which takes a device and expects
      it to have a valid IOMMU table struct with an iommu_table_group pointer
      which in turn points the iommu_group struct (which represents
      an IOMMU group). Although the helper seems easy to use, it relies on
      some pre-existing device configuration and associated data structures
      which it does not really need.
      
      This simplifies iommu_add_device() to take the table_group pointer
      directly. Pseries already has a table_group pointer handy and the bus
      notified is not used anyway. For powernv, this copies the existing bus
      notifier, makes it work for powernv only which means an easy way of
      getting to the table_group pointer. This was tested on VFs but should
      also support physical PCI hotplug.
      
      Since iommu_add_device() receives the table_group pointer directly,
      pseries does not do TCE cache invalidation (the hypervisor does) nor
      allow multiple groups per a VFIO container (in other words sharing
      an IOMMU table between partitionable endpoints), this removes
      iommu_table_group_link from pseries.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c4e9d3c1
    • Alexey Kardashevskiy's avatar
      powerpc/pseries: Remove IOMMU API support for non-LPAR systems · c409c631
      Alexey Kardashevskiy authored
      The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
      registered for the pseries platform which does not have FW_FEATURE_LPAR;
      these would be pre-powernv platforms which we never supported PCI pass
      through for anyway so remove it.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c409c631
    • Alexey Kardashevskiy's avatar
      powerpc/pseries/npu: Enable platform support · 3be2df00
      Alexey Kardashevskiy authored
      We already changed NPU API for GPUs to not to call OPAL and the remaining
      bit is initializing NPU structures.
      
      This searches for POWER9 NVLinks attached to any device on a PHB and
      initializes an NPU structure if any found.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3be2df00
    • Alexey Kardashevskiy's avatar
      powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation · 68c0449e
      Alexey Kardashevskiy authored
      We might have memory@ nodes with "linux,usable-memory" set to zero
      (for example, to replicate powernv's behaviour for GPU coherent memory)
      which means that the memory needs an extra initialization but since
      it can be used afterwards, the pseries platform will try mapping it
      for DMA so the DMA window needs to cover those memory regions too;
      if the window cannot cover new memory regions, the memory onlining fails.
      
      This walks through the memory nodes to find the highest RAM address to
      let a huge DMA window cover that too in case this memory gets onlined
      later.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      68c0449e
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/npu: Move OPAL calls away from context manipulation · 0e759bd7
      Alexey Kardashevskiy authored
      When introduced, the NPU context init/destroy helpers called OPAL which
      enabled/disabled PID (a userspace memory context ID) filtering in an NPU
      per a GPU; this was a requirement for P9 DD1.0. However newer chip
      revision added a PID wildcard support so there is no more need to
      call OPAL every time a new context is initialized. Also, since the PID
      wildcard support was added, skiboot does not clear wildcard entries
      in the NPU so these remain in the hardware till the system reboot.
      
      This moves LPID and wildcard programming to the PE setup code which
      executes once during the booting process so NPU2 context init/destroy
      won't need to do additional configuration.
      
      This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
      this is the way to tell if the NPU support is present and configured.
      
      This moves pnv_npu2_init() declaration as pseries should be able to use it.
      This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
      call that. This exports pnv_npu2_map_lpar_dev() as following patches
      will use it from the VFIO driver.
      
      While at it, replace redundant list_for_each_entry_safe() with
      a simpler list_for_each_entry().
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0e759bd7
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Move npu struct from pnv_phb to pci_controller · 46a1449d
      Alexey Kardashevskiy authored
      The powernv PCI code stores NPU data in the pnv_phb struct. The latter
      is referenced by pci_controller::private_data. We are going to have NPU2
      support in the pseries platform as well but it does not store any
      private_data in in the pci_controller struct; and even if it did,
      it would be a different data structure.
      
      This makes npu a pointer and stores it one level higher in
      the pci_controller struct.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      46a1449d
    • Alexey Kardashevskiy's avatar
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy authored
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • Alexey Kardashevskiy's avatar
      powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region · e0bf78b0
      Alexey Kardashevskiy authored
      Normally mm_iommu_get() should add a reference and mm_iommu_put() should
      remove it. However historically mm_iommu_find() does the referencing and
      mm_iommu_get() is doing allocation and referencing.
      
      We are going to add another helper to preregister device memory so
      instead of having mm_iommu_new() (which pre-registers the normal memory
      and references the region), we need separate helpers for pre-registering
      and referencing.
      
      This renames:
      - mm_iommu_get to mm_iommu_new;
      - mm_iommu_find to mm_iommu_get.
      
      This changes mm_iommu_get() to reference the region so the name now
      reflects what it does.
      
      This removes the check for exact match from mm_iommu_new() as we want it
      to fail on existing regions; mm_iommu_get() should be used instead.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e0bf78b0
    • Alexey Kardashevskiy's avatar
      powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2 · ab7032e7
      Alexey Kardashevskiy authored
      The skiboot firmware has a hot reset handler which fences the NVIDIA V100
      GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
      https://github.com/open-power/skiboot/commit/fca2b2b839a67
      
      Now we are going to pass V100 via VFIO which most certainly involves
      KVM guests which are often terminated without getting a chance to offline
      GPU RAM so we end up with a running machine with misconfigured memory.
      Accessing this memory produces hardware management interrupts (HMI)
      which bring the host down.
      
      To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
      via pci_disable_device() which switches NPU2 to a safe mode and prevents
      HMIs.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlistair Popple <alistair@popple.id.au>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ab7032e7
    • Christophe Leroy's avatar
      powerpc/mm: Fix reporting of kernel execute faults on the 8xx · ffca395b
      Christophe Leroy authored
      On the 8xx, no-execute is set via PPP bits in the PTE. Therefore
      a no-exec fault generates DSISR_PROTFAULT error bits,
      not DSISR_NOEXEC_OR_G.
      
      This patch adds DSISR_PROTFAULT in the test mask.
      
      Fixes: d3ca5874 ("powerpc/mm: Fix reporting of kernel execute faults")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ffca395b
    • Firoz Khan's avatar
      powerpc: generate uapi header and system call table files · ab66dcc7
      Firoz Khan authored
      System call table generation script must be run to gener-
      ate unistd_32/64.h and syscall_table_32/64/c32/spu.h files.
      This patch will have changes which will invokes the script.
      
      This patch will generate unistd_32/64.h and syscall_table-
      _32/64/c32/spu.h files by the syscall table generation
      script invoked by parisc/Makefile and the generated files
      against the removed files must be identical.
      
      The generated uapi header file will be included in uapi/-
      asm/unistd.h and generated system call table header file
      will be included by kernel/systbl.S file.
      Signed-off-by: default avatarFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ab66dcc7