1. 02 Dec, 2016 9 commits
    • Alexey Kardashevskiy's avatar
      powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown · 4b6fad70
      Alexey Kardashevskiy authored
      At the moment the userspace tool is expected to request pinning of
      the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
      When the userspace process finishes, all the pinned pages need to
      be put; this is done as a part of the userspace memory context (MM)
      destruction which happens on the very last mmdrop().
      
      This approach has a problem that a MM of the userspace process
      may live longer than the userspace process itself as kernel threads
      use userspace process MMs which was runnning on a CPU where
      the kernel thread was scheduled to. If this happened, the MM remains
      referenced until this exact kernel thread wakes up again
      and releases the very last reference to the MM, on an idle system this
      can take even hours.
      
      This moves preregistered regions tracking from MM to VFIO; insteads of
      using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
      added so each container releases regions which it has pre-registered.
      
      This changes the userspace interface to return EBUSY if a memory
      region is already registered in a container. However it should not
      have any practical effect as the only userspace tool available now
      does register memory region once per container anyway.
      
      As tce_iommu_register_pages/tce_iommu_unregister_pages are called
      under container->lock, this does not need additional locking.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4b6fad70
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Reference mm in tce_container · bc82d122
      Alexey Kardashevskiy authored
      In some situations the userspace memory context may live longer than
      the userspace process itself so if we need to do proper memory context
      cleanup, we better have tce_container take a reference to mm_struct and
      use it later when the process is gone (@current or @current->mm is NULL).
      
      This references mm and stores the pointer in the container; this is done
      in a new helper - tce_iommu_mm_set() - when one of the following happens:
      - a container is enabled (IOMMU v1);
      - a first attempt to pre-register memory is made (IOMMU v2);
      - a DMA window is created (IOMMU v2).
      The @mm stays referenced till the container is destroyed.
      
      This replaces current->mm with container->mm everywhere except debug
      prints.
      
      This adds a check that current->mm is the same as the one stored in
      the container to prevent userspace from making changes to a memory
      context of other processes.
      
      DMA map/unmap ioctls() do not check for @mm as they already check
      for @enabled which is set after tce_iommu_mm_set() is called.
      
      This does not reference a task as multiple threads within the same mm
      are allowed to ioctl() to vfio and supposedly they will have same limits
      and capabilities and if they do not, we'll just fail with no harm made.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bc82d122
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Postpone default window creation · d9c72894
      Alexey Kardashevskiy authored
      We are going to allow the userspace to configure container in
      one memory context and pass container fd to another so
      we are postponing memory allocations accounted against
      the locked memory limit. One of previous patches took care of
      it_userspace.
      
      At the moment we create the default DMA window when the first group is
      attached to a container; this is done for the userspace which is not
      DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
      pre-registration - such client expects the default DMA window to exist.
      
      This postpones the default DMA window allocation till one of
      the folliwing happens:
      1. first map/unmap request arrives;
      2. new window is requested;
      This adds noop for the case when the userspace requested removal
      of the default window which has not been created yet.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d9c72894
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Add a helper to create default DMA window · 6f01cc69
      Alexey Kardashevskiy authored
      There is already a helper to create a DMA window which does allocate
      a table and programs it to the IOMMU group. However
      tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
      itself to simplify error path.
      
      Since we are going to delay the default window creation till
      the default window is accessed/removed or new window is added,
      we need a helper to create a default window from all these cases.
      
      This adds tce_iommu_create_default_window(). Since it relies on
      a VFIO container to have at least one IOMMU group (for future use),
      this changes tce_iommu_attach_group() to add a group to the container
      first and then call the new helper.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6f01cc69
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Postpone allocation of userspace version of TCE table · 39701e56
      Alexey Kardashevskiy authored
      The iommu_table struct manages a hardware TCE table and a vmalloc'd
      table with corresponding userspace addresses. Both are allocated when
      the default DMA window is created and this happens when the very first
      group is attached to a container.
      
      As we are going to allow the userspace to configure container in one
      memory context and pas container fd to another, we have to postpones
      such allocations till a container fd is passed to the destination
      user process so we would account locked memory limit against the actual
      container user constrainsts.
      
      This postpones the it_userspace array allocation till it is used first
      time for mapping. The unmapping patch already checks if the array is
      allocated.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      39701e56
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Stop using @current in mm_iommu_xxx · d7baee69
      Alexey Kardashevskiy authored
      This changes mm_iommu_xxx helpers to take mm_struct as a parameter
      instead of getting it from @current which in some situations may
      not have a valid reference to mm.
      
      This changes helpers to receive @mm and moves all references to @current
      to the caller, including checks for !current and !current->mm;
      checks in mm_iommu_preregistered() are removed as there is no caller
      yet.
      
      This moves the mm_iommu_adjust_locked_vm() call to the caller as
      it receives mm_iommu_table_group_mem_t but it needs mm.
      
      This should cause no behavioral change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d7baee69
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Pass mm_struct to init/cleanup helpers · 88f54a35
      Alexey Kardashevskiy authored
      We are going to get rid of @current references in mmu_context_boos3s64.c
      and cache mm_struct in the VFIO container. Since mm_context_t does not
      have reference counting, we will be using mm_struct which does have
      the reference counter.
      
      This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
      than mm_context_t (which is embedded into mm).
      
      This should not cause any behavioral change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      88f54a35
    • Michael Ellerman's avatar
      powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit · f6853eb5
      Michael Ellerman authored
      This is used in poison.h to offset poison values so that they don't
      point directly into user space.
      
      The value we choose sits roughly between user and kernel space, which
      means on their own the poison values don't point anywhere useful. If an
      attacker can cause an access at some offset from the poison value then
      we may still be in trouble, but by putting the poison values between
      user and kernel space we maximise the required size of that offset.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f6853eb5
    • Balbir Singh's avatar
      powerpc Don't print misleading facility name in facility unavailable exception · 93c2ec0f
      Balbir Singh authored
      The current facility_strings[] are correct when the trap address is
      0xf80 (hypervisor facility unavailable). When the trap address is
      0xf60 (facility unavailable) IC (Interruption Cause) a.k.a status in the
      code is undefined for values 0 and 1.
      
      Add a check to prevent printing the (misleading) facility name for IC 0
      and 1 when we came in via 0xf60. In all cases, print the actual IC
      value, to avoid any confusion.
      
      This hasn't been seen on real hardware, on only qemu which was
      misreporting an exception.
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [mpe: Fix indentation, combine printks(), massage change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      93c2ec0f
  2. 01 Dec, 2016 4 commits
    • Michael Ellerman's avatar
      powerpc: Make selects of IBM_EMAC_* depend on IBM_EMAC · 33596727
      Michael Ellerman authored
      We have a bunch of Kconfig symbols which select various IBM_EMAC_*
      symbols. These all cause warnings when IBM_EMAC is not selected.
      
      eg.
      
        warning: (PPC_CELL_NATIVE && BLUESTONE && CANYONLANDS && GLACIER &&
        EIGER && 440EPX && 440GRX && 440GX && 460SX && 405EX) selects
        IBM_EMAC_RGMII which has unmet direct dependencies (NETDEVICES &&
        ETHERNET && NET_VENDOR_IBM)
      
      So make them all depend on IBM_EMAC being enabled first.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      33596727
    • Michael Ellerman's avatar
      powerpc/cell: Drop select of MEMORY_HOTPLUG · 577ec789
      Michael Ellerman authored
      SPU_FS selects MEMORY_HOTPLUG, which is problematic because
      MEMORY_HOTPLUG is user selectable, meaning we can end up with a broken
      .config where MEMORY_HOTPLUG is enabled but its dependencies are not,
      leading to build breakages.
      
      The select of MEMORY_HOTPLUG for SPU_FS was added back in 2006, in
      commit 4da30d15 ("[POWERPC] spufs: fix memory hotplug dependency").
      
      However we reworked the spufs code and removed the dependency on memory
      hotplug in 2007 in commit 78bde53e ("[POWERPC] spufs: remove need
      for struct page for SPEs").
      
      So drop the select as it's no longer needed and causes problems.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      577ec789
    • Nathan Fontenot's avatar
      powerpc/pseries: Use lmb_is_removable() to check removability · 2db029ef
      Nathan Fontenot authored
      We should be using lmb_is_removable() to validate that enough LMBs
      are available to remove when doing a remove by count. This will check
      that the LMB is owned by the system and it is considered removable.
      This patch also adds a pr_info() notification to report the LMB count
      to remove was not satisfied.
      
      What we do now is just check that there are enough LMBs owned by the
      system when validating there are enough LMBs to remove. This can
      lead to situations where there are enough LMBs owned by the system
      but not enough that are considered removable. This results in having
      to bail out of the remove operation instead of just failing the request
      that we should have known wouldn't succeed.
      Signed-off-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2db029ef
    • Michael Ellerman's avatar
      powerpc/mm: Fix page table dump build on non-Book3S · dd5ac03e
      Michael Ellerman authored
      In the recent commit 1515ab93 ("powerpc/mm: Dump hash table") we
      added code to dump the hage page table. Currently this can be selected
      to build on any platform. However it breaks the build if we're building
      for a non-Book3S platform, because none of the hash page table related
      defines and so on exist. So restrict it to building only on Book3S.
      
      Similarly in commit 8eb07b18 ("powerpc/mm: Dump linux pagetables")
      we added code to dump the Linux page tables, which uses some constants
      which are only defined on Book3S - so guard those with an #ifdef.
      
      Fixes: 1515ab93 ("powerpc/mm: Dump hash table")
      Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      dd5ac03e
  3. 30 Nov, 2016 13 commits
  4. 29 Nov, 2016 3 commits
    • Michael Ellerman's avatar
      powerpc/boot: Fix rebuild when changing kernel endian · f0f7fe1a
      Michael Ellerman authored
      Now that we don't set ARCH incorrectly when calling the boot Makefile,
      we can use the generic cpp_lds_S rule for converting our zImage.lds.S
      into zImage.lds.
      
      The main advantage of using the generic rule is that it correctly uses
      if_changed, which means we correctly regenerate the linker script when
      switching endian. Fixing that means we are finally able to build one
      endian and then rebuild the other endian without requiring to clean
      between builds.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f0f7fe1a
    • Michael Ellerman's avatar
      powerpc/boot: All uses of if_changed should depend on FORCE · 42d0c932
      Michael Ellerman authored
      If we're using if_changed then we must depend on FORCE, so that
      if_changed gets a chance to check if something changed.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      42d0c932
    • Michael Ellerman's avatar
      powerpc: Stop passing ARCH=ppc64 to boot Makefile · 1196d7aa
      Michael Ellerman authored
      Back in 2005 when the ppc/ppc64 merge started, we used to build the
      kernel code in arch/powerpc but use the boot code from arch/ppc or
      arch/ppc64 depending on whether we were building for 32 or 64-bit.
      
      Originally we called the boot Makefile passing ARCH=$(OLDARCH), where
      OLDARCH was ppc or ppc64.
      
      In commit 20f62954 ("powerpc: Make building the boot image work for
      both 32-bit and 64-bit") (2005-10-11) we split the call for 32/64-bit
      using an ifeq check, because the two Makefiles took different targets,
      and explicitly passed ARCH=ppc64 for the 64-bit case and ARCH=ppc for
      the 32-bit case.
      
      Then in commit 94b212c2 ("powerpc: Move ppc64 boot wrapper code over
      to arch/powerpc") (2005-11-16) we moved the boot code into arch/powerpc
      and dropped the ppc case, but kept passing ARCH=ppc64 to
      arch/powerpc/boot/Makefile.
      
      Since then there have been several more boot targets added, all of which
      have copied the ARCH=ppc64 setting, such that now we have four targets
      using it.
      
      Currently it seems that nothing actually uses the ARCH value, but that's
      basically just luck, and in particular it prevents us from using the
      generic cpp_lds_S rule. It's also clearly wrong, ARCH=ppc64 is dead,
      buried and cremated.
      
      Fix it by dropping the setting of ARCH completely, the correct value is
      exported by the top level Makefile.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1196d7aa
  5. 28 Nov, 2016 9 commits
  6. 26 Nov, 2016 1 commit
    • Balbir Singh's avatar
      powerpc/mm/radix: Prevent kernel execution of user space · 3b10d009
      Balbir Singh authored
      ISA 3 defines new encoded access authority that allows instruction
      access prevention in privileged mode and allows normal access
      to problem state. This patch just enables IAMR (Instruction Authority
      Mask Register), enabling AMR would require more work.
      
      I've tested this with a buggy driver and a simple payload. The payload
      is specific to the build I've tested.
      
      mpe: Also tested with LKDTM:
      
        # echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
        lkdtm: Performing direct entry EXEC_USERSPACE
        lkdtm: attempting ok execution at c0000000005bf560
        lkdtm: attempting bad execution at 00003fff8d940000
        Unable to handle kernel paging request for instruction fetch
        Faulting instruction address: 0x3fff8d940000
        Oops: Kernel access of bad area, sig: 11 [#1]
        NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000
        REGS: c0000000f1fcf900 TRAP: 0400   Not tainted  (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a)
        MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48002222  XER: 00000000
        ...
        Call Trace:
          lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable)
          lkdtm_do_action+0x3c/0x80
          direct_entry+0x100/0x1b0
          full_proxy_write+0x94/0x100
          __vfs_write+0x3c/0x1b0
          vfs_write+0xcc/0x230
          SyS_write+0x60/0x110
          system_call+0x38/0xfc
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3b10d009
  7. 25 Nov, 2016 1 commit