- 02 Dec, 2016 8 commits
-
-
Alexey Kardashevskiy authored
In some situations the userspace memory context may live longer than the userspace process itself so if we need to do proper memory context cleanup, we better have tce_container take a reference to mm_struct and use it later when the process is gone (@current or @current->mm is NULL). This references mm and stores the pointer in the container; this is done in a new helper - tce_iommu_mm_set() - when one of the following happens: - a container is enabled (IOMMU v1); - a first attempt to pre-register memory is made (IOMMU v2); - a DMA window is created (IOMMU v2). The @mm stays referenced till the container is destroyed. This replaces current->mm with container->mm everywhere except debug prints. This adds a check that current->mm is the same as the one stored in the container to prevent userspace from making changes to a memory context of other processes. DMA map/unmap ioctls() do not check for @mm as they already check for @enabled which is set after tce_iommu_mm_set() is called. This does not reference a task as multiple threads within the same mm are allowed to ioctl() to vfio and supposedly they will have same limits and capabilities and if they do not, we'll just fail with no harm made. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
We are going to allow the userspace to configure container in one memory context and pass container fd to another so we are postponing memory allocations accounted against the locked memory limit. One of previous patches took care of it_userspace. At the moment we create the default DMA window when the first group is attached to a container; this is done for the userspace which is not DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory pre-registration - such client expects the default DMA window to exist. This postpones the default DMA window allocation till one of the folliwing happens: 1. first map/unmap request arrives; 2. new window is requested; This adds noop for the case when the userspace requested removal of the default window which has not been created yet. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
There is already a helper to create a DMA window which does allocate a table and programs it to the IOMMU group. However tce_iommu_take_ownership_ddw() did not use it and did these 2 calls itself to simplify error path. Since we are going to delay the default window creation till the default window is accessed/removed or new window is added, we need a helper to create a default window from all these cases. This adds tce_iommu_create_default_window(). Since it relies on a VFIO container to have at least one IOMMU group (for future use), this changes tce_iommu_attach_group() to add a group to the container first and then call the new helper. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
The iommu_table struct manages a hardware TCE table and a vmalloc'd table with corresponding userspace addresses. Both are allocated when the default DMA window is created and this happens when the very first group is attached to a container. As we are going to allow the userspace to configure container in one memory context and pas container fd to another, we have to postpones such allocations till a container fd is passed to the destination user process so we would account locked memory limit against the actual container user constrainsts. This postpones the it_userspace array allocation till it is used first time for mapping. The unmapping patch already checks if the array is allocated. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
This changes mm_iommu_xxx helpers to take mm_struct as a parameter instead of getting it from @current which in some situations may not have a valid reference to mm. This changes helpers to receive @mm and moves all references to @current to the caller, including checks for !current and !current->mm; checks in mm_iommu_preregistered() are removed as there is no caller yet. This moves the mm_iommu_adjust_locked_vm() call to the caller as it receives mm_iommu_table_group_mem_t but it needs mm. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
We are going to get rid of @current references in mmu_context_boos3s64.c and cache mm_struct in the VFIO container. Since mm_context_t does not have reference counting, we will be using mm_struct which does have the reference counter. This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather than mm_context_t (which is embedded into mm). This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
This is used in poison.h to offset poison values so that they don't point directly into user space. The value we choose sits roughly between user and kernel space, which means on their own the poison values don't point anywhere useful. If an attacker can cause an access at some offset from the poison value then we may still be in trouble, but by putting the poison values between user and kernel space we maximise the required size of that offset. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
The current facility_strings[] are correct when the trap address is 0xf80 (hypervisor facility unavailable). When the trap address is 0xf60 (facility unavailable) IC (Interruption Cause) a.k.a status in the code is undefined for values 0 and 1. Add a check to prevent printing the (misleading) facility name for IC 0 and 1 when we came in via 0xf60. In all cases, print the actual IC value, to avoid any confusion. This hasn't been seen on real hardware, on only qemu which was misreporting an exception. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Fix indentation, combine printks(), massage change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 01 Dec, 2016 4 commits
-
-
Michael Ellerman authored
We have a bunch of Kconfig symbols which select various IBM_EMAC_* symbols. These all cause warnings when IBM_EMAC is not selected. eg. warning: (PPC_CELL_NATIVE && BLUESTONE && CANYONLANDS && GLACIER && EIGER && 440EPX && 440GRX && 440GX && 460SX && 405EX) selects IBM_EMAC_RGMII which has unmet direct dependencies (NETDEVICES && ETHERNET && NET_VENDOR_IBM) So make them all depend on IBM_EMAC being enabled first. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
SPU_FS selects MEMORY_HOTPLUG, which is problematic because MEMORY_HOTPLUG is user selectable, meaning we can end up with a broken .config where MEMORY_HOTPLUG is enabled but its dependencies are not, leading to build breakages. The select of MEMORY_HOTPLUG for SPU_FS was added back in 2006, in commit 4da30d15 ("[POWERPC] spufs: fix memory hotplug dependency"). However we reworked the spufs code and removed the dependency on memory hotplug in 2007 in commit 78bde53e ("[POWERPC] spufs: remove need for struct page for SPEs"). So drop the select as it's no longer needed and causes problems. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nathan Fontenot authored
We should be using lmb_is_removable() to validate that enough LMBs are available to remove when doing a remove by count. This will check that the LMB is owned by the system and it is considered removable. This patch also adds a pr_info() notification to report the LMB count to remove was not satisfied. What we do now is just check that there are enough LMBs owned by the system when validating there are enough LMBs to remove. This can lead to situations where there are enough LMBs owned by the system but not enough that are considered removable. This results in having to bail out of the remove operation instead of just failing the request that we should have known wouldn't succeed. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
In the recent commit 1515ab93 ("powerpc/mm: Dump hash table") we added code to dump the hage page table. Currently this can be selected to build on any platform. However it breaks the build if we're building for a non-Book3S platform, because none of the hash page table related defines and so on exist. So restrict it to building only on Book3S. Similarly in commit 8eb07b18 ("powerpc/mm: Dump linux pagetables") we added code to dump the Linux page tables, which uses some constants which are only defined on Book3S - so guard those with an #ifdef. Fixes: 1515ab93 ("powerpc/mm: Dump hash table") Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 30 Nov, 2016 13 commits
-
-
Geoff Levand authored
GCC 5 generates different code for this bootwrapper null check that causes the PS3 to hang very early in its bootup. This check is of limited value, so just get rid of it. Cc: stable@vger.kernel.org Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Now that we've defined structures to describe each of the client architecture vectors, we can use those to construct the value we pass to firmware. This avoids the tricks we previously played with the W() macro, allows us to properly endian annotate fields, and should help to avoid bugs introduced by failing to have the correct number of zero pad bytes between fields. It also means we can avoid hard coding IBM_ARCH_VEC_NRCORES_OFFSET in order to update the max_cpus value and instead just set it. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
The "client architecture vectors" are a series of structures we pass to firmware to define various things, such as what processors we support and many other options. Each structure is entirely different so we have to define a different struct for each one, but that's OK. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This has not made its way to a PAPR release yet, but we have an hcall number assigned. H_SIGNAL_SYS_RESET = 0x380 Syntax: hcall(uint64 H_SIGNAL_SYS_RESET, int64 target); Generate a system reset NMI on the threads indicated by target. Values for target: -1 = target all online threads including the caller -2 = target all online threads except for the caller All other negative values: reserved Positive values: The thread to be targeted, obtained from the value of the "ibm,ppc-interrupt-server#s" property of the CPU in the OF device tree. Semantics: - Invalid target: return H_Parameter. - Otherwise: Generate a system reset NMI on target thread(s), return H_Success. This will be used by crash/debug code to get stuck CPUs into a known state. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
Enable CONFIG_KEXEC_FILE in powernv_defconfig, ppc64_defconfig and pseries_defconfig. It depends on CONFIG_CRYPTO_SHA256=y, so add that as well. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
Define the Kconfig symbol so that the kexec_file_load() code can be built, and wire up the syscall so that it can be called. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
This purgatory implementation is based on the versions from kexec-tools and kexec-lite, with additional changes. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
This patch adds the support code needed for implementing kexec_file_load() on powerpc. This consists of functions to load the ELF kernel, either big or little endian, and setup the purgatory enviroment which switches from the first kernel to the second kernel. None of this code is built yet, as it depends on CONFIG_KEXEC_FILE which we have not yet defined. Although we could define CONFIG_KEXEC_FILE in this patch, we'd then have a window in history where the kconfig symbol is present but the syscall is not, which would be awkward. Signed-off-by: Josh Sklar <sklar@linux.vnet.ibm.com> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
Commit 2965faa5 ("kexec: split kexec_load syscall from kexec core code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE means whether the kexec_file_load system call should be compiled-in. These options can be set independently from each other. Since until now powerpc only supported kexec_load, CONFIG_KEXEC and CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we need to make a distinction. Almost all places where CONFIG_KEXEC was being used should be using CONFIG_KEXEC_CORE instead, since kexec_file_load also needs that code compiled in. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
kexec_locate_mem_hole will be used by the PowerPC kexec_file_load implementation to find free memory for the purgatory stack. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
This is done to simplify the kexec_add_buffer argument list. Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer. In addition, change the type of kexec_buf.buffer from char * to void *. There is no particular reason for it to be a char *, and the change allows us to get rid of 3 existing casts to char * in the code. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Thiago Jung Bauermann authored
Allow architectures to specify a different memory walking function for kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but PowerPC uses the memblock subsystem. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
Aneesh/Ben reported that the change to do_page_fault() we made in commit 1d18ad02 ("powerpc/mm: Detect instruction fetch denied and report") needs to handle the case where CPU_FTR_COHERENT_ICACHE is missing but we have CPU_FTR_NOEXECUTE. In those cases the check added for SRR1_ISI_N_OR_G might trigger a false positive. This patch adds a check for CPU_FTR_COHERENT_ICACHE in addition to the MSR value. Fixes: 1d18ad02 ("powerpc/mm: Detect instruction fetch denied and report") Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 29 Nov, 2016 3 commits
-
-
Michael Ellerman authored
Now that we don't set ARCH incorrectly when calling the boot Makefile, we can use the generic cpp_lds_S rule for converting our zImage.lds.S into zImage.lds. The main advantage of using the generic rule is that it correctly uses if_changed, which means we correctly regenerate the linker script when switching endian. Fixing that means we are finally able to build one endian and then rebuild the other endian without requiring to clean between builds. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
If we're using if_changed then we must depend on FORCE, so that if_changed gets a chance to check if something changed. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Back in 2005 when the ppc/ppc64 merge started, we used to build the kernel code in arch/powerpc but use the boot code from arch/ppc or arch/ppc64 depending on whether we were building for 32 or 64-bit. Originally we called the boot Makefile passing ARCH=$(OLDARCH), where OLDARCH was ppc or ppc64. In commit 20f62954 ("powerpc: Make building the boot image work for both 32-bit and 64-bit") (2005-10-11) we split the call for 32/64-bit using an ifeq check, because the two Makefiles took different targets, and explicitly passed ARCH=ppc64 for the 64-bit case and ARCH=ppc for the 32-bit case. Then in commit 94b212c2 ("powerpc: Move ppc64 boot wrapper code over to arch/powerpc") (2005-11-16) we moved the boot code into arch/powerpc and dropped the ppc case, but kept passing ARCH=ppc64 to arch/powerpc/boot/Makefile. Since then there have been several more boot targets added, all of which have copied the ARCH=ppc64 setting, such that now we have four targets using it. Currently it seems that nothing actually uses the ARCH value, but that's basically just luck, and in particular it prevents us from using the generic cpp_lds_S rule. It's also clearly wrong, ARCH=ppc64 is dead, buried and cremated. Fix it by dropping the setting of ARCH completely, the correct value is exported by the top level Makefile. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 28 Nov, 2016 9 commits
-
-
Aneesh Kumar K.V authored
This will improve the task exit case, by batching tlb invalidates. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
When we are updating a pte, we just need to flush the tlb mapping that pte. Right now we do a full mm flush because we don't track page size. Now that we have page size details in pte use that to do the optimized flush Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
When we are updating a pte, we just need to flush the tlb mapping that pte. Right now we do a full mm flush because we don't track the page size. Now that we have page size details in pte use that to do the optimized flush Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
Now that we have page size details encoded in pte using software pte bits, use that to find the page size needed for tlb flush. This function should only be used on P9 DD1, so give it a horrible name to make that clear. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
This patch adds a new software defined pte bit. We use the reserved fields of ISA 3.0 pte definition since we will only be using this on DD1 code paths. We can possibly look at removing this code later. The software bit will be used to differentiate between 64K/4K and 2M ptes. This helps in finding the page size mapping by a pte so that we can do efficient tlb flush. We don't support 1G hugetlb pages yet. So we add a DEBUG WARN_ON to catch wrong usage. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
W.r.t hash page table config, we support 16MB and 16GB as the hugepage size. Update the hstate_get_psize to handle 16M and 16G. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Aneesh Kumar K.V authored
We will start moving some book3s specific hugetlb functions there. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This converts one that was missed by b1576fec ("powerpc: No need to use dot symbols when branching to a function"). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
From 80f23935 ("powerpc: Convert cmp to cmpd in idle enter sequence"): PowerPC's "cmp" instruction has four operands. Normally people write "cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently people forget, and write "cmp" with just three operands. With older binutils this is silently accepted as if this was "cmpw", while often "cmpd" is wanted. With newer binutils GAS will complain about this for 64-bit code. For 32-bit code it still silently assumes "cmpw" is what is meant. In this case, cmpwi is called for, so this is just a build fix for new toolchains. Cc: stable@vger.kernel.org # v3.0+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 26 Nov, 2016 1 commit
-
-
Balbir Singh authored
ISA 3 defines new encoded access authority that allows instruction access prevention in privileged mode and allows normal access to problem state. This patch just enables IAMR (Instruction Authority Mask Register), enabling AMR would require more work. I've tested this with a buggy driver and a simple payload. The payload is specific to the build I've tested. mpe: Also tested with LKDTM: # echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT lkdtm: Performing direct entry EXEC_USERSPACE lkdtm: attempting ok execution at c0000000005bf560 lkdtm: attempting bad execution at 00003fff8d940000 Unable to handle kernel paging request for instruction fetch Faulting instruction address: 0x3fff8d940000 Oops: Kernel access of bad area, sig: 11 [#1] NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000 REGS: c0000000f1fcf900 TRAP: 0400 Not tainted (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a) MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48002222 XER: 00000000 ... Call Trace: lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable) lkdtm_do_action+0x3c/0x80 direct_entry+0x100/0x1b0 full_proxy_write+0x94/0x100 __vfs_write+0x3c/0x1b0 vfs_write+0xcc/0x230 SyS_write+0x60/0x110 system_call+0x38/0xfc Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 25 Nov, 2016 2 commits
-
-
Balbir Singh authored
ISA 3 allows for prevention of instruction fetch and execution of user mode pages. If such an error occurs, SRR1 bit 35 reports the error. We catch and report the error in do_page_fault(). Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
Setup AMOR (Authority Mask Override Register) in HV mode so that the host and guest kernel can in turn setup IAMR. This allows us to enable key 0 in a following patch. Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-