1. 04 Apr, 2018 8 commits
    • Frederic Barrat's avatar
      cxl: Fix possible deadlock when processing page faults from cxllib · ad7b4e80
      Frederic Barrat authored
      cxllib_handle_fault() is called by an external driver when it needs to
      have the host resolve page faults for a buffer. The buffer can cover
      several pages and VMAs. The function iterates over all the pages used
      by the buffer, based on the page size of the VMA.
      
      To ensure some stability while processing the faults, the thread T1
      grabs the mm->mmap_sem semaphore with read access (R1). However, when
      processing a page fault for a single page, one of the underlying
      functions, copro_handle_mm_fault(), also grabs the same semaphore with
      read access (R2). So the thread T1 takes the semaphore twice.
      
      If another thread T2 tries to access the semaphore in write mode W1
      (say, because it wants to allocate memory and calls 'brk'), then that
      thread T2 will have to wait because there's a reader (R1). If the
      thread T1 is processing a new page at that time, it won't get an
      automatic grant at R2, because there's now a writer thread
      waiting (T2). And we have a deadlock.
      
      The timeline is:
      1. thread T1 owns the semaphore with read access R1
      2. thread T2 requests write access W1 and waits
      3. thread T1 requests read access R2 and waits
      
      The fix is for the thread T1 to release the semaphore R1 once it got
      the information it needs from the current VMA. The address space/VMAs
      could evolve while T1 iterates over the full buffer, but in the
      unlikely case where T1 misses a page, the external driver will raise a
      new page fault when retrying the memory access.
      
      Fixes: 3ced8d73 ("cxl: Export library to support IBM XSL")
      Cc: stable@vger.kernel.org # 4.13+
      Signed-off-by: default avatarFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ad7b4e80
    • Naveen N. Rao's avatar
      powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it · 5d6a03eb
      Naveen N. Rao authored
      We get the below warning if we try to use kexec on P9:
         kexec_core: Starting new kernel
         WARNING: CPU: 0 PID: 1223 at arch/powerpc/kernel/process.c:826 __set_breakpoint+0xb4/0x140
         [snip]
         NIP __set_breakpoint+0xb4/0x140
         LR  kexec_prepare_cpus_wait+0x58/0x150
         Call Trace:
           0xc0000000ee70fb20 (unreliable)
           0xc0000000ee70fb20
           default_machine_kexec+0x234/0x2c0
           machine_kexec+0x84/0x90
           kernel_kexec+0xd8/0xe0
           SyS_reboot+0x214/0x2c0
           system_call+0x58/0x6c
      
      This happens since we are trying to clear hw breakpoint on POWER9,
      though we don't have CPU_FTR_DAWR enabled. Guard __set_breakpoint()
      within hw_breakpoint_disable() with ppc_breakpoint_available() to
      address this.
      
      Fixes: 96541531 ("powerpc: Disable DAWR in the base POWER9 CPU features")
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5d6a03eb
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Update command line parsing for disable_radix · 7a22d632
      Aneesh Kumar K.V authored
      kernel parameter disable_radix takes different options
      disable_radix=yes|no|1|0  or just disable_radix.
      
      prom_init parsing is not supporting these options.
      
      Fixes: 1fd6c022 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7a22d632
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Parse disable_radix commandline correctly. · cec4e9b2
      Aneesh Kumar K.V authored
      kernel parameter disable_radix takes different options
      disable_radix=yes|no|1|0 or just disable_radix. When using the later
      format we get below error.
      
       `Malformed early option 'disable_radix'`
      
      Fixes: 1fd6c022 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      cec4e9b2
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb · 6fa50483
      Aneesh Kumar K.V authored
      With 64k page size, we have hugetlb pte entries at the pmd and pud level for
      book3s64. We don't need to create a separate page table cache for that. With 4k
      we need to make sure hugepd page table cache for 16M is placed at PUD level
      and 16G at the PGD level.
      
      Simplify all these by not using HUGEPD_PD_SHIFT which is confusing for book3s64.
      
      Without this patch, with 64k page size we create pagetable caches with shift
      value 10 and 7 which are not used at all.
      
      Fixes: 419df06e ("powerpc: Reduce the PTE_INDEX_SIZE")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6fa50483
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix · fb4e5dbd
      Aneesh Kumar K.V authored
      With split PTL (page table lock) config, we allocate the level
      4 (leaf) page table using pte fragment framework instead of slab cache
      like other levels. This was done to enable us to have split page table
      lock at the level 4 of the page table. We use page->plt backing the
      all the level 4 pte fragment for the lock.
      
      Currently with Radix, we use only 16 fragments out of the allocated
      page. In radix each fragment is 256 bytes which means we use only 4k
      out of the allocated 64K page wasting 60k of the allocated memory.
      This was done earlier to keep it closer to hash.
      
      This patch update the pte fragment count to 256, thereby using the
      full 64K page and reducing the memory usage. Performance tests shows
      really low impact even with THP disabled. With THP disabled we will be
      contenting further less on level 4 ptl and hence the impact should be
      further low.
      
        256 threads:
          without patch (10 runs of ./ebizzy  -m -n 1000 -s 131072 -S 100)
            median = 15678.5
            stdev = 42.1209
      
          with patch:
            median = 15354
            stdev = 194.743
      
      This is with THP disabled. With THP enabled the impact of the patch
      will be less.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      fb4e5dbd
    • Aneesh Kumar K.V's avatar
      powerpc/mm/keys: Update documentation and remove unnecessary check · f2ed480f
      Aneesh Kumar K.V authored
      Adds more code comments. We also remove an unnecessary pkey check
      after we check for pkey error in this patch.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f2ed480f
    • Nicholas Piggin's avatar
      powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead · b9ee31e1
      Nicholas Piggin authored
      When stop is executed with EC=ESL=0, it appears to execute like a
      normal instruction (resuming from NIP when woken by interrupt). So all
      the save/restore handling can be avoided completely. In particular NV
      GPRs do not have to be saved, and MSR does not have to be switched
      back to kernel MSR.
      
      So move the test for EC=ESL=0 sleep states out to power9_idle_stop,
      and return directly to the caller after stop in that case.
      
      This improves performance for ping-pong benchmark with the stop0_lite
      idle state by 2.54% for 2 threads in the same core, and 2.57% for
      different cores. Performance increase with HV_POSSIBLE defined will be
      improved further by avoiding the hwsync.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b9ee31e1
  2. 03 Apr, 2018 11 commits
  3. 01 Apr, 2018 5 commits
  4. 31 Mar, 2018 16 commits
    • Nicholas Piggin's avatar
      powerpc/64s: Remove POWER4 support · 471d7ff8
      Nicholas Piggin authored
      POWER4 has been broken since at least the change 49d09bf2
      ("powerpc/64s: Optimise MSR handling in exception handling"), which
      requires mtmsrd L=1 support. This was introduced in ISA v2.01, and
      POWER4 supports ISA v2.00.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      471d7ff8
    • Nicholas Piggin's avatar
      powerpc: Remove unused CPU_FTR_ARCH_201 · 3735eb85
      Nicholas Piggin authored
      The last usage was removed in c17b98cf ("KVM: PPC: Book3S HV:
      Remove code for PPC970 processors") (Dec 2014).
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3735eb85
    • Nicholas Piggin's avatar
      powerpc/64s: Fix POWER9 DD2.2 and above in DT CPU features · 9e9626ed
      Nicholas Piggin authored
      The CPU_FTR_POWER9_DD2_1 flag is intended to be set for DD2.1 and
      above (which is what the cputable setup does). Fix DT CPU features
      quirk setup to match.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Merge with upstream changes]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9e9626ed
    • Nicholas Piggin's avatar
      powerpc/64s: Set assembler machine type to POWER4 · 15a3204d
      Nicholas Piggin authored
      Rather than override the machine type in .S code (which can hide wrong
      or ambiguous code generation for the target), set the type to power4
      for all assembly.
      
      This also means we need to be careful not to build power4-only code
      when we're not building for Book3S, such as the "power7" versions of
      copyuser/page/memcpy.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Fix Book3E build, don't build the "power7" variants for non-Book3S]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      15a3204d
    • Nicholas Piggin's avatar
      powerpc/64s: Explicitly add vector features to CPU_FTRS_POSSIBLE · d50614fa
      Nicholas Piggin authored
      ALTIVEC and VSX features are not added by to default to the POWERx CPU
      feature sets because they are intended to be enabled by firmware.
      Currently they end up in CPU_FTRS_POSSIBLE due to their inclusion in
      other the set for other CPUs, eg. PPC970.
      
      But they should be added individually to the CPU_FTRS_POSSIBLE set,
      because if we reduce the set of CPUs that are built-for they may
      disappear from the possible mask.
      
      It already contains CPU_FTR_VSX, so add ALTIVEC. The _COMP features
      should be used because they won't be present if compiled out.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Add detail to change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d50614fa
    • Nicholas Piggin's avatar
      powerpc/64s: Add all POWER9 features to CPU_FTRS_ALWAYS · b842bd0f
      Nicholas Piggin authored
      It's not a bug to have features missing in CPU_FTR_ALWAYS, but it is a
      missed opportunity for optimisation.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b842bd0f
    • Mark Greer's avatar
      powerpc/boot: Remove duplicate typedefs from libfdt_env.h · 14770453
      Mark Greer authored
      When building a uImage or zImage using ppc6xx_defconfig and some other
      defconfigs, the following error occurs with GCC 4.5.1:
      
        /arch/powerpc/boot/libfdt_env.h:10:13: error: redefinition of typedef 'uint32_t'
        /arch/powerpc/boot/types.h:21:13: note: previous declaration of 'uint32_t' was here
        /arch/powerpc/boot/libfdt_env.h:11:13: error: redefinition of typedef 'uint64_t'
        /arch/powerpc/boot/types.h:22:13: note: previous declaration of 'uint64_t' was here
      
      The problem is that commit 656ad58e (powerpc/boot: Add OPAL
      console to epapr wrappers) adds typedefs for uint32_t and uint64_t to
      type.h but doesn't remove the pre-existing (and now duplicate)
      typedefs from libfdt_env.h.
      
      Fix the error by removing the duplicate typedefs from libfdt_env.h
      Signed-off-by: default avatarMark Greer <mgreer@animalcreek.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      14770453
    • Nicholas Piggin's avatar
      powerpc/64s/idle: avoid sync for KVM state when waking from idle · 8c1c7fb0
      Nicholas Piggin authored
      When waking from a CPU idle instruction (e.g., nap or stop), the sync
      for ordering the KVM secondary thread state can be avoided if there
      wakeup is coming from a kernel context rather than KVM context.
      
      This improves performance for ping-pong benchmark with the stop0 idle
      state by 0.46% for 2 threads in the same core, and 1.02% for different
      cores.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8c1c7fb0
    • Nicholas Piggin's avatar
      powerpc/64s/idle: POWER9 implement a separate idle stop function for hotplug · 3d4fbffd
      Nicholas Piggin authored
      Implement a new function to invoke stop, power9_offline_stop, which is
      like power9_idle_stop but used by the cpu hotplug code.
      
      Move KVM secondary state manipulation code to the offline case.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3d4fbffd
    • Nicholas Piggin's avatar
      powerpc/64s: sreset panic if there is no debugger or crash dump handlers · d40b6768
      Nicholas Piggin authored
      system_reset_exception does most of its own crash handling now,
      invoking the debugger or crash dumps if they are registered. If not,
      then it goes through to die() to print stack traces, and then is
      supposed to panic (according to comments).
      
      However after die() prints oopses, it does its own handling which
      doesn't allow system_reset_exception to panic (e.g., it may just
      kill the current process). This patch causes sreset exceptions to
      return from die after it prints messages but before acting.
      
      This also stops die from invoking the debugger on 0x100 crashes.
      system_reset_exception similarly calls the debugger. It had been
      thought this was harmless (because if the debugger was disabled,
      neither call would fire, and if it was enabled the first call
      would return). However in some cases like xmon 'X' command, the
      debugger returns 0, which currently causes it to be entered
      again (first in system_reset_exception, then in die), which is
      confusing.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d40b6768
    • Nicholas Piggin's avatar
      powerpc/64s: return more carefully from sreset NMI · 15b4dd79
      Nicholas Piggin authored
      System Reset, being an NMI, must return more carefully than other
      interrupts. It has traditionally returned via the nromal return
      from exception path, but that has a number of problems.
      
      - r13 does not get restored if returning to kernel. This is for
        interrupts which may cause a context switch, which sreset will
        never do. Interrupting OPAL (which uses a different r13) is one
        place where this causes breakage.
      
      - It may cause several other problems returning to kernel with
        preempt or TIF_EMULATE_STACK_STORE if it hits at the wrong time.
      
      It's safer just to have a simple restore and return, like machine
      check which is the other NMI.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      15b4dd79
    • Michael Neuling's avatar
      powerpc/eeh: Fix race with driver un/bind · f0295e04
      Michael Neuling authored
      The current EEH callbacks can race with a driver unbind. This can
      result in a backtraces like this:
      
        EEH: Frozen PHB#0-PE#1fc detected
        EEH: PE location: S000009, PHB location: N/A
        CPU: 2 PID: 2312 Comm: kworker/u258:3 Not tainted 4.15.6-openpower1 #2
        Workqueue: nvme-wq nvme_reset_work [nvme]
        Call Trace:
          dump_stack+0x9c/0xd0 (unreliable)
          eeh_dev_check_failure+0x420/0x470
          eeh_check_failure+0xa0/0xa4
          nvme_reset_work+0x138/0x1414 [nvme]
          process_one_work+0x1ec/0x328
          worker_thread+0x2e4/0x3a8
          kthread+0x14c/0x154
          ret_from_kernel_thread+0x5c/0xc8
        nvme nvme1: Removing after probe failure status: -19
        <snip>
        cpu 0x23: Vector: 300 (Data Access) at [c000000ff50f3800]
            pc: c0080000089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
            lr: c000000000026564: eeh_report_error+0xe0/0x110
            sp: c000000ff50f3a80
           msr: 9000000000009033
           dar: 400
         dsisr: 40000000
          current = 0xc000000ff507c000
          paca    = 0xc00000000fdc9d80   softe: 0        irq_happened: 0x01
            pid   = 782, comm = eehd
        Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 2017.11.2-00008-g4b6188e)) #2 SM                                             P Tue Feb 27 12:33:27 PST 2018
        enter ? for help
          eeh_report_error+0xe0/0x110
          eeh_pe_dev_traverse+0xc0/0xdc
          eeh_handle_normal_event+0x184/0x4c4
          eeh_handle_event+0x30/0x288
          eeh_event_handler+0x124/0x170
          kthread+0x14c/0x154
          ret_from_kernel_thread+0x5c/0xc8
      
      The first part is an EEH (on boot), the second half is the resulting
      crash. nvme probe starts the nvme_reset_work() worker thread. This
      worker thread starts touching the device which see a device error
      (EEH) and hence queues up an event in the powerpc EEH worker
      thread. nvme_reset_work() then continues and runs
      nvme_remove_dead_ctrl_work() which results in unbinding the driver
      from the device and hence releases all resources. At the same time,
      the EEH worker thread starts doing the EEH .error_detected() driver
      callback, which no longer works since the resources have been freed.
      
      This fixes the problem in the same way the generic PCIe AER code (in
      drivers/pci/pcie/aer/aerdrv_core.c) does. It makes the EEH code hold
      the device_lock() while performing the driver EEH callbacks and
      associated code. This ensures either the callbacks are no longer
      register, or if they are registered the driver will not be removed
      from underneath us.
      
      This has been broken forever. The EEH call backs were first introduced
      in 2005 (in 77bd7415) but it's not clear if a lock was needed back
      then.
      
      Fixes: 77bd7415 ("[PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines")
      Cc: stable@vger.kernel.org # v2.6.16+
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Reviewed-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f0295e04
    • Thiago Jung Bauermann's avatar
      powerpc/kexec_file: Fix error code when trying to load kdump kernel · bf8a1abc
      Thiago Jung Bauermann authored
      kexec_file_load() on powerpc doesn't support kdump kernels yet, so it
      returns -ENOTSUPP in that case.
      
      I've recently learned that this errno is internal to the kernel and
      isn't supposed to be exposed to userspace. Therefore, change to
      -EOPNOTSUPP which is defined in an uapi header.
      
      This does indeed make kexec-tools happier. Before the patch, on
      ppc64le:
      
        # ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
        kexec_file_load failed: Unknown error 524
      
      After the patch:
      
        # ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
        kexec_file_load failed: Operation not supported
      
      Fixes: a0458284 ("powerpc: Add support code for kexec_file_load()")
      Cc: stable@vger.kernel.org # v4.10+
      Reported-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@verge.net.au>
      Reviewed-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bf8a1abc
    • Jonathan Neuschäfer's avatar
      powerpc/mm/32: Remove the reserved memory hack · 7e140591
      Jonathan Neuschäfer authored
      This hack, introduced in commit c5df7f77 ("powerpc: allow ioremap
      within reserved memory regions") is now unnecessary.
      Signed-off-by: default avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7e140591
    • Jonathan Neuschäfer's avatar
      powerpc/wii: Don't rely on the reserved memory hack · 57deb8fe
      Jonathan Neuschäfer authored
      Because the two memory blocks (usually called MEM1 and MEM2) are not
      merged anymore, __request_region in kernel/resource.c will correctly
      allow reserving regions in the physical address space between MEM1 and
      MEM2, where many important peripherals are (GPIO, MMC, USB, ...).
      
      A previous change to __ioremap_caller in arch/powerpc/mm/pgtable_32.c
      ensures that multiple memblocks are properly considered in ioremap; this
      makes it unnecessary to set __allow_ioremap_reserved.
      Signed-off-by: default avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      57deb8fe
    • Jonathan Neuschäfer's avatar
      powerpc/mm/32: Use page_is_ram to check for RAM · 2bbf6326
      Jonathan Neuschäfer authored
      On systems where there is MMIO space between different blocks of RAM in
      the physical address space, __ioremap_caller did not allow mapping these
      MMIO areas, because they were below the end RAM and thus considered RAM
      as well.  Use the memblock-based page_is_ram function, which returns
      false for such MMIO holes.
      
      v2:
        Keep the check for p < virt_to_phys(high_memory). On 32-bit systems
        with high memory (memory above physical address 4GiB), the high memory
        is expected to be available though ioremap. The high_memory variable
        marks the end of low memory; comparing against it means that only
        ioremap requests for low RAM will be denied.
        Reported by Michael Ellerman.
      Signed-off-by: default avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2bbf6326