1. 13 Oct, 2018 23 commits
  2. 09 Oct, 2018 1 commit
    • Michael Ellerman's avatar
      Merge branch 'fixes' into next · 9b7e4d60
      Michael Ellerman authored
      Merge our fixes branch. It has a few important fixes that are needed for
      futher testing and also some commits that will conflict with content in
      next.
      9b7e4d60
  3. 08 Oct, 2018 7 commits
  4. 05 Oct, 2018 2 commits
    • Srikar Dronamraju's avatar
      powerpc/numa: Skip onlining a offline node in kdump path · ac1788cc
      Srikar Dronamraju authored
      With commit 2ea62630 ("powerpc/topology: Get topology for shared
      processors at boot"), kdump kernel on shared LPAR may crash.
      
      The necessary conditions are
      - Shared LPAR with at least 2 nodes having memory and CPUs.
      - Memory requirement for kdump kernel must be met by the first N-1
        nodes where there are at least N nodes with memory and CPUs.
      
      Example numactl of such a machine.
        $ numactl -H
        available: 5 nodes (0,2,5-7)
        node 0 cpus:
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 255 MB
        node 2 free: 189 MB
        node 5 cpus: 24 25 26 27 28 29 30 31
        node 5 size: 4095 MB
        node 5 free: 4024 MB
        node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
        node 6 size: 6353 MB
        node 6 free: 5998 MB
        node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
        node 7 size: 7640 MB
        node 7 free: 7164 MB
        node distances:
        node   0   2   5   6   7
          0:  10  40  40  40  40
          2:  40  10  40  40  40
          5:  40  40  10  40  40
          6:  40  40  40  10  20
          7:  40  40  40  20  10
      
      Steps to reproduce.
      1. Load / start kdump service.
      2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)
      
      When booting a kdump kernel with 2048M:
      
        kexec: Starting switchover sequence.
        I'm in purgatory
        Using 1TB segments
        hash-mmu: Initializing hash mmu with SLB
        Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
        Found initrd at 0xc000000009e70000:0xc00000000ae554b4
        Using pSeries machine description
        -----------------------------------------------------
        ppc64_pft_size    = 0x1e
        phys_mem_size     = 0x88000000
        dcache_bsize      = 0x80
        icache_bsize      = 0x80
        cpu_features      = 0x000000ff8f5d91a7
          possible        = 0x0000fbffcf5fb1a7
          always          = 0x0000006f8b5c91a1
        cpu_user_features = 0xdc0065c2 0xef000000
        mmu_features      = 0x7c006001
        firmware_features = 0x00000007c45bfc57
        htab_hash_mask    = 0x7fffff
        physical_start    = 0x8000000
        -----------------------------------------------------
        numa:   NODE_DATA [mem 0x87d5e300-0x87d67fff]
        numa:     NODE_DATA(0) on node 6
        numa:   NODE_DATA [mem 0x87d54600-0x87d5e2ff]
        Top of RAM: 0x88000000, Total RAM: 0x88000000
        Memory hole size: 0MB
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x0000000087ffffff]
          DMA32    empty
          Normal   empty
        Movable zone start for each node
        Early memory node ranges
          node   6: [mem 0x0000000000000000-0x0000000087ffffff]
        Could not find start_pfn for node 0
        Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
        On node 0 totalpages: 0
        Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
        On node 6 totalpages: 34816
      
        Unable to handle kernel paging request for data at address 0x00000060
        Faulting instruction address: 0xc000000008703a54
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
        NIP:  c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
        REGS: c00000000b673440 TRAP: 0380   Not tainted  (4.19.0-rc5-master+)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24022022  XER: 20000002
        CFAR: c0000000086fc238 IRQMASK: 0
        GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
        GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
        GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
        GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
        GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
        GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
        GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
        GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
        NIP [c000000008703a54] bus_add_device+0x84/0x1e0
        LR [c000000008703a38] bus_add_device+0x68/0x1e0
        Call Trace:
        [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
        [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
        [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
        [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
        [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
        [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
        [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
        [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
        [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
        [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
        [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
        [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
        Instruction dump:
        60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
        e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
        ---[ end trace 593577668c2daa65 ]---
      
      However a regular kernel with 4096M (2048 gets reserved for crash
      kernel) boots properly.
      
      Unlike regular kernels, which mark all available nodes as online,
      kdump kernel only marks just enough nodes as online and marks the rest
      as offline at boot. However kdump kernel boots with all available
      CPUs. With Commit 2ea62630 ("powerpc/topology: Get topology for
      shared processors at boot"), all CPUs are onlined on their respective
      nodes at boot time. try_online_node() tries to online the offline
      nodes but fails as all needed subsystems are not yet initialized.
      
      As part of fix, detect and skip early onlining of a offline node.
      
      Fixes: 2ea62630 ("powerpc/topology: Get topology for shared processors at boot")
      Reported-by: default avatarPavithra Prakash <pavrampu@in.ibm.com>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ac1788cc
    • Michael Ellerman's avatar
      powerpc: Don't print kernel instructions in show_user_instructions() · a932ed3b
      Michael Ellerman authored
      Recently we implemented show_user_instructions() which dumps the code
      around the NIP when a user space process dies with an unhandled
      signal. This was modelled on the x86 code, and we even went so far as
      to implement the exact same bug, namely that if the user process
      crashed with its NIP pointing into the kernel we will dump kernel text
      to dmesg. eg:
      
        bad-bctr[2996]: segfault (11) at c000000000010000 nip c000000000010000 lr 12d0b0894 code 1
        bad-bctr[2996]: code: fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 fb810050 7f8802a6
        bad-bctr[2996]: code: 3860001c f8010080 48242371 60000000 <7c7b1b79> 4082002c e8010080 eb610048
      
      This was discovered on x86 by Jann Horn and fixed in commit
      342db04a ("x86/dumpstack: Don't dump kernel memory based on usermode RIP").
      
      Fix it by checking the adjusted NIP value (pc) and number of
      instructions against USER_DS, and bail if we fail the check, eg:
      
        bad-bctr[2969]: segfault (11) at c000000000010000 nip c000000000010000 lr 107930894 code 1
        bad-bctr[2969]: Bad NIP, not dumping instructions.
      
      Fixes: 88b0fe17 ("powerpc: Add show_user_instructions()")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a932ed3b
  5. 04 Oct, 2018 7 commits
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation · 053c5a75
      Nicholas Piggin authored
      Local radix TLB flush operations that operate on congruence classes
      have explicit ERAT flushes for POWER9. The process scoped LPID flush
      did not have a flush, so add it.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      053c5a75
    • Nicholas Piggin's avatar
      powerpc/64s/hash: Do not use PPC_INVALIDATE_ERAT on CPUs before POWER9 · bc276ecb
      Nicholas Piggin authored
      PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
      with POWER9, and the result is undefined on earlier CPUs.
      
      Commits 7b9f71f9 ("powerpc/64s: POWER9 machine check handler") and
      d4748276 ("powerpc/64s: Improve local TLB flush for boot and MCE on
      POWER9") caused POWER7/8 code to use this instruction. Remove it. An
      ERAT flush can be made by invalidatig the SLB, but before POWER9 that
      requires a flush and rebolt.
      
      Fixes: 7b9f71f9 ("powerpc/64s: POWER9 machine check handler")
      Fixes: d4748276 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
      Cc: stable@vger.kernel.org # v4.11+
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bc276ecb
    • Anton Blanchard's avatar
      powerpc/time: Add set_state_oneshot_stopped decrementer callback · 81759360
      Anton Blanchard authored
      If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
      0x7fffffff:
      
             if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
                      set_dec(0x7fffffff);
              else
                      set_dec(decrementer_max);
      
      If there are no future events, we don't reprogram the decrementer
      after this and we end up with 0x7fffffff even on a large decrementer
      capable system.
      
      As suggested by Nick, add a set_state_oneshot_stopped callback
      so we program the decrementer with decrementer_max if there are
      no future events.
      Signed-off-by: default avatarAnton Blanchard <anton@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      81759360
    • Anton Blanchard's avatar
      powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer · 8b78fdb0
      Anton Blanchard authored
      We currently cap the decrementer clockevent at 4 seconds, even on systems
      with large decrementer support. Fix this by converting the code to use
      clockevents_register_device() which calculates the upper bound based on
      the max_delta passed in.
      Signed-off-by: default avatarAnton Blanchard <anton@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8b78fdb0
    • Mark Hairgrove's avatar
      powerpc/powernv/npu: Remove atsd_threshold debugfs setting · f86ad3e0
      Mark Hairgrove authored
      This threshold is no longer used now that all invalidates issue a single
      ATSD to each active NPU.
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-by: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f86ad3e0
    • Mark Hairgrove's avatar
      powerpc/powernv/npu: Use size-based ATSD invalidates · 3689c37d
      Mark Hairgrove authored
      Prior to this change only two types of ATSDs were issued to the NPU:
      invalidates targeting a single page and invalidates targeting the whole
      address space. The crossover point happened at the configurable
      atsd_threshold which defaulted to 2M. Invalidates that size or smaller
      would issue per-page invalidates for the whole range.
      
      The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
      These invalidates target addresses aligned to their size. 2M is a common
      invalidation size for GPU-enabled applications because that is a GPU
      page size, so reducing the number of invalidates by 32x in that case is a
      clear improvement.
      
      ATSD latency is high in general so now we always issue a single invalidate
      rather than multiple. This will over-invalidate in some cases, but for any
      invalidation size over 2M it matches or improves the prior behavior.
      There's also an improvement for single-page invalidates since the prior
      version issued two invalidates for that case instead of one.
      
      With this change all issued ATSDs now perform a flush, so the flush
      parameter has been removed from all the helpers.
      
      To show the benefit here are some performance numbers from a
      microbenchmark which creates a 1G allocation then uses mprotect with
      PROT_NONE to trigger invalidates in strides across the allocation.
      
      One NPU (1 GPU):
      
               mprotect rate (GB/s)
      Stride   Before      After      Speedup
      64K         5.3        5.6           5%
      1M         39.3       57.4          46%
      2M         49.7       82.6          66%
      4M        286.6      285.7           0%
      
      Two NPUs (6 GPUs):
      
               mprotect rate (GB/s)
      Stride   Before      After      Speedup
      64K         6.5        7.4          13%
      1M         33.4       67.9         103%
      2M         38.7       93.1         141%
      4M        356.7      354.6          -1%
      
      Anything over 2M is roughly the same as before since both cases issue a
      single ATSD.
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-By: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3689c37d
    • Mark Hairgrove's avatar
      powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates · 7ead15a1
      Mark Hairgrove authored
      There are two types of ATSDs issued to the NPU: invalidates targeting a
      specific virtual address and invalidates targeting the whole address
      space. In both cases prior to this change, the sequence was:
      
          for each NPU
              - Write the target address to the XTS_ATSD_AVA register
              - EIEIO
              - Write the launch value to issue the ATSD
      
      First, a target address is not required when invalidating the whole
      address space, so that write and the EIEIO have been removed. The AP
      (size) field in the launch is not needed either.
      
      Second, for per-address invalidates the above sequence is inefficient in
      the common case of multiple NPUs because an EIEIO is issued per NPU. This
      unnecessarily forces the launches of later ATSDs to be ordered with the
      launches of earlier ones. The new sequence only issues a single EIEIO:
      
          for each NPU
              - Write the target address to the XTS_ATSD_AVA register
          EIEIO
          for each NPU
              - Write the launch value to issue the ATSD
      
      Performance results were gathered using a microbenchmark which creates a
      1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
      strides across the allocation.
      
      With only a single NPU active (one GPU) the difference is in the noise for
      both types of invalidates (+/-1%).
      
      With two NPUs active (on a 6-GPU system) the effect is more noticeable:
      
               mprotect rate (GB/s)
      Stride   Before      After      Speedup
      64K         5.9        6.5          10%
      1M         31.2       33.4           7%
      2M         36.3       38.7           7%
      4M        322.6      356.7          11%
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Reviewed-by: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7ead15a1