1. 02 Jul, 2019 15 commits
  2. 01 Jul, 2019 15 commits
  3. 25 Jun, 2019 1 commit
    • Nicholas Piggin's avatar
      powerpc/64s/exception: Fix machine check early corrupting AMR · e13e7cd4
      Nicholas Piggin authored
      The early machine check runs in real mode, so locking is unnecessary.
      Worse, the windup does not restore AMR, so this can result in a false
      KUAP fault after a recoverable machine check hits inside a user copy
      operation.
      
      Fix this similarly to HMI by just avoiding the kuap lock in the
      early machine check handler (it will be set by the late handler that
      runs in virtual mode if that runs). If the virtual mode handler is
      reached, it will lock and restore the AMR.
      
      Fixes: 890274c2 ("powerpc/64s: Implement KUAP for Radix MMU")
      Cc: Russell Currey <ruscur@russell.cc>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e13e7cd4
  4. 20 Jun, 2019 4 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entry · 3c25ab35
      Suraj Jitindar Singh authored
      If we enter an L1 guest with a pending decrementer exception then this
      is cleared on guest exit if the guest has writtien a positive value
      into the decrementer (indicating that it handled the decrementer
      exception) since there is no other way to detect that the guest has
      handled the pending exception and that it should be dequeued. In the
      event that the L1 guest tries to run a nested (L2) guest immediately
      after this and the L2 guest decrementer is negative (which is loaded
      by L1 before making the H_ENTER_NESTED hcall), then the pending
      decrementer exception isn't cleared and the L2 entry is blocked since
      L1 has a pending exception, even though L1 may have already handled
      the exception and written a positive value for it's decrementer. This
      results in a loop of L1 trying to enter the L2 guest and L0 blocking
      the entry since L1 has an interrupt pending with the outcome being
      that L2 never gets to run and hangs.
      
      Fix this by clearing any pending decrementer exceptions when L1 makes
      the H_ENTER_NESTED hcall since it won't do this if it's decrementer
      has gone negative, and anyway it's decrementer has been communicated
      to L0 in the hdec_expires field and L0 will return control to L1 when
      this goes negative by delivering an H_DECREMENTER exception.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c25ab35
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decrementer · 86953770
      Suraj Jitindar Singh authored
      On POWER9 the decrementer can operate in large decrementer mode where
      the decrementer is 56 bits and signed extended to 64 bits. When not
      operating in this mode the decrementer behaves as a 32 bit decrementer
      which is NOT signed extended (as on POWER8).
      
      Currently when reading a guest decrementer value we don't take into
      account whether the large decrementer is enabled or not, and this
      means the value will be incorrect when the guest is not using the
      large decrementer. Fix this by sign extending the value read when the
      guest isn't using the large decrementer.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      86953770
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries · 50087112
      Suraj Jitindar Singh authored
      When a guest vcpu moves from one physical thread to another it is
      necessary for the host to perform a tlb flush on the previous core if
      another vcpu from the same guest is going to run there. This is because the
      guest may use the local form of the tlb invalidation instruction meaning
      stale tlb entries would persist where it previously ran. This is handled
      on guest entry in kvmppc_check_need_tlb_flush() which calls
      flush_guest_tlb() to perform the tlb flush.
      
      Previously the generic radix__local_flush_tlb_lpid_guest() function was
      used, however the functionality was reimplemented in flush_guest_tlb()
      to avoid the trace_tlbie() call as the flushing may be done in real
      mode. The reimplementation in flush_guest_tlb() was missing an erat
      invalidation after flushing the tlb.
      
      This lead to observable memory corruption in the guest due to the
      caching of stale translations. Fix this by adding the erat invalidation.
      
      Fixes: 70ea13f6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      50087112
    • Alexey Kardashevskiy's avatar
      powerpc/pci/of: Fix OF flags parsing for 64bit BARs · df5be5be
      Alexey Kardashevskiy authored
      When the firmware does PCI BAR resource allocation, it passes the assigned
      addresses and flags (prefetch/64bit/...) via the "reg" property of
      a PCI device device tree node so the kernel does not need to do
      resource allocation.
      
      The flags are stored in resource::flags - the lower byte stores
      PCI_BASE_ADDRESS_SPACE/etc bits and the other bytes are IORESOURCE_IO/etc.
      Some flags from PCI_BASE_ADDRESS_xxx and IORESOURCE_xxx are duplicated,
      such as PCI_BASE_ADDRESS_MEM_PREFETCH/PCI_BASE_ADDRESS_MEM_TYPE_64/etc.
      When parsing the "reg" property, we copy the prefetch flag but we skip
      on PCI_BASE_ADDRESS_MEM_TYPE_64 which leaves the flags out of sync.
      
      The missing IORESOURCE_MEM_64 flag comes into play under 2 conditions:
      1. we remove PCI_PROBE_ONLY for pseries (by hacking pSeries_setup_arch()
      or by passing "/chosen/linux,pci-probe-only");
      2. we request resource alignment (by passing pci=resource_alignment=
      via the kernel cmd line to request PAGE_SIZE alignment or defining
      ppc_md.pcibios_default_alignment which returns anything but 0). Note that
      the alignment requests are ignored if PCI_PROBE_ONLY is enabled.
      
      With 1) and 2), the generic PCI code in the kernel unconditionally
      decides to:
      - reassign the BARs in pci_specified_resource_alignment() (works fine)
      - write new BARs to the device - this fails for 64bit BARs as the generic
      code looks at IORESOURCE_MEM_64 (not set) and writes only lower 32bits
      of the BAR and leaves the upper 32bit unmodified which breaks BAR mapping
      in the hypervisor.
      
      This fixes the issue by copying the flag. This is useful if we want to
      enforce certain BAR alignment per platform as handling subpage sized BARs
      is proven to cause problems with hotplug (SLOF already aligns BARs to 64k).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarSam Bobroff <sbobroff@linux.ibm.com>
      Reviewed-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: default avatarShawn Anastasio <shawn@anastas.io>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      df5be5be
  5. 19 Jun, 2019 5 commits
    • Christoph Hellwig's avatar
      powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac · 9739ab7e
      Christoph Hellwig authored
      With the strict dma mask checking introduced with the switch to
      the generic DMA direct code common wifi chips on 32-bit powerbooks
      stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
      to allow them to reliably allocate dma coherent memory.
      
      Fixes: 65a21b71 ("powerpc/dma: remove dma_nommu_dma_supported")
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Acked-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9739ab7e
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Enable HAVE_ARCH_HUGE_VMAP · d909f910
      Nicholas Piggin authored
      This sets the HAVE_ARCH_HUGE_VMAP option, and defines the required
      page table functions.
      
      This enables huge (2MB and 1GB) ioremap mappings. I don't have a
      benchmark for this change, but huge vmap will be used by a later core
      kernel change to enable huge vmalloc memory mappings. This improves
      cached `git diff` performance by about 5% on a 2-node POWER9 with 32MB
      size dentry cache hash.
      
        Profiling git diff dTLB misses with a vanilla kernel:
      
        81.75%  git      [kernel.vmlinux]    [k] __d_lookup_rcu
         7.21%  git      [kernel.vmlinux]    [k] strncpy_from_user
         1.77%  git      [kernel.vmlinux]    [k] find_get_entry
         1.59%  git      [kernel.vmlinux]    [k] kmem_cache_free
      
                  40,168      dTLB-miss
             0.100342754 seconds time elapsed
      
        With powerpc huge vmalloc:
      
                   2,987      dTLB-miss
             0.095933138 seconds time elapsed
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d909f910
    • Nicholas Piggin's avatar
      powerpc/64s/radix: ioremap use ioremap_page_range · d38153f9
      Nicholas Piggin authored
      Radix can use ioremap_page_range for ioremap, after slab is available.
      This makes it possible to enable huge ioremap mapping support.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d38153f9
    • Nicholas Piggin's avatar
      powerpc/64: __ioremap_at clean up in the error case · a72808a7
      Nicholas Piggin authored
      __ioremap_at error handling is wonky, it requires caller to clean up
      after it. Implement a helper that does the map and error cleanup and
      remove the requirement from the caller.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a72808a7
    • Anju T Sudhakar's avatar
      powerpc/perf: Use cpumask_last() to determine the designated cpu for nest/core units. · 9c9f8fb7
      Anju T Sudhakar authored
      Nest and core IMC (In-Memory Collection counters) assigns a particular
      cpu as the designated target for counter data collection. During
      system boot, the first online cpu in a chip gets assigned as the
      designated cpu for that chip(for nest-imc) and the first online cpu in
      a core gets assigned as the designated cpu for that core(for
      core-imc).
      
      If the designated cpu goes offline, the next online cpu from the same
      chip(for nest-imc)/core(for core-imc) is assigned as the next target,
      and the event context is migrated to the target cpu. Currently,
      cpumask_any_but() function is used to find the target cpu. Though this
      function is expected to return a `random` cpu, this always returns the
      next online cpu.
      
      If all cpus in a chip/core is offlined in a sequential manner,
      starting from the first cpu, the event migration has to happen for all
      the cpus which goes offline. Since the migration process involves a
      grace period, the total time taken to offline all the cpus will be
      significantly high.
      
      Example:
        In a system which has 2 sockets, with
        NUMA node0 CPU(s):     0-87
        NUMA node8 CPU(s):     88-175
      
        Time taken to offline cpu 88-175:
        real    2m56.099s
        user    0m0.191s
        sys     0m0.000s
      
      Use cpumask_last() to choose the target cpu, when the designated cpu
      goes online, so the migration will happen only when the last_cpu in
      the mask goes offline. This way the time taken to offline all cpus in
      a chip/core can be reduced.
      
      With the patch:
      
        Time taken  to offline cpu 88-175:
        real    0m12.207s
        user    0m0.171s
        sys     0m0.000s
      
      Offlining all cpus in reverse order is also taken care because,
      cpumask_any_but() is used to find the designated cpu if the last cpu
      in the mask goes offline. Since cpumask_any_but() always return the
      first cpu in the mask, that becomes the designated cpu and migration
      will happen only when the first_cpu in the mask goes offline.
      
      Example: With the patch,
      
        Time taken to offline cpu from 175-88:
        real    0m9.330s
        user    0m0.110s
        sys     0m0.000s
      Signed-off-by: default avatarAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9c9f8fb7