1. 08 Oct, 2020 23 commits
  2. 07 Oct, 2020 7 commits
  3. 06 Oct, 2020 10 commits
    • Scott Cheloha's avatar
      pseries/hotplug-memory: hot-add: skip redundant LMB lookup · 72cdd117
      Scott Cheloha authored
      During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
      to determine which node id (nid) to use when later calling __add_memory().
      
      This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
      appropriate nid for a given address by looking up the LMB containing the
      address and then passing that LMB to of_drconf_to_nid_single() to get the
      nid.  In dlpar_add_lmb() we get this address from the LMB itself.
      
      In short, we have a pointer to an LMB and then we are searching for
      that LMB *again* in order to find its nid.
      
      If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
      can skip the redundant lookup.  The only error handling we need to
      duplicate from memory_add_physaddr_to_nid() is the fallback to the
      default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
      an invalid nid.
      
      Skipping the extra lookup makes hot-add operations faster, especially
      on machines with many LMBs.
      
      Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
      LMBs on an upatched kernel took ~3.5 hours while a patched kernel
      completed the same operation in ~2 hours:
      
      Unpatched (12450 seconds):
      Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
      Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      Patched (7065 seconds):
      Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
      Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      It should be noted that the speedup grows more substantial when
      hot-adding LMBs at the end of the drconf range.  This is because we
      are skipping a linear LMB search.
      
      To see the distinction, consider smaller hot-add test on the same
      LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
      LMBs completed less than 1 second faster on a patched kernel:
      
      Unpatched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
                   4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
                   2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
         445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
           8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
         300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
         258,091,488,691      instructions              #    0.58  insn per cycle
                                                        #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
          70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
           3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
      
                 105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
      
      Patched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
                   4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
                   2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
         442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
           8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
         299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
         252,731,168,193      instructions              #    0.57  insn per cycle
                                                        #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
          68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
           3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
      
                 104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
      
      This is consistent.  An add-by-count hot-add operation adds LMBs
      greedily, so LMBs near the start of the drconf range are considered
      first.  On an otherwise idle LPAR with so many LMBs we would expect to
      find the LMBs we need near the start of the drconf range, hence the
      smaller speedup.
      Signed-off-by: default avatarScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
      72cdd117
    • Andrew Donnellan's avatar
      selftests/powerpc: Add a rtas_filter selftest · dc9af82e
      Andrew Donnellan authored
      Add a selftest to test the basic functionality of CONFIG_RTAS_FILTER.
      Signed-off-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      [mpe: Change rmo_start/end to 32-bit to avoid build errors on ppc64]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200820044512.7543-2-ajd@linux.ibm.com
      dc9af82e
    • Andrew Donnellan's avatar
      powerpc/rtas: Restrict RTAS requests from userspace · bd59380c
      Andrew Donnellan authored
      A number of userspace utilities depend on making calls to RTAS to retrieve
      information and update various things.
      
      The existing API through which we expose RTAS to userspace exposes more
      RTAS functionality than we actually need, through the sys_rtas syscall,
      which allows root (or anyone with CAP_SYS_ADMIN) to make any RTAS call they
      want with arbitrary arguments.
      
      Many RTAS calls take the address of a buffer as an argument, and it's up to
      the caller to specify the physical address of the buffer as an argument. We
      allocate a buffer (the "RMO buffer") in the Real Memory Area that RTAS can
      access, and then expose the physical address and size of this buffer in
      /proc/powerpc/rtas/rmo_buffer. Userspace is expected to read this address,
      poke at the buffer using /dev/mem, and pass an address in the RMO buffer to
      the RTAS call.
      
      However, there's nothing stopping the caller from specifying whatever
      address they want in the RTAS call, and it's easy to construct a series of
      RTAS calls that can overwrite arbitrary bytes (even without /dev/mem
      access).
      
      Additionally, there are some RTAS calls that do potentially dangerous
      things and for which there are no legitimate userspace use cases.
      
      In the past, this would not have been a particularly big deal as it was
      assumed that root could modify all system state freely, but with Secure
      Boot and lockdown we need to care about this.
      
      We can't fundamentally change the ABI at this point, however we can address
      this by implementing a filter that checks RTAS calls against a list
      of permitted calls and forces the caller to use addresses within the RMO
      buffer.
      
      The list is based off the list of calls that are used by the librtas
      userspace library, and has been tested with a number of existing userspace
      RTAS utilities. For compatibility with any applications we are not aware of
      that require other calls, the filter can be turned off at build time.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200820044512.7543-1-ajd@linux.ibm.com
      bd59380c
    • Athira Rajeev's avatar
      powerpc/perf: Exclude pmc5/6 from the irrelevant PMU group constraints · 3b6c3adb
      Athira Rajeev authored
      PMU counter support functions enforces event constraints for group of
      events to check if all events in a group can be monitored. Incase of
      event codes using PMC5 and PMC6 ( 500fa and 600f4 respectively ), not
      all constraints are applicable, say the threshold or sample bits. But
      current code includes pmc5 and pmc6 in some group constraints (like
      IC_DC Qualifier bits) which is actually not applicable and hence
      results in those events not getting counted when scheduled along with
      group of other events. Patch fixes this by excluding PMC5/6 from
      constraints which are not relevant for it.
      
      Fixes: 7ffd948f ("powerpc/perf: factor out power8 pmu functions")
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1600672204-1610-1-git-send-email-atrajeev@linux.vnet.ibm.com
      3b6c3adb
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize update_coregroup_mask · 70a94089
      Srikar Dronamraju authored
      All threads of a SMT4/SMT8 core can either be part of CPU's coregroup
      mask or outside the coregroup. Use this relation to reduce the
      number of iterations needed to find all the CPUs that share the same
      coregroup
      
      Use a temporary mask to iterate through the CPUs that may share
      coregroup mask. Also instead of setting one CPU at a time into
      cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-12-srikar@linux.vnet.ibm.com
      70a94089
    • Srikar Dronamraju's avatar
      powerpc/smp: Move coregroup mask updation to a new function · b8a97cb4
      Srikar Dronamraju authored
      Move the logic for updating the coregroup mask of a CPU to its own
      function. This will help in reworking the updation of coregroup mask in
      subsequent patch.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-11-srikar@linux.vnet.ibm.com
      b8a97cb4
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize update_mask_by_l2 · 3ab33d6d
      Srikar Dronamraju authored
      All threads of a SMT4 core can either be part of this CPU's l2-cache
      mask or not related to this CPU l2-cache mask. Use this relation to
      reduce the number of iterations needed to find all the CPUs that share
      the same l2-cache.
      
      Use a temporary mask to iterate through the CPUs that may share l2_cache
      mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
      copy the SMT4/sub mask at one shot.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-10-srikar@linux.vnet.ibm.com
      3ab33d6d
    • Srikar Dronamraju's avatar
      powerpc/smp: Check for duplicate topologies and consolidate · 375370a1
      Srikar Dronamraju authored
      CACHE and COREGROUP domains are now part of default topology. However on
      systems that don't support CACHE or COREGROUP, these domains will
      eventually be degenerated. The degeneration happens per CPU. Do note the
      current fixup_topology() logic ensures that mask of a domain that is not
      supported on the current platform is set to the previous domain.
      
      Instead of waiting for the scheduler to degenerated try to consolidate
      based on their masks and sd_flags. This is done just before setting
      the scheduler topology.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-9-srikar@linux.vnet.ibm.com
      375370a1
    • Srikar Dronamraju's avatar
      powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs · 661e3d42
      Srikar Dronamraju authored
      Currently on hotplug/hotunplug, CPU iterates through all the CPUs in
      its core to find threads in its thread group. However this info is
      already captured in cpu_l1_cache_map. Hence reduce iterations and
      cleanup add_cpu_to_smallcore_masks function.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-8-srikar@linux.vnet.ibm.com
      661e3d42
    • Srikar Dronamraju's avatar
      powerpc/smp: Stop passing mask to update_mask_by_l2 · 1f3a4181
      Srikar Dronamraju authored
      update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
      as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
      update_mask_by_l2.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-7-srikar@linux.vnet.ibm.com
      1f3a4181