1. 08 Oct, 2020 10 commits
  2. 07 Oct, 2020 7 commits
  3. 06 Oct, 2020 23 commits
    • Scott Cheloha's avatar
      pseries/hotplug-memory: hot-add: skip redundant LMB lookup · 72cdd117
      Scott Cheloha authored
      During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
      to determine which node id (nid) to use when later calling __add_memory().
      
      This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
      appropriate nid for a given address by looking up the LMB containing the
      address and then passing that LMB to of_drconf_to_nid_single() to get the
      nid.  In dlpar_add_lmb() we get this address from the LMB itself.
      
      In short, we have a pointer to an LMB and then we are searching for
      that LMB *again* in order to find its nid.
      
      If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
      can skip the redundant lookup.  The only error handling we need to
      duplicate from memory_add_physaddr_to_nid() is the fallback to the
      default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
      an invalid nid.
      
      Skipping the extra lookup makes hot-add operations faster, especially
      on machines with many LMBs.
      
      Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
      LMBs on an upatched kernel took ~3.5 hours while a patched kernel
      completed the same operation in ~2 hours:
      
      Unpatched (12450 seconds):
      Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
      Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      Patched (7065 seconds):
      Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
      Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      It should be noted that the speedup grows more substantial when
      hot-adding LMBs at the end of the drconf range.  This is because we
      are skipping a linear LMB search.
      
      To see the distinction, consider smaller hot-add test on the same
      LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
      LMBs completed less than 1 second faster on a patched kernel:
      
      Unpatched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
                   4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
                   2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
         445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
           8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
         300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
         258,091,488,691      instructions              #    0.58  insn per cycle
                                                        #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
          70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
           3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
      
                 105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
      
      Patched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
                   4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
                   2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
         442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
           8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
         299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
         252,731,168,193      instructions              #    0.57  insn per cycle
                                                        #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
          68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
           3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
      
                 104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
      
      This is consistent.  An add-by-count hot-add operation adds LMBs
      greedily, so LMBs near the start of the drconf range are considered
      first.  On an otherwise idle LPAR with so many LMBs we would expect to
      find the LMBs we need near the start of the drconf range, hence the
      smaller speedup.
      Signed-off-by: default avatarScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
      72cdd117
    • Andrew Donnellan's avatar
      selftests/powerpc: Add a rtas_filter selftest · dc9af82e
      Andrew Donnellan authored
      Add a selftest to test the basic functionality of CONFIG_RTAS_FILTER.
      Signed-off-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      [mpe: Change rmo_start/end to 32-bit to avoid build errors on ppc64]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200820044512.7543-2-ajd@linux.ibm.com
      dc9af82e
    • Andrew Donnellan's avatar
      powerpc/rtas: Restrict RTAS requests from userspace · bd59380c
      Andrew Donnellan authored
      A number of userspace utilities depend on making calls to RTAS to retrieve
      information and update various things.
      
      The existing API through which we expose RTAS to userspace exposes more
      RTAS functionality than we actually need, through the sys_rtas syscall,
      which allows root (or anyone with CAP_SYS_ADMIN) to make any RTAS call they
      want with arbitrary arguments.
      
      Many RTAS calls take the address of a buffer as an argument, and it's up to
      the caller to specify the physical address of the buffer as an argument. We
      allocate a buffer (the "RMO buffer") in the Real Memory Area that RTAS can
      access, and then expose the physical address and size of this buffer in
      /proc/powerpc/rtas/rmo_buffer. Userspace is expected to read this address,
      poke at the buffer using /dev/mem, and pass an address in the RMO buffer to
      the RTAS call.
      
      However, there's nothing stopping the caller from specifying whatever
      address they want in the RTAS call, and it's easy to construct a series of
      RTAS calls that can overwrite arbitrary bytes (even without /dev/mem
      access).
      
      Additionally, there are some RTAS calls that do potentially dangerous
      things and for which there are no legitimate userspace use cases.
      
      In the past, this would not have been a particularly big deal as it was
      assumed that root could modify all system state freely, but with Secure
      Boot and lockdown we need to care about this.
      
      We can't fundamentally change the ABI at this point, however we can address
      this by implementing a filter that checks RTAS calls against a list
      of permitted calls and forces the caller to use addresses within the RMO
      buffer.
      
      The list is based off the list of calls that are used by the librtas
      userspace library, and has been tested with a number of existing userspace
      RTAS utilities. For compatibility with any applications we are not aware of
      that require other calls, the filter can be turned off at build time.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200820044512.7543-1-ajd@linux.ibm.com
      bd59380c
    • Athira Rajeev's avatar
      powerpc/perf: Exclude pmc5/6 from the irrelevant PMU group constraints · 3b6c3adb
      Athira Rajeev authored
      PMU counter support functions enforces event constraints for group of
      events to check if all events in a group can be monitored. Incase of
      event codes using PMC5 and PMC6 ( 500fa and 600f4 respectively ), not
      all constraints are applicable, say the threshold or sample bits. But
      current code includes pmc5 and pmc6 in some group constraints (like
      IC_DC Qualifier bits) which is actually not applicable and hence
      results in those events not getting counted when scheduled along with
      group of other events. Patch fixes this by excluding PMC5/6 from
      constraints which are not relevant for it.
      
      Fixes: 7ffd948f ("powerpc/perf: factor out power8 pmu functions")
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1600672204-1610-1-git-send-email-atrajeev@linux.vnet.ibm.com
      3b6c3adb
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize update_coregroup_mask · 70a94089
      Srikar Dronamraju authored
      All threads of a SMT4/SMT8 core can either be part of CPU's coregroup
      mask or outside the coregroup. Use this relation to reduce the
      number of iterations needed to find all the CPUs that share the same
      coregroup
      
      Use a temporary mask to iterate through the CPUs that may share
      coregroup mask. Also instead of setting one CPU at a time into
      cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-12-srikar@linux.vnet.ibm.com
      70a94089
    • Srikar Dronamraju's avatar
      powerpc/smp: Move coregroup mask updation to a new function · b8a97cb4
      Srikar Dronamraju authored
      Move the logic for updating the coregroup mask of a CPU to its own
      function. This will help in reworking the updation of coregroup mask in
      subsequent patch.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-11-srikar@linux.vnet.ibm.com
      b8a97cb4
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize update_mask_by_l2 · 3ab33d6d
      Srikar Dronamraju authored
      All threads of a SMT4 core can either be part of this CPU's l2-cache
      mask or not related to this CPU l2-cache mask. Use this relation to
      reduce the number of iterations needed to find all the CPUs that share
      the same l2-cache.
      
      Use a temporary mask to iterate through the CPUs that may share l2_cache
      mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
      copy the SMT4/sub mask at one shot.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-10-srikar@linux.vnet.ibm.com
      3ab33d6d
    • Srikar Dronamraju's avatar
      powerpc/smp: Check for duplicate topologies and consolidate · 375370a1
      Srikar Dronamraju authored
      CACHE and COREGROUP domains are now part of default topology. However on
      systems that don't support CACHE or COREGROUP, these domains will
      eventually be degenerated. The degeneration happens per CPU. Do note the
      current fixup_topology() logic ensures that mask of a domain that is not
      supported on the current platform is set to the previous domain.
      
      Instead of waiting for the scheduler to degenerated try to consolidate
      based on their masks and sd_flags. This is done just before setting
      the scheduler topology.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-9-srikar@linux.vnet.ibm.com
      375370a1
    • Srikar Dronamraju's avatar
      powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs · 661e3d42
      Srikar Dronamraju authored
      Currently on hotplug/hotunplug, CPU iterates through all the CPUs in
      its core to find threads in its thread group. However this info is
      already captured in cpu_l1_cache_map. Hence reduce iterations and
      cleanup add_cpu_to_smallcore_masks function.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-8-srikar@linux.vnet.ibm.com
      661e3d42
    • Srikar Dronamraju's avatar
      powerpc/smp: Stop passing mask to update_mask_by_l2 · 1f3a4181
      Srikar Dronamraju authored
      update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
      as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
      update_mask_by_l2.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-7-srikar@linux.vnet.ibm.com
      1f3a4181
    • Srikar Dronamraju's avatar
      powerpc/smp: Limit CPUs traversed to within a node. · 53516d4a
      Srikar Dronamraju authored
      All the arch specific topology cpumasks are within a node/DIE.
      However when setting these per CPU cpumasks, system traverses through
      all the online CPUs. This is redundant.
      
      Reduce the traversal to only CPUs that are online in the node to which
      the CPU belongs to.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-6-srikar@linux.vnet.ibm.com
      53516d4a
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize remove_cpu_from_masks · 70edd4a7
      Srikar Dronamraju authored
      While offlining a CPU, system currently iterate through all the CPUs in
      the DIE to clear sibling, l2_cache and smallcore maps. However if there
      are more cores in a DIE, system can end up spending more time iterating
      through CPUs which are completely unrelated.
      
      Optimize this by only iterating through smaller but relevant cpumap.
      If shared_cache is set, cpu_l2_cache_map should be relevant else
      cpu_sibling_map would be relevant.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-5-srikar@linux.vnet.ibm.com
      70edd4a7
    • Srikar Dronamraju's avatar
      powerpc/smp: Remove get_physical_package_id · e29e9ed6
      Srikar Dronamraju authored
      Now that cpu_core_mask has been removed and topology_core_cpumask has
      been updated to use cpu_cpu_mask, we no more need
      get_physical_package_id.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-4-srikar@linux.vnet.ibm.com
      e29e9ed6
    • Srikar Dronamraju's avatar
      powerpc/smp: Stop updating cpu_core_mask · 4ca234a9
      Srikar Dronamraju authored
      Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
      minutes to boot. He also analyzed it to the time taken to iterate while
      setting the cpu_core_mask.
      
      Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
      would be equal on Power. However updating cpu_core_mask took forever to
      update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
      NODE /per DIE cpumask that was shared by all the respective CPUs.
      
      Also cpu_cpu_mask is needed from a scheduler perspective. However
      cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
      and make it point to cpu_cpu_mask.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-3-srikar@linux.vnet.ibm.com
      4ca234a9
    • Srikar Dronamraju's avatar
      powerpc/topology: Update topology_core_cpumask · 4bce5459
      Srikar Dronamraju authored
      On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs.
      cpu_cpu_mask is needed by scheduler, hence look at deprecating
      cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user
      is moved to cpu_cpu_mask.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200921095653.9701-2-srikar@linux.vnet.ibm.com
      4bce5459
    • Gustavo Romero's avatar
      powerpc/tm: Save and restore AMR on treclaim and trechkpt · d0ffdee8
      Gustavo Romero authored
      Althought AMR is stashed in the checkpoint area, currently we don't save
      it to the per thread checkpoint struct after a treclaim and so we don't
      restore it either from that struct when we trechkpt. As a consequence when
      the transaction is later rolled back the kernel space AMR value when the
      trechkpt was done appears in userspace.
      
      That commit saves and restores AMR accordingly on treclaim and trechkpt.
      Since AMR value is also used in kernel space in other functions, it also
      takes care of stashing kernel live AMR into the stack before treclaim and
      before trechkpt, restoring it later, just before returning from tm_reclaim
      and __tm_recheckpoint.
      
      Is also fixes two nonrelated comments about CR and MSR.
      Signed-off-by: default avatarGustavo Romero <gromero@linux.ibm.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200919150025.9609-1-gromero@linux.ibm.com
      d0ffdee8
    • Oliver O'Halloran's avatar
      powerpc/eeh: Clean up PE addressing · 35d64734
      Oliver O'Halloran authored
      When support for EEH on PowerNV was added a lot of pseries specific code
      was made "generic" and some of the quirks of pseries EEH came along for the
      ride. One of the stranger quirks is eeh_pe containing two types of PE
      address: pe->addr and pe->config_addr. There reason for this appears to be
      historical baggage rather than any real requirements.
      
      On pseries EEH PEs are manipulated using RTAS calls. Each EEH RTAS call
      takes a "PE configuration address" as an input which is used to identify
      which EEH PE is being manipulated by the call. When initialising the EEH
      state for a device the first thing we need to do is determine the
      configuration address for the PE which contains the device so we can enable
      EEH on that PE. This process is outlined in PAPR which is the modern
      (i.e post-2003) FW specification for pseries. However, EEH support was
      first described in the pSeries RISC Platform Architecture (RPA) and
      although they are mostly compatible EEH is one of the areas where they are
      not.
      
      The major difference is that RPA doesn't actually have the concept of a PE.
      On RPA systems the EEH RTAS calls are done on a per-device basis using the
      same config_addr that would be passed to the RTAS functions to access PCI
      config space (e.g. ibm,read-pci-config). The config_addr is not identical
      since the function and config register offsets of the config_addr must be
      set to zero. EEH operations being done on a per-device basis doesn't make a
      whole lot of sense when you consider how EEH was implemented on legacy PCI
      systems.
      
      For legacy PCI(-X) systems EEH was implemented using special PCI-PCI
      bridges which contained logic to detect errors and freeze the secondary
      bus when one occurred. This means that the EEH enabled state is shared
      among all devices behind that EEH bridge. As a result there's no way to
      implement the per-device control required for the semantics specified by
      RPA. It can be made to work if we assume that a separate EEH bridge exists
      for each EEH capable PCI slot and there are no bridges behind those slots.
      However, RPA also specifies the ibm,configure-bridge RTAS call for
      re-initalising bridges behind EEH capable slots after they are reset due
      to an EEH event so that is probably not a valid assumption. This
      incoherence was fixed in later PAPR, which succeeded RPA. Unfortunately,
      since Linux EEH support seems to have been implemented based on the RPA
      spec some of the legacy assumptions were carried over (probably for POWER4
      compatibility).
      
      The fix made in PAPR was the introduction of the "PE" concept and
      redefining the EEH RTAS calls (set-eeh-option, reset-slot, etc) to operate
      on a per-PE basis so all devices behind an EEH bride would share the same
      EEH state. The "config_addr" argument to the EEH RTAS calls became the
      "PE_config_addr" and the OS was required to use the
      ibm,get-config-addr-info RTAS call to find the correct PE address for the
      device. When support for the new interfaces was added to Linux it was
      implemented using something like:
      
      At probe time:
      
      	pdn->eeh_config_addr = rtas_config_addr(pdn);
      	pdn->eeh_pe_config_addr = rtas_get_config_addr_info(pdn);
      
      When performing an RTAS call:
      
      	config_addr = pdn->eeh_config_addr;
      	if (pdn->eeh_pe_config_addr)
      		config_addr = pdn->eeh_pe_config_addr;
      
      	rtas_call(..., config_addr, ...);
      
      In other words, if the ibm,get-config-addr-info RTAS call is implemented
      and returned a valid result we'd use that as the argument to the EEH
      RTAS calls. If not, Linux would fall back to using the device's
      config_addr. Over time these addresses have moved around going from pci_dn
      to eeh_dev and finally into eeh_pe. Today the users look like this:
      
      	config_addr = pe->config_addr;
      	if (pe->addr)
      		config_addr = pe->addr;
      
      	rtas_call(..., config_addr, ...);
      
      However, considering the EEH core always operates on a per-PE basis and
      even on pseries the only per-device operation is the initial call to
      ibm,set-eeh-option I'm not sure if any of this actually works on an RPA
      system today. It doesn't make much sense to have the fallback address in
      a generic structure either since the bulk of the code which reference it
      is in pseries anyway.
      
      The EEH core makes a token effort to support looking up a PE using the
      config_addr by having two arguments to eeh_pe_get(). However, a survey of
      all the callers to eeh_pe_get() shows that all bar one have the config_addr
      argument hard-coded to zero.The only caller that doesn't is in
      eeh_pe_tree_insert() which has:
      
      	if (!eeh_has_flag(EEH_VALID_PE_ZERO) && !edev->pe_config_addr)
      		return -EINVAL;
      
      	pe = eeh_pe_get(hose, edev->pe_config_addr, edev->bdfn);
      
      The third argument (config_addr) is only used if the second (pe->addr)
      argument is invalid. The preceding check ensures that the call to
      eeh_pe_get() will never happen if edev->pe_config_addr is invalid so there
      is no situation where eeh_pe_get() will search for a PE based on the 3rd
      argument. The check also means that we'll never insert a PE into the tree
      where pe_config_addr is zero since EEH_VALID_PE_ZERO is never set on
      pseries. All the users of the fallback address on pseries never actually
      use the fallback and all the only caller that supplies something for the
      config_addr argument to eeh_pe_get() never use it either. It's all dead
      code.
      
      This patch removes the fallback address from eeh_pe since nothing uses it.
      Specificly, we do this by:
      
      1) Removing pe->config_addr
      2) Removing the EEH_VALID_PE_ZERO flag
      3) Removing the fallback address argument to eeh_pe_get().
      4) Removing all the checks for pe->addr being zero in the pseries EEH code.
      
      This leaves us with PE's only being identified by what's in their pe->addr
      field and the EEH core relying on the platform to ensure that eeh_dev's are
      only inserted into the EEH tree if they're actually inside a PE.
      
      No functional changes, I hope.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-9-oohall@gmail.com
      35d64734
    • Oliver O'Halloran's avatar
      powerpc/pseries/eeh: Allow zero to be a valid PE configuration address · 42de19d5
      Oliver O'Halloran authored
      There's no real reason why zero can't be a valid PE configuration address.
      Under qemu each sPAPR PHB (i.e. EEH supporting) has the passed-though
      devices on bus zero, so the PE address of bus <dddd>:00 should be zero.
      
      However, all previous versions of Linux will reject that, so Qemu at least
      goes out of it's way to avoid it. The Qemu implementation of
      ibm,get-config-addr-info2 RTAS has the following comment:
      
      > /*
      >  * We always have PE address of form "00BB0001". "BB"
      >  * represents the bus number of PE's primary bus.
      >  */
      
      So qemu puts a one into the register portion of the PE's config_addr to
      avoid it being zero. The whole is pretty silly considering that RTAS will
      return a negative error code if it can't map the device's config_addr to a
      PE.
      
      This patch fixes Linux to treat zero as a valid PE address. This shouldn't
      have any real effects due to the Qemu hack mentioned above. And the fact
      that Linux EEH has worked historically on PowerVM means they never pass
      through devices on bus zero so we would never see the problem there either.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-8-oohall@gmail.com
      42de19d5
    • Oliver O'Halloran's avatar
      powerpc/pseries/eeh: Rework device EEH PE determination · 98ba956f
      Oliver O'Halloran authored
      The process Linux uses for determining if a device supports EEH or not
      appears to be at odds with what PAPR says the OS should be doing. The
      current flow is something like:
      
      1. Assume pe_config_addr is equal the the device's config_addr.
      2. Attempt to enable EEH on that PE
      3. Verify EEH was enabled (POWER4 bug workaround)
      4. Try find the pe_config_addr using the ibm,get-config-addr-info2 RTAS
         call.
      5. If that fails walk the pci_dn tree upwards trying to find a parent
         device with EEH support. If we find one then add the device to that PE.
      
      The first major problem with this process is that we need the PE config
      address in step 2) since its needs to be passed to the ibm,set-eeh-option
      RTAS call when enabling EEH for th PE. We hack around this requirement in
      by making the assumption in 1) and delay finding the actual PE address
      until 4). This is fine if:
      
      a) The PCI device is the 0th function, and
      b) The device is on the PE's root bus.
      
      Granted, the current sequence does appear to work on most systems even when
      these conditions are false. At a guess PowerVM's RTAS has workarounds to
      accommodate Linux's quirks or the RTAS call to enable EEH is treated as
      no-op on most platforms since EEH is usually enabled by default. However,
      what is currently implemented is a bit sketch and is downright confusing
      since it doesn't match up with what what PAPR suggests we should be doing.
      
      This patch re-works how we handle EEH init so that we find the PE config
      address using the ibm,get-config-addr-info2 RTAS call first, then use the
      found address to finish the EEH init process. It also drops the Power4
      workaround since as of commit 471d7ff8 ("powerpc/64s: Remove POWER4
      support") the kernel does not support running on a Power4 CPU so there's
      no need to support the Power4 platform's quirks either. With the patch
      applied the sequence is now:
      
      1. Find the pe_config_addr from the device using the RTAS call.
      2. Enable the PE.
      3. Insert the edev into the tree and create an eeh_pe if needed.
      
      The other change made here is ignoring unsupported devices entirely.
      Currently the device's BARs are saved to the eeh_dev even if the device is
      not part of an EEH PE. Not being part of a PE means that an EEH recovery
      pass will never see that device so the saving the BARs is pointless.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-7-oohall@gmail.com
      98ba956f
    • Oliver O'Halloran's avatar
      powerpc/pseries/eeh: Clean up pe_config_addr lookups · f61c859f
      Oliver O'Halloran authored
      De-duplicate, and fix up the comments, and make the prototype just take a
      pci_dn since the job of the function is to return the pe_config_addr of the
      PE which contains a given device.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-6-oohall@gmail.com
      f61c859f
    • Oliver O'Halloran's avatar
      powerpc/eeh: Move EEH initialisation to an arch initcall · 395ee2a2
      Oliver O'Halloran authored
      The initialisation of EEH mostly happens in a core_initcall_sync initcall,
      followed by registering a bus notifier later on in an arch_initcall.
      Anything involving initcall dependecies is mostly incomprehensible unless
      you've spent a while staring at code so here's the full sequence:
      
      ppc_md.setup_arch       <-- pci_controllers are created here
      
      ...time passes...
      
      core_initcall           <-- pci_dns are created from DT nodes
      core_initcall_sync      <-- platforms call eeh_init()
      postcore_initcall       <-- PCI bus type is registered
      postcore_initcall_sync
      arch_initcall           <-- EEH pci_bus notifier registered
      subsys_initcall         <-- PHBs are scanned here
      
      There's no real requirement to do the EEH setup at the core_initcall_sync
      level. It just needs to be done after pci_dn's are created and before we
      start scanning PHBs. Simplify the flow a bit by moving the platform EEH
      inititalisation to an arch_initcall so we can fold the bus notifier
      registration into eeh_init().
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-5-oohall@gmail.com
      395ee2a2
    • Oliver O'Halloran's avatar
      powerpc/eeh: Delete eeh_ops->init · 5d69e46a
      Oliver O'Halloran authored
      No longer used since the platforms perform their EEH initialisation before
      calling eeh_init().
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-4-oohall@gmail.com
      5d69e46a
    • Oliver O'Halloran's avatar
      powerpc/pseries: Stop using eeh_ops->init() · 1f8fa0cd
      Oliver O'Halloran authored
      Fold pseries_eeh_init() into eeh_pseries_init() rather than having
      eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us
      delete eeh_ops.init.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200918093050.37344-3-oohall@gmail.com
      1f8fa0cd