• Srivatsa S. Bhat's avatar
    powerpc: Fix the setup of CPU-to-Node mappings during CPU online · d4edc5b6
    Srivatsa S. Bhat authored
    On POWER platforms, the hypervisor can notify the guest kernel about dynamic
    changes in the cpu-numa associativity (VPHN topology update). Hence the
    cpu-to-node mappings that we got from the firmware during boot, may no longer
    be valid after such updates. This is handled using the arch_update_cpu_topology()
    hook in the scheduler, and the sched-domains are rebuilt according to the new
    mappings.
    
    But unfortunately, at the moment, CPU hotplug ignores these updated mappings
    and instead queries the firmware for the cpu-to-numa relationships and uses
    them during CPU online. So the kernel can end up assigning wrong NUMA nodes
    to CPUs during subsequent CPU hotplug online operations (after booting).
    
    Further, a particularly problematic scenario can result from this bug:
    On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
    threads per core. The switch to Single-Threaded (ST) mode is performed by
    offlining all except the first CPU thread in each core. Switching back to
    SMT mode involves onlining those other threads back, in each core.
    
    Now consider this scenario:
    
    1. During boot, the kernel gets the cpu-to-node mappings from the firmware
       and assigns the CPUs to NUMA nodes appropriately, during CPU online.
    
    2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
       communicates this update to the kernel. The kernel in turn updates its
       cpu-to-node associations and rebuilds its sched domains. Everything is
       fine so far.
    
    3. Now, the user switches the machine from SMT to ST mode (say, by running
       ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
       core.
    
    4. The user then tries to switch back from ST to SMT mode (say, by running
       ppc64_cpu --smt=4), and this involves onlining those threads back. Since
       CPU hotplug ignores the new mappings, it queries the firmware and tries to
       associate the newly onlined sibling threads to the old NUMA nodes. This
       results in sibling threads within the same core getting associated with
       different NUMA nodes, which is incorrect.
    
       The scheduler's build-sched-domains code gets thoroughly confused with this
       and enters an infinite loop and causes soft-lockups, as explained in detail
       in commit 3be7db6a (powerpc: VPHN topology change updates all siblings).
    
    So to fix this, use the numa_cpu_lookup_table to remember the updated
    cpu-to-node mappings, and use them during CPU hotplug online operations.
    Further, we also need to ensure that all threads in a core are assigned to a
    common NUMA node, irrespective of whether all those threads were online during
    the topology update. To achieve this, we take care not to use cpu_sibling_mask()
    since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
    and set up the mappings manually using the 'threads_per_core' value for that
    particular platform. This helps us ensure that we don't hit this bug with any
    combination of CPU hotplug and SMT mode switching.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
    Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
    d4edc5b6
topology.h 2.64 KB