• Rik van Riel's avatar
    sched/numa: Only consider less busy nodes as numa balancing destinations · 6f9aad0b
    Rik van Riel authored
    Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks
    the preferred node") fixes an issue where workloads would never
    converge on a fully loaded (or overloaded) system.
    
    However, it introduces a regression on less than fully loaded systems,
    where workloads converge on a few NUMA nodes, instead of properly
    staying spread out across the whole system. This leads to a reduction
    in available memory bandwidth, and usable CPU cache, with predictable
    performance problems.
    
    The root cause appears to be an interaction between the load balancer
    and NUMA balancing, where the short term load represented by the load
    balancer differs from the long term load the NUMA balancing code would
    like to base its decisions on.
    
    Simply reverting a43455a1 would re-introduce the non-convergence
    of workloads on fully loaded systems, so that is not a good option. As
    an aside, the check done before a43455a1 only applied to a task's
    preferred node, not to other candidate nodes in the system, so the
    converge-on-too-few-nodes problem still happens, just to a lesser
    degree.
    
    Instead, try to compensate for the impedance mismatch between the load
    balancer and NUMA balancing by only ever considering a lesser loaded
    node as a destination for NUMA balancing, regardless of whether the
    task is trying to move to the preferred node, or to another node.
    
    This patch also addresses the issue that a system with a single
    runnable thread would never migrate that thread to near its memory,
    introduced by 095bebf6 ("sched/numa: Do not move past the balance
    point if unbalanced").
    
    A test where the main thread creates a large memory area, and spawns a
    worker thread to iterate over the memory (placed on another node by
    select_task_rq_fair), after which the main thread goes to sleep and
    waits for the worker thread to loop over all the memory now sees the
    worker thread migrated to where the memory is, instead of having all
    the memory migrated over like before.
    
    Jirka has run a number of performance tests on several systems: single
    instance SpecJBB 2005 performance is 7-15% higher on a 4 node system,
    with higher gains on systems with more cores per socket.
    Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
    little or no changes with the revert of 095bebf6 and this patch.
    Reported-by: default avatarArtem Bityutski <dedekind1@gmail.com>
    Reported-by: default avatarJirka Hladky <jhladky@redhat.com>
    Tested-by: default avatarJirka Hladky <jhladky@redhat.com>
    Tested-by: default avatarArtem Bityutskiy <dedekind1@gmail.com>
    Signed-off-by: default avatarRik van Riel <riel@redhat.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    6f9aad0b
fair.c 222 KB