• Nikhil Rao's avatar
    sched: Drop group_capacity to 1 only if local group has extra capacity · 75dd321d
    Nikhil Rao authored
    When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
    only if the local group has extra capacity. The extra check prevents the case
    where you always pull from the heaviest group when it is already under-utilized
    (possible with a large weight task outweighs the tasks on the system).
    
    For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
    scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
    and each task is running on one core. In this case, we observe the following
    events when balancing at the NUMA domain:
    
    - find_busiest_group() will always pick the sched group containing the niced
      task to be the busiest group.
    - find_busiest_queue() will then always pick one of the cpus running the
      nice0 task (never picks the cpu with the nice -15 task since
      weighted_cpuload > imbalance).
    - The load balancer fails to migrate the task since it is the running task
      and increments sd->nr_balance_failed.
    - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
      at which point it kicks off the active load balancer, wakes up the migration
      thread and kicks the nice 0 task off the cpu.
    
    The load balancer doesn't stop until we kick out all nice 0 tasks from
    the sched group, leaving you with 3 idle cpus and one cpu running the
    nice -15 task.
    
    When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
    domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
    not relevant because the niced task has a very large weight.
    
    In this patch, we add an extra condition to the "if(prefer_sibling)" check in
    update_sd_lb_stats(). We drop the capacity of a group only if the local group
    has extra capacity, ie. nr_running < group_capacity. This patch preserves the
    original intent of the prefer_siblings check (to spread tasks across the system
    in low utilization scenarios) and fixes the case above.
    
    It helps in the following ways:
    - In low utilization cases (where nr_tasks << nr_cpus), we still drop
      group_capacity down to 1 if we prefer siblings.
    - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
      likely be > sgs.group_capacity.
    - When balancing large weight tasks, if the local group does not have extra
      capacity, we do not pick the group with the niced task as the busiest group.
      This prevents failed balances, active migration and the under-utilization
      described above.
    Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
    Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    75dd321d
sched_fair.c 101 KB