• Odin Ugedal's avatar
    sched/fair: Fix unfairness caused by missing load decay · 0258bdfa
    Odin Ugedal authored
    This fixes an issue where old load on a cfs_rq is not properly decayed,
    resulting in strange behavior where fairness can decrease drastically.
    Real workloads with equally weighted control groups have ended up
    getting a respective 99% and 1%(!!) of cpu time.
    
    When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
    the old load of the task is attached to the new cfs_rq and sched_entity by
    attach_entity_cfs_rq. If the task is then moved to another cpu (and
    therefore cfs_rq) before being enqueued/woken up, the load will be moved
    to cfs_rq->removed from the sched_entity. Such a move will happen when
    enforcing a cpuset on the task (eg. via a cgroup) that force it to move.
    
    The load will however not be removed from the task_group itself, making
    it look like there is a constant load on that cfs_rq. This causes the
    vruntime of tasks on other sibling cfs_rq's to increase faster than they
    are supposed to; causing severe fairness issues. If no other task is
    started on the given cfs_rq, and due to the cpuset it would not happen,
    this load would never be properly unloaded. With this patch the load
    will be properly removed inside update_blocked_averages. This also
    applies to tasks moved to the fair scheduling class and moved to another
    cpu, and this path will also fix that. For fork, the entity is queued
    right away, so this problem does not affect that.
    
    This applies to cases where the new process is the first in the cfs_rq,
    issue introduced 3d30544f ("sched/fair: Apply more PELT fixes"), and
    when there has previously been load on the cgroup but the cgroup was
    removed from the leaflist due to having null PELT load, indroduced
    in 039ae8bc ("sched/fair: Fix O(nr_cgroups) in the load balancing
    path").
    
    For a simple cgroup hierarchy (as seen below) with two equally weighted
    groups, that in theory should get 50/50 of cpu time each, it often leads
    to a load of 60/40 or 70/30.
    
    parent/
      cg-1/
        cpu.weight: 100
        cpuset.cpus: 1
      cg-2/
        cpu.weight: 100
        cpuset.cpus: 1
    
    If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
    equally weighted, they should still get a 50/50 balance of cpu time.
    This however sometimes results in a balance of 10/90 or 1/99(!!) between
    the task groups.
    
    $ ps u -C stress
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
    root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1
    
    parent/
      cg-1/
        cpu.weight: 100
        sub-group/
          cpu.weight: 1
          cpuset.cpus: 1
      cg-2/
        cpu.weight: 100
        sub-group/
          cpu.weight: 10000
          cpuset.cpus: 1
    
    This can be reproduced by attaching an idle process to a cgroup and
    moving it to a given cpuset before it wakes up. The issue is evident in
    many (if not most) container runtimes, and has been reproduced
    with both crun and runc (and therefore docker and all its "derivatives"),
    and with both cgroup v1 and v2.
    
    Fixes: 3d30544f ("sched/fair: Apply more PELT fixes")
    Fixes: 039ae8bc ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
    Signed-off-by: default avatarOdin Ugedal <odin@uged.al>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
    Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al
    0258bdfa
fair.c 299 KB