• Chris Down's avatar
    mm, memcg: reclaim more aggressively before high allocator throttling · b3ff9291
    Chris Down authored
    Patch series "mm, memcg: reclaim harder before high throttling", v2.
    
    This patch (of 2):
    
    In Facebook production, we've seen cases where cgroups have been put into
    allocator throttling even when they appear to have a lot of slack file
    caches which should be trivially reclaimable.
    
    Looking more closely, the problem is that we only try a single cgroup
    reclaim walk for each return to usermode before calculating whether or not
    we should throttle.  This single attempt doesn't produce enough pressure
    to shrink for cgroups with a rapidly growing amount of file caches prior
    to entering allocator throttling.
    
    As an example, we see that threads in an affected cgroup are stuck in
    allocator throttling:
    
        # for i in $(cat cgroup.threads); do
        >     grep over_high "/proc/$i/stack"
        > done
        [<0>] mem_cgroup_handle_over_high+0x10b/0x150
        [<0>] mem_cgroup_handle_over_high+0x10b/0x150
        [<0>] mem_cgroup_handle_over_high+0x10b/0x150
    
    ...however, there is no I/O pressure reported by PSI, despite a lot of
    slack file pages:
    
        # cat memory.pressure
        some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
        full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
        # cat io.pressure
        some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
        full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
        # grep _file memory.stat
        inactive_file 1370939392
        active_file 661635072
    
    This patch changes the behaviour to retry reclaim either until the current
    task goes below the 10ms grace period, or we are making no reclaim
    progress at all.  In the latter case, we enter reclaim throttling as
    before.
    
    To a user, there's no intuitive reason for the reclaim behaviour to differ
    from hitting memory.high as part of a new allocation, as opposed to
    hitting memory.high because someone lowered its value.  As such this also
    brings an added benefit: it unifies the reclaim behaviour between the two.
    
    There's precedent for this behaviour: we already do reclaim retries when
    writing to memory.{high,max}, in max reclaim, and in the page allocator
    itself.
    Signed-off-by: default avatarChris Down <chris@chrisdown.name>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b3ff9291
memcontrol.c 190 KB