• Johannes Weiner's avatar
    Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" · e82553c1
    Johannes Weiner authored
    This reverts commit 536d3bf2, as it can
    cause writers to memory.high to get stuck in the kernel forever,
    performing page reclaim and consuming excessive amounts of CPU cycles.
    
    Before the patch, a write to memory.high would first put the new limit
    in place for the workload, and then reclaim the requested delta.  After
    the patch, the kernel tries to reclaim the delta before putting the new
    limit into place, in order to not overwhelm the workload with a sudden,
    large excess over the limit.  However, if reclaim is actively racing
    with new allocations from the uncurbed workload, it can keep the write()
    working inside the kernel indefinitely.
    
    This is causing problems in Facebook production.  A privileged
    system-level daemon that adjusts memory.high for various workloads
    running on a host can get unexpectedly stuck in the kernel and
    essentially turn into a sort of involuntary kswapd for one of the
    workloads.  We've observed that daemon busy-spin in a write() for
    minutes at a time, neglecting its other duties on the system, and
    expending privileged system resources on behalf of a workload.
    
    To remedy this, we have first considered changing the reclaim logic to
    break out after a couple of loops - whether the workload has converged
    to the new limit or not - and bound the write() call this way.  However,
    the root cause that inspired the sequence change in the first place has
    been fixed through other means, and so a revert back to the proven
    limit-setting sequence, also used by memory.max, is preferable.
    
    The sequence was changed to avoid extreme latencies in the workload when
    the limit was lowered: the sudden, large excess created by the limit
    lowering would erroneously trigger the penalty sleeping code that is
    meant to throttle excessive growth from below.  Allocating threads could
    end up sleeping long after the write() had already reclaimed the delta
    for which they were being punished.
    
    However, erroneous throttling also caused problems in other scenarios at
    around the same time.  This resulted in commit b3ff9291 ("mm, memcg:
    reclaim more aggressively before high allocator throttling"), included
    in the same release as the offending commit.  When allocating threads
    now encounter large excess caused by a racing write() to memory.high,
    instead of entering punitive sleeps, they will simply be tasked with
    helping reclaim down the excess, and will be held no longer than it
    takes to accomplish that.  This is in line with regular limit
    enforcement - i.e.  if the workload allocates up against or over an
    otherwise unchanged limit from below.
    
    With the patch breaking userspace, and the root cause addressed by other
    means already, revert it again.
    
    Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
    Fixes: 536d3bf2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reported-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChris Down <chris@chrisdown.name>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Michal Koutný <mkoutny@suse.com>
    Cc: <stable@vger.kernel.org>	[5.8+]
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    e82553c1
memcontrol.c 190 KB