• Chris Down's avatar
    mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
    Chris Down authored
    cgroup v2 introduces two memory protection thresholds: memory.low
    (best-effort) and memory.min (hard protection).  While they generally do
    what they say on the tin, there is a limitation in their implementation
    that makes them difficult to use effectively: that cliff behaviour often
    manifests when they become eligible for reclaim.  This patch implements
    more intuitive and usable behaviour, where we gradually mount more
    reclaim pressure as cgroups further and further exceed their protection
    thresholds.
    
    This cliff edge behaviour happens because we only choose whether or not
    to reclaim based on whether the memcg is within its protection limits
    (see the use of mem_cgroup_protected in shrink_node), but we don't vary
    our reclaim behaviour based on this information.  Imagine the following
    timeline, with the numbers the lruvec size in this zone:
    
    1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
    2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
    3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
       scanned. (?!)
    
    * Of course, we won't usually scan all available pages in the zone even
      without this patch because of scan control priority, over-reclaim
      protection, etc.  However, as shown by the tests at the end, these
      techniques don't sufficiently throttle such an extreme change in input,
      so cliff-like behaviour isn't really averted by their existence alone.
    
    Here's an example of how this plays out in practice.  At Facebook, we are
    trying to protect various workloads from "system" software, like
    configuration management tools, metric collectors, etc (see this[0] case
    study).  In order to find a suitable memory.low value, we start by
    determining the expected memory range within which the workload will be
    comfortable operating.  This isn't an exact science -- memory usage deemed
    "comfortable" will vary over time due to user behaviour, differences in
    composition of work, etc, etc.  As such we need to ballpark memory.low,
    but doing this is currently problematic:
    
    1. If we end up setting it too low for the workload, it won't have
       *any* effect (see discussion above).  The group will receive the full
       weight of reclaim and won't have any priority while competing with the
       less important system software, as if we had no memory.low configured
       at all.
    
    2. Because of this behaviour, we end up erring on the side of setting
       it too high, such that the comfort range is reliably covered.  However,
       protected memory is completely unavailable to the rest of the system,
       so we might cause undue memory and IO pressure there when we *know* we
       have some elasticity in the workload.
    
    3. Even if we get the value totally right, smack in the middle of the
       comfort zone, we get extreme jumps between no pressure and full
       pressure that cause unpredictable pressure spikes in the workload due
       to the current binary reclaim behaviour.
    
    With this patch, we can set it to our ballpark estimation without too much
    worry.  Any undesirable behaviour, such as too much or too little reclaim
    pressure on the workload or system will be proportional to how far our
    estimation is off.  This means we can set memory.low much more
    conservatively and thus waste less resources *without* the risk of the
    workload falling off a cliff if we overshoot.
    
    As a more abstract technical description, this unintuitive behaviour
    results in having to give high-priority workloads a large protection
    buffer on top of their expected usage to function reliably, as otherwise
    we have abrupt periods of dramatically increased memory pressure which
    hamper performance.  Having to set these thresholds so high wastes
    resources and generally works against the principle of work conservation.
    In addition, having proportional memory reclaim behaviour has other
    benefits.  Most notably, before this patch it's basically mandatory to set
    memory.low to a higher than desirable value because otherwise as soon as
    you exceed memory.low, all protection is lost, and all pages are eligible
    to scan again.  By contrast, having a gradual ramp in reclaim pressure
    means that you now still get some protection when thresholds are exceeded,
    which means that one can now be more comfortable setting memory.low to
    lower values without worrying that all protection will be lost.  This is
    important because workingset size is really hard to know exactly,
    especially with variable workloads, so at least getting *some* protection
    if your workingset size grows larger than you expect increases user
    confidence in setting memory.low without a huge buffer on top being
    needed.
    
    Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
    assistance in thinking about how to make this work better.
    
    In testing these changes, I intended to verify that:
    
    1. Changes in page scanning become gradual and proportional instead of
       binary.
    
       To test this, I experimented stepping further and further down
       memory.low protection on a workload that floats around 19G workingset
       when under memory.low protection, watching page scan rates for the
       workload cgroup:
    
       +------------+-----------------+--------------------+--------------+
       | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
       +------------+-----------------+--------------------+--------------+
       |        21G |               0 |                  0 | N/A          |
       |        17G |             867 |               3799 | 23%          |
       |        12G |            1203 |               3543 | 34%          |
       |         8G |            2534 |               3979 | 64%          |
       |         4G |            3980 |               4147 | 96%          |
       |          0 |            3799 |               3980 | 95%          |
       +------------+-----------------+--------------------+--------------+
    
       As you can see, the test kernel (with a kernel containing this
       patch) ramps up page scanning significantly more gradually than the
       control kernel (without this patch).
    
    2. More gradual ramp up in reclaim aggression doesn't result in
       premature OOMs.
    
       To test this, I wrote a script that slowly increments the number of
       pages held by stress(1)'s --vm-keep mode until a production system
       entered severe overall memory contention.  This script runs in a highly
       protected slice taking up the majority of available system memory.
       Watching vmstat revealed that page scanning continued essentially
       nominally between test and control, without causing forward reclaim
       progress to become arrested.
    
    [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
    
    [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
    [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
      Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
    Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: default avatarChris Down <chris@chrisdown.name>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    9783aa99
cgroup-v2.rst 94 KB