• Johannes Weiner's avatar
    mm: memcontrol: fix memory.low proportional distribution · 503970e4
    Johannes Weiner authored
    Patch series "mm: memcontrol: recursive memory.low protection", v3.
    
    The current memory.low (and memory.min) semantics require protection to be
    assigned to a cgroup in an untinterrupted chain from the top-level cgroup
    all the way to the leaf.
    
    In practice, we want to protect entire cgroup subtrees from each other
    (system management software vs.  workload), but we would like the VM to
    balance memory optimally *within* each subtree, without having to make
    explicit weight allocations among individual components.  The current
    semantics make that impossible.
    
    They also introduce unmanageable complexity into more advanced resource
    trees.  For example:
    
              host root
              `- system.slice
                 `- rpm upgrades
                 `- logging
              `- workload.slice
                 `- a container
                    `- system.slice
                    `- workload.slice
                       `- job A
                          `- component 1
                          `- component 2
                       `- job B
    
    At a host-level perspective, we would like to protect the outer
    workload.slice subtree as a whole from rpm upgrades, logging etc.  But for
    that to be effective, right now we'd have to propagate it down through the
    container, the inner workload.slice, into the job cgroup and ultimately
    the component cgroups where memory is actually, physically allocated.
    This may cross several tree delegation points and namespace boundaries,
    which make such a setup near impossible.
    
    CPU and IO on the other hand are already distributed recursively.  The
    user would simply configure allowances at the host level, and they would
    apply to the entire subtree without any downward propagation.
    
    To enable the above-mentioned usecases and bring memory in line with other
    resource controllers, this patch series extends memory.low/min such that
    settings apply recursively to the entire subtree.  Users can still assign
    explicit shares in subgroups, but if they don't, any ancestral protection
    will be distributed such that children compete freely amongst each other -
    as if no memory control were enabled inside the subtree - but enjoy
    protection from neighboring trees.
    
    In the above example, the user would then be able to configure shares of
    CPU, IO and memory at the host level to comprehensively protect and
    isolate the workload.slice as a whole from system.slice activity.
    
    Patch #1 fixes an existing bug that can give a cgroup tree more protection
    than it should receive as per ancestor configuration.
    
    Patch #2 simplifies and documents the existing code to make it easier to
    reason about the changes in the next patch.
    
    Patch #3 finally implements recursive memory protection semantics.
    
    Because of a risk of regressing legacy setups, the new semantics are
    hidden behind a cgroup2 mount option, 'memory_recursiveprot'.
    
    More details in patch #3.
    
    This patch (of 3):
    
    When memory.low is overcommitted - i.e.  the children claim more
    protection than their shared ancestor grants them - the allowance is
    distributed in proportion to how much each sibling uses their own declared
    protection:
    
    	low_usage = min(memory.low, memory.current)
    	elow = parent_elow * (low_usage / siblings_low_usage)
    
    However, siblings_low_usage is not the sum of all low_usages. It sums
    up the usages of *only those cgroups that are within their memory.low*
    That means that low_usage can be *bigger* than siblings_low_usage, and
    consequently the total protection afforded to the children can be
    bigger than what the ancestor grants the subtree.
    
    Consider three groups where two are in excess of their protection:
    
      A/memory.low = 10G
      A/A1/memory.low = 10G, memory.current = 20G
      A/A2/memory.low = 10G, memory.current = 20G
      A/A3/memory.low = 10G, memory.current =  8G
      siblings_low_usage = 8G (only A3 contributes)
    
      A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
      A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
      A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
    
      (the 12.5G are capped to the explicit memory.low setting of 10G)
    
    With that, the sum of all awarded protection below A is 30G, when A
    only grants 10G for the entire subtree.
    
    What does this mean in practice? A1 and A2 would still be in excess of
    their 10G allowance and would be reclaimed, whereas A3 would not. As
    they eventually drop below their protection setting, they would be
    counted in siblings_low_usage again and the error would right itself.
    
    When reclaim was applied in a binary fashion (cgroup is reclaimed when
    it's above its protection, otherwise it's skipped) this would actually
    work out just fine. However, since 1bc63fb1 ("mm, memcg: make scan
    aggression always exclude protection"), reclaim pressure is scaled to
    how much a cgroup is above its protection. As a result this
    calculation error unduly skews pressure away from A1 and A2 toward the
    rest of the system.
    
    But why did we do it like this in the first place?
    
    The reasoning behind exempting groups in excess from
    siblings_low_usage was to go after them first during reclaim in an
    overcommitted subtree:
    
      A/memory.low = 2G, memory.current = 4G
      A/A1/memory.low = 3G, memory.current = 2G
      A/A2/memory.low = 1G, memory.current = 2G
    
      siblings_low_usage = 2G (only A1 contributes)
      A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
      A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
    
    While the children combined are overcomitting A and are technically
    both at fault, A2 is actively declaring unprotected memory and we
    would like to reclaim that first.
    
    However, while this sounds like a noble goal on the face of it, it
    doesn't make much difference in actual memory distribution: Because A
    is overcommitted, reclaim will not stop once A2 gets pushed back to
    within its allowance; we'll have to reclaim A1 either way. The end
    result is still that protection is distributed proportionally, with A1
    getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
    
    [ If A weren't overcommitted, it wouldn't make a difference since each
      cgroup would just get the protection it declares:
    
      A/memory.low = 2G, memory.current = 3G
      A/A1/memory.low = 1G, memory.current = 1G
      A/A2/memory.low = 1G, memory.current = 2G
    
      With the current calculation:
    
      siblings_low_usage = 1G (only A1 contributes)
      A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
      A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
    
      Including excess groups in siblings_low_usage:
    
      siblings_low_usage = 2G
      A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
      A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
    
    Simplify the calculation and fix the proportional reclaim bug by
    including excess cgroups in siblings_low_usage.
    
    After this patch, the effective memory.low distribution from the
    example above would be as follows:
    
      A/memory.low = 10G
      A/A1/memory.low = 10G, memory.current = 20G
      A/A2/memory.low = 10G, memory.current = 20G
      A/A3/memory.low = 10G, memory.current =  8G
      siblings_low_usage = 28G
    
      A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
      A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
      A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G
    
    Fixes: 1bc63fb1 ("mm, memcg: make scan aggression always exclude protection")
    Fixes: 23067153 ("mm: memory.low hierarchical behavior")
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarRoman Gushchin <guro@fb.com>
    Acked-by: default avatarChris Down <chris@chrisdown.name>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Michal Koutný <mkoutny@suse.com>
    Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    503970e4
memcontrol.c 186 KB