1. 30 Apr, 2021 10 commits
    • Muchun Song's avatar
      mm: memcontrol: slab: fix obtain a reference to a freeing memcg · 9f38f03a
      Muchun Song authored
      Patch series "Use obj_cgroup APIs to charge kmem pages", v5.
      
      Since Roman's series "The new cgroup slab memory controller" applied.
      All slab objects are charged with the new APIs of obj_cgroup.  The new
      APIs introduce a struct obj_cgroup to charge slab objects.  It prevents
      long-living objects from pinning the original memory cgroup in the
      memory.  But there are still some corner objects (e.g.  allocations
      larger than order-1 page on SLUB) which are not charged with the new
      APIs.  Those objects (include the pages which are allocated from buddy
      allocator directly) are charged as kmem pages which still hold a
      reference to the memory cgroup.
      
      E.g.  We know that the kernel stack is charged as kmem pages because the
      size of the kernel stack can be greater than 2 pages (e.g.  16KB on
      x86_64 or arm64).  If we create a thread (suppose the thread stack is
      charged to memory cgroup A) and then move it from memory cgroup A to
      memory cgroup B.  Because the kernel stack of the thread hold a
      reference to the memory cgroup A.  The thread can pin the memory cgroup
      A in the memory even if we remove the cgroup A.  If we want to see this
      scenario by using the following script.  We can see that the system has
      added 500 dying cgroups (This is not a real world issue, just a script
      to show that the large kmallocs are charged as kmem pages which can pin
      the memory cgroup in the memory).
      
      	#!/bin/bash
      
      	cat /proc/cgroups | grep memory
      
      	cd /sys/fs/cgroup/memory
      	echo 1 > memory.move_charge_at_immigrate
      
      	for i in range{1..500}
      	do
      		mkdir kmem_test
      		echo $$ > kmem_test/cgroup.procs
      		sleep 3600 &
      		echo $$ > cgroup.procs
      		echo `cat kmem_test/cgroup.procs` > cgroup.procs
      		rmdir kmem_test
      	done
      
      	cat /proc/cgroups | grep memory
      
      This patchset aims to make those kmem pages to drop the reference to
      memory cgroup by using the APIs of obj_cgroup.  Finally, we can see that
      the number of the dying cgroups will not increase if we run the above test
      script.
      
      This patch (of 7):
      
      The rcu_read_lock/unlock only can guarantee that the memcg will not be
      freed, but it cannot guarantee the success of css_get (which is in the
      refill_stock when cached memcg changed) to memcg.
      
        rcu_read_lock()
        memcg = obj_cgroup_memcg(old)
        __memcg_kmem_uncharge(memcg)
            refill_stock(memcg)
                if (stock->cached != memcg)
                    // css_get can change the ref counter from 0 back to 1.
                    css_get(&memcg->css)
        rcu_read_unlock()
      
      This fix is very like the commit:
      
        eefbfa7f ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
      
      Fix this by holding a reference to the memcg which is passed to the
      __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
      
      Link: https://lkml.kernel.org/r/20210319163821.20704-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210319163821.20704-2-songmuchun@bytedance.com
      Fixes: 3de7d4f2
      
       ("mm: memcg/slab: optimize objcg stock draining")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f38f03a
    • Shakeel Butt's avatar
      memcg: charge before adding to swapcache on swapin · 0add0c77
      Shakeel Butt authored
      Currently the kernel adds the page, allocated for swapin, to the
      swapcache before charging the page.  This is fine but now we want a
      per-memcg swapcache stat which is essential for folks who wants to
      transparently migrate from cgroup v1's memsw to cgroup v2's memory and
      swap counters.  In addition charging a page before exposing it to other
      parts of the kernel is a step in the right direction.
      
      To correctly maintain the per-memcg swapcache stat, this patch has
      adopted to charge the page before adding it to swapcache.  One challenge
      in this option is the failure case of add_to_swap_cache() on which we
      need to undo the mem_cgroup_charge().  Specifically undoing
      mem_cgroup_uncharge_swap() is not simple.
      
      To resolve the issue, this patch decouples the charging for swapin pages
      from mem_cgroup_charge().  Two new functions are introduced,
      mem_cgroup_swapin_charge_page() for just charging the swapin page and
      mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the
      page has been successfully added to the swapcache.
      
      [shakeelb@google.com: set page->private before calling swap_readpage]
        Link: https://lkml.kernel.org/r/20210318015959.2986837-1-shakeelb@google.com
      
      Link: https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.com
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0add0c77
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate lruvec stat flushing · 2cd21c89
      Johannes Weiner authored
      There are two functions to flush the per-cpu data of an lruvec into the
      rest of the cgroup tree: when the cgroup is being freed, and when a CPU
      disappears during hotplug.  The difference is whether all CPUs or just
      one is being collected, but the rest of the flushing code is the same.
      Merge them into one function and share the common code.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-8-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cd21c89
    • Johannes Weiner's avatar
      mm: memcontrol: switch to rstat · 2d146aa3
      Johannes Weiner authored
      Replace the memory controller's custom hierarchical stats code with the
      generic rstat infrastructure provided by the cgroup core.
      
      The current implementation does batched upward propagation from the
      write side (i.e.  as stats change).  The per-cpu batches introduce an
      error, which is multiplied by the number of subgroups in a tree.  In
      systems with many CPUs and sizable cgroup trees, the error can be large
      enough to confuse users (e.g.  32 batch pages * 32 CPUs * 32 subgroups
      results in an error of up to 128M per stat item).  This can entirely
      swallow allocation bursts inside a workload that the user is expecting
      to see reflected in the statistics.
      
      In the past, we've done read-side aggregation, where a memory.stat read
      would have to walk the entire subtree and add up per-cpu counts.  This
      became problematic with lazily-freed cgroups: we could have large
      subtrees where most cgroups were entirely idle.  Hence the switch to
      change-driven upward propagation.  Unfortunately, it needed to trade
      accuracy for speed due to the write side being so hot.
      
      Rstat combines the best of both worlds: from the write side, it cheaply
      maintains a queue of cgroups that have pending changes, so that the read
      side can do selective tree aggregation.  This way the reported stats
      will always be precise and recent as can be, while the aggregation can
      skip over potentially large numbers of idle cgroups.
      
      The way rstat works is that it implements a tree for tracking cgroups
      with pending local changes, as well as a flush function that walks the
      tree upwards.  The controller then drives this by 1) telling rstat when
      a local cgroup stat changes (e.g.  mod_memcg_state) and 2) when a flush
      is required to get uptodate hierarchy stats for a given subtree (e.g.
      when memory.stat is read).  The controller also provides a flush
      callback that is called during the rstat flush walk for each cgroup and
      aggregates its local per-cpu counters and propagates them upwards.
      
      This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
      NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
      aggregation.  It removes 3 words from the per-cpu data.  It eliminates
      memcg_exact_page_state(), since memcg_page_state() is now exact.
      
      [akpm@linux-foundation.org: merge fix]
      [hannes@cmpxchg.org: fix a sleep in atomic section problem]
        Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d146aa3
    • Johannes Weiner's avatar
      mm: memcontrol: privatize memcg_page_state query functions · a18e6e6e
      Johannes Weiner authored
      There are no users outside of the memory controller itself. The rest
      of the kernel cares either about node or lruvec stats.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-4-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a18e6e6e
    • Johannes Weiner's avatar
      mm: memcontrol: kill mem_cgroup_nodeinfo() · a3747b53
      Johannes Weiner authored
      No need to encapsulate a simple struct member access.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-3-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3747b53
    • Johannes Weiner's avatar
      mm: memcontrol: fix cpuhotplug statistics flushing · a3d4c05a
      Johannes Weiner authored
      Patch series "mm: memcontrol: switch to rstat", v3.
      
      This series converts memcg stats tracking to the streamlined rstat
      infrastructure provided by the cgroup core code.  rstat is already used by
      the CPU controller and the IO controller.  This change is motivated by
      recent accuracy problems in memcg's custom stats code, as well as the
      benefits of sharing common infra with other controllers.
      
      The current memcg implementation does batched tree aggregation on the
      write side: local stat changes are cached in per-cpu counters, which are
      then propagated upward in batches when a threshold (32 pages) is exceeded.
      This is cheap, but the error introduced by the lazy upward propagation
      adds up: 32 pages times CPUs times cgroups in the subtree.  We've had
      complaints from service owners that the stats do not reliably track and
      react to allocation behavior as expected, sometimes swallowing the results
      of entire test applications.
      
      The original memcg stat implementation used to do tree aggregation
      exclusively on the read side: local stats would only ever be tracked in
      per-cpu counters, and a memory.stat read would iterate the entire subtree
      and sum those counters up.  This didn't keep up with the times:
      
       - Cgroup trees are much bigger now. We switched to lazily-freed
         cgroups, where deleted groups would hang around until their remaining
         page cache has been reclaimed. This can result in large subtrees that
         are expensive to walk, while most of the groups are idle and their
         statistics don't change much anymore.
      
       - Automated monitoring increased. With the proliferation of userspace
         oom killing, proactive reclaim, and higher-resolution logging of
         workload trends in general, top-level stat files are polled at least
         once a second in many deployments.
      
       - The lifetime of cgroups got shorter. Where most cgroup setups in the
         past would have a few large policy-oriented cgroups for everything
         running on the system, newer cgroup deployments tend to create one
         group per application - which gets deleted again as the processes
         exit. An aggregation scheme that doesn't retain child data inside the
         parents loses event history of the subtree.
      
      Rstat addresses all three of those concerns through intelligent,
      persistent read-side aggregation.  As statistics change at the local
      level, rstat tracks - on a per-cpu basis - only those parts of a subtree
      that have changes pending and require aggregation.  The actual
      aggregation occurs on the colder read side - which can now skip over
      (potentially large) numbers of recently idle cgroups.
      
      ===
      
      The test_kmem cgroup selftest is currently failing due to excessive
      cumulative vmstat drift from 100 subgroups:
      
          ok 1 test_kmem_basic
          memory.current = 8810496
          slab + anon + file + kernel_stack = 17074568
          slab = 6101384
          anon = 946176
          file = 0
          kernel_stack = 10027008
          not ok 2 test_kmem_memcg_deletion
          ok 3 test_kmem_proc_kpagecgroup
          ok 4 test_kmem_kernel_stacks
          ok 5 test_kmem_dead_cgroups
          ok 6 test_percpu_basic
      
      As you can see, memory.stat items far exceed memory.current.  The kernel
      stack alone is bigger than all of charged memory.  That's because the
      memory of the test has been uncharged from memory.current, but the
      negative vmstat deltas are still sitting in the percpu caches.
      
      The test at this time isn't even counting percpu, pagetables etc.  yet,
      which would further contribute to the error.  The last patch in the series
      updates the test to include them - as well as reduces the vmstat
      tolerances in general to only expect page_counter batching.
      
      With all patches applied, the (now more stringent) test succeeds:
      
          ok 1 test_kmem_basic
          ok 2 test_kmem_memcg_deletion
          ok 3 test_kmem_proc_kpagecgroup
          ok 4 test_kmem_kernel_stacks
          ok 5 test_kmem_dead_cgroups
          ok 6 test_percpu_basic
      
      ===
      
      A kernel build test confirms that overhead is comparable.  Two kernels are
      built simultaneously in a nested tree with several idle siblings:
      
      root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                                   `- build-b (defconfig, make -j16)
                                                   `- idle-1
                                                   `- ...
                                                   `- idle-9
      
      During the builds, kernelbuild/memory.stat is read once a second.
      
      A perf diff shows that the changes in cycle distribution is
      minimal. Top 10 kernel symbols:
      
           0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
           0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
           0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
           0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
           0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
           0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
           0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
           0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
           0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault
      
      ===
      
      The on-demand aggregated stats are now fully accurate:
      
      $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
        grep -e inactive_file /sys/fs/cgroup/memory.stat
      
      vanilla:                              patched:
      nr_inactive_file 1574105088           nr_inactive_file 1027801088
         inactive_file 1577410560              inactive_file 1027801088
      
      ===
      
      This patch (of 8):
      
      The memcg hotunplug callback erroneously flushes counts on the local CPU,
      not the counts of the CPU going away; those counts will be lost.
      
      Flush the CPU that is actually going away.
      
      Also simplify the code a bit by using mod_memcg_state() and
      count_memcg_events() instead of open-coding the upward flush - this is
      comparable to how vmstat.c handles hotunplug flushing.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org
      Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org
      Fixes: a983b5eb
      
       ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3d4c05a
    • Shakeel Butt's avatar
      memcg: enable memcg oom-kill for __GFP_NOFAIL · 3d0cbb98
      Shakeel Butt authored
      In the era of async memcg oom-killer, the commit a0d8b00a ("mm: memcg:
      do not declare OOM from __GFP_NOFAIL allocations") added the code to skip
      memcg oom-killer for __GFP_NOFAIL allocations.  The reason was that the
      __GFP_NOFAIL callers will not enter aync oom synchronization path and will
      keep the task marked as in memcg oom.  At that time the tasks marked in
      memcg oom can bypass the memcg limits and the oom synchronization would
      have happened later in the later userspace triggered page fault.  Thus
      letting the task marked as under memcg oom bypass the memcg limit for
      arbitrary time.
      
      With the synchronous memcg oom-killer (commit 29ef680a ("memcg, oom:
      move out_of_memory back to the charge path")) and not letting the task
      marked under memcg oom to bypass the memcg limits (commit 1f14c1ac
      ("mm: memcg: do not allow task about to OOM kill to bypass the limit")),
      we can again allow __GFP_NOFAIL allocations to trigger memcg oom-kill.
      This will make memcg oom behavior closer to page allocator oom behavior.
      
      Link: https://lkml.kernel.org/r/20210223204337.2785120-1-shakeelb@google.com
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d0cbb98
    • Shakeel Butt's avatar
      memcg: cleanup root memcg checks · a4792030
      Shakeel Butt authored
      Replace the implicit checking of root memcg with explicit root memcg
      checking i.e.  !css->parent with mem_cgroup_is_root().
      
      Link: https://lkml.kernel.org/r/20210223205625.2792891-1-shakeelb@google.com
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4792030
    • Johannes Weiner's avatar
      mm: page-writeback: simplify memcg handling in test_clear_page_writeback() · 1c824a68
      Johannes Weiner authored
      Page writeback doesn't hold a page reference, which allows truncate to
      free a page the second PageWriteback is cleared.  This used to require
      special attention in test_clear_page_writeback(), where we had to be
      careful not to rely on the unstable page->memcg binding and look up all
      the necessary information before clearing the writeback flag.
      
      Since commit 073861ed ("mm: fix VM_BUG_ON(PageTail) and
      BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
      explicit reference on the page, and this dance is no longer needed.
      
      Use unlock_page_memcg() and dec_lruvec_page_state() directly.
      
      This removes the last user of the lock_page_memcg() return value, change
      it to void.  Touch up the comments in there as well.  This also removes
      the last extern user of __unlock_page_memcg(), make it static.  Further,
      it removes the last user of dec_lruvec_state(), delete it, along with a
      few other unused helpers.
      
      Link: https://lkml.kernel.org/r/YCQbYAWg4nvBFL6h@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c824a68
  2. 13 Mar, 2021 1 commit
  3. 24 Feb, 2021 16 commits
  4. 10 Feb, 2021 1 commit
    • Johannes Weiner's avatar
      Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" · e82553c1
      Johannes Weiner authored
      This reverts commit 536d3bf2, as it can
      cause writers to memory.high to get stuck in the kernel forever,
      performing page reclaim and consuming excessive amounts of CPU cycles.
      
      Before the patch, a write to memory.high would first put the new limit
      in place for the workload, and then reclaim the requested delta.  After
      the patch, the kernel tries to reclaim the delta before putting the new
      limit into place, in order to not overwhelm the workload with a sudden,
      large excess over the limit.  However, if reclaim is actively racing
      with new allocations from the uncurbed workload, it can keep the write()
      working inside the kernel indefinitely.
      
      This is causing problems in Facebook production.  A privileged
      system-level daemon that adjusts memory.high for various workloads
      running on a host can get unexpectedly stuck in the kernel and
      essentially turn into a sort of involuntary kswapd for one of the
      workloads.  We've observed that daemon busy-spin in a write() for
      minutes at a time, neglecting its other duties on the system, and
      expending privileged system resources on behalf of a workload.
      
      To remedy this, we have first considered changing the reclaim logic to
      break out after a couple of loops - whether the workload has converged
      to the new limit or not - and bound the write() call this way.  However,
      the root cause that inspired the sequence change in the first place has
      been fixed through other means, and so a revert back to the proven
      limit-setting sequence, also used by memory.max, is preferable.
      
      The sequence was changed to avoid extreme latencies in the workload when
      the limit was lowered: the sudden, large excess created by the limit
      lowering would erroneously trigger the penalty sleeping code that is
      meant to throttle excessive growth from below.  Allocating threads could
      end up sleeping long after the write() had already reclaimed the delta
      for which they were being punished.
      
      However, erroneous throttling also caused problems in other scenarios at
      around the same time.  This resulted in commit b3ff9291 ("mm, memcg:
      reclaim more aggressively before high allocator throttling"), included
      in the same release as the offending commit.  When allocating threads
      now encounter large excess caused by a racing write() to memory.high,
      instead of entering punitive sleeps, they will simply be tasked with
      helping reclaim down the excess, and will be held no longer than it
      takes to accomplish that.  This is in line with regular limit
      enforcement - i.e.  if the workload allocates up against or over an
      otherwise unchanged limit from below.
      
      With the patch breaking userspace, and the root cause addressed by other
      means already, revert it again.
      
      Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
      Fixes: 536d3bf2
      
       ("mm: memcontrol: avoid workload stalls when lowering memory.high")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e82553c1
  5. 24 Jan, 2021 2 commits
  6. 19 Dec, 2020 3 commits
  7. 15 Dec, 2020 7 commits