1. 29 Jun, 2021 40 commits
    • Muchun Song's avatar
      mm: memcontrol: fix page charging in page replacement · 8dc87c7d
      Muchun Song authored
      Patch series "memcontrol code cleanup and simplification", v3.
      
      This patch (of 8):
      
      The pages aren't accounted at the root level, so do not charge the page to
      the root memcg in page replacement.  Although we do not display the value
      (mem_cgroup_usage) so there shouldn't be any actual problem, but there is
      a WARN_ON_ONCE in the page_counter_cancel().  Who knows if it will
      trigger?  So it is better to fix it.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dc87c7d
    • Muchun Song's avatar
      mm: memcontrol: fix root_mem_cgroup charging · c5c8b16b
      Muchun Song authored
      The below scenario can cause the page counters of the root_mem_cgroup to
      be out of balance.
      
      CPU0:                                   CPU1:
      
      objcg = get_obj_cgroup_from_current()
      obj_cgroup_charge_pages(objcg)
                                              memcg_reparent_objcgs()
                                                  // reparent to root_mem_cgroup
                                                  WRITE_ONCE(iter->memcg, parent)
          // memcg == root_mem_cgroup
          memcg = get_mem_cgroup_from_objcg(objcg)
          // do not charge to the root_mem_cgroup
          try_charge(memcg)
      
      obj_cgroup_uncharge_pages(objcg)
          memcg = get_mem_cgroup_from_objcg(objcg)
          // uncharge from the root_mem_cgroup
          refill_stock(memcg)
              drain_stock(memcg)
                  page_counter_uncharge(&memcg->memory)
      
      get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so
      we never explicitly charge the root_mem_cgroup.  And it's not going to
      change.  It's all about a race when we got an obj_cgroup pointing at some
      non-root memcg, but before we were able to charge it, the cgroup was gone,
      objcg was reparented to the root and so we're skipping the charging.  Then
      we store the objcg pointer and later use to uncharge the root_mem_cgroup.
      
      This can cause the page counter to be less than the actual value.
      Although we do not display the value (mem_cgroup_usage) so there shouldn't
      be any actual problem, but there is a WARN_ON_ONCE in the
      page_counter_cancel().  Who knows if it will trigger?  So it is better to
      fix it.
      
      Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5c8b16b
    • Waiman Long's avatar
      mm: memcg/slab: disable cache merging for KMALLOC_NORMAL caches · 13e680fb
      Waiman Long authored
      The KMALLOC_NORMAL (kmalloc-<n>) caches are for unaccounted objects only
      when CONFIG_MEMCG_KMEM is enabled.  To make sure that this condition
      remains true, we will have to prevent KMALOC_NORMAL caches to merge with
      other kmem caches.  This is now done by setting its refcount to -1 right
      after its creation.
      
      Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13e680fb
    • Waiman Long's avatar
      mm: memcg/slab: create a new set of kmalloc-cg-<n> caches · 494c1dfe
      Waiman Long authored
      There are currently two problems in the way the objcg pointer array
      (memcg_data) in the page structure is being allocated and freed.
      
      On its allocation, it is possible that the allocated objcg pointer
      array comes from the same slab that requires memory accounting. If this
      happens, the slab will never become empty again as there is at least
      one object left (the obj_cgroup array) in the slab.
      
      When it is freed, the objcg pointer array object may be the last one
      in its slab and hence causes kfree() to be called again. With the
      right workload, the slab cache may be set up in a way that allows the
      recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.
      
      One way to solve this problem is to split the kmalloc-<n> caches
      (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
      (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
      kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
      the other caches can still allow a mix of accounted and unaccounted
      objects.
      
      With this change, all the objcg pointer array objects will come from
      KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
      both the recursive kfree() problem and non-freeable slab problem are
      gone.
      
      Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
      mixed accounted and unaccounted objects, this will slightly reduce the
      number of objcg pointer arrays that need to be allocated and save a bit
      of memory. On the other hand, creating a new set of kmalloc caches does
      have the effect of reducing cache utilization. So it is properly a wash.
      
      The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
      KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
      will include the newly added caches without change.
      
      [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem]
        Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      [akpm@linux-foundation.org: un-fat-finger v5 delta creation]
      [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches]
        Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com
      
      Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      [longman@redhat.com: fix for CONFIG_ZONE_DMA=n]
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      494c1dfe
    • Waiman Long's avatar
      mm: memcg/slab: properly set up gfp flags for objcg pointer array · 41eb5df1
      Waiman Long authored
      Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4.
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure stores a pointer to objcg pointer array for slab pages.  When
      the slab has no used objects, it can be freed in free_slab() which will
      call kfree() to free the objcg pointer array in
      memcg_alloc_page_obj_cgroups().  If it happens that the objcg pointer
      array is the last used object in its slab, that slab may then be freed
      which may caused kfree() to be called again.
      
      With the right workload, the slab cache may be set up in a way that allows
      the recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.  In fact, we have a reproducer that
      can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256
      and kmalloc-rcl-128 slabs with the following kfree() loop recursively
      called 74 times:
      
        [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740]
      [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741]
      [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742]
      [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I
      also found an issue on the allocation side.  If the objcg pointer array
      happen to come from the same slab or a circular dependency linkage is
      formed with multiple slabs, those affected slabs can never be freed again.
      
      This patch series addresses these two issues by introducing a new set of
      kmalloc-cg-<n> caches split from kmalloc-<n> caches.  The new set will
      only contain non-reclaimable and non-dma objects that are accounted in
      memory cgroups whereas the old set are now for unaccounted objects only.
      By making this split, all the objcg pointer arrays will come from the
      kmalloc-<n> caches, but those caches will never hold any objcg pointer
      array.  As a result, deeply nested kfree() call and the unfreeable slab
      problems are now gone.
      
      This patch (of 4):
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure may store a pointer to obj_cgroup pointer array for slab pages.
      Currently, only the __GFP_ACCOUNT bit is masked off.  However, the array
      is not readily reclaimable and doesn't need to come from the DMA buffer.
      So those GFP bits should be masked off as well.
      
      Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
      that it is consistently applied no matter where it is called.
      
      Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com
      Fixes: 286e04b8 ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41eb5df1
    • Waiman Long's avatar
      mm/memcg: optimize user context object stock access · 55927114
      Waiman Long authored
      Most kmem_cache_alloc() calls are from user context.  With instrumentation
      enabled, the measured amount of kmem_cache_alloc() calls from non-task
      context was about 0.01% of the total.
      
      The irq disable/enable sequence used in this case to access content from
      object stock is slow.  To optimize for user context access, there are now
      two sets of object stocks (in the new obj_stock structure) for task
      context and interrupt context access respectively.
      
      The task context object stock can be accessed after disabling preemption
      which is cheap in non-preempt kernel.  The interrupt context object stock
      can only be accessed after disabling interrupt.  User context code can
      access interrupt object stock, but not vice versa.
      
      The downside of this change is that there are more data stored in local
      object stocks and not reflected in the charge counter and the vmstat
      arrays.  However, this is a small price to pay for better performance.
      
      [longman@redhat.com: fix potential uninitialized variable warning]
        Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55927114
    • Waiman Long's avatar
      mm/memcg: improve refill_obj_stock() performance · 5387c904
      Waiman Long authored
      There are two issues with the current refill_obj_stock() code.  First of
      all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
      atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and
      do a obj_cgroup_put().  It is likely that the same obj_cgroup will be used
      again which leads to another call to drain_obj_stock() and
      obj_cgroup_get() as well as atomically retrieve the available byte from
      obj_cgroup.  That is costly.  Instead, we should just uncharge the excess
      pages, reduce the stock bytes and be done with it.  The drain_obj_stock()
      function should only be called when obj_cgroup changes.
      
      Secondly, when charging an object of size not less than a page in
      obj_cgroup_charge(), it is possible that the remaining bytes to be
      refilled to the stock will overflow a page and cause refill_obj_stock() to
      uncharge 1 page.  To avoid the additional uncharge in this case, a new
      allow_uncharge flag is added to refill_obj_stock() which will be set to
      false when called from obj_cgroup_charge() so that an uncharge_pages()
      call won't be issued right after a charge_pages() call unless the objcg
      changes.
      
      A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core
      96-thread x86-64 system with 96 testing threads were run.  Before this
      patch, the total number of kilo kmalloc+kfree operations done for a 4k
      large object by all the testing threads per second were 4,304 kops/s
      (cgroup v1) and 8,478 kops/s (cgroup v2).  After applying this patch, the
      number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively.  This
      represents a performance improvement of 1.10X (cgroup v1) and 49.3X
      (cgroup v2).
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5387c904
    • Waiman Long's avatar
      mm/memcg: cache vmstat data in percpu memcg_stock_pcp · 68ac5b3c
      Waiman Long authored
      Before the new slab memory controller with per object byte charging,
      charging and vmstat data update happen only when new slab pages are
      allocated or freed.  Now they are done with every kmem_cache_alloc() and
      kmem_cache_free().  This causes additional overhead for workloads that
      generate a lot of alloc and free calls.
      
      The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup
      to reduce that overhead.  To further reducing it, this patch makes the
      vmstat data cached in the memcg_stock_pcp structure as well until it
      accumulates a page size worth of update or when other cached data change.
      Caching the vmstat data in the per-cpu stock eliminates two writes to
      non-hot cachelines for memcg specific as well as memcg-lruvecs specific
      vmstat data by a write to a hot local stock cacheline.
      
      On a 2-socket Cascade Lake server with instrumentation enabled and this
      patch applied, it was found that about 20% (634400 out of 3243830) of the
      time when mod_objcg_state() is called leads to an actual call to
      __mod_objcg_state() after initial boot.  When doing parallel kernel build,
      the figure was about 17% (24329265 out of 142512465).  So caching the
      vmstat data reduces the number of calls to __mod_objcg_state() by more
      than 80%.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ac5b3c
    • Waiman Long's avatar
      mm/memcg: move mod_objcg_state() to memcontrol.c · fdbcb2a6
      Waiman Long authored
      Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.
      
      With the recent introduction of the new slab memory controller, we
      eliminate the need for having separate kmemcaches for each memory cgroup
      and reduce overall kernel memory usage.  However, we also add additional
      memory accounting overhead to each call of kmem_cache_alloc() and
      kmem_cache_free().
      
      For workloads that require a lot of kmemcache allocations and
      de-allocations, they may experience performance regression as illustrated
      in [1] and [2].
      
      A simple kernel module that performs repeated loop of 100,000,000
      kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
      or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
      followed by 4 kfree's) is used for benchmarking.  The benchmarking tool
      was run on a kernel based on linux-next-20210419.  The test was run on a
      CascadeLake server with turbo-boosting disable to reduce run-to-run
      variation.
      
      The small object test exercises mainly the object stock charging and
      vmstat update code paths.  The large object test also exercises the
      refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
      paths.
      
      With memory accounting disabled, the run time was 3.130s with both small
      object big object tests.
      
      With memory accounting enabled, both cgroup v1 and v2 showed similar
      results in the small object test.  The performance results of the large
      object test, however, differed between cgroup v1 and v2.
      
      The execution times with the application of various patches in the
      patchset were:
      
        Applied patches   Run time   Accounting overhead   %age 1   %age 2
        ---------------   --------   -------------------   ------   ------
      
        Small 32-byte object:
             None          11.634s         8.504s          100.0%   271.7%
              1-2           9.425s         6.295s           74.0%   201.1%
              1-3           9.708s         6.578s           77.4%   210.2%
              1-4           8.062s         4.932s           58.0%   157.6%
      
        Large 4k object (v2):
             None          22.107s        18.977s          100.0%   606.3%
              1-2          20.960s        17.830s           94.0%   569.6%
              1-3          14.238s        11.108s           58.5%   354.9%
              1-4          11.329s         8.199s           43.2%   261.9%
      
        Large 4k object (v1):
             None          36.807s        33.677s          100.0%  1075.9%
              1-2          36.648s        33.518s           99.5%  1070.9%
              1-3          22.345s        19.215s           57.1%   613.9%
              1-4          18.662s        15.532s           46.1%   496.2%
      
        N.B. %age 1 = overhead/unpatched overhead
             %age 2 = overhead/accounting disabled time
      
      Patch 2 (vmstat data stock caching) helps in both the small object test
      and the large v2 object test. It doesn't help much in v1 big object test.
      
      Patch 3 (refill_obj_stock improvement) does help the small object test
      but offer significant performance improvement for the large object test
      (both v1 and v2).
      
      Patch 4 (eliminating irq disable/enable) helps in all test cases.
      
      To test for the extreme case, a multi-threaded kmalloc/kfree
      microbenchmark was run on the 2-socket 48-core 96-thread system with
      96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
      with accounting enabled for 10s. The total number of kmalloc+kfree done
      in kilo operations per second (kops/s) were as follows:
      
        Applied patches   v1 kops/s   v1 change   v2 kops/s   v2 change
        ---------------   ---------   ---------   ---------   ---------
             None           3,520        1.00X      6,242        1.00X
              1-2           4,304        1.22X      8,478        1.36X
              1-3           4,731        1.34X    418,142       66.99X
              1-4           4,587        1.30X    438,838       70.30X
      
      With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
      kop/s. This test shows how significant the memory accouting overhead
      can be in some extreme situations.
      
      For this multithreaded test, the improvement from patch 2 mainly
      comes from the conditional atomic xchg of objcg->nr_charged_bytes in
      mod_objcg_state(). By using an unconditional xchg, the operation rates
      were similar to the unpatched kernel.
      
      Patch 3 elminates the single highly contended cacheline of
      objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
      improvement. Cgroup v1, however, still has another highly contended
      cacheline in the shared page counter &memcg->kmem. So the improvement
      is only modest.
      
      Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
      eliminating the irq_disable/irq_enable overhead seems to aggravate the
      cacheline contention.
      
      [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
      [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
      
      This patch (of 4):
      
      mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
      further optimization can be done to it in later patches without exposing
      unnecessary details to other mm components.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdbcb2a6
    • Huang Ying's avatar
      swap: check mapping_empty() for swap cache before being freed · eea4a501
      Huang Ying authored
      To check whether all pages and shadow entries in swap cache has been
      removed before swap cache is freed.
      
      Link: https://lkml.kernel.org/r/20210608005121.511140-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eea4a501
    • Huang Ying's avatar
      mm: free idle swap cache page after COW · f4c4a3f4
      Huang Ying authored
      With commit 09854ba9 ("mm: do_wp_page() simplification"), after COW,
      the idle swap cache page (neither the page nor the corresponding swap
      entry is mapped by any process) will be left in the LRU list, even if it's
      in the active list or the head of the inactive list.  So, the page
      reclaimer may take quite some overhead to reclaim these actually unused
      pages.
      
      To help the page reclaiming, in this patch, after COW, the idle swap cache
      page will be tried to be freed.  To avoid to introduce much overhead to
      the hot COW code path,
      
      a) there's almost zero overhead for non-swap case via checking
         PageSwapCache() firstly.
      
      b) the page lock is acquired via trylock only.
      
      To test the patch, we used pmbench memory accessing benchmark with
      working-set larger than available memory on a 2-socket Intel server with a
      NVMe SSD as swap device.  Test results shows that the pmbench score
      increases up to 23.8% with the decreased size of swap cache and swapin
      throughput.
      
      Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: Johannes Weiner <hannes@cmpxchg.org>	[use free_swap_cache()]
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4c4a3f4
    • Huang Ying's avatar
      mm, swap: remove unnecessary smp_rmb() in swap_type_to_swap_info() · a4b45114
      Huang Ying authored
      Before commit c10d38cc ("mm, swap: bounds check swap_info array
      accesses to avoid NULL derefs"), the typical code to reference the
      swap_info[] is as follows,
      
        type = swp_type(swp_entry);
        if (type >= nr_swapfiles)
                /* handle invalid swp_entry */;
        p = swap_info[type];
        /* access fields of *p.  OOPS! p may be NULL! */
      
      Because the ordering isn't guaranteed, it's possible that swap_info[type]
      is read before "nr_swapfiles".  And that may result in NULL pointer
      dereference.
      
      So after commit c10d38cc, the code becomes,
      
        struct swap_info_struct *swap_type_to_swap_info(int type)
        {
      	  if (type >= READ_ONCE(nr_swapfiles))
      		  return NULL;
      	  smp_rmb();
      	  return READ_ONCE(swap_info[type]);
        }
      
        /* users */
        type = swp_type(swp_entry);
        p = swap_type_to_swap_info(type);
        if (!p)
      	  /* handle invalid swp_entry */;
        /* dereference p */
      
      Where the value of swap_info[type] (that is, "p") is checked to be
      non-zero before being dereferenced.  So, the NULL deferencing becomes
      impossible even if "nr_swapfiles" is read after swap_info[type].
      Therefore, the "smp_rmb()" becomes unnecessary.
      
      And, we don't even need to read "nr_swapfiles" here.  Because the non-zero
      checking for "p" is sufficient.  We just need to make sure we will not
      access out of the boundary of the array.  With the change, nr_swapfiles
      will only be accessed with swap_lock held, except in
      swapcache_free_entries().  Where the absolute correctness of the value
      isn't needed, as described in the comments.
      
      We still need to guarantee swap_info[type] is read before being
      dereferenced.  That can be satisfied via the data dependency ordering
      enforced by READ_ONCE(swap_info[type]).  This needs to be paired with
      proper write barriers.  So smp_store_release() is used in
      alloc_swap_info() to guarantee the fields of *swap_info[type] is
      initialized before swap_info[type] itself being written.  Note that the
      fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly.
      The assignment and deferencing of swap_info[type] is like
      rcu_assign_pointer() and rcu_dereference().
      
      Link: https://lkml.kernel.org/r/20210520073301.1676294-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4b45114
    • Miaohe Lin's avatar
      mm/swap_slots.c: delete meaningless forward declarations · 1cfcc830
      Miaohe Lin authored
      deactivate_swap_slots_cache() and reactivate_swap_slots_cache() are only
      called below their implementations.  So these forward declarations are
      meaningless and should be removed.
      
      Link: https://lkml.kernel.org/r/20210520134022.1370406-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cfcc830
    • Miaohe Lin's avatar
      mm/swap: remove unused local variable nr_shadows · eb7709c5
      Miaohe Lin authored
      Since commit 55c653b71e8c ("mm: stop accounting shadow entries"),
      nr_shadows is not used anymore.
      
      Link: https://lkml.kernel.org/r/20210520134022.1370406-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb7709c5
    • Miaohe Lin's avatar
      mm/swapfile: move get_swap_page_of_type() under CONFIG_HIBERNATION · bb243f7d
      Miaohe Lin authored
      Patch series "Cleanups for swap", v2.
      
      This series contains just cleanups to remove some unused variables, delete
      meaningless forward declarations and so on.  More details can be found in
      the respective changelogs.
      
      This patch (of 4):
      
      We should move get_swap_page_of_type() under CONFIG_HIBERNATION since the
      only caller of this function is now suspend routine.
      
      [linmiaohe@huawei.com: move scan_swap_map() under CONFIG_HIBERNATION]
        Link: https://lkml.kernel.org/r/20210521070855.2015094-1-linmiaohe@huawei.com
      [linmiaohe@huawei.com: fold scan_swap_map() into the only caller get_swap_page_of_type()]
        Link: https://lkml.kernel.org/r/20210527120328.3935132-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210520134022.1370406-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210520134022.1370406-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb243f7d
    • Miaohe Lin's avatar
      mm/shmem: fix shmem_swapin() race with swapoff · 2efa33fc
      Miaohe Lin authored
      When I was investigating the swap code, I found the below possible race
      window:
      
      CPU 1                                         CPU 2
      -----                                         -----
      shmem_swapin
        swap_cluster_readahead
          if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
                                                    swapoff
                                                      ..
                                                      si->swap_file = NULL;
                                                      ..
          struct inode *inode = si->swap_file->f_mapping->host;[oops!]
      
      Close this race window by using get/put_swap_device() to guard against
      concurrent swapoff.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com
      Fixes: 8fd2e0b5 ("mm: swap: check if swap backing device is congested or not")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2efa33fc
    • Miaohe Lin's avatar
      mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info() · 5c046235
      Miaohe Lin authored
      The non_swap_entry() was used for working with VMA based swap readahead
      via commit ec560175 ("mm, swap: VMA based swap readahead").  At that
      time, the non_swap_entry() checking is necessary because the function is
      called before checking that in do_swap_page().  Then it's moved to
      swap_ra_info() since commit eaf649eb ("mm: swap: clean up swap
      readahead").  After that, the non_swap_entry() checking is unnecessary,
      because swap_ra_info() is called after non_swap_entry() has been checked
      already.  The resulting code is confusing as the non_swap_entry() check
      looks racy now because while we released the pte lock, somebody else might
      have faulted in this pte.  So we should check whether it's swap pte first
      to guard against such race or swap_type will be unexpected.  But the race
      isn't important because it will not cause problem.  We would have enough
      checking when we really operate the PTE entries later.  So we remove the
      non_swap_entry() check here to avoid confusion.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c046235
    • Miaohe Lin's avatar
      swap: fix do_swap_page() race with swapoff · 2799e775
      Miaohe Lin authored
      When I was investigating the swap code, I found the below possible race
      window:
      
      CPU 1                                   	CPU 2
      -----                                   	-----
      do_swap_page
        if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
        swap_readpage
          if (data_race(sis->flags & SWP_FS_OPS)) {
                                              	swapoff
      					  	  ..
      					  	  p->swap_file = NULL;
      					  	  ..
          struct file *swap_file = sis->swap_file;
          struct address_space *mapping = swap_file->f_mapping;[oops!]
      
      Note that for the pages that are swapped in through swap cache, this isn't
      an issue. Because the page is locked, and the swap entry will be marked
      with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
      unlocked.
      
      Fix this race by using get/put_swap_device() to guard against concurrent
      swapoff.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com
      Fixes: 0bcac06f ("mm,swap: skip swapcache for swapin of synchronous device")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2799e775
    • Miaohe Lin's avatar
      mm/swapfile: use percpu_ref to serialize against concurrent swapoff · 63d8620e
      Miaohe Lin authored
      Patch series "close various race windows for swap", v6.
      
      When I was investigating the swap code, I found some possible race
      windows.  This series aims to fix all these races.  But using current
      get/put_swap_device() to guard against concurrent swapoff for
      swap_readpage() looks terrible because swap_readpage() may take really
      long time.  And to reduce the performance overhead on the hot-path as much
      as possible, it appears we can use the percpu_ref to close this race
      window(as suggested by Huang, Ying).  The patch 1 adds percpu_ref support
      for swap and most of the remaining patches try to use this to close
      various race windows.  More details can be found in the respective
      changelogs.
      
      This patch (of 4):
      
      Using current get/put_swap_device() to guard against concurrent swapoff
      for some swap ops, e.g.  swap_readpage(), looks terrible because they
      might take really long time.  This patch adds the percpu_ref support to
      serialize against concurrent swapoff(as suggested by Huang, Ying).  Also
      we remove the SWP_VALID flag because it's used together with RCU solution.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63d8620e
    • Christophe Leroy's avatar
      mm: pagewalk: fix walk for hugepage tables · e17eae2b
      Christophe Leroy authored
      Pagewalk ignores hugepd entries and walk down the tables as if it was
      traditionnal entries, leading to crazy result.
      
      Add walk_hugepd_range() and use it to walk hugepage tables.
      
      Link: https://lkml.kernel.org/r/38d04410700c8d02f28ba37e020b62c55d6f3d2c.1624597695.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarSteven Price <steven.price@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e17eae2b
    • Andrea Arcangeli's avatar
      mm: gup: pack has_pinned in MMF_HAS_PINNED · a458b76a
      Andrea Arcangeli authored
      has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
      cleanup.
      
      Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
      would reintroduce a loss of SMP scalability to pin-fast, so there's no
      future potential usefulness to keep an atomic in the mm for this.
      
      set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
      (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
      atomic_set after this commit) has to be still issued only once per "mm",
      so the difference between the two will be lost in the noise.
      
      will-it-scale "mmap2" shows no change in performance with enterprise
      config as expected.
      
      will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
      improvement against upstream as expected.
      
      This is a noop as far as overall performance and SMP scalability are
      concerned.
      
      [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
        Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
      [akpm@linux-foundation.org: coding style fixes]
      [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a458b76a
    • Andrea Arcangeli's avatar
      mm: gup: allow FOLL_PIN to scale in SMP · 292648ac
      Andrea Arcangeli authored
      has_pinned cannot be written by each pin-fast or it won't scale in SMP.
      This isn't "false sharing" strictly speaking (it's more like "true
      non-sharing"), but it creates the same SMP scalability bottleneck of
      "false sharing".
      
      To verify the improvement, below test is done on 40 cpus host with
      Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with
      CONFIG_GUP_TEST=y):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
      
      Where we can get (average value for 40 threads):
      
        Old kernel: 477729.97 (+- 3.79%)
        New kernel:  89144.65 (+-11.76%)
      
      On a similar condition with 256 cpus, this commits increases the SMP
      scalability of pin_user_pages_fast() executed by different threads of the
      same process by more than 4000%.
      
      [peterx@redhat.com: rewrite commit message, add parentheses against "(A & B)"]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-3-peterx@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      292648ac
    • Peter Xu's avatar
      mm/gup_benchmark: support threading · f39bd853
      Peter Xu authored
      Patch series "mm/gup: Fix pin page write cache bouncing on has_pinned", v2.
      
      This series contains 3 patches, the 1st one enables threading for
      gup_benchmark in the kselftest.  The latter two patches are collected from
      Andrea's local branch which can fix write cache bouncing issue with
      pinning fast-gup.
      
      To be explicit on the latter two patches:
      
        - the 2nd patch fixes the perf degrade when introducing has_pinned, then
      
        - the last patch tries to remove the has_pinned with a bit in mm->flags
      
      For patch 3: originally I think we had a plan to reuse has_pinned into a
      counter very soon, however that's not happening at least until today, so
      maybe it proves that we can remove it until we really want such a counter
      for whatever reason.  As the commit message stated, it saves 4 bytes for
      each mm without observable regressions.
      
      Regarding testing: we can reference to the commit message of patch 2 for
      some detailed testing with will-is-scale.  Meanwhile I did patch 1 just
      because then we can even easily verify the patchset using the existing
      kselftest facilities or even regress test it in the future with the repo
      if we want.
      
      Below numbers are extra verification tests that I did besides commit
      message of patch 2 using the new gup_benchmark and 256 cpus.  Below test
      is done on 40 cpus host with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz,
      and I can get similar result (of course the write cache bouncing get
      severe with even more cores).
      
      After patch 1 applied (only test patch, so using old kernel):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
        PIN_FAST_BENCHMARK: Time: get:459632 put:5990 us
        PIN_FAST_BENCHMARK: Time: get:461967 put:5840 us
        PIN_FAST_BENCHMARK: Time: get:464521 put:6140 us
        PIN_FAST_BENCHMARK: Time: get:465176 put:7100 us
        PIN_FAST_BENCHMARK: Time: get:465960 put:6733 us
        PIN_FAST_BENCHMARK: Time: get:465324 put:6781 us
        PIN_FAST_BENCHMARK: Time: get:466018 put:7130 us
        PIN_FAST_BENCHMARK: Time: get:466362 put:7118 us
        PIN_FAST_BENCHMARK: Time: get:465118 put:6975 us
        PIN_FAST_BENCHMARK: Time: get:466422 put:6602 us
        PIN_FAST_BENCHMARK: Time: get:465791 put:6818 us
        PIN_FAST_BENCHMARK: Time: get:467091 put:6298 us
        PIN_FAST_BENCHMARK: Time: get:467694 put:5432 us
        PIN_FAST_BENCHMARK: Time: get:469575 put:5581 us
        PIN_FAST_BENCHMARK: Time: get:468124 put:6055 us
        PIN_FAST_BENCHMARK: Time: get:468877 put:6720 us
        PIN_FAST_BENCHMARK: Time: get:467212 put:4961 us
        PIN_FAST_BENCHMARK: Time: get:467834 put:6697 us
        PIN_FAST_BENCHMARK: Time: get:470778 put:6398 us
        PIN_FAST_BENCHMARK: Time: get:469788 put:6310 us
        PIN_FAST_BENCHMARK: Time: get:488277 put:7113 us
        PIN_FAST_BENCHMARK: Time: get:486613 put:7085 us
        PIN_FAST_BENCHMARK: Time: get:486940 put:7202 us
        PIN_FAST_BENCHMARK: Time: get:488728 put:7101 us
        PIN_FAST_BENCHMARK: Time: get:487570 put:7327 us
        PIN_FAST_BENCHMARK: Time: get:489260 put:7027 us
        PIN_FAST_BENCHMARK: Time: get:488846 put:6866 us
        PIN_FAST_BENCHMARK: Time: get:488521 put:6745 us
        PIN_FAST_BENCHMARK: Time: get:489950 put:6459 us
        PIN_FAST_BENCHMARK: Time: get:489777 put:6617 us
        PIN_FAST_BENCHMARK: Time: get:488224 put:6591 us
        PIN_FAST_BENCHMARK: Time: get:488644 put:6477 us
        PIN_FAST_BENCHMARK: Time: get:488754 put:6711 us
        PIN_FAST_BENCHMARK: Time: get:488875 put:6743 us
        PIN_FAST_BENCHMARK: Time: get:489290 put:6657 us
        PIN_FAST_BENCHMARK: Time: get:490264 put:6684 us
        PIN_FAST_BENCHMARK: Time: get:489631 put:6737 us
        PIN_FAST_BENCHMARK: Time: get:488434 put:6655 us
        PIN_FAST_BENCHMARK: Time: get:492213 put:6297 us
        PIN_FAST_BENCHMARK: Time: get:491124 put:6173 us
      
      After the whole series applied (new fixed kernel):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
        PIN_FAST_BENCHMARK: Time: get:82038 put:7041 us
        PIN_FAST_BENCHMARK: Time: get:82144 put:6817 us
        PIN_FAST_BENCHMARK: Time: get:83417 put:6674 us
        PIN_FAST_BENCHMARK: Time: get:82540 put:6594 us
        PIN_FAST_BENCHMARK: Time: get:83214 put:6681 us
        PIN_FAST_BENCHMARK: Time: get:83444 put:6889 us
        PIN_FAST_BENCHMARK: Time: get:83194 put:7499 us
        PIN_FAST_BENCHMARK: Time: get:84876 put:7369 us
        PIN_FAST_BENCHMARK: Time: get:86092 put:10289 us
        PIN_FAST_BENCHMARK: Time: get:86153 put:10415 us
        PIN_FAST_BENCHMARK: Time: get:85026 put:7751 us
        PIN_FAST_BENCHMARK: Time: get:85458 put:7944 us
        PIN_FAST_BENCHMARK: Time: get:85735 put:8154 us
        PIN_FAST_BENCHMARK: Time: get:85851 put:8299 us
        PIN_FAST_BENCHMARK: Time: get:86323 put:9617 us
        PIN_FAST_BENCHMARK: Time: get:86288 put:10496 us
        PIN_FAST_BENCHMARK: Time: get:87697 put:9346 us
        PIN_FAST_BENCHMARK: Time: get:87980 put:8382 us
        PIN_FAST_BENCHMARK: Time: get:88719 put:8400 us
        PIN_FAST_BENCHMARK: Time: get:87616 put:8588 us
        PIN_FAST_BENCHMARK: Time: get:86730 put:9563 us
        PIN_FAST_BENCHMARK: Time: get:88167 put:8673 us
        PIN_FAST_BENCHMARK: Time: get:86844 put:9777 us
        PIN_FAST_BENCHMARK: Time: get:88068 put:11774 us
        PIN_FAST_BENCHMARK: Time: get:86170 put:15676 us
        PIN_FAST_BENCHMARK: Time: get:87967 put:12827 us
        PIN_FAST_BENCHMARK: Time: get:95773 put:7652 us
        PIN_FAST_BENCHMARK: Time: get:87734 put:13650 us
        PIN_FAST_BENCHMARK: Time: get:89833 put:14237 us
        PIN_FAST_BENCHMARK: Time: get:96186 put:8029 us
        PIN_FAST_BENCHMARK: Time: get:95532 put:8886 us
        PIN_FAST_BENCHMARK: Time: get:95351 put:5826 us
        PIN_FAST_BENCHMARK: Time: get:96401 put:8407 us
        PIN_FAST_BENCHMARK: Time: get:96473 put:8287 us
        PIN_FAST_BENCHMARK: Time: get:97177 put:8430 us
        PIN_FAST_BENCHMARK: Time: get:98120 put:5263 us
        PIN_FAST_BENCHMARK: Time: get:96271 put:7757 us
        PIN_FAST_BENCHMARK: Time: get:99628 put:10467 us
        PIN_FAST_BENCHMARK: Time: get:99344 put:10045 us
        PIN_FAST_BENCHMARK: Time: get:94212 put:15485 us
      
      Summary:
      
        Old kernel: 477729.97 (+-3.79%)
        New kernel:  89144.65 (+-11.76%)
      
      This patch (of 3):
      
      Add a new parameter "-j N" to support concurrent gup test.
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210507150553.208763-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f39bd853
    • Matthew Wilcox (Oracle)'s avatar
      mm: move page dirtying prototypes from mm.h · 3a6b2162
      Matthew Wilcox (Oracle) authored
      These functions implement the address_space ->set_page_dirty operation and
      should live in pagemap.h, not mm.h so that the rest of the kernel doesn't
      get funny ideas about calling them directly.
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a6b2162
    • Matthew Wilcox (Oracle)'s avatar
      fs: remove noop_set_page_dirty() · b82a96c9
      Matthew Wilcox (Oracle) authored
      Use __set_page_dirty_no_writeback() instead.  This will set the dirty bit
      on the page, which will be used to avoid calling set_page_dirty() in the
      future.  It will have no effect on actually writing the page back, as the
      pages are not on any LRU lists.
      
      [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b82a96c9
    • Matthew Wilcox (Oracle)'s avatar
      fs: remove anon_set_page_dirty() · fc50eee3
      Matthew Wilcox (Oracle) authored
      Use __set_page_dirty_no_writeback() instead.  This will set the dirty bit
      on the page, which will be used to avoid calling set_page_dirty() in the
      future.  It will have no effect on actually writing the page back, as the
      pages are not on any LRU lists.
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc50eee3
    • Matthew Wilcox (Oracle)'s avatar
      iomap: use __set_page_dirty_nobuffers · fd7353f8
      Matthew Wilcox (Oracle) authored
      The only difference between iomap_set_page_dirty() and
      __set_page_dirty_nobuffers() is that the latter includes a debugging check
      that a !Uptodate page has private data.
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd7353f8
    • Matthew Wilcox (Oracle)'s avatar
      mm/writeback: use __set_page_dirty in __set_page_dirty_nobuffers · 2f18be36
      Matthew Wilcox (Oracle) authored
      This is fundamentally the same code, so just call it instead of
      duplicating it.
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f18be36
    • Matthew Wilcox (Oracle)'s avatar
      mm/writeback: move __set_page_dirty() to core mm · 6e1cae88
      Matthew Wilcox (Oracle) authored
      Patch series "Further set_page_dirty cleanups".
      
      Prompted by Christoph's recent patches, here are some more patches to
      improve the state of set_page_dirty().  They're all from the folio tree,
      so they've been tested to a certain extent.
      
      This patch (of 6):
      
      Nothing in __set_page_dirty() is specific to buffer_head, so move it to
      mm/page-writeback.c.  That removes the only caller of
      account_page_dirtied() outside of page-writeback.c, so make it static.
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e1cae88
    • Christoph Hellwig's avatar
      mm: require ->set_page_dirty to be explicitly wired up · 0af57378
      Christoph Hellwig authored
      Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire
      that method up for the missing instances.
      
      [hch@lst.de: ecryptfs: add a ->set_page_dirty cludge]
        Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de
      
      Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.deSigned-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tyler Hicks <code@tyhicks.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0af57378
    • Christoph Hellwig's avatar
      fs: move ramfs_aops to libfs · c1e3dbe9
      Christoph Hellwig authored
      Move the ramfs aops to libfs and reuse them for kernfs and configfs.
      Thosw two did not wire up ->set_page_dirty before and now get
      __set_page_dirty_no_writeback, which is the right one for no-writeback
      address_space usage.
      
      Drop the now unused exports of the libfs helpers only used for ramfs-style
      pagecache usage.
      
      Link: https://lkml.kernel.org/r/20210614061512.3966143-3-hch@lst.deSigned-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1e3dbe9
    • Christoph Hellwig's avatar
      fs: unexport __set_page_dirty · 34ebcce7
      Christoph Hellwig authored
      Patch series "remove the implicit .set_page_dirty default".
      
      This series cleans up a few lose ends around ->set_page_dirty, most
      importantly removes the default to the buffer head based on if no method
      is wired up.
      
      This patch (of 3):
      
      __set_page_dirty is only used by built-in code.
      
      Link: https://lkml.kernel.org/r/20210614061512.3966143-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20210614061512.3966143-2-hch@lst.deSigned-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34ebcce7
    • Roman Gushchin's avatar
      writeback, cgroup: release dying cgwbs by switching attached inodes · c22d70a1
      Roman Gushchin authored
      Asynchronously try to release dying cgwbs by switching attached inodes to
      the nearest living ancestor wb.  It helps to get rid of per-cgroup
      writeback structures themselves and of pinned memory and block cgroups,
      which are significantly larger structures (mostly due to large per-cpu
      statistics data).  This prevents memory waste and helps to avoid different
      scalability problems caused by large piles of dying cgroups.
      
      Reuse the existing mechanism of inode switching used for foreign inode
      detection.  To speed things up batch up to 115 inode switching in a single
      operation (the maximum number is selected so that the resulting struct
      inode_switch_wbs_context can fit into 1024 bytes).  Because every
      switching consists of two steps divided by an RCU grace period, it would
      be too slow without batching.  Please note that the whole batch counts as
      a single operation (when increasing/decreasing isw_nr_in_flight).  This
      allows to keep umounting working (flush the switching queue), however
      prevents cleanups from consuming the whole switching quota and effectively
      blocking the frn switching.
      
      A cgwb cleanup operation can fail due to different reasons (e.g.  not
      enough memory, the cgwb has an in-flight/pending io, an attached inode in
      a wrong state, etc).  In this case the next scheduled cleanup will make a
      new attempt.  An attempt is made each time a new cgwb is offlined (in
      other words a memcg and/or a blkcg is deleted by a user).  In the future
      an additional attempt scheduled by a timer can be implemented.
      
      [guro@fb.com: replace open-coded "115" with arithmetic]
        Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com
      [guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()]
        Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com
      [willy@infradead.org: fix documentation]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c22d70a1
    • Roman Gushchin's avatar
      writeback, cgroup: support switching multiple inodes at once · f5fbe6b7
      Roman Gushchin authored
      Currently only a single inode can be switched to another writeback
      structure at once.  That means to switch an inode a separate
      inode_switch_wbs_context structure must be allocated, and a separate rcu
      callback and work must be scheduled.
      
      It's fine for the existing ad-hoc switching, which is not happening that
      often, but sub-optimal for massive switching required in order to release
      a writeback structure.  To prepare for it, let's add a support for
      switching multiple inodes at once.
      
      Instead of containing a single inode pointer, inode_switch_wbs_context
      will contain a NULL-terminated array of inode pointers.
      inode_do_switch_wbs() will be called for each inode.
      
      To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's
      list_locks will be acquired and released only once altogether for all
      inodes.  wb_wakeup() will be also be called only once.  Instead of calling
      wb_put(old_wb) after each successful switch, wb_put_many() is introduced
      and used.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5fbe6b7
    • Roman Gushchin's avatar
      writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn() · 72d4512e
      Roman Gushchin authored
      Split out the functional part of the inode_switch_wbs_work_fn() function
      as inode_do switch_wbs() to reuse it later for switching inodes attached
      to dying cgwbs.
      
      This commit doesn't bring any functional changes.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-7-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72d4512e
    • Roman Gushchin's avatar
      writeback, cgroup: keep list of inodes attached to bdi_writeback · f3b6a6df
      Roman Gushchin authored
      Currently there is no way to iterate over inodes attached to a specific
      cgwb structure.  It limits the ability to efficiently reclaim the
      writeback structure itself and associated memory and block cgroup
      structures without scanning all inodes belonging to a sb, which can be
      prohibitively expensive.
      
      While dirty/in-active-writeback an inode belongs to one of the
      bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.  Once
      cleaned up, it's removed from all io lists.  So the inode->i_io_list can
      be reused to maintain the list of inodes, attached to a bdi_writeback
      structure.
      
      This patch introduces a new wb->b_attached list, which contains all inodes
      which were dirty at least once and are attached to the given cgwb.  Inodes
      attached to the root bdi_writeback structures are never placed on such
      list.  The following patch will use this list to try to release cgwbs
      structures more efficiently.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3b6a6df
    • Roman Gushchin's avatar
      writeback, cgroup: switch to rcu_work API in inode_switch_wbs() · 29264d92
      Roman Gushchin authored
      Inode's wb switching requires two steps divided by an RCU grace period.
      It's currently implemented as an RCU callback inode_switch_wbs_rcu_fn(),
      which schedules inode_switch_wbs_work_fn() as a work.
      
      Switching to the rcu_work API allows to do the same in a cleaner and
      slightly shorter form.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-5-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29264d92
    • Roman Gushchin's avatar
      writeback, cgroup: increment isw_nr_in_flight before grabbing an inode · 8826ee4f
      Roman Gushchin authored
      isw_nr_in_flight is used to determine whether the inode switch queue
      should be flushed from the umount path.  Currently it's increased after
      grabbing an inode and even scheduling the switch work.  It means the
      umount path can walk past cleanup_offline_cgwb() with active inode
      references, which can result in a "Busy inodes after unmount." message and
      use-after-free issues (with inode->i_sb which gets freed).
      
      Fix it by incrementing isw_nr_in_flight before doing anything with the
      inode and decrementing in the case when switching wasn't scheduled.
      
      The problem hasn't yet been seen in the real life and was discovered by
      Jan Kara by looking into the code.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-4-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Suggested-by: default avatarJan Kara <jack@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8826ee4f
    • Roman Gushchin's avatar
      writeback, cgroup: add smp_mb() to cgroup_writeback_umount() · 592fa002
      Roman Gushchin authored
      A full memory barrier is required between clearing SB_ACTIVE flag in
      generic_shutdown_super() and checking isw_nr_in_flight in
      cgroup_writeback_umount(), otherwise a new switch operation might be
      scheduled after atomic_read(&isw_nr_in_flight) returned 0.  This would
      result in a non-flushed isw_wq, and a potential crash.
      
      The problem hasn't yet been seen in the real life and was discovered by
      Jan Kara by looking into the code.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-3-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      592fa002
    • Roman Gushchin's avatar
      writeback, cgroup: do not switch inodes with I_WILL_FREE flag · 4ade5867
      Roman Gushchin authored
      Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9.
      
      When an inode is getting dirty for the first time it's associated with a
      wb structure (see __inode_attach_wb()).  It can later be switched to
      another wb (if e.g.  some other cgroup is writing a lot of data to the
      same inode), but otherwise stays attached to the original wb until being
      reclaimed.
      
      The problem is that the wb structure holds a reference to the original
      memory and blkcg cgroups.  So if an inode has been dirty once and later is
      actively used in read-only mode, it has a good chance to pin down the
      original memory and blkcg cgroups forever.  This is often the case with
      services bringing data for other services, e.g.  updating some rpm
      packages.
      
      In the real life it becomes a problem due to a large size of the memcg
      structure, which can easily be 1000x larger than an inode.  Also a really
      large number of dying cgroups can raise different scalability issues, e.g.
      making the memory reclaim costly and less effective.
      
      To solve the problem inodes should be eventually detached from the
      corresponding writeback structure.  It's inefficient to do it after every
      writeback completion.  Instead it can be done whenever the original memory
      cgroup is offlined and writeback structure is getting killed.  Scanning
      over a (potentially long) list of inodes and detach them from the
      writeback structure can take quite some time.  To avoid scanning all
      inodes, attached inodes are kept on a new list (b_attached).  To make it
      less noticeable to a user, the scanning and switching is performed from a
      work context.
      
      Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their
      ideas and contribution to this patchset.
      
      This patch (of 8):
      
      If an inode's state has I_WILL_FREE flag set, the inode will be freed
      soon, so there is no point in trying to switch the inode to a different
      cgwb.
      
      I_WILL_FREE was ignored since the introduction of the inode switching, so
      it looks like it doesn't lead to any noticeable issues for a user.  This
      is why the patch is not intended for a stable backport.
      
      Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ade5867