1. 04 Jun, 2020 40 commits
    • Johannes Weiner's avatar
      mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() · 6058eaec
      Johannes Weiner authored
      They're the same function, and for the purpose of all callers they are
      equivalent to lru_cache_add().
      
      [akpm@linux-foundation.org: fix it for local_lock changes]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6058eaec
    • Johannes Weiner's avatar
      mm: allow swappiness that prefers reclaiming anon over the file workingset · c843966c
      Johannes Weiner authored
      With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
      devices such as zswap, it's possible for swap to be much faster than
      filesystems, and for swapping to be preferable over thrashing filesystem
      caches.
      
      Allow setting swappiness - which defines the rough relative IO cost of
      cache misses between page cache and swap-backed pages - to reflect such
      situations by making the swap-preferred range configurable.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c843966c
    • Johannes Weiner's avatar
      mm: keep separate anon and file statistics on page reclaim activity · 497a6c1b
      Johannes Weiner authored
      Having statistics on pages scanned and pages reclaimed for both anon and
      file pages makes it easier to evaluate changes to LRU balancing.
      
      While at it, clean up the stat-keeping mess for isolation, putback,
      reclaim stats etc.  a bit: first the physical LRU operation (isolation and
      putback), followed by vmstats, reclaim_stats, and then vm events.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      497a6c1b
    • Johannes Weiner's avatar
      mm: fix LRU balancing effect of new transparent huge pages · 5df74196
      Johannes Weiner authored
      The reclaim code that balances between swapping and cache reclaim tries to
      predict likely reuse based on in-memory reference patterns alone.  This
      works in many cases, but when it fails it cannot detect when the cache is
      thrashing pathologically, or when we're in the middle of a swap storm.
      
      The high seek cost of rotational drives under which the algorithm evolved
      also meant that mistakes could quickly result in lockups from too
      aggressive swapping (which is predominantly random IO).  As a result, the
      balancing code has been tuned over time to a point where it mostly goes
      for page cache and defers swapping until the VM is under significant
      memory pressure.
      
      The resulting strategy doesn't make optimal caching decisions - where
      optimal is the least amount of IO required to execute the workload.
      
      The proliferation of fast random IO devices such as SSDs, in-memory
      compression such as zswap, and persistent memory technologies on the
      horizon, has made this undesirable behavior very noticable: Even in the
      presence of large amounts of cold anonymous memory and a capable swap
      device, the VM refuses to even seriously scan these pages, and can leave
      the page cache thrashing needlessly.
      
      This series sets out to address this.  Since commit ("a528910e mm:
      thrash detection-based file cache sizing") we have exact tracking of
      refault IO - the ultimate cost of reclaiming the wrong pages.  This allows
      us to use an IO cost based balancing model that is more aggressive about
      scanning anonymous memory when the cache is thrashing, while being able to
      avoid unnecessary swap storms.
      
      These patches base the LRU balance on the rate of refaults on each list,
      times the relative IO cost between swap device and filesystem
      (swappiness), in order to optimize reclaim for least IO cost incurred.
      
      	History
      
      I floated these changes in 2016.  At the time they were incomplete and
      full of workarounds due to a lack of infrastructure in the reclaim code:
      We didn't have PageWorkingset, we didn't have hierarchical cgroup
      statistics, and problems with the cgroup swap controller.  As swapping
      wasn't too high a priority then, the patches stalled out.  With all
      dependencies in place now, here we are again with much cleaner,
      feature-complete patches.
      
      I kept the acks for patches that stayed materially the same :-)
      
      Below is a series of test results that demonstrate certain problematic
      behavior of the current code, as well as showcase the new code's more
      predictable and appropriate balancing decisions.
      
      	Test #1: No convergence
      
      This test shows an edge case where the VM currently doesn't converge at
      all on a new file workingset with a stale anon/tmpfs set.
      
      The test sets up a cold anon set the size of 3/4 RAM, then tries to
      establish a new file set half the size of RAM (flat access pattern).
      
      The vanilla kernel refuses to even scan anon pages and never converges.
      The file set is perpetually served from the filesystem.
      
      The first test kernel is with the series up to the workingset patch
      applied.  This allows thrashing page cache to challenge the anonymous
      workingset.  The VM then scans the lists based on the current
      scanned/rotated balancing algorithm.  It converges on a stable state where
      all cold anon pages are pushed out and the fileset is served entirely from
      cache:
      
      			    noconverge/5.7-rc5-mm	noconverge/5.7-rc5-mm-workingset
      Scanned			417719308.00 (    +0.00%)		64091155.00 (   -84.66%)
      Reclaimed		417711094.00 (    +0.00%)		61640308.00 (   -85.24%)
      Reclaim efficiency %	      100.00 (    +0.00%)		      96.18 (    -3.78%)
      Scanned file		417719308.00 (    +0.00%)		59211118.00 (   -85.83%)
      Scanned anon			0.00 (    +0.00%)	         4880037.00 (          )
      Swapouts			0.00 (    +0.00%)	         2439957.00 (          )
      Swapins				0.00 (    +0.00%)		     257.00 (          )
      Refaults		415246605.00 (    +0.00%)		59183722.00 (   -85.75%)
      Restore refaults		0.00 (    +0.00%)	        54988252.00 (          )
      
      The second test kernel is with the full patch series applied, which
      replaces the scanned/rotated ratios with refault/swapin rate-based
      balancing.  It evicts the cold anon pages more aggressively in the
      presence of a thrashing cache and the absence of swapins, and so converges
      with about 60% of the IO and reclaim activity:
      
      			noconverge/5.7-rc5-mm-workingset	noconverge/5.7-rc5-mm-lrubalance
      Scanned				64091155.00 (    +0.00%)		37579741.00 (   -41.37%)
      Reclaimed			61640308.00 (    +0.00%)		35129293.00 (   -43.01%)
      Reclaim efficiency %		      96.18 (    +0.00%)		      93.48 (    -2.78%)
      Scanned file			59211118.00 (    +0.00%)		32708385.00 (   -44.76%)
      Scanned anon			 4880037.00 (    +0.00%)		 4871356.00 (    -0.18%)
      Swapouts			 2439957.00 (    +0.00%)		 2435565.00 (    -0.18%)
      Swapins				     257.00 (    +0.00%)		     262.00 (    +1.94%)
      Refaults			59183722.00 (    +0.00%)		32675667.00 (   -44.79%)
      Restore refaults		54988252.00 (    +0.00%)		28480430.00 (   -48.21%)
      
      We're triggering this case in host sideloading scenarios: When a host's
      primary workload is not saturating the machine (primary load is usually
      driven by user activity), we can optimistically sideload a batch job; if
      user activity picks up and the primary workload needs the whole host
      during this time, we freeze the sideload and rely on it getting pushed to
      swap.  Frequently that swapping doesn't happen and the completely inactive
      sideload simply stays resident while the expanding primary worklad is
      struggling to gain ground.
      
      	Test #2: Kernel build
      
      This test is a a kernel build that is slightly memory-restricted (make -j4
      inside a 400M cgroup).
      
      Despite the very aggressive swapping of cold anon pages in test #1, this
      test shows that the new kernel carefully balances swap against cache
      refaults when both the file and the cache set are pressured.
      
      It shows the patched kernel to be slightly better at finding the coldest
      memory from the combined anon and file set to evict under pressure.  The
      result is lower aggregate reclaim and paging activity:
      
      z				    5.7-rc5-mm	5.7-rc5-mm-lrubalance
      Real time		   210.60 (    +0.00%)	   210.97 (    +0.18%)
      User time		   745.42 (    +0.00%)	   746.48 (    +0.14%)
      System time		    69.78 (    +0.00%)	    69.79 (    +0.02%)
      Scanned file		354682.00 (    +0.00%)	293661.00 (   -17.20%)
      Scanned anon		465381.00 (    +0.00%)	378144.00 (   -18.75%)
      Swapouts		185920.00 (    +0.00%)	147801.00 (   -20.50%)
      Swapins			 34583.00 (    +0.00%)	 32491.00 (    -6.05%)
      Refaults		212664.00 (    +0.00%)	172409.00 (   -18.93%)
      Restore refaults	 48861.00 (    +0.00%)	 80091.00 (   +63.91%)
      Total paging IO		433167.00 (    +0.00%)	352701.00 (   -18.58%)
      
      	Test #3: Overload
      
      This next test is not about performance, but rather about the
      predictability of the algorithm.  The current balancing behavior doesn't
      always lead to comprehensible results, which makes performance analysis
      and parameter tuning (swappiness e.g.) very difficult.
      
      The test shows the balancing behavior under equivalent anon and file
      input.  Anon and file sets are created of equal size (3/4 RAM), have the
      same access patterns (a hot-cold gradient), and synchronized access rates.
      Swappiness is raised from the default of 60 to 100 to indicate equal IO
      cost between swap and cache.
      
      With the vanilla balancing code, anon scans make up around 9% of the total
      pages scanned, or a ~1:10 ratio.  This is a surprisingly skewed ratio, and
      it's an outcome that is hard to explain given the input parameters to the
      VM.
      
      The new balancing model targets a 1:2 balance: All else being equal,
      reclaiming a file page costs one page IO - the refault; reclaiming an anon
      page costs two IOs - the swapout and the swapin.  In the test we observe a
      ~1:3 balance.
      
      The scanned and paging IO numbers indicate that the anon LRU algorithm we
      have in place right now does a slightly worse job at picking the coldest
      pages compared to the file algorithm.  There is ongoing work to improve
      this, like Joonsoo's anon workingset patches; however, it's difficult to
      compare the two aging strategies when the balancing between them is
      behaving unintuitively.
      
      The slightly less efficient anon reclaim results in a deviation from the
      optimal 1:2 scan ratio we would like to see here - however, 1:3 is much
      closer to what we'd want to see in this test than the vanilla kernel's
      aging of 10+ cache pages for every anonymous one:
      
      			overload-100/5.7-rc5-mm-workingset	overload-100/5.7-rc5-mm-lrubalance-realfile
      Scanned				 533633725.00 (    +0.00%)			  595687785.00 (   +11.63%)
      Reclaimed			 494325440.00 (    +0.00%)			  518154380.00 (    +4.82%)
      Reclaim efficiency %			92.63 (    +0.00%)				 86.98 (    -6.03%)
      Scanned file			 484532894.00 (    +0.00%)			  456937722.00 (    -5.70%)
      Scanned anon			  49100831.00 (    +0.00%)			  138750063.00 (  +182.58%)
      Swapouts			   8096423.00 (    +0.00%)			   48982142.00 (  +504.98%)
      Swapins				  10027384.00 (    +0.00%)			   62325044.00 (  +521.55%)
      Refaults			 479819973.00 (    +0.00%)			  451309483.00 (    -5.94%)
      Restore refaults		 426422087.00 (    +0.00%)			  399914067.00 (    -6.22%)
      Total paging IO			 497943780.00 (    +0.00%)			  562616669.00 (   +12.99%)
      
      	Test #4: Parallel IO
      
      It's important to note that these patches only affect the situation where
      the kernel has to reclaim workingset memory, which is usually a
      transitionary period.  The vast majority of page reclaim occuring in a
      system is from trimming the ever-expanding page cache.
      
      These patches don't affect cache trimming behavior.  We never swap as long
      as we only have use-once cache moving through the file LRU, we only
      consider swapping when the cache is actively thrashing.
      
      The following test demonstrates this.  It has an anon workingset that
      takes up half of RAM and then writes a file that is twice the size of RAM
      out to disk.
      
      As the cache is funneled through the inactive file list, no anon pages are
      scanned (aside from apparently some background noise of 10 pages):
      
      					  5.7-rc5-mm		          5.7-rc5-mm-lrubalance
      Scanned			    10714722.00 (    +0.00%)		       10723445.00 (    +0.08%)
      Reclaimed		    10703596.00 (    +0.00%)		       10712166.00 (    +0.08%)
      Reclaim efficiency %		  99.90 (    +0.00%)			     99.89 (    -0.00%)
      Scanned file		    10714722.00 (    +0.00%)		       10723435.00 (    +0.08%)
      Scanned anon			   0.00 (    +0.00%)			     10.00 (          )
      Swapouts			   0.00 (    +0.00%)			      7.00 (          )
      Swapins				   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Refaults			  92.00 (    +0.00%)			     41.00 (   -54.84%)
      Restore refaults		   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Total paging IO			  92.00 (    +0.00%)			     48.00 (   -47.31%)
      
      This patch (of 14):
      
      Currently, THP are counted as single pages until they are split right
      before being swapped out.  However, at that point the VM is already in the
      middle of reclaim, and adjusting the LRU balance then is useless.
      
      Always account THP by the number of basepages, and remove the fixup from
      the splitting path.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200520232525.798933-2-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5df74196
    • Johannes Weiner's avatar
      mm: memcontrol: update page->mem_cgroup stability rules · a0b5b414
      Johannes Weiner authored
      The previous patches have simplified the access rules around
      page->mem_cgroup somewhat:
      
      1. We never change page->mem_cgroup while the page is isolated by
         somebody else.  This was by far the biggest exception to our rules and
         it didn't stop at lock_page() or lock_page_memcg().
      
      2. We charge pages before they get put into page tables now, so the
         somewhat fishy rule about "can be in page table as long as it's still
         locked" is now gone and boiled down to having an exclusive reference to
         the page.
      
      Document the new rules.  Any of the following will stabilize the
      page->mem_cgroup association:
      
      - the page lock
      - LRU isolation
      - lock_page_memcg()
      - exclusive access to the page
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-20-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0b5b414
    • Johannes Weiner's avatar
      mm: memcontrol: delete unused lrucare handling · d9eb1ea2
      Johannes Weiner authored
      Swapin faults were the last event to charge pages after they had already
      been put on the LRU list.  Now that we charge directly on swapin, the
      lrucare portion of the charge code is unused.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9eb1ea2
    • Alex Shi's avatar
      mm: memcontrol: document the new swap control behavior · 0a27cae1
      Alex Shi authored
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-18-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a27cae1
    • Johannes Weiner's avatar
      mm: memcontrol: charge swapin pages on instantiation · 4c6355b2
      Johannes Weiner authored
      Right now, users that are otherwise memory controlled can easily escape
      their containment and allocate significant amounts of memory that they're
      not being charged for.  That's because swap readahead pages are not being
      charged until somebody actually faults them into their page table.  This
      can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      There are additional problems with the delayed charging of swap pages:
      
      1. To implement refault/workingset detection for anonymous pages, we
         need to have a target LRU available at swapin time, but the LRU is not
         determinable until the page has been charged.
      
      2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
         stable when the page is isolated from the LRU; otherwise, the locks
         change under us.  But swapcache gets charged after it's already on the
         LRU, and even if we cannot isolate it ourselves (since charging is not
         exactly optional).
      
      The previous patch ensured we always maintain cgroup ownership records for
      swap pages.  This patch moves the swapcache charging point from the fault
      handler to swapin time to fix all of the above problems.
      
      v2: simplify swapin error checking (Joonsoo)
      
      [hughd@google.com: fix livelock in __read_swap_cache_async()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvilsSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c6355b2
    • Johannes Weiner's avatar
      mm: memcontrol: make swap tracking an integral part of memory control · 2d1c4980
      Johannes Weiner authored
      Without swap page tracking, users that are otherwise memory controlled can
      easily escape their containment and allocate significant amounts of memory
      that they're not being charged for.  That's because swap does readahead,
      but without the cgroup records of who owned the page at swapout, readahead
      pages don't get charged until somebody actually faults them into their
      page table and we can identify an owner task.  This can be maliciously
      exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      Make swap swap page tracking an integral part of memcg and remove the
      Kconfig options.  In the first place, it was only made configurable to
      allow users to save some memory.  But the overhead of tracking cgroup
      ownership per swap page is minimal - 2 byte per page, or 512k per 1G of
      swap, or 0.04%.  Saving that at the expense of broken containment
      semantics is not something we should present as a coequal option.
      
      The swapaccount=0 boot option will continue to exist, and it will
      eliminate the page_counter overhead and hide the swap control files, but
      it won't disable swap slot ownership tracking.
      
      This patch makes sure we always have the cgroup records at swapin time;
      the next patch will fix the actual bug by charging readahead swap pages at
      swapin time rather than at fault time.
      
      v2: fix double swap charge bug in cgroup1/cgroup2 code gating
      
      [hannes@cmpxchg.org: fix crash with cgroup_disable=memory]
        Link: http://lkml.kernel.org/r/20200521215855.GB815153@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Link: http://lkml.kernel.org/r/20200508183105.225460-16-hannes@cmpxchg.orgDebugged-by: default avatarHugh Dickins <hughd@google.com>
      Debugged-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d1c4980
    • Johannes Weiner's avatar
      mm: memcontrol: prepare swap controller setup for integration · eccb52e7
      Johannes Weiner authored
      A few cleanups to streamline the swap controller setup:
      
      - Replace the do_swap_account flag with cgroup_memory_noswap. This
        brings it in line with other functionality that is usually available
        unless explicitly opted out of - nosocket, nokmem.
      
      - Remove the really_do_swap_account flag that stores the boot option
        and is later used to switch the do_swap_account. It's not clear why
        this indirection is/was necessary. Use do_swap_account directly.
      
      - Minor coding style polishing
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-15-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eccb52e7
    • Johannes Weiner's avatar
      mm: memcontrol: drop unused try/commit/cancel charge API · f0e45fb4
      Johannes Weiner authored
      There are no more users. RIP in peace.
      
      [arnd@arndb.de: fix an unused-function warning]
        Link: http://lkml.kernel.org/r/20200528095640.151454-1-arnd@arndb.deSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-14-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0e45fb4
    • Johannes Weiner's avatar
      mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API · 9d82c694
      Johannes Weiner authored
      With the page->mapping requirement gone from memcg, we can charge anon and
      file-thp pages in one single step, right after they're allocated.
      
      This removes two out of three API calls - especially the tricky commit
      step that needed to happen at just the right time between when the page is
      "set up" and when it's "published" - somewhat vague and fluid concepts
      that varied by page type.  All we need is a freshly allocated page and a
      memcg context to charge.
      
      v2: prevent double charges on pre-allocated hugepages in khugepaged
      
      [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
        Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d82c694
    • Johannes Weiner's avatar
      mm: memcontrol: switch to native NR_ANON_THPS counter · 468c3982
      Johannes Weiner authored
      With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a
      small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
      native NR_ANON_THPS accounting sites.
      
      [hannes@cmpxchg.org: fixes]
        Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: Randy Dunlap <rdunlap@infradead.org>	[build-tested]
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      468c3982
    • Johannes Weiner's avatar
      mm: memcontrol: switch to native NR_ANON_MAPPED counter · be5d0a74
      Johannes Weiner authored
      Memcg maintains a private MEMCG_RSS counter.  This divergence from the
      generic VM accounting means unnecessary code overhead, and creates a
      dependency for memcg that page->mapping is set up at the time of charging,
      so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
      lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
      same way we do for NR_FILE_MAPPED.
      
      With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
      counter, this patch finally eliminates the need to have page->mapping set
      up at charge time.  However, we need to have page->mem_cgroup set up by
      the time rmap runs and does the accounting, so switch the commit and the
      rmap callbacks around.
      
      v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be5d0a74
    • Johannes Weiner's avatar
      mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters · 0d1c2072
      Johannes Weiner authored
      Memcg maintains private MEMCG_CACHE and NR_SHMEM counters.  This
      divergence from the generic VM accounting means unnecessary code overhead,
      and creates a dependency for memcg that page->mapping is set up at the
      time of charging, so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
      The page is already locked in these places, so page->mem_cgroup is stable;
      we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
      it's set up in time.
      
      Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
      NR_SHMEM accounting sites.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d1c2072
    • Johannes Weiner's avatar
      mm: memcontrol: prepare cgroup vmstat infrastructure for native anon counters · 9da7b521
      Johannes Weiner authored
      Anonymous compound pages can be mapped by ptes, which means that if we
      want to track NR_MAPPED_ANON, NR_ANON_THPS on a per-cgroup basis, we have
      to be prepared to see tail pages in our accounting functions.
      
      Make mod_lruvec_page_state() and lock_page_memcg() deal with tail pages
      correctly, namely by redirecting to the head page which has the
      page->mem_cgroup set up.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-9-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9da7b521
    • Johannes Weiner's avatar
      mm: memcontrol: prepare move_account for removal of private page type counters · 49e50d27
      Johannes Weiner authored
      When memcg uses the generic vmstat counters, it doesn't need to do
      anything at charging and uncharging time.  It does, however, need to
      migrate counts when pages move to a different cgroup in move_account.
      
      Prepare the move_account function for the arrival of NR_FILE_PAGES,
      NR_ANON_MAPPED, NR_ANON_THPS etc.  by having a branch for files and a
      branch for anon, which can then divided into sub-branches.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-8-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49e50d27
    • Johannes Weiner's avatar
      mm: memcontrol: prepare uncharging for removal of private page type counters · 9f762dbe
      Johannes Weiner authored
      The uncharge batching code adds up the anon, file, kmem counts to
      determine the total number of pages to uncharge and references to drop.
      But the next patches will remove the anon and file counters.
      
      Maintain an aggregate nr_pages in the uncharge_gather struct.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-7-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f762dbe
    • Johannes Weiner's avatar
      mm: memcontrol: convert page cache to a new mem_cgroup_charge() API · 3fea5a49
      Johannes Weiner authored
      The try/commit/cancel protocol that memcg uses dates back to when pages
      used to be uncharged upon removal from the page cache, and thus couldn't
      be committed before the insertion had succeeded.  Nowadays, pages are
      uncharged when they are physically freed; it doesn't matter whether the
      insertion was successful or not.  For the page cache, the transaction
      dance has become unnecessary.
      
      Introduce a mem_cgroup_charge() function that simply charges a newly
      allocated page to a cgroup and sets up page->mem_cgroup in one single
      step.  If the insertion fails, the caller doesn't have to do anything but
      free/put the page.
      
      Then switch the page cache over to this new API.
      
      Subsequent patches will also convert anon pages, but it needs a bit more
      prep work.  Right now, memcg depends on page->mapping being already set up
      at the time of charging, so that it can maintain its own MEMCG_CACHE and
      MEMCG_RSS counters.  For anon, page->mapping is set under the same pte
      lock under which the page is publishd, so a single charge point that can
      block doesn't work there just yet.
      
      The following prep patches will replace the private memcg counters with
      the generic vmstat counters, thus removing the page->mapping dependency,
      then complete the transition to the new single-point charge API and delete
      the old transactional scheme.
      
      v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      v3: rebase on preceeding shmem simplification patch
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fea5a49
    • Johannes Weiner's avatar
      mm: memcontrol: move out cgroup swaprate throttling · 6caa6a07
      Johannes Weiner authored
      The cgroup swaprate throttling is about matching new anon allocations to
      the rate of available IO when that is being throttled.  It's the io
      controller hooking into the VM, rather than a memory controller thing.
      
      Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
      drop the @memcg argument which is only used to check whether the preceding
      page charge has succeeded and the fault is proceeding.
      
      We could decouple the call from mem_cgroup_try_charge() here as well, but
      that would cause unnecessary churn: the following patches convert all
      callsites to a new charge API and we'll decouple as we go along.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6caa6a07
    • Johannes Weiner's avatar
      mm: shmem: remove rare optimization when swapin races with hole punching · 14235ab3
      Johannes Weiner authored
      Commit 215c02bc ("tmpfs: fix shmem_getpage_gfp() VM_BUG_ON")
      recognized that hole punching can race with swapin and removed the
      BUG_ON() for a truncated entry from the swapin path.
      
      The patch also added a swapcache deletion to optimize this rare case:
      Since swapin has the page locked, and free_swap_and_cache() merely
      trylocks, this situation can leave the page stranded in swapcache.
      Usually, page reclaim picks up stale swapcache pages, and the race can
      happen at any other time when the page is locked.  (The same happens for
      non-shmem swapin racing with page table zapping.) The thinking here was:
      we already observed the race and we have the page locked, we may as well
      do the cleanup instead of waiting for reclaim.
      
      However, this optimization complicates the next patch which moves the
      cgroup charging code around.  As this is just a minor speedup for a race
      condition that is so rare that it required a fuzzer to trigger the
      original BUG_ON(), it's no longer worth the complications.
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200511181056.GA339505@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14235ab3
    • Johannes Weiner's avatar
      mm: memcontrol: drop @compound parameter from memcg charging API · 3fba69a5
      Johannes Weiner authored
      The memcg charging API carries a boolean @compound parameter that tells
      whether the page we're dealing with is a hugepage.
      mem_cgroup_commit_charge() has another boolean @lrucare that indicates
      whether the page needs LRU locking or not while charging.  The majority of
      callsites know those parameters at compile time, which results in a lot of
      naked "false, false" argument lists.  This makes for cryptic code and is a
      breeding ground for subtle mistakes.
      
      Thankfully, the huge page state can be inferred from the page itself and
      doesn't need to be passed along.  This is safe because charging completes
      before the page is published and somebody may split it.
      
      Simplify the callsites by removing @compound, and let memcg infer the
      state by using hpage_nr_pages() unconditionally.  That function does
      PageTransHuge() to identify huge pages, which also helpfully asserts that
      nobody passes in tail pages by accident.
      
      The following patches will introduce a new charging API, best not to carry
      over unnecessary weight.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fba69a5
    • Johannes Weiner's avatar
      mm: memcontrol: fix stat-corrupting race in charge moving · abb242f5
      Johannes Weiner authored
      The move_lock is a per-memcg lock, but the VM accounting code that needs
      to acquire it comes from the page and follows page->mem_cgroup under RCU
      protection.  That means that the page becomes unlocked not when we drop
      the move_lock, but when we update page->mem_cgroup.  And that assignment
      doesn't imply any memory ordering.  If that pointer write gets reordered
      against the reads of the page state - page_mapped, PageDirty etc.  the
      state may change while we rely on it being stable and we can end up
      corrupting the counters.
      
      Place an SMP memory barrier to make sure we're done with all page state by
      the time the new page->mem_cgroup becomes visible.
      
      Also replace the open-coded move_lock with a lock_page_memcg() to make it
      more obvious what we're serializing against.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-3-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abb242f5
    • Johannes Weiner's avatar
      mm: fix NUMA node file count error in replace_page_cache() · f4129ea3
      Johannes Weiner authored
      Patch series "mm: memcontrol: charge swapin pages on instantiation", v2.
      
      This patch series reworks memcg to charge swapin pages directly at
      swapin time, rather than at fault time, which may be much later, or
      not happen at all.
      
      Changes in version 2:
      - prevent double charges on pre-allocated hugepages in khugepaged
      - leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      - fix temporary accounting bug by switching rmap<->commit (Joonsoo)
      - fix double swap charge bug in cgroup1/cgroup2 code gating
      - simplify swapin error checking (Joonsoo)
      - mm: memcontrol: document the new swap control behavior (Alex)
      - review tags
      
      The delayed swapin charging scheme we have right now causes problems:
      
      - Alex's per-cgroup lru_lock patches rely on pages that have been
        isolated from the LRU to have a stable page->mem_cgroup; otherwise
        the lock may change underneath him. Swapcache pages are charged only
        after they are added to the LRU, and charging doesn't follow the LRU
        isolation protocol.
      
      - Joonsoo's anon workingset patches need a suitable LRU at the time
        the page enters the swap cache and displaces the non-resident
        info. But the correct LRU is only available after charging.
      
      - It's a containment hole / DoS vector. Users can trigger arbitrarily
        large swap readahead using MADV_WILLNEED. The memory is never
        charged unless somebody actually touches it.
      
      - It complicates the page->mem_cgroup stabilization rules
      
      In order to charge pages directly at swapin time, the memcg code base
      needs to be prepared, and several overdue cleanups become a necessity:
      
      To charge pages at swapin time, we need to always have cgroup
      ownership tracking of swap records. We also cannot rely on
      page->mapping to tell apart page types at charge time, because that's
      only set up during a page fault.
      
      To eliminate the page->mapping dependency, memcg needs to ditch its
      private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
      of the generic vmstat counters and accounting sites, such as
      NR_FILE_PAGES, NR_ANON_MAPPED etc.
      
      To switch to generic vmstat counters, the charge sequence must be
      adjusted such that page->mem_cgroup is set up by the time these
      counters are modified.
      
      The series is structured as follows:
      
      1. Bug fixes
      2. Decoupling charging from rmap
      3. Swap controller integration into memcg
      4. Direct swapin charging
      
      This patch (of 19):
      
      When replacing one page with another one in the cache, we have to decrease
      the file count of the old page's NUMA node and increase the one of the new
      NUMA node, otherwise the old node leaks the count and the new node
      eventually underflows its counter.
      
      Fixes: 74d60958 ("page cache: Add and replace pages using the XArray")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200508183105.225460-2-hannes@cmpxchg.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4129ea3
    • Qiwu Chen's avatar
      mm/vmscan: update the comment of should_continue_reclaim() · df3a45f9
      Qiwu Chen authored
      try_to_compact_zone() has been replaced by try_to_compact_pages(), which
      is necessary to be updated in the comment of should_continue_reclaim().
      Signed-off-by: default avatarQiwu Chen <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200501034907.22991-1-chenqiwu@xiaomi.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df3a45f9
    • Maninder Singh's avatar
      mm/vmscan.c: change prototype for shrink_page_list · 730ec8c0
      Maninder Singh authored
      commit 3c710c1a ("mm, vmscan extract shrink_page_list reclaim counters
      into a struct") changed data type for the function, so changing return
      type for funciton and its caller.
      Signed-off-by: default avatarVaneet Narang <v.narang@samsung.com>
      Signed-off-by: default avatarManinder Singh <maninder1.s@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      730ec8c0
    • Jaewon Kim's avatar
      mm/vmscan: count layzfree pages and fix nr_isolated_* mismatch · 1f318a9b
      Jaewon Kim authored
      Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages.
      
      If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree
      page, then the lazyfree page is changed to a normal anon page having
      SwapBacked by commit 802a3a92 ("mm: reclaim MADV_FREE pages").  Even
      with the change, reclaim context correctly counts isolated files because
      it uses is_file_lru to distinguish file.  And the change to anon is not
      happened if try_to_unmap_one is used for migration.  So migration context
      like compaction also correctly counts isolated files even though it uses
      page_is_file_lru insted of is_file_lru.  Recently page_is_file_cache was
      renamed to page_is_file_lru by commit 9de4f22a ("mm: code cleanup for
      MADV_FREE").
      
      But the nr_isolate_* mismatch problem happens on cma alloc.  There is
      reclaim_clean_pages_from_list which is being used only by cma.  It was
      introduced by commit 02c6de8d ("mm: cma: discard clean pages during
      contiguous allocation instead of migration") to reclaim clean file pages
      without migration.  The cma alloc uses both reclaim_clean_pages_from_list
      and migrate_pages, and it uses page_is_file_lru to count isolated files.
      If there are dirty lazyfree pages allocated from cma memory region, the
      pages are counted as isolated file at the beginging but are counted as
      isolated anon after finished.
      
      Mem-Info:
      Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      
      Like log above, there were too much isolated files, 37664kB, which
      triggers too_many_isolated in reclaim even when there is no actually
      isolated file in system wide.  It could be reproducible by running two
      programs, writing on MADV_FREE page and doing cma alloc, respectively.
      Although isolated anon is 0, I found that the internal value of isolated
      anon was the negative value of isolated file.
      
      Fix this by compensating the isolated count for both LRU lists.  Count
      non-discarded lazyfree pages in shrink_page_list, then compensate the
      counted number in reclaim_clean_pages_from_list.
      Reported-by: default avatarYong-Taek Lee <ytk.lee@samsung.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f318a9b
    • Wei Yang's avatar
      mm/vmscan.c: use update_lru_size() in update_lru_sizes() · a892cb6b
      Wei Yang authored
      We already defined the helper update_lru_size().
      
      Let's use this to reduce code duplication.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200331221550.1011-1-richard.weiyang@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a892cb6b
    • Matthew Wilcox (Oracle)'s avatar
      mm: simplify calling a compound page destructor · ff45fc3c
      Matthew Wilcox (Oracle) authored
      None of the three callers of get_compound_page_dtor() want to know the
      value; they just want to call the function.  Replace it with
      destroy_compound_page() which calls the dtor for them.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff45fc3c
    • Anshuman Khandual's avatar
      mm/hugetlb: define a generic fallback for arch_clear_hugepage_flags() · 5be99343
      Anshuman Khandual authored
      There are multiple similar definitions for arch_clear_hugepage_flags() on
      various platforms.  Lets just add it's generic fallback definition for
      platforms that do not override.  This help reduce code duplication.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1588907271-11920-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5be99343
    • Anshuman Khandual's avatar
      mm/hugetlb: define a generic fallback for is_hugepage_only_range() · b0eae98c
      Anshuman Khandual authored
      There are multiple similar definitions for is_hugepage_only_range() on
      various platforms.  Lets just add it's generic fallback definition for
      platforms that do not override.  This help reduce code duplication.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1588907271-11920-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0eae98c
    • Anshuman Khandual's avatar
      arm64/mm: drop __HAVE_ARCH_HUGE_PTEP_GET · be51e3fd
      Anshuman Khandual authored
      Patch series "mm/hugetlb: Add some new generic fallbacks", v3.
      
      This series adds the following new generic fallbacks.  Before that it
      drops __HAVE_ARCH_HUGE_PTEP_GET from arm64 platform.
      
      1. is_hugepage_only_range()
      2. arch_clear_hugepage_flags()
      
      After this arm (32 bit) remains the sole platform defining it's own
      huge_ptep_get() via __HAVE_ARCH_HUGE_PTEP_GET.
      
      This patch (of 3):
      
      Platform specific huge_ptep_get() is required only when fetching the huge
      PTE involves more than just dereferencing the page table pointer.  This is
      not the case on arm64 platform.  Hence huge_ptep_pte() can be dropped
      along with it's __HAVE_ARCH_HUGE_PTEP_GET subscription.  Before that, it
      updates the generic huge_ptep_get() with READ_ONCE() which will prevent
      known page table issues with THP on arm64.
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/1588907271-11920-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r//1506527369-19535-1-git-send-email-will.deacon@arm.com/
      Link: http://lkml.kernel.org/r/1588907271-11920-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be51e3fd
    • Li Xinhai's avatar
      mm/hugetlb: avoid unnecessary check on pud and pmd entry in huge_pte_offset · 8ac0b81a
      Li Xinhai authored
      When huge_pte_offset() is called, the parameter sz can only be PUD_SIZE or
      PMD_SIZE.  If sz is PUD_SIZE and code can reach pud, then *pud must be
      none, or normal hugetlb entry, or non-present (migration or hwpoisoned)
      hugetlb entry, and we can directly return pud.  When sz is PMD_SIZE, pud
      must be none or present, and if code can reach pmd, we can directly return
      pmd.
      
      So after this patch the code is simplified by first check on the parameter
      sz, and avoid unnecessary checks in current code.  Same semantics of
      existing code is maintained.
      
      More details about relevant commits:
      commit 9b19df29 ("mm/hugetlb.c: make huge_pte_offset() consistent
      and document behaviour") changed the code path for pud and pmd handling,
      see comments about why this patch intends to change it.
      ...
      	pud = pud_offset(p4d, addr);
      	if (sz != PUD_SIZE && pud_none(*pud)) // [1]
      		return NULL;
      	/* hugepage or swap? */
      	if (pud_huge(*pud) || !pud_present(*pud)) // [2]
      		return (pte_t *)pud;
      
      	pmd = pmd_offset(pud, addr);
      	if (sz != PMD_SIZE && pmd_none(*pmd)) // [3]
      		return NULL;
      	/* hugepage or swap? */
      	if (pmd_huge(*pmd) || !pmd_present(*pmd)) // [4]
      		return (pte_t *)pmd;
      
      	return NULL; // [5]
      ...
      [1]: this is necessary, return NULL for sz == PMD_SIZE;
      [2]: if sz == PUD_SIZE, all valid values of pud entry will cause return;
      [3]: dead code, sz != PMD_SIZE never true;
      [4]: all valid values of pmd entry will cause return;
      [5]: dead code, because of check in [4].
      
      Now, this patch combines [1] and [2] for pud, and combines [3], [4] and
      [5] for pmd, so avoid unnecessary checks.
      
      I don't try to catch any invalid values in page table entry, as that will
      be checked by caller and avoid extra branch in this function.  Also no
      assert on sz must equal PUD_SIZE or PMD_SIZE, since this function only
      call for hugetlb mapping.
      
      For commit 3c1d7e6c ("mm/hugetlb: fix a addressing exception caused by
      huge_pte_offset"), since we don't read the entry more than once now,
      variable pud_entry and pmd_entry are not needed.
      Signed-off-by: default avatarLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Longpeng <longpeng2@huawei.com>
      Link: http://lkml.kernel.org/r/1587794313-16849-1-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ac0b81a
    • Mike Kravetz's avatar
      hugetlbfs: fix changes to command line processing · c2833a5b
      Mike Kravetz authored
      Previously, a check for hugepages_supported was added before processing
      hugetlb command line parameters.  On some architectures such as powerpc,
      hugepages_supported() is not set to true until after command line
      processing.  Therefore, no hugetlb command line parameters would be
      accepted.
      
      Remove the additional checks for hugepages_supported.  In hugetlb_init,
      print a warning if !hugepages_supported and command line parameters were
      specified.
      Reported-by: default avatarSandipan Das <sandipan.osd@gmail.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/b1f04f9f-fa46-c2a0-7693-4a0679d2a1ee@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2833a5b
    • Mike Kravetz's avatar
      hugetlbfs: clean up command line processing · 282f4214
      Mike Kravetz authored
      With all hugetlb page processing done in a single file clean up code.
      
      - Make code match desired semantics
        - Update documentation with semantics
      - Make all warnings and errors messages start with 'HugeTLB:'.
      - Consistently name command line parsing routines.
      - Warn if !hugepages_supported() and command line parameters have
        been specified.
      - Add comments to code
        - Describe some of the subtle interactions
        - Describe semantics of command line arguments
      
      This patch also fixes issues with implicitly setting the number of
      gigantic huge pages to preallocate.  Previously on X86 command line,
      
              hugepages=2 default_hugepagesz=1G
      
      would result in zero 1G pages being preallocated and,
      
              # grep HugePages_Total /proc/meminfo
              HugePages_Total:       0
              # sysctl -a | grep nr_hugepages
              vm.nr_hugepages = 2
              vm.nr_hugepages_mempolicy = 2
              # cat /proc/sys/vm/nr_hugepages
              2
      
      After this patch 2 gigantic pages will be preallocated and all the proc,
      sysfs, sysctl and meminfo files will accurately reflect this.
      
      To address the issue with gigantic pages, a small change in behavior was
      made to command line processing.  Previously the command line,
      
              hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256
      
      would result in the allocation of 256 2M huge pages.  The value 128 would
      be ignored without any warning.  After this patch, 128 2M pages will be
      allocated and a warning message will be displayed indicating the value of
      256 is ignored.  This change in behavior is required because allocation of
      implicitly specified gigantic pages must be done when the
      default_hugepagesz= is encountered for gigantic pages.  Previously the
      code waited until later in the boot process (hugetlb_init), to allocate
      pages of default size.  However the bootmem allocator required for
      gigantic allocations is not available at this time.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarSandipan Das <sandipan@linux.ibm.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      282f4214
    • Mike Kravetz's avatar
      hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate · 38237830
      Mike Kravetz authored
      hugetlb_add_hstate() prints a warning if the hstate already exists.  This
      was originally done as part of kernel command line parsing.  If
      'hugepagesz=' was specified more than once, the warning
      
      	pr_warn("hugepagesz= specified twice, ignoring\n");
      
      would be printed.
      
      Some architectures want to enable all huge page sizes.  They would call
      hugetlb_add_hstate for all supported sizes.  However, this was done after
      command line processing and as a result hstates could have already been
      created for some sizes.  To make sure no warning were printed, there would
      often be code like:
      
      	if (!size_to_hstate(size)
      		hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT)
      
      The only time we want to print the warning is as the result of command
      line processing.  So, remove the warning from hugetlb_add_hstate and add
      it to the single arch independent routine processing "hugepagesz=".  After
      this, calls to size_to_hstate() in arch specific code can be removed and
      hugetlb_add_hstate can be called without worrying about warning messages.
      
      [mike.kravetz@oracle.com: fix hugetlb initialization]
        Link: http://lkml.kernel.org/r/4c36c6ce-3774-78fa-abc4-b7346bf24348@oracle.com
        Link: http://lkml.kernel.org/r/20200428205614.246260-5-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Acked-by: default avatarMina Almasry <almasrymina@google.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-4-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-4-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38237830
    • Mike Kravetz's avatar
      hugetlbfs: move hugepagesz= parsing to arch independent code · 359f2544
      Mike Kravetz authored
      Now that architectures provide arch_hugetlb_valid_size(), parsing of
      "hugepagesz=" can be done in architecture independent code.  Create a
      single routine to handle hugepagesz= parsing and remove all arch specific
      routines.  We can also remove the interface hugetlb_bad_size() as this is
      no longer used outside arch independent code.
      
      This also provides consistent behavior of hugetlbfs command line options.
      The hugepagesz= option should only be specified once for a specific size,
      but some architectures allow multiple instances.  This appears to be more
      of an oversight when code was added by some architectures to set up ALL
      huge pages sizes.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarSandipan Das <sandipan@linux.ibm.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarMina Almasry <almasrymina@google.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      359f2544
    • Mike Kravetz's avatar
      hugetlbfs: add arch_hugetlb_valid_size · ae94da89
      Mike Kravetz authored
      Patch series "Clean up hugetlb boot command line processing", v4.
      
      Longpeng(Mike) reported a weird message from hugetlb command line
      processing and proposed a solution [1].  While the proposed patch does
      address the specific issue, there are other related issues in command line
      processing.  As hugetlbfs evolved, updates to command line processing have
      been made to meet immediate needs and not necessarily in a coordinated
      manner.  The result is that some processing is done in arch specific code,
      some is done in arch independent code and coordination is problematic.
      Semantics can vary between architectures.
      
      The patch series does the following:
      - Define arch specific arch_hugetlb_valid_size routine used to validate
        passed huge page sizes.
      - Move hugepagesz= command line parsing out of arch specific code and into
        an arch independent routine.
      - Clean up command line processing to follow desired semantics and
        document those semantics.
      
      [1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com
      
      This patch (of 3):
      
      The architecture independent routine hugetlb_default_setup sets up the
      default huge pages size.  It has no way to verify if the passed value is
      valid, so it accepts it and attempts to validate at a later time.  This
      requires undocumented cooperation between the arch specific and arch
      independent code.
      
      For architectures that support more than one huge page size, provide a
      routine arch_hugetlb_valid_size to validate a huge page size.
      hugetlb_default_setup can use this to validate passed values.
      
      arch_hugetlb_valid_size will also be used in a subsequent patch to move
      processing of the "hugepagesz=" in arch specific code to a common routine
      in arch independent code.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae94da89
    • Kirill A. Shutemov's avatar
      khugepaged: introduce 'max_ptes_shared' tunable · 71a2c112
      Kirill A. Shutemov authored
      'max_ptes_shared' specifies how many pages can be shared across multiple
      processes.  Exceeding the number would block the collapse::
      
      	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
      
      A higher value may increase memory footprint for some workloads.
      
      By default, at least half of pages has to be not shared.
      
      [colin.king@canonical.com: fix several spelling mistakes]
        Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71a2c112
    • Kirill A. Shutemov's avatar
      thp: change CoW semantics for anon-THP · 3917c802
      Kirill A. Shutemov authored
      Currently we have different copy-on-write semantics for anon- and
      file-THP.  For anon-THP we try to allocate huge page on the write fault,
      but on file-THP we split PMD and allocate 4k page.
      
      Arguably, file-THP semantics is more desirable: we don't necessary want to
      unshare full PMD range from the parent on the first access.  This is the
      primary reason THP is unusable for some workloads, like Redis.
      
      The original THP refcounting didn't allow to have PTE-mapped compound
      pages, so we had no options, but to allocate huge page on CoW (with
      fallback to 512 4k pages).
      
      The current refcounting doesn't have such limitations and we can cut a lot
      of complex code out of fault path.
      
      khugepaged is now able to recover THP from such ranges if the
      configuration allows.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Link: http://lkml.kernel.org/r/20200416160026.16538-8-kirill.shutemov@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3917c802