1. 06 Nov, 2021 40 commits
    • Mel Gorman's avatar
      mm/page_alloc: remove the throttling logic from the page allocator · 132b0d21
      Mel Gorman authored
      The page allocator stalls based on the number of pages that are waiting
      for writeback to start but this should now be redundant.
      shrink_inactive_list() will wake flusher threads if the LRU tail are
      unqueued dirty pages so the flusher should be active.  If it fails to
      make progress due to pages under writeback not being completed quickly
      then it should stall on VMSCAN_THROTTLE_WRITEBACK.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      132b0d21
    • Mel Gorman's avatar
      mm/writeback: throttle based on page writeback instead of congestion · 8d58802f
      Mel Gorman authored
      do_writepages throttles on congestion if the writepages() fails due to a
      lack of memory but congestion_wait() is partially broken as the
      congestion state is not updated for all BDIs.
      
      This patch stalls waiting for a number of pages to complete writeback
      that located on the local node.  The main weakness is that there is no
      correlation between the location of the inode's pages and locality but
      that is still better than congestion_wait.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-5-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d58802f
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim when no progress is being made · 69392a40
      Mel Gorman authored
      Memcg reclaim throttles on congestion if no reclaim progress is made.
      This makes little sense, it might be due to writeback or a host of other
      factors.
      
      For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
      in the page allocator if it is failing to make progress.  Kswapd
      throttles if too many pages are under writeback and marked for immediate
      reclaim.
      
      This patch explicitly throttles if reclaim is failing to make progress.
      
      [vbabka@suse.cz: Remove redundant code]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69392a40
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated · d818fca1
      Mel Gorman authored
      Page reclaim throttles on congestion if too many parallel reclaim
      instances have isolated too many pages.  This makes no sense, excessive
      parallelisation has nothing to do with writeback or congestion.
      
      This patch creates an additional workqueue to sleep on when too many
      pages are isolated.  The throttled tasks are woken when the number of
      isolated pages is reduced or a timeout occurs.  There may be some false
      positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
      throttle again if necessary.
      
      [shy828301@gmail.com: Wake up from compaction context]
      [vbabka@suse.cz: Account number of throttled tasks only for writeback]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d818fca1
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman authored
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588
    • Kai Song's avatar
      mm/vmscan.c: fix -Wunused-but-set-variable warning · cb75463c
      Kai Song authored
      We fix the following warning when building kernel with W=1:
      
        mm/vmscan.c:1362:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]
      
      Link: https://lkml.kernel.org/r/20210924181218.21165-1-songkai01@inspur.comSigned-off-by: default avatarKai Song <songkai01@inspur.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb75463c
    • Miaohe Lin's avatar
      mm/page_isolation: guard against possible putback unisolated page · a500cb34
      Miaohe Lin authored
      Isolating a free page in an isolated pageblock is expected to always
      work as watermarks don't apply here.
      
      But if __isolate_free_page() failed, due to condition changes, the page
      will be left on the free list.  And the page will be put back to free
      list again via __putback_isolated_page().  This may trigger
      VM_BUG_ON_PAGE() on page->flags checking in __free_one_page() if
      PageReported is set.  Or we will corrupt the free list because
      list_add() will be called for pages already on another list.
      
      Add a VM_WARN_ON() to complain about this change.
      
      Link: https://lkml.kernel.org/r/20210914114508.23725-1-linmiaohe@huawei.com
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a500cb34
    • Miaohe Lin's avatar
      mm/page_isolation: fix potential missing call to unset_migratetype_isolate() · e1d8c966
      Miaohe Lin authored
      In start_isolate_page_range() undo path, pfn_to_online_page() just
      checks the first pfn in a pageblock while __first_valid_page() will
      traverse the pageblock until the first online pfn is found.  So we may
      miss the call to unset_migratetype_isolate() in undo path and pages will
      remain isolated unexpectedly.
      
      Fix this by calling undo_isolate_page_range() and this will also help to
      simplify the code further.  Note we shouldn't ever trigger it because
      MAX_ORDER-1 aligned pfn ranges shouldn't contain memory holes now.
      
      Link: https://lkml.kernel.org/r/20210914114348.15569-1-linmiaohe@huawei.com
      Fixes: 2ce13640 ("mm: __first_valid_page skip over offline pages")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1d8c966
    • Axel Rasmussen's avatar
      userfaultfd/selftests: fix calculation of expected ioctls · ad0ce23e
      Axel Rasmussen authored
      Today, we assert that the ioctls the kernel reports as supported for a
      registration match a precomputed list.  We decide which ioctls are
      supported by examining the memory type.  Then, in several locations we
      "fix up" this list by adding or removing things this initial decision
      got wrong.
      
      What ioctls the kernel reports is actually a function of several things:
      - The memory type
      - Kernel feature support (e.g., no writeprotect on aarch64)
      - The registration type (e.g., CONTINUE only supported for MINOR mode)
      
      So, we can't fully compute this at the start, in set_test_type.  It
      varies per test, depending on what registration mode(s) those tests use.
      
      Instead, introduce a new function which computes the correct list.  This
      centralizes the add/remove of ioctls depending on these function inputs
      in one place, so we don't have to repeat ourselves in various tests.
      
      Not only is the resulting code a bit shorter, but it fixes a real bug in
      the existing code: previously, we would incorrectly require the
      writeprotect ioctl to be present on aarch64, where it isn't actually
      supported.
      
      Link: https://lkml.kernel.org/r/20210930212309.4001967-4-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad0ce23e
    • Axel Rasmussen's avatar
      userfaultfd/selftests: fix feature support detection · 1042a53d
      Axel Rasmussen authored
      Before any tests are run, in set_test_type, we decide what feature(s) we
      are going to be testing, based upon our command line arguments.
      However, the supported features are not just a function of the memory
      type being used, so this is broken.
      
      For instance, consider writeprotect support.  It is "normally" supported
      for anonymous memory, but furthermore it requires that the kernel has
      CONFIG_HAVE_ARCH_USERFAULTFD_WP.  So, it is *not* supported at all on
      aarch64, for example.
      
      So, this fixes this by querying the kernel for the set of features it
      supports in set_test_type, by opening a userfaultfd and issuing a
      UFFDIO_API ioctl.  Based upon the reported features, we toggle what
      tests are enabled.
      
      Link: https://lkml.kernel.org/r/20210930212309.4001967-3-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1042a53d
    • Axel Rasmussen's avatar
      userfaultfd/selftests: don't rely on GNU extensions for random numbers · 1c10e674
      Axel Rasmussen authored
      Patch series "Small userfaultfd selftest fixups", v2.
      
      This patch (of 3):
      
      Two arguments for doing this:
      
      First, and maybe most importantly, the resulting code is significantly
      shorter / simpler.
      
      Then, we avoid using GNU libc extensions.  Why does this matter? It
      makes testing userfaultfd with the selftest easier e.g.  on distros
      which use something other than glibc (e.g., Alpine, which uses musl);
      basically, it makes the test more portable.
      
      Link: https://lkml.kernel.org/r/20210930212309.4001967-2-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c10e674
    • Mike Kravetz's avatar
      hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page · 2c0078a7
      Mike Kravetz authored
      In commit 7118fc29 ("hugetlb: address ref count racing in
      prep_compound_gigantic_page"), page_ref_freeze is used to atomically
      zero the ref count of tail pages iff they are 1.  The unconditional call
      to set_page_count(0) was left in the code.  This call is after
      page_ref_freeze so it is really a noop.
      
      Remove redundant and unnecessary set_page_count call.
      
      Link: https://lkml.kernel.org/r/20211026220635.35187-1-mike.kravetz@oracle.com
      Fixes: 7118fc29 ("hugetlb: address ref count racing in prep_compound_gigantic_page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c0078a7
    • Baolin Wang's avatar
      hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() · 76efc67a
      Baolin Wang authored
      When calling hugetlb_resv_map_add(), we've guaranteed that the parameter
      'to' is always larger than 'from', so it never returns a negative value
      from hugetlb_resv_map_add().  Thus remove the redundant VM_BUG_ON().
      
      Link: https://lkml.kernel.org/r/2b565552f3d06753da1e8dda439c0d96d6d9a5a3.1634797639.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76efc67a
    • Baolin Wang's avatar
      hugetlb: remove redundant validation in has_same_uncharge_info() · 0739eb43
      Baolin Wang authored
      The callers of has_same_uncharge_info() has accessed the original
      file_region and new file_region, and they are impossible to be NULL now.
      
      So we can remove the file_region validation in has_same_uncharge_info()
      to simplify the code.
      
      Link: https://lkml.kernel.org/r/97fc68d3f8d34f63c204645e10d7a718997e50b7.1634797639.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0739eb43
    • Baolin Wang's avatar
      hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments · aa6d2e8c
      Baolin Wang authored
      After commit 8382d914 ("mm, hugetlb: improve page-fault
      scalability"), the hugetlb_instantiation_mutex lock had been replaced by
      hugetlb_fault_mutex_table to serializes faults on the same logical page.
      
      Thus update the obsolete hugetlb_instantiation_mutex related comments.
      
      Link: https://lkml.kernel.org/r/4b3febeae37455ff7b74aa0aad16cc6909cf0926.1634797639.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa6d2e8c
    • Baolin Wang's avatar
      hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro · df8931c8
      Baolin Wang authored
      Patch series "Some cleanups and improvements for hugetlb".
      
      This patchset does some cleanups and improvements for hugetlb and
      hugetlb_cgroup.
      
      This patch (of 4):
      
      Since commit 726b7bbe ("hugetlb_cgroup: fix illegal access to
      memory"), the hugetlb_cgroup_from_counter() macro is not used any more,
      remove it.
      
      Link: https://lkml.kernel.org/r/cover.1634797639.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/f03b29b801fa9942466ab15334ec09988e124ae6.1634797639.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df8931c8
    • Ran Jianping's avatar
      mm: remove duplicate include in hugepage-mremap.c · b65c23f7
      Ran Jianping authored
      Remove duplicate includes 'unistd.h' included in
       '/tools/testing/selftests/vm/hugepage-mremap.c'  is duplicated.It is also
       included on 23 line.
      
      Link: https://lkml.kernel.org/r/20211018102336.869726-1-ran.jianping@zte.com.cnSigned-off-by: default avatarRan Jianping <ran.jianping@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b65c23f7
    • Baolin Wang's avatar
      hugetlb: support node specified when using cma for gigantic hugepages · 38e719ab
      Baolin Wang authored
      Now the size of CMA area for gigantic hugepages runtime allocation is
      balanced for all online nodes, but we also want to specify the size of
      CMA per-node, or only one node in some cases, which are similar with
      patch [1].
      
      For example, on some multi-nodes systems, each node's memory can be
      different, allocating the same size of CMA for each node is not suitable
      for the low-memory nodes.  Meanwhile some workloads like DPDK mentioned
      by Zhenguo in patch [1] only need hugepages in one node.
      
      On the other hand, we have some machines with multiple types of memory,
      like DRAM and PMEM (persistent memory).  On this system, we may want to
      specify all the hugepages only on DRAM node, or specify the proportion
      of DRAM node and PMEM node, to tuning the performance of the workloads.
      
      Thus this patch adds node format for 'hugetlb_cma' parameter to support
      specifying the size of CMA per-node.  An example is as follows:
      
        hugetlb_cma=0:5G,2:5G
      
      which means allocating 5G size of CMA area on node 0 and node 2
      respectively.  And the users should use the node specific sysfs file to
      allocate the gigantic hugepages if specified the CMA size on that node.
      
      Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com [1]
      Link: https://lkml.kernel.org/r/bb790775ca60bb8f4b26956bb3f6988f74e075c7.1634261144.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38e719ab
    • Mina Almasry's avatar
      mm, hugepages: add hugetlb vma mremap() test · 12b61320
      Mina Almasry authored
      [almasrymina@google.com: v8]
        Link: https://lkml.kernel.org/r/20211014200542.4126947-2-almasrymina@google.com
      [wanjiabing@vivo.com: remove duplicated include in hugepage-mremap]
        Link: https://lkml.kernel.org/r/20211021122944.8857-1-wanjiabing@vivo.com
      
      Link: https://lkml.kernel.org/r/20211013195825.3058275-2-almasrymina@google.comSigned-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12b61320
    • Mina Almasry's avatar
      mm, hugepages: add mremap() support for hugepage backed vma · 550a7d60
      Mina Almasry authored
      Support mremap() for hugepage backed vma segment by simply repositioning
      page table entries.  The page table entries are repositioned to the new
      virtual address on mremap().
      
      Hugetlb mremap() support is of course generic; my motivating use case is
      a library (hugepage_text), which reloads the ELF text of executables in
      hugepages.  This significantly increases the execution performance of
      said executables.
      
      Restrict the mremap operation on hugepages to up to the size of the
      original mapping as the underlying hugetlb reservation is not yet
      capable of handling remapping to a larger size.
      
      During the mremap() operation we detect pmd_share'd mappings and we
      unshare those during the mremap().  On access and fault the sharing is
      established again.
      
      Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.comSigned-off-by: default avatarMina Almasry <almasrymina@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      550a7d60
    • Liangcai Fan's avatar
      mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged · bd3400ea
      Liangcai Fan authored
      When initializing transparent huge pages, min_free_kbytes would be
      calculated according to what khugepaged expected.
      
      So when transparent huge pages get disabled, min_free_kbytes should be
      recalculated instead of the higher value set by khugepaged.
      
      Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.comSigned-off-by: default avatarLiangcai Fan <liangcaifan19@gmail.com>
      Signed-off-by: default avatarChunyan Zhang <zhang.lyra@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd3400ea
    • Mike Kravetz's avatar
      hugetlb: add hugetlb demote page support · 8531fc6f
      Mike Kravetz authored
      Demote page functionality will split a huge page into a number of huge
      pages of a smaller size.  For example, on x86 a 1GB huge page can be
      demoted into 512 2M huge pages.  Demotion is done 'in place' by simply
      splitting the huge page.
      
      Added '*_for_demote' wrappers for remove_hugetlb_page,
      destroy_compound_hugetlb_page and prep_compound_gigantic_page for use by
      demote code.
      
      [mike.kravetz@oracle.com: v4]
        Link: https://lkml.kernel.org/r/6ca29b8e-527c-d6ec-900e-e6a43e4f8b73@oracle.com
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-6-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8531fc6f
    • Mike Kravetz's avatar
      hugetlb: add demote bool to gigantic page routines · 34d9e35b
      Mike Kravetz authored
      The routines remove_hugetlb_page and destroy_compound_gigantic_page will
      remove a gigantic page and make the set of base pages ready to be
      returned to a lower level allocator.  In the process of doing this, they
      make all base pages reference counted.
      
      The routine prep_compound_gigantic_page creates a gigantic page from a
      set of base pages.  It assumes that all these base pages are reference
      counted.
      
      During demotion, a gigantic page will be split into huge pages of a
      smaller size.  This logically involves use of the routines,
      remove_hugetlb_page, and destroy_compound_gigantic_page followed by
      prep_compound*_page for each smaller huge page.
      
      When pages are reference counted (ref count >= 0), additional
      speculative ref counts could be taken as described in previous commits
      [1] and [2].  This could result in errors while demoting a huge page.
      Quite a bit of code would need to be created to handle all possible
      issues.
      
      Instead of dealing with the possibility of speculative ref counts, avoid
      the possibility by keeping ref counts at zero during the demote process.
      Add a boolean 'demote' to the routines remove_hugetlb_page,
      destroy_compound_gigantic_page and prep_compound_gigantic_page.  If the
      boolean is set, the remove and destroy routines will not reference count
      pages and the prep routine will not expect reference counted pages.
      
      '*_for_demote' wrappers of the routines will be added in a subsequent
      patch where this functionality is used.
      
      [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20210809184832.18342-3-mike.kravetz@oracle.com/
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-5-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34d9e35b
    • Mike Kravetz's avatar
      hugetlb: be sure to free demoted CMA pages to CMA · a01f4390
      Mike Kravetz authored
      When huge page demotion is fully implemented, gigantic pages can be
      demoted to a smaller huge page size.  For example, on x86 a 1G page can
      be demoted to 512 2M pages.  However, gigantic pages can potentially be
      allocated from CMA.  If a gigantic page which was allocated from CMA is
      demoted, the corresponding demoted pages needs to be returned to CMA.
      
      Use the new interface cma_pages_valid() to determine if a non-gigantic
      hugetlb page should be freed to CMA.  Also, clear mapping field of these
      pages as expected by cma_release.
      
      This also requires a change to CMA region creation for gigantic pages.
      CMA uses a per-region bit map to track allocations.  When setting up the
      region, you specify how many pages each bit represents.  Currently, only
      gigantic pages are allocated/freed from CMA so the region is set up such
      that one bit represents a gigantic page size allocation.
      
      With demote, a gigantic page (allocation) could be split into smaller
      size pages.  And, these smaller size pages will be freed to CMA.  So,
      since the per-region bit map needs to be set up to represent the
      smallest allocation/free size, it now needs to be set to the smallest
      huge page size which can be freed to CMA.
      
      Unfortunately, we set up the CMA region for huge pages before we set up
      huge pages sizes (hstates).  So, technically we do not know the smallest
      huge page size as this can change via command line options and
      architecture specific code.  Therefore, at region setup time we use
      HUGETLB_PAGE_ORDER as the smallest possible huge page size that can be
      given back to CMA.  It is possible that this value is sub-optimal for
      some architectures/config options.  If needed, this can be addressed in
      follow on work.
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-4-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a01f4390
    • Mike Kravetz's avatar
      mm/cma: add cma_pages_valid to determine if pages are in CMA · 9871e2de
      Mike Kravetz authored
      Add new interface cma_pages_valid() which indicates if the specified
      pages are part of a CMA region.  This interface will be used in a
      subsequent patch by hugetlb code.
      
      In order to keep the same amount of DEBUG information, a pr_debug() call
      was added to cma_pages_valid().  In the case where the page passed to
      cma_release is not in cma region, the debug message will be printed from
      cma_pages_valid as opposed to cma_release.
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-3-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9871e2de
    • Mike Kravetz's avatar
      hugetlb: add demote hugetlb page sysfs interfaces · 79dfc695
      Mike Kravetz authored
      Patch series "hugetlb: add demote/split page functionality", v4.
      
      The concurrent use of multiple hugetlb page sizes on a single system is
      becoming more common.  One of the reasons is better TLB support for
      gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
      being used to back VMs in hosting environments.
      
      When using hugetlb pages to back VMs, it is often desirable to
      preallocate hugetlb pools.  This avoids the delay and uncertainty of
      allocating hugetlb pages at VM startup.  In addition, preallocating huge
      pages minimizes the issue of memory fragmentation that increases the
      longer the system is up and running.
      
      In such environments, a combination of larger and smaller hugetlb pages
      are preallocated in anticipation of backing VMs of various sizes.  Over
      time, the preallocated pool of smaller hugetlb pages may become depleted
      while larger hugetlb pages still remain.  In such situations, it is
      desirable to convert larger hugetlb pages to smaller hugetlb pages.
      
      Converting larger to smaller hugetlb pages can be accomplished today by
      first freeing the larger page to the buddy allocator and then allocating
      the smaller pages.  For example, to convert 50 GB pages on x86:
      
        gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
        m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
        echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
        echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
      
      On an idle system this operation is fairly reliable and results are as
      expected.  The number of 2MB pages is increased as expected and the time
      of the operation is a second or two.
      
      However, when there is activity on the system the following issues
      arise:
      
      1) This process can take quite some time, especially if allocation of
         the smaller pages is not immediate and requires migration/compaction.
      
      2) There is no guarantee that the total size of smaller pages allocated
         will match the size of the larger page which was freed. This is
         because the area freed by the larger page could quickly be
         fragmented.
      
      In a test environment with a load that continually fills the page cache
      with clean pages, results such as the following can be observed:
      
        Unexpected number of 2MB pages allocated: Expected 25600, have 19944
        real    0m42.092s
        user    0m0.008s
        sys     0m41.467s
      
      To address these issues, introduce the concept of hugetlb page demotion.
      Demotion provides a means of 'in place' splitting of a hugetlb page to
      pages of a smaller size.  This avoids freeing pages to buddy and then
      trying to allocate from buddy.
      
      Page demotion is controlled via sysfs files that reside in the per-hugetlb
      page size and per node directories.
      
       - demote_size
              Target page size for demotion, a smaller huge page size. File
              can be written to chose a smaller huge page size if multiple are
              available.
      
       - demote
              Writable number of hugetlb pages to be demoted
      
      To demote 50 GB huge pages, one would:
      
        cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
        cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
        echo 50 > .../hugepages-1048576kB/demote
      
      Only hugetlb pages which are free at the time of the request can be
      demoted.  Demotion does not add to the complexity of surplus pages and
      honors reserved huge pages.  Therefore, when a value is written to the
      sysfs demote file, that value is only the maximum number of pages which
      will be demoted.  It is possible fewer will actually be demoted.  The
      recently introduced per-hstate mutex is used to synchronize demote
      operations with other operations that modify hugetlb pools.
      
      Real world use cases
      --------------------
      The above scenario describes a real world use case where hugetlb pages
      are used to back VMs on x86.  Both issues of long allocation times and
      not necessarily getting the expected number of smaller huge pages after
      a free and allocate cycle have been experienced.  The occurrence of
      these issues is dependent on other activity within the host and can not
      be predicted.
      
      This patch (of 5):
      
      Two new sysfs files are added to demote hugtlb pages.  These files are
      both per-hugetlb page size and per node.  Files are:
      
        demote_size - The size in Kb that pages are demoted to. (read-write)
        demote - The number of huge pages to demote. (write-only)
      
      By default, demote_size is the next smallest huge page size.  Valid huge
      page sizes less than huge page size may be written to this file.  When
      huge pages are demoted, they are demoted to this size.
      
      Writing a value to demote will result in an attempt to demote that
      number of hugetlb pages to an appropriate number of demote_size pages.
      
      NOTE: Demote interfaces are only provided for huge page sizes if there
      is a smaller target demote huge page size.  For example, on x86 1GB huge
      pages will have demote interfaces.  2MB huge pages will not have demote
      interfaces.
      
      This patch does not provide full demote functionality.  It only provides
      the sysfs interfaces.
      
      It also provides documentation for the new interfaces.
      
      [mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
        Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79dfc695
    • Peter Xu's avatar
      mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h · 73c54763
      Peter Xu authored
      Remove __unmap_hugepage_range() from the header file, because it is only
      used in hugetlb.c.
      
      Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73c54763
    • Yang Shi's avatar
      mm: hwpoison: handle non-anonymous THP correctly · 4966455d
      Yang Shi authored
      Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
      support for tmpfs and read-only file cache has been added.  They could
      be offlined by split THP, just like anonymous THP.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4966455d
    • Yang Shi's avatar
      mm: shmem: don't truncate page if memory failure happens · b9d02f1b
      Yang Shi authored
      The current behavior of memory failure is to truncate the page cache
      regardless of dirty or clean.  If the page is dirty the later access
      will get the obsolete data from disk without any notification to the
      users.  This may cause silent data loss.  It is even worse for shmem
      since shmem is in-memory filesystem, truncating page cache means
      discarding data blocks.  The later read would return all zero.
      
      The right approach is to keep the corrupted page in page cache, any
      later access would return error for syscalls or SIGBUS for page fault,
      until the file is truncated, hole punched or removed.  The regular
      storage backed filesystems would be more complicated so this patch is
      focused on shmem.  This also unblock the support for soft offlining
      shmem THP.
      
      [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
        Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9d02f1b
    • Yang Shi's avatar
      mm: hwpoison: refactor refcount check handling · dd0f230a
      Yang Shi authored
      Memory failure will report failure if the page still has extra pinned
      refcount other than from hwpoison after the handler is done.  Actually
      the check is not necessary for all handlers, so move the check into
      specific handlers.  This would make the following keeping shmem page in
      page cache patch easier.
      
      There may be expected extra pin for some cases, for example, when the
      page is dirty and in swapcache.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd0f230a
    • Yang Shi's avatar
      mm: filemap: coding style cleanup for filemap_map_pmd() · e0f43fa5
      Yang Shi authored
      Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
      
      When discussing the patch that splits page cache THP in order to offline
      the poisoned page, Noaya mentioned there is a bigger problem [1] that
      prevents this from working since the page cache page will be truncated
      if uncorrectable errors happen.  By looking this deeper it turns out
      this approach (truncating poisoned page) may incur silent data loss for
      all non-readonly filesystems if the page is dirty.  It may be worse for
      in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are
      actually gone.
      
      To solve this problem we could keep the poisoned dirty page in page
      cache then notify the users on any later access, e.g.  page fault,
      read/write, etc.  The clean page could be truncated as is since they can
      be reread from disk later on.
      
      The consequence is the filesystems may find poisoned page and manipulate
      it as healthy page since all the filesystems actually don't check if the
      page is poisoned or not in all the relevant paths except page fault.  In
      general, we need make the filesystems be aware of poisoned page before
      we could keep the poisoned page in page cache in order to solve the data
      loss problem.
      
      To make filesystems be aware of poisoned page we should consider:
      
       - The page should be not written back: clearing dirty flag could
         prevent from writeback.
      
       - The page should not be dropped (it shows as a clean page) by drop
         caches or other callers: the refcount pin from hwpoison could prevent
         from invalidating (called by cache drop, inode cache shrinking, etc),
         but it doesn't avoid invalidation in DIO path.
      
       - The page should be able to get truncated/hole punched/unlinked: it
         works as it is.
      
       - Notify users when the page is accessed, e.g. read/write, page fault
         and other paths (compression, encryption, etc).
      
      The scope of the last one is huge since almost all filesystems need do
      it once a page is returned from page cache lookup.  There are a couple
      of options to do it:
      
       1. Check hwpoison flag for every path, the most straightforward way.
      
       2. Return NULL for poisoned page from page cache lookup, the most
          callsites check if NULL is returned, this should have least work I
          think. But the error handling in filesystems just return -ENOMEM,
          the error code will incur confusion to the users obviously.
      
       3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
          but this will involve significant amount of code change as well
          since all the paths need check if the pointer is ERR or not just
          like option #1.
      
      I did prototypes for both #1 and #3, but it seems #3 may require more
      changes than #1.  For #3 ERR_PTR will be returned so all the callers
      need to check the return value otherwise invalid pointer may be
      dereferenced, but not all callers really care about the content of the
      page, for example, partial truncate which just sets the truncated range
      in one page to 0.  So for such paths it needs additional modification if
      ERR_PTR is returned.  And if the callers have their own way to handle
      the problematic pages we need to add a new FGP flag to tell FGP
      functions to return the pointer to the page.
      
      It may happen very rarely, but once it happens the consequence (data
      corruption) could be very bad and it is very hard to debug.  It seems
      this problem had been slightly discussed before, but seems no action was
      taken at that time.  [2]
      
      As the aforementioned investigation, it needs huge amount of work to
      solve the potential data loss for all filesystems.  But it is much
      easier for in-memory filesystems and such filesystems actually suffer
      more than others since even the data blocks are gone due to truncating.
      So this patchset starts from shmem/tmpfs by taking option #1.
      
      TODO:
      * The unpoison has been broken since commit 0ed950d1 ("mm,hwpoison: make
        get_hwpoison_page() call get_any_page()"), and this patch series make
        refcount check for unpoisoning shmem page fail.
      * Expand to other filesystems.  But I haven't heard feedback from filesystem
        developers yet.
      
      Patch breakdown:
      Patch #1: cleanup, depended by patch #2
      Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
      Patch #3: coding style cleanup
      Patch #4: refactor and preparation.
      Patch #5: keep the poisoned page in page cache and handle such case for all
                the paths.
      Patch #6: the previous patches unblock page cache THP split, so this patch
                add page cache THP split support.
      
      This patch (of 4):
      
      A minor cleanup to the indent.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0f43fa5
    • Rikard Falkeborn's avatar
      mm/memory_failure: constify static mm_walk_ops · ba9eb3ce
      Rikard Falkeborn authored
      The only usage of hwp_walk_ops is to pass its address to
      walk_page_range() which takes a pointer to const mm_walk_ops as
      argument.
      
      Make it const to allow the compiler to put it in read-only memory.
      
      Link: https://lkml.kernel.org/r/20211014075042.17174-3-rikard.falkeborn@gmail.comSigned-off-by: default avatarRikard Falkeborn <rikard.falkeborn@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba9eb3ce
    • Marco Elver's avatar
      mm: fix data race in PagePoisoned() · 477d01fc
      Marco Elver authored
      PagePoisoned() accesses page->flags which can be updated concurrently:
      
        | BUG: KCSAN: data-race in next_uptodate_page / unlock_page
        |
        | write (marked) to 0xffffea00050f37c0 of 8 bytes by task 1872 on cpu 1:
        |  instrument_atomic_write           include/linux/instrumented.h:87 [inline]
        |  clear_bit_unlock_is_negative_byte include/asm-generic/bitops/instrumented-lock.h:74 [inline]
        |  unlock_page+0x102/0x1b0           mm/filemap.c:1465
        |  filemap_map_pages+0x6c6/0x890     mm/filemap.c:3057
        |  ...
        | read to 0xffffea00050f37c0 of 8 bytes by task 1873 on cpu 0:
        |  PagePoisoned                   include/linux/page-flags.h:204 [inline]
        |  PageReadahead                  include/linux/page-flags.h:382 [inline]
        |  next_uptodate_page+0x456/0x830 mm/filemap.c:2975
        |  ...
        | CPU: 0 PID: 1873 Comm: systemd-udevd Not tainted 5.11.0-rc4-00001-gf9ce0be7 #1
      
      To avoid the compiler tearing or otherwise optimizing the access, use
      READ_ONCE() to access flags.
      
      Link: https://lore.kernel.org/all/20210826144157.GA26950@xsang-OptiPlex-9020/
      Link: https://lkml.kernel.org/r/20210913113542.2658064-1-elver@google.comReported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      477d01fc
    • Wang ShaoBo's avatar
      mm/page_alloc: use clamp() to simplify code · 59d336bd
      Wang ShaoBo authored
      This patch uses clamp() to simplify code in init_per_zone_wmark_min().
      
      Link: https://lkml.kernel.org/r/20211021034830.1049150-1-bobo.shaobowang@huawei.comSigned-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59d336bd
    • Sebastian Andrzej Siewior's avatar
      mm: page_alloc: use migrate_disable() in drain_local_pages_wq() · 9c25cbfc
      Sebastian Andrzej Siewior authored
      drain_local_pages_wq() disables preemption to avoid CPU migration during
      CPU hotplug and can't use cpus_read_lock().
      
      Using migrate_disable() works here, too.  The scheduler won't take the
      CPU offline until the task left the migrate-disable section.  The
      problem with disabled preemption here is that drain_local_pages()
      acquires locks which are turned into sleeping locks on PREEMPT_RT and
      can't be acquired with disabled preemption.
      
      Use migrate_disable() in drain_local_pages_wq().
      
      Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c25cbfc
    • Christophe Leroy's avatar
      s390: use generic version of arch_is_kernel_initmem_freed() · 564f6ea1
      Christophe Leroy authored
      The generic version of arch_is_kernel_initmem_freed() now does the same
      as s390 version.
      
      Remove the s390 version.
      
      Link: https://lkml.kernel.org/r/b6feb5dfe611a322de482762fc2df3a9eece70c7.1633001016.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      564f6ea1
    • Christophe Leroy's avatar
      powerpc: use generic version of arch_is_kernel_initmem_freed() · e012a25d
      Christophe Leroy authored
      The generic version of arch_is_kernel_initmem_freed() now does the same
      as powerpc version.
      
      Remove the powerpc version.
      
      Link: https://lkml.kernel.org/r/c53764eb45d41491e2b21da2e7812239897dbebb.1633001016.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e012a25d
    • Christophe Leroy's avatar
      mm: make generic arch_is_kernel_initmem_freed() do what it says · e5ae3728
      Christophe Leroy authored
      Commit 7a5da02d ("locking/lockdep: check for freed initmem in
      static_obj()") added arch_is_kernel_initmem_freed() which is supposed to
      report whether an object is part of already freed init memory.
      
      For the time being, the generic version of
      arch_is_kernel_initmem_freed() always reports 'false', allthough
      free_initmem() is generically called on all architectures.
      
      Therefore, change the generic version of arch_is_kernel_initmem_freed()
      to check whether free_initmem() has been called.  If so, then check if a
      given address falls into init memory.
      
      To ease the use of system_state, move it out of line into its only
      caller which is lockdep.c
      
      Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5ae3728
    • Christophe Leroy's avatar
      mm: create a new system state and fix core_kernel_text() · d2635f20
      Christophe Leroy authored
      core_kernel_text() considers that until system_state in at least
      SYSTEM_RUNNING, init memory is valid.
      
      But init memory is freed a few lines before setting SYSTEM_RUNNING, so
      we have a small period of time when core_kernel_text() is wrong.
      
      Create an intermediate system state called SYSTEM_FREEING_INIT that is
      set before starting freeing init memory, and use it in
      core_kernel_text() to report init memory invalid earlier.
      
      Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2635f20
    • Liangcai Fan's avatar
      mm/page_alloc.c: show watermark_boost of zone in zoneinfo · a6ea8b5b
      Liangcai Fan authored
      min/low/high_wmark_pages(z) is defined as
      
        (z->_watermark[WMARK_MIN/LOW/HIGH] + z->watermark_boost)
      
      If kswapd is frequently woken up due to the increase of
      min/low/high_wmark_pages, printing watermark_boost can quickly locate
      whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
      min/low/high_wmark_pages to increase.
      
      Link: https://lkml.kernel.org/r/1632472566-12246-1-git-send-email-liangcaifan19@gmail.comSigned-off-by: default avatarLiangcai Fan <liangcaifan19@gmail.com>
      Cc: Chunyan Zhang <zhang.lyra@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6ea8b5b