1. 04 Jul, 2024 40 commits
    • Hyeongtak Ji's avatar
      mm/damon/sysfs-schemes: add target_nid on sysfs-schemes · e36287c6
      Hyeongtak Ji authored
      This patch adds target_nid under
        /sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/
      
      The 'target_nid' can be used as the destination node for DAMOS actions
      such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.
      
      [sj@kernel.org: document target_nid file]
        Link: https://lkml.kernel.org/r/20240618213630.84846-3-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240614030010.751-4-honggyu.kim@sk.comSigned-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e36287c6
    • Honggyu Kim's avatar
      mm: rename alloc_demote_folio to alloc_migrate_folio · 8f75267d
      Honggyu Kim authored
      The alloc_demote_folio can also be used for general migration including
      both demotion and promotion so it'd be better to rename it from
      alloc_demote_folio to alloc_migrate_folio.
      
      Link: https://lkml.kernel.org/r/20240614030010.751-3-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f75267d
    • Honggyu Kim's avatar
      mm: make alloc_demote_folio externally invokable for migration · a00ce85a
      Honggyu Kim authored
      Patch series "DAMON based tiered memory management for CXL memory", v6.
      
      Introduction
      ============
      
      With the advent of CXL/PCIe attached DRAM, which will be called simply as
      CXL memory in this cover letter, some systems are becoming more
      heterogeneous having memory systems with different latency and bandwidth
      characteristics.  They are usually handled as different NUMA nodes in
      separate memory tiers and CXL memory is used as slow tiers because of its
      protocol overhead compared to local DRAM.
      
      In this kind of systems, we need to be careful placing memory pages on
      proper NUMA nodes based on the memory access frequency.  Otherwise, some
      frequently accessed pages might reside on slow tiers and it makes
      performance degradation unexpectedly.  Moreover, the memory access
      patterns can be changed at runtime.
      
      To handle this problem, we need a way to monitor the memory access
      patterns and migrate pages based on their access temperature.  The
      DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
      Schemes) can be useful features for monitoring and migrating pages.  DAMOS
      provides multiple actions based on DAMON monitoring results and it can be
      used for proactive reclaim, which means swapping cold pages out with
      DAMOS_PAGEOUT action, but it doesn't support migration actions such as
      demotion and promotion between tiered memory nodes.
      
      This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
      promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
      tiers.  This prevents hot pages from being stuck on slow tiers, which
      makes performance degradation and cold pages can be proactively demoted to
      slow tiers so that the system can increase the chance to allocate more hot
      pages to fast tiers.
      
      The DAMON provides various tuning knobs but we found that the proactive
      demotion for cold pages is especially useful when the system is running
      out of memory on its fast tier nodes.
      
      Our evaluation result shows that it reduces the performance slowdown
      compared to the default memory policy from 11% to 3~5% when the system
      runs under high memory pressure on its fast tier DRAM nodes.
      
      DAMON configuration
      ===================
      
      The specific DAMON configuration doesn't have to be in the scope of this
      patch series, but some rough idea is better to be shared to explain the
      evaluation result.
      
      The DAMON provides many knobs for fine tuning but its configuration file
      is generated by HMSDK[3].  It includes gen_config.py script that generates
      a json file with the full config of DAMON knobs and it creates multiple
      kdamonds for each NUMA node when the DAMON is enabled so that it can run
      hot/cold based migration for tiered memory.
      
      Evaluation Workload
      ===================
      
      The performance evaluation is done with redis[4], which is a widely used
      in-memory database and the memory access patterns are generated via
      YCSB[5].  We have measured two different workloads with zipfian and latest
      distributions but their configs are slightly modified to make memory usage
      higher and execution time longer for better evaluation.
      
      The idea of evaluation using these migrate_{hot,cold} actions covers
      system-wide memory management rather than partitioning hot/cold pages of a
      single workload.  The default memory allocation policy creates pages to
      the fast tier DRAM node first, then allocates newly created pages to the
      slow tier CXL node when the DRAM node has insufficient free space.  Once
      the page allocation is done then those pages never move between NUMA
      nodes.  It's not true when using numa balancing, but it is not the scope
      of this DAMON based tiered memory management support.
      
      If the working set of redis can be fit fully into the DRAM node, then the
      redis will access the fast DRAM only.  Since the performance of DRAM only
      is faster than partially accessing CXL memory in slow tiers, this
      environment is not useful to evaluate this patch series.
      
      To make pages of redis be distributed across fast DRAM node and slow CXL
      node to evaluate our migrate_{hot,cold} actions, we pre-allocate some cold
      memory externally using mmap and memset before launching redis-server.  We
      assumed that there are enough amount of cold memory in datacenters as
      TMO[6] and TPP[7] papers mentioned.
      
      The evaluation sequence is as follows.
      
      1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
         DAMOS_MIGRATE_HOT action for CXL node.  It demotes cold pages on DRAM
         node and promotes hot pages on CXL node in a regular interval.
      2. Allocate a huge block of cold memory by calling mmap and memset at
         the fast tier DRAM node, then make the process sleep to make the fast
         tier has insufficient space for redis-server.
      3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
         redis-server consumes 52GB of anon pages and 33GB of file pages, but
         due to the cold memory allocated at 2, it fails allocating the entire
         memory of redis-server on the fast tier DRAM node so it partially
         allocates the remaining on the slow tier CXL node.  The ratio of
         DRAM:CXL depends on the size of the pre-allocated cold memory.
      4. Run YCSB to make zipfian or latest distribution of memory accesses to
         redis-server, then measure its execution time when it's completed.
      5. Repeat 4 over 50 times to measure the average execution time for each
         run.
      6. Increase the cold memory size then repeat goes to 2.
      
      For each test at 4 took about a minute so repeating it 50 times almost
      took about 1 hour for each test with a specific cold memory from 440GB to
      500GB in 10GB increments for each evaluation.  So it took about more than
      10 hours for both zipfian and latest workloads to get the entire
      evaluation results.  Repeating the same test set multiple times doesn't
      show much difference so I think it might be enough to make the result
      reliable.
      
      Evaluation Results
      ==================
      
      All the result values are normalized to DRAM-only execution time because
      the workload cannot be faster than DRAM-only unless the workload hits the
      peak bandwidth but our redis test doesn't go beyond the bandwidth limit.
      
      So the DRAM-only execution time is the ideal result without affected by
      the gap between DRAM and CXL performance difference.  The NUMA node
      environment is as follows.
      
        node0 - local DRAM, 512GB with a CPU socket (fast tier)
        node1 - disabled
        node2 - CXL DRAM, 96GB, no CPU attached (slow tier)
      
      The following is the result of generating zipfian distribution to
      redis-server and the numbers are averaged by 50 times of execution.
      
        1. YCSB zipfian distribution read only workload
        memory pressure with cold memory on node0 with 512GB of local DRAM.
        ====================+================================================+=========
                            |       cold memory occupied by mmap and memset  |
                            |   0G  440G  450G  460G  470G  480G  490G  500G |
        ====================+================================================+=========
        Execution time normalized to DRAM-only values                        | GEOMEAN
        --------------------+------------------------------------------------+---------
        DRAM-only           | 1.00     -     -     -     -     -     -     - | 1.00
        CXL-only            | 1.19     -     -     -     -     -     -     - | 1.19
        default             |    -  1.00  1.05  1.08  1.12  1.14  1.18  1.18 | 1.11
        DAMON tiered        |    -  1.03  1.03  1.03  1.03  1.03  1.07 *1.05 | 1.04
        DAMON lazy          |    -  1.04  1.03  1.04  1.05  1.06  1.06 *1.06 | 1.05
        ====================+================================================+=========
        CXL usage of redis-server in GB                                      | AVERAGE
        --------------------+------------------------------------------------+---------
        DRAM-only           |  0.0     -     -     -     -     -     -     - |  0.0
        CXL-only            | 51.4     -     -     -     -     -     -     - | 51.4
        default             |    -   0.6  10.6  20.5  30.5  40.5  47.6  50.4 | 28.7
        DAMON tiered        |    -   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
        DAMON lazy          |    -   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
        ====================+================================================+=========
      
      Each test result is based on the execution environment as follows.
      
        DRAM-only:           redis-server uses only local DRAM memory.
        CXL-only:            redis-server uses only CXL memory.
        default:             default memory policy(MPOL_DEFAULT).
                             numa balancing disabled.
        DAMON tiered:        DAMON enabled with DAMOS_MIGRATE_COLD for DRAM
                             nodes and DAMOS_MIGRATE_HOT for CXL nodes.
        DAMON lazy:          same as DAMON tiered, but turn on DAMON just
                             before making memory access request via YCSB.
      
      The above result shows the "default" execution time goes up as the size of
      cold memory is increased from 440G to 500G because the more cold memory
      used, the more CXL memory is used for the target redis workload and this
      makes the execution time increase.
      
      However, "DAMON tiered" and other DAMON results show less slowdown because
      the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes
      pre-allocated cold memory to CXL node and this free space at DRAM
      increases more chance to allocate hot or warm pages of redis-server to
      fast DRAM node.  Moreover, DAMOS_MIGRATE_HOT action at CXL node also
      promotes hot pages of redis-server to DRAM node actively.
      
      As a result, it makes more memory of redis-server stay in DRAM node
      compared to "default" memory policy and this makes the performance
      improvement.
      
      Please note that the result numbers of "DAMON tiered" and "DAMON lazy" at
      500G are marked with * stars, which means their test results are replaced
      with reproduced tests that didn't have OOM issue.
      
      That was needed because sometimes the test processes get OOM when DRAM has
      insufficient space.  The DAMOS_MIGRATE_HOT doesn't kick reclaim but just
      gives up migration when there is not enough space at DRAM side.  The
      problem happens when there is competition between normal allocation and
      migration and the migration is done before normal allocation, then the
      completely unrelated normal allocation can trigger reclaim, which incurs
      OOM.
      
      Because of this issue, I have also tested more cases with
      "demotion_enabled" flag enabled to make such reclaim doesn't trigger OOM,
      but just demote reclaimed pages.  The following test results show more
      tests with "kswapd" marked.
      
        2. YCSB zipfian distribution read only workload (with demotion_enabled true)
        memory pressure with cold memory on node0 with 512GB of local DRAM.
        ====================+================================================+=========
                            |       cold memory occupied by mmap and memset  |
                            |   0G  440G  450G  460G  470G  480G  490G  500G |
        ====================+================================================+=========
        Execution time normalized to DRAM-only values                        | GEOMEAN
        --------------------+------------------------------------------------+---------
        DAMON tiered        |    -  1.03  1.03  1.03  1.03  1.03  1.07  1.05 | 1.04
        DAMON lazy          |    -  1.04  1.03  1.04  1.05  1.06  1.06  1.06 | 1.05
        DAMON tiered kswapd |    -  1.03  1.03  1.03  1.03  1.02  1.02  1.03 | 1.03
        DAMON lazy kswapd   |    -  1.04  1.04  1.04  1.03  1.05  1.04  1.05 | 1.04
        ====================+================================================+=========
        CXL usage of redis-server in GB                                      | AVERAGE
        --------------------+------------------------------------------------+---------
        DAMON tiered        |    -   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
        DAMON lazy          |    -   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
        DAMON tiered kswapd |    -   0.0   0.0   0.4   0.5   0.1   0.8   1.0 |  0.4
        DAMON lazy kswapd   |    -   4.2   4.6   5.3   1.7   6.8   8.1   5.8 |  5.2
        ====================+================================================+=========
      
      Each test result is based on the exeuction environment as follows.
      
        DAMON tiered:        same as before
        DAMON lazy:          same as before
        DAMON tiered kswapd: same as DAMON tiered, but turn on
                             /sys/kernel/mm/numa/demotion_enabled to make
                             kswapd or direct reclaim does demotion.
        DAMON lazy kswapd:   same as DAMON lazy, but turn on
                             /sys/kernel/mm/numa/demotion_enabled to make
                             kswapd or direct reclaim does demotion.
      
      The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at
      all unlike other tests because kswapd and direct reclaim from DRAM node
      can demote reclaimed pages to CXL node independently from DAMON actions
      and their results are slightly better than without having
      "demotion_enabled".
      
      In summary, the evaluation results show that DAMON memory management with
      DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown compared
      to the "default" memory policy from 11% to 3~5% when the system runs with
      high memory pressure on its fast tier DRAM nodes.
      
      Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
      tiered memory systems run more efficiently under high memory pressures.
      
      
      This patch (of 7):
      
      The alloc_demote_folio can be used out of vmscan.c so it'd be better to
      remove static keyword from it.
      
      Link: https://lkml.kernel.org/r/20240614030010.751-1-honggyu.kim@sk.com
      Link: https://lkml.kernel.org/r/20240614030010.751-2-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a00ce85a
    • Wei Yang's avatar
      mm/mm_init.c: simplify logic of deferred_[init|free]_pages · 972b89c1
      Wei Yang authored
      Function deferred_[init|free]_pages are only used in
      deferred_init_maxorder(), which makes sure the range to init/free is
      within MAX_ORDER_NR_PAGES size.
      
      With this knowledge, we can simplify these two functions. Since
      
        * only the first pfn could be IS_MAX_ORDER_ALIGNED()
      
      Also since the range passed to deferred_[init|free]_pages is always from
      memblock.memory for those we have already allocated memmap to cover,
      pfn_valid() always return true.  Then we can remove related check.
      
      [richard.weiyang@gmail.com: adjust function declaration indention per David]
        Link: https://lkml.kernel.org/r/20240613114525.27528-1-richard.weiyang@gmail.com
      Link: https://lkml.kernel.org/r/20240612020421.31975-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      972b89c1
    • Miaohe Lin's avatar
      mm/memory-failure: correct comment in me_swapcache_dirty · e5d89670
      Miaohe Lin authored
      Dirty swap cache page could live both in page table (not page cache) and
      swap cache when freshly swapped in.  Correct comment.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-14-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5d89670
    • Miaohe Lin's avatar
      mm/memory-failure: remove obsolete comment in kill_proc() · d49f2366
      Miaohe Lin authored
      When user sets SIGBUS to SIG_IGN, it won't cause loop now.  For action
      required mce error, SIGBUS cannot be blocked.  Also when a hwpoisoned page
      is re-accessed, kill_accessing_process() will be called to kill the
      process.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-13-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d49f2366
    • Miaohe Lin's avatar
      mm/memory-failure: fix comment of get_hwpoison_page() · b71340ef
      Miaohe Lin authored
      When return value is 0, it could also means the page is free hugetlb page
      or free buddy page.  Fix the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-12-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b71340ef
    • Miaohe Lin's avatar
      mm/memory-failure: move some function declarations into internal.h · 3a78f77f
      Miaohe Lin authored
      There are some functions only used inside mm.  Move them into internal.h. 
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-11-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202405251049.hxjwX7zO-lkp@intel.com/
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a78f77f
    • Miaohe Lin's avatar
      mm/memory-failure: remove obsolete comment in unpoison_memory() · 28eab7d4
      Miaohe Lin authored
      Since commit 130d4df5 ("mm/sl[au]b: rearrange struct slab fields to
      allow larger rcu_head"), folio->_mapcount is not overloaded with SLAB. 
      Update corresponding comment.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-10-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28eab7d4
    • Miaohe Lin's avatar
      mm/memory-failure: use helper macro task_pid_nr() · 96e13a4e
      Miaohe Lin authored
      Use helper macro task_pid_nr() to get the pid of a task.  No functional
      change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-9-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96e13a4e
    • Miaohe Lin's avatar
      mm/memory-failure: don't export hwpoison_filter() when !CONFIG_HWPOISON_INJECT · 5a8b01be
      Miaohe Lin authored
      When CONFIG_HWPOISON_INJECT is not enabled, there is no user of the
      hwpoison_filter() outside memory-failure.  So there is no need to export
      it in that case.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-8-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202406070136.hGQwVbsv-lkp@intel.com/
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a8b01be
    • Miaohe Lin's avatar
      mm/memory-failure: remove confusing initialization to count · 4d64ab2f
      Miaohe Lin authored
      It's meaningless and confusing to init local variable count to 1.  Remove
      it.  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-7-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d64ab2f
    • Miaohe Lin's avatar
      mm/memory-failure: remove unneeded empty string · 7f8de206
      Miaohe Lin authored
      Remove unneeded empty string in definition of macro pr_fmt.  No functional
      change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-6-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f8de206
    • Miaohe Lin's avatar
      mm/memory-failure: save some page_folio() calls · b7c3afba
      Miaohe Lin authored
      Use local variable folio directly to save a page_folio() call.  Also use
      folio_mapped() to save more page_folio() calls.  No functional change
      intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-5-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b7c3afba
    • Miaohe Lin's avatar
      mm/memory-failure: add macro GET_PAGE_MAX_RETRY_NUM · babde186
      Miaohe Lin authored
      Add helper macro GET_PAGE_MAX_RETRY_NUM to replace magic number 3.  No
      functional change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      babde186
    • Miaohe Lin's avatar
      mm/memory-failure: remove MF_MSG_SLAB · ceb32d6a
      Miaohe Lin authored
      Since commit 46df8e73 ("mm: free up PG_slab"), MF_MSG_SLAB becomes
      unused.  Remove it.  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ceb32d6a
    • Miaohe Lin's avatar
      mm/memory-failure: simplify put_ref_page() · 16117532
      Miaohe Lin authored
      Patch series "Some cleanups for memory-failure", v3.
      
      This series contains a few cleanup patches to avoid exporting unused
      function, add helper macro, fix some obsolete comments and so on.  More
      details can be found in the respective changelogs.  
      
      
      This patch (of 13):
      
      Remove unneeded page != NULL check.  pfn_to_page() won't return NULL.  No
      functional change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20240612071835.157004-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16117532
    • Oscar Salvador's avatar
      mm/hugetlb: guard dequeue_hugetlb_folio_nodemask against NUMA_NO_NODE uses · 09a53362
      Oscar Salvador authored
      dequeue_hugetlb_folio_nodemask() expects a preferred node where to get the
      hugetlb page from.  It does not expect, though, users to pass
      NUMA_NO_NODE, otherwise we will get trash when trying to get the zonelist
      from that node.  All current users are careful enough to not pass
      NUMA_NO_NODE, but it opens the door for new users to get this wrong since
      it is not documented [0].
      
      Guard against this by getting the local nid if NUMA_NO_NODE was passed.
      
      [0] https://lore.kernel.org/linux-mm/0000000000004f12bb061a9acf07@google.com/
      
      Closes: https://lore.kernel.org/linux-mm/0000000000004f12bb061a9acf07@google.com/
      Link: https://lkml.kernel.org/r/20240612082936.10867-1-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: syzbot+569ed13f4054f271087b@syzkaller.appspotmail.com
      Tested-by: syzbot+569ed13f4054f271087b@syzkaller.appspotmail.com
      Reviewed-by: default avatarMuchun Song <muchun.song@linux.dev>
      Acked-by: default avatarVivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      09a53362
    • Xiu Jianfeng's avatar
      mm/hugetlb_cgroup: switch to the new cftypes · b79d715c
      Xiu Jianfeng authored
      The previous patch has already reconstructed the cftype attributes based
      on the templates and saved them in dfl_cftypes and legacy_cftypes.  then
      remove the old procedure and switch to the new cftypes.
      
      Link: https://lkml.kernel.org/r/20240612092409.2027592-4-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b79d715c
    • Xiu Jianfeng's avatar
      mm/hugetlb_cgroup: prepare cftypes based on template · 47179fe0
      Xiu Jianfeng authored
      Unlike other cgroup subsystems, the hugetlb cgroup does not provide a
      static array of cftype that explicitly displays the properties, handling
      functions, etc., of each file.  Instead, it dynamically creates the
      attribute of cftypes based on the hstate during the startup procedure. 
      This reduces the readability of the code.
      
      To fix this issue, introduce two templates of cftypes, and rebuild the
      attributes according to the hstate to make it ready to be added to cgroup
      framework.
      
      Link: https://lkml.kernel.org/r/20240612092409.2027592-3-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: kernel test robot <oliver.sang@intel.com>
      From: Xiu Jianfeng <xiujianfeng@huawei.com>
      Subject: mm/hugetlb_cgroup: register lockdep key for cftype
      Date: Tue, 18 Jun 2024 07:19:22 +0000
      
      When CONFIG_DEBUG_LOCK_ALLOC is enabled, the following commands can
      trigger a bug,
      
      mount -t cgroup2 none /sys/fs/cgroup
      cd /sys/fs/cgroup
      echo "+hugetlb" > cgroup.subtree_control
      
      The log is as below:
      
      BUG: key ffff8880046d88d8 has not been registered!
      ------------[ cut here ]------------
      DEBUG_LOCKS_WARN_ON(1)
      WARNING: CPU: 3 PID: 226 at kernel/locking/lockdep.c:4945 lockdep_init_map_type+0x185/0x220
      Modules linked in:
      CPU: 3 PID: 226 Comm: bash Not tainted 6.10.0-rc4-next-20240617-g76db4c64526c #544
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:lockdep_init_map_type+0x185/0x220
      Code: 00 85 c0 0f 84 6c ff ff ff 8b 3d 6a d1 85 01 85 ff 0f 85 5e ff ff ff 48 c7 c6 21 99 4a 82 48 c7 c7 60 29 49 82 e8 3b 2e f5
      RSP: 0018:ffffc9000083fc30 EFLAGS: 00000282
      RAX: 0000000000000000 RBX: ffffffff828dd820 RCX: 0000000000000027
      RDX: ffff88803cd9cac8 RSI: 0000000000000001 RDI: ffff88803cd9cac0
      RBP: ffff88800674fbb0 R08: ffffffff828ce248 R09: 00000000ffffefff
      R10: ffffffff8285e260 R11: ffffffff828b8eb8 R12: ffff8880046d88d8
      R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880067281c0
      FS:  00007f68601ea740(0000) GS:ffff88803cd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005614f3ebc740 CR3: 000000000773a000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       ? __warn+0x77/0xd0
       ? lockdep_init_map_type+0x185/0x220
       ? report_bug+0x189/0x1a0
       ? handle_bug+0x3c/0x70
       ? exc_invalid_op+0x18/0x70
       ? asm_exc_invalid_op+0x1a/0x20
       ? lockdep_init_map_type+0x185/0x220
       __kernfs_create_file+0x79/0x100
       cgroup_addrm_files+0x163/0x380
       ? find_held_lock+0x2b/0x80
       ? find_held_lock+0x2b/0x80
       ? find_held_lock+0x2b/0x80
       css_populate_dir+0x73/0x180
       cgroup_apply_control_enable+0x12f/0x3a0
       cgroup_subtree_control_write+0x30b/0x440
       kernfs_fop_write_iter+0x13a/0x1f0
       vfs_write+0x341/0x450
       ksys_write+0x64/0xe0
       do_syscall_64+0x4b/0x110
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x7f68602d9833
      Code: 8b 15 61 26 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 08
      RSP: 002b:00007fff9bbdf8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f68602d9833
      RDX: 0000000000000009 RSI: 00005614f3ebc740 RDI: 0000000000000001
      RBP: 00005614f3ebc740 R08: 000000000000000a R09: 0000000000000008
      R10: 00005614f3db6ba0 R11: 0000000000000246 R12: 0000000000000009
      R13: 00007f68603bd6a0 R14: 0000000000000009 R15: 00007f68603b8880
      
      For lockdep, there is a sanity check in lockdep_init_map_type(), the
      lock-class key must either have been allocated statically or must
      have been registered as a dynamic key. However the commit e18df2889ff9
      ("mm/hugetlb_cgroup: prepare cftypes based on template") has changed
      the cftypes from static allocated objects to dynamic allocated objects,
      so the cft->lockdep_key must be registered proactively.
      
      [xiujianfeng@huawei.com: fix BUG()]
        Link: https://lkml.kernel.org/r/20240619015527.2212698-1-xiujianfeng@huawei.com
      Link: https://lkml.kernel.org/r/20240618071922.2127289-1-xiujianfeng@huawei.com
      Link: https://lore.kernel.org/all/602186b3-5ce3-41b3-90a3-134792cc2a48@samsung.com/
      Fixes: e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template")
      Signed-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/oe-lkp/202406181046.8d8b2492-oliver.sang@intel.comTested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Tested-by: default avatarSeongJae Park <sj@kernel.org>
      Closes: https://lore.kernel.org/20240618233608.400367-1-sj@kernel.org
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47179fe0
    • Xiu Jianfeng's avatar
      mm/hugetlb_cgroup: identify the legacy using cgroup_subsys_on_dfl() · 520de595
      Xiu Jianfeng authored
      Patch series "mm/hugetlb_cgroup: rework on cftypes", v3.
      
      This patchset provides an intuitive view of the control files through
      static templates of cftypes.  This improves the readability of the code.  
      
      
      This patch (of 3):
      
      Currently the numa_stat file encodes 1 into .private using the micro
      MEMFILE_PRIVATE() to identify the legacy.  Actually, we can use
      cgroup_subsys_on_dfl() instead.  This is helpful to handle .private in the
      static templates in the next patch.
      
      Link: https://lkml.kernel.org/r/20240612092409.2027592-1-xiujianfeng@huawei.com
      Link: https://lkml.kernel.org/r/20240612092409.2027592-2-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      520de595
    • Sourav Panda's avatar
      mm: report per-page metadata information · 15995a35
      Sourav Panda authored
      Today, we do not have any observability of per-page metadata and how much
      it takes away from the machine capacity.  Thus, we want to describe the
      amount of memory that is going towards per-page metadata, which can vary
      depending on build configuration, machine architecture, and system use.
      
      This patch adds 2 fields to /proc/vmstat that can used as shown below:
      
      Accounting per-page metadata allocated by boot-allocator:
      	/proc/vmstat:nr_memmap_boot * PAGE_SIZE
      
      Accounting per-page metadata allocated by buddy-allocator:
      	/proc/vmstat:nr_memmap * PAGE_SIZE
      
      Accounting total Perpage metadata allocated on the machine:
      	(/proc/vmstat:nr_memmap_boot +
      	 /proc/vmstat:nr_memmap) * PAGE_SIZE
      
      Utility for userspace:
      
      Observability: Describe the amount of memory overhead that is going to
      per-page metadata on the system at any given time since this overhead is
      not currently observable.
      
      Debugging: Tracking the changes or absolute value in struct pages can help
      detect anomalies as they can be correlated with other metrics in the
      machine (e.g., memtotal, number of huge pages, etc).
      
      page_ext overheads: Some kernel features such as page_owner
      page_table_check that use page_ext can be optionally enabled via kernel
      parameters.  Having the total per-page metadata information helps users
      precisely measure impact.  Furthermore, page-metadata metrics will reflect
      the amount of struct pages reliquished (or overhead reduced) when
      hugetlbfs pages are reserved which will vary depending on whether hugetlb
      vmemmap optimization is enabled or not.
      
      For background and results see:
      lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com
      
      Link: https://lkml.kernel.org/r/20240605222751.1406125-1-souravpanda@google.comSigned-off-by: default avatarSourav Panda <souravpanda@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Chen Linxuan <chenlinxuan@uniontech.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ivan Babrou <ivan@cloudflare.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tomas Mudrunka <tomas.mudrunka@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15995a35
    • Edward Liaw's avatar
      selftests/mm: guard defines from shm · 8192bc03
      Edward Liaw authored
      thuge-gen.c defines SHM_HUGE_* macros that are provided by the uapi since
      4.14.  These macros get redefined when compiling with Android's bionic
      because its sys/shm.h will import the uapi definitions.
      
      However if linux/shm.h is included, with glibc, sys/shm.h will clash on
      some struct definitions:
      
        /usr/include/linux/shm.h:26:8: error: redefinition of ‘struct shmid_ds’
           26 | struct shmid_ds {
              |        ^~~~~~~~
        In file included from /usr/include/x86_64-linux-gnu/bits/shm.h:45,
                         from /usr/include/x86_64-linux-gnu/sys/shm.h:30:
        /usr/include/x86_64-linux-gnu/bits/types/struct_shmid_ds.h:24:8: note: originally defined here
           24 | struct shmid_ds
              |        ^~~~~~~~
      
      For now, guard the SHM_HUGE_* defines with ifndef to prevent redefinition
      warnings on Android bionic.
      
      Link: https://lkml.kernel.org/r/20240605223637.1374969-3-edliaw@google.comSigned-off-by: default avatarEdward Liaw <edliaw@google.com>
      Reviewed-by: default avatarCarlos Llamas <cmllamas@google.com>
      Reviewed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Bill Wendling <morbo@google.com>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Justin Stitt <justinstitt@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8192bc03
    • Edward Liaw's avatar
      selftests/mm: include linux/mman.h · 05f3f75c
      Edward Liaw authored
      thuge-gen defines MAP_HUGE_* macros that are provided by linux/mman.h
      since 4.15. Removes the macros and includes linux/mman.h instead.
      
      Link: https://lkml.kernel.org/r/20240605223637.1374969-2-edliaw@google.comSigned-off-by: default avatarEdward Liaw <edliaw@google.com>
      Reviewed-by: default avatarCarlos Llamas <cmllamas@google.com>
      Reviewed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Bill Wendling <morbo@google.com>
      Cc: Justin Stitt <justinstitt@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05f3f75c
    • Anastasia Belova's avatar
      mm/memory_hotplug: prevent accessing by index=-1 · 5958d359
      Anastasia Belova authored
      nid may be equal to NUMA_NO_NODE=-1.  Prevent accessing node_data array by
      invalid index with check for nid.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Link: https://lkml.kernel.org/r/20240606080659.18525-1-abelova@astralinux.ru
      Fixes: e83a437f ("mm/memory_hotplug: introduce "auto-movable" online policy")
      Signed-off-by: default avatarAnastasia Belova <abelova@astralinux.ru>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5958d359
    • Lance Yang's avatar
      mm/mlock: implement folio_mlock_step() using folio_pte_batch() · f742829d
      Lance Yang authored
      Let's make folio_mlock_step() simply a wrapper around folio_pte_batch(),
      which will greatly reduce the cost of ptep_get() when scanning a range of
      contptes.
      
      Link: https://lkml.kernel.org/r/20240611010418.70797-1-ioworker0@gmail.comSigned-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Suggested-by: default avatarBarry Song <21cnbao@gmail.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Bang Li <libang.li@antgroup.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f742829d
    • Yosry Ahmed's avatar
      mm: zswap: handle incorrect attempts to load large folios · c63f210d
      Yosry Ahmed authored
      Zswap does not support storing or loading large folios.  Until proper
      support is added, attempts to load large folios from zswap are a bug.
      
      For example, if a swapin fault observes that contiguous PTEs are pointing
      to contiguous swap entries and tries to swap them in as a large folio,
      swap_read_folio() will pass in a large folio to zswap_load(), but
      zswap_load() will only effectively load the first page in the folio.  If
      the first page is not in zswap, the folio will be read from disk, even
      though other pages may be in zswap.
      
      In both cases, this will lead to silent data corruption.  Proper support
      needs to be added before large folio swapins and zswap can work together.
      
      Looking at callers of swap_read_folio(), it seems like they are either
      allocated from __read_swap_cache_async() or do_swap_page() in the
      SWP_SYNCHRONOUS_IO path.  Both of which allocate order-0 folios, so
      everything is fine for now.
      
      However, there is ongoing work to add to support large folio swapins [1]. 
      To make sure new development does not break zswap (or get broken by
      zswap), add minimal handling of incorrect loads of large folios to zswap. 
      First, move the call folio_mark_uptodate() inside zswap_load().
      
      If a large folio load is attempted, and zswap was ever enabled on the
      system, return 'true' without calling folio_mark_uptodate().  This will
      prevent the folio from being read from disk, and will emit an IO error
      because the folio is not uptodate (e.g.  do_swap_fault() will return
      VM_FAULT_SIGBUS).  It may not be reliable recovery in all cases, but it is
      better than nothing.
      
      This was tested by hacking the allocation in __read_swap_cache_async() to
      use order 2 and __GFP_COMP.
      
      In the future, to handle this correctly, the swapin code should:
      
      (a) Fall back to order-0 swapins if zswap was ever used on the
          machine, because compressed pages remain in zswap after it is
          disabled.
      
      (b) Add proper support to swapin large folios from zswap (fully or
          partially).
      
      Probably start with (a) then followup with (b).
      
      [1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/
      
      Link: https://lkml.kernel.org/r/20240611024516.1375191-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c63f210d
    • Yosry Ahmed's avatar
      mm: zswap: add zswap_never_enabled() · 2d4d2b1c
      Yosry Ahmed authored
      Add zswap_never_enabled() to skip the xarray lookup in zswap_load() if
      zswap was never enabled on the system.  It is implemented using static
      branches for efficiency, as enabling zswap should be a rare event.  This
      could shave some cycles off zswap_load() when CONFIG_ZSWAP is used but
      zswap is never enabled.
      
      However, the real motivation behind this patch is two-fold:
      - Incoming large folio swapin work will need to fallback to order-0
        folios if zswap was ever enabled, because any part of the folio could be
        in zswap, until proper handling of large folios with zswap is added.
      
      - A warning and recovery attempt will be added in a following change in
        case the above was not done incorrectly.  Zswap will fail the read if
        the folio is large and it was ever enabled.
      
      Expose zswap_never_enabled() in the header for the swapin work to use
      it later.
      
      [yosryahmed@google.com: expose zswap_never_enabled() in the header]
        Link: https://lkml.kernel.org/r/Zmjf0Dr8s9xSW41X@google.com
      Link: https://lkml.kernel.org/r/20240611024516.1375191-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d4d2b1c
    • Yosry Ahmed's avatar
      mm: zswap: rename is_zswap_enabled() to zswap_is_enabled() · 2b33a97c
      Yosry Ahmed authored
      In preparation for introducing a similar function, rename
      is_zswap_enabled() to use zswap_* prefix like other zswap functions.
      
      Link: https://lkml.kernel.org/r/20240611024516.1375191-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b33a97c
    • Wei Yang's avatar
      mm/mm_init.c: print mem_init info after defer_init is done · 4f66da89
      Wei Yang authored
      Current call flow looks like this:
      
      start_kernel
        mm_core_init
          mem_init
          mem_init_print_info
        rest_init
          kernel_init
            kernel_init_freeable
              page_alloc_init_late
                deferred_init_memmap
      
      If CONFIG_DEFERRED_STRUCT_PAGE_INIT, the time mem_init_print_info()
      calls, pages are not totally initialized and freed to buddy.
      
      This has one issue
      
        * nr_free_pages() just contains partial free pages in the system,
          which is not we expect.
      
      Let's print the mem info after defer_init is done.
      
      Also this would help changing totalram_pages accounting, since we plan
      to move the accounting into __free_pages_core().
      
      Link: https://lkml.kernel.org/r/20240611145223.16872-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f66da89
    • Leesoo Ahn's avatar
      mm/sparse: use MEMBLOCK_ALLOC_ACCESSIBLE enum instead of 0 · afb90a36
      Leesoo Ahn authored
      Setting 'limit' variable to 0 might seem like it means "no limit".  But in
      the memblock API, 0 actually means the 'MEMBLOCK_ALLOC_ACCESSIBLE' enum,
      which limits the physical address range end based on
      'memblock.current_limit'.  This could be confusing.
      
      Use the enum instead of 0 to make it clear.
      
      Link: https://lkml.kernel.org/r/20240610151528.943680-1-lsahn@wewakecorp.comSigned-off-by: default avatarLeesoo Ahn <lsahn@ooseel.net>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      afb90a36
    • Lance Yang's avatar
      mm/vmscan: avoid split lazyfree THP during shrink_folio_list() · 735ecdfa
      Lance Yang authored
      When the user no longer requires the pages, they would use
      madvise(MADV_FREE) to mark the pages as lazy free.  Subsequently, they
      typically would not re-write to that memory again.
      
      During memory reclaim, if we detect that the large folio and its PMD are
      both still marked as clean and there are no unexpected references (such as
      GUP), so we can just discard the memory lazily, improving the efficiency
      of memory reclamation in this case.
      
      On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
      mem_cgroup_force_empty() results in the following runtimes in seconds
      (shorter is better):
      
      --------------------------------------------
      |     Old       |      New       |  Change  |
      --------------------------------------------
      |   0.683426    |    0.049197    |  -92.80% |
      --------------------------------------------
      
      [ioworker0@gmail.com: minor changes per David]
        Link: https://lkml.kernel.org/r/20240622100057.3352-1-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240614015138.31461-4-ioworker0@gmail.comSigned-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Suggested-by: default avatarZi Yan <ziy@nvidia.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Bang Li <libang.li@antgroup.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Jeff Xie <xiehuan09@gmail.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      735ecdfa
    • Lance Yang's avatar
      mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop · 29e847d2
      Lance Yang authored
      In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
      folios, start the pagewalk first, then call split_huge_pmd_address() to
      split the folio.
      
      Link: https://lkml.kernel.org/r/20240614015138.31461-3-ioworker0@gmail.comSigned-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Bang Li <libang.li@antgroup.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Jeff Xie <xiehuan09@gmail.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29e847d2
    • Lance Yang's avatar
      mm/rmap: remove duplicated exit code in pagewalk loop · 26d21b18
      Lance Yang authored
      Patch series "Reclaim lazyfree THP without splitting", v8.
      
      This series adds support for reclaiming PMD-mapped THP marked as lazyfree
      without needing to first split the large folio via
      split_huge_pmd_address().
      
      When the user no longer requires the pages, they would use
      madvise(MADV_FREE) to mark the pages as lazy free.  Subsequently, they
      typically would not re-write to that memory again.
      
      During memory reclaim, if we detect that the large folio and its PMD are
      both still marked as clean and there are no unexpected references(such as
      GUP), so we can just discard the memory lazily, improving the efficiency
      of memory reclamation in this case.
      
      Performance Testing
      ===================
      
      On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
      mem_cgroup_force_empty() results in the following runtimes in seconds
      (shorter is better):
      
      --------------------------------------------
      |     Old       |      New       |  Change  |
      --------------------------------------------
      |   0.683426    |    0.049197    |  -92.80% |
      --------------------------------------------
      
      
      This patch (of 8):
      
      Introduce the labels walk_done and walk_abort as exit points to eliminate
      duplicated exit code in the pagewalk loop.
      
      Link: https://lkml.kernel.org/r/20240614015138.31461-1-ioworker0@gmail.com
      Link: https://lkml.kernel.org/r/20240614015138.31461-2-ioworker0@gmail.comSigned-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Cc: Bang Li <libang.li@antgroup.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Jeff Xie <xiehuan09@gmail.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      26d21b18
    • Usama Arif's avatar
      mm: do not start/end writeback for pages stored in zswap · 9ba85f55
      Usama Arif authored
      Most of the work done in folio_start_writeback is reversed in
      folio_end_writeback.  For e.g.  NR_WRITEBACK and NR_ZONE_WRITE_PENDING are
      incremented in start_writeback and decremented in end_writeback.  Calling
      end_writeback immediately after start_writeback (separated by
      folio_unlock) cancels the affect of most of the work done in start hence
      can be removed.
      
      There is some extra work done in folio_end_writeback, however it is
      incorrect/not applicable to zswap:
      - folio_end_writeback incorrectly increments NR_WRITTEN counter,
        eventhough the pages aren't written to disk, hence this change
        corrects this behaviour.
      - folio_end_writeback calls folio_rotate_reclaimable, but that only
        makes sense for async writeback pages, while for zswap pages are
        synchronously reclaimed.
      
      Link: https://lkml.kernel.org/r/20240612100109.1616626-1-usamaarif642@gmail.com
      Link: https://lkml.kernel.org/r/20240610143037.812955-1-usamaarif642@gmail.comSigned-off-by: default avatarUsama Arif <usamaarif642@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9ba85f55
    • Pankaj Raghav's avatar
      selftests/mm: use asm volatile to not optimize mmap read variable · ecc1793b
      Pankaj Raghav authored
      create_pagecache_thp_and_fd() in split_huge_page_test.c used the variable
      dummy to perform mmap read.
      
      However, this test was skipped even on XFS which has large folio support. 
      The issue was compiler (gcc 13.2.0) was optimizing out the dummy variable,
      therefore, not creating huge page in the page cache.
      
      Use asm volatile() trick to force the compiler not to optimize out the
      loop where we read from the mmaped addr.  This is similar to what is being
      done in other tests (cow.c, etc)
      
      As the variable is now used in the asm statement, remove the unused
      attribute.
      
      Link: https://lkml.kernel.org/r/20240606203619.677276-1-kernel@pankajraghav.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ecc1793b
    • Barry Song's avatar
      mm: set pte writable while pte_soft_dirty() is true in do_swap_page() · 20dfa5b7
      Barry Song authored
      This patch leverages the new pte_needs_soft_dirty_wp() helper to optimize
      a scenario where softdirty is enabled, but the softdirty flag has already
      been set in do_swap_page().  In this situation, we can use pte_mkwrite
      instead of applying write-protection since we don't depend on write
      faults.
      
      Link: https://lkml.kernel.org/r/20240607211358.4660-3-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20dfa5b7
    • Barry Song's avatar
      mm: introduce pmd|pte_needs_soft_dirty_wp helpers for softdirty write-protect · f38ee285
      Barry Song authored
      Patch series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and
      utilize them", v2.
      
      
      This patchset introduces the pte_need_soft_dirty_wp and
      pmd_need_soft_dirty_wp helpers to determine if write protection is
      required for softdirty tracking.  These helpers enhance code readability
      and improve the overall appearance.
      
      They are then utilized in gup, mprotect, swap, and other related
      functions.
      
      
      This patch (of 2): 
      
      This patch introduces the pte_needs_soft_dirty_wp and
      pmd_needs_soft_dirty_wp helpers to determine if write protection is
      required for softdirty tracking.  This can enhance code readability and
      improve its overall appearance.  These new helpers are then utilized in
      gup, huge_memory, and mprotect.
      
      Link: https://lkml.kernel.org/r/20240607211358.4660-1-21cnbao@gmail.com
      Link: https://lkml.kernel.org/r/20240607211358.4660-2-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f38ee285
    • John Hubbard's avatar
      selftests/mm: kvm, mdwe fixes to avoid requiring "make headers" · c142850f
      John Hubbard authored
      On Ubuntu 23.04, the kvm and mdwe selftests/mm build fails due to
      missing a few items that are found in prctl.h. Here is an excerpt of the
      build failures:
      
      ksm_tests.c:252:13: error: use of undeclared identifier 'PR_SET_MEMORY_MERGE'
      ...
      mdwe_test.c:26:18: error: use of undeclared identifier 'PR_SET_MDWE'
      mdwe_test.c:38:18: error: use of undeclared identifier 'PR_GET_MDWE'
      
      Fix these errors by adding a new tools/include/uapi/linux/prctl.h . This
      file was created by running "make headers", and then copying a snapshot
      over from ./usr/include/linux/prctl.h, as per the approach we settled on
      in [1].
      
      [1] commit e076eaca ("selftests: break the dependency upon local
      header files")
      
      Link: https://lkml.kernel.org/r/20240618022422.804305-6-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrei Vagin <avagin@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c142850f
    • John Hubbard's avatar
      selftests/mm: fix vm_util.c build failures: add snapshot of fs.h · e2019472
      John Hubbard authored
      On Ubuntu 23.04, on a clean git tree, the selftests/mm build fails due 10
      or 20 missing items, all of which are found in fs.h, which is created via
      "make headers".  However, as per [1], the idea is to stop requiring "make
      headers", and instead, take a snapshot of the files and check them in.
      
      Here are a few of the build errors:
      
      vm_util.c:34:21: error: variable has incomplete type 'struct pm_scan_arg'
              struct pm_scan_arg arg;
      ...
      vm_util.c:45:28: error: use of undeclared identifier 'PAGE_IS_WPALLOWED'
      ...
      vm_util.c:55:21: error: variable has incomplete type 'struct page_region'
      ...
      vm_util.c:105:20: error: use of undeclared identifier 'PAGE_IS_SOFT_DIRTY'
      
      To fix this, add fs.h, taken from a snapshot of ./usr/include/linux/fs.h
      after running "make headers".
      
      [1] commit e076eaca ("selftests: break the dependency upon local
      header files")
      
      Link: https://lkml.kernel.org/r/20240618022422.804305-5-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrei Vagin <avagin@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2019472