1. 04 Jul, 2024 40 commits
    • David Hildenbrand's avatar
      mm/page_alloc: clear PageBuddy using __ClearPageBuddy() for bad pages · e4d970ac
      David Hildenbrand authored
      Let's stop using page_mapcount_reset() and clear PageBuddy using
      __ClearPageBuddy() instead.
      
      Link: https://lkml.kernel.org/r/20240529111904.2069608-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram/zsmalloc workloads]
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e4d970ac
    • David Hildenbrand's avatar
      mm/zsmalloc: use a proper page type · 43d746dc
      David Hildenbrand authored
      Let's clean it up: use a proper page type and store our data (offset into
      a page) in the lower 16 bit as documented.
      
      We won't be able to support 256 KiB base pages, which is acceptable. 
      Teach Kconfig to handle that cleanly using a new CONFIG_HAVE_ZSMALLOC.
      
      Based on this, we should do a proper "struct zsdesc" conversion, as
      proposed in [1].
      
      This removes the last _mapcount/page_type offender.
      
      [1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/
      
      Link: https://lkml.kernel.org/r/20240529111904.2069608-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram/zsmalloc workloads]
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43d746dc
    • David Hildenbrand's avatar
      mm: allow reuse of the lower 16 bit of the page type with an actual type · 8db00ad5
      David Hildenbrand authored
      As long as the owner sets a page type first, we can allow reuse of the
      lower 16 bit: sufficient to store an offset into a 64 KiB page, which is
      the maximum base page size in *common* configurations (ignoring the 256
      KiB variant).  Restrict it to the head page.
      
      We'll use that for zsmalloc next, to set a proper type while still reusing
      that field to store information (offset into a base page) that cannot go
      elsewhere for now.
      
      Let's reserve the lower 16 bit for that purpose and for catching mapcount
      underflows, and let's reduce PAGE_TYPE_BASE to a single bit.
      
      Note that we will still have to overflow the mapcount quite a lot until we
      would actually indicate a valid page type.
      
      Start handing out the type bits from highest to lowest, to make it clearer
      how many bits for types we have left.  Out of 15 bit we can use for types,
      we currently use 6.  If we run out of bits before we have better typing
      (e.g., memdesc), we can always investigate storing a value instead [1].
      
      [1] https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/
      
      [akpm@linux-foundation.org: fix PG_hugetlb typo, per David]
      Link: https://lkml.kernel.org/r/20240529111904.2069608-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram/zsmalloc workloads]
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8db00ad5
    • David Hildenbrand's avatar
      mm: update _mapcount and page_type documentation · 6d21dde7
      David Hildenbrand authored
      Patch series "mm: page_type, zsmalloc and page_mapcount_reset()", v2.
      
      Wanting to remove the remaining abuser of _mapcount/page_type along with
      page_mapcount_reset(), I stumbled over zsmalloc, which is yet to be
      converted away from "struct page" [1].
      
      Unfortunately, we cannot stop using the page_type field in zsmalloc code
      completely for its own purposes.  All other fields in "struct page" are
      used one way or the other.  Could we simply store a 2-byte offset value at
      the beginning of each page?  Likely, but that will require a bit more
      work; and once we have memdesc we might want to move the offset in there
      (struct zsalloc?) again.
      
      ...  but we can limit the abuse to 16 bit, glue it to a page type that
      must be set, and document it.  page_has_type() will always successfully
      indicate such zsmalloc pages, and such zsmalloc pages only.
      
      We lose zsmalloc support for PAGE_SIZE > 64KB, which should be tolerable. 
      We could use more bits from the page type, but 16 bit sounds like a good
      idea for now.
      
      So clarify the _mapcount/page_type documentation, use a proper page_type
      for zsmalloc, and remove page_mapcount_reset().
      
      [1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/
      
      
      This patch (of 6):
      
      Let's make it clearer that _mapcount must no longer be used for own
      purposes, and how _mapcount and page_type behaves nowadays (also in the
      context of hugetlb folios, which are typed folios that will be mapped to
      user space).
      
      Move the documentation regarding "-1" over from page_mapcount_reset(),
      which we will remove next.  Move "page_type" before "mapcount", to make it
      clearer what typed folios are.
      
      Link: https://lkml.kernel.org/r/20240529111904.2069608-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240529111904.2069608-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram/zsmalloc workloads]
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d21dde7
    • John Hubbard's avatar
      selftests/mm: remove local __NR_* definitions · a5c6bc59
      John Hubbard authored
      This continues the work on getting the selftests to build without
      requiring people to first run "make headers" [1].
      
      Now that the system call numbers are in the correct, checked-in locations
      in the kernel tree (./tools/include/uapi/asm/unistd*.h), make sure that
      the mm selftests include that file (indirectly).
      
      Doing so provides guaranteed definitions at build time, so remove all of
      the checks for "ifdef __NR_xxx" in the mm selftests, because they will
      always be true (defined).
      
      [1] commit e076eaca ("selftests: break the dependency upon local
      header files")
      
      Link: https://lkml.kernel.org/r/20240618022422.804305-7-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Andrei Vagin <avagin@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a5c6bc59
    • Andrew Morton's avatar
      mm/huge_memory.c: fix used-uninitialized · d40f74ab
      Andrew Morton authored
      Fix used-uninitialized of `page'.
      
      Fixes: dce7d10b ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202406260514.SLhNM9kQ-lkp@intel.com
      Cc: Lance Yang <ioworker0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d40f74ab
    • Ryusuke Konishi's avatar
      nilfs2: fix incorrect inode allocation from reserved inodes · f41e355f
      Ryusuke Konishi authored
      If the bitmap block that manages the inode allocation status is corrupted,
      nilfs_ifile_create_inode() may allocate a new inode from the reserved
      inode area where it should not be allocated.
      
      Previous fix commit d325dc6e ("nilfs2: fix use-after-free bug of
      struct nilfs_root"), fixed the problem that reserved inodes with inode
      numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
      bitmap corruption, but since the start number of non-reserved inodes is
      read from the super block and may change, in which case inode allocation
      may occur from the extended reserved inode area.
      
      If that happens, access to that inode will cause an IO error, causing the
      file system to degrade to an error state.
      
      Fix this potential issue by adding a wraparound option to the common
      metadata object allocation routine and by modifying
      nilfs_ifile_create_inode() to disable the option so that it only allocates
      inodes with inode numbers greater than or equal to the inode number read
      in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
      inodes.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f41e355f
    • Ryusuke Konishi's avatar
      nilfs2: add missing check for inode numbers on directory entries · 49ae997f
      Ryusuke Konishi authored
      Syzbot reported that mounting and unmounting a specific pattern of
      corrupted nilfs2 filesystem images causes a use-after-free of metadata
      file inodes, which triggers a kernel bug in lru_add_fn().
      
      As Jan Kara pointed out, this is because the link count of a metadata file
      gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
      tries to delete that inode (ifile inode in this case).
      
      The inconsistency occurs because directories containing the inode numbers
      of these metadata files that should not be visible in the namespace are
      read without checking.
      
      Fix this issue by treating the inode numbers of these internal files as
      errors in the sanity check helper when reading directory folios/pages.
      
      Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
      analysis.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8Reported-by: default avatarJan Kara <jack@suse.cz>
      Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      49ae997f
    • Ryusuke Konishi's avatar
      nilfs2: fix inode number range checks · 1ab84250
      Ryusuke Konishi authored
      Patch series "nilfs2: fix potential issues related to reserved inodes".
      
      This series fixes one use-after-free issue reported by syzbot, caused by
      nilfs2's internal inode being exposed in the namespace on a corrupted
      filesystem, and a couple of flaws that cause problems if the starting
      number of non-reserved inodes written in the on-disk super block is
      intentionally (or corruptly) changed from its default value.  
      
      
      This patch (of 3):
      
      In the current implementation of nilfs2, "nilfs->ns_first_ino", which
      gives the first non-reserved inode number, is read from the superblock,
      but its lower limit is not checked.
      
      As a result, if a number that overlaps with the inode number range of
      reserved inodes such as the root directory or metadata files is set in the
      super block parameter, the inode number test macros (NILFS_MDT_INODE and
      NILFS_VALID_INODE) will not function properly.
      
      In addition, these test macros use left bit-shift calculations using with
      the inode number as the shift count via the BIT macro, but the result of a
      shift calculation that exceeds the bit width of an integer is undefined in
      the C specification, so if "ns_first_ino" is set to a large value other
      than the default value NILFS_USER_INO (=11), the macros may potentially
      malfunction depending on the environment.
      
      Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
      by preventing bit shifts equal to or greater than the NILFS_USER_INO
      constant in the inode number test macros.
      
      Also, change the type of "ns_first_ino" from signed integer to unsigned
      integer to avoid the need for type casting in comparisons such as the
      lower bound check introduced this time.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com
      Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1ab84250
    • Jan Kara's avatar
      mm: avoid overflows in dirty throttling logic · 68ed2a39
      Jan Kara authored
      The dirty throttling logic is interspersed with assumptions that dirty
      limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
      fit into 64-bits).  If limits end up being larger, we will hit overflows,
      possible divisions by 0 etc.  Fix these problems by never allowing so
      large dirty limits as they have dubious practical value anyway.  For
      dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
      so large limits.  For dirty_ratio / dirty_background_ratio it isn't so
      simple as the dirty limit is computed from the amount of available memory
      which can change due to memory hotplug etc.  So when converting dirty
      limits from ratios to numbers of pages, we just don't allow the result to
      exceed UINT_MAX.
      
      This is root-only triggerable problem which occurs when the operator
      sets dirty limits to >16 TB.
      
      Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-By: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68ed2a39
    • Jan Kara's avatar
      Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again" · 8dfcffa3
      Jan Kara authored
      Patch series "mm: Avoid possible overflows in dirty throttling".
      
      Dirty throttling logic assumes dirty limits in page units fit into
      32-bits.  This patch series makes sure this is true (see patch 2/2 for
      more details).
      
      
      This patch (of 2):
      
      This reverts commit 9319b647.
      
      The commit is broken in several ways.  Firstly, the removed (u64) cast
      from the multiplication will introduce a multiplication overflow on 32-bit
      archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
      default settings with 4GB of RAM will trigger this).  Secondly, the
      div64_u64() is unnecessarily expensive on 32-bit archs.  We have
      div64_ul() in case we want to be safe & cheap.  Thirdly, if dirty
      thresholds are larger than 1<<32 pages, then dirty balancing is going to
      blow up in many other spectacular ways anyway so trying to fix one
      possible overflow is just moot.
      
      Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz
      Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz
      Fixes: 9319b647 ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-By: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8dfcffa3
    • Jinliang Zheng's avatar
      mm: optimize the redundant loop of mm_update_owner_next() · 76ba6acf
      Jinliang Zheng authored
      When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
      /proc or ptrace or page migration (get_task_mm()), it is impossible to
      find an appropriate task_struct in the loop whose mm_struct is the same as
      the target mm_struct.
      
      If the above race condition is combined with the stress-ng-zombie and
      stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
      write_lock_irq() for tasklist_lock.
      
      Recognize this situation in advance and exit early.
      
      Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.comSigned-off-by: default avatarJinliang Zheng <alexjlzheng@tencent.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tycho Andersen <tandersen@netflix.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76ba6acf
    • Hongfu Li's avatar
      khugepaged: simplify the allocation of slab caches · 9b94b5a2
      Hongfu Li authored
      Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
      to simplify the creation of SLAB caches.
      
      Link: https://lkml.kernel.org/r/20240618014517.25954-1-lihongfu@kylinos.cnSigned-off-by: default avatarHongfu Li <lihongfu@kylinos.cn>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b94b5a2
    • Kefeng Wang's avatar
      mm: ksm: drop KSM_KMEM_CACHE() · aa1b9489
      Kefeng Wang authored
      After commit 21fbd591 ("ksm: add the ksm prefix to the names of the
      ksm private structures"), we could directly use KMEM_CACHE().
      
      Link: https://lkml.kernel.org/r/20240618081201.134985-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa1b9489
    • SeongJae Park's avatar
      mm/damon/lru_sort: remove unnecessary online tuning handling code · d4fbcf0b
      SeongJae Park authored
      DAMON_LRU_SORT contains code for handling of online DAMON parameters
      update edge cases.  It is no more necessary since damon_commit_ctx() takes
      care of the cases.  Remove the unnecessary code.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-13-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4fbcf0b
    • SeongJae Park's avatar
      mm/damon/lru_sort: use damon_commit_ctx() · a3096943
      SeongJae Park authored
      DAMON_LRU_SORT manually manipulates the DAMON context struct for online
      parameters update.  Since the struct contains not only input parameters
      but also internal status and operation results, it is not that simple. 
      Indeed, we found and fixed a few bugs in the code.  Now DAMON core layer
      provides a function for the usage, namely damon_commit_ctx().  Replace the
      manual manipulation logic with the function.  The core layer function
      could have its own bugs, but this change removes a source of bugs.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-12-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a3096943
    • SeongJae Park's avatar
      mm/damon/reclaim: remove unnecessary code for online tuning · b94322b1
      SeongJae Park authored
      DAMON_RECLAIM contains code for handling of online DAMON parameters update
      edge cases.  It is no more necessary since damon_commit_ctx() takes care
      of the cases.  Remove the unnecessary code.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-11-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b94322b1
    • SeongJae Park's avatar
      mm/damon/reclaim: use damon_commit_ctx() · 11ddcfc2
      SeongJae Park authored
      DAMON_RECLAIM manually manipulates the DAMON context struct for online
      parameters update.  Since the struct contains not only input parameters
      but also internal status and operation results, it is not that simple. 
      Indeed, we found and fixed a few bugs in the code.  Now DAMON core layer
      provides a function for the usage, namely damon_commit_ctx().  Replace the
      manual manipulation logic with the function.  The core layer function
      could have its own bugs, but this change removes a source of bugs.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11ddcfc2
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: rename *_set_{schemes,scheme_filters,quota_score,schemes}() · a83364a2
      SeongJae Park authored
      The functions were for updating DAMON structs that may or may not be
      partially populated.  Hence it was not for only adding items, but also
      removing unnecessary items and updating items in-place.  A previous commit
      has changed the functions to assume the structs are not partially
      populated, and do only adding items.  Make the names better explain the
      behavior.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a83364a2
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: remove unnecessary online tuning handling code · 0fddd604
      SeongJae Park authored
      damon/sysfs-schemes.c contains code for handling of online DAMON
      parameters update edge cases.  The logics are no more necessary since
      damon_commit_ctx() and damon_commit_quota_goals() takes care of the cases.
      Remove the unnecessary code.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0fddd604
    • SeongJae Park's avatar
      mm/damon/sysfs: rename damon_sysfs_set_targets() to ...add_targets() · 2caef83d
      SeongJae Park authored
      The function was for updating DAMON structs that may or may not be
      partially populated.  Hence it was not for only adding items, but also
      removing unnecessary items and updating items in-place.  A previous commit
      has changed the function to assume the structs are not partially
      populated, and do only adding items.  Make the function name better
      explain the behavior.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2caef83d
    • SeongJae Park's avatar
      mm/damon/sysfs: remove unnecessary online tuning handling code · d96727a2
      SeongJae Park authored
      damon/sysfs.c contains code for handling of online DAMON parameters update
      edge cases.  It is no more necessary since damon_commit_ctx() takes care
      of the cases.  Remove the unnecessary code.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d96727a2
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: use damos_commit_quota_goals() · 77ed1eb6
      SeongJae Park authored
      DAMON_SYSFS manually manipulates the DAMOS quota structs for online quotal
      goals parameter update.  Since the struct contains not only input
      parameters but also internal status and operation results, it is not that
      simple.  Now DAMON core layer provides a function for the usage, namely
      damon_commit_quota_goals().  Replace the manual manipulation logic with
      the function.  The core layer function could have its own bugs, but this
      change removes a source of bugs.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77ed1eb6
    • SeongJae Park's avatar
      mm/damon/sysfs: use damon_commit_ctx() · 83dc7bba
      SeongJae Park authored
      DAMON_SYSFS manually manipulates DAMON context structs for online
      parameters update.  Since the struct contains not only input parameters
      but also internal status and operation results, it is not that simple. 
      Indeed, we found and fixed a few bugs in the code.  Now DAMON core layer
      provides a function for the usage, namely damon_commit_ctx().  Replace the
      manual manipulation logic with the function.  The core layer function
      could have its own bugs, but this change removes a source of bugs.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83dc7bba
    • SeongJae Park's avatar
      mm/damon/core: implement DAMON context commit function · 9cb3d0b9
      SeongJae Park authored
      Implement functions for supporting online DAMON context level parameters
      update.  The function receives two DAMON context structs.  One is the
      struct that currently being used by a kdamond and therefore to be updated.
      The other one contains the parameters to be applied to the first one. 
      The function applies the new parameters to the destination struct while
      keeping/updating the internal status and operation results.  The function
      should be called from DAMON context-update-safe place, like DAMON
      callbacks.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cb3d0b9
    • SeongJae Park's avatar
      mm/damon/core: implement DAMOS quota goals online commit function · 3ad1dce6
      SeongJae Park authored
      Patch series "mm/damon: introduce DAMON parameters online commit function".
      
      DAMON context struct (damon_ctx) contains user requests (parameters),
      internal status, and operation results.  For flexible usages, DAMON API
      users are encouraged to manually manipulate the struct.  That works well
      for simple use cases.  However, it has turned out that it is not that
      simple at least for online parameters udpate.  It is easy to forget
      properly maintaining internal status and operation results.  Also, such
      manual manipulation for online tuning is implemented multiple times on
      DAMON API users including DAMON sysfs interface, DAMON_RECLAIM and
      DAMON_LRU_SORT.  As a result, we have multiple sources of bugs for same
      problem.  Actually we found and fixed a few bugs from online parameter
      updating of DAMON API users.
      
      Implement a function for online DAMON parameters update in core layer, and
      replace DAMON API users' manual manipulation code for the use case.  The
      core layer function could still have bugs, but this change reduces the
      source of bugs for the problem to one place.
      
      
      This patch (of 12):
      
      Implement functions for supporting online DAMOS quota goals parameters
      update.  The function receives two DAMOS quota structs.  One is the struct
      that currently being used by a kdamond and therefore to be updated.  The
      other one contains the parameters to be applied to the first one.  The
      function applies the new parameters to the destination struct while
      keeping/updating the internal status.  The function should be called from
      parameters-update safe place, like DAMON callbacks.
      
      Link: https://lkml.kernel.org/r/20240618181809.82078-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240618181809.82078-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ad1dce6
    • Baolin Wang's avatar
      mm: memcontrol: add VM_BUG_ON_FOLIO() to catch lru folio in mem_cgroup_migrate() · a6ab9c82
      Baolin Wang authored
      mem_cgroup_migrate() will clear the memcg data of the old folio,
      therefore, the callers must make sure the old folio is no longer on the
      LRU list, otherwise the old folio can not get the correct lruvec object
      without the memcg data, which could lead to potential problems [1].
      
      Thus adding a VM_BUG_ON_FOLIO() to catch this issue.
      
      [1] https://lore.kernel.org/all/5ab860d8ee987955e917748f9d6da525d3b52690.1718326003.git.baolin.wang@linux.alibaba.com/
      
      Link: https://lkml.kernel.org/r/66d181c41b7ced35dbd39ffd3f5774a11aef266a.1718327124.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Suggested-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6ab9c82
    • Honggyu Kim's avatar
      Docs/damon: document damos_migrate_{hot,cold} · 83d0d46a
      Honggyu Kim authored
      This patch adds damon description for "migrate_hot" and "migrate_cold"
      actions for both usage and design documents as long as a new
      "target_nid" knob to set the migration target node.
      
      [sj@kernel.org: trivial fixups for DAMOS_MIGRATE_{HOT,COLD} documentation]
        Link: https://lkml.kernel.org/r/20240618213630.84846-2-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240614030010.751-8-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Signed-off-by: default avatarSeongJae Park &lt;sj@kernel.org&gt;Reviewed-by: SeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83d0d46a
    • Hyeongtak Ji's avatar
      mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion · b696722d
      Hyeongtak Ji authored
      This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
      DAMOS_MIGRATE_COLD, but proritizes hot pages.
      
      It migrates pages inside the given region to the 'target_nid' NUMA node
      in the sysfs.
      
      Here is one of the example usage of this 'migrate_hot' action.
      
        $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
        $ cat contexts/<N>/schemes/<N>/action
        migrate_hot
        $ echo 0 > contexts/<N>/schemes/<N>/target_nid
        $ echo commit > state
        $ numactl -p 2 ./hot_cold 500M 600M &
        $ numastat -c -p hot_cold
      
        Per-node process memory usage (in MBs)
        PID             Node 0 Node 1 Node 2 Total
        --------------  ------ ------ ------ -----
        701 (hot_cold)     501      0    601  1101
      
      Link: https://lkml.kernel.org/r/20240614030010.751-7-honggyu.kim@sk.comSigned-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b696722d
    • Honggyu Kim's avatar
      mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion · b51820eb
      Honggyu Kim authored
      This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
      DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
      instead of swapping them out.
      
      The 'target_nid' sysfs knob informs the migration target node ID.
      
      Here is one of the example usage of this 'migrate_cold' action.
      
        $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
        $ cat contexts/<N>/schemes/<N>/action
        migrate_cold
        $ echo 2 > contexts/<N>/schemes/<N>/target_nid
        $ echo commit > state
        $ numactl -p 0 ./hot_cold 500M 600M &
        $ numastat -c -p hot_cold
      
        Per-node process memory usage (in MBs)
        PID             Node 0 Node 1 Node 2 Total
        --------------  ------ ------ ------ -----
        701 (hot_cold)     501      0    601  1101
      
      Since there are some common routines with pageout, many functions have
      similar logics between pageout and migrate cold.
      
      damon_pa_migrate_folio_list() is a minimized version of
      shrink_folio_list().
      
      Link: https://lkml.kernel.org/r/20240614030010.751-6-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Signed-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b51820eb
    • Honggyu Kim's avatar
      mm/migrate: add MR_DAMON to migrate_reason · ced816a7
      Honggyu Kim authored
      The current patch series introduces DAMON based migration across NUMA
      nodes so it'd be better to have a new migrate_reason in trace events.
      
      Link: https://lkml.kernel.org/r/20240614030010.751-5-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ced816a7
    • Hyeongtak Ji's avatar
      mm/damon/sysfs-schemes: add target_nid on sysfs-schemes · e36287c6
      Hyeongtak Ji authored
      This patch adds target_nid under
        /sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/
      
      The 'target_nid' can be used as the destination node for DAMOS actions
      such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.
      
      [sj@kernel.org: document target_nid file]
        Link: https://lkml.kernel.org/r/20240618213630.84846-3-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240614030010.751-4-honggyu.kim@sk.comSigned-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e36287c6
    • Honggyu Kim's avatar
      mm: rename alloc_demote_folio to alloc_migrate_folio · 8f75267d
      Honggyu Kim authored
      The alloc_demote_folio can also be used for general migration including
      both demotion and promotion so it'd be better to rename it from
      alloc_demote_folio to alloc_migrate_folio.
      
      Link: https://lkml.kernel.org/r/20240614030010.751-3-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f75267d
    • Honggyu Kim's avatar
      mm: make alloc_demote_folio externally invokable for migration · a00ce85a
      Honggyu Kim authored
      Patch series "DAMON based tiered memory management for CXL memory", v6.
      
      Introduction
      ============
      
      With the advent of CXL/PCIe attached DRAM, which will be called simply as
      CXL memory in this cover letter, some systems are becoming more
      heterogeneous having memory systems with different latency and bandwidth
      characteristics.  They are usually handled as different NUMA nodes in
      separate memory tiers and CXL memory is used as slow tiers because of its
      protocol overhead compared to local DRAM.
      
      In this kind of systems, we need to be careful placing memory pages on
      proper NUMA nodes based on the memory access frequency.  Otherwise, some
      frequently accessed pages might reside on slow tiers and it makes
      performance degradation unexpectedly.  Moreover, the memory access
      patterns can be changed at runtime.
      
      To handle this problem, we need a way to monitor the memory access
      patterns and migrate pages based on their access temperature.  The
      DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
      Schemes) can be useful features for monitoring and migrating pages.  DAMOS
      provides multiple actions based on DAMON monitoring results and it can be
      used for proactive reclaim, which means swapping cold pages out with
      DAMOS_PAGEOUT action, but it doesn't support migration actions such as
      demotion and promotion between tiered memory nodes.
      
      This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
      promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
      tiers.  This prevents hot pages from being stuck on slow tiers, which
      makes performance degradation and cold pages can be proactively demoted to
      slow tiers so that the system can increase the chance to allocate more hot
      pages to fast tiers.
      
      The DAMON provides various tuning knobs but we found that the proactive
      demotion for cold pages is especially useful when the system is running
      out of memory on its fast tier nodes.
      
      Our evaluation result shows that it reduces the performance slowdown
      compared to the default memory policy from 11% to 3~5% when the system
      runs under high memory pressure on its fast tier DRAM nodes.
      
      DAMON configuration
      ===================
      
      The specific DAMON configuration doesn't have to be in the scope of this
      patch series, but some rough idea is better to be shared to explain the
      evaluation result.
      
      The DAMON provides many knobs for fine tuning but its configuration file
      is generated by HMSDK[3].  It includes gen_config.py script that generates
      a json file with the full config of DAMON knobs and it creates multiple
      kdamonds for each NUMA node when the DAMON is enabled so that it can run
      hot/cold based migration for tiered memory.
      
      Evaluation Workload
      ===================
      
      The performance evaluation is done with redis[4], which is a widely used
      in-memory database and the memory access patterns are generated via
      YCSB[5].  We have measured two different workloads with zipfian and latest
      distributions but their configs are slightly modified to make memory usage
      higher and execution time longer for better evaluation.
      
      The idea of evaluation using these migrate_{hot,cold} actions covers
      system-wide memory management rather than partitioning hot/cold pages of a
      single workload.  The default memory allocation policy creates pages to
      the fast tier DRAM node first, then allocates newly created pages to the
      slow tier CXL node when the DRAM node has insufficient free space.  Once
      the page allocation is done then those pages never move between NUMA
      nodes.  It's not true when using numa balancing, but it is not the scope
      of this DAMON based tiered memory management support.
      
      If the working set of redis can be fit fully into the DRAM node, then the
      redis will access the fast DRAM only.  Since the performance of DRAM only
      is faster than partially accessing CXL memory in slow tiers, this
      environment is not useful to evaluate this patch series.
      
      To make pages of redis be distributed across fast DRAM node and slow CXL
      node to evaluate our migrate_{hot,cold} actions, we pre-allocate some cold
      memory externally using mmap and memset before launching redis-server.  We
      assumed that there are enough amount of cold memory in datacenters as
      TMO[6] and TPP[7] papers mentioned.
      
      The evaluation sequence is as follows.
      
      1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
         DAMOS_MIGRATE_HOT action for CXL node.  It demotes cold pages on DRAM
         node and promotes hot pages on CXL node in a regular interval.
      2. Allocate a huge block of cold memory by calling mmap and memset at
         the fast tier DRAM node, then make the process sleep to make the fast
         tier has insufficient space for redis-server.
      3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
         redis-server consumes 52GB of anon pages and 33GB of file pages, but
         due to the cold memory allocated at 2, it fails allocating the entire
         memory of redis-server on the fast tier DRAM node so it partially
         allocates the remaining on the slow tier CXL node.  The ratio of
         DRAM:CXL depends on the size of the pre-allocated cold memory.
      4. Run YCSB to make zipfian or latest distribution of memory accesses to
         redis-server, then measure its execution time when it's completed.
      5. Repeat 4 over 50 times to measure the average execution time for each
         run.
      6. Increase the cold memory size then repeat goes to 2.
      
      For each test at 4 took about a minute so repeating it 50 times almost
      took about 1 hour for each test with a specific cold memory from 440GB to
      500GB in 10GB increments for each evaluation.  So it took about more than
      10 hours for both zipfian and latest workloads to get the entire
      evaluation results.  Repeating the same test set multiple times doesn't
      show much difference so I think it might be enough to make the result
      reliable.
      
      Evaluation Results
      ==================
      
      All the result values are normalized to DRAM-only execution time because
      the workload cannot be faster than DRAM-only unless the workload hits the
      peak bandwidth but our redis test doesn't go beyond the bandwidth limit.
      
      So the DRAM-only execution time is the ideal result without affected by
      the gap between DRAM and CXL performance difference.  The NUMA node
      environment is as follows.
      
        node0 - local DRAM, 512GB with a CPU socket (fast tier)
        node1 - disabled
        node2 - CXL DRAM, 96GB, no CPU attached (slow tier)
      
      The following is the result of generating zipfian distribution to
      redis-server and the numbers are averaged by 50 times of execution.
      
        1. YCSB zipfian distribution read only workload
        memory pressure with cold memory on node0 with 512GB of local DRAM.
        ====================+================================================+=========
                            |       cold memory occupied by mmap and memset  |
                            |   0G  440G  450G  460G  470G  480G  490G  500G |
        ====================+================================================+=========
        Execution time normalized to DRAM-only values                        | GEOMEAN
        --------------------+------------------------------------------------+---------
        DRAM-only           | 1.00     -     -     -     -     -     -     - | 1.00
        CXL-only            | 1.19     -     -     -     -     -     -     - | 1.19
        default             |    -  1.00  1.05  1.08  1.12  1.14  1.18  1.18 | 1.11
        DAMON tiered        |    -  1.03  1.03  1.03  1.03  1.03  1.07 *1.05 | 1.04
        DAMON lazy          |    -  1.04  1.03  1.04  1.05  1.06  1.06 *1.06 | 1.05
        ====================+================================================+=========
        CXL usage of redis-server in GB                                      | AVERAGE
        --------------------+------------------------------------------------+---------
        DRAM-only           |  0.0     -     -     -     -     -     -     - |  0.0
        CXL-only            | 51.4     -     -     -     -     -     -     - | 51.4
        default             |    -   0.6  10.6  20.5  30.5  40.5  47.6  50.4 | 28.7
        DAMON tiered        |    -   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
        DAMON lazy          |    -   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
        ====================+================================================+=========
      
      Each test result is based on the execution environment as follows.
      
        DRAM-only:           redis-server uses only local DRAM memory.
        CXL-only:            redis-server uses only CXL memory.
        default:             default memory policy(MPOL_DEFAULT).
                             numa balancing disabled.
        DAMON tiered:        DAMON enabled with DAMOS_MIGRATE_COLD for DRAM
                             nodes and DAMOS_MIGRATE_HOT for CXL nodes.
        DAMON lazy:          same as DAMON tiered, but turn on DAMON just
                             before making memory access request via YCSB.
      
      The above result shows the "default" execution time goes up as the size of
      cold memory is increased from 440G to 500G because the more cold memory
      used, the more CXL memory is used for the target redis workload and this
      makes the execution time increase.
      
      However, "DAMON tiered" and other DAMON results show less slowdown because
      the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes
      pre-allocated cold memory to CXL node and this free space at DRAM
      increases more chance to allocate hot or warm pages of redis-server to
      fast DRAM node.  Moreover, DAMOS_MIGRATE_HOT action at CXL node also
      promotes hot pages of redis-server to DRAM node actively.
      
      As a result, it makes more memory of redis-server stay in DRAM node
      compared to "default" memory policy and this makes the performance
      improvement.
      
      Please note that the result numbers of "DAMON tiered" and "DAMON lazy" at
      500G are marked with * stars, which means their test results are replaced
      with reproduced tests that didn't have OOM issue.
      
      That was needed because sometimes the test processes get OOM when DRAM has
      insufficient space.  The DAMOS_MIGRATE_HOT doesn't kick reclaim but just
      gives up migration when there is not enough space at DRAM side.  The
      problem happens when there is competition between normal allocation and
      migration and the migration is done before normal allocation, then the
      completely unrelated normal allocation can trigger reclaim, which incurs
      OOM.
      
      Because of this issue, I have also tested more cases with
      "demotion_enabled" flag enabled to make such reclaim doesn't trigger OOM,
      but just demote reclaimed pages.  The following test results show more
      tests with "kswapd" marked.
      
        2. YCSB zipfian distribution read only workload (with demotion_enabled true)
        memory pressure with cold memory on node0 with 512GB of local DRAM.
        ====================+================================================+=========
                            |       cold memory occupied by mmap and memset  |
                            |   0G  440G  450G  460G  470G  480G  490G  500G |
        ====================+================================================+=========
        Execution time normalized to DRAM-only values                        | GEOMEAN
        --------------------+------------------------------------------------+---------
        DAMON tiered        |    -  1.03  1.03  1.03  1.03  1.03  1.07  1.05 | 1.04
        DAMON lazy          |    -  1.04  1.03  1.04  1.05  1.06  1.06  1.06 | 1.05
        DAMON tiered kswapd |    -  1.03  1.03  1.03  1.03  1.02  1.02  1.03 | 1.03
        DAMON lazy kswapd   |    -  1.04  1.04  1.04  1.03  1.05  1.04  1.05 | 1.04
        ====================+================================================+=========
        CXL usage of redis-server in GB                                      | AVERAGE
        --------------------+------------------------------------------------+---------
        DAMON tiered        |    -   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
        DAMON lazy          |    -   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
        DAMON tiered kswapd |    -   0.0   0.0   0.4   0.5   0.1   0.8   1.0 |  0.4
        DAMON lazy kswapd   |    -   4.2   4.6   5.3   1.7   6.8   8.1   5.8 |  5.2
        ====================+================================================+=========
      
      Each test result is based on the exeuction environment as follows.
      
        DAMON tiered:        same as before
        DAMON lazy:          same as before
        DAMON tiered kswapd: same as DAMON tiered, but turn on
                             /sys/kernel/mm/numa/demotion_enabled to make
                             kswapd or direct reclaim does demotion.
        DAMON lazy kswapd:   same as DAMON lazy, but turn on
                             /sys/kernel/mm/numa/demotion_enabled to make
                             kswapd or direct reclaim does demotion.
      
      The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at
      all unlike other tests because kswapd and direct reclaim from DRAM node
      can demote reclaimed pages to CXL node independently from DAMON actions
      and their results are slightly better than without having
      "demotion_enabled".
      
      In summary, the evaluation results show that DAMON memory management with
      DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown compared
      to the "default" memory policy from 11% to 3~5% when the system runs with
      high memory pressure on its fast tier DRAM nodes.
      
      Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
      tiered memory systems run more efficiently under high memory pressures.
      
      
      This patch (of 7):
      
      The alloc_demote_folio can be used out of vmscan.c so it'd be better to
      remove static keyword from it.
      
      Link: https://lkml.kernel.org/r/20240614030010.751-1-honggyu.kim@sk.com
      Link: https://lkml.kernel.org/r/20240614030010.751-2-honggyu.kim@sk.comSigned-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Gregory Price <gregory.price@memverge.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a00ce85a
    • Wei Yang's avatar
      mm/mm_init.c: simplify logic of deferred_[init|free]_pages · 972b89c1
      Wei Yang authored
      Function deferred_[init|free]_pages are only used in
      deferred_init_maxorder(), which makes sure the range to init/free is
      within MAX_ORDER_NR_PAGES size.
      
      With this knowledge, we can simplify these two functions. Since
      
        * only the first pfn could be IS_MAX_ORDER_ALIGNED()
      
      Also since the range passed to deferred_[init|free]_pages is always from
      memblock.memory for those we have already allocated memmap to cover,
      pfn_valid() always return true.  Then we can remove related check.
      
      [richard.weiyang@gmail.com: adjust function declaration indention per David]
        Link: https://lkml.kernel.org/r/20240613114525.27528-1-richard.weiyang@gmail.com
      Link: https://lkml.kernel.org/r/20240612020421.31975-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      972b89c1
    • Miaohe Lin's avatar
      mm/memory-failure: correct comment in me_swapcache_dirty · e5d89670
      Miaohe Lin authored
      Dirty swap cache page could live both in page table (not page cache) and
      swap cache when freshly swapped in.  Correct comment.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-14-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5d89670
    • Miaohe Lin's avatar
      mm/memory-failure: remove obsolete comment in kill_proc() · d49f2366
      Miaohe Lin authored
      When user sets SIGBUS to SIG_IGN, it won't cause loop now.  For action
      required mce error, SIGBUS cannot be blocked.  Also when a hwpoisoned page
      is re-accessed, kill_accessing_process() will be called to kill the
      process.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-13-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d49f2366
    • Miaohe Lin's avatar
      mm/memory-failure: fix comment of get_hwpoison_page() · b71340ef
      Miaohe Lin authored
      When return value is 0, it could also means the page is free hugetlb page
      or free buddy page.  Fix the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-12-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b71340ef
    • Miaohe Lin's avatar
      mm/memory-failure: move some function declarations into internal.h · 3a78f77f
      Miaohe Lin authored
      There are some functions only used inside mm.  Move them into internal.h. 
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-11-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202405251049.hxjwX7zO-lkp@intel.com/
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a78f77f
    • Miaohe Lin's avatar
      mm/memory-failure: remove obsolete comment in unpoison_memory() · 28eab7d4
      Miaohe Lin authored
      Since commit 130d4df5 ("mm/sl[au]b: rearrange struct slab fields to
      allow larger rcu_head"), folio->_mapcount is not overloaded with SLAB. 
      Update corresponding comment.
      
      Link: https://lkml.kernel.org/r/20240612071835.157004-10-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28eab7d4