1. 12 Jul, 2024 2 commits
  2. 10 Jul, 2024 20 commits
  3. 06 Jul, 2024 11 commits
    • Kefeng Wang's avatar
      mm: migrate: remove folio_migrate_copy() · 3f594937
      Kefeng Wang authored
      The folio_migrate_copy() is just a wrapper of folio_copy() and
      folio_migrate_flags(), it is simple and only aio use it for now, unfold it
      and remove folio_migrate_copy().
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f594937
    • Kefeng Wang's avatar
      fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio() · f00b295b
      Kefeng Wang authored
      This is similar to __migrate_folio(), use folio_mc_copy() in HugeTLB folio
      migration to avoid panic when copy from poisoned folio.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f00b295b
    • Kefeng Wang's avatar
      mm: migrate: support poisoned recover from migrate folio · 06091399
      Kefeng Wang authored
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kerenl will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths,
      eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel},
      copy_mc_{user_}highpage callers.
      
      In order to support poisoned folio copy recover from migrate folio, we
      chose to make folio migration tolerant of memory failures and return error
      for folio migration, because folio migration is no guarantee of success,
      this could avoid the similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      Note, folio copy is moved in the begin of the __migrate_folio(), which
      could simplify the error handling since there is no turning back if
      folio_migrate_mapping() return success, the downside is the folio copied
      even though folio_migrate_mapping() return fail, an optimization is to
      check whether source folio does not have extra refs before we do folio
      copy.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06091399
    • Kefeng Wang's avatar
      mm: migrate: split folio_migrate_mapping() · 52881539
      Kefeng Wang authored
      The folio refcount check is moved out for both !mapping and mapping folio,
      also update comment from page to folio for folio_migrate_mapping().
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52881539
    • Kefeng Wang's avatar
      mm: add folio_mc_copy() · 02f4ee5a
      Kefeng Wang authored
      Add a #MC variant of folio_copy() which uses copy_mc_highpage() to support
      #MC handled during folio copy, it will be used in folio migration soon.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      02f4ee5a
    • Kefeng Wang's avatar
      mm: move memory_failure_queue() into copy_mc_[user]_highpage() · 28bdacbc
      Kefeng Wang authored
      Patch series "mm: migrate: support poison recover from migrate folio", v5.
      
      The folio migration is widely used in kernel, memory compaction, memory
      hotplug, soft offline page, numa balance, memory demote/promotion, etc,
      but once access a poisoned source folio when migrating, the kernel will
      panic.
      
      There is a mechanism in the kernel to recover from uncorrectable memory
      errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which
      is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump,
      ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers.
      
      This series of patches provide the recovery mechanism from folio copy for
      the widely used folio migration.  Please note, because folio migration is
      no guarantee of success, so we could chose to make folio migration
      tolerant of memory failures, adding folio_mc_copy() which is a #MC
      versions of folio_copy(), once accessing a poisoned source folio, we could
      return error and make the folio migration fail, and this could avoid the
      similar panic shown below.
      
        CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
        pc : copy_page+0x10/0xc0
        lr : copy_highpage+0x38/0x50
        ...
        Call trace:
         copy_page+0x10/0xc0
         folio_copy+0x78/0x90
         migrate_folio_extra+0x54/0xa0
         move_to_new_folio+0xd8/0x1f0
         migrate_folio_move+0xb8/0x300
         migrate_pages_batch+0x528/0x788
         migrate_pages_sync+0x8c/0x258
         migrate_pages+0x440/0x528
         soft_offline_in_use_page+0x2ec/0x3c0
         soft_offline_page+0x238/0x310
         soft_offline_page_store+0x6c/0xc0
         dev_attr_store+0x20/0x40
         sysfs_kf_write+0x4c/0x68
         kernfs_fop_write_iter+0x130/0x1c8
         new_sync_write+0xa4/0x138
         vfs_write+0x238/0x2d8
         ksys_write+0x74/0x110
      
      
      This patch (of 5):
      
      There is a memory_failure_queue() call after copy_mc_[user]_highpage(),
      see callers, eg, CoW/KSM page copy, it is used to mark the source page as
      h/w poisoned and unmap it from other tasks, and the upcomming poison
      recover from migrate folio will do the similar thing, so let's move the
      memory_failure_queue() into the copy_mc_[user]_highpage() instead of
      adding it into each user, this should also enhance the handling of
      poisoned page in khugepaged.
      
      Link: https://lkml.kernel.org/r/20240626085328.608006-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240626085328.608006-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28bdacbc
    • Andrew Morton's avatar
      Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix · 8ef6fd0e
      Andrew Morton authored
      crashes from deferred split racing folio migration", needed by "mm:
      migrate: split folio_migrate_mapping()".
      8ef6fd0e
    • Lorenzo Stoakes's avatar
    • Hugh Dickins's avatar
      mm: fix crashes from deferred split racing folio migration · be9581ea
      Hugh Dickins authored
      Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
      flags when freeing, yet the flags shown are not bad: PG_locked had been
      set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
      deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
      symptoms implying double free by deferred split and large folio migration.
      
      6.7 commit 9bcef597 ("mm: memcg: fix split queue list crash when large
      folio migration") was right to fix the memcg-dependent locking broken in
      85ce2c51 ("memcontrol: only transfer the memcg data for migration"),
      but missed a subtlety of deferred_split_scan(): it moves folios to its own
      local list to work on them without split_queue_lock, during which time
      folio->_deferred_list is not empty, but even the "right" lock does nothing
      to secure the folio and the list it is on.
      
      Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
      folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
      while the old folio's reference count is temporarily frozen to 0 - adding
      such a freeze in the !mapping case too (originally, folio lock and
      unmapping and no swap cache left an anon folio unreachable, so no freezing
      was needed there: but the deferred split queue offers a way to reach it).
      
      Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
      Fixes: 9bcef597 ("mm: memcg: fix split queue list crash when large folio migration")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be9581ea
    • Paul Menzel's avatar
      lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat · 2fe29fe9
      Paul Menzel authored
      On a system with Perl 5.12.1, commit 5ef6dc08
      ("lib/build_OID_registry: don't mention the full path of the script in
      output") causes the build to fail with the error below.
      
           Bareword found where operator expected at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
           syntax error at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
           Execution of ./lib/build_OID_registry aborted due to compilation errors.
           make[3]: *** [lib/Makefile:352: lib/oid_registry_data.c] Error 255
      
      Ahmad Fatoum analyzed that non-destructive substitution is only supported since
      Perl 5.13.2. Instead of dropping `r` and having the side effect of modifying
      `$0`, introduce a dedicated variable to support older Perl versions.
      
      Link: https://lkml.kernel.org/r/20240702223512.8329-2-pmenzel@molgen.mpg.de
      Link: https://lkml.kernel.org/r/20240701155802.75152-1-pmenzel@molgen.mpg.de
      Fixes: 5ef6dc08 ("lib/build_OID_registry: don't mention the full path of the script in output")
      Link: https://lore.kernel.org/all/259f7a87-2692-480e-9073-1c1c35b52f67@molgen.mpg.de/Signed-off-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Suggested-by: default avatarAhmad Fatoum <a.fatoum@pengutronix.de>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Nicolas Schier <nicolas@fjasle.eu>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Ahmad Fatoum <a.fatoum@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fe29fe9
    • Yang Shi's avatar
      mm: gup: stop abusing try_grab_folio · f442fa61
      Yang Shi authored
      A kernel warning was reported when pinning folio in CMA memory when
      launching SEV virtual machine.  The splat looks like:
      
      [  464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
      [  464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
      [  464.325477] RIP: 0010:__get_user_pages+0x423/0x520
      [  464.325515] Call Trace:
      [  464.325520]  <TASK>
      [  464.325523]  ? __get_user_pages+0x423/0x520
      [  464.325528]  ? __warn+0x81/0x130
      [  464.325536]  ? __get_user_pages+0x423/0x520
      [  464.325541]  ? report_bug+0x171/0x1a0
      [  464.325549]  ? handle_bug+0x3c/0x70
      [  464.325554]  ? exc_invalid_op+0x17/0x70
      [  464.325558]  ? asm_exc_invalid_op+0x1a/0x20
      [  464.325567]  ? __get_user_pages+0x423/0x520
      [  464.325575]  __gup_longterm_locked+0x212/0x7a0
      [  464.325583]  internal_get_user_pages_fast+0xfb/0x190
      [  464.325590]  pin_user_pages_fast+0x47/0x60
      [  464.325598]  sev_pin_memory+0xca/0x170 [kvm_amd]
      [  464.325616]  sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
      
      Per the analysis done by yangge, when starting the SEV virtual machine, it
      will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. 
      But the page is in CMA area, so fast GUP will fail then fallback to the
      slow path due to the longterm pinnalbe check in try_grab_folio().
      
      The slow path will try to pin the pages then migrate them out of CMA area.
      But the slow path also uses try_grab_folio() to pin the page, it will
      also fail due to the same check then the above warning is triggered.
      
      In addition, the try_grab_folio() is supposed to be used in fast path and
      it elevates folio refcount by using add ref unless zero.  We are guaranteed
      to have at least one stable reference in slow path, so the simple atomic add
      could be used.  The performance difference should be trivial, but the
      misuse may be confusing and misleading.
      
      Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
      to try_grab_folio(), and use them in the proper paths.  This solves both
      the abuse and the kernel warning.
      
      The proper naming makes their usecase more clear and should prevent from
      abusing in the future.
      
      peterx said:
      
      : The user will see the pin fails, for gpu-slow it further triggers the WARN
      : right below that failure (as in the original report):
      : 
      :         folio = try_grab_folio(page, page_increm - 1,
      :                                 foll_flags);
      :         if (WARN_ON_ONCE(!folio)) { <------------------------ here
      :                 /*
      :                         * Release the 1st page ref if the
      :                         * folio is problematic, fail hard.
      :                         */
      :                 gup_put_folio(page_folio(page), 1,
      :                                 foll_flags);
      :                 ret = -EFAULT;
      :                 goto out;
      :         }
      
      [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/
      
      [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast]
        Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com
      Fixes: 57edfcfd ("mm/gup: accelerate thp gup even for "pages != NULL"")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avataryangge <yangge1116@126.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f442fa61
  4. 05 Jul, 2024 7 commits
    • Jiaqi Yan's avatar
      docs: mm: add enable_soft_offline sysctl · 44195d1e
      Jiaqi Yan authored
      Add the documentation for soft offline behaviors / costs, and what the new
      enable_soft_offline sysctl is for.
      
      [jiaqiyan@google.com: fix kerneldoc warnings]
        Link: https://lkml.kernel.org/r/CACw3F52=GxTCDw-PqFh3-GDM-fo3GbhGdu0hedxYXOTT4TQSTg@mail.gmail.com
      [jiaqiyan@google.com: there are more blank lines needed]
        Link: https://lkml.kernel.org/r/CACw3F52_obAB742XeDRNun4BHBYtrxtbvp5NkUincXdaob0j1g@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240626050818.2277273-5-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44195d1e
    • Jiaqi Yan's avatar
      selftest/mm: test enable_soft_offline behaviors · 72ead83d
      Jiaqi Yan authored
      Add regression and new tests when hugepage has correctable memory errors,
      and how userspace wants to deal with it:
      
      * if enable_soft_offline=1, mapped hugepage is soft offlined
      * if enable_soft_offline=0, mapped hugepage is intact
      
      Free hugepages case is not explicitly covered by the tests.
      
      Hugepage having corrected memory errors is emulated with
      MADV_SOFT_OFFLINE.
      
      [jiaqiyan@google.com: v7]
        Link: https://lkml.kernel.org/r/20240628205958.2845610-4-jiaqiyan@google.com
      Link: https://lkml.kernel.org/r/20240626050818.2277273-4-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72ead83d
    • Jiaqi Yan's avatar
      mm/memory-failure: userspace controls soft-offlining pages · 56374430
      Jiaqi Yan authored
      Correctable memory errors are very common on servers with large amount of
      memory, and are corrected by ECC.  Soft offline is kernel's additional
      recovery handling for memory pages having (excessive) corrected memory
      errors.  Impacted page is migrated to a healthy page if inuse; the
      original page is discarded for any future use.
      
      The actual policy on whether (and when) to soft offline should be
      maintained by userspace, especially in case of an 1G HugeTLB page. 
      Soft-offline dissolves the HugeTLB page, either in-use or free, into
      chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.  If
      userspace has not acknowledged such behavior, it may be surprised when
      later failed to mmap hugepages due to lack of hugepages.  In case of a
      transparent hugepage, it will be split into 4K pages as well; userspace
      will stop enjoying the transparent performance.
      
      In addition, discarding the entire 1G HugeTLB page only because of
      corrected memory errors sounds very costly and kernel better not doing
      under the hood.  But today there are at least 2 such cases doing so:
      1. when GHES driver sees both GHES_SEV_CORRECTED and
         CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
      2. RAS Correctable Errors Collector counts correctable errors per
         PFN and when the counter for a PFN reaches threshold
      In both cases, userspace has no control of the soft offline performed
      by kernel's memory failure recovery.
      
      This commit gives userspace the control of softofflining any page: kernel
      only soft offlines raw page / transparent hugepage / HugeTLB hugepage if
      userspace has agreed to.  The interface to userspace is a new sysctl at
      /proc/sys/vm/enable_soft_offline.  By default its value is set to 1 to
      preserve existing behavior in kernel.  When set to 0, soft-offline (e.g. 
      MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP.
      
      [jiaqiyan@google.com: v7]
        Link: https://lkml.kernel.org/r/20240628205958.2845610-3-jiaqiyan@google.com
      Link: https://lkml.kernel.org/r/20240626050818.2277273-3-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      56374430
    • Jiaqi Yan's avatar
      mm/memory-failure: refactor log format in soft offline code · 865319f7
      Jiaqi Yan authored
      Patch series "Userspace controls soft-offline pages", v6.
      
      Correctable memory errors are very common on servers with large amount of
      memory, and are corrected by ECC, but with two pain points to users:
      
      1. Correction usually happens on the fly and adds latency overhead
      2. Not-fully-proved theory states excessive correctable memory
         errors can develop into uncorrectable memory error.
      
      Soft offline is kernel's additional solution for memory pages having
      (excessive) corrected memory errors.  Impacted page is migrated to healthy
      page if it is in use, then the original page is discarded for any future
      use.
      
      The actual policy on whether (and when) to soft offline should be
      maintained by userspace, especially in case of an 1G HugeTLB page. 
      Soft-offline dissolves the HugeTLB page, either in-use or free, into
      chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.  If
      userspace has not acknowledged such behavior, it may be surprised when
      later mmap hugepages MAP_FAILED due to lack of hugepages.  In case of a
      transparent hugepage, it will be split into 4K pages as well; userspace
      will stop enjoying the transparent performance.
      
      In addition, discarding the entire 1G HugeTLB page only because of
      corrected memory errors sounds very costly and kernel better not doing
      under the hood.  But today there are at least 2 such cases:
      
      1. GHES driver sees both GHES_SEV_CORRECTED and
         CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
      2. RAS Correctable Errors Collector counts correctable errors per
         PFN and when the counter for a PFN reaches threshold
      
      In both cases, userspace has no control of the soft offline performed by
      kernel's memory failure recovery.
      
      This patch series give userspace the control of softofflining any page:
      kernel only soft offlines raw page / transparent hugepage / HugeTLB
      hugepage if userspace has agreed to.  The interface to userspace is a new
      sysctl called enable_soft_offline under /proc/sys/vm.  By default
      enable_soft_line is 1 to preserve existing behavior in kernel.
      
      
      This patch (of 4):
      
      Logs from soft_offline_page and soft_offline_in_use_page have different
      formats than majority of the memory failure code:
      
        "Memory failure: 0x${pfn}: ${lower_case_message}"
      
      Convert them to the following format:
      
        "Soft offline: 0x${pfn}: ${lower_case_message}"
      
      No functional change in this commit.
      
      Link: https://lkml.kernel.org/r/20240626050818.2277273-1-jiaqiyan@google.com
      Link: https://lkml.kernel.org/r/20240626050818.2277273-2-jiaqiyan@google.comSigned-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarLance Yang <ioworker0@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      865319f7
    • Xiu Jianfeng's avatar
      mm: memcg: adjust the warning when seq_buf overflows · c2fad56b
      Xiu Jianfeng authored
      Currently it uses WARN_ON_ONCE() if seq_buf overflows when user reads
      memory.stat, the only advantage of WARN_ON_ONCE is that the splat is so
      verbose that it gets noticed.  And also it panics the system if
      panic_on_warn is enabled.  It seems like the warning is just an over
      reaction and a simple pr_warn should just achieve the similar effect.
      
      Link: https://lkml.kernel.org/r/20240628072333.2496527-1-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2fad56b
    • Xiu Jianfeng's avatar
      mm: memcg: remove redundant seq_buf_has_overflowed() · 1c46cc09
      Xiu Jianfeng authored
      Both the end of memory_stat_format() and memcg_stat_format() will call
      WARN_ON_ONCE(seq_buf_has_overflowed()).  However, memory_stat_format() is
      the only caller of memcg_stat_format(), when memcg is on the default
      hierarchy, seq_buf_has_overflowed() will be executed twice, so remove the
      redundant one.
      
      Link: https://lkml.kernel.org/r/20240626094232.2432891-1-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1c46cc09
    • Audra Mitchell's avatar
      mm: turn off test_uffdio_wp if CONFIG_PTE_MARKER_UFFD_WP is not configured. · a591d35c
      Audra Mitchell authored
      If CONFIG_PTE_MARKER_UFFD_WP is disabled, then we turn off three features
      in userfaultfd_api (UFFD_FEATURE_WP_HUGETLBFS_SHMEM,
      UFFD_FEATURE_WP_UNPOPULATED, and UFFD_FEATURE_WP_ASYNC).
      
      Currently this test always will call uffdio_regsiter with the flag
      UFFDIO_REGISTER_MODE_WP.  However, the kernel ensures in vma_can_userfault
      that if the feature UFFD_FEATURE_WP_HUGETLBFS_SHMEM is disabled, only
      allow the VM_UFFD_WP on anonymous vmas, meaning our call to
      uffdio_regsiter will fail.
      
      We still want to be able to run the test even if we have
      CONFIG_PTE_MARKER_UFFD_WP disabled, so check to see if the feature
      UFFD_FEATURE_WP_HUGETLBFS_SHMEM has been turned off in the test and if so,
      disable us from calling uffdio_regsiter with the flag
      UFFDIO_REGISTER_MODE_WP.
      
      Link: https://lkml.kernel.org/r/20240626130513.120193-3-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a591d35c