1. 23 Nov, 2022 18 commits
    • Marco Elver's avatar
      kfence: fix stack trace pruning · 747c0f35
      Marco Elver authored
      Commit b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      refactored large parts of the kmalloc subsystem, resulting in the stack
      trace pruning logic done by KFENCE to no longer work.
      
      While b1405135 attempted to fix the situation by including
      '__kmem_cache_free' in the list of functions KFENCE should skip through,
      this only works when the compiler actually optimized the tail call from
      kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
      appearing in the full stack trace to begin with).
      
      In some configurations, the compiler no longer optimizes the tail call
      into a jump, and __kmem_cache_free() appears in the stack trace.  This
      means that the pruned stack trace shown by KFENCE would include kfree()
      which is not intended - for example:
      
       | BUG: KFENCE: invalid free in kfree+0x7c/0x120
       |
       | Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
       |  kfree+0x7c/0x120
       |  test_double_free+0x116/0x1a9
       |  kunit_try_run_case+0x90/0xd0
       | [...]
      
      Fix it by moving __kmem_cache_free() to the list of functions that may be
      tail called by an allocator entry function, making the pruning logic work
      in both the optimized and unoptimized tail call cases.
      
      Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
      Fixes: b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      747c0f35
    • Yosry Ahmed's avatar
      proc/meminfo: fix spacing in SecPageTables · f850c849
      Yosry Ahmed authored
      SecPageTables has a tab after it instead of a space, this can break
      fragile parsers that depend on spaces after the stat names.
      
      Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
      Fixes: ebc97a52 ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f850c849
    • Yu Zhao's avatar
      mm: multi-gen LRU: retry folios written back while isolated · 359a5e14
      Yu Zhao authored
      The page reclaim isolates a batch of folios from the tail of one of the
      LRU lists and works on those folios one by one.  For a suitable
      swap-backed folio, if the swap device is async, it queues that folio for
      writeback.  After the page reclaim finishes an entire batch, it puts back
      the folios it queued for writeback to the head of the original LRU list.
      
      In the meantime, the page writeback flushes the queued folios also by
      batches.  Its batching logic is independent from that of the page reclaim.
      For each of the folios it writes back, the page writeback calls
      folio_rotate_reclaimable() which tries to rotate a folio to the tail.
      
      folio_rotate_reclaimable() only works for a folio after the page reclaim
      has put it back.  If an async swap device is fast enough, the page
      writeback can finish with that folio while the page reclaim is still
      working on the rest of the batch containing it.  In this case, that folio
      will remain at the head and the page reclaim will not retry it before
      reaching there.
      
      This patch adds a retry to evict_folios().  After evict_folios() has
      finished an entire batch and before it puts back folios it cannot free
      immediately, it retries those that may have missed the rotation.
      
      Before this patch, ~60% of folios swapped to an Intel Optane missed
      folio_rotate_reclaimable().  After this patch, ~99% of missed folios were
      reclaimed upon retry.
      
      This problem affects relatively slow async swap devices like Samsung 980
      Pro much less and does not affect sync swap devices like zram or zswap at
      all.
      
      Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
      Fixes: ac35a490 ("mm: multi-gen LRU: minimal implementation")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      359a5e14
    • Satya Priya's avatar
      mailmap: update email address for Satya Priya · 47123d7f
      Satya Priya authored
      Add and also update email address, skakit@codeaurora.org is no longer
      active.
      
      Link: https://lkml.kernel.org/r/20221116105017.3018971-1-quic_c_skakit@quicinc.comSigned-off-by: default avatarSatya Priya <quic_c_skakit@quicinc.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47123d7f
    • Alistair Popple's avatar
      mm/migrate_device: return number of migrating pages in args->cpages · 44af0b45
      Alistair Popple authored
      migrate_vma->cpages originally contained a count of the number of pages
      migrating including non-present pages which can be populated directly on
      the target.
      
      Commit 241f6885 ("mm/migrate_device.c: refactor migrate_vma and
      migrate_device_coherent_page()") inadvertantly changed this to contain
      just the number of pages that were unmapped.  Usage of migrate_vma->cpages
      isn't documented, but most drivers use it to see if all the requested
      addresses can be migrated so restore the original behaviour.
      
      Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
      Fixes: 241f6885 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reported-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44af0b45
    • Sam James's avatar
      kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible · 50c69721
      Sam James authored
      Add missing <linux/string.h> include for strcmp.
      
      Clang 16 makes -Wimplicit-function-declaration an error by default. 
      Unfortunately, out of tree modules may use this in configure scripts,
      which means failure might cause silent miscompilation or misconfiguration.
      
      For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
      or the (new) c-std-porting mailing list [3].
      
      [0] https://lwn.net/Articles/913505/
      [1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
      [2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
      [3] hosted at lists.linux.dev.
      
      [akpm@linux-foundation.org: remember "linux/"]
      Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.orgSigned-off-by: default avatarSam James <sam@gentoo.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50c69721
    • Alex Hung's avatar
    • Alex Hung's avatar
      mailmap: update Alex Hung's email address · d39e2ad6
      Alex Hung authored
      I am no longer at Canonical and add entry of my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.comSigned-off-by: default avatarAlex Hung <alexhung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d39e2ad6
    • Ian Cowan's avatar
      mm: mmap: fix documentation for vma_mas_szero · 4a423440
      Ian Cowan authored
      When the struct_mm input, mm, was changed to a struct ma_state, mas, the
      documentation for the function was never updated.  This updates that
      documentation reference.
      
      Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aeroSigned-off-by: default avatarIan Cowan <ian@linux.cowan.aero>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a423440
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed · 8468b486
      SeongJae Park authored
      A DAMON sysfs interface user can start DAMON with a scheme, remove the
      sysfs directory for the scheme, and then ask update of the scheme's stats.
      Because the schemes stats update logic isn't aware of the situation, it
      results in an invalid memory access.  Fix the bug by checking if the
      scheme sysfs directory exists.
      
      Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
      Fixes: 0ac32b8a ("mm/damon/sysfs: support DAMOS stats")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[v5.18]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8468b486
    • Alistair Popple's avatar
      mm/memory: return vm_fault_t result from migrate_to_ram() callback · 4a955bed
      Alistair Popple authored
      The migrate_to_ram() callback should always succeed, but in rare cases can
      fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101d
      ("mm/memory.c: fix race when faulting a device private page") incorrectly
      stopped passing the return code up the stack.  Fix this by setting the ret
      variable, restoring the previous behaviour on migrate_to_ram() failure.
      
      Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
      Fixes: 16ce101d ("mm/memory.c: fix race when faulting a device private page")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a955bed
    • Li Liguang's avatar
      mm: correctly charge compressed memory to its memcg · cd08d80e
      Li Liguang authored
      Kswapd will reclaim memory when memory pressure is high, the annonymous
      memory will be compressed and stored in the zpool if zswap is enabled. 
      The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
      kernel thread and cause the compressed memory not be charged to its memory
      cgroup.
      
      Remove the memcg_kmem_bypass() call and properly charge compressed memory
      to its corresponding memory cgroup.
      
      Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
      Fixes: f4840ccf ("zswap: memcg accounting")
      Signed-off-by: default avatarLi Liguang <liliguang@baidu.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd08d80e
    • Mike Kravetz's avatar
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz authored
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDoug Nelson <doug.nelson@intel.com>
      Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6305049
    • Mukesh Ojha's avatar
      gcov: clang: fix the buffer overflow issue · a6f810ef
      Mukesh Ojha authored
      Currently, in clang version of gcov code when module is getting removed
      gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
      dst->functions and it result in the kernel panic in below crash report. 
      Fix this by properly handling it.
      
      [    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
      [    8.899100][  T599] Mem abort info:
      [    8.899102][  T599]   ESR = 0x9600004f
      [    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
      [    8.899105][  T599]   SET = 0, FnV = 0
      [    8.899107][  T599]   EA = 0, S1PTW = 0
      [    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
      [    8.899110][  T599] Data abort info:
      [    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
      [    8.899113][  T599]   CM = 0, WnR = 1
      [    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
      [    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
      [    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
      [    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
      ....
      ..,
      [    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
      [    8.899547][  T599] Hardware name: XXX (DT)
      [    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
      [    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
      [    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
      [    8.899559][  T599] sp : ffffffc00e733b00
      [    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
      [    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
      [    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
      [    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
      [    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
      [    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
      [    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
      [    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
      [    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
      [    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
      [    8.899589][  T599] Call trace:
      [    8.899590][  T599]  gcov_info_add+0x9c/0xb8
      [    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
      [    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
      [    8.899598][  T599]  do_init_module+0x2a8/0x33c
      [    8.899600][  T599]  load_module+0x23cc/0x261c
      [    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
      [    8.899604][  T599]  invoke_syscall+0x94/0x2bc
      [    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
      [    8.899609][  T599]  do_el0_svc+0x40/0x54
      [    8.899611][  T599]  el0_svc+0x94/0x2f0
      [    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
      [    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
      [    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
      [    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---
      
      Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
      Fixes: e178a5be ("gcov: clang support")
      Signed-off-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Reviewed-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Tested-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6f810ef
    • Gautam Menghani's avatar
      mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call · 045634ff
      Gautam Menghani authored
      Refactor the mm_khugepaged_scan_file tracepoint to move filename
      dereference to the tracepoint definition, to maintain consistency with
      other tracepoints[1].
      
      [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
      
      Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
      Fixes: d41fd201 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
      Signed-off-by: default avatarGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      045634ff
    • Charan Teja Kalla's avatar
      mm/page_exit: fix kernel doc warning in page_ext_put() · ed86b748
      Charan Teja Kalla authored
      Fix the below compiler warnings reported with 'make W=1 mm/'. 
      mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
      described in 'page_ext_put'.
      
      [quic_pkondeti@quicinc.com: better patch title]
      Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
      Fixes: b1d5488a ("mm: fix use-after free of page_ext after race with memory-offline")
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed86b748
    • Yang Shi's avatar
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • Johannes Weiner's avatar
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner authored
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f53af428
  2. 08 Nov, 2022 22 commits