1. 23 Nov, 2022 14 commits
    • Alistair Popple's avatar
      mm/migrate_device: return number of migrating pages in args->cpages · 44af0b45
      Alistair Popple authored
      migrate_vma->cpages originally contained a count of the number of pages
      migrating including non-present pages which can be populated directly on
      the target.
      
      Commit 241f6885 ("mm/migrate_device.c: refactor migrate_vma and
      migrate_device_coherent_page()") inadvertantly changed this to contain
      just the number of pages that were unmapped.  Usage of migrate_vma->cpages
      isn't documented, but most drivers use it to see if all the requested
      addresses can be migrated so restore the original behaviour.
      
      Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
      Fixes: 241f6885 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reported-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44af0b45
    • Sam James's avatar
      kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible · 50c69721
      Sam James authored
      Add missing <linux/string.h> include for strcmp.
      
      Clang 16 makes -Wimplicit-function-declaration an error by default. 
      Unfortunately, out of tree modules may use this in configure scripts,
      which means failure might cause silent miscompilation or misconfiguration.
      
      For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
      or the (new) c-std-porting mailing list [3].
      
      [0] https://lwn.net/Articles/913505/
      [1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
      [2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
      [3] hosted at lists.linux.dev.
      
      [akpm@linux-foundation.org: remember "linux/"]
      Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.orgSigned-off-by: default avatarSam James <sam@gentoo.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50c69721
    • Alex Hung's avatar
    • Alex Hung's avatar
      mailmap: update Alex Hung's email address · d39e2ad6
      Alex Hung authored
      I am no longer at Canonical and add entry of my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.comSigned-off-by: default avatarAlex Hung <alexhung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d39e2ad6
    • Ian Cowan's avatar
      mm: mmap: fix documentation for vma_mas_szero · 4a423440
      Ian Cowan authored
      When the struct_mm input, mm, was changed to a struct ma_state, mas, the
      documentation for the function was never updated.  This updates that
      documentation reference.
      
      Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aeroSigned-off-by: default avatarIan Cowan <ian@linux.cowan.aero>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a423440
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed · 8468b486
      SeongJae Park authored
      A DAMON sysfs interface user can start DAMON with a scheme, remove the
      sysfs directory for the scheme, and then ask update of the scheme's stats.
      Because the schemes stats update logic isn't aware of the situation, it
      results in an invalid memory access.  Fix the bug by checking if the
      scheme sysfs directory exists.
      
      Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
      Fixes: 0ac32b8a ("mm/damon/sysfs: support DAMOS stats")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[v5.18]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8468b486
    • Alistair Popple's avatar
      mm/memory: return vm_fault_t result from migrate_to_ram() callback · 4a955bed
      Alistair Popple authored
      The migrate_to_ram() callback should always succeed, but in rare cases can
      fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101d
      ("mm/memory.c: fix race when faulting a device private page") incorrectly
      stopped passing the return code up the stack.  Fix this by setting the ret
      variable, restoring the previous behaviour on migrate_to_ram() failure.
      
      Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
      Fixes: 16ce101d ("mm/memory.c: fix race when faulting a device private page")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a955bed
    • Li Liguang's avatar
      mm: correctly charge compressed memory to its memcg · cd08d80e
      Li Liguang authored
      Kswapd will reclaim memory when memory pressure is high, the annonymous
      memory will be compressed and stored in the zpool if zswap is enabled. 
      The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
      kernel thread and cause the compressed memory not be charged to its memory
      cgroup.
      
      Remove the memcg_kmem_bypass() call and properly charge compressed memory
      to its corresponding memory cgroup.
      
      Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
      Fixes: f4840ccf ("zswap: memcg accounting")
      Signed-off-by: default avatarLi Liguang <liliguang@baidu.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd08d80e
    • Mike Kravetz's avatar
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz authored
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDoug Nelson <doug.nelson@intel.com>
      Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6305049
    • Mukesh Ojha's avatar
      gcov: clang: fix the buffer overflow issue · a6f810ef
      Mukesh Ojha authored
      Currently, in clang version of gcov code when module is getting removed
      gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
      dst->functions and it result in the kernel panic in below crash report. 
      Fix this by properly handling it.
      
      [    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
      [    8.899100][  T599] Mem abort info:
      [    8.899102][  T599]   ESR = 0x9600004f
      [    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
      [    8.899105][  T599]   SET = 0, FnV = 0
      [    8.899107][  T599]   EA = 0, S1PTW = 0
      [    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
      [    8.899110][  T599] Data abort info:
      [    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
      [    8.899113][  T599]   CM = 0, WnR = 1
      [    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
      [    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
      [    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
      [    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
      ....
      ..,
      [    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
      [    8.899547][  T599] Hardware name: XXX (DT)
      [    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
      [    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
      [    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
      [    8.899559][  T599] sp : ffffffc00e733b00
      [    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
      [    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
      [    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
      [    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
      [    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
      [    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
      [    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
      [    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
      [    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
      [    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
      [    8.899589][  T599] Call trace:
      [    8.899590][  T599]  gcov_info_add+0x9c/0xb8
      [    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
      [    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
      [    8.899598][  T599]  do_init_module+0x2a8/0x33c
      [    8.899600][  T599]  load_module+0x23cc/0x261c
      [    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
      [    8.899604][  T599]  invoke_syscall+0x94/0x2bc
      [    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
      [    8.899609][  T599]  do_el0_svc+0x40/0x54
      [    8.899611][  T599]  el0_svc+0x94/0x2f0
      [    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
      [    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
      [    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
      [    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---
      
      Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
      Fixes: e178a5be ("gcov: clang support")
      Signed-off-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Reviewed-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Tested-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6f810ef
    • Gautam Menghani's avatar
      mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call · 045634ff
      Gautam Menghani authored
      Refactor the mm_khugepaged_scan_file tracepoint to move filename
      dereference to the tracepoint definition, to maintain consistency with
      other tracepoints[1].
      
      [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
      
      Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
      Fixes: d41fd201 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
      Signed-off-by: default avatarGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      045634ff
    • Charan Teja Kalla's avatar
      mm/page_exit: fix kernel doc warning in page_ext_put() · ed86b748
      Charan Teja Kalla authored
      Fix the below compiler warnings reported with 'make W=1 mm/'. 
      mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
      described in 'page_ext_put'.
      
      [quic_pkondeti@quicinc.com: better patch title]
      Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
      Fixes: b1d5488a ("mm: fix use-after free of page_ext after race with memory-offline")
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed86b748
    • Yang Shi's avatar
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • Johannes Weiner's avatar
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner authored
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f53af428
  2. 08 Nov, 2022 22 commits
  3. 06 Nov, 2022 4 commits
    • Linus Torvalds's avatar
      Linux 6.1-rc4 · f0c4d9fc
      Linus Torvalds authored
      f0c4d9fc
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 16c7a368
      Linus Torvalds authored
      Pull cxl fixes from Dan Williams:
       "Several fixes for CXL region creation crashes, leaks and failures.
      
        This is mainly fallout from the original implementation of dynamic CXL
        region creation (instantiate new physical memory pools) that arrived
        in v6.0-rc1.
      
        Given the theme of "failures in the presence of pass-through decoders"
        this also includes new regression test infrastructure for that case.
      
        Summary:
      
         - Fix region creation crash with pass-through decoders
      
         - Fix region creation crash when no decoder allocation fails
      
         - Fix region creation crash when scanning regions to enforce the
           increasing physical address order constraint that CXL mandates
      
         - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
           1:1 memory-device-to-region associations.
      
         - Fix a memory leak for cxl_region objects when regions with active
           targets are deleted
      
         - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
           emulated proximity domains.
      
         - Fix region creation failure for switch attached devices downstream
           of a single-port host-bridge
      
         - Fix false positive memory leak of cxl_region objects by recycling
           recently used region ids rather than freeing them
      
         - Add regression test infrastructure for a pass-through decoder
           configuration
      
         - Fix some mailbox payload handling corner cases"
      
      * tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/region: Recycle region ids
        cxl/region: Fix 'distance' calculation with passthrough ports
        tools/testing/cxl: Add a single-port host-bridge regression config
        tools/testing/cxl: Fix some error exits
        cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
        cxl/region: Fix cxl_region leak, cleanup targets at region delete
        cxl/region: Fix region HPA ordering validation
        cxl/pmem: Use size_add() against integer overflow
        cxl/region: Fix decoder allocation crash
        ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
        cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
        cxl/region: Fix null pointer dereference due to pass through decoder commit
        cxl/mbox: Add a check on input payload size
      16c7a368
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-v6.1-rc4' of... · aa529949
      Linus Torvalds authored
      Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "Fix two regressions:
      
         - Commit 54cc3dbf ("hwmon: (pmbus) Add regulator supply into
           macro") resulted in regulator undercount when disabling regulators.
           Revert it.
      
         - The thermal subsystem rework caused the scmi driver to no longer
           register with the thermal subsystem because index values no longer
           match. To fix the problem, the scmi driver now directly registers
           with the thermal subsystem, no longer through the hwmon core"
      
      * tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        Revert "hwmon: (pmbus) Add regulator supply into macro"
        hwmon: (scmi) Register explicitly with Thermal Framework
      aa529949
    • Linus Torvalds's avatar
      Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 727ea09e
      Linus Torvalds authored
      Pull perf fixes from Borislav Petkov:
      
       - Add Cooper Lake's stepping to the PEBS guest/host events isolation
         fixed microcode revisions checking quirk
      
       - Update Icelake and Sapphire Rapids events constraints
      
       - Use the standard energy unit for Sapphire Rapids in RAPL
      
       - Fix the hw_breakpoint test to fail more graciously on !SMP configs
      
      * tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
        perf/x86/intel: Fix pebs event constraints for SPR
        perf/x86/intel: Fix pebs event constraints for ICL
        perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
        perf/hw_breakpoint: test: Skip the test if dependencies unmet
      727ea09e