1. 06 Nov, 2021 40 commits
    • David Hildenbrand's avatar
      mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit · 7ec58a2b
      David Hildenbrand authored
      32 bit support is broken in various ways: for example, we can online
      memory that should actually go to ZONE_HIGHMEM to ZONE_MOVABLE or in
      some cases even to one of the other kernel zones.
      
      We marked it BROKEN in commit b59d02ed ("mm/memory_hotplug: disable
      the functionality for 32b") almost one year ago.  According to that
      commit it might be broken at least since 2017.  Further, there is hardly
      a sane use case nowadays.
      
      Let's just depend completely on 64bit, dropping the "BROKEN" dependency
      to make clear that we are not going to support it again.  Next, we'll
      remove some HIGHMEM leftovers from memory hotplug code to clean up.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ec58a2b
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE · 50f9481e
      David Hildenbrand authored
      CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
      CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
      CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: Shuah Khan <skhan@linuxfoundation.org>	[kselftest]
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50f9481e
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG · 71b6f2dd
      David Hildenbrand authored
      Patch series "mm/memory_hotplug: Kconfig and 32 bit cleanups".
      
      Some cleanups around CONFIG_MEMORY_HOTPLUG, including removing 32 bit
      leftovers of memory hotplug support.
      
      This patch (of 6):
      
      SPARSEMEM is the only possible memory model for x86-64, FLATMEM is not
      possible:
      
      	config ARCH_FLATMEM_ENABLE
      		def_bool y
      		depends on X86_32 && !NUMA
      
      And X86_64_ACPI_NUMA (obviously) only supports x86-64:
      
      	config X86_64_ACPI_NUMA
      		def_bool y
      		depends on X86_64 && NUMA && ACPI && PCI
      
      Let's just remove the CONFIG_X86_64_ACPI_NUMA dependency, as it does no
      longer make sense.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71b6f2dd
    • David Hildenbrand's avatar
      memory-hotplug.rst: document the "auto-movable" online policy · 9e122cc1
      David Hildenbrand authored
      Commit e83a437f ("mm/memory_hotplug: introduce "auto-movable" online
      policy") introduced a new memory online policy to automatically select a
      zone for memory blocks to be onlined.  It added a way to set the active
      online policy and tunables for the auto-movable online policy.
      
      Follow-up commits tweaked the "auto-movable" policy to also consider
      memory device details when selecting zones for memory blocks to be
      onlined.
      
      Let's document the new toggles and how the two online policies we have
      work.
      
      [david@redhat.com: updates]
        Link: https://lkml.kernel.org/r/20211011082058.6076-4-david@redhat.com
      
      Link: https://lkml.kernel.org/r/20210930144117.23641-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e122cc1
    • David Hildenbrand's avatar
      memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path · a8db400f
      David Hildenbrand authored
      We accidentially added a superfluous "s".
      
      Link: https://lkml.kernel.org/r/20210930144117.23641-3-david@redhat.com
      Fixes: ac3332c4 ("memory-hotplug.rst: complete admin-guide overhaul")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8db400f
    • David Hildenbrand's avatar
      memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" · d83fe3c9
      David Hildenbrand authored
      Patch series "memory-hotplug.rst: document the "auto-movable" online
      policy".
      
      Now that the memory-hotplug.rst overhaul is upstream, proper
      documentation for the "auto-movable" online policy, documenting all new
      toggles and options.  Along, two fixes for the original overhaul.
      
      This patch (of 3):
      
      We really want to refer to the "movable_node" kernel command line
      parameter here.
      
      Link: https://lkml.kernel.org/r/20210930144117.23641-2-david@redhat.com
      Fixes: ac3332c4 ("memory-hotplug.rst: complete admin-guide overhaul")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d83fe3c9
    • Tang Yizhou's avatar
      mm/memory_hotplug: add static qualifier for online_policy_to_str() · ac62554b
      Tang Yizhou authored
      online_policy_to_str is only used in memory_hotplug.c and should be
      defined as static.
      
      Link: https://lkml.kernel.org/r/20210913024534.26161-1-tangyizhou@huawei.comSigned-off-by: default avatarTang Yizhou <tangyizhou@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac62554b
    • David Hildenbrand's avatar
      selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers · 39b2e5ca
      David Hildenbrand authored
      The madv_populate selftest currently builds with a warning when the
      local installed headers (via the distribution) don't include
      MADV_POPULATE_READ and MADV_POPULATE_WRITE.  The warning is correct,
      because the test cannot locate the necessary header.
      
      The reason is that the in-tree installed headers (usr/include) have a
      "linux" instead of a "sys" subdirectory.
      
      Including "linux/mman.h" instead of "sys/mman.h" doesn't work (e.g.,
      mmap() and madvise() are not defined that way).  The only thing that
      seems to work is including "linux/mman.h" in addition to "sys/mman.h".
      
      We can get rid of our availability check and simplify.
      
      Link: https://lkml.kernel.org/r/20211015165758.41374-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39b2e5ca
    • Lin Feng's avatar
      mm: vmstat.c: make extfrag_index show more pretty · a9970586
      Lin Feng authored
      fragmentation_index may return -1000 and the corresponding formated
      value showed by seq_printf will take a negative signatrue, but other
      positive formated values don't take a positive signatrue, so the output
      becomes unaligned.
      
      before:
        Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
        Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
        Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 0.931 0.966 0.983 0.992 0.996 0.998 0.999
      
      after this patch:
        Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
        Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
        Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000  0.931  0.966  0.983  0.992  0.996  0.998  0.999
      
      Link: https://lkml.kernel.org/r/20211019103241.134797-1-linf@wangsu.comSigned-off-by: default avatarLin Feng <linf@wangsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9970586
    • Liu Shixin's avatar
      mm/vmstat: annotate data race for zone->free_area[order].nr_free · af1c31ac
      Liu Shixin authored
      KCSAN reports a data-race on v5.10 which also exists on mainline:
      
        BUG: KCSAN: data-race in extfrag_for_order+0x33/0x2d0
      
        race at unknown origin, with read to 0xffff9ee9bfffab48 of 8 bytes by task 34 on cpu 1:
         extfrag_for_order+0x33/0x2d0
         kcompactd+0x5f0/0xce0
         kthread+0x1f9/0x220
         ret_from_fork+0x22/0x30
      
        Reported by Kernel Concurrency Sanitizer on:
        CPU: 1 PID: 34 Comm: kcompactd0 Not tainted 5.10.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      
      Access to zone->free_area[order].nr_free in extfrag_for_order() and
      frag_show_print() is lockless.  That's intentional and the stats are a
      rough estimate anyway.  Annotate them with data_race().
      
      [liushixin2@huawei.com: add comments]
        Link: https://lkml.kernel.org/r/20210918084655.2696522-1-liushixin2@huawei.com
      
      Link: https://lkml.kernel.org/r/20210908015606.3999871-1-liushixin2@huawei.comSigned-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Cc: "Paul E . McKenney" <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af1c31ac
    • Pedro Demarchi Gomes's avatar
      selftests: vm: add KSM huge pages merging time test · 32525489
      Pedro Demarchi Gomes authored
      Add test case of KSM merging time using mostly huge pages
      
      Link: https://lkml.kernel.org/r/20211013044045.360251-1-pedrodemargomes@gmail.comSigned-off-by: default avatarPedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32525489
    • Aneesh Kumar K.V's avatar
      selftest/vm: fix ksm selftest to run with different NUMA topologies · e3820ab2
      Aneesh Kumar K.V authored
      Platforms can have non-contiguous NUMA nodes like below
      
         #numactl  -H
        available: 2 nodes (0,8)
        .....
        node distances:
        node   0   8
          0:  10  40
          8:  40  10
      
         #numactl  -H
        available: 1 nodes (1)
        ....
        node distances:
        node   1
          1:  10
      
      Hence update the test to not assume the presence of Node 0 and 1 and
      also use numa_num_configured_nodes() instead of numa_max_node for
      finding whether to skip the test.
      
      Link: https://lkml.kernel.org/r/20210914141414.350759-1-aneesh.kumar@linux.ibm.com
      Fixes: 82e717ad ("selftests: vm: add KSM merging across nodes test")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3820ab2
    • Kefeng Wang's avatar
      mm: nommu: kill arch_get_unmapped_area() · 916caa12
      Kefeng Wang authored
      When nommu, the arch_get_unmapped_area() will not be called, just kill
      it.
      
      Link: https://lkml.kernel.org/r/20210910061906.36299-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      916caa12
    • Lin Feng's avatar
      mm/readahead.c: fix incorrect comments for get_init_ra_size · fb25a77d
      Lin Feng authored
      In fact, formated values returned by get_init_ra_size are not that
      intuitive.  This patch make the comments reflect its truth.
      
      Link: https://lkml.kernel.org/r/20211019104812.135602-1-linf@wangsu.comSigned-off-by: default avatarLin Feng <linf@wangsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb25a77d
    • Rongwei Wang's avatar
      mm, thp: fix incorrect unmap behavior for private pages · 8468e937
      Rongwei Wang authored
      When truncating pagecache on file THP, the private pages of a process
      should not be unmapped mapping.  This incorrect behavior on a dynamic
      shared libraries which will cause related processes to happen core dump.
      
      A simple test for a DSO (Prerequisite is the DSO mapped in file THP):
      
          int main(int argc, char *argv[])
          {
      	int fd;
      
      	fd = open(argv[1], O_WRONLY);
      	if (fd < 0) {
      		perror("open");
      	}
      
      	close(fd);
      	return 0;
          }
      
      The test only to open a target DSO, and do nothing.  But this operation
      will lead one or more process to happen core dump.  This patch mainly to
      fix this bug.
      
      Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
      Fixes: eb6ecbed ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
      Signed-off-by: default avatarRongwei Wang <rongwei.wang@linux.alibaba.com>
      Tested-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Song Liu <song@kernel.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Collin Fijalkovich <cfijalkovich@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8468e937
    • Rongwei Wang's avatar
      mm, thp: lock filemap when truncating page cache · 55fc0d91
      Rongwei Wang authored
      Patch series "fix two bugs for file THP".
      
      This patch (of 2):
      
      Transparent huge page has supported read-only non-shmem files.  The
      file- backed THP is collapsed by khugepaged and truncated when written
      (for shared libraries).
      
      However, there is a race when multiple writers truncate the same page
      cache concurrently.
      
      In that case, subpage(s) of file THP can be revealed by find_get_entry
      in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
      truncate_inode_page, as follows:
      
          page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
          head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
          flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
          raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
          raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
          head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
          head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
          page dumped because: VM_BUG_ON_PAGE(PageTail(page))
          ------------[ cut here ]------------
          kernel BUG at mm/truncate.c:213!
          Internal error: Oops - BUG: 0 [#1] SMP
          Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
          CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
          Hardware name: ECS, BIOS 0.0.0 02/06/2015
          pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
          Call trace:
           truncate_inode_page+0x64/0x70
           truncate_inode_pages_range+0x550/0x7e4
           truncate_pagecache+0x58/0x80
           do_dentry_open+0x1e4/0x3c0
           vfs_open+0x38/0x44
           do_open+0x1f0/0x310
           path_openat+0x114/0x1dc
           do_filp_open+0x84/0x134
           do_sys_openat2+0xbc/0x164
           __arm64_sys_openat+0x74/0xc0
           el0_svc_common.constprop.0+0x88/0x220
           do_el0_svc+0x30/0xa0
           el0_svc+0x20/0x30
           el0_sync_handler+0x1a4/0x1b0
           el0_sync+0x180/0x1c0
          Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
      
      This patch mainly to lock filemap when one enter truncate_pagecache(),
      avoiding truncating the same page cache concurrently.
      
      Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com
      Fixes: eb6ecbed ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarRongwei Wang <rongwei.wang@linux.alibaba.com>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarSong Liu <song@kernel.org>
      Cc: Collin Fijalkovich <cfijalkovich@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55fc0d91
    • George G. Davis's avatar
      selftests/vm/transhuge-stress: fix ram size thinko · 39cad887
      George G. Davis authored
      When executing transhuge-stress with an argument to specify the virtual
      memory size for testing, the ram size is reported as 0, e.g.
      
        transhuge-stress 384
        thp-mmap: allocate 192 transhuge pages, using 384 MiB virtual memory and 0 MiB of ram
        thp-mmap: 0.184 s/loop, 0.957 ms/page,   2090.265 MiB/s  192 succeed,    0 failed
      
      This appears to be due to a thinko in commit 0085d61f
      ("selftests/vm/transhuge-stress: stress test for memory compaction"),
      where, at a guess, the intent was to base "xyz MiB of ram" on `ram`
      size.
      
      Here are results after using `ram` size:
      
        thp-mmap: allocate 192 transhuge pages, using 384 MiB virtual memory and 14 MiB of ram
      
      Link: https://lkml.kernel.org/r/20210825135843.29052-1-george_davis@mentor.com
      Fixes: 0085d61f ("selftests/vm/transhuge-stress: stress test for memory compaction")
      Signed-off-by: default avatarGeorge G. Davis <davis.george@siemens.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Eugeniu Rosca <erosca@de.adit-jv.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39cad887
    • Yang Shi's avatar
      mm: migrate: make demotion knob depend on migration · 20f9ba4f
      Yang Shi authored
      The memory demotion needs to call migrate_pages() to do the jobs.  And
      it is controlled by a knob, however, the knob doesn't depend on
      CONFIG_MIGRATION.  The knob could be truned on even though MIGRATION is
      disabled, this will not cause any crash since migrate_pages() would just
      return -ENOSYS.  But it is definitely not optimal to go through demotion
      path then retry regular swap every time.
      
      And it doesn't make too much sense to have the knob visible to the users
      when !MIGRATION.  Move the related code from mempolicy.[h|c] to
      migrate.[h|c].
      
      Link: https://lkml.kernel.org/r/20211015005559.246709-1-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20f9ba4f
    • John Hubbard's avatar
      mm/migrate: de-duplicate migrate_reason strings · 8eb42bea
      John Hubbard authored
      In order to remove the need to manually keep three different files in
      synch, provide a common definition of the mapping between enum
      migrate_reason, and the associated strings for each enum item.
      
      1. Use the tracing system's mapping of enums to strings, by redefining
         and reusing the MIGRATE_REASON and supporting macros, and using that
         to populate the string array in mm/debug.c.
      
      2. Move enum migrate_reason to migrate_mode.h. This is not strictly
         necessary for this patch, but migrate mode and migrate reason go
         together, so this will slightly clarify things.
      
      Link: https://lkml.kernel.org/r/20210922041755.141817-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarWeizhao Ouyang <o451686892@gmail.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8eb42bea
    • Zhenguo Yao's avatar
      hugetlbfs: extend the definition of hugepages parameter to support node allocation · b5389086
      Zhenguo Yao authored
      We can specify the number of hugepages to allocate at boot.  But the
      hugepages is balanced in all nodes at present.  In some scenarios, we
      only need hugepages in one node.  For example: DPDK needs hugepages
      which are in the same node as NIC.
      
      If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
      nodes we must reserve 64 hugepages on the kernel cmdline.  But only four
      hugepages are used.  The others should be free after boot.  If the
      system memory is low(for example: 64G), it will be an impossible task.
      
      So extend the hugepages parameter to support specifying hugepages on a
      specific node.  For example add following parameter:
      
        hugepagesz=1G hugepages=0:1,1:3
      
      It will allocate 1 hugepage in node0 and 3 hugepages in node1.
      
      Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.comSigned-off-by: default avatarZhenguo Yao <yaozhenguo1@gmail.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5389086
    • Sultan Alsawaf's avatar
      mm: mark the OOM reaper thread as freezable · 3723929e
      Sultan Alsawaf authored
      The OOM reaper alters user address space which might theoretically alter
      the snapshot if reaping is allowed to happen after the freezer quiescent
      state.  To this end, the reaper kthread uses wait_event_freezable()
      while waiting for any work so that it cannot run while the system
      freezes.
      
      However, the current implementation doesn't respect the freezer because
      all kernel threads are created with the PF_NOFREEZE flag, so they are
      automatically excluded from freezing operations.  This means that the
      OOM reaper can race with system snapshotting if it has work to do while
      the system is being frozen.
      
      Fix this by adding a set_freezable() call which will clear the
      PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer.
      
      Please note that the OOM reaper altering the snapshot this way is mostly
      a theoretical concern and has not been observed in practice.
      
      Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com
      Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: default avatarSultan Alsawaf <sultan@kerneltoast.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3723929e
    • Mike Rapoport's avatar
      memblock: use memblock_free for freeing virtual pointers · 4421cca0
      Mike Rapoport authored
      Rename memblock_free_ptr() to memblock_free() and use memblock_free()
      when freeing a virtual pointer so that memblock_free() will be a
      counterpart of memblock_alloc()
      
      The callers are updated with the below semantic patch and manual
      addition of (void *) casting to pointers that are represented by
      unsigned long variables.
      
          @@
          identifier vaddr;
          expression size;
          @@
          (
          - memblock_phys_free(__pa(vaddr), size);
          + memblock_free(vaddr, size);
          |
          - memblock_free_ptr(vaddr, size);
          + memblock_free(vaddr, size);
          )
      
      [sfr@canb.auug.org.au: fixup]
        Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4421cca0
    • Mike Rapoport's avatar
      memblock: rename memblock_free to memblock_phys_free · 3ecc6834
      Mike Rapoport authored
      Since memblock_free() operates on a physical range, make its name
      reflect it and rename it to memblock_phys_free(), so it will be a
      logical counterpart to memblock_phys_alloc().
      
      The callers are updated with the below semantic patch:
      
          @@
          expression addr;
          expression size;
          @@
          - memblock_free(addr, size);
          + memblock_phys_free(addr, size);
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ecc6834
    • Mike Rapoport's avatar
      memblock: stop aliasing __memblock_free_late with memblock_free_late · 621d9739
      Mike Rapoport authored
      memblock_free_late() is a NOP wrapper for __memblock_free_late(), there
      is no point to keep this indirection.
      
      Drop the wrapper and rename __memblock_free_late() to
      memblock_free_late().
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      621d9739
    • Mike Rapoport's avatar
      memblock: drop memblock_free_early_nid() and memblock_free_early() · fa277171
      Mike Rapoport authored
      memblock_free_early_nid() is unused and memblock_free_early() is an
      alias for memblock_free().
      
      Replace calls to memblock_free_early() with calls to memblock_free() and
      remove memblock_free_early() and memblock_free_early_nid().
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa277171
    • Mike Rapoport's avatar
      xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer · c486514d
      Mike Rapoport authored
      free_p2m_page() wrongly passes a virtual pointer to memblock_free() that
      treats it as a physical address.
      
      Call memblock_free_ptr() instead that gets a virtual address to free the
      memory.
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c486514d
    • Mike Rapoport's avatar
      arch_numa: simplify numa_distance allocation · 5787ea5b
      Mike Rapoport authored
      Patch series "memblock: cleanup memblock_free interface", v2.
      
      This is the fix for memblock freeing APIs mismatch [1].
      
      The first patch is a cleanup of numa_distance allocation in arch_numa
      I've spotted during the conversion.  The second patch is a fix for Xen
      memory freeing on some of the error paths.
      
      [1] https://lore.kernel.org/all/CAHk-=wj9k4LZTz+svCxLYs5Y1=+yKrbAUArH1+ghyG3OLd8VVg@mail.gmail.com
      
      This patch (of 6):
      
      Memory allocation of numa_distance uses memblock_phys_alloc_range()
      without actual range limits, converts the returned physical address to
      virtual and then only uses the virtual address for further
      initialization.
      
      Simplify this by replacing memblock_phys_alloc_range() with
      memblock_alloc().
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210930185031.18648-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5787ea5b
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: print file offset in hexadecimal · 41d4613b
      Naoya Horiguchi authored
      In page list mode (with -l and -L option), virtual address and physical
      address are printed in hexadecimal, but file offset is not, which is
      confusing, so let's align it.
      
      Link: https://lkml.kernel.org/r/20211004061325.1525902-4-naoya.horiguchi@linux.devSigned-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Bin Wang <wangbin224@huawei.com>
      Cc: Changbin Du <changbin.du@intel.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41d4613b
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: move show_file() to summary output · b76901db
      Naoya Horiguchi authored
      Currently file info from show_file() is printed out within page list
      like below, but this is inconvenient a little to utilize the page list
      from other scripts (maybe needs additional filtering).
      
          $ ./page-types -f page-types.c -l
          foffset offset  len     flags
          page-types.c Inode: 15108680 Size: 30953 (8 pages)
          Modify: Sat Oct  2 23:11:20 2021 (2399 seconds ago)
          Access: Sat Oct  2 23:11:28 2021 (2391 seconds ago)
          0       d9f59e  1       ___U_lA____________________________________
          1       1031eb5 1       __RU_l_____________________________________
          2       13bf717 1       __RU_l_____________________________________
          3       13ac333 1       ___U_lA____________________________________
          4       d9f59f  1       __RU_l_____________________________________
          5       183fd49 1       ___U_lA____________________________________
          6       13cbf69 1       ___U_lA____________________________________
          7       d9ef05  1       ___U_lA____________________________________
      
                       flags      page-count       MB  symbolic-flags                     long-symbolic-flags
          0x000000000000002c               3        0  __RU_l_____________________________________        referenced,uptodate,lru
          0x0000000000000068               5        0  ___U_lA____________________________________        uptodate,lru,active
                       total               8        0
      
      With this patch file info is printed out in summary part like below:
      
          $ ./page-types -f page-types.c -l
          foffset offset  len     flags
          0       d9f59e  1       ___U_lA_____________________________________
          1       1031eb5 1       __RU_l______________________________________
          2       13bf717 1       __RU_l______________________________________
          3       13ac333 1       ___U_lA_____________________________________
          4       d9f59f  1       __RU_l______________________________________
          5       183fd49 1       ___U_lA_____________________________________
          6       13cbf69 1       ___U_lA_____________________________________
      
          page-types.c Inode: 15108680 Size: 30953 (8 pages)
          Modify: Sat Oct  2 23:11:20 2021 (2435 seconds ago)
          Access: Sat Oct  2 23:11:28 2021 (2427 seconds ago)
      
                       flags      page-count       MB  symbolic-flags                     long-symbolic-flags
          0x000000000000002c               3        0  __RU_l______________________________________       referenced,uptodate,lru
          0x0000000000000068               4        0  ___U_lA_____________________________________       uptodate,lru,active
                       total               7        0
      
      Link: https://lkml.kernel.org/r/20211004061325.1525902-3-naoya.horiguchi@linux.devSigned-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Bin Wang <wangbin224@huawei.com>
      Cc: Changbin Du <changbin.du@intel.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b76901db
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: make walk_file() aware of address range option · a62f5ecb
      Naoya Horiguchi authored
      Patch series "tools/vm/page-types.c: a few improvements".
      
      This patchset adds some improvements on tools/vm/page-types.c.  Patch
      1/3 makes -a option (specify address range) work with -f (file cache
      mode).  Patch 2/3 and 3/3 are to fix minor formatting issues of this
      tool.  These would make life a little easier for the users of this tool.
      
      Please see individual patches for more details about specific issues.
      
      This patch (of 3):
      
      -a|--addr option is used to limit the range of address to be scanned for
      page status.  It works now for physical address space (dafult mode) or for
      virtual address space (with -p option), but not for file address space
      (with -f option).  So make walk_file() aware of -a option.
      
      Link: https://lkml.kernel.org/r/20211004061325.1525902-1-naoya.horiguchi@linux.dev
      Link: https://lkml.kernel.org/r/20211004061325.1525902-2-naoya.horiguchi@linux.devSigned-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Changbin Du <changbin.du@intel.com>
      Cc: Bin Wang <wangbin224@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a62f5ecb
    • Zhenliang Wei's avatar
      tools/vm/page_owner_sort.c: count and sort by mem · f7df2b1c
      Zhenliang Wei authored
      When viewing page owner information, we may be more concerned about the
      total memory rather than the times of stack appears.  Therefore, the
      following adjustments are made:
      
      1. Added the statistics on the total number of pages.
      
      2. Added the optional parameter "-m" to configure the program to sort by
         memory (total pages).
      
      The general output of page_owner is as follows:
      
      	Page allocated via order XXX, ...
      	PFN XXX ...
      	 // Detailed stack
      
      	Page allocated via order XXX, ...
      	PFN XXX ...
      	 // Detailed stack
      
      The original page_owner_sort ignores PFN rows, puts the remaining rows
      in buf, counts the times of buf, and finally sorts them according to the
      times.  General output:
      
      	XXX times:
      	Page allocated via order XXX, ...
      	 // Detailed stack
      
      Now, we use regexp to extract the page order value from the buf, and
      count the total pages for the buf.  General output:
      
      	XXX times, XXX pages:
      	Page allocated via order XXX, ...
      	 // Detailed stack
      
      By default, it is still sorted by the times of buf; If you want to sort
      by the pages nums of buf, use the new -m parameter.
      
      Link: https://lkml.kernel.org/r/1631678242-41033-1-git-send-email-weizhenliang@huawei.comSigned-off-by: default avatarZhenliang Wei <weizhenliang@huawei.com>
      Cc: Tang Bin <tangbin@cmss.chinamobile.com>
      Cc: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
      Cc: Zhenliang Wei <weizhenliang@huawei.com>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7df2b1c
    • Yuanzheng Song's avatar
      mm/vmpressure: fix data-race with memcg->socket_pressure · 7e6ec49c
      Yuanzheng Song authored
      When reading memcg->socket_pressure in mem_cgroup_under_socket_pressure()
      and writing memcg->socket_pressure in vmpressure() at the same time, the
      following data-race occurs:
      
        BUG: KCSAN: data-race in __sk_mem_reduce_allocated / vmpressure
      
        write to 0xffff8881286f4938 of 8 bytes by task 24550 on cpu 3:
         vmpressure+0x218/0x230 mm/vmpressure.c:307
         shrink_node_memcgs+0x2b9/0x410 mm/vmscan.c:2658
         shrink_node+0x9d2/0x11d0 mm/vmscan.c:2769
         shrink_zones+0x29f/0x470 mm/vmscan.c:2972
         do_try_to_free_pages+0x193/0x6e0 mm/vmscan.c:3027
         try_to_free_mem_cgroup_pages+0x1c0/0x3f0 mm/vmscan.c:3345
         reclaim_high mm/memcontrol.c:2440 [inline]
         mem_cgroup_handle_over_high+0x18b/0x4d0 mm/memcontrol.c:2624
         tracehook_notify_resume include/linux/tracehook.h:197 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
         exit_to_user_mode_prepare+0x110/0x170 kernel/entry/common.c:191
         syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
         ret_from_fork+0x15/0x30 arch/x86/entry/entry_64.S:289
      
        read to 0xffff8881286f4938 of 8 bytes by interrupt on cpu 1:
         mem_cgroup_under_socket_pressure include/linux/memcontrol.h:1483 [inline]
         sk_under_memory_pressure include/net/sock.h:1314 [inline]
         __sk_mem_reduce_allocated+0x1d2/0x270 net/core/sock.c:2696
         __sk_mem_reclaim+0x44/0x50 net/core/sock.c:2711
         sk_mem_reclaim include/net/sock.h:1490 [inline]
         ......
         net_rx_action+0x17a/0x480 net/core/dev.c:6864
         __do_softirq+0x12c/0x2af kernel/softirq.c:298
         run_ksoftirqd+0x13/0x20 kernel/softirq.c:653
         smpboot_thread_fn+0x33f/0x510 kernel/smpboot.c:165
         kthread+0x1fc/0x220 kernel/kthread.c:292
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
      
      Fix it by using READ_ONCE() and WRITE_ONCE() to read and write
      memcg->socket_pressure.
      
      Link: https://lkml.kernel.org/r/20211025082843.671690-1-songyuanzheng@huawei.comSigned-off-by: default avatarYuanzheng Song <songyuanzheng@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e6ec49c
    • Mel Gorman's avatar
      mm/vmscan: delay waking of tasks throttled on NOPROGRESS · 66ce520b
      Mel Gorman authored
      Tracing indicates that tasks throttled on NOPROGRESS are woken
      prematurely resulting in occasional massive spikes in direct reclaim
      activity.  This patch wakes tasks throttled on NOPROGRESS if reclaim
      efficiency is at least 12%.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-9-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66ce520b
    • Mel Gorman's avatar
      mm/vmscan: increase the timeout if page reclaim is not making progress · a19594ca
      Mel Gorman authored
      Tracing of the stutterp workload showed the following delays
      
            1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
          116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
         5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
        71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      All the throttling hit the full timeout and then there was wakeup delays
      meaning that the wakeups are premature as no other reclaimer such as
      kswapd has made progress.  This patch increases the maximum timeout.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-8-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a19594ca
    • Mel Gorman's avatar
      mm/vmscan: centralise timeout values for reclaim_throttle · c3f4a9a2
      Mel Gorman authored
      Neil Brown raised concerns about callers of reclaim_throttle specifying
      a timeout value.  The original timeout values to congestion_wait() were
      probably pulled out of thin air or copy&pasted from somewhere else.
      This patch centralises the timeout values and selects a timeout based on
      the reason for reclaim throttling.  These figures are also pulled out of
      the same thin air but better values may be derived
      
      Running a workload that is throttling for inappropriate periods and
      tracing mm_vmscan_throttled can be used to pick a more appropriate
      value.  Excessive throttling would pick a lower timeout where as
      excessive CPU usage in reclaim context would select a larger timeout.
      Ideally a large value would always be used and the wakeups would occur
      before a timeout but that requires careful testing.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-7-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3f4a9a2
    • Mel Gorman's avatar
      mm/page_alloc: remove the throttling logic from the page allocator · 132b0d21
      Mel Gorman authored
      The page allocator stalls based on the number of pages that are waiting
      for writeback to start but this should now be redundant.
      shrink_inactive_list() will wake flusher threads if the LRU tail are
      unqueued dirty pages so the flusher should be active.  If it fails to
      make progress due to pages under writeback not being completed quickly
      then it should stall on VMSCAN_THROTTLE_WRITEBACK.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      132b0d21
    • Mel Gorman's avatar
      mm/writeback: throttle based on page writeback instead of congestion · 8d58802f
      Mel Gorman authored
      do_writepages throttles on congestion if the writepages() fails due to a
      lack of memory but congestion_wait() is partially broken as the
      congestion state is not updated for all BDIs.
      
      This patch stalls waiting for a number of pages to complete writeback
      that located on the local node.  The main weakness is that there is no
      correlation between the location of the inode's pages and locality but
      that is still better than congestion_wait.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-5-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d58802f
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim when no progress is being made · 69392a40
      Mel Gorman authored
      Memcg reclaim throttles on congestion if no reclaim progress is made.
      This makes little sense, it might be due to writeback or a host of other
      factors.
      
      For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
      in the page allocator if it is failing to make progress.  Kswapd
      throttles if too many pages are under writeback and marked for immediate
      reclaim.
      
      This patch explicitly throttles if reclaim is failing to make progress.
      
      [vbabka@suse.cz: Remove redundant code]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69392a40
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated · d818fca1
      Mel Gorman authored
      Page reclaim throttles on congestion if too many parallel reclaim
      instances have isolated too many pages.  This makes no sense, excessive
      parallelisation has nothing to do with writeback or congestion.
      
      This patch creates an additional workqueue to sleep on when too many
      pages are isolated.  The throttled tasks are woken when the number of
      isolated pages is reduced or a timeout occurs.  There may be some false
      positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
      throttle again if necessary.
      
      [shy828301@gmail.com: Wake up from compaction context]
      [vbabka@suse.cz: Account number of throttled tasks only for writeback]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d818fca1
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman authored
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588