1. 15 Jan, 2022 40 commits
    • Mina Almasry's avatar
      hugetlb: add hugetlb.*.numa_stat file · f4776199
      Mina Almasry authored
      For hugetlb backed jobs/VMs it's critical to understand the numa
      information for the memory backing these jobs to deliver optimal
      performance.
      
      Currently this technically can be queried from /proc/self/numa_maps, but
      there are significant issues with that.  Namely:
      
      1. Memory can be mapped or unmapped.
      
      2. numa_maps are per process and need to be aggregated across all
         processes in the cgroup.  For shared memory this is more involved as
         the userspace needs to make sure it doesn't double count shared
         mappings.
      
      3. I believe querying numa_maps needs to hold the mmap_lock which adds
         to the contention on this lock.
      
      For these reasons I propose simply adding hugetlb.*.numa_stat file,
         which shows the numa information of the cgroup similarly to
         memory.numa_stat.
      
      On cgroup-v2:
         cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
         total=2097152 N0=2097152 N1=0
      
      On cgroup-v1:
         cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
         total=2097152 N0=2097152 N1=0
         hierarichal_total=2097152 N0=2097152 N1=0
      
      This patch was tested manually by allocating hugetlb memory and querying
      the hugetlb.*.numa_stat file of the cgroup and its parents.
      
      [colin.i.king@googlemail.com: fix spelling mistake "hierarichal" -> "hierarchical"]
        Link: https://lkml.kernel.org/r/20211125090635.23508-1-colin.i.king@gmail.com
      [keescook@chromium.org: fix copy/paste array assignment]
        Link: https://lkml.kernel.org/r/20211203065647.2819707-1-keescook@chromium.org
      
      Link: https://lkml.kernel.org/r/20211123001020.4083653-1-almasrymina@google.comSigned-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jue Wang <juew@google.com>
      Cc: Yang Yao <ygyao@google.com>
      Cc: Joanna Li <joannali@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4776199
    • Baoquan He's avatar
      mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages · c4dc63f0
      Baoquan He authored
      In kdump kernel of x86_64, page allocation failure is observed:
      
       kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
       Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
       Workqueue: events_unbound async_run_entry_fn
       Call Trace:
        <TASK>
        dump_stack_lvl+0x48/0x5e
        warn_alloc.cold+0x72/0xd6
        __alloc_pages_slowpath.constprop.0+0xc69/0xcd0
        __alloc_pages+0x1df/0x210
        new_slab+0x389/0x4d0
        ___slab_alloc+0x58f/0x770
        __slab_alloc.constprop.0+0x4a/0x80
        kmem_cache_alloc_trace+0x24b/0x2c0
        sr_probe+0x1db/0x620
        ......
        device_add+0x405/0x920
        ......
        __scsi_add_device+0xe5/0x100
        ata_scsi_scan_host+0x97/0x1d0
        async_run_entry_fn+0x30/0x130
        process_one_work+0x1e8/0x3c0
        worker_thread+0x50/0x3b0
        ? rescuer_thread+0x350/0x350
        kthread+0x16b/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x22/0x30
        </TASK>
       Mem-Info:
       ......
      
      The above failure happened when calling kmalloc() to allocate buffer with
      GFP_DMA.  It requests to allocate slab page from DMA zone while no managed
      pages at all in there.
      
       sr_probe()
       --> get_capabilities()
           --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
      
      Because in the current kernel, dma-kmalloc will be created as long as
      CONFIG_ZONE_DMA is enabled.  However, kdump kernel of x86_64 doesn't have
      managed pages on DMA zone since commit 6f599d84 ("x86/kdump: Always
      reserve the low 1M when the crashkernel option is specified").  The
      failure can be always reproduced.
      
      For now, let's mute the warning of allocation failure if requesting pages
      from DMA zone while no managed pages.
      
      [akpm@linux-foundation.org: fix warning]
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarJohn Donnelly  <john.p.donnelly@oracle.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4dc63f0
    • Baoquan He's avatar
      dma/pool: create dma atomic pool only if dma zone has managed pages · a674e48c
      Baoquan He authored
      Currently three dma atomic pools are initialized as long as the relevant
      kernel codes are built in.  While in kdump kernel of x86_64, this is not
      right when trying to create atomic_pool_dma, because there's no managed
      pages in DMA zone.  In the case, DMA zone only has low 1M memory
      presented and locked down by memblock allocator.  So no pages are added
      into buddy of DMA zone.  Please check commit f1d4d47c ("x86/setup:
      Always reserve the first 1M of RAM").
      
      Then in kdump kernel of x86_64, it always prints below failure message:
      
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-0.rc5.20210611git929d931f.42.fc35.x86_64 #1
       Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        __alloc_pages_slowpath.constprop.0+0xf29/0xf50
        __alloc_pages+0x24d/0x2c0
        alloc_page_interleave+0x13/0xb0
        atomic_pool_expand+0x118/0x210
        __dma_atomic_pool_init+0x45/0x93
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        kernel_init_freeable+0x290/0x2dc
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ......
       DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
       DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
      
      Here, let's check if DMA zone has managed pages, then create
      atomic_pool_dma if yes.  Otherwise just skip it.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-3-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJohn Donnelly  <john.p.donnelly@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a674e48c
    • Baoquan He's avatar
      mm_zone: add function to check if managed dma zone exists · 62b31070
      Baoquan He authored
      Patch series "Handle warning of allocation failure on DMA zone w/o
      managed pages", v4.
      
      **Problem observed:
      On x86_64, when crash is triggered and entering into kdump kernel, page
      allocation failure can always be seen.
      
       ---------------------------------
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        ......
        __alloc_pages+0x24d/0x2c0
        ......
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        ? rcu_read_lock_sched_held+0x3f/0x80
        kernel_init_freeable+0x290/0x2dc
        ? rest_init+0x24f/0x24f
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ------------------------------------
      
      ***Root cause:
      In the current kernel, it assumes that DMA zone must have managed pages
      and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
      always true. E.g in kdump kernel of x86_64, only low 1M is presented and
      locked down at very early stage of boot, so that this low 1M won't be
      added into buddy allocator to become managed pages of DMA zone. This
      exception will always cause page allocation failure if page is requested
      from DMA zone.
      
      ***Investigation:
      This failure happens since below commit merged into linus's tree.
        1a6a9044 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
        23721c8e x86/crash: Remove crash_reserve_low_1M()
        f1d4d47c x86/setup: Always reserve the first 1M of RAM
        7c321eb2 x86/kdump: Remove the backup region handling
        6f599d84 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
      
      Before them, on x86_64, the low 640K area will be reused by kdump kernel.
      So in kdump kernel, the content of low 640K area is copied into a backup
      region for dumping before jumping into kdump. Then except of those firmware
      reserved region in [0, 640K], the left area will be added into buddy
      allocator to become available managed pages of DMA zone.
      
      However, after above commits applied, in kdump kernel of x86_64, the low
      1M is reserved by memblock, but not released to buddy allocator. So any
      later page allocation requested from DMA zone will fail.
      
      At the beginning, if crashkernel is reserved, the low 1M need be locked
      down because AMD SME encrypts memory making the old backup region
      mechanims impossible when switching into kdump kernel.
      
      Later, it was also observed that there are BIOSes corrupting memory
      under 1M. To solve this, in commit f1d4d47c, the entire region of
      low 1M is always reserved after the real mode trampoline is allocated.
      
      Besides, recently, Intel engineer mentioned their TDX (Trusted domain
      extensions) which is under development in kernel also needs to lock down
      the low 1M. So we can't simply revert above commits to fix the page allocation
      failure from DMA zone as someone suggested.
      
      ***Solution:
      Currently, only DMA atomic pool and dma-kmalloc will initialize and
      request page allocation with GFP_DMA during bootup.
      
      So only initializ DMA atomic pool when DMA zone has available managed
      pages, otherwise just skip the initialization.
      
      For dma-kmalloc(), for the time being, let's mute the warning of
      allocation failure if requesting pages from DMA zone while no manged
      pages.  Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
      replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
      not necessary.  Christoph is posting patches to fix those under
      drivers/scsi/.  Finally, we can remove the need of dma-kmalloc() as people
      suggested.
      
      This patch (of 3):
      
      In some places of the current kernel, it assumes that dma zone must have
      managed pages if CONFIG_ZONE_DMA is enabled.  While this is not always
      true.  E.g in kdump kernel of x86_64, only low 1M is presented and locked
      down at very early stage of boot, so that there's no managed pages at all
      in DMA zone.  This exception will always cause page allocation failure if
      page is requested from DMA zone.
      
      Here add function has_managed_dma() and the relevant helper functions to
      check if there's DMA zone with managed pages.  It will be used in later
      patches.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarJohn Donnelly  <john.p.donnelly@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62b31070
    • Anshuman Khandual's avatar
      mm/page_alloc.c: modify the comment section for alloc_contig_pages() · eaab8e75
      Anshuman Khandual authored
      Clarify that the alloc_contig_pages() allocated range will always be
      aligned to the requested nr_pages.
      
      Link: https://lkml.kernel.org/r/1639545478-12160-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eaab8e75
    • Miles Chen's avatar
      include/linux/gfp.h: further document GFP_DMA32 · 04a536bf
      Miles Chen authored
      kmalloc(..., GFP_DMA32) does not return DMA32 memory because the DMA32
      kmalloc cache array is not implemented.  (Reason: there is no such user
      in kernel).
      
      Put a short comment about this so people can understand this by reading
      the comment.
      
      [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
      
      Link: https://lkml.kernel.org/r/20211207093610.6406-1-miles.chen@mediatek.comSigned-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04a536bf
    • Michal Hocko's avatar
      mm: drop node from alloc_pages_vma · be1a13eb
      Michal Hocko authored
      alloc_pages_vma is meant to allocate a page with a vma specific memory
      policy.  The initial node parameter is always a local node so it is
      pointless to waste a function argument for this.  Drop the parameter.
      
      Link: https://lkml.kernel.org/r/YaSnlv4QpryEpesG@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be1a13eb
    • Xiongwei Song's avatar
      mm: page_alloc: fix building error on -Werror=array-compare · ca831f29
      Xiongwei Song authored
      Arthur Marsh reported we would hit the error below when building kernel
      with gcc-12:
      
        CC      mm/page_alloc.o
        mm/page_alloc.c: In function `mem_init_print_info':
        mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
         8173 |                 if (start <= pos && pos < end && size > adj) \
              |
      
      In C++20, the comparision between arrays should be warned.
      
      Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.comSigned-off-by: default avatarXiongwei Song <sxwjean@gmail.com>
      Reported-by: default avatarArthur Marsh <arthur.marsh@internode.on.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca831f29
    • Changcheng Deng's avatar
      mm: fix boolreturn.cocci warning · 1611f74a
      Changcheng Deng authored
      Return statements in functions returning bool should use true/false
      instead of 1/0.
      
      Link: https://lkml.kernel.org/r/20211126073327.74815-1-deng.changcheng@zte.com.cnSigned-off-by: default avatarChangcheng Deng <deng.changcheng@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1611f74a
    • Suren Baghdasaryan's avatar
      mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30% · 39c65a94
      Suren Baghdasaryan authored
      For embedded systems with low total memory, having to run applications
      with relatively large memory requirements, 10% max limitation for
      watermark_scale_factor poses an issue of triggering direct reclaim every
      time such application is started.  This results in slow application
      startup times and bad end-user experience.
      
      By increasing watermark_scale_factor max limit we allow vendors more
      flexibility to choose the right level of kswapd aggressiveness for their
      device and workload requirements.
      
      Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Lukas Middendorf <kernel@tuxforce.de>
      Cc: Antti Palosaari <crope@iki.fi>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Fengfei Xi <xi.fengfei@h3c.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39c65a94
    • NeilBrown's avatar
      mm: introduce memalloc_retry_wait() · 4034247a
      NeilBrown authored
      Various places in the kernel - largely in filesystems - respond to a
      memory allocation failure by looping around and re-trying.  Some of
      these cannot conveniently use __GFP_NOFAIL, for reasons such as:
      
       - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
       - a need to check for the process being signalled between failures
       - the possibility that other recovery actions could be performed
       - the allocation is quite deep in support code, and passing down an
         extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
      
      Many of these currently use congestion_wait() which (in almost all
      cases) simply waits the given timeout - congestion isn't tracked for
      most devices.
      
      It isn't clear what the best delay is for loops, but it is clear that
      the various filesystems shouldn't be responsible for choosing a timeout.
      
      This patch introduces memalloc_retry_wait() with takes on that
      responsibility.  Code that wants to retry a memory allocation can call
      this function passing the GFP flags that were used.  It will wait
      however is appropriate.
      
      For now, it only considers __GFP_NORETRY and whatever
      gfpflags_allow_blocking() tests.  If blocking is allowed without
      __GFP_NORETRY, then alloc_page either made some reclaim progress, or
      waited for a while, before failing.  So there is no need for much
      further waiting.  memalloc_retry_wait() will wait until the current
      jiffie ends.  If this condition is not met, then alloc_page() won't have
      waited much if at all.  In that case memalloc_retry_wait() waits about
      200ms.  This is the delay that most current loops uses.
      
      linux/sched/mm.h needs to be included in some files now,
      but linux/backing-dev.h does not.
      
      Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.nameSigned-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4034247a
    • Michal Hocko's avatar
      mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware · 704687de
      Michal Hocko authored
      sl?b and vmalloc allocators reduce the given gfp mask for their internal
      needs.  For that they use GFP_RECLAIM_MASK to preserve the reclaim
      behavior and constrains.
      
      __GFP_NOLOCKDEP is not a part of that mask because it doesn't really
      control the reclaim behavior strictly speaking.  On the other hand it
      tells the underlying page allocator to disable reclaim recursion
      detection so arguably it should be part of the mask.
      
      Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
      form so this change is safe pretty much by definition.  It also adds a
      support for this flag to SL?B and vmalloc allocators which will in turn
      allow its use to kvmalloc as well.  A lack of the support has been
      noticed recently in
      
        http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area
      
      Link: https://lkml.kernel.org/r/YZ9XtLY4AEjVuiEI@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      704687de
    • Michal Hocko's avatar
      mm: allow !GFP_KERNEL allocations for kvmalloc · a421ef30
      Michal Hocko authored
      Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
      previous patches so we can allow the support for kvmalloc.  This will
      allow some external users to simplify or completely remove their
      helpers.
      
      GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
      explicitly documented so let's add a note about that.
      
      ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.
      
      Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a421ef30
    • Michal Hocko's avatar
      mm/vmalloc: be more explicit about supported gfp flags. · 30d3f011
      Michal Hocko authored
      Commit b7d90e7a ("mm/vmalloc: be more explicit about supported gfp
      flags") has been merged prematurely without the rest of the series and
      without addressed review feedback from Neil.  Fix that up now.  Only
      wording is changed slightly.
      
      Link: https://lkml.kernel.org/r/20211122153233.9924-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30d3f011
    • Michal Hocko's avatar
      mm/vmalloc: add support for __GFP_NOFAIL · 9376130c
      Michal Hocko authored
      Dave Chinner has mentioned that some of the xfs code would benefit from
      kvmalloc support for __GFP_NOFAIL because they have allocations that
      cannot fail and they do not fit into a single page.
      
      The large part of the vmalloc implementation already complies with the
      given gfp flags so there is no work for those to be done.  The area and
      page table allocations are an exception to that.  Implement a retry loop
      for those.
      
      Add a short sleep before retrying.  1 jiffy is a completely random
      timeout.  Ideally the retry would wait for an explicit event - e.g.  a
      change to the vmalloc space change if the failure was caused by the
      space fragmentation or depletion.  But there are multiple different
      reasons to retry and this could become much more complex.  Keep the
      retry simple for now and just sleep to prevent from hogging CPUs.
      
      Link: https://lkml.kernel.org/r/20211122153233.9924-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9376130c
    • Michal Hocko's avatar
      mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc · 451769eb
      Michal Hocko authored
      Patch series "extend vmalloc support for constrained allocations", v2.
      
      Based on a recent discussion with Dave and Neil [1] I have tried to
      implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of
      kvmalloc users easier.
      
      A requirement for NOFAIL support for kvmalloc was new to me but this
      seems to be really needed by the xfs code.
      
      NOFS/NOIO was a known and a long term problem which was hoped to be
      handled by the scope API.  Those scope should have been used at the
      reclaim recursion boundaries both to document them and also to remove
      the necessity of NOFS/NOIO constrains for all allocations within that
      scope.  Instead workarounds were developed to wrap a single allocation
      instead (like ceph_kvmalloc).
      
      First patch implements NOFS/NOIO support for vmalloc.  The second one
      adds NOFAIL support and the third one bundles all together into kvmalloc
      and drops ceph_kvmalloc which can use kvmalloc directly now.
      
      [1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown
      
      This patch (of 4):
      
      vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
      page table allocations do not support externally provided gfp mask and
      performed GFP_KERNEL like allocations.
      
      Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
      to enforce NOFS and NOIO constrains implicitly to all allocators within
      the scope.  There was a hope that those scopes would be defined on a
      higher level when the reclaim recursion boundary starts/stops (e.g.
      when a lock required during the memory reclaim is required etc.).  It
      seems that not all NOFS/NOIO users have adopted this approach and
      instead they have taken a workaround approach to wrap a single
      [k]vmalloc allocation by a scope API.
      
      These workarounds do not serve the purpose of a better reclaim recursion
      documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
      just provide them with the semantic they are asking for without a need
      for workarounds.
      
      Add support for GFP_NOFS and GFP_NOIO to vmalloc directly.  All internal
      allocations already comply with the given gfp_mask.  The only current
      exception is vmap_pages_range which maps kernel page tables.  Infer the
      proper scope API based on the given gfp mask.
      
      [sfr@canb.auug.org.au: mm/vmalloc.c needs linux/sched/mm.h]
       Link: https://lkml.kernel.org/r/20211217232641.0148710c@canb.auug.org.au
      
      Link: https://lkml.kernel.org/r/20211122153233.9924-1-mhocko@kernel.org
      Link: https://lkml.kernel.org/r/20211122153233.9924-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      451769eb
    • Christian König's avatar
      mm/dmapool.c: revert "make dma pool to use kmalloc_node" · cc6266f0
      Christian König authored
      This reverts commit 2618c60b ("dma: make dma pool to use
      kmalloc_node").
      
      While working myself into the dmapool code I've found this little odd
      kmalloc_node().
      
      What basically happens here is that we allocate the housekeeping
      structure on the numa node where the device is attached to.  Since the
      device is never doing DMA to or from that memory this doesn't seem to
      make sense at all.
      
      So while this doesn't seem to cause much harm it's probably cleaner to
      revert the change for consistency.
      
      Link: https://lkml.kernel.org/r/20211221110724.97664-1-christian.koenig@amd.comSigned-off-by: default avatarChristian König <christian.koenig@amd.com>
      Cc: Yinghai Lu <yinghai.lu@sun.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc6266f0
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove the total_mapcount argument from page_trans_huge_mapcount() · d08d2b62
      Matthew Wilcox (Oracle) authored
      All callers pass NULL, so we can stop calculating the value we would
      store in it.
      
      Link: https://lkml.kernel.org/r/20211220205943.456187-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d08d2b62
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove the total_mapcount argument from page_trans_huge_map_swapcount() · 66c7f7a6
      Matthew Wilcox (Oracle) authored
      Now that we don't report it to the caller of reuse_swap_page(), we don't
      need to request it from page_trans_huge_map_swapcount().
      
      Link: https://lkml.kernel.org/r/20211220205943.456187-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66c7f7a6
    • Matthew Wilcox (Oracle)'s avatar
    • Pasha Tatashin's avatar
      x86: mm: add x86_64 support for page table check · d283d422
      Pasha Tatashin authored
      Add page table check hooks into routines that modify user page tables.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-5-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d283d422
    • Pasha Tatashin's avatar
      mm: page table check · df4e817b
      Pasha Tatashin authored
      Check user page table entries at the time they are added and removed.
      
      Allows to synchronously catch memory corruption issues related to double
      mapping.
      
      When a pte for an anonymous page is added into page table, we verify
      that this pte does not already point to a file backed page, and vice
      versa if this is a file backed page that is being added we verify that
      this page does not have an anonymous mapping
      
      We also enforce that read-only sharing for anonymous pages is allowed
      (i.e.  cow after fork).  All other sharing must be for file pages.
      
      Page table check allows to protect and debug cases where "struct page"
      metadata became corrupted for some reason.  For example, when refcnt or
      mapcount become invalid.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df4e817b
    • Pasha Tatashin's avatar
      mm: ptep_clear() page table helper · 08d5b29e
      Pasha Tatashin authored
      We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to
      clear PTE from user page tables, but there is no variant for simple
      clear of a present PTE from user page tables without using a low level
      pte_clear() which can be either native or para-virtualised.
      
      Add a new ptep_clear() that can be used in common code to clear PTEs
      from page table.  We will need this call later in order to add a hook
      for page table check.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-3-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08d5b29e
    • Pasha Tatashin's avatar
      mm: change page type prior to adding page table entry · 1eba86c0
      Pasha Tatashin authored
      Patch series "page table check", v3.
      
      Ensure that some memory corruptions are prevented by checking at the
      time of insertion of entries into user page tables that there is no
      illegal sharing.
      
      We have recently found a problem [1] that existed in kernel since 4.14.
      The problem was caused by broken page ref count and led to memory
      leaking from one process into another.  The problem was accidentally
      detected by studying a dump of one process and noticing that one page
      contains memory that should not belong to this process.
      
      There are some other page->_refcount related problems that were recently
      fixed: [2], [3] which potentially could also lead to illegal sharing.
      
      In addition to hardening refcount [4] itself, this work is an attempt to
      prevent this class of memory corruption issues.
      
      It uses a simple state machine that is independent from regular MM logic
      to check for illegal sharing at time pages are inserted and removed from
      page tables.
      
      [1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com
      [2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com
      [3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com
      
      This patch (of 4):
      
      There are a few places where we first update the entry in the user page
      table, and later change the struct page to indicate that this is
      anonymous or file page.
      
      In most places, however, we first configure the page metadata and then
      insert entries into the page table.  Page table check, will use the
      information from struct page to verify the type of entry is inserted.
      
      Change the order in all places to first update struct page, and later to
      update page table.
      
      This means that we first do calls that may change the type of page (anon
      or file):
      
      	page_move_anon_rmap
      	page_add_anon_rmap
      	do_page_add_anon_rmap
      	page_add_new_anon_rmap
      	page_add_file_rmap
      	hugepage_add_anon_rmap
      	hugepage_add_new_anon_rmap
      
      And after that do calls that add entries to the page table:
      
      	set_huge_pte_at
      	set_pte_at
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com
      Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1eba86c0
    • Shuah Khan's avatar
      docs/vm: add vmalloced-kernel-stacks document · 4b8fec28
      Shuah Khan authored
      Add a new document to explain Virtually Mapped Kernel Stack Support.
      This is a compilation of information from the code and original patch
      series that introduced the Virtually Mapped Kernel Stacks feature.
      
      This document summarizes the feature and provides details on allocation,
      free, and stack overflow handling.  Provides reference to available
      tests.
      
      Link: https://lkml.kernel.org/r/20211215002004.47981-1-skhan@linuxfoundation.orgSigned-off-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b8fec28
    • Suren Baghdasaryan's avatar
      mm/oom_kill: allow process_mrelease to run under mmap_lock protection · ba535c1c
      Suren Baghdasaryan authored
      With exit_mmap holding mmap_write_lock during free_pgtables call,
      process_mrelease does not need to elevate mm->mm_users in order to
      prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is
      walking the VMA tree.  The change prevents process_mrelease from calling
      the last mmput, which can lead to waiting for IO completion in exit_aio.
      
      Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jan Engelhardt <jengelh@inai.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba535c1c
    • Suren Baghdasaryan's avatar
      mm: document locking restrictions for vm_operations_struct::close · cc6dcfee
      Suren Baghdasaryan authored
      Add comments for vm_operations_struct::close documenting locking
      requirements for this callback and its callers.
      
      Link: https://lkml.kernel.org/r/20211209191325.3069345-2-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jan Engelhardt <jengelh@inai.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc6dcfee
    • Suren Baghdasaryan's avatar
      mm: protect free_pgtables with mmap_lock write lock in exit_mmap · 64591e86
      Suren Baghdasaryan authored
      oom-reaper and process_mrelease system call should protect against races
      with exit_mmap which can destroy page tables while they walk the VMA
      tree.  oom-reaper protects from that race by setting MMF_OOM_VICTIM and
      by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing
      mmap_write_lock.  process_mrelease has to elevate mm->mm_users to
      prevent such race.
      
      Both oom-reaper and process_mrelease hold mmap_read_lock when walking
      the VMA tree.  The locking rules and mechanisms could be simpler if
      exit_mmap takes mmap_write_lock while executing destructive operations
      such as free_pgtables.
      
      Change exit_mmap to hold the mmap_write_lock when calling unlock_range,
      free_pgtables and remove_vma.  Note also that because oom-reaper checks
      VM_LOCKED flag, unlock_range() should not be allowed to race with it.
      
      Before this patch, remove_vma used to be called with no locks held,
      however with fput being executed asynchronously and vm_ops->close not
      being allowed to hold mmap_lock (it is called from __split_vma with
      mmap_sem held for write), changing that should be fine.
      
      In most cases this lock should be uncontended.  Previously, Kirill
      reported ~4% regression caused by a similar change [1].  We reran the
      same test and although the individual results are quite noisy, the
      percentiles show lower regression with 1.6% being the worst case [2].
      The change allows oom-reaper and process_mrelease to execute safely
      under mmap_read_lock without worries that exit_mmap might destroy page
      tables from under them.
      
      [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
      [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20211209191325.3069345-1-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jan Engelhardt <jengelh@inai.de>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64591e86
    • Arnd Bergmann's avatar
      mm: move tlb_flush_pending inline helpers to mm_inline.h · 36090def
      Arnd Bergmann authored
      linux/mm_types.h should only define structure definitions, to make it
      cheap to include elsewhere.  The atomic_t helper function definitions
      are particularly large, so it's better to move the helpers using those
      into the existing linux/mm_inline.h and only include that where needed.
      
      As a follow-up, we may want to go through all the indirect includes in
      mm_types.h and reduce them as much as possible.
      
      Link: https://lkml.kernel.org/r/20211207125710.2503446-2-arnd@kernel.orgSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36090def
    • Arnd Bergmann's avatar
      mm: move anon_vma declarations to linux/mm_inline.h · 17fca131
      Arnd Bergmann authored
      The patch to add anonymous vma names causes a build failure in some
      configurations:
      
        include/linux/mm_types.h: In function 'is_same_vma_anon_name':
        include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
          924 |         return name && vma_name && !strcmp(name, vma_name);
              |                                     ^~~~~~
        include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
      
      This should not really be part of linux/mm_types.h in the first place,
      as that header is meant to only contain structure defintions and need a
      minimum set of indirect includes itself.
      
      While the header clearly includes more than it should at this point,
      let's not make it worse by including string.h as well, which would pull
      in the expensive (compile-speed wise) fortify-string logic.
      
      Move the new functions into a separate header that only needs to be
      included in a couple of locations.
      
      Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
      Fixes: "mm: add a field to store names for private anonymous memory"
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@google.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17fca131
    • Suren Baghdasaryan's avatar
      mm: add anonymous vma name refcounting · 78db3412
      Suren Baghdasaryan authored
      While forking a process with high number (64K) of named anonymous vmas
      the overhead caused by strdup() is noticeable.  Experiments with ARM64
      Android device show up to 40% performance regression when forking a
      process with 64k unpopulated anonymous vmas using the max name lengths
      vs the same process with the same number of anonymous vmas having no
      name.
      
      Introduce anon_vma_name refcounted structure to avoid the overhead of
      copying vma names during fork() and when splitting named anonymous vmas.
      
      When a vma is duplicated, instead of copying the name we increment the
      refcount of this structure.  Multiple vmas can point to the same
      anon_vma_name as long as they increment the refcount.  The name member
      of anon_vma_name structure is assigned at structure allocation time and
      is never changed.  If vma name changes then the refcount of the original
      structure is dropped, a new anon_vma_name structure is allocated to hold
      the new name and the vma pointer is updated to point to the new
      structure.
      
      With this approach the fork() performance regressions is reduced 3-4x
      times and with usecases using more reasonable number of VMAs (a few
      thousand) the regressions is not measurable.
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-3-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78db3412
    • Colin Cross's avatar
      mm: add a field to store names for private anonymous memory · 9a10064f
      Colin Cross authored
      In many userspace applications, and especially in VM based applications
      like Android uses heavily, there are multiple different allocators in
      use.  At a minimum there is libc malloc and the stack, and in many cases
      there are libc malloc, the stack, direct syscalls to mmap anonymous
      memory, and multiple VM heaps (one for small objects, one for big
      objects, etc.).  Each of these layers usually has its own tools to
      inspect its usage; malloc by compiling a debug version, the VM through
      heap inspection tools, and for direct syscalls there is usually no way
      to track them.
      
      On Android we heavily use a set of tools that use an extended version of
      the logic covered in Documentation/vm/pagemap.txt to walk all pages
      mapped in userspace and slice their usage by process, shared (COW) vs.
      unique mappings, backing, etc.  This can account for real physical
      memory usage even in cases like fork without exec (which Android uses
      heavily to share as many private COW pages as possible between
      processes), Kernel SamePage Merging, and clean zero pages.  It produces
      a measurement of the pages that only exist in that process (USS, for
      unique), and a measurement of the physical memory usage of that process
      with the cost of shared pages being evenly split between processes that
      share them (PSS).
      
      If all anonymous memory is indistinguishable then figuring out the real
      physical memory usage (PSS) of each heap requires either a pagemap
      walking tool that can understand the heap debugging of every layer, or
      for every layer's heap debugging tools to implement the pagemap walking
      logic, in which case it is hard to get a consistent view of memory
      across the whole system.
      
      Tracking the information in userspace leads to all sorts of problems.
      It either needs to be stored inside the process, which means every
      process has to have an API to export its current heap information upon
      request, or it has to be stored externally in a filesystem that somebody
      needs to clean up on crashes.  It needs to be readable while the process
      is still running, so it has to have some sort of synchronization with
      every layer of userspace.  Efficiently tracking the ranges requires
      reimplementing something like the kernel vma trees, and linking to it
      from every layer of userspace.  It requires more memory, more syscalls,
      more runtime cost, and more complexity to separately track regions that
      the kernel is already tracking.
      
      This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
      userspace-provided name for anonymous vmas.  The names of named
      anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
      [anon:<name>].
      
      Userspace can set the name for a region of memory by calling
      
         prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
      
      Setting the name to NULL clears it.  The name length limit is 80 bytes
      including NUL-terminator and is checked to contain only printable ascii
      characters (including space), except '[',']','\','$' and '`'.
      
      Ascii strings are being used to have a descriptive identifiers for vmas,
      which can be understood by the users reading /proc/pid/maps or
      /proc/pid/smaps.  Names can be standardized for a given system and they
      can include some variable parts such as the name of the allocator or a
      library, tid of the thread using it, etc.
      
      The name is stored in a pointer in the shared union in vm_area_struct
      that points to a null terminated string.  Anonymous vmas with the same
      name (equivalent strings) and are otherwise mergeable will be merged.
      The name pointers are not shared between vmas even if they contain the
      same name.  The name pointer is stored in a union with fields that are
      only used on file-backed mappings, so it does not increase memory usage.
      
      CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
      feature.  It keeps the feature disabled by default to prevent any
      additional memory overhead and to avoid confusing procfs parsers on
      systems which are not ready to support named anonymous vmas.
      
      The patch is based on the original patch developed by Colin Cross, more
      specifically on its latest version [1] posted upstream by Sumit Semwal.
      It used a userspace pointer to store vma names.  In that design, name
      pointers could be shared between vmas.  However during the last
      upstreaming attempt, Kees Cook raised concerns [2] about this approach
      and suggested to copy the name into kernel memory space, perform
      validity checks [3] and store as a string referenced from
      vm_area_struct.
      
      One big concern is about fork() performance which would need to strdup
      anonymous vma names.  Dave Hansen suggested experimenting with
      worst-case scenario of forking a process with 64k vmas having longest
      possible names [4].  I ran this experiment on an ARM64 Android device
      and recorded a worst-case regression of almost 40% when forking such a
      process.
      
      This regression is addressed in the followup patch which replaces the
      pointer to a name with a refcounted structure that allows sharing the
      name pointer between vmas of the same name.  Instead of duplicating the
      string during fork() or when splitting a vma it increments the refcount.
      
      [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
      [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
      [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
      [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
      
      Changes for prctl(2) manual page (in the options section):
      
      PR_SET_VMA
      	Sets an attribute specified in arg2 for virtual memory areas
      	starting from the address specified in arg3 and spanning the
      	size specified	in arg4. arg5 specifies the value of the attribute
      	to be set. Note that assigning an attribute to a virtual memory
      	area might prevent it from being merged with adjacent virtual
      	memory areas due to the difference in that attribute's value.
      
      	Currently, arg2 must be one of:
      
      	PR_SET_VMA_ANON_NAME
      		Set a name for anonymous virtual memory areas. arg5 should
      		be a pointer to a null-terminated string containing the
      		name. The name length including null byte cannot exceed
      		80 bytes. If arg5 is NULL, the name of the appropriate
      		anonymous virtual memory areas will be reset. The name
      		can contain only printable ascii characters (including
                      space), except '[',']','\','$' and '`'.
      
                      This feature is available only if the kernel is built with
                      the CONFIG_ANON_VMA_NAME option enabled.
      
      [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
        Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
      [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
       added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
       work here was done by Colin Cross, therefore, with his permission, keeping
       him as the author]
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.comSigned-off-by: default avatarColin Cross <ccross@google.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a10064f
    • Colin Cross's avatar
      mm: rearrange madvise code to allow for reuse · ac1e9acc
      Colin Cross authored
      Patch series "mm: rearrange madvise code to allow for reuse", v11.
      
      Avoid performance regression of the new anon vma name field refcounting it.
      
      I checked the image sizes with allnoconfig builds:
      
        unpatched Linus' ToT
           text    data     bss     dec     hex filename
        1324759      32   73928 1398719 1557bf vmlinux
      
        After the first patch is applied (madvise refactoring)
           text    data     bss     dec     hex filename
        1322346      32   73928 1396306 154e52 vmlinux
        >>> 2413 bytes decrease vs ToT <<<
      
        After all patches applied with CONFIG_ANON_VMA_NAME=n
           text    data     bss     dec     hex filename
        1322337      32   73928 1396297 154e49 vmlinux
        >>> 2422 bytes decrease vs ToT <<<
      
        After all patches applied with CONFIG_ANON_VMA_NAME=y
           text    data     bss     dec     hex filename
        1325228      32   73928 1399188 155994 vmlinux
        >>> 469 bytes increase vs ToT <<<
      
      This patch (of 3):
      
      Refactor the madvise syscall to allow for parts of it to be reused by a
      prctl syscall that affects vmas.
      
      Move the code that walks vmas in a virtual address range into a function
      that takes a function pointer as a parameter.  The only caller for now
      is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
      but the next patch will add an additional caller.
      
      Move handling all vma behaviors inside madvise_behavior, and rename it
      to madvise_vma_behavior.
      
      Move the code that updates the flags on a vma, including splitting or
      merging the vma as necessary, into a new function called
      madvise_update_vma.  The next patch will add support for updating a new
      anon_name field as well.
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-1-surenb@google.comSigned-off-by: default avatarColin Cross <ccross@google.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac1e9acc
    • Qi Zheng's avatar
      mm: remove redundant check about FAULT_FLAG_ALLOW_RETRY bit · 36ef159f
      Qi Zheng authored
      Since commit 4064b982 ("mm: allow VM_FAULT_RETRY for multiple
      times") allowed VM_FAULT_RETRY for multiple times, the
      FAULT_FLAG_ALLOW_RETRY bit of fault_flag will not be changed in the page
      fault path, so the following check is no longer needed:
      
      	flags & FAULT_FLAG_ALLOW_RETRY
      
      So just remove it.
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20211110123358.36511-1-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Chengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36ef159f
    • chiminghao's avatar
      tools/testing/selftests/vm/userfaultfd.c: use swap() to make code cleaner · 2c769ed7
      chiminghao authored
      Fix the following coccicheck REVIEW:
      
       tools/testing/selftests/vm/userfaultfd.c:1531:21-22:use swap() to make code cleaner
      
      Link: https://lkml.kernel.org/r/20211124031632.35317-1-chi.minghao@zte.com.cnSigned-off-by: default avatarchiminghao <chi.minghao@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c769ed7
    • Shakeel Butt's avatar
      memcg: add per-memcg vmalloc stat · 4e5aa1f4
      Shakeel Butt authored
      The kvmalloc* allocation functions can fallback to vmalloc allocations
      and more often on long running machines.  In addition the kernel does
      have __GFP_ACCOUNT kvmalloc* calls.  So, often on long running machines,
      the memory.stat does not tell the complete picture which type of memory
      is charged to the memcg.  So add a per-memcg vmalloc stat.
      
      [shakeelb@google.com: page_memcg() within rcu lock, per Muchun]
        Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com
      [akpm@linux-foundation.org: remove cast, per Muchun]
      [shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal]
        Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com
      
      Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e5aa1f4
    • Wang Weiyang's avatar
      mm/memcg: use struct_size() helper in kzalloc() · 06b2c3b0
      Wang Weiyang authored
      Make use of the struct_size() helper instead of an open-coded version,
      in order to avoid any potential type mistakes or integer overflows that,
      in the worst scenario, could lead to heap overflows.
      
      Link: https://github.com/KSPP/linux/issues/160
      Link: https://lkml.kernel.org/r/20211216022024.127375-1-wangweiyang2@huawei.comSigned-off-by: default avatarWang Weiyang <wangweiyang2@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06b2c3b0
    • Shakeel Butt's avatar
      memcg: better bounds on the memcg stats updates · 5b3be698
      Shakeel Butt authored
      Commit 11192d9c ("memcg: flush stats only if updated") added
      tracking of memcg stats updates which is used by the readers to flush
      only if the updates are over a certain threshold.  However each
      individual update can correspond to a large value change for a given
      stat.  For example adding or removing a hugepage to an LRU changes the
      stat by thp_nr_pages (512 on x86_64).
      
      Treating the update related to THP as one can keep the stat off, in
      theory, by (thp_nr_pages * nr_cpus * CHARGE_BATCH) before flush.
      
      To handle such scenarios, this patch adds consideration of the stat
      update value as well instead of just the update event.  In addition let
      the asyn flusher unconditionally flush the stats to put time limit on
      the stats skew and hopefully a lot less readers would need to flush.
      
      Link: https://lkml.kernel.org/r/20211118065350.697046-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b3be698
    • Dan Schatzberg's avatar
      mm/memcg: add oom_group_kill memory event · b6bf9abb
      Dan Schatzberg authored
      Our container agent wants to know when a container exits if it was OOM
      killed or not to report to the user.  We use memory.oom.group = 1 to
      ensure that OOM kills within the container's cgroup kill everything.
      Existing memory.events are insufficient for knowing if this triggered:
      
      1) Our current approach reads memory.events oom_kill and reports the
         container was killed if the value is non-zero. This is erroneous in
         some cases where containers create their children cgroups with
         memory.oom.group=1 as such OOM kills will get counted against the
         container cgroup's oom_kill counter despite not actually OOM killing
         the entire container.
      
      2) Reading memory.events.local will fail to identify OOM kills in leaf
         cgroups (that don't set memory.oom.group) within the container
         cgroup.
      
      This patch adds a new oom_group_kill event when memory.oom.group
      triggers to allow userspace to cleanly identify when an entire cgroup is
      oom killed.
      
      [schatzberg.dan@gmail.com: changes from Johannes and Chris]
        Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com
      
      Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.comSigned-off-by: default avatarDan Schatzberg <schatzberg.dan@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6bf9abb
    • Donghai Qiao's avatar
      mm/page_counter: remove an incorrect call to propagate_protected_usage() · 46a53371
      Donghai Qiao authored
      propagate_protected_usage() is called to propagate the usage change in
      the page_counter structure.  But there is a call to this function from
      page_counter_try_charge() when there is actually no usage change.  Hence
      this call should be removed.
      
      Link: https://lkml.kernel.org/r/20211118181125.3918222-1-dqiao@redhat.comSigned-off-by: default avatarDonghai Qiao <dqiao@redhat.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46a53371