1. 25 Oct, 2023 16 commits
    • Matthew Wilcox (Oracle)'s avatar
      gfs2: convert gfs2_getbuf() to folios · 0eb75179
      Matthew Wilcox (Oracle) authored
      Remove several folio->page->folio conversions.  Also use __GFP_NOFAIL
      instead of calling yield() and the new get_nth_bh().
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-8-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0eb75179
    • Matthew Wilcox (Oracle)'s avatar
      gfs2: convert inode unstuffing to use a folio · 81cb277e
      Matthew Wilcox (Oracle) authored
      Use the folio APIs, removing numerous hidden calls to compound_head(). 
      Also remove the stale comment about the page being looked up if it's NULL.
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81cb277e
    • Matthew Wilcox (Oracle)'s avatar
      buffer: add get_nth_bh() · 0217fbb0
      Matthew Wilcox (Oracle) authored
      Extract this useful helper from nilfs_page_get_nth_block()
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0217fbb0
    • Matthew Wilcox (Oracle)'s avatar
      ext4: convert to folio_create_empty_buffers · d4059993
      Matthew Wilcox (Oracle) authored
      Remove an unnecessary folio->page->folio conversion and take advantage of
      the new return value from folio_create_empty_buffers().
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4059993
    • Matthew Wilcox (Oracle)'s avatar
      mpage: convert map_buffer_to_folio() to folio_create_empty_buffers() · 4f05f139
      Matthew Wilcox (Oracle) authored
      Saves a folio->page->folio conversion.
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f05f139
    • Matthew Wilcox (Oracle)'s avatar
      buffer: make folio_create_empty_buffers() return a buffer_head · 3decb856
      Matthew Wilcox (Oracle) authored
      Patch series "Finish the create_empty_buffers() transition", v2.
      
      Pankaj recently added folio_create_empty_buffers() as the folio equivalent
      to create_empty_buffers().  This patch set finishes the conversion by
      first converting all remaining filesystems to call
      folio_create_empty_buffers(), then renaming it back to
      create_empty_buffers().  I took the opportunity to make a few
      simplifications like making folio_create_empty_buffers() return the head
      buffer and extracting get_nth_bh() from nilfs2.
      
      A few of the patches in this series aren't directly related to
      create_empty_buffers(), but I saw them while I was working on this and
      thought they'd be easy enough to add to this series.  Compile-tested only,
      other than ext4.
      
      
      This patch (of 26):
      
      Almost all callers want to know the first BH that was allocated for this
      folio.  We already have that handy, so return it.
      
      Link: https://lkml.kernel.org/r/20231016201114.1928083-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20231016201114.1928083-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3decb856
    • Usama Arif's avatar
      hugetlb_vmemmap: use folio argument for hugetlb_vmemmap_* functions · c5ad3233
      Usama Arif authored
      Most function calls in hugetlb.c are made with folio arguments.  This
      brings hugetlb_vmemmap calls inline with them by using folio instead of
      head struct page.  Head struct page is still needed within these
      functions.
      
      The set/clear/test functions for hugepages are also changed to folio
      versions.
      
      Link: https://lkml.kernel.org/r/20231011144557.1720481-2-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5ad3233
    • Mike Kravetz's avatar
      hugetlb: batch TLB flushes when restoring vmemmap · c24f188b
      Mike Kravetz authored
      Update the internal hugetlb restore vmemmap code path such that TLB
      flushing can be batched.  Use the existing mechanism of passing the
      VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
      performed for individual pages.  The routine
      hugetlb_vmemmap_restore_folios is the only user of this new mechanism, and
      it will perform a global flush after all vmemmap is restored.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-9-mike.kravetz@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c24f188b
    • Joao Martins's avatar
      hugetlb: batch TLB flushes when freeing vmemmap · f13b83fd
      Joao Martins authored
      Now that a list of pages is deduplicated at once, the TLB flush can be
      batched for all vmemmap pages that got remapped.
      
      Expand the flags field value to pass whether to skip the TLB flush on
      remap of the PTE.
      
      The TLB flush is global as we don't have guarantees from caller that the
      set of folios is contiguous, or to add complexity in composing a list of
      kVAs to flush.
      
      Modified by Mike Kravetz to perform TLB flush on single folio if an
      error is encountered.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-8-mike.kravetz@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f13b83fd
    • Joao Martins's avatar
      hugetlb: batch PMD split for bulk vmemmap dedup · f4b7e3ef
      Joao Martins authored
      In an effort to minimize amount of TLB flushes, batch all PMD splits
      belonging to a range of pages in order to perform only 1 (global) TLB
      flush.
      
      Add a flags field to the walker and pass whether it's a bulk allocation or
      just a single page to decide to remap.  First value
      (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
      flush when we split the PMD.
      
      Rebased and updated by Mike Kravetz
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-7-mike.kravetz@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4b7e3ef
    • Mike Kravetz's avatar
      hugetlb: batch freeing of vmemmap pages · 91f386bf
      Mike Kravetz authored
      Now that batching of hugetlb vmemmap optimization processing is possible,
      batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
      hugetlb page, we add them to a list that is freed after the entire batch
      has been processed.
      
      This enhances the ability to return contiguous ranges of memory to the low
      level allocators.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-6-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91f386bf
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap restoration on a list of pages · cfb8c750
      Mike Kravetz authored
      The routine update_and_free_pages_bulk already performs vmemmap
      restoration on the list of hugetlb pages in a separate step.  In
      preparation for more functionality to be added in this step, create a new
      routine hugetlb_vmemmap_restore_folios() that will restore vmemmap for a
      list of folios.
      
      This new routine must provide sufficient feedback about errors and actual
      restoration performed so that update_and_free_pages_bulk can perform
      optimally.
      
      Special care must be taken when encountering an error from
      hugetlb_vmemmap_restore_folios.  We want to continue making as much
      forward progress as possible.  A new routine bulk_vmemmap_restore_error
      handles this specific situation.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-5-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfb8c750
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap optimization on a list of pages · 79359d6d
      Mike Kravetz authored
      When adding hugetlb pages to the pool, we first create a list of the
      allocated pages before adding to the pool.  Pass this list of pages to a
      new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.
      
      Due to significant differences in vmemmmap initialization for bootmem
      allocated hugetlb pages, a new routine prep_and_add_bootmem_folios is
      created.
      
      We also modify the routine vmemmap_should_optimize() to check for pages
      that are already optimized.  There are code paths that might request
      vmemmap optimization twice and we want to make sure this is not attempted.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-4-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79359d6d
    • Mike Kravetz's avatar
      hugetlb: restructure pool allocations · d67e32f2
      Mike Kravetz authored
      Allocation of a hugetlb page for the hugetlb pool is done by the routine
      alloc_pool_huge_page.  This routine will allocate contiguous pages from a
      low level allocator, prep the pages for usage as a hugetlb page and then
      add the resulting hugetlb page to the pool.
      
      In the 'prep' stage, optional vmemmap optimization is done.  For
      performance reasons we want to perform vmemmap optimization on multiple
      hugetlb pages at once.  To do this, restructure the hugetlb pool
      allocation code such that vmemmap optimization can be isolated and later
      batched.
      
      The code to allocate hugetlb pages from bootmem was also modified to
      allow batching.
      
      No functional changes, only code restructure.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-3-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Tested-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d67e32f2
    • Mike Kravetz's avatar
      hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles · d2cf88c2
      Mike Kravetz authored
      Patch series "Batch hugetlb vmemmap modification operations", v8.
      
      When hugetlb vmemmap optimization was introduced, the overhead of enabling
      the option was measured as described in commit 426e5c42 [1].  The
      summary states that allocating a hugetlb page should be ~2x slower with
      optimization and freeing a hugetlb page should be ~2-3x slower.  Such
      overhead was deemed an acceptable trade off for the memory savings
      obtained by freeing vmemmap pages.
      
      It was recently reported that the overhead associated with enabling
      vmemmap optimization could be as high as 190x for hugetlb page
      allocations.  Yes, 190x!  Some actual numbers from other environments are:
      
      Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
      ------------------------------------------------
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m4.119s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m4.477s
      
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m28.973s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m36.748s
      
      VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
      -----------------------------------------------------------
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m2.463s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m2.931s
      
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    2m27.609s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    2m29.924s
      
      In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
      resulted in allocation times being 61x slower.
      
      A quick profile showed that the vast majority of this overhead was due to
      TLB flushing.  Each time we modify the kernel pagetable we need to flush
      the TLB.  For each hugetlb that is optimized, there could be potentially
      two TLB flushes performed.  One for the vmemmap pages associated with the
      hugetlb page, and potentially another one if the vmemmap pages are mapped
      at the PMD level and must be split.  The TLB flushes required for the
      kernel pagetable, result in a broadcast IPI with each CPU having to flush
      a range of pages, or do a global flush if a threshold is exceeded.  So,
      the flush time increases with the number of CPUs.  In addition, in virtual
      environments the broadcast IPI can’t be accelerated by hypervisor
      hardware and leads to traps that need to wakeup/IPI all vCPUs which is
      very expensive.  Because of this the slowdown in virtual environments is
      even worse than bare metal as the number of vCPUS/CPUs is increased.
      
      The following series attempts to reduce amount of time spent in TLB
      flushing.  The idea is to batch the vmemmap modification operations for
      multiple hugetlb pages.  Instead of doing one or two TLB flushes for each
      page, we do two TLB flushes for each batch of pages.  One flush after
      splitting pages mapped at the PMD level, and another after remapping
      vmemmap associated with all hugetlb pages.  Results of such batching are
      as follows:
      
      Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
      ------------------------------------------------
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m4.719s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m4.245s
      
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m7.267s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m13.199s
      
      VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
      -----------------------------------------------------------
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m2.715s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m3.186s
      
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m4.799s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m5.273s
      
      With batching, results are back in the 2-3x slowdown range.
      
      
      This patch (of 8):
      
      update_and_free_pages_bulk is designed to free a list of hugetlb pages
      back to their associated lower level allocators.  This may require
      allocating vmemmmap pages associated with each hugetlb page.  The hugetlb
      page destructor must be changed before pages are freed to lower level
      allocators.  However, the destructor must be changed under the hugetlb
      lock.  This means there is potentially one lock cycle per page.
      
      Minimize the number of lock cycles in update_and_free_pages_bulk by:
      1) allocating necessary vmemmap for all hugetlb pages on the list
      2) take hugetlb lock and clear destructor for all pages on the list
      3) free all pages on list back to low level allocators
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20231019023113.345257-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJames Houghton <jthoughton@google.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2cf88c2
    • Huang Ying's avatar
      mm: fix draining remote pageset · fa8c4f9a
      Huang Ying authored
      If there is no memory allocation/freeing in the PCP (Per-CPU Pageset) of a
      remote zone (zone in remote NUMA node) after some time (3 seconds for
      now), the pages of the PCP of the remote zone will be drained to avoid
      memory wastage.
      
      This behavior was introduced in the commit 4ae7c039 ("[PATCH]
      Periodically drain non local pagesets") and the commit 4037d452 ("Move
      remote node draining out of slab allocators")
      
      But, after the commit 7cc36bbd ("vmstat: on-demand vmstat workers
      V8"), the vmstat updater worker which is used to drain the PCP of remote
      zones may not be re-queued when we are waiting for the timeout
      (pcp->expire != 0) if there are no vmstat changes on this CPU, for
      example, when the CPU goes idle or runs user space only workloads.  This
      may cause the pages of a remote zone be kept in PCP of this CPU for long
      time.  So that, the page reclaiming of the remote zone may be triggered
      prematurely.  This isn't a severe problem in practice, because the PCP of
      the remote zone will be drained if some memory are allocated/freed again
      on this CPU.  And, the PCP will eventually be drained during the direct
      reclaiming if necessary.
      
      Anyway, the problem still deserves a fix via guaranteeing that the vmstat
      updater worker will always be re-queued when we are waiting for the
      timeout.  In effect, this restores the original behavior before the commit
      7cc36bbd.
      
      We can reproduce the bug via allocating/freeing pages from a remote zone
      then go idle as follows.  And the patch can fix it.
      
      - Run some workloads, use `numactl` to bind CPU to node 0 and memory to
        node 1.  So the PCP of the CPU on node 0 for zone on node 1 will be
        filled.
      
      - After workloads finish, idle for 60s
      
      - Check /proc/zoneinfo
      
      With the original kernel, the number of pages in the PCP of the CPU on
      node 0 for zone on node 1 is non-zero after idle.  With the patched
      kernel, it becomes 0 after idle.  That is, we avoid to keep pages in the
      remote PCP during idle.
      
      Link: https://lkml.kernel.org/r/20231007062356.187621-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20230811090819.60845-1-ying.huang@intel.com
      Fixes: 7cc36bbd ("vmstat: on-demand vmstat workers V8")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa8c4f9a
  2. 18 Oct, 2023 24 commits
    • Lorenzo Stoakes's avatar
      mm: perform the mapping_map_writable() check after call_mmap() · 15897894
      Lorenzo Stoakes authored
      In order for a F_SEAL_WRITE sealed memfd mapping to have an opportunity to
      clear VM_MAYWRITE, we must be able to invoke the appropriate
      vm_ops->mmap() handler to do so.  We would otherwise fail the
      mapping_map_writable() check before we had the opportunity to avoid it.
      
      This patch moves this check after the call_mmap() invocation.  Only memfd
      actively denies write access causing a potential failure here (in
      memfd_add_seals()), so there should be no impact on non-memfd cases.
      
      This patch makes the userland-visible change that MAP_SHARED, PROT_READ
      mappings of an F_SEAL_WRITE sealed memfd mapping will now succeed.
      
      There is a delicate situation with cleanup paths assuming that a writable
      mapping must have occurred in circumstances where it may now not have.  In
      order to ensure we do not accidentally mark a writable file unwritable by
      mistake, we explicitly track whether we have a writable mapping and unmap
      only if we do.
      
      [lstoakes@gmail.com: do not set writable_file_mapping in inappropriate case]
        Link: https://lkml.kernel.org/r/c9eb4cc6-7db4-4c2b-838d-43a0b319a4f0@lucifer.local
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217238
      Link: https://lkml.kernel.org/r/55e413d20678a1bb4c7cce889062bbb07b0df892.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15897894
    • Lorenzo Stoakes's avatar
      mm: update memfd seal write check to include F_SEAL_WRITE · 28464bbb
      Lorenzo Stoakes authored
      The seal_check_future_write() function is called by shmem_mmap() or
      hugetlbfs_file_mmap() to disallow any future writable mappings of an memfd
      sealed this way.
      
      The F_SEAL_WRITE flag is not checked here, as that is handled via the
      mapping->i_mmap_writable mechanism and so any attempt at a mapping would
      fail before this could be run.
      
      However we intend to change this, meaning this check can be performed for
      F_SEAL_WRITE mappings also.
      
      The logic here is equally applicable to both flags, so update this
      function to accommodate both and rename it accordingly.
      
      Link: https://lkml.kernel.org/r/913628168ce6cce77df7d13a63970bae06a526e0.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28464bbb
    • Lorenzo Stoakes's avatar
      mm: drop the assumption that VM_SHARED always implies writable · e8e17ee9
      Lorenzo Stoakes authored
      Patch series "permit write-sealed memfd read-only shared mappings", v4.
      
      The man page for fcntl() describing memfd file seals states the following
      about F_SEAL_WRITE:-
      
          Furthermore, trying to create new shared, writable memory-mappings via
          mmap(2) will also fail with EPERM.
      
      With emphasis on 'writable'.  In turns out in fact that currently the
      kernel simply disallows all new shared memory mappings for a memfd with
      F_SEAL_WRITE applied, rendering this documentation inaccurate.
      
      This matters because users are therefore unable to obtain a shared mapping
      to a memfd after write sealing altogether, which limits their usefulness. 
      This was reported in the discussion thread [1] originating from a bug
      report [2].
      
      This is a product of both using the struct address_space->i_mmap_writable
      atomic counter to determine whether writing may be permitted, and the
      kernel adjusting this counter when any VM_SHARED mapping is performed and
      more generally implicitly assuming VM_SHARED implies writable.
      
      It seems sensible that we should only update this mapping if VM_MAYWRITE
      is specified, i.e.  whether it is possible that this mapping could at any
      point be written to.
      
      If we do so then all we need to do to permit write seals to function as
      documented is to clear VM_MAYWRITE when mapping read-only.  It turns out
      this functionality already exists for F_SEAL_FUTURE_WRITE - we can
      therefore simply adapt this logic to do the same for F_SEAL_WRITE.
      
      We then hit a chicken and egg situation in mmap_region() where the check
      for VM_MAYWRITE occurs before we are able to clear this flag.  To work
      around this, perform this check after we invoke call_mmap(), with careful
      consideration of error paths.
      
      Thanks to Andy Lutomirski for the suggestion!
      
      [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
      [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
      
      
      This patch (of 3):
      
      There is a general assumption that VMAs with the VM_SHARED flag set are
      writable.  If the VM_MAYWRITE flag is not set, then this is simply not the
      case.
      
      Update those checks which affect the struct address_space->i_mmap_writable
      field to explicitly test for this by introducing
      [vma_]is_shared_maywrite() helper functions.
      
      This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
      that the VMA cannot be written to.
      
      Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8e17ee9
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: update for tried regions update time interval · bc17ea26
      SeongJae Park authored
      The documentation says DAMOS tried regions update feature of DAMON sysfs
      interface is doing the update for one aggregation interval after the
      request is made.  Since the introduction of the per-scheme apply interval,
      that behavior makes no much sense.  Hence the implementation has changed
      to update the regions for each scheme for only its apply interval. 
      Further update the document to reflect the real behavior.
      
      Link: https://lkml.kernel.org/r/20231012192256.33556-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc17ea26
    • SeongJae Park's avatar
      mm/damon/sysfs: avoid empty scheme tried regions for large apply interval · 76126332
      SeongJae Park authored
      DAMON_SYSFS assumes all schemes will be applied for at least one DAMON
      monitoring results snapshot within one aggregation interval, or makes no
      sense to wait for it while DAMON is deactivated by the watermarks.  That
      for deactivated status still makes sense, but the aggregation interval
      based assumption is invalid now because each scheme can has its own apply
      interval.  For schemes having larger than the aggregation or watermarks
      check interval, DAMOS tried regions update request can be finished without
      the update.  Avoid the case by explicitly checking the status of the
      schemes tried regions update and watermarks based DAMON deactivation.
      
      Link: https://lkml.kernel.org/r/20231012192256.33556-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76126332
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: do not update tried regions more than one DAMON snapshot · 4d4e41b6
      SeongJae Park authored
      Patch series "mm/damon/sysfs-schemes: Do DAMOS tried regions update for
      only one apply interval".
      
      DAMOS tried regions update feature of DAMON sysfs interface is doing the
      update for one aggregation interval after the request is made.  Since the
      per-scheme apply interval is supported, that behavior makes no much sense.
      That is, the tried regions directory will have regions from multiple
      DAMON monitoring results snapshots, or no region for apply intervals that
      much shorter than, or longer than the aggregation interval, respectively. 
      Update the behavior to update the regions for each scheme for only its
      apply interval, and update the document.
      
      Since DAMOS apply interval is the aggregation by default, this change
      makes no visible behavioral difference to old users who don't explicitly
      set the apply intervals.
      
      Patches Sequence
      ----------------
      
      The first two patches makes schemes of apply intervals that much shorter
      or longer than the aggregation interval to keep the maximum and minimum
      times for continuing the update.  After the two patches, the update aligns
      with the each scheme's apply interval.
      
      Finally, the third patch updates the document to reflect the behavior.
      
      
      This patch (of 3):
      
      DAMON_SYSFS exposes every DAMON-found region that eligible for applying
      the scheme action for one aggregation interval.  However, each DAMON-based
      operation scheme has its own apply interval.  Hence, for a scheme that
      having its apply interval much smaller than the aggregation interval,
      DAMON_SYSFS will expose the scheme regions that applied to more than one
      DAMON monitoring results snapshots.  Since the purpose of DAMON tried
      regions is exposing single snapshot, this makes no much sense.  Track
      progress of each scheme's tried regions update and avoid the case.
      
      Link: https://lkml.kernel.org/r/20231012192256.33556-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20231012192256.33556-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d4e41b6
    • Audra Mitchell's avatar
      tools/mm: update the usage output to be more organized · d8ea435f
      Audra Mitchell authored
      Organize the usage options alphabetically and improve the description of
      some options.  Also separate the more complicated cull options from the
      single use compare options.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-6-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8ea435f
    • Audra Mitchell's avatar
      tools/mm: fix the default case for page_owner_sort · c6d5e490
      Audra Mitchell authored
      With the additional commands and timestamps added to the tool, the default
      case (-t) has been broken.  Now that the allocation timestamps are saved
      outside of the txt field, allow us to properly sort the data by number of
      times the record has been seen.  Furthermore prevent the misuse of the
      commandline arguments so only one compare option can be used.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-5-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6d5e490
    • Audra Mitchell's avatar
      tools/mm: filter out timestamps for correct collation · 63a15062
      Audra Mitchell authored
      With the introduction of allocation timestamps being included in
      page_owner output, each record becomes unique due to the timestamp
      nanosecond granularity.  Remove the check in add_list that tries to
      collate each record during processing as the memcmp() is just additional
      overhead at this point.
      
      Also keep the allocation timestamps, but allow collation to occur without
      consideration of the allocation timestamp except in the case were
      allocation timestamps are requested by the user (the -a option).
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-4-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      63a15062
    • Audra Mitchell's avatar
      tools/mm: remove references to free_ts from page_owner_sort · 0179c628
      Audra Mitchell authored
      With the removal of free timestamps from page_owner output, we no longer
      need to handle this case or the "unreleased" case.  Remove all references
      to both cases.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-3-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0179c628
    • Audra Mitchell's avatar
      mm/page_owner: remove free_ts from page_owner output · b459f090
      Audra Mitchell authored
      Patch series "Fix page_owner's use of free timestamps".
      
      While page ower output is used to investigate memory utilization,
      typically the allocation pathway, the introduction of timestamps to the
      page owner records caused each record to become unique due to the
      granularity of the nanosecond timestamp (for example):
      
        Page allocated via order 0 ... ts 5206196026 ns, free_ts 5187156703 ns
        Page allocated via order 0 ... ts 5206198540 ns, free_ts 5187162702 ns
      
      Furthermore, the page_owner output only dumps the currently allocated
      records, so having the free timestamps is nonsensical for the typical use
      case.
      
      In addition, the introduction of timestamps was not properly handled in
      the page_owner_sort tool causing most use cases to be broken.  This series
      is meant to remove the free timestamps from the page_owner output and fix
      the page_owner_sort tool so proper collation can occur.
      
      
      This patch (of 5):
      
      When printing page_owner data via the sysfs interface, no free pages will
      ever be dumped due to the series of checks in read_page_owner():
      
          /*
           * Although we do have the info about past allocation of free
           * pages, it's not relevant for current memory usage.
           */
           if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags))
      
      The free_ts values are still used when dump_page_owner() is called, so
      keeping the field for other use cases but removing them for the typical
      page_owner case.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-1-audra@redhat.com
      Link: https://lkml.kernel.org/r/20231013190350.579407-2-audra@redhat.com
      Fixes: 866b4852 ("mm/page_owner: record the timestamp of all pages during free")
      Signed-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b459f090
    • Lorenzo Stoakes's avatar
      mm: abstract VMA merge and extend into vma_merge_extend() helper · 93bf5d4a
      Lorenzo Stoakes authored
      mremap uses vma_merge() in the case where a VMA needs to be extended. This
      can be significantly simplified and abstracted.
      
      This makes it far easier to understand what the actual function is doing,
      avoids future mistakes in use of the confusing vma_merge() function and
      importantly allows us to make future changes to how vma_merge() is
      implemented by knowing explicitly which merge cases each invocation uses.
      
      Note that in the mremap() extend case, we perform this merge only when
      old_len == vma->vm_end - addr. The extension_start, i.e. the start of the
      extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end.
      
      With this refactoring, vma_merge() is no longer required anywhere except
      mm/mmap.c, so mark it static.
      
      Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93bf5d4a
    • Lorenzo Stoakes's avatar
      mm: abstract merge for new VMAs into vma_merge_new_vma() · 4b5f2d20
      Lorenzo Stoakes authored
      Only in mmap_region() and copy_vma() do we attempt to merge VMAs which
      occupy entirely new regions of virtual memory.
      
      We can abstract this logic and make the intent of this invocations of it
      completely explicit, rather than invoking vma_merge() with an inscrutable
       wall of parameters.
      
      This also paves the way for a simplification of the core vma_merge()
      implementation, as we seek to make it entirely an implementation detail.
      
      The VMA merge call in mmap_region() occurs only for file-backed mappings,
      where each of the parameters previously specified as NULL are defaulted to
      NULL in vma_init() (called by vm_area_alloc()).
      
      This matches the previous behaviour of specifying NULL for a number of
      fields, however note that prior to this call we pass the VMA to the file
      system driver via call_mmap(), which may in theory adjust fields that we
      pass in to vma_merge_new_vma().
      
      Therefore we actually resolve an oversight here by allowing for the fact
      that the driver may have done this.
      
      Link: https://lkml.kernel.org/r/3dc71d17e307756a54781d4a4ce7315cf8b18bea.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b5f2d20
    • Lorenzo Stoakes's avatar
      mm: make vma_merge() and split_vma() internal · adb20b0c
      Lorenzo Stoakes authored
      Now the common pattern of - attempting a merge via vma_merge() and should
      this fail splitting VMAs via split_vma() - has been abstracted, the former
      can be placed into mm/internal.h and the latter made static.
      
      In addition, the split_vma() nommu variant also need not be exported.
      
      Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adb20b0c
    • Lorenzo Stoakes's avatar
      mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al. · 94d7d923
      Lorenzo Stoakes authored
      mprotect() and other functions which change VMA parameters over a range
      each employ a pattern of:-
      
      1. Attempt to merge the range with adjacent VMAs.
      2. If this fails, and the range spans a subset of the VMA, split it
         accordingly.
      
      This is open-coded and duplicated in each case. Also in each case most of
      the parameters passed to vma_merge() remain the same.
      
      Create a new function, vma_modify(), which abstracts this operation,
      accepting only those parameters which can be changed.
      
      To avoid the mess of invoking each function call with unnecessary
      parameters, create inline wrapper functions for each of the modify
      operations, parameterised only by what is required to perform the action.
      
      We can also significantly simplify the logic - by returning the VMA if we
      split (or merged VMA if we do not) we no longer need specific handling for
      merge/split cases in any of the call sites.
      
      Note that the userfaultfd_release() case works even though it does not
      split VMAs - since start is set to vma->vm_start and end is set to
      vma->vm_end, the split logic does not trigger.
      
      In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
      - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
      instance, this invocation will remain unchanged.
      
      We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
      vma_merge() correctly ensures that flags remain the same, something that is
      already checked in is_mergeable_vma() and elsewhere, and in any case is not
      specific to mprotect().
      
      Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94d7d923
    • Lorenzo Stoakes's avatar
      mm: move vma_policy() and anon_vma_name() decls to mm_types.h · 3657fdc2
      Lorenzo Stoakes authored
      Patch series "Abstract vma_merge() and split_vma()", v4.
      
      The vma_merge() interface is very confusing and its implementation has led
      to numerous bugs as a result of that confusion.
      
      In addition there is duplication both in invocation of vma_merge(), but
      also in the common mprotect()-style pattern of attempting a merge, then if
      this fails, splitting the portion of a VMA about to have its attributes
      changed.
      
      This pattern has been copy/pasted around the kernel in each instance where
      such an operation has been required, each very slightly modified from the
      last to make it even harder to decipher what is going on.
      
      Simplify the whole thing by dividing the actual uses of vma_merge() and
      split_vma() into specific and abstracted functions and de-duplicate the
      vma_merge()/split_vma() pattern altogether.
      
      Doing so also opens the door to changing how vma_merge() is implemented -
      by knowing precisely what cases a caller is invoking rather than having a
      central interface where anything might happen we can untangle the brittle
      and confusing vma_merge() implementation into something more workable.
      
      For mprotect()-like cases we introduce vma_modify() which performs the
      vma_merge()/split_vma() pattern, returning a pointer to either the merged
      or split VMA or an ERR_PTR(err) if the splits fail.
      
      We provide a number of inline helper functions to make things even clearer:-
      
      * vma_modify_flags()      - Prepare to modify the VMA's flags.
      * vma_modify_flags_name() - Prepare to modify the VMA's flags/anon_vma_name
      * vma_modify_policy()     - Prepare to modify the VMA's mempolicy.
      * vma_modify_flags_uffd() - Prepare to modify the VMA's flags/uffd context.
      
      For cases where a new VMA is attempted to be merged with adjacent VMAs we
      add:-
      
      * vma_merge_new_vma() - Prepare to merge a new VMA.
      * vma_merge_extend()  - Prepare to extend the end of a new VMA.
      
      
      This patch (of 5):
      
      The vma_policy() define is a helper specifically for a VMA field so it
      makes sense to host it in the memory management types header.
      
      The anon_vma_name(), anon_vma_name_alloc() and anon_vma_name_free()
      functions are a little out of place in mm_inline.h as they define external
      functions, and so it makes sense to locate them in mm_types.h.
      
      The purpose of these relocations is to make it possible to abstract static
      inline wrappers which invoke both of these helpers.
      
      Link: https://lkml.kernel.org/r/cover.1697043508.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/24bfc6c9e382fffbcb0ea8d424392c27d56cc8ca.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3657fdc2
    • Matthew Wilcox (Oracle)'s avatar
      sched: remove wait bookmarks · 37acade0
      Matthew Wilcox (Oracle) authored
      There are no users of wait bookmarks left, so simplify the wait
      code by removing them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37acade0
    • Matthew Wilcox (Oracle)'s avatar
      filemap: remove use of wait bookmarks · b0b598ee
      Matthew Wilcox (Oracle) authored
      The original problem of the overly long list of waiters on a locked page
      was solved properly by commit 9a1ea439 ("mm:
      put_and_wait_on_page_locked() while page is migrated").  In the meantime,
      using bookmarks for the writeback bit can cause livelocks, so we need to
      stop using them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0b598ee
    • Lorenzo Stoakes's avatar
      mm/mprotect: allow unfaulted VMAs to be unaccounted on mprotect() · 9b914329
      Lorenzo Stoakes authored
      When mprotect() is used to make unwritable VMAs writable, they have the
      VM_ACCOUNT flag applied and memory accounted accordingly.
      
      If the VMA has had no pages faulted in and is then made unwritable once
      again, it will remain accounted for, despite not being capable of
      extending memory usage.
      
      Consider:-
      
      ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0);
      mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
      mprotect(ptr + page_size, page_size, PROT_READ);
      
      The first mprotect() splits the range into 3 VMAs and the second fails to
      merge the three as the middle VMA has VM_ACCOUNT set and the others do
      not, rendering them unmergeable.
      
      This is unnecessary, since no pages have actually been allocated and the
      middle VMA is not capable of utilising more memory, thereby introducing
      unnecessary VMA fragmentation (and accounting for more memory than is
      necessary).
      
      Since we cannot efficiently determine which pages map to an anonymous VMA,
      we have to be very conservative - determining whether any pages at all
      have been faulted in, by checking whether vma->anon_vma is NULL.
      
      We can see that the lack of anon_vma implies that no anonymous pages are
      present as evidenced by vma_needs_copy() utilising this on fork to
      determine whether page tables need to be copied.
      
      The only place where anon_vma is set NULL explicitly is on fork with
      VM_WIPEONFORK set, however since this flag is intended to cause the child
      process to not CoW on a given memory range, it is right to interpret this
      as indicating the VMA has no faulted-in anonymous memory mapped.
      
      If the VMA was forked without VM_WIPEONFORK set, then anon_vma_fork() will
      have ensured that a new anon_vma is assigned (and correctly related to its
      parent anon_vma) should any pages be CoW-mapped.
      
      The overall operation is safe against races as we hold a write lock against
      mm->mmap_lock.
      
      If we could efficiently look up the VMA's faulted-in pages then we would
      unaccount all those pages not yet faulted in.  However as the original
      comment alludes this simply isn't currently possible, so we are
      conservative and account all pages or none at all.
      
      Link: https://lkml.kernel.org/r/ad5540371a16623a069f03f4db1739f33cde1fab.1696921767.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b914329
    • Lucy Mielke's avatar
      mm: add printf attribute to shrinker_debugfs_name_alloc · f04eba13
      Lucy Mielke authored
      This fixes a compiler warning when compiling an allyesconfig with W=1:
      
      mm/internal.h:1235:9: error: function might be a candidate for `gnu_printf'
      format attribute [-Werror=suggest-attribute=format]
      
      [akpm@linux-foundation.org: fix shrinker_alloc() as welll per Qi Zheng]
        Link: https://lkml.kernel.org/r/822387b7-4895-4e64-5806-0f56b5d6c447@bytedance.com
      Link: https://lkml.kernel.org/r/ZSBue-3kM6gI6jCr@mainframe
      Fixes: c42d50ae ("mm: shrinker: add infrastructure for dynamically allocating shrinker")
      Signed-off-by: default avatarLucy Mielke <lucymielke@icloud.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f04eba13
    • Zach O'Keefe's avatar
      mm/thp: fix "mm: thp: kill __transhuge_page_enabled()" · 7a81751f
      Zach O'Keefe authored
      The 6.0 commits:
      
      commit 9fec5168 ("mm: thp: kill transparent_hugepage_active()")
      commit 7da4e2cb ("mm: thp: kill __transhuge_page_enabled()")
      
      merged "can we have THPs in this VMA?" logic that was previously done
      separately by fault-path, khugepaged, and smaps "THPeligible" checks.
      
      During the process, the semantics of the fault path check changed in two
      ways:
      
      1) A VM_NO_KHUGEPAGED check was introduced (also added to smaps path).
      2) We no longer checked if non-anonymous memory had a vm_ops->huge_fault
         handler that could satisfy the fault.  Previously, this check had been
         done in create_huge_pud() and create_huge_pmd() routines, but after
         the changes, we never reach those routines.
      
      During the review of the above commits, it was determined that in-tree
      users weren't affected by the change; most notably, since the only
      relevant user (in terms of THP) of VM_MIXEDMAP or ->huge_fault is DAX,
      which is explicitly approved early in approval logic.  However, this was a
      bad assumption to make as it assumes the only reason to support
      ->huge_fault was for DAX (which is not true in general).
      
      Remove the VM_NO_KHUGEPAGED check when not in collapse path and give any
      ->huge_fault handler a chance to handle the fault.  Note that we don't
      validate the file mode or mapping alignment, which is consistent with the
      behavior before the aforementioned commits.
      
      Link: https://lkml.kernel.org/r/20230925200110.1979606-1-zokeefe@google.com
      Fixes: 7da4e2cb ("mm: thp: kill __transhuge_page_enabled()")
      Reported-by: default avatarSaurabh Singh Sengar <ssengar@microsoft.com>
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a81751f
    • Nhat Pham's avatar
      selftests: add a selftest to verify hugetlb usage in memcg · c0dddb7a
      Nhat Pham authored
      This patch add a new kselftest to demonstrate and verify the new hugetlb
      memcg accounting behavior.
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-5-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0dddb7a
    • Nhat Pham's avatar
      hugetlb: memcg: account hugetlb-backed memory in memory controller · 8cba9576
      Nhat Pham authored
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etc.  fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch rectifies this issue by charging the memcg when the hugetlb
      folio is utilized, and uncharging when the folio is freed (analogous to
      the hugetlb controller).  Note that we do not charge when the folio is
      allocated to the hugetlb pool, because at this point it is not owned by
      any memcg.
      
      Some caveats to consider:
        * This feature is only available on cgroup v2.
        * There is no hugetlb pool management involved in the memory
          controller. As stated above, hugetlb folios are only charged towards
          the memory controller when it is used. Host overcommit management
          has to consider it when configuring hard limits.
        * Failure to charge towards the memcg results in SIGBUS. This could
          happen even if the hugetlb pool still has pages (but the cgroup
          limit is hit and reclaim attempt fails).
        * When this feature is enabled, hugetlb pages contribute to memory
          reclaim protection. low, min limits tuning must take into account
          hugetlb memory.
        * Hugetlb pages utilized while this option is not selected will not
          be tracked by the memory controller (even if cgroup v2 is remounted
          later on).
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cba9576
    • Nhat Pham's avatar
      memcontrol: only transfer the memcg data for migration · 85ce2c51
      Nhat Pham authored
      For most migration use cases, only transfer the memcg data from the old
      folio to the new folio, and clear the old folio's memcg data.  No charging
      and uncharging will be done.
      
      This shaves off some work on the migration path, and avoids the temporary
      double charging of a folio during its migration.
      
      The only exception is replace_page_cache_folio(), which will use the old
      mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio).  In that
      context, the isolation of the old page isn't quite as thorough as with
      migration, so we cannot use our new implementation directly.
      
      This patch is the result of the following discussion on the new hugetlb
      memcg accounting behavior:
      
      https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85ce2c51