1. 12 Sep, 2022 10 commits
    • Zach O'Keefe's avatar
      mm/madvise: add MADV_COLLAPSE to process_madvise() · 876b4a18
      Zach O'Keefe authored
      Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
      CAP_SYS_ADMIN or is requesting collapse of it's own memory.
      
      This is useful for the development of userspace agents that seek to
      optimize THP utilization system-wide by using userspace signals to
      prioritize what memory is most deserving of being THP-backed.
      
      [zokeefe@google.com: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE)]
        Link: https://lkml.kernel.org/r/20220801210946.3069083-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-13-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      876b4a18
    • Zach O'Keefe's avatar
      mm/khugepaged: rename prefix of shared collapse functions · 7d2c4385
      Zach O'Keefe authored
      The following functions are shared between khugepaged and madvise collapse
      contexts.  Replace the "khugepaged_" prefix with generic "hpage_collapse_"
      prefix in such cases:
      
      khugepaged_test_exit() -> hpage_collapse_test_exit()
      khugepaged_scan_abort() -> hpage_collapse_scan_abort()
      khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
      khugepaged_find_target_node() -> hpage_collapse_find_target_node()
      khugepaged_alloc_page() -> hpage_collapse_alloc_page()
      
      The kerenel ABI (e.g.  huge_memory:mm_khugepaged_scan_pmd tracepoint) is
      unaltered.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-11-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d2c4385
    • Zach O'Keefe's avatar
      mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse · 7d8faaf1
      Zach O'Keefe authored
      This idea was introduced by David Rientjes[1].
      
      Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
      a synchronous collapse of memory at their own expense.
      
      The benefits of this approach are:
      
      * CPU is charged to the process that wants to spend the cycles for the
        THP
      * Avoid unpredictable timing of khugepaged collapse
      
      Semantics
      
      This call is independent of the system-wide THP sysfs settings, but will
      fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
      multiple VMAs, the semantics of the collapse over each VMA is independent
      from the others.  This implies a hugepage cannot cross a VMA boundary.  If
      collapse of a given hugepage-aligned/sized region fails, the operation may
      continue to attempt collapsing the remainder of memory specified.
      
      The memory ranges provided must be page-aligned, but are not required to
      be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
      start/end of the range will be clamped to the first/last hugepage-aligned
      address covered by said range.  The memory ranges must span at least one
      hugepage-sized region.
      
      All non-resident pages covered by the range will first be
      swapped/faulted-in, before being internally copied onto a freshly
      allocated hugepage.  Unmapped pages will have their data directly
      initialized to 0 in the new hugepage.  However, for every eligible
      hugepage aligned/sized region to-be collapsed, at least one page must
      currently be backed by memory (a PMD covering the address range must
      already exist).
      
      Allocation for the new hugepage may enter direct reclaim and/or
      compaction, regardless of VMA flags.  When the system has multiple NUMA
      nodes, the hugepage will be allocated from the node providing the most
      native pages.  This operation operates on the current state of the
      specified process and makes no persistent changes or guarantees on how
      pages will be mapped, constructed, or faulted in the future
      
      Return Value
      
      If all hugepage-sized/aligned regions covered by the provided range were
      either successfully collapsed, or were already PMD-mapped THPs, this
      operation will be deemed successful.  On success, process_madvise(2)
      returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
      is returned and errno is set to indicate the error for the most-recently
      attempted hugepage collapse.  Note that many failures might have occurred,
      since the operation may continue to collapse in the event a single
      hugepage-sized/aligned region fails.
      
      	ENOMEM	Memory allocation failed or VMA not found
      	EBUSY	Memcg charging failed
      	EAGAIN	Required resource temporarily unavailable.  Try again
      		might succeed.
      	EINVAL	Other error: No PMD found, subpage doesn't have Present
      		bit set, "Special" page no backed by struct page, VMA
      		incorrectly sized, address not page-aligned, ...
      
      Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
      to provide the caller with actionable feedback so they may take an
      appropriate fallback measure.
      
      Use Cases
      
      An immediate user of this new functionality are malloc() implementations
      that manage memory in hugepage-sized chunks, but sometimes subrelease
      memory back to the system in native-sized chunks via MADV_DONTNEED;
      zapping the pmd.  Later, when the memory is hot, the implementation could
      madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
      coverage and dTLB performance.  TCMalloc is such an implementation that
      could benefit from this[2].
      
      Only privately-mapped anon memory is supported for now, but additional
      support for file, shmem, and HugeTLB high-granularity mappings[2] is
      expected.  File and tmpfs/shmem support would permit:
      
      * Backing executable text by THPs.  Current support provided by
        CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
        might impair services from serving at their full rated load after
        (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
        immediately realize iTLB performance prevents page sharing and demand
        paging, both of which increase steady state memory footprint.  With
        MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
        and lower RAM footprints.
      * Backing guest memory by hugapages after the memory contents have been
        migrated in native-page-sized chunks to a new host, in a
        userfaultfd-based live-migration stack.
      
      [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
      [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
      
      [jrdr.linux@gmail.com: avoid possible memory leak in failure path]
        Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
      [zokeefe@google.com add missing kfree() to madvise_collapse()]
        Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
        Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
      [zokeefe@google.com: delay computation of hpage boundaries until use]]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatar"Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d8faaf1
    • Zach O'Keefe's avatar
      mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage · 50722804
      Zach O'Keefe authored
      When scanning an anon pmd to see if it's eligible for collapse, return
      SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
      SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
      file-collapse path, since the latter might identify pte-mapped compound
      pages.  This is required by MADV_COLLAPSE which necessarily needs to know
      what hugepage-aligned/sized regions are already pmd-mapped.
      
      In order to determine if a pmd already maps a hugepage, refactor
      mm_find_pmd():
      
      Return mm_find_pmd() to it's pre-commit f72e7dcd ("mm: let mm_find_pmd
      fix buggy race with THP fault") behavior.  ksm was the only caller that
      explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
      there (pmd_present() and pmd_trans_huge() checks).
      
      Undo revert change in commit f72e7dcd ("mm: let mm_find_pmd fix buggy
      race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
      and use mm_find_pmd() instead.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50722804
    • Zach O'Keefe's avatar
      mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() · a7f4e6e4
      Zach O'Keefe authored
      MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
      
      hugepage_vma_check() is the authority on determining if a VMA is eligible
      for THP allocation/collapse, and currently enforces the sysfs THP
      settings.  Add a flag to disable these checks.  For now, only apply this
      arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. 
      We can expand this to shmem, which uses
      /sys/kernel/transparent_hugepage/shmem_enabled, later.
      
      Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
      passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
      VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
      THP flag in hugepage_vma_check()", this check also didn't check "never"
      THP mode.  As such, this restores the previous behavior of
      collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
      comment in code for justification why this is OK.
      
      [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f4e6e4
    • Zach O'Keefe's avatar
      mm/khugepaged: add flag to predicate khugepaged-only behavior · d8ea7cc8
      Zach O'Keefe authored
      Add .is_khugepaged flag to struct collapse_control so khugepaged-specific
      behavior can be elided by MADV_COLLAPSE context.
      
      Start by protecting khugepaged-specific heuristics by this flag.  In
      MADV_COLLAPSE, the user presumably has reason to believe the collapse will
      be beneficial and khugepaged heuristics shouldn't prevent the user from
      doing so:
      
      1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
      
      2) requirement that some pages in region being collapsed be young or
         referenced
      
      [zokeefe@google.com: consistently order cc->is_khugepaged and pte_* checks]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-3-zokeefe@google.com
        Link: https://lore.kernel.org/linux-mm/Ys2qJm6FaOQcxkha@google.com/
      Link: https://lkml.kernel.org/r/20220706235936.2197195-7-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8ea7cc8
    • Zach O'Keefe's avatar
      mm/khugepaged: propagate enum scan_result codes back to callers · 50ad2f24
      Zach O'Keefe authored
      Propagate enum scan_result codes back through return values of
      functions downstream of khugepaged_scan_file() and
      khugepaged_scan_pmd() to inform callers if the operation was
      successful, and if not, why.
      
      Since khugepaged_scan_pmd()'s return value already has a specific meaning
      (whether mmap_lock was unlocked or not), add a bool* argument to
      khugepaged_scan_pmd() to retrieve this information.
      
      Change khugepaged to take action based on the return values of
      khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting deep
      within the collapsing functions themselves.
      
      hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be more
      consistent with enum scan_result propagation.
      
      Remove dependency on error pointers to communicate to khugepaged that
      allocation failed and it should sleep; instead just use the result of the
      scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-6-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50ad2f24
    • Zach O'Keefe's avatar
      mm/khugepaged: dedup and simplify hugepage alloc and charging · 9710a78a
      Zach O'Keefe authored
      The following code is duplicated in collapse_huge_page() and
      collapse_file():
      
              gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
      
      	new_page = khugepaged_alloc_page(hpage, gfp, node);
              if (!new_page) {
                      result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                      goto out;
              }
      
              if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                      result = SCAN_CGROUP_CHARGE_FAIL;
                      goto out;
              }
              count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
      
      Also, "node" is passed as an argument to both collapse_huge_page() and
      collapse_file() and obtained the same way, via
      khugepaged_find_target_node().
      
      Move all this into a new helper, alloc_charge_hpage(), and remove the
      duplicate code from collapse_huge_page() and collapse_file().  Also,
      simplify khugepaged_alloc_page() by returning a bool indicating allocation
      success instead of a copy of the allocated struct page *.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-5-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9710a78a
    • Zach O'Keefe's avatar
      mm/khugepaged: add struct collapse_control · 34d6b470
      Zach O'Keefe authored
      Modularize hugepage collapse by introducing struct collapse_control.  This
      structure serves to describe the properties of the requested collapse, as
      well as serve as a local scratch pad to use during the collapse itself.
      
      Start by moving global per-node khugepaged statistics into this new
      structure.  Note that this structure is still statically allocated since
      CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating a
      MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
      
      [zokeefe@google.com: use minimal bits to store num page < HPAGE_PMD_NR]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-2-zokeefe@google.com
        Link: https://lore.kernel.org/linux-mm/Ys2CeIm%2FQmQwWh9a@google.com/
      [sfr@canb.auug.org.au: fix build]
        Link: https://lkml.kernel.org/r/20220721195508.15f1e07a@canb.auug.org.au
      [zokeefe@google.com: fix struct collapse_control load_node definition]
        Link: https://lore.kernel.org/linux-mm/202209021349.F73i5d6X-lkp@intel.com/
        Link: https://lkml.kernel.org/r/20220903021221.1130021-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-4-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34d6b470
    • Yang Shi's avatar
      mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA · c6a7f445
      Yang Shi authored
      Patch series "mm: userspace hugepage collapse", v7.
      
      Introduction
      --------------------------------
      
      This series provides a mechanism for userspace to induce a collapse of
      eligible ranges of memory into transparent hugepages in process context,
      thus permitting users to more tightly control their own hugepage
      utilization policy at their own expense.
      
      This idea was introduced by David Rientjes[5].
      
      Interface
      --------------------------------
      
      The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
      leverages the new process_madvise(2) call.
      
      process_madvise(2)
      
      	Performs a synchronous collapse of the native pages
      	mapped by the list of iovecs into transparent hugepages.
      
      	This operation is independent of the system THP sysfs settings,
      	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.
      
      	THP allocation may enter direct reclaim and/or compaction.
      
      	When a range spans multiple VMAs, the semantics of the collapse
      	over of each VMA is independent from the others.
      
      	Caller must have CAP_SYS_ADMIN if not acting on self.
      
      	Return value follows existing process_madvise(2) conventions.  A
      	“success” indicates that all hugepage-sized/aligned regions
      	covered by the provided range were either successfully
      	collapsed, or were already pmd-mapped THPs.
      
      madvise(2)
      
      	Equivalent to process_madvise(2) on self, with 0 returned on
      	“success”.
      
      Current Use-Cases
      --------------------------------
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  With MADV_COLLAPSE, we get the best of both
      	worlds: Peak upfront performance and lower RAM footprints.  Note
      	that subsequent support for file-backed memory is required here.
      
      (2)	malloc() implementations that manage memory in hugepage-sized
      	chunks, but sometimes subrelease memory back to the system in
      	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
      	when the memory is hot, the implementation could
      	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
      	hugepage coverage and dTLB performance.  TCMalloc is such an
      	implementation that could benefit from this[6].  A prior study of
      	Google internal workloads during evaluation of Temeraire, a
      	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
      	all cpu cycles were spent in dTLB stalls, and that increasing
      	hugepage coverage by even small amount can help with that[7].
      
      (3)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.  Note that
      	subsequent support for file/shmem-backed memory is required here.
      
      (4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
      	be mapped at different levels in the page tables[8].  As it's not
      	"transparent" like THP, HugeTLB high-granularity mappings require
      	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
      	for this use case[9].  Note that subsequent support for HugeTLB
      	memory is required here.
      
      Future work
      --------------------------------
      
      Only private anonymous memory is supported by this series. File and
      shmem memory support will be added later.
      
      One possible user of this functionality is a userspace agent that
      attempts to optimize THP utilization system-wide by allocating THPs
      based on, for example, task priority, task performance requirements, or
      heatmaps.  For the latter, one idea that has already surfaced is using
      DAMON to identify hot regions, and driving THP collapse through a new
      DAMOS_COLLAPSE scheme[10].
      
      
      This patch (of 17):
      
      The khugepaged has optimization to reduce huge page allocation calls for
      !CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
      the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
      collapse huge page from a different node, so it doesn't make too much
      sense to carry it.
      
      But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
      before scanning the address space, so it means huge page may be allocated
      even though there is no suitable range for collapsing.  Then the page
      would be just freed if khugepaged already made enough progress.  This
      could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
      run.  This problem actually makes things worse due to the way more
      pointless THP allocations and makes the optimization pointless.
      
      This could be fixed by carrying the huge page across scans, but it will
      complicate the code further and the huge page may be carried indefinitely.
      But if we take one step back, the optimization itself seems not worth
      keeping nowadays since:
      
        * Not too many users build NUMA=n kernel nowadays even though the kernel is
          actually running on a non-NUMA machine. Some small devices may run NUMA=n
          kernel, but I don't think they actually use THP.
        * Since commit 44042b44 ("mm/page_alloc: allow high-order pages to be
          stored on the per-cpu lists"), THP could be cached by pcp.  This actually
          somehow does the job done by the optimization.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-3-zokeefe@google.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Co-developed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6a7f445
  2. 28 Aug, 2022 25 commits
  3. 27 Aug, 2022 5 commits