1. 01 Jul, 2021 40 commits
    • David Hildenbrand's avatar
      mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables · 4ca9b385
      David Hildenbrand authored
      I. Background: Sparse Memory Mappings
      
      When we manage sparse memory mappings dynamically in user space - also
      sometimes involving MAP_NORESERVE - we want to dynamically populate/
      discard memory inside such a sparse memory region.  Example users are
      hypervisors (especially implementing memory ballooning or similar
      technologies like virtio-mem) and memory allocators.  In addition, we want
      to fail in a nice way (instead of generating SIGBUS) if populating does
      not succeed because we are out of backend memory (which can happen easily
      with file-based mappings, especially tmpfs and hugetlbfs).
      
      While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
      reliably discarding memory for most mapping types, there is no generic
      approach to populate page tables and preallocate memory.
      
      Although mmap() supports MAP_POPULATE, it is not applicable to the concept
      of sparse memory mappings, where we want to populate/discard dynamically
      and avoid expensive/problematic remappings.  In addition, we never
      actually report errors during the final populate phase - it is best-effort
      only.
      
      fallocate() can be used to preallocate file-based memory and fail in a
      safe way.  However, it cannot really be used for any private mappings on
      anonymous files via memfd due to COW semantics.  In addition, fallocate()
      does not actually populate page tables, so we still always get pagefaults
      on first access - which is sometimes undesired (i.e., real-time workloads)
      and requires real prefaulting of page tables, not just a preallocation of
      backend storage.  There might be interesting use cases for sparse memory
      regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
      as it does not prefault page tables.
      
      II. On preallcoation/prefaulting from user space
      
      Because we don't have a proper interface, what applications (like QEMU and
      databases) end up doing is touching (i.e., reading+writing one byte to not
      overwrite existing data) all individual pages.
      
      However, that approach
      1) Can result in wear on storage backing, because we end up reading/writing
         each page; this is especially a problem for dax/pmem.
      2) Can result in mmap_sem contention when prefaulting via multiple
         threads.
      3) Requires expensive signal handling, especially to catch SIGBUS in case
         of hugetlbfs/shmem/file-backed memory. For example, this is
         problematic in hypervisors like QEMU where SIGBUS handlers might already
         be used by other subsystems concurrently to e.g, handle hardware errors.
         "Simply" doing preallocation concurrently from other thread is not that
         easy.
      
      III. On MADV_WILLNEED
      
      Extending MADV_WILLNEED is not an option because
      1. It would change the semantics: "Expect access in the near future." and
         "might be a good idea to read some pages" vs. "Definitely populate/
         preallocate all memory and definitely fail on errors.".
      2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
         don't want populate/prealloc semantics. They treat this rather as a hint
         to give a little performance boost without too much overhead - and don't
         expect that a lot of memory might get consumed or a lot of time
         might be spent.
      
      IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
      
      Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
      MAP_POPULATE, with the following semantics:
      1. MADV_POPULATE_READ can be used to prefault page tables just like
         manually reading each individual page. This will not break any COW
         mappings. The shared zero page might get mapped and no backend storage
         might get preallocated -- allocation might be deferred to
         write-fault time. Especially shared file mappings require an explicit
         fallocate() upfront to actually preallocate backend memory (blocks in
         the file system) in case the file might have holes.
      2. If MADV_POPULATE_READ succeeds, all page tables have been populated
         (prefaulted) readable once.
      3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
         prefault page tables just like manually writing (or
         reading+writing) each individual page. This will break any COW
         mappings -- e.g., the shared zeropage is never populated.
      4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
         (prefaulted) writable once.
      5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
         mappings marked with VM_PFNMAP and VM_IO. Also, proper access
         permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
         mapping is encountered, madvise() fails with -EINVAL.
      6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
         might have been populated.
      7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
         when encountering a HW poisoned page in the range.
      8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
         cannot protect from the OOM (Out Of Memory) handler killing the
         process.
      
      While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
      preallocate memory and prefault page tables for VMs), one issue is that
      whenever we prefault pages writable, the pages have to be marked dirty,
      because the CPU could dirty them any time.  while not a real problem for
      hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
      page will be marked dirty and has to be written back later when evicting.
      
      MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
      mapping from backend storage without marking it dirty, such that eviction
      won't have to write it back.  As discussed above, shared file mappings
      might require an explciit fallocate() upfront to achieve
      preallcoation+prepopulation.
      
      Although sparse memory mappings are the primary use case, this will also
      be useful for other preallocate/prefault use cases where MAP_POPULATE is
      not desired or the semantics of MAP_POPULATE are not sufficient: as one
      example, QEMU users can trigger preallocation/prefaulting of guest RAM
      after the mapping was created -- and don't want errors to be silently
      suppressed.
      
      Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
      however, the main motivation back than was performance improvements --
      which should also still be the case.
      
      V. Single-threaded performance comparison
      
      I did a short experiment, prefaulting page tables on completely *empty
      mappings/files* and repeated the experiment 10 times.  The results
      correspond to the shortest execution time.  In general, the performance
      benefit for huge pages is negligible with small mappings.
      
      V.1: Private mappings
      
      POPULATE_READ and POPULATE_WRITE is fastest.  Note that
      Reading/POPULATE_READ will populate the shared zeropage where applicable
      -- which result in short population times.
      
      The fastest way to allocate backend storage (here: swap or huge pages) and
      prefault page tables is POPULATE_WRITE.
      
      V.2: Shared mappings
      
      fallocate() is fastest, however, doesn't prefault page tables.
      POPULATE_WRITE is faster than simple writes and read/writes.
      POPULATE_READ is faster than simple reads.
      
      Without a fd, the fastest way to allocate backend storage and prefault
      page tables is POPULATE_WRITE.  With an fd, the fastest way is usually
      FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
      exception are actual files: FALLOCATE+Read is slightly faster than
      FALLOCATE+POPULATE_READ.
      
      The fastest way to allocate backend storage prefault page tables is
      FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
      FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
      dirty.
      
      v.3: Detailed results
      
      ==================================================
      2 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :     0.119 ms
      Anon 4 KiB     : Write                    :     0.222 ms
      Anon 4 KiB     : Read/Write               :     0.380 ms
      Anon 4 KiB     : POPULATE_READ            :     0.060 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
      Memfd 4 KiB    : Read                     :     0.034 ms
      Memfd 4 KiB    : Write                    :     0.310 ms
      Memfd 4 KiB    : Read/Write               :     0.362 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      tmpfs          : Read                     :     0.033 ms
      tmpfs          : Write                    :     0.313 ms
      tmpfs          : Read/Write               :     0.406 ms
      tmpfs          : POPULATE_READ            :     0.039 ms
      tmpfs          : POPULATE_WRITE           :     0.285 ms
      file           : Read                     :     0.033 ms
      file           : Write                    :     0.351 ms
      file           : Read/Write               :     0.408 ms
      file           : POPULATE_READ            :     0.039 ms
      file           : POPULATE_WRITE           :     0.290 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      **************************************************
      4096 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :   237.940 ms
      Anon 4 KiB     : Write                    :   708.409 ms
      Anon 4 KiB     : Read/Write               :  1054.041 ms
      Anon 4 KiB     : POPULATE_READ            :   124.310 ms
      Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
      Memfd 4 KiB    : Read                     :   136.928 ms
      Memfd 4 KiB    : Write                    :   963.898 ms
      Memfd 4 KiB    : Read/Write               :  1106.561 ms
      Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
      Memfd 2 MiB    : Read                     :   357.116 ms
      Memfd 2 MiB    : Write                    :   357.210 ms
      Memfd 2 MiB    : Read/Write               :   357.606 ms
      Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
      tmpfs          : Read                     :   137.536 ms
      tmpfs          : Write                    :   954.362 ms
      tmpfs          : Read/Write               :  1105.954 ms
      tmpfs          : POPULATE_READ            :    80.289 ms
      tmpfs          : POPULATE_WRITE           :   822.826 ms
      file           : Read                     :   137.874 ms
      file           : Write                    :   987.025 ms
      file           : Read/Write               :  1107.439 ms
      file           : POPULATE_READ            :    80.413 ms
      file           : POPULATE_WRITE           :   857.622 ms
      hugetlbfs      : Read                     :   355.607 ms
      hugetlbfs      : Write                    :   355.729 ms
      hugetlbfs      : Read/Write               :   356.127 ms
      hugetlbfs      : POPULATE_READ            :   354.585 ms
      hugetlbfs      : POPULATE_WRITE           :   355.138 ms
      **************************************************
      2 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :     0.394 ms
      Anon 4 KiB     : Write                    :     0.348 ms
      Anon 4 KiB     : Read/Write               :     0.400 ms
      Anon 4 KiB     : POPULATE_READ            :     0.326 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
      Anon 2 MiB     : Read                     :     0.030 ms
      Anon 2 MiB     : Write                    :     0.030 ms
      Anon 2 MiB     : Read/Write               :     0.030 ms
      Anon 2 MiB     : POPULATE_READ            :     0.030 ms
      Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
      Memfd 4 KiB    : Read                     :     0.412 ms
      Memfd 4 KiB    : Write                    :     0.372 ms
      Memfd 4 KiB    : Read/Write               :     0.419 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
      Memfd 4 KiB    : FALLOCATE                :     0.137 ms
      Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
      Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      Memfd 2 MiB    : FALLOCATE                :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
      tmpfs          : Read                     :     0.416 ms
      tmpfs          : Write                    :     0.369 ms
      tmpfs          : Read/Write               :     0.425 ms
      tmpfs          : POPULATE_READ            :     0.346 ms
      tmpfs          : POPULATE_WRITE           :     0.295 ms
      tmpfs          : FALLOCATE                :     0.139 ms
      tmpfs          : FALLOCATE+Read           :     0.447 ms
      tmpfs          : FALLOCATE+Write          :     0.333 ms
      tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
      file           : Read                     :     0.191 ms
      file           : Write                    :     0.511 ms
      file           : Read/Write               :     0.524 ms
      file           : POPULATE_READ            :     0.196 ms
      file           : POPULATE_WRITE           :     0.434 ms
      file           : FALLOCATE                :     0.004 ms
      file           : FALLOCATE+Read           :     0.197 ms
      file           : FALLOCATE+Write          :     0.554 ms
      file           : FALLOCATE+Read/Write     :     0.480 ms
      file           : FALLOCATE+POPULATE_READ  :     0.201 ms
      file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      hugetlbfs      : FALLOCATE                :     0.030 ms
      hugetlbfs      : FALLOCATE+Read           :     0.031 ms
      hugetlbfs      : FALLOCATE+Write          :     0.031 ms
      hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
      **************************************************
      4096 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :  1053.090 ms
      Anon 4 KiB     : Write                    :   913.642 ms
      Anon 4 KiB     : Read/Write               :  1060.350 ms
      Anon 4 KiB     : POPULATE_READ            :   893.691 ms
      Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
      Anon 2 MiB     : Read                     :   358.553 ms
      Anon 2 MiB     : Write                    :   358.419 ms
      Anon 2 MiB     : Read/Write               :   357.992 ms
      Anon 2 MiB     : POPULATE_READ            :   357.533 ms
      Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
      Memfd 4 KiB    : Read                     :  1078.144 ms
      Memfd 4 KiB    : Write                    :   942.036 ms
      Memfd 4 KiB    : Read/Write               :  1100.391 ms
      Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
      Memfd 4 KiB    : FALLOCATE                :   304.632 ms
      Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
      Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
      Memfd 2 MiB    : Read                     :   358.131 ms
      Memfd 2 MiB    : Write                    :   358.099 ms
      Memfd 2 MiB    : Read/Write               :   358.250 ms
      Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
      Memfd 2 MiB    : FALLOCATE                :   356.735 ms
      Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
      Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
      tmpfs          : Read                     :  1087.265 ms
      tmpfs          : Write                    :   950.840 ms
      tmpfs          : Read/Write               :  1107.567 ms
      tmpfs          : POPULATE_READ            :   922.605 ms
      tmpfs          : POPULATE_WRITE           :   810.094 ms
      tmpfs          : FALLOCATE                :   306.320 ms
      tmpfs          : FALLOCATE+Read           :  1169.796 ms
      tmpfs          : FALLOCATE+Write          :   933.730 ms
      tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
      file           : Read                     :   654.101 ms
      file           : Write                    :  1259.142 ms
      file           : Read/Write               :  1289.509 ms
      file           : POPULATE_READ            :   661.642 ms
      file           : POPULATE_WRITE           :  1106.816 ms
      file           : FALLOCATE                :     1.864 ms
      file           : FALLOCATE+Read           :   656.328 ms
      file           : FALLOCATE+Write          :  1153.300 ms
      file           : FALLOCATE+Read/Write     :  1180.613 ms
      file           : FALLOCATE+POPULATE_READ  :   668.347 ms
      file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
      hugetlbfs      : Read                     :   357.245 ms
      hugetlbfs      : Write                    :   357.413 ms
      hugetlbfs      : Read/Write               :   357.120 ms
      hugetlbfs      : POPULATE_READ            :   356.321 ms
      hugetlbfs      : POPULATE_WRITE           :   356.693 ms
      hugetlbfs      : FALLOCATE                :   355.927 ms
      hugetlbfs      : FALLOCATE+Read           :   357.074 ms
      hugetlbfs      : FALLOCATE+Write          :   357.120 ms
      hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
      **************************************************
      
      [1] https://lkml.org/lkml/2013/6/27/698
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ca9b385
    • David Hildenbrand's avatar
      mm: make variable names for populate_vma_page_range() consistent · a78f1ccd
      David Hildenbrand authored
      Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2.
      
      Excessive details on MADV_POPULATE_(READ|WRITE) can be found in patch #2.
      
      This patch (of 5):
      
      Let's make the variable names in the function declaration match the
      variable names used in the definition.
      
      Link: https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210419135443.12822-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a78f1ccd
    • Kefeng Wang's avatar
      mm: generalize ZONE_[DMA|DMA32] · 63703f37
      Kefeng Wang authored
      ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that
      subscribe to them.  Instead, just make them generic options which can be
      selected on applicable platforms.
      
      Also only x86/arm64 architectures could enable both ZONE_DMA and
      ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone
      configurable and visible on the two architectures.
      
      Link: https://lkml.kernel.org/r/20210528074557.17768-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	[RISC-V]
      Acked-by: Michal Simek <michal.simek@xilinx.com>	[microblaze]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63703f37
    • Liam Howlett's avatar
      mm/nommu: unexport do_munmap() · db1d9152
      Liam Howlett authored
      do_munmap() does not take the mmap_write_lock().  vm_munmap() should be
      used instead.
      
      Link: https://lkml.kernel.org/r/20210604194002.648037-1-Liam.Howlett@Oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db1d9152
    • Chen Li's avatar
      nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc · 176056fd
      Chen Li authored
      mm/nommu.c:
      void *__vmalloc(unsigned long size, gfp_t gfp_mask)
      {
      	/*
      	 *  You can't specify __GFP_HIGHMEM with kmalloc() since kmalloc()
      	 * returns only a logical address.
      	 */
      	return kmalloc(size, (gfp_mask | __GFP_COMP) & ~__GFP_HIGHMEM);
      }
      
      nommu's __vmalloc just uses kmalloc internally and elimitates
      __GFP_HIGHMEM, so it makes no sense to add __GFP_HIGHMEM for nommu's
      vmalloc/vzalloc.
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/875z00rnp8.wl-chenli@uniontech.comSigned-off-by: default avatarChen Li <chenli@uniontech.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      176056fd
    • Matthew Wilcox (Oracle)'s avatar
      mm/thp: fix strncpy warning · 1212e00c
      Matthew Wilcox (Oracle) authored
      Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify
      complain as it thinks the string might be longer than the buffer, and if
      it is, we will end up with a "string" that is missing a NUL terminator.
      It's trivial to show that 'tok' points to a NUL-terminated string which is
      less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy()
      and avoid the warning.
      
      Link: https://lkml.kernel.org/r/20210615200242.1716568-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1212e00c
    • Hugh Dickins's avatar
      mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC · 36af6737
      Hugh Dickins authored
      TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
      before the page is accounted as unmapped.  It is unlikely to coincide with
      hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would
      do well to use it.
      
      Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36af6737
    • Hugh Dickins's avatar
      mm/thp: remap_page() is only needed on anonymous THP · ab02c252
      Hugh Dickins authored
      THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and
      migration entries are only inserted when TTU_MIGRATION (unused here) or
      TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to
      search for migration entries to remove when !PageAnon.
      
      Link: https://lkml.kernel.org/r/f987bc44-f28e-688d-2424-b4722153ed8@google.com
      Fixes: baa355fd ("thp: file pages support for split_huge_page()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab02c252
    • Yang Shi's avatar
      mm: rmap: make try_to_unmap() void function · 1fb08ac6
      Yang Shi authored
      Currently try_to_unmap() return bool value by checking page_mapcount(),
      however this may return false positive since page_mapcount() doesn't check
      all subpages of compound page.  The total_mapcount() could be used
      instead, but its cost is higher since it traverses all subpages.
      
      Actually the most callers of try_to_unmap() don't care about the return
      value at all.  So just need check if page is still mapped by page_mapped()
      when necessary.  And page_mapped() does bail out early when it finds
      mapped subpage.
      
      Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.comSuggested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fb08ac6
    • Anshuman Khandual's avatar
      mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2 · cebc774f
      Anshuman Khandual authored
      ARCH_ENABLE_SPLIT_PMD_PTLOCK is irrelevant unless there are more than two
      page table levels including PMD (also per
      Documentation/vm/split_page_table_lock.rst).  Make this dependency
      explicit on remaining platforms i.e x86 and s390 where
      ARCH_ENABLE_SPLIT_PMD_PTLOCK is subscribed.
      
      Link: https://lkml.kernel.org/r/1622013501-20409-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cebc774f
    • Yang Shi's avatar
      mm: thp: skip make PMD PROT_NONE if THP migration is not supported · e346e668
      Yang Shi authored
      A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
      NUMA balancing and THP.  But S390 doesn't support THP migration so NUMA
      balancing actually can't migrate any misplaced pages.
      
      Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
      by pointless NUMA hinting faults on S390.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e346e668
    • Yang Shi's avatar
      mm: migrate: check mapcount for THP instead of refcount · 662aeea7
      Yang Shi authored
      The generic migration path will check refcount, so no need check refcount
      here.  But the old code actually prevents from migrating shared THP
      (mapped by multiple processes), so bail out early if mapcount is > 1 to
      keep the behavior.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      662aeea7
    • Yang Shi's avatar
      mm: migrate: don't split THP for misplaced NUMA page · b0b515bf
      Yang Shi authored
      The old behavior didn't split THP if migration is failed due to lack of
      memory on the target node.  But the THP migration does split THP, so keep
      the old behavior for misplaced NUMA page migration.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0b515bf
    • Yang Shi's avatar
      mm: migrate: account THP NUMA migration counters correctly · c5fc5c3a
      Yang Shi authored
      Now both base page and THP NUMA migration is done via
      migrate_misplaced_page(), keep the counters correctly for THP.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5fc5c3a
    • Yang Shi's avatar
      mm: thp: refactor NUMA fault handling · c5b5a3dd
      Yang Shi authored
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.
      
      This patch reworks the NUMA fault handling to use generic migration
      implementation to migrate misplaced page.  There is no functional change.
      
      After the refactor the flow of NUMA fault handling looks just like its
      PTE counterpart:
        Acquire ptl
        Prepare for migration (elevate page refcount)
        Release ptl
        Isolate page from lru and elevate page refcount
        Migrate the misplaced THP
      
      If migration fails just restore the old normal PMD.
      
      In the old code anon_vma lock was needed to serialize THP migration
      against THP split, but since then the THP code has been reworked a lot, it
      seems anon_vma lock is not required anymore to avoid the race.
      
      The page refcount elevation when holding ptl should prevent from THP
      split.
      
      Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
      and remove all the dead and duplicate code.
      
      [dan.carpenter@oracle.com: fix a double unlock bug]
        Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5b5a3dd
    • Yang Shi's avatar
      mm: memory: make numa_migrate_prep() non-static · f4c0d836
      Yang Shi authored
      The numa_migrate_prep() will be used by huge NUMA fault as well in the
      following patch, make it non-static.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4c0d836
    • Yang Shi's avatar
      mm: memory: add orig_pmd to struct vm_fault · 5db4f15c
      Yang Shi authored
      Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
      
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.  It is definitely a maintenance burden to keep
      two THP migration implementation for different code paths and it is more
      error prone.  Using the generic THP migration implementation allows us
      remove the duplicate code and some hacks needed by the old ad hoc
      implementation.
      
      A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
      THP and NUMA balancing.  The most of them support THP migration except for
      S390.  Zi Yan tried to add THP migration support for S390 before but it
      was not accepted due to the design of S390 PMD.  For the discussion,
      please see: https://lkml.org/lkml/2018/4/27/953.
      
      Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
      huge PMD for S390 for now.
      
      I saw there were some hacks about gup from git history, but I didn't
      figure out if they have been removed or not since I just found FOLL_NUMA
      code in the current gup implementation and they seems useful.
      
      Patch #1 ~ #2 are preparation patches.
      Patch #3 is the real meat.
      Patch #4 ~ #6 keep consistent counters and behaviors with before.
      Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
      
      Test
      ----
      Did some tests to measure the latency of do_huge_pmd_numa_page.  The test
      VM has 80 vcpus and 64G memory.  The test would create 2 processes to
      consume 128G memory together which would incur memory pressure to cause
      THP splits.  And it also creates 80 processes to hog cpu, and the memory
      consumer processes are bound to different nodes periodically in order to
      increase NUMA faults.
      
      The below test script is used:
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Run stress-ng for 24 hours
      ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
      PID=$!
      
      ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
      
      # Wait for vm stressors forked
      sleep 5
      
      PID_1=`pgrep -P $PID | awk 'NR == 1'`
      PID_2=`pgrep -P $PID | awk 'NR == 2'`
      
      JOB1=`pgrep -P $PID_1`
      JOB2=`pgrep -P $PID_2`
      
      # Bind load jobs to different nodes periodically to force generate
      # cross node memory access
      while [ -d "/proc/$PID" ]
      do
              taskset -apc 8 $JOB1
              taskset -apc 8 $JOB2
              sleep 300
              taskset -apc 58 $JOB1
              taskset -apc 58 $JOB2
              sleep 300
      done
      
      With the above test the histogram of latency of do_huge_pmd_numa_page is
      as shown below.  Since the number of do_huge_pmd_numa_page varies
      drastically for each run (should be due to scheduler), so I converted the
      raw number to percentage.
      
                                   patched               base
      @us[stress-ng]:
      [0]                          3.57%                 0.16%
      [1]                          55.68%                18.36%
      [2, 4)                       10.46%                40.44%
      [4, 8)                       7.26%                 17.82%
      [8, 16)                      21.12%                13.41%
      [16, 32)                     1.06%                 4.27%
      [32, 64)                     0.56%                 4.07%
      [64, 128)                    0.16%                 0.35%
      [128, 256)                   < 0.1%                < 0.1%
      [256, 512)                   < 0.1%                < 0.1%
      [512, 1K)                    < 0.1%                < 0.1%
      [1K, 2K)                     < 0.1%                < 0.1%
      [2K, 4K)                     < 0.1%                < 0.1%
      [4K, 8K)                     < 0.1%                < 0.1%
      [8K, 16K)                    < 0.1%                < 0.1%
      [16K, 32K)                   < 0.1%                < 0.1%
      [32K, 64K)                   < 0.1%                < 0.1%
      
      Per the result, patched kernel is even slightly better than the base
      kernel.  I think this is because the lock contention against THP split is
      less than base kernel due to the refactor.
      
      To exclude the affect from THP split, I also did test w/o memory pressure.
      No obvious regression is spotted.  The below is the test result *w/o*
      memory pressure.
      
                                 patched                  base
      @us[stress-ng]:
      [0]                        7.97%                   18.4%
      [1]                        69.63%                  58.24%
      [2, 4)                     4.18%                   2.63%
      [4, 8)                     0.22%                   0.17%
      [8, 16)                    1.03%                   0.92%
      [16, 32)                   0.14%                   < 0.1%
      [32, 64)                   < 0.1%                  < 0.1%
      [64, 128)                  < 0.1%                  < 0.1%
      [128, 256)                 < 0.1%                  < 0.1%
      [256, 512)                 0.45%                   1.19%
      [512, 1K)                  15.45%                  17.27%
      [1K, 2K)                   < 0.1%                  < 0.1%
      [2K, 4K)                   < 0.1%                  < 0.1%
      [4K, 8K)                   < 0.1%                  < 0.1%
      [8K, 16K)                  0.86%                   0.88%
      [16K, 32K)                 < 0.1%                  0.15%
      [32K, 64K)                 < 0.1%                  < 0.1%
      [64K, 128K)                < 0.1%                  < 0.1%
      [128K, 256K)               < 0.1%                  < 0.1%
      
      The series also survived a series of tests that exercise NUMA balancing
      migrations by Mel.
      
      This patch (of 7):
      
      Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
      page fault could be removed, just like its PTE counterpart does.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5db4f15c
    • Collin Fijalkovich's avatar
      mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs · eb6ecbed
      Collin Fijalkovich authored
      Transparent huge pages are supported for read-only non-shmem files, but
      are only used for vmas with VM_DENYWRITE.  This condition ensures that
      file THPs are protected from writes while an application is running
      (ETXTBSY).  Any existing file THPs are then dropped from the page cache
      when a file is opened for write in do_dentry_open().  Since sys_mmap
      ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
      produced by execve().
      
      Systems that make heavy use of shared libraries (e.g.  Android) are unable
      to apply VM_DENYWRITE through the dynamic linker, preventing them from
      benefiting from the resultant reduced contention on the TLB.
      
      This patch reduces the constraint on file THPs allowing use with any
      executable mapping from a file not opened for write (see
      inode_is_open_for_write()).  It also introduces additional conditions to
      ensure that files opened for write will never be backed by file THPs.
      
      Restricting the use of THPs to executable mappings eliminates the risk
      that a read-only file later opened for write would encounter significant
      latencies due to page cache truncation.
      
      The ld linker flag '-z max-page-size=(hugepage size)' can be used to
      produce executables with the necessary layout.  The dynamic linker must
      map these file's segments at a hugepage size aligned vma for the mapping
      to be backed with THPs.
      
      Comparison of the performance characteristics of 4KB and 2MB-backed
      libraries follows; the Android dex2oat tool was used to AOT compile an
      example application on a single ARM core.
      
      4KB Pages:
      ==========
      
      count              event_name            # count / runtime
      598,995,035,942    cpu-cycles            # 1.800861 GHz
       81,195,620,851    raw-stall-frontend    # 244.112 M/sec
      347,754,466,597    iTLB-loads            # 1.046 G/sec
        2,970,248,900    iTLB-load-misses      # 0.854122% miss rate
      
      Total test time: 332.854998 seconds.
      
      2MB Pages:
      ==========
      
      count              event_name            # count / runtime
      592,872,663,047    cpu-cycles            # 1.800358 GHz
       76,485,624,143    raw-stall-frontend    # 232.261 M/sec
      350,478,413,710    iTLB-loads            # 1.064 G/sec
          803,233,322    iTLB-load-misses      # 0.229182% miss rate
      
      Total test time: 329.826087 seconds
      
      A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
      
      /apex/com.android.art/lib64/libart.so
      FilePmdMapped:      4096 kB
      
      /apex/com.android.art/lib64/libart-compiler.so
      FilePmdMapped:      2048 kB
      
      Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.comSigned-off-by: default avatarCollin Fijalkovich <cfijalkovich@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Hridya Valsaraju <hridya@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb6ecbed
    • Muchun Song's avatar
      mm: migrate: fix missing update page_private to hugetlb_page_subpool · 6acfb5ba
      Muchun Song authored
      Since commit d6995da3 ("hugetlb: use page.private for hugetlb specific
      page flags") converts page.private for hugetlb specific page flags.  We
      should use hugetlb_page_subpool() to get the subpool pointer instead of
      page_private().
      
      This 'could' prevent the migration of hugetlb pages.  page_private(hpage)
      is now used for hugetlb page specific flags.  At migration time, the only
      flag which could be set is HPageVmemmapOptimized.  This flag will only be
      set if the new vmemmap reduction feature is enabled.  In addition,
      !page_mapping() implies an anonymous mapping.  So, this will prevent
      migration of hugetb pages in anonymous mappings if the vmemmap reduction
      feature is enabled.
      
      In addition, that if statement checked for the rare race condition of a
      page being migrated while in the process of being freed.  Since that check
      is now wrong, we could leak hugetlb subpool usage counts.
      
      The commit forgot to update it in the page migration routine.  So fix it.
      
      [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
        Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
      Fixes: d6995da3 ("hugetlb: use page.private for hugetlb specific page flags")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reported-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>	[arm64]
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6acfb5ba
    • Anshuman Khandual's avatar
      arm64/mm: drop HAVE_ARCH_PFN_VALID · 16c9afc7
      Anshuman Khandual authored
      CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64
      platforms and free_unused_memmap() would just return without creating any
      holes in the memmap mapping.  There is no need for any special handling in
      pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped.  This also moves
      the pfn upper bits sanity check into generic pfn_valid().
      
      Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16c9afc7
    • Mike Rapoport's avatar
      arm64: drop pfn_valid_within() and simplify pfn_valid() · a7d9f306
      Mike Rapoport authored
      The arm64's version of pfn_valid() differs from the generic because of two
      reasons:
      
      * Parts of the memory map are freed during boot. This makes it necessary to
        verify that there is actual physical memory that corresponds to a pfn
        which is done by querying memblock.
      
      * There are NOMAP memory regions. These regions are not mapped in the
        linear map and until the previous commit the struct pages representing
        these areas had default values.
      
      As the consequence of absence of the special treatment of NOMAP regions in
      the memory map it was necessary to use memblock_is_map_memory() in
      pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
      generic mm functionality would not treat a NOMAP page as a normal page.
      
      Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
      the rest of core mm will treat them as unusable memory and thus
      pfn_valid_within() is no longer required at all and can be disabled on
      arm64.
      
      pfn_valid() can be slightly simplified by replacing
      memblock_is_map_memory() with memblock_is_memory().
      
      [rppt@kernel.org: fix merge fix]
        Link: https://lkml.kernel.org/r/YJtoQhidtIJOhYsV@kernel.org
      
      Link: https://lkml.kernel.org/r/20210511100550.28178-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7d9f306
    • Mike Rapoport's avatar
      arm64: decouple check whether pfn is in linear map from pfn_valid() · 873ba463
      Mike Rapoport authored
      The intended semantics of pfn_valid() is to verify whether there is a
      struct page for the pfn in question and nothing else.
      
      Yet, on arm64 it is used to distinguish memory areas that are mapped in
      the linear map vs those that require ioremap() to access them.
      
      Introduce a dedicated pfn_is_map_memory() wrapper for
      memblock_is_map_memory() to perform such check and use it where
      appropriate.
      
      Using a wrapper allows to avoid cyclic include dependencies.
      
      While here also update style of pfn_valid() so that both pfn_valid() and
      pfn_is_map_memory() declarations will be consistent.
      
      Link: https://lkml.kernel.org/r/20210511100550.28178-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      873ba463
    • Mike Rapoport's avatar
      memblock: update initialization of reserved pages · 9092d4f7
      Mike Rapoport authored
      The struct pages representing a reserved memory region are initialized
      using reserve_bootmem_range() function.  This function is called for each
      reserved region just before the memory is freed from memblock to the buddy
      page allocator.
      
      The struct pages for MEMBLOCK_NOMAP regions are kept with the default
      values set by the memory map initialization which makes it necessary to
      have a special treatment for such pages in pfn_valid() and
      pfn_valid_within().
      
      Split out initialization of the reserved pages to a function with a
      meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
      reserved regions and mark struct pages for the NOMAP regions as
      PageReserved.
      
      Link: https://lkml.kernel.org/r/20210511100550.28178-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9092d4f7
    • Mike Rapoport's avatar
      include/linux/mmzone.h: add documentation for pfn_valid() · 51c656ae
      Mike Rapoport authored
      Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4.
      
      These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
      pfn_valid_within() to 1.
      
      The idea is to mark NOMAP pages as reserved in the memory map and restore
      the intended semantics of pfn_valid() to designate availability of struct
      page for a pfn.
      
      With this the core mm will be able to cope with the fact that it cannot
      use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER
      blocks will be treated correctly even without the need for
      pfn_valid_within.
      
      This patch (of 4):
      
      Add comment describing the semantics of pfn_valid() that clarifies that
      pfn_valid() only checks for availability of a memory map entry (i.e.
      struct page) for a PFN rather than availability of usable memory backing
      that PFN.
      
      The most "generic" version of pfn_valid() used by the configurations with
      SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
      suitable place for documentation about semantics of pfn_valid().
      
      Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Suggested-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51c656ae
    • Ben Widawsky's avatar
      mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies · 269fbe72
      Ben Widawsky authored
      Current structure 'mempolicy' uses a union to store the node info for
      bind/interleave/perfer policies.
      
      	union {
      		short 		 preferred_node; /* preferred */
      		nodemask_t	 nodes;		/* interleave/bind */
      		/* undefined for default */
      	} v;
      
      Since preferred node can also be represented by a nodemask_t with only ont
      bit set, unify these policies with using one nodemask_t 'nodes', which can
      remove a union, simplify the code and make it easier to support future's
      new policy's node info.
      
      Link: https://lore.kernel.org/r/20200630212517.308045-7-ben.widawsky@intel.com
      Link: https://lkml.kernel.org/r/1623399825-75651-1-git-send-email-feng.tang@intel.comCo-developed-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      269fbe72
    • Yang Shi's avatar
      mm: mempolicy: don't have to split pmd for huge zero page · e5947d23
      Yang Shi authored
      When trying to migrate pages to obey mempolicy, the huge zero page is
      split by inserting base zero pfn to all PTEs, then the page table walk
      fallback to PTE level and just skips zero page.  Skipping zero page for
      mempolicy has been the behavior of kernel since v2.6.16 due to commit
      f4598c8b ("[PATCH] migration: make sure there is no attempt to migrate
      reserved pages.").  So it seems pointless to split huge zero page, it
      could be just skipped like base zero page.
      
      Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
      this case.
      
      Link: https://lkml.kernel.org/r/20210609172146.3594-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210604203513.240709-1-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5947d23
    • Feng Tang's avatar
      mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy · 95837924
      Feng Tang authored
      Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same
      operation for parameter sanity check.
      
      Add a helper function to unify the code to reduce the redundancy, and make
      it easier for changing the sanity check code in future.
      
      [thanks to David Rientjes for suggesting using helper function instead of
      macro].
      
      [feng.tang@intel.com: add comment]
        Link: https://lkml.kernel.org/r/1622560492-1294-4-git-send-email-feng.tang@intel.com
      
      Link: https://lkml.kernel.org/r/1622469956-82897-4-git-send-email-feng.tang@intel.comSigned-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95837924
    • Feng Tang's avatar
      mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy · 7858d7bc
      Feng Tang authored
      MPOL_LOCAL policy has been setup as a real policy, but it is still handled
      like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit
      set, and there are many places having to judge the real 'prefer' or the
      'local' policy, which are quite confusing.
      
      In current code, there are 4 cases that MPOL_LOCAL are used:
      
      1. user specifies 'local' policy
      
      2. user specifies 'prefer' policy, but with empty nodemask
      
      3. system 'default' policy is used
      
      4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
         flag set, and when it is 'rebind' to a nodemask which doesn't contains
         the 'preferred' node, it will perform as 'local' policy
      
      So make 'local' a real policy instead of a fake 'prefer' one, and kill
      MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading.
      
      For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal
      Hocko pointed out:
      
      : I do believe that rebinding preferred policy is just bogus and it should
      : be dropped altogether on the ground that a preference is a mere hint from
      : userspace where to start the allocation.  Unless I am missing something
      : cpusets will be always authoritative for the final placement.  The
      : preferred node just acts as a starting point and it should be really
      : preserved when cpusets changes.  Otherwise we have a very subtle behavior
      : corner cases.
      
      So dump all the tricky transformation between 'prefer' and 'local', and
      just record the new nodemask of rebinding.
      
      [feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko]
        Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com
      [feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal]
        Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com
      
      Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.comSigned-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7858d7bc
    • Feng Tang's avatar
      mm/mempolicy: cleanup nodemask intersection check for oom · b26e517a
      Feng Tang authored
      Patch series "mm/mempolicy: some fix and semantics cleanup", v4.
      
      Current memory policy code has some confusing and ambiguous part about
      MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
      there are many places having to distinguish them.  Also the nodemask
      intersection check needs cleanup to be more explicit for OOM use, and
      handle MPOL_INTERLEAVE correctly.  This patchset cleans up these and
      unifies the parameter sanity check for mbind() and set_mempolicy().
      
      This patch (of 3):
      
      mempolicy_nodemask_intersects seem to be a general purpose mempolicy
      function.  In fact it is partially tailored for the OOM purpose
      instead.  The oom proper is the only existing user so rename the
      function to make that purpose explicit.
      
      While at it drop the MPOL_INTERLEAVE as those allocations never has a
      nodemask defined (see alloc_page_interleave) so this is a dead code and
      a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
      requirement so it shouldn't be considered during the OOM.
      
      The final code can be reduced to a check for MPOL_BIND which is the
      only memory policy that is a hard requirement and thus relevant to a
      constrained OOM logic.
      
      [mhocko@suse.com: changelog edits]
      
      Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com
      Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com
      Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com
      Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.comSigned-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b26e517a
    • Wonhyuk Yang's avatar
      mm/compaction: fix 'limit' in fast_isolate_freepages · b55ca526
      Wonhyuk Yang authored
      Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1.
      This takes away the opportunities of find candinate pages.  So, by making
      enough scans available, increases the probability of finding the
      appropriate freepage.
      
      Tested it on the thpscale and the results are as follows.
      
                                              5.12.0                 5.12.0
                                            valnilla                patched
      Amean     fault-both-1       598.15 (   0.00%)      592.56 (   0.93%)
      Amean     fault-both-3      1494.47 (   0.00%)     1514.35 (  -1.33%)
      Amean     fault-both-5      2519.48 (   0.00%)     2471.76 (   1.89%)
      Amean     fault-both-7      3173.85 (   0.00%)     3079.19 (   2.98%)
      Amean     fault-both-12     8063.83 (   0.00%)     7858.29 (   2.55%)
      Amean     fault-both-18     8781.20 (   0.00%)     7827.70 *  10.86%*
      Amean     fault-both-24    12576.44 (   0.00%)    12250.20 (   2.59%)
      Amean     fault-both-30    18503.27 (   0.00%)    17528.11 *   5.27%*
      Amean     fault-both-32    16133.69 (   0.00%)    13874.24 *  14.00%*
      
                                                 5.12.0         5.12.0
                                                vanilla        patched
      Ops Compaction migrate scanned         6547133.00     5963901.00
      Ops Compaction free scanned           32452453.00    26609101.00
      
                              5.12        5.12
                           vanilla     patched
      Duration User          27.99       28.84
      Duration System       244.08      236.76
      Duration Elapsed       78.27       78.38
      
      Link: https://lkml.kernel.org/r/20210626082443.22547-1-vvghjk1234@gmail.com
      Fixes: 5a811889 ("mm, compaction: use free lists to quickly locate a migration target")
      Signed-off-by: default avatarWonhyuk Yang <vvghjk1234@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b55ca526
    • Liu Xiang's avatar
      mm: compaction: remove duplicate !list_empty(&sublist) check · d2155fe5
      Liu Xiang authored
      The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist)
      check, so remove the duplicate call.
      
      Link: https://lkml.kernel.org/r/20210609095409.19920-1-liu.xiang@zlingsmart.comSigned-off-by: default avatarLiu Xiang <liu.xiang@zlingsmart.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2155fe5
    • YueHaibing's avatar
      mm/compaction: use DEVICE_ATTR_WO macro · 17adb230
      YueHaibing authored
      Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the
      code a bit shorter and easier to read.
      
      Link: https://lkml.kernel.org/r/20210523064521.32912-1-yuehaibing@huawei.comSigned-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17adb230
    • Miaohe Lin's avatar
      mm/zbud: don't export any zbud API · 2a03085c
      Miaohe Lin authored
      The zbud doesn't need to export any API and it is meant to be used via
      zpool API since the commit 12d79d64 ("mm/zpool: update zswap to use
      zpool").  So we can remove the unneeded zbud.h and move down zpool API to
      avoid any forward declaration.
      
      [linmiaohe@huawei.com: fix unused function warnings when CONFIG_ZPOOL is disabled]
        Link: https://lkml.kernel.org/r/20210619025508.1239386-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210608114515.206992-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a03085c
    • Miaohe Lin's avatar
      mm/zbud: reuse unbuddied[0] as buddied in zbud_pool · f356aeac
      Miaohe Lin authored
      Patch series "Cleanups for zbud", v2.
      
      This series contains just cleanups to save some possible memory in
      zbud_pool and avoid exporting any unneeded zbud API.  More details can be
      found in the respective changelogs
      
      This patch (of 2):
      
      Since commit 9d8c5b52 ("mm: zbud: fix condition check on allocation
      size"), zbud_pool.unbuddied[0] is always unused.  We can reuse it as
      buddied field to save some possible memory.
      
      Link: https://lkml.kernel.org/r/20210608114515.206992-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210608114515.206992-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f356aeac
    • Miaohe Lin's avatar
      mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page · 28473d91
      Miaohe Lin authored
      We should use release_z3fold_page_locked() to release z3fold page when
      it's locked, although it looks harmless to use release_z3fold_page() now.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-7-linmiaohe@huawei.com
      Fixes: dcf5aedb ("z3fold: stricter locking and more careful reclaim")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28473d91
    • Miaohe Lin's avatar
      mm/z3fold: fix potential memory leak in z3fold_destroy_pool() · dac0d1cf
      Miaohe Lin authored
      There is a memory leak in z3fold_destroy_pool() as it forgets to
      free_percpu pool->unbuddied.  Call free_percpu for pool->unbuddied to fix
      this issue.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-6-linmiaohe@huawei.com
      Fixes: d30561c5 ("z3fold: use per-cpu unbuddied lists")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dac0d1cf
    • Miaohe Lin's avatar
      mm/z3fold: remove unused function handle_to_z3fold_header() · 767cc6c5
      Miaohe Lin authored
      handle_to_z3fold_header() is unused now.  So we can remove it.  As a
      result, get_z3fold_header() becomes the only caller of
      __get_z3fold_header() and the argument lock is always true.  Therefore we
      could further fold the __get_z3fold_header() into get_z3fold_header() with
      lock = true.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-5-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      767cc6c5
    • Miaohe Lin's avatar
      mm/z3fold: remove magic number in z3fold_create_pool() · e891f60e
      Miaohe Lin authored
      It's meaningless to pass a magic number 2 to __alloc_percpu() as there is
      a minimum alignment size of PCPU_MIN_ALLOC_SIZE (> 2) in it.  Also there
      is no special alignment requirement for unbuddied.  So we could replace
      this magic number with nature alignment, i.e.  __alignof__(struct
      list_head), to improve readability.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e891f60e
    • Miaohe Lin's avatar
      mm/z3fold: avoid possible underflow in z3fold_alloc() · 014284a0
      Miaohe Lin authored
      It is not enough to just make sure the z3fold header is not larger than
      the page size.  When z3fold header is equal to PAGE_SIZE, we would
      underflow when check alloc size against PAGE_SIZE - ZHDR_SIZE_ALIGNED -
      CHUNK_SIZE in z3fold_alloc().  Make sure there has remaining spaces for
      its buddy to fix this theoretical issue.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      014284a0
    • Miaohe Lin's avatar
      mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS · e3c0db4f
      Miaohe Lin authored
      Patch series "Cleanup and fixup for z3fold".
      
      This series contains cleanups to remove unused function, redefine macro to
      improve readability and so on.  Also this fixes several bugs in z3fold,
      such as memory leak in z3fold_destroy_pool().  More details can be found
      in the respective changelogs.
      
      This patch (of 6):
      
      To improve code readability, we could define macro NCHUNKS as TOTAL_CHUNKS
      - ZHDR_CHUNKS.  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210619093151.1492174-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3c0db4f