1. 24 Aug, 2023 19 commits
    • Yosry Ahmed's avatar
      mm: memcg: use rstat for non-hierarchical stats · f82e6bf9
      Yosry Ahmed authored
      Currently, memcg uses rstat to maintain aggregated hierarchical stats. 
      Counters are maintained for hierarchical stats at each memcg.  Rstat
      tracks which cgroups have updates on which cpus to keep those counters
      fresh on the read-side.
      
      Non-hierarchical stats are currently not covered by rstat.  Their per-cpu
      counters are summed up on every read, which is expensive.  The original
      implementation did the same.  At some point before rstat, non-hierarchical
      aggregated counters were introduced by commit a983b5eb ("mm:
      memcontrol: fix excessive complexity in memory.stat reporting").  However,
      those counters were updated on the performance critical write-side, which
      caused regressions, so they were later removed by commit 815744d7
      ("mm: memcontrol: don't batch updates of local VM stats and events").  See
      [1] for more detailed history.
      
      Kernel versions in between a983b5eb & 815744d7 (a year and a half)
      enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. 
      When moving to more recent kernels, a performance regression for reading
      non-hierarchical stats is observed.
      
      Now that we have rstat, we know exactly which percpu counters have updates
      for each stat.  We can maintain non-hierarchical counters again, making
      reads much more efficient, without affecting the performance critical
      write-side.  Hence, add non-hierarchical (i.e local) counters for the
      stats, and extend rstat flushing to keep those up-to-date.
      
      A caveat is that we now need a stats flush before reading
      local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or
      memcg_events_local(), where we previously only needed a flush to read
      hierarchical stats.  Most contexts reading non-hierarchical stats are
      already doing a flush, add a flush to the only missing context in
      count_shadow_nodes().
      
      With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
      machine with 256 cpus on cgroup v1:
      
       # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
       # time cat /sys/fs/cgroup/memory/cg*/memory.stat > /dev/null
       real	 0m0.125s
       user	 0m0.005s
       sys	 0m0.120s
      
      After:
       real	 0m0.032s
       user	 0m0.005s
       sys	 0m0.027s
      
      To make sure there are no regressions on cgroup v2, I ran an artificial
      reclaim/refault stress test [2] that creates (NR_CPUS * 2) cgroups,
      assigns them limits, runs a worker process in each cgroup that allocates
      tmpfs memory equal to quadruple the limit (to invoke reclaim
      continuously), and then reads back the entire file (to invoke refaults). 
      All workers are run in parallel, and zram is used as a swapping backend. 
      Both reclaim and refault have conditional stats flushing.  I ran this on a
      machine with 112 cpus, once on mm-unstable, and once on mm-unstable with
      this patch reverted.
      
      (1) A few runs without this patch:
      
       # time ./stress_reclaim_refault.sh
       real 0m9.949s
       user 0m0.496s
       sys 14m44.974s
      
       # time ./stress_reclaim_refault.sh
       real 0m10.049s
       user 0m0.486s
       sys 14m55.791s
      
       # time ./stress_reclaim_refault.sh
       real 0m9.984s
       user 0m0.481s
       sys 14m53.841s
      
      (2) A few runs with this patch:
      
       # time ./stress_reclaim_refault.sh
       real 0m9.885s
       user 0m0.486s
       sys 14m48.753s
      
       # time ./stress_reclaim_refault.sh
       real 0m9.903s
       user 0m0.495s
       sys 14m48.339s
      
       # time ./stress_reclaim_refault.sh
       real 0m9.861s
       user 0m0.507s
       sys 14m49.317s
      
      No regressions are observed with this patch. There is actually a very
      slight improvement. If I have to guess, maybe it's because we avoid
      the percpu loop in count_shadow_nodes() when calling
      lruvec_page_state_local(), but I could not prove this using perf, it's
      probably in the noise.
      
      [1] https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/
      [2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20230803185046.1385770-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230726153223.821757-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f82e6bf9
    • Suren Baghdasaryan's avatar
      mm: handle userfaults under VMA lock · 29a22b9e
      Suren Baghdasaryan authored
      Enable handle_userfault to operate under VMA lock by releasing VMA lock
      instead of mmap_lock and retrying.  Note that FAULT_FLAG_RETRY_NOWAIT
      should never be used when handling faults under per-VMA lock protection
      because that would break the assumption that lock is dropped on retry.
      
      [surenb@google.com: fix a lockdep issue in vma_assert_write_locked]
        Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29a22b9e
    • Suren Baghdasaryan's avatar
      mm: handle swap page faults under per-VMA lock · 1235ccd0
      Suren Baghdasaryan authored
      When page fault is handled under per-VMA lock protection, all swap page
      faults are retried with mmap_lock because folio_lock_or_retry has to drop
      and reacquire mmap_lock if folio could not be immediately locked.  Follow
      the same pattern as mmap_lock to drop per-VMA lock when waiting for folio
      and retrying once folio is available.
      
      With this obstacle removed, enable do_swap_page to operate under per-VMA
      lock protection.  Drivers implementing ops->migrate_to_ram might still
      rely on mmap_lock, therefore we have to fall back to mmap_lock in that
      particular case.
      
      Note that the only time do_swap_page calls synchronous swap_readpage is
      when SWP_SYNCHRONOUS_IO is set, which is only set for
      QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem).
      Therefore we don't sleep in this path, and there's no need to drop the
      mmap or per-VMA lock.
      
      Link: https://lkml.kernel.org/r/20230630211957.1341547-6-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Tested-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1235ccd0
    • Suren Baghdasaryan's avatar
      mm: change folio_lock_or_retry to use vm_fault directly · fdc724d6
      Suren Baghdasaryan authored
      Change folio_lock_or_retry to accept vm_fault struct and return the
      vm_fault_t directly.
      
      Link: https://lkml.kernel.org/r/20230630211957.1341547-5-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fdc724d6
    • Suren Baghdasaryan's avatar
      mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED · 4089eef0
      Suren Baghdasaryan authored
      handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
      mmap_lock has been released.  However with per-VMA locks behavior is
      different and the caller should still release it.  To make the rules
      consistent for the caller, drop the per-VMA lock when returning
      VM_FAULT_RETRY or VM_FAULT_COMPLETED.  Currently the only path returning
      VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
      VM_FAULT_COMPLETED for now.
      
      [willy@infradead.org: fix riscv]
        Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com
      Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarConor Dooley <conor.dooley@microchip.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4089eef0
    • Suren Baghdasaryan's avatar
      mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED · 7a32b58b
      Suren Baghdasaryan authored
      VM_FAULT_RESULT_TRACE should contain an element for every vm_fault_reason
      to be used as flag_array inside trace_print_flags_seq().  The element for
      VM_FAULT_COMPLETED is missing, add it.
      
      Link: https://lkml.kernel.org/r/20230630211957.1341547-3-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a32b58b
    • Suren Baghdasaryan's avatar
      swap: remove remnants of polling from read_swap_cache_async · b243dcbf
      Suren Baghdasaryan authored
      Patch series "Per-VMA lock support for swap and userfaults", v7.
      
      When per-VMA locks were introduced in [1] several types of page faults
      would still fall back to mmap_lock to keep the patchset simple.  Among
      them are swap and userfault pages.  The main reason for skipping those
      cases was the fact that mmap_lock could be dropped while handling these
      faults and that required additional logic to be implemented.  Implement
      the mechanism to allow per-VMA locks to be dropped for these cases.
      
      First, change handle_mm_fault to drop per-VMA locks when returning
      VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
      mmap_lock is handled.  Then change folio_lock_or_retry to accept vm_fault
      and return vm_fault_t which simplifies later patches.  Finally allow swap
      and uffd page faults to be handled under per-VMA locks by dropping per-VMA
      and retrying, the same way it's done under mmap_lock.  Naturally, once VMA
      lock is dropped that VMA should be assumed unstable and can't be used.
      
      
      This patch (of 6):
      
      Commit [1] introduced IO polling support duding swapin to reduce swap read
      latency for block devices that can be polled.  However later commit [2]
      removed polling support.  Therefore it seems safe to remove do_poll
      parameter in read_swap_cache_async and always call swap_readpage with
      synchronous=false waiting for IO completion in folio_lock_or_retry.
      
      [1] commit 23955622 ("swap: add block io poll in swapin path")
      [2] commit 9650b453 ("block: ignore RWF_HIPRI hint for sync dio")
      
      Link: https://lkml.kernel.org/r/20230630211957.1341547-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20230630211957.1341547-2-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b243dcbf
    • Miaohe Lin's avatar
      mm: memory-failure: fix potential page refcnt leak in memory_failure() · d51b6846
      Miaohe Lin authored
      put_ref_page() is not called to drop extra refcnt when comes from madvise
      in the case pfn is valid but pgmap is NULL leading to page refcnt leak.
      
      Link: https://lkml.kernel.org/r/20230701072837.1994253-1-linmiaohe@huawei.com
      Fixes: 1e8aaedb ("mm,memory_failure: always pin the page in madvise_inject_error")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d51b6846
    • Matthew Wilcox's avatar
      mm/memory.c: fix mismerge · 08dff281
      Matthew Wilcox authored
      Fix a build issue.
      
      Link: https://lkml.kernel.org/r/ZNerqcNS4EBJA/2v@casper.infradead.org
      Fixes: 4aaa60dad4d1 ("mm: allow per-VMA locks on file-backed VMAs")
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202308121909.XNYBtqNI-lkp@intel.com/
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      08dff281
    • Hugh Dickins's avatar
      mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd · a9846049
      Hugh Dickins authored
      Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
      shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
      thought it had emptied: page lock on the huge page is enough to protect
      against WP faults (which find the PTE has been cleared), but not enough to
      protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
      
      retract_page_tables() protects against this by checking !vma->anon_vma;
      but we know that MADV_COLLAPSE needs to be able to work on private shmem
      mappings, even those with an anon_vma prepared for another part of the
      mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
      mappings which are userfaultfd_armed().  Whether it needs to work on
      private shmem mappings which are userfaultfd_armed(), I'm not so sure: but
      assume that it does.
      
      Just for this case, take the pmd_lock() two steps earlier: not because it
      gives any protection against this case itself, but because ptlock nests
      inside it, and it's the dropping of ptlock which let the bug in.  In other
      cases, continue to minimize the pmd_lock() hold time.
      
      Link: https://lkml.kernel.org/r/4d31abf5-56c0-9f3d-d12f-c9317936691@google.com
      Fixes: 1043173e ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9846049
    • Mike Kravetz's avatar
      hugetlb: clear flags in tail pages that will be freed individually · 6c141973
      Mike Kravetz authored
      hugetlb manually creates and destroys compound pages.  As such it makes
      assumptions about struct page layout.  Commit ebc1baf5 ("mm: free up a
      word in the first tail page") breaks hugetlb.  The following will fix the
      breakage.
      
      Link: https://lkml.kernel.org/r/20230822231741.GC4509@monkey
      Fixes: ebc1baf5 ("mm: free up a word in the first tail page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c141973
    • Andrew Morton's avatar
    • Hugh Dickins's avatar
      shmem: fix smaps BUG sleeping while atomic · e5548f85
      Hugh Dickins authored
      smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page
      table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if
      need_resched(): "BUG: sleeping function called from invalid context".
      
      Since shmem_partial_swap_usage() is designed to count across a range, but
      smaps_pte_hole_lookup() only calls it for a single page slot, just break
      out of the loop on the last or only page, before checking need_resched().
      
      Link: https://lkml.kernel.org/r/6fe3b3ec-abdf-332f-5c23-6a3b3a3b11a9@google.com
      Fixes: 23010032 ("mm/smaps: simplify shmem handling of pte holes")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5548f85
    • Andre Przywara's avatar
      selftests: cachestat: catch failing fsync test on tmpfs · f84f62e6
      Andre Przywara authored
      The cachestat kselftest runs a test on a normal file, which is created
      temporarily in the current directory.  Among the tests it runs there is a
      call to fsync(), which is expected to clean all dirty pages used by the
      file.
      
      However the tmpfs filesystem implements fsync() as noop_fsync(), so the
      call will not even attempt to clean anything when this test file happens
      to live on a tmpfs instance.  This happens in an initramfs, or when the
      current directory is in /dev/shm or sometimes /tmp.
      
      To avoid this test failing wrongly, use statfs() to check which filesystem
      the test file lives on.  If that is "tmpfs", we skip the fsync() test.
      
      Since the fsync test is only one part of the "normal file" test, we now
      execute this twice, skipping the fsync part on the first call.  This way
      only the second test, including the fsync part, would be skipped.
      
      Link: https://lkml.kernel.org/r/20230821160534.3414911-3-andre.przywara@arm.comSigned-off-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f84f62e6
    • Andre Przywara's avatar
      selftests: cachestat: test for cachestat availability · 5e56982d
      Andre Przywara authored
      Patch series "selftests: cachestat: fix run on older kernels", v2.
      
      I ran all kernel selftests on some test machine, and stumbled upon
      cachestat failing (among others).  These patches fix the run on older
      kernels and when the current directory is on a tmpfs instance.
      
      
      This patch (of 2):
      
      As cachestat is a new syscall, it won't be available on older kernels, for
      instance those running on a development machine.  At the moment the test
      reports all tests as "not ok" in this case.
      
      Test for the cachestat syscall availability first, before doing further
      tests, and bail out early with a TAP SKIP comment.
      
      This also uses the opportunity to add the proper TAP headers, and add one
      check for proper error handling (illegal file descriptor).
      
      Link: https://lkml.kernel.org/r/20230821160534.3414911-1-andre.przywara@arm.com
      Link: https://lkml.kernel.org/r/20230821160534.3414911-2-andre.przywara@arm.comSigned-off-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e56982d
    • Liam R. Howlett's avatar
      maple_tree: disable mas_wr_append() when other readers are possible · cfeb6ae8
      Liam R. Howlett authored
      The current implementation of append may cause duplicate data and/or
      incorrect ranges to be returned to a reader during an update.  Although
      this has not been reported or seen, disable the append write operation
      while the tree is in rcu mode out of an abundance of caution.
      
      During the analysis of the mas_next_slot() the following was
      artificially created by separating the writer and reader code:
      
      Writer:                                 reader:
      mas_wr_append
          set end pivot
          updates end metata
          Detects write to last slot
          last slot write is to start of slot
          store current contents in slot
          overwrite old end pivot
                                              mas_next_slot():
                                                      read end metadata
                                                      read old end pivot
                                                      return with incorrect range
          store new value
      
      Alternatively:
      
      Writer:                                 reader:
      mas_wr_append
          set end pivot
          updates end metata
          Detects write to last slot
          last lost write to end of slot
          store value
                                              mas_next_slot():
                                                      read end metadata
                                                      read old end pivot
                                                      read new end pivot
                                                      return with incorrect range
          set old end pivot
      
      There may be other accesses that are not safe since we are now updating
      both metadata and pointers, so disabling append if there could be rcu
      readers is the safest action.
      
      Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfeb6ae8
    • Yin Fengwei's avatar
      madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check · 0e0e9bd5
      Yin Fengwei authored
      Commit 98b211d6 ("madvise: convert madvise_free_pte_range() to use a
      folio") replaced the page_mapcount() with folio_mapcount() to check
      whether the folio is shared by other mapping.
      
      It's not correct for large folios. folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares.
      That means it's not 100% correct. It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise.
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com
      Fixes: 98b211d6 ("madvise: convert madvise_free_pte_range() to use a folio")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0e0e9bd5
    • Yin Fengwei's avatar
      madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check · 20b18aad
      Yin Fengwei authored
      Commit fc986a38 ("mm: huge_memory: convert madvise_free_huge_pmd to
      use a folio") replaced the page_mapcount() with folio_mapcount() to check
      whether the folio is shared by other mapping.
      
      It's not correct for large folios. folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares.
      That means it's not 100% correct. It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise.
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-3-fengwei.yin@intel.com
      Fixes: fc986a38 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20b18aad
    • Yin Fengwei's avatar
      madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against... · 2f406263
      Yin Fengwei authored
      madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
      
      Patch series "don't use mapcount() to check large folio sharing", v2.
      
      In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
      folio_mapcount() is used to check whether the folio is shared.  But it's
      not correct as folio_mapcount() returns total mapcount of large folio.
      
      Use folio_estimated_sharers() here as the estimated number is enough.
      
      This patchset will fix the cases:
      User space application call madvise() with MADV_FREE, MADV_COLD and
      MADV_PAGEOUT for specific address range. There are THP mapped to the
      range. Without the patchset, the THP is skipped. With the patch, the
      THP will be split and handled accordingly.
      
      David reported the cow self test skip some cases because of MADV_PAGEOUT
      skip THP:
      https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.com/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81
      and I confirmed this patchset make it work again.
      
      
      This patch (of 3):
      
      Commit 07e8c82b ("madvise: convert madvise_cold_or_pageout_pte_range()
      to use folios") replaced the page_mapcount() with folio_mapcount() to
      check whether the folio is shared by other mapping.
      
      It's not correct for large folio.  folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares. 
      That means it's not 100% correct.  It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise. 
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com
      Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com
      Fixes: 07e8c82b ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f406263
  2. 21 Aug, 2023 21 commits