1. 09 Jun, 2023 29 commits
    • Kefeng Wang's avatar
      mm: memory_failure: move memory_failure_attr_group under MEMORY_FAILURE · 870388db
      Kefeng Wang authored
      The memory_failure_attr_group is only called if MEMORY_FAILURE enabled,
      move it under this configuration.
      
      Link: https://lkml.kernel.org/r/20230508114128.37081-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      870388db
    • Pasha Tatashin's avatar
      mm: hugetlb_vmemmap: provide stronger vmemmap allocation guarantees · eb83f652
      Pasha Tatashin authored
      HugeTLB pages have a struct page optimizations where struct pages for tail
      pages are freed.  However, when HugeTLB pages are destroyed, the memory
      for struct pages (vmemmap) need to be allocated again.
      
      Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap,
      but given that this flag makes very little effort to actually reclaim
      memory the returning of huge pages back to the system can be problem. 
      Lets use __GFP_RETRY_MAYFAIL instead.  This flag is also performs graceful
      reclaim without causing ooms, but at least it may perform a few retries,
      and will fail only when there is genuinely little amount of unused memory
      in the system.
      
      Freeing a 1G page requires 16M of free memory.  A machine might need to be
      reconfigured from one task to another, and release a large number of 1G
      pages back to the system if allocating 16M fails, the release won't work.
      
      Link: https://lkml.kernel.org/r/20230508234059.2529638-1-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb83f652
    • Arnd Bergmann's avatar
      kasan: use internal prototypes matching gcc-13 builtins · bb6e04a1
      Arnd Bergmann authored
      gcc-13 warns about function definitions for builtin interfaces that have a
      different prototype, e.g.:
      
      In file included from kasan_test.c:31:
      kasan.h:574:6: error: conflicting types for built-in function '__asan_register_globals'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
        574 | void __asan_register_globals(struct kasan_global *globals, size_t size);
      kasan.h:577:6: error: conflicting types for built-in function '__asan_alloca_poison'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
        577 | void __asan_alloca_poison(unsigned long addr, size_t size);
      kasan.h:580:6: error: conflicting types for built-in function '__asan_load1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
        580 | void __asan_load1(unsigned long addr);
      kasan.h:581:6: error: conflicting types for built-in function '__asan_store1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
        581 | void __asan_store1(unsigned long addr);
      kasan.h:643:6: error: conflicting types for built-in function '__hwasan_tag_memory'; expected 'void(void *, unsigned char,  long int)' [-Werror=builtin-declaration-mismatch]
        643 | void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size);
      
      The two problems are:
      
       - Addresses are passes as 'unsigned long' in the kernel, but gcc-13
         expects a 'void *'.
      
       - sizes meant to use a signed ssize_t rather than size_t.
      
      Change all the prototypes to match these.  Using 'void *' consistently for
      addresses gets rid of a couple of type casts, so push that down to the
      leaf functions where possible.
      
      This now passes all randconfig builds on arm, arm64 and x86, but I have
      not tested it on the other architectures that support kasan, since they
      tend to fail randconfig builds in other ways.  This might fail if any of
      the 32-bit architectures expect a 'long' instead of 'int' for the size
      argument.
      
      The __asan_allocas_unpoison() function prototype is somewhat weird, since
      it uses a pointer for 'stack_top' and an size_t for 'stack_bottom'.  This
      looks like it is meant to be 'addr' and 'size' like the others, but the
      implementation clearly treats them as 'top' and 'bottom'.
      
      Link: https://lkml.kernel.org/r/20230509145735.9263-2-arnd@kernel.orgSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb6e04a1
    • Arnd Bergmann's avatar
      kasan: add kasan_tag_mismatch prototype · fb646a4c
      Arnd Bergmann authored
      The kasan sw-tags implementation contains one function that is only called
      from assembler and has no prototype in a header.  This causes a W=1
      warning:
      
      mm/kasan/sw_tags.c:171:6: warning: no previous prototype for 'kasan_tag_mismatch' [-Wmissing-prototypes]
        171 | void kasan_tag_mismatch(unsigned long addr, unsigned long access_info,
      
      Add a prototype in the local header to get a clean build.
      
      Link: https://lkml.kernel.org/r/20230509145735.9263-1-arnd@kernel.orgSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb646a4c
    • Huang Ying's avatar
      migrate_pages_batch: simplify retrying and failure counting of large folios · 124abced
      Huang Ying authored
      After recent changes to the retrying and failure counting in
      migrate_pages_batch(), it was found that it's unnecessary to count
      retrying and failure for normal, large, and THP folios separately. 
      Because we don't use retrying and failure number of large folios directly.
      So, in this patch, we simplified retrying and failure counting of large
      folios via counting retrying and failure of normal and large folios
      together.  This results in the reduced line number.
      
      Previously, in migrate_pages_batch we need to track whether the source
      folio is large/THP before splitting.  So is_large is used to cache
      folio_test_large() result.  Now, we don't need that variable any more
      because we don't count retrying and failure of large folios (only counting
      that of THP folios).  So, in this patch, is_large is removed to simplify
      the code.
      
      This is just code cleanup, no functionality changes are expected.
      
      Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      124abced
    • Rick Wertenbroek's avatar
      mm: memory_hotplug: fix format string in warnings · 50135045
      Rick Wertenbroek authored
      The format string in __add_pages and __remove_pages has a typo and prints
      e.g., "Misaligned __add_pages start: 0xfc605 end: #fc609" instead of
      "Misaligned __add_pages start: 0xfc605 end: 0xfc609" Fix "#%lx" => "%#lx"
      
      Link: https://lkml.kernel.org/r/20230510090758.3537242-1-rick.wertenbroek@gmail.comSigned-off-by: default avatarRick Wertenbroek <rick.wertenbroek@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50135045
    • Pankaj Raghav's avatar
      filemap: remove page_endio() · c9639011
      Pankaj Raghav authored
      page_endio() is not used anymore. Remove it.
      
      Link: https://lkml.kernel.org/r/20230510124716.73655-1-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c9639011
    • Peng Zhang's avatar
      maple_tree: fix potential out-of-bounds access in mas_wr_end_piv() · cd00dd25
      Peng Zhang authored
      Check the write offset end bounds before using it as the offset into the
      pivot array.  This avoids a possible out-of-bounds access on the pivot
      array if the write extends to the last slot in the node, in which case the
      node maximum should be used as the end pivot.
      
      akpm: this doesn't affect any current callers, but new users of mapletree
      may encounter this problem if backported into earlier kernels, so let's
      fix it in -stable kernels in case of this.
      
      Link: https://lkml.kernel.org/r/20230506024752.2550-1-zhangpeng.00@bytedance.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd00dd25
    • Lorenzo Stoakes's avatar
      mm/gup: add missing gup_must_unshare() check to gup_huge_pgd() · 31115034
      Lorenzo Stoakes authored
      All other instances of gup_huge_pXd() perform the unshare check, so update
      the PGD-specific function to do so as well.
      
      While checking pgd_write() might seem unusual, this function already
      performs such a check via pgd_access_permitted() so this is in line with
      the existing implementation.
      
      David said:
      
      : This change makes unshare handling across all GUP-fast variants
      : consistent, which is desirable as GUP-fast is complicated enough
      : already even when consistent.
      : 
      : This function was the only one I seemed to have missed (or left out and
      : forgot why -- maybe because it's really dead code for now).  The COW
      : selftest would identify the problem, so far there was no report. 
      : Either the selftest wasn't run on corresponding architectures with that
      : hugetlb size, or that code is still dead code and unused by
      : architectures.
      : 
      : the original commit(s) that added unsharing explain why we care about
      : these checks:
      : 
      : a7f22660 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
      : 84209e87 ("mm/gup: reliable R/O long-term pinning in COW mappings")
      
      Link: https://lkml.kernel.org/r/cb971ac8dd315df97058ea69442ecc007b9a364a.1683381545.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      31115034
    • Keith Busch's avatar
      dmapool: create/destroy cleanup · 9f297db3
      Keith Busch authored
      Set the 'empty' bool directly from the result of the function that
      determines its value instead of adding additional logic.
      
      Link: https://lkml.kernel.org/r/20230126215125.4069751-13-kbusch@meta.com
      Fixes: 2d55c16c ("dmapool: create/destroy cleanup")
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Tony Battersby <tonyb@cybernetics.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9f297db3
    • Ackerley Tng's avatar
      fs: hugetlbfs: set vma policy only when needed for allocating folio · adef0803
      Ackerley Tng authored
      Calling hugetlb_set_vma_policy() later avoids setting the vma policy
      and then dropping it on a page cache hit.
      
      Link: https://lkml.kernel.org/r/20230502235622.3652586-1-ackerleytng@google.comSigned-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Erdem Aktas <erdemaktas@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adef0803
    • Nhat Pham's avatar
      selftests: add selftests for cachestat · 88537aac
      Nhat Pham authored
      Test cachestat on a newly created file, /dev/ files, /proc/ files and a
      directory.  Also test on a shmem file (which can also be tested with
      huge pages since tmpfs supports huge pages).
      
      [colin.i.king@gmail.com: fix spelling mistake "trucate" -> "truncate"]
        Link: https://lkml.kernel.org/r/20230505110855.2493457-1-colin.i.king@gmail.com
      [mpe@ellerman.id.au: avoid excessive stack allocation]
        Link: https://lkml.kernel.org/r/877ctfa6yv.fsf@mail.lhotse
      Link: https://lkml.kernel.org/r/20230503013608.2431726-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      88537aac
    • Nhat Pham's avatar
      cachestat: wire up cachestat for other architectures · 946e697c
      Nhat Pham authored
      cachestat is previously only wired in for x86 (and architectures using
      the generic unistd.h table):
      
      https://lore.kernel.org/lkml/20230503013608.2431726-1-nphamcs@gmail.com/
      
      This patch wires cachestat in for all the other architectures.
      
      [nphamcs@gmail.com: wire up cachestat for arm64]
        Link: https://lkml.kernel.org/r/20230511092843.3896327-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20230510195806.2902878-1-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: Heiko Carstens <hca@linux.ibm.com>		[s390]
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      946e697c
    • Nhat Pham's avatar
      cachestat: implement cachestat syscall · cf264e13
      Nhat Pham authored
      There is currently no good way to query the page cache state of large file
      sets and directory trees.  There is mincore(), but it scales poorly: the
      kernel writes out a lot of bitmap data that userspace has to aggregate,
      when the user really doesn not care about per-page information in that
      case.  The user also needs to mmap and unmap each file as it goes along,
      which can be quite slow as well.
      
      Some use cases where this information could come in handy:
        * Allowing database to decide whether to perform an index scan or
          direct table queries based on the in-memory cache state of the
          index.
        * Visibility into the writeback algorithm, for performance issues
          diagnostic.
        * Workload-aware writeback pacing: estimating IO fulfilled by page
          cache (and IO to be done) within a range of a file, allowing for
          more frequent syncing when and where there is IO capacity, and
          batching when there is not.
        * Computing memory usage of large files/directory trees, analogous to
          the du tool for disk usage.
      
      More information about these use cases could be found in the following
      thread:
      
      https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
      
      This patch implements a new syscall that queries cache state of a file and
      summarizes the number of cached pages, number of dirty pages, number of
      pages marked for writeback, number of (recently) evicted pages, etc.  in a
      given range.  Currently, the syscall is only wired in for x86
      architecture.
      
      NAME
          cachestat - query the page cache statistics of a file.
      
      SYNOPSIS
          #include <sys/mman.h>
      
          struct cachestat_range {
              __u64 off;
              __u64 len;
          };
      
          struct cachestat {
              __u64 nr_cache;
              __u64 nr_dirty;
              __u64 nr_writeback;
              __u64 nr_evicted;
              __u64 nr_recently_evicted;
          };
      
          int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
              struct cachestat *cstat, unsigned int flags);
      
      DESCRIPTION
          cachestat() queries the number of cached pages, number of dirty
          pages, number of pages marked for writeback, number of evicted
          pages, number of recently evicted pages, in the bytes range given by
          `off` and `len`.
      
          An evicted page is a page that is previously in the page cache but
          has been evicted since. A page is recently evicted if its last
          eviction was recent enough that its reentry to the cache would
          indicate that it is actively being used by the system, and that
          there is memory pressure on the system.
      
          These values are returned in a cachestat struct, whose address is
          given by the `cstat` argument.
      
          The `off` and `len` arguments must be non-negative integers. If
          `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
          0, we will query in the range from `off` to the end of the file.
      
          The `flags` argument is unused for now, but is included for future
          extensibility. User should pass 0 (i.e no flag specified).
      
          Currently, hugetlbfs is not supported.
      
          Because the status of a page can change after cachestat() checks it
          but before it returns to the application, the returned values may
          contain stale information.
      
      RETURN VALUE
          On success, cachestat returns 0. On error, -1 is returned, and errno
          is set to indicate the error.
      
      ERRORS
          EFAULT cstat or cstat_args points to an invalid address.
      
          EINVAL invalid flags.
      
          EBADF  invalid file descriptor.
      
          EOPNOTSUPP file descriptor is of a hugetlbfs file
      
      [nphamcs@gmail.com: replace rounddown logic with the existing helper]
        Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf264e13
    • Nhat Pham's avatar
      workingset: refactor LRU refault to expose refault recency check · ffcb5f52
      Nhat Pham authored
      Patch series "cachestat: a new syscall for page cache state of files",
      v13.
      
      There is currently no good way to query the page cache statistics of large
      files and directory trees.  There is mincore(), but it scales poorly: the
      kernel writes out a lot of bitmap data that userspace has to aggregate,
      when the user really does not care about per-page information in that
      case.  The user also needs to mmap and unmap each file as it goes along,
      which can be quite slow as well.
      
      Some use cases where this information could come in handy:
        * Allowing database to decide whether to perform an index scan or direct
          table queries based on the in-memory cache state of the index.
        * Visibility into the writeback algorithm, for performance issues
          diagnostic.
        * Workload-aware writeback pacing: estimating IO fulfilled by page cache
          (and IO to be done) within a range of a file, allowing for more
          frequent syncing when and where there is IO capacity, and batching
          when there is not.
        * Computing memory usage of large files/directory trees, analogous to
          the du tool for disk usage.
      
      More information about these use cases could be found in this thread:
      https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
      
      This series of patches introduces a new system call, cachestat, that
      summarizes the page cache statistics (number of cached pages, dirty pages,
      pages marked for writeback, evicted pages etc.) of a file, in a specified
      range of bytes.  It also include a selftest suite that tests some typical
      usage.  Currently, the syscall is only wired in for x86 architecture.
      
      This interface is inspired by past discussion and concerns with fincore,
      which has a similar design (and as a result, issues) as mincore.  Relevant
      links:
      
      https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
      https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
      
      
      I have also developed a small tool that computes the memory usage of files
      and directories, analogous to the du utility.  User can choose between
      mincore or cachestat (with cachestat exporting more information than
      mincore).  To compare the performance of these two options, I benchmarked
      the tool on the root directory of a Meta's server machine, each for five
      runs:
      
      Using cachestat
      real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
      user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
      sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
      
      Using mincore:
      real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
      user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
      sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
      
      I also ran both syscalls on a 2TB sparse file:
      
      Using cachestat:
      real    0m0.009s
      user    0m0.000s
      sys     0m0.009s
      
      Using mincore:
      real    0m37.510s
      user    0m2.934s
      sys     0m34.558s
      
      Very large files like this are the pathological case for mincore.  In
      fact, to compute the stats for a single 2TB file, mincore takes as long as
      cachestat takes to compute the stats for the entire tree!  This could
      easily happen inadvertently when we run it on subdirectories.  Mincore is
      clearly not suitable for a general-purpose command line tool.
      
      Regarding security concerns, cachestat() should not pose any additional
      issues.  The caller already has read permission to the file itself (since
      they need an fd to that file to call cachestat).  This means that the
      caller can access the underlying data in its entirety, which is a much
      greater source of information (and as a result, a much greater security
      risk) than the cache status itself.
      
      The latest API change (in v13 of the patch series) is suggested by Jens
      Axboe.  It allows for 64-bit length argument, even on 32-bit architecture
      (which is previously not possible due to the limit on the number of
      syscall arguments).  Furthermore, it eliminates the need for compatibility
      handling - every user can use the same ABI.
      
      
      This patch (of 4):
      
      In preparation for computing recently evicted pages in cachestat, refactor
      workingset_refault and lru_gen_refault to expose a helper function that
      would test if an evicted page is recently evicted.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
        Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
      Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ffcb5f52
    • Haifeng Xu's avatar
      memcg, oom: remove explicit wakeup in mem_cgroup_oom_synchronize() · 18b1d18b
      Haifeng Xu authored
      Before commit 29ef680a ("memcg, oom: move out_of_memory back to the
      charge path"), all memcg oom killers were delayed to page fault path.  And
      the explicit wakeup is used in this case:
      
      thread A:
              ...
              if (locked) {           // complete oom-kill, hold the lock
                      mem_cgroup_oom_unlock(memcg);
                      ...
              }
              ...
      
      thread B:
              ...
      
              if (locked && !memcg->oom_kill_disable) {
                      ...
              } else {
                      schedule();     // can't acquire the lock
                      ...
              }
              ...
      
      The reason is that thread A kicks off the OOM-killer, which leads to
      wakeups from the uncharges of the exiting task.  But thread B is not
      guaranteed to see them if it enters the OOM path after the OOM kills but
      before thread A releases the lock.
      
      Now only oom_kill_disable case is handled from the #PF path.  In that case
      it is userspace to trigger the wake up not the #PF path itself.  All
      potential paths to free some charges are responsible to call
      memcg_oom_recover() , so the explicit wakeup is not needed in the
      mem_cgroup_oom_synchronize() path which doesn't release any memory itself.
      
      Link: https://lkml.kernel.org/r/20230419030739.115845-2-haifeng.xu@shopee.comSigned-off-by: default avatarHaifeng Xu <haifeng.xu@shopee.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      18b1d18b
    • Haifeng Xu's avatar
      memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize() · 857f2139
      Haifeng Xu authored
      mem_cgroup_oom_synchronize() is only used when the memcg oom handling is
      handed over to the edge of the #PF path.  Since commit 29ef680a
      ("memcg, oom: move out_of_memory back to the charge path") this is the
      case only when the kernel memcg oom killer is disabled
      (current->memcg_in_oom is only set if memcg->oom_kill_disable).  Therefore
      a check for oom_kill_disable in mem_cgroup_oom_synchronize() is not
      required.
      
      Link: https://lkml.kernel.org/r/20230419030739.115845-1-haifeng.xu@shopee.comSigned-off-by: default avatarHaifeng Xu <haifeng.xu@shopee.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      857f2139
    • Yosry Ahmed's avatar
      cgroup: remove cgroup_rstat_flush_atomic() · 0a2dc6ac
      Yosry Ahmed authored
      Previous patches removed the only caller of cgroup_rstat_flush_atomic(). 
      Remove the function and simplify the code.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a2dc6ac
    • Yosry Ahmed's avatar
      memcg: remove mem_cgroup_flush_stats_atomic() · 35822fda
      Yosry Ahmed authored
      Previous patches removed all callers of mem_cgroup_flush_stats_atomic(). 
      Remove the function and simplify the code.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35822fda
    • Yosry Ahmed's avatar
      memcg: calculate root usage from global state · f82a7a86
      Yosry Ahmed authored
      Currently, we approximate the root usage by adding the memcg stats for
      anon, file, and conditionally swap (for memsw).  To read the memcg stats
      we need to invoke an rstat flush.  rstat flushes can be expensive, they
      scale with the number of cpus and cgroups on the system.
      
      mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
      with irqs disabled, so such an expensive operation with irqs disabled can
      cause problems.
      
      Instead, approximate the root usage from global state.  This is not 100%
      accurate, but the root usage has always been ill-defined anyway.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f82a7a86
    • Yosry Ahmed's avatar
      memcg: flush stats non-atomically in mem_cgroup_wb_stats() · 190409ca
      Yosry Ahmed authored
      The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
      code path in wb_writeback() outside the lock section.  We no longer need
      to flush the stats atomically.  Flush the stats non-atomically.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      190409ca
    • Yosry Ahmed's avatar
      writeback: move wb_over_bg_thresh() call outside lock section · 2816ea2a
      Yosry Ahmed authored
      Patch series "cgroup: eliminate atomic rstat flushing", v5.
      
      A previous patch series [1] changed most atomic rstat flushing contexts to
      become non-atomic.  This was done to avoid an expensive operation that
      scales with # cgroups and # cpus to happen with irqs disabled and
      scheduling not permitted.  There were two remaining atomic flushing
      contexts after that series.  This series tries to eliminate them as well,
      eliminating atomic rstat flushing completely.
      
      The two remaining atomic flushing contexts are:
      (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
      (b) mem_cgroup_threshold()->mem_cgroup_usage()
      
      For (a), flushing needs to be atomic as wb_writeback() calls
      wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
      to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
      this series proposes a refactoring that moves the call outside the lock
      criticial section and makes the stats flushing in mem_cgroup_wb_stats()
      non-atomic.
      
      For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
      with irqs disabled.  We only flush the stats when calculating the root
      usage, as it is approximated as the sum of some memcg stats (file, anon,
      and optionally swap) instead of the conventional page counter.  This
      series proposes changing this calculation to use the global stats instead,
      eliminating the need for a memcg stat flush.
      
      After these 2 contexts are eliminated, we no longer need
      mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
      remove them and simplify the code.
      
      [1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
      
      
      This patch (of 5):
      
      wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
      flush, which can be expensive on large systems. Currently,
      wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
      have to do the rstat flush atomically. On systems with a lot of
      cpus and/or cgroups, this can cause us to disable irqs for a long time,
      potentially causing problems.
      
      Move the call to wb_over_bg_thresh() outside the lock section in
      preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
      The list_empty(&wb->work_list) check should be okay outside the lock
      section of wb->list_lock as it is protected by a separate lock
      (wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
      modifying any of wb->b_* lists the wb->list_lock is protecting.
      Also, the loop seems to be already releasing and reacquring the
      lock, so this refactoring looks safe.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2816ea2a
    • Baolin Wang's avatar
      mm/page_alloc: drop the unnecessary pfn_valid() for start pfn · 3c4322c9
      Baolin Wang authored
      __pageblock_pfn_to_page() currently performs both pfn_valid check and
      pfn_to_online_page().  The former one is redundant because the latter is a
      stronger check.  Drop pfn_valid().
      
      Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c4322c9
    • Wen Yang's avatar
      mm: compaction: optimize compact_memory to comply with the admin-guide · 8b9167cd
      Wen Yang authored
      For the /proc/sys/vm/compact_memory file, the admin-guide states: When 1
      is written to the file, all zones are compacted such that free memory is
      available in contiguous blocks where possible.  This can be important for
      example in the allocation of huge pages although processes will also
      directly compact memory as required
      
      But it was not strictly followed, writing any value would cause all zones
      to be compacted.
      
      It has been slightly optimized to comply with the admin-guide.  Enforce
      the 1 on the unlikely chance that the sysctl handler is ever extended to
      do something different.
      
      Commit ef498438 ("mm/compaction: remove unused variable
      sysctl_compact_memory") has also been optimized a bit here, as the
      declaration in the external header file has been eliminated, and
      sysctl_compact_memory also needs to be verified.
      
      [akpm@linux-foundation.org: add __read_mostly, per Mel]
      Link: https://lkml.kernel.org/r/tencent_DFF54DB2A60F3333F97D3F6B5441519B050A@qq.comSigned-off-by: default avatarWen Yang <wenyang.linux@foxmail.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: William Lam <william.lam@bytedance.com>
      Cc: Pintu Kumar <pintu@codeaurora.org>
      Cc: Fu Wei <wefu@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b9167cd
    • Yosry Ahmed's avatar
      memcg: dump memory.stat during cgroup OOM for v1 · dddb44ff
      Yosry Ahmed authored
      Patch series "memcg: OOM log improvements", v2.
      
      This short patch series brings back some cgroup v1 stats in OOM logs
      that were unnecessarily changed before. It also makes memcg OOM logs
      less reliant on printk() internals.
      
      
      This patch (of 2):
      
      Commit c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM")
      made sure we dump all the stats in memory.stat during a cgroup OOM, but it
      also introduced a slight behavioral change.  The code used to print the
      non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it
      only prints the v2 cgroup stats for the cgroup under OOM.
      
      For cgroup v1 users, this introduces a few problems:
      
      (a) The non-hierarchical stats of the memcg under OOM are no longer
          shown.
      
      (b) A couple of v1-only stats (e.g.  pgpgin, pgpgout) are no longer
          shown.
      
      (c) We show the list of cgroup v2 stats, even in cgroup v1.  This list
          of stats is not tracked with v1 in mind.  While most of the stats seem
          to be working on v1, there may be some stats that are not fully or
          correctly tracked.
      
      Although OOM log is not set in stone, we should not change it for no
      reason.  When upgrading the kernel version to a version including commit
      c8713d0b ("mm: memcontrol: dump memory.stat during cgroup OOM"), these
      behavioral changes are noticed in cgroup v1.
      
      The fix is simple.  Commit c8713d0b ("mm: memcontrol: dump memory.stat
      during cgroup OOM") separated stats formatting from stats display for v2,
      to reuse the stats formatting in the OOM logs.  Do the same for v1.
      
      Move the v2 specific formatting from memory_stat_format() to
      memcg_stat_format(), add memcg1_stat_format() for v1, and make
      memory_stat_format() select between them based on cgroup version.  Since
      memory_stat_show() now works for both v1 & v2, drop memcg_stat_show().
      
      Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dddb44ff
    • Yosry Ahmed's avatar
      memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo() · 5b42360c
      Yosry Ahmed authored
      Currently, we format all the memcg stats into a buffer in
      mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs. 
      However, this buffer is large in size.  Although it is currently working
      as intended, ther is a dependency between the memcg stats buffer and the
      printk record size limit.
      
      If we add more stats in the future and the buffer becomes larger than the
      printk record size limit, or if the prink record size limit is reduced,
      the logs may be truncated.
      
      It is safer to use seq_buf_do_printk(), which will automatically break up
      the buffer at line breaks and issue small printk() calls.
      
      Refactor the code to move the seq_buf from memory_stat_format() to its
      callers, and use seq_buf_do_printk() to print the seq_buf in
      mem_cgroup_print_oom_meminfo().
      
      Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b42360c
    • Douglas Anderson's avatar
      migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT · 4bb6dc79
      Douglas Anderson authored
      The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
      finish quickly but not for things that will take a long time.  Exactly how
      long is too long is not well defined, but waits of tens of milliseconds is
      likely non-ideal.
      
      When putting a Chromebook under memory pressure (opening over 90 tabs on a
      4GB machine) it was fairly easy to see delays waiting for some locks in
      the kcompactd code path of > 100 ms.  While the laptop wasn't amazingly
      usable in this state, it was still limping along and this state isn't
      something artificial.  Sometimes we simply end up with a lot of memory
      pressure.
      
      Putting the same Chromebook under memory pressure while it was running
      Android apps (though not stressing them) showed a much worse result (NOTE:
      this was on a older kernel but the codepaths here are similar).  Android
      apps on ChromeOS currently run from a 128K-block, zlib-compressed,
      loopback-mounted squashfs disk.  If we get a page fault from something
      backed by the squashfs filesystem we could end up holding a folio lock
      while reading enough from disk to decompress 128K (and then decompressing
      it using the somewhat slow zlib algorithms).  That reading goes through
      the ext4 subsystem (because it's a loopback mount) before eventually
      ending up in the block subsystem.  This extra jaunt adds extra overhead. 
      Without much work I could see cases where we ended up blocked on a folio
      lock for over a second.  With more extreme memory pressure I could see up
      to 25 seconds.
      
      We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
      two locks that were seen to be slow [1] and that generated much
      discussion.  After discussion, it was decided that we should avoid waiting
      for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
      IO.  We'll continue with the unbounded wait for the more full SYNC modes.
      
      With this change, I couldn't see any slow waits on these locks with my
      previous testcases.
      
      NOTE: The reason I stated digging into this originally isn't because some
      benchmark had gone awry, but because we've received in-the-field crash
      reports where we have a hung task waiting on the page lock (which is the
      equivalent code path on old kernels).  While the root cause of those
      crashes is likely unrelated and won't be fixed by this patch, analyzing
      those crash reports did point out these very long waits seemed like
      something good to fix.  With this patch we should no longer hang waiting
      on these locks, but presumably the system will still be in a bad shape and
      hang somewhere else.
      
      [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid
      
      Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeidSigned-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bb6dc79
    • Roman Gushchin's avatar
      mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->cached · f785a8f2
      Roman Gushchin authored
      A memcg pointer in the percpu stock can be accessed by drain_all_stock()
      from another cpu in a lockless way.  In theory it might lead to an issue,
      similar to the one which has been discovered with stock->cached_objcg,
      where the pointer was zeroed between the check for being NULL and
      dereferencing.  In this case the issue is unlikely a real problem, but to
      make it bulletproof and similar to stock->cached_objcg, let's annotate all
      accesses to stock->cached with READ_ONCE()/WTRITE_ONCE().
      
      Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.devSigned-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f785a8f2
    • Roman Gushchin's avatar
      mm: kmem: fix a NULL pointer dereference in obj_stock_flush_required() · 3b8abb32
      Roman Gushchin authored
      KCSAN found an issue in obj_stock_flush_required():
      stock->cached_objcg can be reset between the check and dereference:
      
      ==================================================================
      BUG: KCSAN: data-race in drain_all_stock / drain_obj_stock
      
      write to 0xffff888237c2a2f8 of 8 bytes by task 19625 on cpu 0:
       drain_obj_stock+0x408/0x4e0 mm/memcontrol.c:3306
       refill_obj_stock+0x9c/0x1e0 mm/memcontrol.c:3340
       obj_cgroup_uncharge+0xe/0x10 mm/memcontrol.c:3408
       memcg_slab_free_hook mm/slab.h:587 [inline]
       __cache_free mm/slab.c:3373 [inline]
       __do_kmem_cache_free mm/slab.c:3577 [inline]
       kmem_cache_free+0x105/0x280 mm/slab.c:3602
       __d_free fs/dcache.c:298 [inline]
       dentry_free fs/dcache.c:375 [inline]
       __dentry_kill+0x422/0x4a0 fs/dcache.c:621
       dentry_kill+0x8d/0x1e0
       dput+0x118/0x1f0 fs/dcache.c:913
       __fput+0x3bf/0x570 fs/file_table.c:329
       ____fput+0x15/0x20 fs/file_table.c:349
       task_work_run+0x123/0x160 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop+0xcf/0xe0 kernel/entry/common.c:171
       exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:203
       __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
       syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:296
       do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff888237c2a2f8 of 8 bytes by task 19632 on cpu 1:
       obj_stock_flush_required mm/memcontrol.c:3319 [inline]
       drain_all_stock+0x174/0x2a0 mm/memcontrol.c:2361
       try_charge_memcg+0x6d0/0xd10 mm/memcontrol.c:2703
       try_charge mm/memcontrol.c:2837 [inline]
       mem_cgroup_charge_skmem+0x51/0x140 mm/memcontrol.c:7290
       sock_reserve_memory+0xb1/0x390 net/core/sock.c:1025
       sk_setsockopt+0x800/0x1e70 net/core/sock.c:1525
       udp_lib_setsockopt+0x99/0x6c0 net/ipv4/udp.c:2692
       udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2817
       sock_common_setsockopt+0x61/0x70 net/core/sock.c:3668
       __sys_setsockopt+0x1c3/0x230 net/socket.c:2271
       __do_sys_setsockopt net/socket.c:2282 [inline]
       __se_sys_setsockopt net/socket.c:2279 [inline]
       __x64_sys_setsockopt+0x66/0x80 net/socket.c:2279
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0xffff8881382d52c0 -> 0xffff888138893740
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 19632 Comm: syz-executor.0 Not tainted 6.3.0-rc2-syzkaller-00387-g53429336 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
      
      Fix it by using READ_ONCE()/WRITE_ONCE() for all accesses to
      stock->cached_objcg.
      
      Link: https://lkml.kernel.org/r/20230502160839.361544-1-roman.gushchin@linux.dev
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reported-by: syzbot+774c29891415ab0fd29d@syzkaller.appspotmail.com
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
        Link: https://lore.kernel.org/linux-mm/CACT4Y+ZfucZhM60YPphWiCLJr6+SGFhT+jjm8k1P-a_8Kkxsjg@mail.gmail.com/T/#tReviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3b8abb32
  2. 28 May, 2023 8 commits
  3. 27 May, 2023 3 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 4e893b5a
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - a double free fix in the Xen pvcalls backend driver
      
       - a fix for a regression causing the MSI related sysfs entries to not
         being created in Xen PV guests
      
       - a fix in the Xen blkfront driver for handling insane input data
         better
      
      * tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        x86/pci/xen: populate MSI sysfs entries
        xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
        xen/blkfront: Only check REQ_FUA for writes
      4e893b5a
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 957f3f8e
      Linus Torvalds authored
      Pull char/misc fixes from Greg KH:
       "Here are some small driver fixes for 6.4-rc4. They are just two
        different types:
      
         - binder fixes and reverts for reported problems and regressions in
           the binder "driver".
      
         - coresight driver fixes for reported problems.
      
        All of these have been in linux-next for over a week with no reported
        problems"
      
      * tag 'char-misc-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        binder: fix UAF of alloc->vma in race with munmap()
        binder: add lockless binder_alloc_(set|get)_vma()
        Revert "android: binder: stop saving a pointer to the VMA"
        Revert "binder_alloc: add missing mmap_lock calls when using the VMA"
        binder: fix UAF caused by faulty buffer cleanup
        coresight: perf: Release Coresight path when alloc trace id failed
        coresight: Fix signedness bug in tmc_etr_buf_insert_barrier_packet()
      957f3f8e
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 49572d53
      Linus Torvalds authored
      Pull compute express link fixes from Dan Williams:
       "The 'media ready' series prevents the driver from acting on bad
        capacity information, and it moves some checks earlier in the init
        sequence which impacts topics in the queue for 6.5.
      
        Additional hotplug testing uncovered a missing enable for memory
        decode. A debug crash fix is also included.
      
        Summary:
      
         - Stop trusting capacity data before the "media ready" indication
      
         - Add missing HDM decoder capability enable for the cold-plug case
      
         - Fix a debug message induced crash"
      
      * tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl: Explicitly initialize resources when media is not ready
        cxl/port: Fix NULL pointer access in devm_cxl_add_port()
        cxl: Move cxl_await_media_ready() to before capacity info retrieval
        cxl: Wait Memory_Info_Valid before access memory related info
        cxl/port: Enable the HDM decoder capability for switch ports
      49572d53