1. 01 Feb, 2023 18 commits
    • Phillip Lougher's avatar
      Squashfs: fix handling and sanity checking of xattr_ids count · f65c4bbb
      Phillip Lougher authored
      A Sysbot [1] corrupted filesystem exposes two flaws in the handling and
      sanity checking of the xattr_ids count in the filesystem.  Both of these
      flaws cause computation overflow due to incorrect typing.
      
      In the corrupted filesystem the xattr_ids value is 4294967071, which
      stored in a signed variable becomes the negative number -225.
      
      Flaw 1 (64-bit systems only):
      
      The signed integer xattr_ids variable causes sign extension.
      
      This causes variable overflow in the SQUASHFS_XATTR_*(A) macros.  The
      variable is first multiplied by sizeof(struct squashfs_xattr_id) where the
      type of the sizeof operator is "unsigned long".
      
      On a 64-bit system this is 64-bits in size, and causes the negative number
      to be sign extended and widened to 64-bits and then become unsigned.  This
      produces the very large number 18446744073709548016 or 2^64 - 3600.  This
      number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and
      divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0
      (stored in len).
      
      Flaw 2 (32-bit systems only):
      
      On a 32-bit system the integer variable is not widened by the unsigned
      long type of the sizeof operator (32-bits), and the signedness of the
      variable has no effect due it always being treated as unsigned.
      
      The above corrupted xattr_ids value of 4294967071, when multiplied
      overflows and produces the number 4294963696 or 2^32 - 3400.  This number
      when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by
      SQUASHFS_METADATA_SIZE overflows again and produces a length of 0.
      
      The effect of the 0 length computation:
      
      In conjunction with the corrupted xattr_ids field, the filesystem also has
      a corrupted xattr_table_start value, where it matches the end of
      filesystem value of 850.
      
      This causes the following sanity check code to fail because the
      incorrectly computed len of 0 matches the incorrect size of the table
      reported by the superblock (0 bytes).
      
          len = SQUASHFS_XATTR_BLOCK_BYTES(*xattr_ids);
          indexes = SQUASHFS_XATTR_BLOCKS(*xattr_ids);
      
          /*
           * The computed size of the index table (len bytes) should exactly
           * match the table start and end points
          */
          start = table_start + sizeof(*id_table);
          end = msblk->bytes_used;
      
          if (len != (end - start))
                  return ERR_PTR(-EINVAL);
      
      Changing the xattr_ids variable to be "usigned int" fixes the flaw on a
      64-bit system.  This relies on the fact the computation is widened by the
      unsigned long type of the sizeof operator.
      
      Casting the variable to u64 in the above macro fixes this flaw on a 32-bit
      system.
      
      It also means 64-bit systems do not implicitly rely on the type of the
      sizeof operator to widen the computation.
      
      [1] https://lore.kernel.org/lkml/000000000000cd44f005f1a0f17f@google.com/
      
      Link: https://lkml.kernel.org/r/20230127061842.10965-1-phillip@squashfs.org.uk
      Fixes: 506220d2 ("squashfs: add more sanity checks in xattr id lookup")
      Signed-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com>
      Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Fedor Pchelkin <pchelkin@ispras.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f65c4bbb
    • Tom Saeger's avatar
      sh: define RUNTIME_DISCARD_EXIT · c1c551be
      Tom Saeger authored
      sh vmlinux fails to link with GNU ld < 2.40 (likely < 2.36) since
      commit 99cb0d91 ("arch: fix broken BuildID for arm64 and riscv").
      
      This is similar to fixes for powerpc and s390:
      commit 4b9880db ("powerpc/vmlinux.lds: Define RUNTIME_DISCARD_EXIT").
      commit a494398b ("s390: define RUNTIME_DISCARD_EXIT to fix link error
      with GNU ld < 2.36").
      
        $ sh4-linux-gnu-ld --version | head -n1
        GNU ld (GNU Binutils for Debian) 2.35.2
      
        $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu- microdev_defconfig
        $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu-
      
        `.exit.text' referenced in section `__bug_table' of crypto/algboss.o:
        defined in discarded section `.exit.text' of crypto/algboss.o
        `.exit.text' referenced in section `__bug_table' of
        drivers/char/hw_random/core.o: defined in discarded section
        `.exit.text' of drivers/char/hw_random/core.o
        make[2]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
        make[1]: *** [Makefile:1252: vmlinux] Error 2
      
      arch/sh/kernel/vmlinux.lds.S keeps EXIT_TEXT:
      
      	/*
      	 * .exit.text is discarded at runtime, not link time, to deal with
      	 * references from __bug_table
      	 */
      	.exit.text : AT(ADDR(.exit.text)) { EXIT_TEXT }
      
      However, EXIT_TEXT is thrown away by
      DISCARD(include/asm-generic/vmlinux.lds.h) because
      sh does not define RUNTIME_DISCARD_EXIT.
      
      GNU ld 2.40 does not have this issue and builds fine.
      This corresponds with Masahiro's comments in a494398b:
      "Nathan [Chancellor] also found that binutils
      commit 21401fc7bf67 ("Duplicate output sections in scripts") cured this
      issue, so we cannot reproduce it with binutils 2.36+, but it is better
      to not rely on it."
      
      Link: https://lkml.kernel.org/r/9166a8abdc0f979e50377e61780a4bba1dfa2f52.1674518464.git.tom.saeger@oracle.com
      Fixes: 99cb0d91 ("arch: fix broken BuildID for arm64 and riscv")
      Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/
      Link: https://lore.kernel.org/all/20230123194218.47ssfzhrpnv3xfez@oracle.com/Signed-off-by: default avatarTom Saeger <tom.saeger@oracle.com>
      Tested-by: default avatarJohn Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dennis Gilmore <dennis@ausil.us>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Palmer Dabbelt <palmer@rivosinc.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1c551be
    • Matthew Wilcox (Oracle)'s avatar
      highmem: round down the address passed to kunmap_flush_on_unmap() · 88d7b120
      Matthew Wilcox (Oracle) authored
      We already round down the address in kunmap_local_indexed() which is the
      other implementation of __kunmap_local().  The only implementation of
      kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned
      address.  This may be causing PA-RISC to be flushing the wrong addresses
      currently.
      
      Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Fixes: 298fa1ad ("highmem: Provide generic variant of kmap_atomic*")
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      88d7b120
    • Mike Kravetz's avatar
      migrate: hugetlb: check for hugetlb shared PMD in node migration · 73bdf65e
      Mike Kravetz authored
      migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
      move pages shared with another process to a different node.  page_mapcount
      > 1 is being used to determine if a hugetlb page is shared.  However, a
      hugetlb page will have a mapcount of 1 if mapped by multiple processes via
      a shared PMD.  As a result, hugetlb pages shared by multiple processes and
      mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
      
      To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
      consider the page shared.
      
      Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
      Fixes: e2d8cf40 ("migrate: add hugepage migration code to migrate_pages()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73bdf65e
    • Mike Kravetz's avatar
      mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps · 3489dbb6
      Mike Kravetz authored
      Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs".
      
      This issue of mapcount in hugetlb pages referenced by shared PMDs was
      discussed in [1].  The following two patches address user visible behavior
      caused by this issue.
      
      [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/
      
      
      This patch (of 2):
      
      A hugetlb page will have a mapcount of 1 if mapped by multiple processes
      via a shared PMD.  This is because only the first process increases the
      map count, and subsequent processes just add the shared PMD page to their
      page table.
      
      page_mapcount is being used to decide if a hugetlb page is shared or
      private in /proc/PID/smaps.  Pages referenced via a shared PMD were
      incorrectly being counted as private.
      
      To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
      count the hugetlb page as shared.  A new helper to check for a shared PMD
      is added.
      
      [akpm@linux-foundation.org: simplification, per David]
      [akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()]
      Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com
      Fixes: 25ee01a2 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3489dbb6
    • Zach O'Keefe's avatar
      mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups · edb5d0cf
      Zach O'Keefe authored
      In commit 34488399 ("mm/madvise: add file and shmem support to
      MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
      
      	-       if (!pmd_present(pmde))
      	-               return SCAN_PMD_NULL;
      	+       if (pmd_none(pmde))
      	+               return SCAN_PMD_NONE;
      
      This was for-use by MADV_COLLAPSE file/shmem codepaths, where
      MADV_COLLAPSE might identify a pte-mapped hugepage, only to have
      khugepaged race-in, free the pte table, and clear the pmd.  Such codepaths
      include:
      
      A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
         already in the pagecache.
      B) In retract_page_tables(), if we fail to grab mmap_lock for the target
         mm/address.
      
      In these cases, collapse_pte_mapped_thp() really does expect a none (not
      just !present) pmd, and we want to suitably identify that case separate
      from the case where no pmd is found, or it's a bad-pmd (of course, many
      things could happen once we drop mmap_lock, and the pmd could plausibly
      undergo multiple transitions due to intervening fault, split, etc). 
      Regardless, the code is prepared install a huge-pmd only when the existing
      pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
      
      However, the commit introduces a logical hole; namely, that we've allowed
      !none- && !huge- && !bad-pmds to be classified as genuine
      pte-table-mapping-pmds.  One such example that could leak through are swap
      entries.  The pmd values aren't checked again before use in
      pte_offset_map_lock(), which is expecting nothing less than a genuine
      pte-table-mapping-pmd.
      
      We want to put back the !pmd_present() check (below the pmd_none() check),
      but need to be careful to deal with subtleties in pmd transitions and
      treatments by various arch.
      
      The issue is that __split_huge_pmd_locked() temporarily clears the present
      bit (or otherwise marks the entry as invalid), but pmd_present() and
      pmd_trans_huge() still need to return true while the pmd is in this
      transitory state.  For example, x86's pmd_present() also checks the
      _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
      checks a PMD_PRESENT_INVALID bit.
      
      Covering all 4 cases for x86 (all checks done on the same pmd value):
      
      1) pmd_present() && pmd_trans_huge()
         All we actually know here is that the PSE bit is set. Either:
         a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
            is set.
            => huge-pmd
         b) We are currently racing with __split_huge_page().  The danger here
            is that we proceed as-if we have a huge-pmd, but really we are
            looking at a pte-mapping-pmd.  So, what is the risk of this
            danger?
      
            The only relevant path is:
      
      	madvise_collapse() -> collapse_pte_mapped_thp()
      
            Where we might just incorrectly report back "success", when really
            the memory isn't pmd-backed.  This is fine, since split could
            happen immediately after (actually) successful madvise_collapse().
            So, it should be safe to just assume huge-pmd here.
      
      2) pmd_present() && !pmd_trans_huge()
         Either:
         a) PSE not set and either PRESENT or PROTNONE is.
            => pte-table-mapping pmd (or PROT_NONE)
         b) devmap.  This routine can be called immediately after
            unlocking/locking mmap_lock -- or called with no locks held (see
            khugepaged_scan_mm_slot()), so previous VMA checks have since been
            invalidated.
      
      3) !pmd_present() && pmd_trans_huge()
        Not possible.
      
      4) !pmd_present() && !pmd_trans_huge()
        Neither PRESENT nor PROTNONE set
        => not present
      
      I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
      powerpc, longarch, x86, mips, s390) and this logic roughly translates
      (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
      necessarily hold in general -- but that doesn't matter since
      !pmd_present() always takes failure path).
      
      Also, add a comment above find_pmd_or_thp_or_none() to help future
      travelers reason about the validity of the code; namely, the possible
      mutations that might happen out from under us, depending on how mmap_lock
      is held (if at all).
      
      Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com
      Fixes: 34488399 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      edb5d0cf
    • Isaac J. Manjarres's avatar
      Revert "mm: kmemleak: alloc gray object for reserved region with direct map" · 8ef852f1
      Isaac J. Manjarres authored
      This reverts commit 972fa3a7.
      
      Kmemleak operates by periodically scanning memory regions for pointers to
      allocated memory blocks to determine if they are leaked or not.  However,
      reserved memory regions can be used for DMA transactions between a device
      and a CPU, and thus, wouldn't contain pointers to allocated memory blocks,
      making them inappropriate for kmemleak to scan.  Thus, revert this commit.
      
      Link: https://lkml.kernel.org/r/20230124230254.295589-1-isaacmanjarres@google.com
      Fixes: 972fa3a7 ("mm: kmemleak: alloc gray object for reserved region with direct map")
      Signed-off-by: default avatarIsaac J. Manjarres <isaacmanjarres@google.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Calvin Zhang <calvinzhang.cool@gmail.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Saravana Kannan <saravanak@google.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ef852f1
    • Randy Dunlap's avatar
      freevxfs: Kconfig: fix spelling · 0d7866ea
      Randy Dunlap authored
      Fix a spello in freevxfs Kconfig.
      (reported by codespell)
      
      Link: https://lkml.kernel.org/r/20230124181638.15604-1-rdunlap@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d7866ea
    • Wei Yang's avatar
      maple_tree: should get pivots boundary by type · ab6ef70a
      Wei Yang authored
      We should get pivots boundary by type.  Fixes a potential overindexing of
      mt_pivots[].
      
      Link: https://lkml.kernel.org/r/20221112234308.23823-1-richard.weiyang@gmail.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab6ef70a
    • Eugen Hristev's avatar
    • Vlastimil Babka's avatar
      mm, mremap: fix mremap() expanding for vma's with vm_ops->close() · d014cd7c
      Vlastimil Babka authored
      Fabian has reported another regression in 6.1 due to ca3d76b0 ("mm:
      add merging after mremap resize").  The problem is that vma_merge() can
      fail when vma has a vm_ops->close() method, causing is_mergeable_vma()
      test to be negative.  This was happening for vma mapping a file from
      fuse-overlayfs, which does have the method.  But when we are simply
      expanding the vma, we never remove it due to the "merge" with the added
      area, so the test should not prevent the expansion.
      
      As a quick fix, check for such vmas and expand them using vma_adjust()
      directly as was done before commit ca3d76b0.  For a more robust long
      term solution we should try to limit the check for vma_ops->close only to
      cases that actually result in vma removal, so that no merge would be
      prevented unnecessarily.
      
      [akpm@linux-foundation.org: fix indenting whitespace, reflow comment]
      Link: https://lkml.kernel.org/r/20230117101939.9753-1-vbabka@suse.cz
      Fixes: ca3d76b0 ("mm: add merging after mremap resize")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarFabian Vogt <fvogt@suse.com>
        Link: https://bugzilla.suse.com/show_bug.cgi?id=1206359#c35Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Cc: Jakub Matěna <matenajakub@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d014cd7c
    • Fedor Pchelkin's avatar
      squashfs: harden sanity check in squashfs_read_xattr_id_table · 72e544b1
      Fedor Pchelkin authored
      While mounting a corrupted filesystem, a signed integer '*xattr_ids' can
      become less than zero.  This leads to the incorrect computation of 'len'
      and 'indexes' values which can cause null-ptr-deref in copy_bio_to_actor()
      or out-of-bounds accesses in the next sanity checks inside
      squashfs_read_xattr_id_table().
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Link: https://lkml.kernel.org/r/20230117105226.329303-2-pchelkin@ispras.ru
      Fixes: 506220d2 ("squashfs: add more sanity checks in xattr id lookup")
      Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72e544b1
    • James Morse's avatar
      ia64: fix build error due to switch case label appearing next to declaration · 6f28a261
      James Morse authored
      Since commit aa06a9bd ("ia64: fix clock_getres(CLOCK_MONOTONIC) to
      report ITC frequency"), gcc 10.1.0 fails to build ia64 with the gnomic:
      | ../arch/ia64/kernel/sys_ia64.c: In function 'ia64_clock_getres':
      | ../arch/ia64/kernel/sys_ia64.c:189:3: error: a label can only be part of a statement and a declaration is not a statement
      |   189 |   s64 tick_ns = DIV_ROUND_UP(NSEC_PER_SEC, local_cpu_data->itc_freq);
      
      This line appears immediately after a case label in a switch.
      
      Move the declarations out of the case, to the top of the function.
      
      Link: https://lkml.kernel.org/r/20230117151632.393836-1-james.morse@arm.com
      Fixes: aa06a9bd ("ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequency")
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Reviewed-by: default avatarSergei Trofimovich <slyich@gmail.com>
      Cc: Émeric Maschino <emeric.maschino@gmail.com>
      Cc: matoro <matoro_mailinglist_kernel@matoro.tk>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f28a261
    • Yu Zhao's avatar
      mm: multi-gen LRU: fix crash during cgroup migration · de08eaa6
      Yu Zhao authored
      lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself.  This
      isn't true for the following scenario:
      
          CPU 1                         CPU 2
      
        clone()
          cgroup_can_fork()
                                      cgroup_procs_write()
          cgroup_post_fork()
                                        task_lock()
                                        lru_gen_migrate_mm()
                                        task_unlock()
          task_lock()
          lru_gen_add_mm()
          task_unlock()
      
      And when the above happens, kernel crashes because of linked list
      corruption (mm_struct->lru_gen.list).
      
      Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/
      Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarmsizanoen <msizanoen@qtmlabs.xyz>
      Tested-by: default avatarmsizanoen <msizanoen@qtmlabs.xyz>
      Cc: <stable@vger.kernel.org>	[6.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de08eaa6
    • Michal Hocko's avatar
      Revert "mm: add nodes= arg to memory.reclaim" · 55ab834a
      Michal Hocko authored
      This reverts commit 12a5d395.
      
      Although it is recognized that a finer grained pro-active reclaim is
      something we need and want the semantic of this implementation is really
      ambiguous.
      
      In a follow up discussion it became clear that there are two essential
      usecases here.  One is to use memory.reclaim to pro-actively reclaim
      memory and expectation is that the requested and reported amount of memory
      is uncharged from the memcg.  Another usecase focuses on pro-active
      demotion when the memory is merely shuffled around to demotion targets
      while the overall charged memory stays unchanged.
      
      The current implementation considers demoted pages as reclaimed and that
      break both usecases.  [1] has tried to address the reporting part but
      there are more issues with that summarized in [2] and follow up emails.
      
      Let's revert the nodemask based extension of the memcg pro-active
      reclaim for now until we settle with a more robust semantic.
      
      [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
      [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
      
      Link: https://lkml.kernel.org/r/Y5xASNe1x8cusiTx@dhcp22.suse.cz
      Fixes: 12a5d395 ("mm: add nodes= arg to memory.reclaim")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: zefan li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55ab834a
    • Nhat Pham's avatar
      zsmalloc: fix a race with deferred_handles storing · 85b32581
      Nhat Pham authored
      Currently, there is a race between zs_free() and zs_reclaim_page():
      zs_reclaim_page() finds a handle to an allocated object, but before the
      eviction happens, an independent zs_free() call to the same handle could
      come in and overwrite the object value stored at the handle with the last
      deferred handle.  When zs_reclaim_page() finally gets to call the eviction
      handler, it will see an invalid object value (i.e the previous deferred
      handle instead of the original object value).
      
      This race happens quite infrequently.  We only managed to produce it with
      out-of-tree developmental code that triggers zsmalloc writeback with a
      much higher frequency than usual.
      
      This patch fixes this race by storing the deferred handle in the object
      header instead.  We differentiate the deferred handle from the other two
      cases (handle for allocated object, and linkage for free object) with a
      new tag.  If zspage reclamation succeeds, we will free these deferred
      handles by walking through the zspage objects.  On the other hand, if
      zspage reclamation fails, we reconstruct the zspage freelist (with the
      deferred handle tag and allocated tag) before trying again with the
      reclamation.
      
      [arnd@arndb.de: avoid unused-function warning]
        Link: https://lkml.kernel.org/r/20230117170507.2651972-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20230110231701.326724-1-nphamcs@gmail.com
      Fixes: 9997bc01 ("zsmalloc: implement writeback mechanism for zsmalloc")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85b32581
    • Jann Horn's avatar
      mm/khugepaged: fix ->anon_vma race · 023f47a8
      Jann Horn authored
      If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
      it to be locked.
      
      Page table traversal is allowed under any one of the mmap lock, the
      anon_vma lock (if the VMA is associated with an anon_vma), and the
      mapping lock (if the VMA is associated with a mapping); and so to be
      able to remove page tables, we must hold all three of them. 
      retract_page_tables() bails out if an ->anon_vma is attached, but does
      this check before holding the mmap lock (as the comment above the check
      explains).
      
      If we racily merged an existing ->anon_vma (shared with a child
      process) from a neighboring VMA, subsequent rmap traversals on pages
      belonging to the child will be able to see the page tables that we are
      concurrently removing while assuming that nothing else can access them.
      
      Repeat the ->anon_vma check once we hold the mmap lock to ensure that
      there really is no concurrent page table access.
      
      Hitting this bug causes a lockdep warning in collapse_and_free_pmd(),
      in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)". 
      It can also lead to use-after-free access.
      
      Link: https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20230111133351.807024-1-jannh@google.com
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reported-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@intel.linux.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      023f47a8
    • Liam Howlett's avatar
      maple_tree: fix mas_empty_area_rev() lower bound validation · 7327e811
      Liam Howlett authored
      mas_empty_area_rev() was not correctly validating the start of a gap
      against the lower limit.  This could lead to the range starting lower than
      the requested minimum.
      
      Fix the issue by better validating a gap once one is found.
      
      This commit also adds tests to the maple tree test suite for this issue
      and tests the mas_empty_area() function for similar bound checking.
      
      Link: https://lkml.kernel.org/r/20230111200136.1851322-1-Liam.Howlett@oracle.com
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216911
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: <amanieu@gmail.com>
        Link: https://lore.kernel.org/linux-mm/0b9f5425-08d4-8013-aa4c-e620c3b10bb2@leemhuis.info/Tested-by: default avatarHolger Hoffsttte <holger@applied-asynchrony.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7327e811
  2. 20 Jan, 2023 1 commit
  3. 19 Jan, 2023 4 commits
  4. 15 Jan, 2023 4 commits
  5. 14 Jan, 2023 7 commits
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 7c698440
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Core: Fix an iommu-group refcount leak
      
       - Fix overflow issue in IOVA alloc path
      
       - ARM-SMMU fixes from Will:
          - Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY
          - Fix SMMU shutdown paths to avoid device unregistration race
      
       - Error handling fix for Mediatek IOMMU driver
      
      * tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()
        iommu/iova: Fix alloc iova overflows issue
        iommu: Fix refcount leak in iommu_device_claim_dma_owner
        iommu/arm-smmu-v3: Don't unregister on shutdown
        iommu/arm-smmu: Don't unregister on shutdown
        iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer
      7c698440
    • Linus Torvalds's avatar
      Merge tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock · 4f43ade4
      Linus Torvalds authored
      Pull memblock fix from Mike Rapoport:
       "memblock: always release pages to the buddy allocator in
        memblock_free_late()
      
        If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
        only releases pages to the buddy allocator if they are not in the
        deferred range. This is correct for free pages (as defined by
        for_each_free_mem_pfn_range_in_zone()) because free pages in the
        deferred range will be initialized and released as part of the
        deferred init process.
      
        memblock_free_pages() is called by memblock_free_late(), which is used
        to free reserved ranges after memblock_free_all() has run. All pages
        in reserved ranges have been initialized at that point, and
        accordingly, those pages are not touched by the deferred init process.
      
        This means that currently, if the pages that memblock_free_late()
        intends to release are in the deferred range, they will never be
        released to the buddy allocator. They will forever be reserved.
      
        In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
        which is also correct for free pages but is not correct for reserved
        pages. KMSAN metadata for reserved pages is initialized by
        kmsan_init_shadow(), which runs shortly before memblock_free_all().
      
        For both of these reasons, memblock_free_pages() should only be called
        for free pages, and memblock_free_late() should call
        __free_pages_core() directly instead.
      
        One case where this issue can occur in the wild is EFI boot on x86_64.
        The x86 EFI code reserves all EFI boot services memory ranges via
        memblock_reserve() and frees them later via memblock_free_late()
        (efi_reserve_boot_services() and efi_free_boot_services(),
        respectively).
      
        If any of those ranges happens to fall within the deferred init range,
        the pages will not be released and that memory will be unavailable.
      
        For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
      
          v6.2-rc2:
          Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
          Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  178867
      
          v6.2-rc2 + patch:
          Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
          Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  222816   # +43,949 pages"
      
      * tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
        mm: Always release pages to the buddy allocator in memblock_free_late().
      4f43ade4
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 880ca43e
      Linus Torvalds authored
      Pull kernel hardening fixes from Kees Cook:
      
       - Fix CFI hash randomization with KASAN (Sami Tolvanen)
      
       - Check size of coreboot table entry and use flex-array
      
      * tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        kbuild: Fix CFI hash randomization with KASAN
        firmware: coreboot: Check size of table entry and use flex-array
      880ca43e
    • Linus Torvalds's avatar
      Merge tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux · 8b7be52f
      Linus Torvalds authored
      Pull module fix from Luis Chamberlain:
       "Just one fix for modules by Nick"
      
      * tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
        kallsyms: Fix scheduling with interrupts disabled in self-test
      8b7be52f
    • Linus Torvalds's avatar
      Merge tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · b35ad63e
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
      
       - memory leak and double free fix
      
       - two symlink fixes
      
       - minor cleanup fix
      
       - two smb1 fixes
      
      * tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: Fix uninitialized memory read for smb311 posix symlink create
        cifs: fix potential memory leaks in session setup
        cifs: do not query ifaces on smb1 mounts
        cifs: fix double free on failed kerberos auth
        cifs: remove redundant assignment to the variable match
        cifs: fix file info setting in cifs_open_file()
        cifs: fix file info setting in cifs_query_path_info()
      b35ad63e
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 8e768130
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two minor fixes in the hisi_sas driver which only impact enterprise
        style multi-expander and shared disk situations and no core changes"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id
        scsi: hisi_sas: Use abort task set to reset SAS disks when discovered
      8e768130
    • Linus Torvalds's avatar
      Merge tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata · 34cbf89a
      Linus Torvalds authored
      Pull ATA fix from Damien Le Moal:
       "A single fix to prevent building the pata_cs5535 driver with user mode
        linux as it uses msr operations that are not defined with UML"
      
      * tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
        ata: pata_cs5535: Don't build on UML
      34cbf89a
  6. 13 Jan, 2023 6 commits