- 01 Feb, 2023 18 commits
-
-
Phillip Lougher authored
A Sysbot [1] corrupted filesystem exposes two flaws in the handling and sanity checking of the xattr_ids count in the filesystem. Both of these flaws cause computation overflow due to incorrect typing. In the corrupted filesystem the xattr_ids value is 4294967071, which stored in a signed variable becomes the negative number -225. Flaw 1 (64-bit systems only): The signed integer xattr_ids variable causes sign extension. This causes variable overflow in the SQUASHFS_XATTR_*(A) macros. The variable is first multiplied by sizeof(struct squashfs_xattr_id) where the type of the sizeof operator is "unsigned long". On a 64-bit system this is 64-bits in size, and causes the negative number to be sign extended and widened to 64-bits and then become unsigned. This produces the very large number 18446744073709548016 or 2^64 - 3600. This number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0 (stored in len). Flaw 2 (32-bit systems only): On a 32-bit system the integer variable is not widened by the unsigned long type of the sizeof operator (32-bits), and the signedness of the variable has no effect due it always being treated as unsigned. The above corrupted xattr_ids value of 4294967071, when multiplied overflows and produces the number 4294963696 or 2^32 - 3400. This number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by SQUASHFS_METADATA_SIZE overflows again and produces a length of 0. The effect of the 0 length computation: In conjunction with the corrupted xattr_ids field, the filesystem also has a corrupted xattr_table_start value, where it matches the end of filesystem value of 850. This causes the following sanity check code to fail because the incorrectly computed len of 0 matches the incorrect size of the table reported by the superblock (0 bytes). len = SQUASHFS_XATTR_BLOCK_BYTES(*xattr_ids); indexes = SQUASHFS_XATTR_BLOCKS(*xattr_ids); /* * The computed size of the index table (len bytes) should exactly * match the table start and end points */ start = table_start + sizeof(*id_table); end = msblk->bytes_used; if (len != (end - start)) return ERR_PTR(-EINVAL); Changing the xattr_ids variable to be "usigned int" fixes the flaw on a 64-bit system. This relies on the fact the computation is widened by the unsigned long type of the sizeof operator. Casting the variable to u64 in the above macro fixes this flaw on a 32-bit system. It also means 64-bit systems do not implicitly rely on the type of the sizeof operator to widen the computation. [1] https://lore.kernel.org/lkml/000000000000cd44f005f1a0f17f@google.com/ Link: https://lkml.kernel.org/r/20230127061842.10965-1-phillip@squashfs.org.uk Fixes: 506220d2 ("squashfs: add more sanity checks in xattr id lookup") Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Fedor Pchelkin <pchelkin@ispras.ru> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Tom Saeger authored
sh vmlinux fails to link with GNU ld < 2.40 (likely < 2.36) since commit 99cb0d91 ("arch: fix broken BuildID for arm64 and riscv"). This is similar to fixes for powerpc and s390: commit 4b9880db ("powerpc/vmlinux.lds: Define RUNTIME_DISCARD_EXIT"). commit a494398b ("s390: define RUNTIME_DISCARD_EXIT to fix link error with GNU ld < 2.36"). $ sh4-linux-gnu-ld --version | head -n1 GNU ld (GNU Binutils for Debian) 2.35.2 $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu- microdev_defconfig $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu- `.exit.text' referenced in section `__bug_table' of crypto/algboss.o: defined in discarded section `.exit.text' of crypto/algboss.o `.exit.text' referenced in section `__bug_table' of drivers/char/hw_random/core.o: defined in discarded section `.exit.text' of drivers/char/hw_random/core.o make[2]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1 make[1]: *** [Makefile:1252: vmlinux] Error 2 arch/sh/kernel/vmlinux.lds.S keeps EXIT_TEXT: /* * .exit.text is discarded at runtime, not link time, to deal with * references from __bug_table */ .exit.text : AT(ADDR(.exit.text)) { EXIT_TEXT } However, EXIT_TEXT is thrown away by DISCARD(include/asm-generic/vmlinux.lds.h) because sh does not define RUNTIME_DISCARD_EXIT. GNU ld 2.40 does not have this issue and builds fine. This corresponds with Masahiro's comments in a494398b: "Nathan [Chancellor] also found that binutils commit 21401fc7bf67 ("Duplicate output sections in scripts") cured this issue, so we cannot reproduce it with binutils 2.36+, but it is better to not rely on it." Link: https://lkml.kernel.org/r/9166a8abdc0f979e50377e61780a4bba1dfa2f52.1674518464.git.tom.saeger@oracle.com Fixes: 99cb0d91 ("arch: fix broken BuildID for arm64 and riscv") Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/ Link: https://lore.kernel.org/all/20230123194218.47ssfzhrpnv3xfez@oracle.com/Signed-off-by: Tom Saeger <tom.saeger@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Dennis Gilmore <dennis@ausil.us> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
We already round down the address in kunmap_local_indexed() which is the other implementation of __kunmap_local(). The only implementation of kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned address. This may be causing PA-RISC to be flushing the wrong addresses currently. Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 298fa1ad ("highmem: Provide generic variant of kmap_atomic*") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Helge Deller <deller@gmx.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: David Sterba <dsterba@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Kravetz authored
migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to move pages shared with another process to a different node. page_mapcount > 1 is being used to determine if a hugetlb page is shared. However, a hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. As a result, hugetlb pages shared by multiple processes and mapped with a shared PMD can be moved by a process without CAP_SYS_NICE. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found consider the page shared. Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com Fixes: e2d8cf40 ("migrate: add hugepage migration code to migrate_pages()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Kravetz authored
Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs". This issue of mapcount in hugetlb pages referenced by shared PMDs was discussed in [1]. The following two patches address user visible behavior caused by this issue. [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/ This patch (of 2): A hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. This is because only the first process increases the map count, and subsequent processes just add the shared PMD page to their page table. page_mapcount is being used to decide if a hugetlb page is shared or private in /proc/PID/smaps. Pages referenced via a shared PMD were incorrectly being counted as private. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found count the hugetlb page as shared. A new helper to check for a shared PMD is added. [akpm@linux-foundation.org: simplification, per David] [akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()] Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com Fixes: 25ee01a2 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Zach O'Keefe authored
In commit 34488399 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none(): - if (!pmd_present(pmde)) - return SCAN_PMD_NULL; + if (pmd_none(pmde)) + return SCAN_PMD_NONE; This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE might identify a pte-mapped hugepage, only to have khugepaged race-in, free the pte table, and clear the pmd. Such codepaths include: A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER already in the pagecache. B) In retract_page_tables(), if we fail to grab mmap_lock for the target mm/address. In these cases, collapse_pte_mapped_thp() really does expect a none (not just !present) pmd, and we want to suitably identify that case separate from the case where no pmd is found, or it's a bad-pmd (of course, many things could happen once we drop mmap_lock, and the pmd could plausibly undergo multiple transitions due to intervening fault, split, etc). Regardless, the code is prepared install a huge-pmd only when the existing pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd. However, the commit introduces a logical hole; namely, that we've allowed !none- && !huge- && !bad-pmds to be classified as genuine pte-table-mapping-pmds. One such example that could leak through are swap entries. The pmd values aren't checked again before use in pte_offset_map_lock(), which is expecting nothing less than a genuine pte-table-mapping-pmd. We want to put back the !pmd_present() check (below the pmd_none() check), but need to be careful to deal with subtleties in pmd transitions and treatments by various arch. The issue is that __split_huge_pmd_locked() temporarily clears the present bit (or otherwise marks the entry as invalid), but pmd_present() and pmd_trans_huge() still need to return true while the pmd is in this transitory state. For example, x86's pmd_present() also checks the _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also checks a PMD_PRESENT_INVALID bit. Covering all 4 cases for x86 (all checks done on the same pmd value): 1) pmd_present() && pmd_trans_huge() All we actually know here is that the PSE bit is set. Either: a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE is set. => huge-pmd b) We are currently racing with __split_huge_page(). The danger here is that we proceed as-if we have a huge-pmd, but really we are looking at a pte-mapping-pmd. So, what is the risk of this danger? The only relevant path is: madvise_collapse() -> collapse_pte_mapped_thp() Where we might just incorrectly report back "success", when really the memory isn't pmd-backed. This is fine, since split could happen immediately after (actually) successful madvise_collapse(). So, it should be safe to just assume huge-pmd here. 2) pmd_present() && !pmd_trans_huge() Either: a) PSE not set and either PRESENT or PROTNONE is. => pte-table-mapping pmd (or PROT_NONE) b) devmap. This routine can be called immediately after unlocking/locking mmap_lock -- or called with no locks held (see khugepaged_scan_mm_slot()), so previous VMA checks have since been invalidated. 3) !pmd_present() && pmd_trans_huge() Not possible. 4) !pmd_present() && !pmd_trans_huge() Neither PRESENT nor PROTNONE set => not present I've checked all archs that implement pmd_trans_huge() (arm64, riscv, powerpc, longarch, x86, mips, s390) and this logic roughly translates (though devmap treatment is unique to x86 and powerpc, and (3) doesn't necessarily hold in general -- but that doesn't matter since !pmd_present() always takes failure path). Also, add a comment above find_pmd_or_thp_or_none() to help future travelers reason about the validity of the code; namely, the possible mutations that might happen out from under us, depending on how mmap_lock is held (if at all). Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com Fixes: 34488399 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reported-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Isaac J. Manjarres authored
This reverts commit 972fa3a7. Kmemleak operates by periodically scanning memory regions for pointers to allocated memory blocks to determine if they are leaked or not. However, reserved memory regions can be used for DMA transactions between a device and a CPU, and thus, wouldn't contain pointers to allocated memory blocks, making them inappropriate for kmemleak to scan. Thus, revert this commit. Link: https://lkml.kernel.org/r/20230124230254.295589-1-isaacmanjarres@google.com Fixes: 972fa3a7 ("mm: kmemleak: alloc gray object for reserved region with direct map") Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Calvin Zhang <calvinzhang.cool@gmail.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Randy Dunlap authored
Fix a spello in freevxfs Kconfig. (reported by codespell) Link: https://lkml.kernel.org/r/20230124181638.15604-1-rdunlap@infradead.orgSigned-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Wei Yang authored
We should get pivots boundary by type. Fixes a potential overindexing of mt_pivots[]. Link: https://lkml.kernel.org/r/20221112234308.23823-1-richard.weiyang@gmail.com Fixes: 54a611b6 ("Maple Tree: add new data structure") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Eugen Hristev authored
Update e-mail address. Link: https://lkml.kernel.org/r/20230119072229.99603-1-eugen.hristev@collabora.comSigned-off-by: Eugen Hristev <eugen.hristev@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
Fabian has reported another regression in 6.1 due to ca3d76b0 ("mm: add merging after mremap resize"). The problem is that vma_merge() can fail when vma has a vm_ops->close() method, causing is_mergeable_vma() test to be negative. This was happening for vma mapping a file from fuse-overlayfs, which does have the method. But when we are simply expanding the vma, we never remove it due to the "merge" with the added area, so the test should not prevent the expansion. As a quick fix, check for such vmas and expand them using vma_adjust() directly as was done before commit ca3d76b0. For a more robust long term solution we should try to limit the check for vma_ops->close only to cases that actually result in vma removal, so that no merge would be prevented unnecessarily. [akpm@linux-foundation.org: fix indenting whitespace, reflow comment] Link: https://lkml.kernel.org/r/20230117101939.9753-1-vbabka@suse.cz Fixes: ca3d76b0 ("mm: add merging after mremap resize") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Fabian Vogt <fvogt@suse.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1206359#c35Tested-by: Fabian Vogt <fvogt@suse.com> Cc: Jakub Matěna <matenajakub@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fedor Pchelkin authored
While mounting a corrupted filesystem, a signed integer '*xattr_ids' can become less than zero. This leads to the incorrect computation of 'len' and 'indexes' values which can cause null-ptr-deref in copy_bio_to_actor() or out-of-bounds accesses in the next sanity checks inside squashfs_read_xattr_id_table(). Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Link: https://lkml.kernel.org/r/20230117105226.329303-2-pchelkin@ispras.ru Fixes: 506220d2 ("squashfs: add more sanity checks in xattr id lookup") Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
James Morse authored
Since commit aa06a9bd ("ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequency"), gcc 10.1.0 fails to build ia64 with the gnomic: | ../arch/ia64/kernel/sys_ia64.c: In function 'ia64_clock_getres': | ../arch/ia64/kernel/sys_ia64.c:189:3: error: a label can only be part of a statement and a declaration is not a statement | 189 | s64 tick_ns = DIV_ROUND_UP(NSEC_PER_SEC, local_cpu_data->itc_freq); This line appears immediately after a case label in a switch. Move the declarations out of the case, to the top of the function. Link: https://lkml.kernel.org/r/20230117151632.393836-1-james.morse@arm.com Fixes: aa06a9bd ("ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequency") Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Sergei Trofimovich <slyich@gmail.com> Cc: Émeric Maschino <emeric.maschino@gmail.com> Cc: matoro <matoro_mailinglist_kernel@matoro.tk> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yu Zhao authored
lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself. This isn't true for the following scenario: CPU 1 CPU 2 clone() cgroup_can_fork() cgroup_procs_write() cgroup_post_fork() task_lock() lru_gen_migrate_mm() task_unlock() task_lock() lru_gen_add_mm() task_unlock() And when the above happens, kernel crashes because of linked list corruption (mm_struct->lru_gen.list). Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/ Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: msizanoen <msizanoen@qtmlabs.xyz> Tested-by: msizanoen <msizanoen@qtmlabs.xyz> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Michal Hocko authored
This reverts commit 12a5d395. Although it is recognized that a finer grained pro-active reclaim is something we need and want the semantic of this implementation is really ambiguous. In a follow up discussion it became clear that there are two essential usecases here. One is to use memory.reclaim to pro-actively reclaim memory and expectation is that the requested and reported amount of memory is uncharged from the memcg. Another usecase focuses on pro-active demotion when the memory is merely shuffled around to demotion targets while the overall charged memory stays unchanged. The current implementation considers demoted pages as reclaimed and that break both usecases. [1] has tried to address the reporting part but there are more issues with that summarized in [2] and follow up emails. Let's revert the nodemask based extension of the memcg pro-active reclaim for now until we settle with a more robust semantic. [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz Link: https://lkml.kernel.org/r/Y5xASNe1x8cusiTx@dhcp22.suse.cz Fixes: 12a5d395 ("mm: add nodes= arg to memory.reclaim") Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Nhat Pham authored
Currently, there is a race between zs_free() and zs_reclaim_page(): zs_reclaim_page() finds a handle to an allocated object, but before the eviction happens, an independent zs_free() call to the same handle could come in and overwrite the object value stored at the handle with the last deferred handle. When zs_reclaim_page() finally gets to call the eviction handler, it will see an invalid object value (i.e the previous deferred handle instead of the original object value). This race happens quite infrequently. We only managed to produce it with out-of-tree developmental code that triggers zsmalloc writeback with a much higher frequency than usual. This patch fixes this race by storing the deferred handle in the object header instead. We differentiate the deferred handle from the other two cases (handle for allocated object, and linkage for free object) with a new tag. If zspage reclamation succeeds, we will free these deferred handles by walking through the zspage objects. On the other hand, if zspage reclamation fails, we reconstruct the zspage freelist (with the deferred handle tag and allocated tag) before trying again with the reclamation. [arnd@arndb.de: avoid unused-function warning] Link: https://lkml.kernel.org/r/20230117170507.2651972-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20230110231701.326724-1-nphamcs@gmail.com Fixes: 9997bc01 ("zsmalloc: implement writeback mechanism for zsmalloc") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jann Horn authored
If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires it to be locked. Page table traversal is allowed under any one of the mmap lock, the anon_vma lock (if the VMA is associated with an anon_vma), and the mapping lock (if the VMA is associated with a mapping); and so to be able to remove page tables, we must hold all three of them. retract_page_tables() bails out if an ->anon_vma is attached, but does this check before holding the mmap lock (as the comment above the check explains). If we racily merged an existing ->anon_vma (shared with a child process) from a neighboring VMA, subsequent rmap traversals on pages belonging to the child will be able to see the page tables that we are concurrently removing while assuming that nothing else can access them. Repeat the ->anon_vma check once we hold the mmap lock to ensure that there really is no concurrent page table access. Hitting this bug causes a lockdep warning in collapse_and_free_pmd(), in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)". It can also lead to use-after-free access. Link: https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230111133351.807024-1-jannh@google.com Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <jannh@google.com> Reported-by: Zach O'Keefe <zokeefe@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam Howlett authored
mas_empty_area_rev() was not correctly validating the start of a gap against the lower limit. This could lead to the range starting lower than the requested minimum. Fix the issue by better validating a gap once one is found. This commit also adds tests to the maple tree test suite for this issue and tests the mas_empty_area() function for similar bound checking. Link: https://lkml.kernel.org/r/20230111200136.1851322-1-Liam.Howlett@oracle.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=216911 Fixes: 54a611b6 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: <amanieu@gmail.com> Link: https://lore.kernel.org/linux-mm/0b9f5425-08d4-8013-aa4c-e620c3b10bb2@leemhuis.info/Tested-by: Holger Hoffsttte <holger@applied-asynchrony.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 20 Jan, 2023 1 commit
-
-
Pengfei Xu authored
When use tools/testing/selftests/kselftest_install.sh to make the kselftest-list.txt under tools/testing/selftests/kselftest_install. Then use tools/testing/selftests/kselftest_install/run_kselftest.sh to run all the kselftests in kselftest-list.txt, it will be blocked by case "filesystems/fat: run_fat_tests.sh" with "Warning: file run_fat_tests.sh is not executable", so grant executable permission to run_fat_tests.sh to fix this issue. Link: https://lkml.kernel.org/r/dfdbba6df8a1ab34bb1e81cd8bd7ca3f9ed5c369.1673424747.git.pengfei.xu@intel.com Fixes: dd7c9be3 ("selftests/filesystems: add a vfat RENAME_EXCHANGE test") Signed-off-by: Pengfei Xu <pengfei.xu@intel.com> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 19 Jan, 2023 4 commits
-
-
Peter Xu authored
__USE_GNU should be an internal macro only used inside glibc. Either memfd_create() or fallocate() requires _GNU_SOURCE per man page, where __USE_GNU will further be defined by glibc headers include/features.h: #ifdef _GNU_SOURCE # define __USE_GNU 1 #endif This fixes: >> hugetlb-madvise.c:20: warning: "__USE_GNU" redefined 20 | #define __USE_GNU | In file included from /usr/include/x86_64-linux-gnu/bits/libc-header-start.h:33, from /usr/include/stdlib.h:26, from hugetlb-madvise.c:16: /usr/include/features.h:407: note: this is the location of the previous definition 407 | # define __USE_GNU 1 | Link: https://lkml.kernel.org/r/Y8V9z+z6Tk7NetI3@x1nSigned-off-by: Peter Xu <peterx@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
This patch should harden commit 15520a3f ("mm: use pte markers for swap errors") on using pte markers for swapin errors on a few corner cases. 1. Propagate swapin errors across fork()s: if there're swapin errors in the parent mm, after fork()s the child should sigbus too when an error page is accessed. 2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte marker can be quickly switched to a swapin error. 3. Explicitly ignore swapin error pte markers in change_protection(). I mostly don't worry on (2) or (3) at all, but we should still have them. Case (1) is special because it can potentially cause silent data corrupt on child when parent has swapin error triggered with swapoff, but since swapin error is rare itself already it's probably not easy to trigger either. Currently there is a priority difference between the uffd-wp bit and the swapin error entry, in which the swapin error always has higher priority (e.g. we don't need to wr-protect a swapin error pte marker). If there will be a 3rd bit introduced, we'll probably need to consider a more involved approach so we may need to start operate on the bits. Let's leave that for later. This patch is tested with case (1) explicitly where we'll get corrupted data before in the child if there's existing swapin error pte markers, and after patch applied the child can be rightfully killed. We don't need to copy stable for this one since 15520a3f just landed as part of v6.2-rc1, only "Fixes" applied. Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com Fixes: 15520a3f ("mm: use pte markers for swap errors") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Patch series "mm: Fixes on pte markers". Patch 1 resolves the syzkiller report from Pengfei. Patch 2 further harden pte markers when used with the recent swapin error markers. The major case is we should persist a swapin error marker after fork(), so child shouldn't read a corrupted page. This patch (of 2): When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may have it and has pte marker installed. The warning is improper along with the comment. The right thing is to inherit the pte marker when needed, or keep the dst pte empty. A vague guess is this happened by an accident when there's the prior patch to introduce src/dst vma into this helper during the uffd-wp feature got developed and I probably messed up in the rebase, since if we replace dst_vma with src_vma the warning & comment it all makes sense too. Hugetlb did exactly the right here (copy_hugetlb_page_range()). Fix the general path. Reproducer: https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808 Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com Fixes: c56d1b62 ("mm/shmem: handle uffd-wp during fork()") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> # 5.19+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrew Morton authored
Merge branch 'master' into mm-hotfixes-stable
-
- 15 Jan, 2023 4 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull x86 fixes from Borislav Petkov: - Make sure the poking PGD is pinned for Xen PV as it requires it this way - Fixes for two resctrl races when moving a task or creating a new monitoring group - Fix SEV-SNP guests running under HyperV where MTRRs are disabled to not return a UC- type mapping type on memremap() and thus cause a serious slowdown - Fix insn mnemonics in bioscall.S now that binutils is starting to fix confusing insn suffixes * tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: fix poking_init() for Xen PV guests x86/resctrl: Fix event counts regression in reused RMIDs x86/resctrl: Fix task CLOSID/RMID update race x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case x86/boot: Avoid using Intel mnemonics in AT&T syntax asm
-
git://git.kernel.org/pub/scm/linux/kernel/git/ras/rasLinus Torvalds authored
Pull EDAC fixes from Borislav Petkov: - Fix the EDAC device's confusion in the polling setting units - Fix a memory leak in highbank's probing function * tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/highbank: Fix memory leak in highbank_mc_probe() EDAC/device: Fix period calculation in edac_device_reset_delay_period()
-
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds authored
Pull powerpc fixes from Michael Ellerman: - Fix a build failure with some versions of ld that have an odd version string - Fix incorrect use of mutex in the IMC PMU driver Thanks to Kajol Jain, Michael Petlan, Ojaswin Mujoo, Peter Zijlstra, and Yang Yingliang. * tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s/hash: Make stress_hpt_timer_fn() static powerpc/imc-pmu: Fix use of mutex in IRQs disabled section powerpc/boot: Fix incorrect version calculation issue in ld_version
-
- 14 Jan, 2023 7 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommuLinus Torvalds authored
Pull iommu fixes from Joerg Roedel: - Core: Fix an iommu-group refcount leak - Fix overflow issue in IOVA alloc path - ARM-SMMU fixes from Will: - Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY - Fix SMMU shutdown paths to avoid device unregistration race - Error handling fix for Mediatek IOMMU driver * tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe() iommu/iova: Fix alloc iova overflows issue iommu: Fix refcount leak in iommu_device_claim_dma_owner iommu/arm-smmu-v3: Don't unregister on shutdown iommu/arm-smmu: Don't unregister on shutdown iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer
-
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblockLinus Torvalds authored
Pull memblock fix from Mike Rapoport: "memblock: always release pages to the buddy allocator in memblock_free_late() If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages() only releases pages to the buddy allocator if they are not in the deferred range. This is correct for free pages (as defined by for_each_free_mem_pfn_range_in_zone()) because free pages in the deferred range will be initialized and released as part of the deferred init process. memblock_free_pages() is called by memblock_free_late(), which is used to free reserved ranges after memblock_free_all() has run. All pages in reserved ranges have been initialized at that point, and accordingly, those pages are not touched by the deferred init process. This means that currently, if the pages that memblock_free_late() intends to release are in the deferred range, they will never be released to the buddy allocator. They will forever be reserved. In addition, memblock_free_pages() calls kmsan_memblock_free_pages(), which is also correct for free pages but is not correct for reserved pages. KMSAN metadata for reserved pages is initialized by kmsan_init_shadow(), which runs shortly before memblock_free_all(). For both of these reasons, memblock_free_pages() should only be called for free pages, and memblock_free_late() should call __free_pages_core() directly instead. One case where this issue can occur in the wild is EFI boot on x86_64. The x86 EFI code reserves all EFI boot services memory ranges via memblock_reserve() and frees them later via memblock_free_late() (efi_reserve_boot_services() and efi_free_boot_services(), respectively). If any of those ranges happens to fall within the deferred init range, the pages will not be released and that memory will be unavailable. For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: v6.2-rc2: Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 # +43,949 pages" * tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm: Always release pages to the buddy allocator in memblock_free_late().
-
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds authored
Pull kernel hardening fixes from Kees Cook: - Fix CFI hash randomization with KASAN (Sami Tolvanen) - Check size of coreboot table entry and use flex-array * tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kbuild: Fix CFI hash randomization with KASAN firmware: coreboot: Check size of table entry and use flex-array
-
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linuxLinus Torvalds authored
Pull module fix from Luis Chamberlain: "Just one fix for modules by Nick" * tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: kallsyms: Fix scheduling with interrupts disabled in self-test
-
git://git.samba.org/sfrench/cifs-2.6Linus Torvalds authored
Pull cifs fixes from Steve French: - memory leak and double free fix - two symlink fixes - minor cleanup fix - two smb1 fixes * tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix uninitialized memory read for smb311 posix symlink create cifs: fix potential memory leaks in session setup cifs: do not query ifaces on smb1 mounts cifs: fix double free on failed kerberos auth cifs: remove redundant assignment to the variable match cifs: fix file info setting in cifs_open_file() cifs: fix file info setting in cifs_query_path_info()
-
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds authored
Pull SCSI fixes from James Bottomley: "Two minor fixes in the hisi_sas driver which only impact enterprise style multi-expander and shared disk situations and no core changes" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id scsi: hisi_sas: Use abort task set to reset SAS disks when discovered
-
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libataLinus Torvalds authored
Pull ATA fix from Damien Le Moal: "A single fix to prevent building the pata_cs5535 driver with user mode linux as it uses msr operations that are not defined with UML" * tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: pata_cs5535: Don't build on UML
-
- 13 Jan, 2023 6 commits
-
-
git://git.kernel.dk/linuxLinus Torvalds authored
Pull block fixes from Jens Axboe: "Nothing major in here, just a collection of NVMe fixes and dropping a wrong might_sleep() that static checkers tripped over but which isn't valid" * tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux: MAINTAINERS: stop nvme matching for nvmem files nvme: don't allow unprivileged passthrough on partitions nvme: replace the "bool vec" arguments with flags in the ioctl path nvme: remove __nvme_ioctl nvme-pci: fix error handling in nvme_pci_enable() nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression block: Drop spurious might_sleep() from blk_put_queue()
-
git://git.kernel.dk/linuxLinus Torvalds authored
Pull io_uring fixes from Jens Axboe: "A fix for a regression that happened last week, rest is fixes that will be headed to stable as well. In detail: - Fix for a regression added with the leak fix from last week (me) - In writing a test case for that leak, inadvertently discovered a case where we a poll request can race. So fix that up and mark it for stable, and also ensure that fdinfo covers both the poll tables that we have. The latter was an oversight when the split poll table were added (me) - Fix for a lockdep reported issue with IOPOLL (Pavel)" * tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux: io_uring: lock overflowing for IOPOLL io_uring/poll: attempt request issue after racy poll wakeup io_uring/fdinfo: include locked hash table in fdinfo output io_uring/poll: add hash if ready poll request can't complete inline io_uring/io-wq: only free worker if it was allocated for creation
-
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pciLinus Torvalds authored
Pull pci fixes from Bjorn Helgaas: - Work around apparent firmware issue that made Linux reject MMCONFIG space, which broke PCI extended config space (Bjorn Helgaas) - Fix CONFIG_PCIE_BT1 dependency due to mid-air collision between a PCI_MSI_IRQ_DOMAIN -> PCI_MSI change and addition of PCIE_BT1 (Lukas Bulwahn) * tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space x86/pci: Simplify is_mmconf_reserved() messages PCI: dwc: Adjust to recent removal of PCI_MSI_IRQ_DOMAIN
-
Sami Tolvanen authored
Clang emits a asan.module_ctor constructor to each object file when KASAN is enabled, and these functions are indirectly called in do_ctors. With CONFIG_CFI_CLANG, the compiler also emits a CFI type hash before each address-taken global function so they can pass indirect call checks. However, in commit 0c3e806e ("x86/cfi: Add boot time hash randomization"), x86 implemented boot time hash randomization, which relies on the .cfi_sites section generated by objtool. As objtool is run against vmlinux.o instead of individual object files with X86_KERNEL_IBT (enabled by default), CFI types in object files that are not part of vmlinux.o end up not being included in .cfi_sites, and thus won't get randomized and trip CFI when called. Only .vmlinux.export.o and init/version-timestamp.o are linked into vmlinux separately from vmlinux.o. As these files don't contain any functions, disable KASAN for both of them to avoid breaking hash randomization. Link: https://github.com/ClangBuiltLinux/linux/issues/1742 Fixes: 0c3e806e ("x86/cfi: Add boot time hash randomization") Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230112224948.1479453-2-samitolvanen@google.com
-
Kees Cook authored
The memcpy() of the data following a coreboot_table_entry couldn't be evaluated by the compiler under CONFIG_FORTIFY_SOURCE. To make it easier to reason about, add an explicit flexible array member to struct coreboot_device so the entire entry can be copied at once. Additionally, validate the sizes before copying. Avoids this run-time false positive warning: memcpy: detected field-spanning write (size 168) of single field "&device->entry" at drivers/firmware/google/coreboot_table.c:103 (size 8) Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/all/03ae2704-8c30-f9f0-215b-7cdf4ad35a9a@molgen.mpg.de/ Cc: Jack Rosenthal <jrosenth@chromium.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Julius Werner <jwerner@chromium.org> Cc: Brian Norris <briannorris@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Julius Werner <jwerner@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Link: https://lore.kernel.org/r/20230107031406.gonna.761-kees@kernel.orgReviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Jack Rosenthal <jrosenth@chromium.org> Link: https://lore.kernel.org/r/20230112230312.give.446-kees@kernel.org
-
Nicholas Piggin authored
kallsyms_on_each* may schedule so must not be called with interrupts disabled. The iteration function could disable interrupts, but this also changes lookup_symbol() to match the change to the other timing code. Reported-by: Erhard F. <erhard_f@mailbox.org> Link: https://lore.kernel.org/all/bug-216902-206035@https.bugzilla.kernel.org%2F/Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202212251728.8d0872ff-oliver.sang@intel.com Fixes: 30f3bb09 ("kallsyms: Add self-test facility") Tested-by: "Erhard F." <erhard_f@mailbox.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-