1. 26 Apr, 2024 6 commits
    • Lucas Stach's avatar
      mm: page_alloc: control latency caused by zone PCP draining · 55f77df7
      Lucas Stach authored
      Patch series "mm/treewide: Remove pXd_huge() API", v2.
      
      In previous work [1], we removed the pXd_large() API, which is arch
      specific.  This patchset further removes the hugetlb pXd_huge() API.
      
      Hugetlb was never special on creating huge mappings when compared with
      other huge mappings.  Having a standalone API just to detect such pgtable
      entries is more or less redundant, especially after the pXd_leaf() API set
      is introduced with/without CONFIG_HUGETLB_PAGE.
      
      When looking at this problem, a few issues are also exposed that we don't
      have a clear definition of the *_huge() variance API.  This patchset
      started by cleaning these issues first, then replace all *_huge() users to
      use *_leaf(), then drop all *_huge() code.
      
      On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
      for all the rest archs they're reported "false" instead.  This part is
      done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
      but I'll leave that to hmm experts to decide.
      
      Besides, there are three archs (arm, arm64, powerpc) that have slightly
      different definitions between the *_huge() v.s.  *_leaf() variances.  I
      tackled them separately so that it'll be easier for arch experts to chim
      in when necessary.  This part is done in patch 6-9.
      
      The final patches 10-14 do the rest on the final removal, since *_leaf()
      will be the ultimate API in the future, and we seem to have quite some
      confusions on how *_huge() APIs can be defined, provide a rich comment for
      *_leaf() API set to define them properly to avoid future misuse, and
      hopefully that'll also help new archs to start support huge mappings and
      avoid traps (like either swap entries, or PROT_NONE entry checks).
      
      [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com
      
      
      This patch (of 14):
      
      When the complete PCP is drained a much larger number of pages than the
      usual batch size might be freed at once, causing large IRQ and preemption
      latency spikes, as they are all freed while holding the pcp and zone
      spinlocks.
      
      To avoid those latency spikes, limit the number of pages freed in a single
      bulk operation to common batch limits.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.deSigned-off-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55f77df7
    • Dev Jain's avatar
      selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg · 13e86096
      Dev Jain authored
      mmap() must not succeed in validate_lower_address_hint(), for if it does,
      it is a bug in mmap() itself.  Reflect this behaviour with
      ksft_exit_fail_msg().  While at it, do some formatting changes.
      
      Link: https://lkml.kernel.org/r/20240314122250.68534-1-dev.jain@arm.comSigned-off-by: default avatarDev Jain <dev.jain@arm.com>
      Reviewed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13e86096
    • David Hildenbrand's avatar
      mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) · fa9fcd8b
      David Hildenbrand authored
      We changed faultin_page_range() to no longer consume a VMA, because
      faultin_page_range() might internally release the mm lock to lookup
      the VMA again -- required to cleanly handle VM_FAULT_RETRY. But
      independent of that, __get_user_pages() will always lookup the VMA
      itself.
      
      Now that we let __get_user_pages() just handle VMA checks in a way that
      is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise()
      is just overhead. So let's just call madvise_populate()
      on the full range instead.
      
      There is one change in behavior: madvise_walk_vmas() would skip any VMA
      holes, and if everything succeeded, it would return -ENOMEM after
      processing all VMAs.
      
      However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to
      notice any difference: -ENOMEM might either indicate that there were VMA
      holes or that populating page tables failed because there was not enough
      memory. So it's unlikely that user space will notice the difference, and
      that special handling likely only makes sense for some other madvise()
      actions.
      
      Further, we'd already fail with -ENOMEM early in the past if looking up the
      VMA after dropping the MM lock failed because of concurrent VMA
      modifications. So let's just keep it simple and avoid the madvise VMA
      walk, and consistently fail early if we find a VMA hole.
      
      Link: https://lkml.kernel.org/r/20240314161300.382526-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa9fcd8b
    • Yosry Ahmed's avatar
      mm: memcg: add NULL check to obj_cgroup_put() · 91b71e78
      Yosry Ahmed authored
      9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). 
      Move the NULL check in the function, similar to mem_cgroup_put().  The
      unlikely() NULL check in current_objcg_update() was left alone to avoid
      dropping the unlikey() annotation as this a fast path.
      
      Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91b71e78
    • Christophe Leroy's avatar
      mm: remove guard around pgd_offset_k() macro · 5b0a6700
      Christophe Leroy authored
      The last architecture redefining pgd_offset_k() was IA64 and it was
      removed by commit cf8e8658 ("arch: Remove Itanium (IA-64)
      architecture")
      
      There is no need anymore to guard generic version of pgd_offset_k()
      with #ifndef pgd_offset_k
      
      Link: https://lkml.kernel.org/r/59d3f47d5615d18cca1986f269be2fcb3df34556.1710589838.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b0a6700
    • Andrew Morton's avatar
  2. 25 Apr, 2024 11 commits
  3. 16 Apr, 2024 16 commits
    • Jeongjun Park's avatar
      nilfs2: fix OOB in nilfs_set_de_type · c4a7dc95
      Jeongjun Park authored
      The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
      defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
      which uses this array, specifies the index to read from the array in the
      same way as "(mode & S_IFMT) >> S_SHIFT".
      
      static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
       *inode)
      {
      	umode_t mode = inode->i_mode;
      
      	de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
      }
      
      However, when the index is determined this way, an out-of-bounds (OOB)
      error occurs by referring to an index that is 1 larger than the array size
      when the condition "mode & S_IFMT == S_IFMT" is satisfied.  Therefore, a
      patch to resize the nilfs_type_by_mode array should be applied to prevent
      OOB errors.
      
      Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com
      Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
      Fixes: 2ba466d7 ("nilfs2: directory entry operations")
      Signed-off-by: default avatarJeongjun Park <aha310510@gmail.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4a7dc95
    • Naoya Horiguchi's avatar
      MAINTAINERS: update Naoya Horiguchi's email address · 8247bf1d
      Naoya Horiguchi authored
      My old NEC address has been removed, so update MAINTAINERS and .mailmap to
      map it to my gmail address.
      
      Link: https://lkml.kernel.org/r/20240412181720.18452-1-nao.horiguchi@gmail.comSigned-off-by: default avatarNaoya Horiguchi <nao.horiguchi@gmail.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8247bf1d
    • Miaohe Lin's avatar
      fork: defer linking file vma until vma is fully initialized · 35e35178
      Miaohe Lin authored
      Thorvald reported a WARNING [1]. And the root cause is below race:
      
       CPU 1					CPU 2
       fork					hugetlbfs_fallocate
        dup_mmap				 hugetlbfs_punch_hole
         i_mmap_lock_write(mapping);
         vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
         i_mmap_unlock_write(mapping);
         hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
      					 i_mmap_lock_write(mapping);
         					 hugetlb_vmdelete_list
      					  vma_interval_tree_foreach
      					   hugetlb_vma_trylock_write -- Vma_lock is cleared.
         tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
      					   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
      					 i_mmap_unlock_write(mapping);
      
      hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
      i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
      by deferring linking file vma until vma is fully initialized.  Those vmas
      should be initialized first before they can be used.
      
      Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reported-by: default avatarThorvald Natvig <thorvald@google.com>
      Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peng Zhang <zhangpeng.00@bytedance.com>
      Cc: Tycho Andersen <tandersen@netflix.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      35e35178
    • Sumanth Korikkar's avatar
      mm/shmem: inline shmem_is_huge() for disabled transparent hugepages · 1f737846
      Sumanth Korikkar authored
      In order to  minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
      compiler might choose to make a regular function call (out-of-line) for
      shmem_is_huge() instead of inlining it. When transparent hugepages are
      disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
      error.
      
      mm/shmem.c: In function `shmem_getattr':
      ./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
        383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
            |                           ^~~~~~~~~
      mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
       1148 |                 stat->blksize = HPAGE_PMD_SIZE;
      
      To prevent the possible error, always inline shmem_is_huge() when
      transparent hugepages are disabled.
      
      Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.comSigned-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f737846
    • Oscar Salvador's avatar
      mm,page_owner: defer enablement of static branch · 0b2cf0a4
      Oscar Salvador authored
      Kefeng Wang reported that he was seeing some memory leaks with kmemleak
      with page_owner enabled.
      
      The reason is that we enable the page_owner_inited static branch and then
      proceed with the linking of stack_list struct to dummy_stack, which means
      that exists a race window between these two steps where we can have pages
      already being allocated calling add_stack_record_to_list(), allocating
      objects and linking them to stack_list, but then we set stack_list
      pointing to dummy_stack in init_page_owner.  Which means that the objects
      that have been allocated during that time window are unreferenced and
      lost.
      
      Fix this by deferring the enablement of the branch until we have properly
      set up the list.
      
      Link: https://lkml.kernel.org/r/20240409131715.13632-1-osalvador@suse.de
      Fixes: 4bedfb31 ("mm,page_owner: maintain own list of stack_records structs")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Closes: https://lore.kernel.org/linux-mm/74b147b0-718d-4d50-be75-d6afc801cd24@huawei.com/Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b2cf0a4
    • Phillip Lougher's avatar
      Squashfs: check the inode number is not the invalid value of zero · 9253c54e
      Phillip Lougher authored
      Syskiller has produced an out of bounds access in fill_meta_index().
      
      That out of bounds access is ultimately caused because the inode
      has an inode number with the invalid value of zero, which was not checked.
      
      The reason this causes the out of bounds access is due to following
      sequence of events:
      
      1. Fill_meta_index() is called to allocate (via empty_meta_index())
         and fill a metadata index.  It however suffers a data read error
         and aborts, invalidating the newly returned empty metadata index.
         It does this by setting the inode number of the index to zero,
         which means unused (zero is not a valid inode number).
      
      2. When fill_meta_index() is subsequently called again on another
         read operation, locate_meta_index() returns the previous index
         because it matches the inode number of 0.  Because this index
         has been returned it is expected to have been filled, and because
         it hasn't been, an out of bounds access is performed.
      
      This patch adds a sanity check which checks that the inode number
      is not zero when the inode is created and returns -EINVAL if it is.
      
      [phillip@squashfs.org.uk: whitespace fix]
        Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk
      Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.ukSigned-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: default avatar"Ubisectech Sirius" <bugreport@ubisectech.com>
      Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9253c54e
    • Oscar Salvador's avatar
      mm,swapops: update check in is_pfn_swap_entry for hwpoison entries · 07a57a33
      Oscar Salvador authored
      Tony reported that the Machine check recovery was broken in v6.9-rc1, as
      he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
      DRAM.
      
      After some more digging and debugging on his side, he realized that this
      went back to v6.1, with the introduction of 'commit 0d206b5d
      ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'.  That
      commit, among other things, introduced swp_offset_pfn(), replacing
      hwpoison_entry_to_pfn() in its favour.
      
      The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
      is_pfn_swap_entry() never got updated to cover hwpoison entries, which
      means that we would hit the VM_BUG_ON whenever we would call
      swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
      set.  Fix this by updating the check to cover hwpoison entries as well,
      and update the comment while we are it.
      
      Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de
      Fixes: 0d206b5d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarTony Luck <tony.luck@intel.com>
      Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: <stable@vger.kernel.org>	[6.1.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a57a33
    • Miaohe Lin's avatar
      mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled · 1983184c
      Miaohe Lin authored
      When I did hard offline test with hugetlb pages, below deadlock occurs:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.8.0-11409-gf6cef5f8 #1 Not tainted
      ------------------------------------------------------
      bash/46904 is trying to acquire lock:
      ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60
      
      but task is already holding lock:
      ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
             __mutex_lock+0x6c/0x770
             page_alloc_cpu_online+0x3c/0x70
             cpuhp_invoke_callback+0x397/0x5f0
             __cpuhp_invoke_callback_range+0x71/0xe0
             _cpu_up+0xeb/0x210
             cpu_up+0x91/0xe0
             cpuhp_bringup_mask+0x49/0xb0
             bringup_nonboot_cpus+0xb7/0xe0
             smp_init+0x25/0xa0
             kernel_init_freeable+0x15f/0x3e0
             kernel_init+0x15/0x1b0
             ret_from_fork+0x2f/0x50
             ret_from_fork_asm+0x1a/0x30
      
      -> #0 (cpu_hotplug_lock){++++}-{0:0}:
             __lock_acquire+0x1298/0x1cd0
             lock_acquire+0xc0/0x2b0
             cpus_read_lock+0x2a/0xc0
             static_key_slow_dec+0x16/0x60
             __hugetlb_vmemmap_restore_folio+0x1b9/0x200
             dissolve_free_huge_page+0x211/0x260
             __page_handle_poison+0x45/0xc0
             memory_failure+0x65e/0xc70
             hard_offline_page_store+0x55/0xa0
             kernfs_fop_write_iter+0x12c/0x1d0
             vfs_write+0x387/0x550
             ksys_write+0x64/0xe0
             do_syscall_64+0xca/0x1e0
             entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcp_batch_high_lock);
                                     lock(cpu_hotplug_lock);
                                     lock(pcp_batch_high_lock);
        rlock(cpu_hotplug_lock);
      
       *** DEADLOCK ***
      
      5 locks held by bash/46904:
       #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
       #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      stack backtrace:
      CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       check_noncircular+0x129/0x140
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7fc862314887
      Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
      RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
      RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00
      
      In short, below scene breaks the lock dependency chain:
      
       memory_failure
        __page_handle_poison
         zone_pcp_disable -- lock(pcp_batch_high_lock)
         dissolve_free_huge_page
          __hugetlb_vmemmap_restore_folio
           static_key_slow_dec
            cpus_read_lock -- rlock(cpu_hotplug_lock)
      
      Fix this by calling drain_all_pages() instead.
      
      This issue won't occur until commit a6b40850 ("mm: hugetlb: replace
      hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
      rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
      lock(pcp_batch_high_lock) is already in the __page_handle_poison().
      
      [linmiaohe@huawei.com: extend comment per Oscar]
      [akpm@linux-foundation.org: reflow block comment]
      Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
      Fixes: a6b40850 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1983184c
    • Peter Xu's avatar
      mm/userfaultfd: allow hugetlb change protection upon poison entry · c5977c95
      Peter Xu authored
      After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
      the POISON one or UFFD_WP one.
      
      Allow change protection to run on a poisoned marker just like !hugetlb
      cases, ignoring the marker irrelevant of the permission.
      
      Here the two bits are mutual exclusive.  For example, when install a
      poisoned entry it must not be UFFD_WP already (by checking pte_none()
      before such install).  And it also means if UFFD_WP is set there must have
      no POISON bit set.  It makes sense because UFFD_WP is a bit to reflect
      permission, and permissions do not apply if the pte is poisoned and
      destined to sigbus.
      
      So here we simply check uffd_wp bit set first, do nothing otherwise.
      
      Attach the Fixes to UFFDIO_POISON work, as before that it should not be
      possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
      so no chance of swapin errors).
      
      Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
      Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
      Fixes: fc71884a ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: syzbot+b07c8ac8eee3d4d8440f@syzkaller.appspotmail.com
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5977c95
    • Oscar Salvador's avatar
      mm,page_owner: fix printing of stack records · 74017458
      Oscar Salvador authored
      When seq_* code sees that its buffer overflowed, it re-allocates a bigger
      onecand calls seq_operations->start() callback again.  stack_start()
      naively though that if it got called again, it meant that the old record
      got already printed so it returned the next object, but that is not true.
      
      The consequence of that is that every time stack_stop() -> stack_start()
      get called because we needed a bigger buffer, stack_start() will skip
      entries, and those will not be printed.
      
      Fix it by not advancing to the next object in stack_start().
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-5-osalvador@suse.de
      Fixes: 765973a0 ("mm,page_owner: display all stacks and their count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74017458
    • Oscar Salvador's avatar
      mm,page_owner: fix accounting of pages when migrating · 718b1f33
      Oscar Salvador authored
      Upon migration, new allocated pages are being given the handle of the old
      pages.  This is problematic because it means that for the stack which
      allocated the old page, we will be substracting the old page + the new one
      when that page is freed, creating an accounting imbalance.
      
      There is an interest in keeping it that way, as otherwise the output will
      biased towards migration stacks should those operations occur often, but
      that is not really helpful.
      
      The link from the new page to the old stack is being performed by calling
      __update_page_owner_handle() in __folio_copy_owner().  The only thing that
      is left is to link the migrate stack to the old page, so the old page will
      be subtracted from the migrate stack, avoiding by doing so any possible
      imbalance.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-4-osalvador@suse.de
      Fixes: 217b2119 ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      718b1f33
    • Oscar Salvador's avatar
      mm,page_owner: fix refcount imbalance · f5c12105
      Oscar Salvador authored
      Current code does not contemplate scenarios were an allocation and free
      operation on the same pages do not handle it in the same amount at once. 
      To give an example, page_alloc_exact(), where we will allocate a page of
      enough order to stafisfy the size request, but we will free the remainings
      right away.
      
      In the above example, we will increment the stack_record refcount only
      once, but we will decrease it the same number of times as number of unused
      pages we have to free.  This will lead to a warning because of refcount
      imbalance.
      
      Fix this by recording the number of base pages in the refcount field.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-3-osalvador@suse.de
      Reported-by: syzbot+41bbfdb8d41003d12c0f@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/linux-mm/00000000000090e8ff0613eda0e5@google.com
      Fixes: 217b2119 ("mm,page_owner: implement the tracking of the stacks count")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5c12105
    • Oscar Salvador's avatar
      mm,page_owner: update metadata for tail pages · ea4b5b33
      Oscar Salvador authored
      Patch series "page_owner: Fix refcount imbalance and print fixup", v4.
      
      This series consists of a refactoring/correctness of updating the metadata
      of tail pages, a couple of fixups for the refcounting part and a fixup for
      the stack_start() function.
      
      From this series on, instead of counting the stacks, we count the
      outstanding nr_base_pages each stack has, which gives us a much better
      memory overview.  The other fixup is for the migration part.
      
      A more detailed explanation can be found in the changelog of the
      respective patches.
      
      
      This patch (of 4):
      
      __set_page_owner_handle() and __reset_page_owner() update the metadata of
      all pages when the page is of a higher-order, but we miss to do the same
      when the pages are migrated.  __folio_copy_owner() only updates the
      metadata of the head page, meaning that the information stored in the
      first page and the tail pages will not match.
      
      Strictly speaking that is not a big problem because 1) we do not print
      tail pages and 2) upon splitting all tail pages will inherit the metadata
      of the head page, but it is better to have all metadata in check should
      there be any problem, so it can ease debugging.
      
      For that purpose, a couple of helpers are created
      __update_page_owner_handle() which updates the metadata on allocation, and
      __update_page_owner_free_handle() which does the same when the page is
      freed.
      
      __folio_copy_owner() will make use of both as it needs to entirely replace
      the page_owner metadata for the new page.
      
      Link: https://lkml.kernel.org/r/20240404070702.2744-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20240404070702.2744-2-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea4b5b33
    • Lokesh Gidra's avatar
      userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE · c0205eaf
      Lokesh Gidra authored
      Commit d7a08838 ("mm: userfaultfd: fix unexpected change to src_folio
      when UFFDIO_MOVE fails") moved the src_folio->{mapping, index} changing to
      after clearing the page-table and ensuring that it's not pinned.  This
      avoids failure of swapout+migration and possibly memory corruption.
      
      However, the commit missed fixing it in the huge-page case.
      
      Link: https://lkml.kernel.org/r/20240404171726.2302435-1-lokeshgidra@google.com
      Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI")
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0205eaf
    • David Hildenbrand's avatar
      mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly · 631426ba
      David Hildenbrand authored
      Darrick reports that in some cases where pread() would fail with -EIO and
      mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
      MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
      
      While the madvise() call can be interrupted by a signal, this is not the
      desired behavior.  MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
      like page faults in that case: fail and not retry forever.
      
      A reproducer can be found at [1].
      
      The reason is that __get_user_pages(), as called by
      faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
      it will simply return 0 when VM_FAULT_RETRY happened, making
      madvise_populate()->faultin_vma_page_range() retry again and again, never
      setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
      
      __get_user_pages_locked() does what we want, but duplicating that logic in
      faultin_vma_page_range() feels wrong.
      
      So let's use __get_user_pages_locked() instead, that will detect
      VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
      return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
      from faultin_page() to __get_user_pages(), all the way to
      madvise_populate().
      
      But, there is an issue: __get_user_pages_locked() will end up re-taking
      the MM lock and then __get_user_pages() will do another VMA lookup.  In
      the meantime, the VMA layout could have changed and we'd fail with
      different error codes than we'd want to.
      
      As __get_user_pages() will currently do a new VMA lookup either way, let
      it do the VMA handling in a different way, controlled by a new
      FOLL_MADV_POPULATE flag, effectively moving these checks from
      madvise_populate() + faultin_page_range() in there.
      
      With this change, Darricks reproducer properly fails with -EFAULT, as
      documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
      
      [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
      
      Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
      Fixes: 4ca9b385 ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      631426ba
    • Andrew Morton's avatar
      Merge branch 'master' into mm-stable · 640958fd
      Andrew Morton authored
      640958fd
  4. 14 Apr, 2024 7 commits