1. 22 Apr, 2022 18 commits
    • Linus Torvalds's avatar
      Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 279b83c6
      Linus Torvalds authored
      Pull mount_setattr fix from Christian Brauner:
       "The recent cleanup in e257039f ("mount_setattr(): clean the
        control flow and calling conventions") switched the mount attribute
        codepaths from do-while to for loops as they are more idiomatic when
        walking mounts.
      
        However, we did originally choose do-while constructs because if we
        request a mount or mount tree to be made read-only we need to hold
        writers in the following way: The mount attribute code will grab
        lock_mount_hash() and then call mnt_hold_writers() which will
        _unconditionally_ set MNT_WRITE_HOLD on the mount.
      
        Any callers that need write access have to call mnt_want_write(). They
        will immediately see that MNT_WRITE_HOLD is set on the mount and the
        caller will then either spin (on non-preempt-rt) or wait on
        lock_mount_hash() (on preempt-rt).
      
        The fact that MNT_WRITE_HOLD is set unconditionally means that once
        mnt_hold_writers() returns we need to _always_ pair it with
        mnt_unhold_writers() in both the failure and success paths.
      
        The do-while constructs did take care of this. But Al's change to a
        for loop in the failure path stops on the first mount we failed to
        change mount attributes _without_ going into the loop to call
        mnt_unhold_writers().
      
        This in turn means that once we failed to make a mount read-only via
        mount_setattr() - i.e. there are already writers on that mount - we
        will block any writers indefinitely. Fix this by ensuring that the for
        loop always unsets MNT_WRITE_HOLD including the first mount we failed
        to change to read-only. Also sprinkle a few comments into the cleanup
        code to remind people about what is happening including myself. After
        all, I didn't catch it during review.
      
        This is only relevant on mainline and was reported by syzbot. Details
        about the syzbot reports are all in the commit message"
      
      * tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        fs: unset MNT_WRITE_HOLD on failure
      279b83c6
    • Linus Torvalds's avatar
      Merge tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 2d230968
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "At this time, the majority of changes are for pending ASoC fixes while
        a few usual HD-audio and USB-audio quirks are found.
      
        Almost all patches are small device-specific fixes, and nothing
        worrisome stands out, so far"
      
      * tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits)
        ALSA: hda/realtek: Add quirk for Clevo NP70PNP
        ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDs
        ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook 845/865 G9
        ALSA: usb-audio: Clear MIDI port active flag after draining
        ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
        ALSA: hda/i915: Fix one too many pci_dev_put()
        ALSA: hda/hdmi: add HDMI codec VID for Raptorlake-P
        ALSA: hda/hdmi: fix warning about PCM count when used with SOF
        sound/oss/dmasound: fix 'dmasound_setup' defined but not used
        firmware: cs_dsp: Fix overrun of unterminated control name string
        ASoC: codecs: Fix an error handling path in (rx|tx|va)_macro_probe()
        ASoC: Intel: sof_es8336: Add a quirk for Huawei Matebook D15
        ASoC: Intel: sof_es8336: add a quirk for headset at mic1 port
        ASoC: Intel: sof_es8336: support a separate gpio to control headphone
        ASoC: Intel: sof_es8336: simplify speaker gpio naming
        ASoC: wm8731: Disable the regulator when probing fails
        ASoC: Intel: soc-acpi: correct device endpoints for max98373
        ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
        ASoC: SOF: topology: Fix memory leak in sof_control_load()
        ASoC: SOF: topology: cleanup dailinks on widget unload
        ...
      2d230968
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 281b9d9a
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "13 patches.
      
        Subsystems affected by this patch series: mm (memory-failure, memcg,
        userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
        kcov: don't generate a warning on vm_insert_page()'s failure
        MAINTAINERS: add Vincenzo Frascino to KASAN reviewers
        oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
        selftest/vm: add skip support to mremap_test
        selftest/vm: support xfail in mremap_test
        selftest/vm: verify remap destination address in mremap_test
        selftest/vm: verify mmap addr in mremap_test
        mm, hugetlb: allow for "high" userspace addresses
        userfaultfd: mark uffd_wp regardless of VM_WRITE flag
        memcg: sync flush only if periodic flush is delayed
        mm/memory-failure.c: skip huge_zero_page in memory_failure()
        mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
      281b9d9a
    • Nicholas Piggin's avatar
      mm/vmalloc: huge vmalloc backing pages should be split rather than compound · 3b8000ae
      Nicholas Piggin authored
      Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
      in order to allow the sub-pages to be refcounted by callers such as
      "remap_vmalloc_page [sic]" (remap_vmalloc_range).
      
      However a similar problem exists for other struct page fields callers
      use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
      not only refcounts it but uses ->lru, ->mapping, ->index.
      
      This is not compatible with compound sub-pages, and can cause bad page
      state issues like
      
        BUG: Bad page state in process swapper/0  pfn:00743
        page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
        flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
        raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
        raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: corrupted mapping in tail page
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
        Call Trace:
          dump_stack_lvl+0x74/0xa8 (unreliable)
          bad_page+0x12c/0x170
          free_tail_pages_check+0xe8/0x190
          free_pcp_prepare+0x31c/0x4e0
          free_unref_page+0x40/0x1b0
          __vunmap+0x1d8/0x420
          ...
      
      The correct approach is to use split high-order pages for the huge
      vmalloc backing. These allow callers to treat them in exactly the same
      way as individually-allocated order-0 pages.
      
      Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b8000ae
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm · d569e869
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Extra quiet after Easter, only have minor i915 and msm pulls. However
        I haven't seen a PR from our misc tree in a little while, I've cc'ed
        all the suspects. Once that unblocks I expect a bit larger bunch of
        patches to arrive.
      
        Otherwise as I said, one msm revert and two i915 fixes.
      
        msm:
      
         - revert iommu change that broke some platforms.
      
        i915:
      
         - Unset enable_psr2_sel_fetch if PSR2 detection fails
      
         - Fix to detect when VRR is turned off from panel settings"
      
      * tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm:
        drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
        drm/msm: Revert "drm/msm: Stop using iommu_present()"
        drm/i915/display/vrr: Reset VRR capable property on a long hpd
      d569e869
    • Alistair Popple's avatar
      mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() · 31956166
      Alistair Popple authored
      In some cases it is possible for mmu_interval_notifier_remove() to race
      with mn_tree_inv_end() allowing it to return while the notifier data
      structure is still in use.  Consider the following sequence:
      
        CPU0 - mn_tree_inv_end()            CPU1 - mmu_interval_notifier_remove()
        ----------------------------------- ------------------------------------
                                            spin_lock(subscriptions->lock);
                                            seq = subscriptions->invalidate_seq;
        spin_lock(subscriptions->lock);     spin_unlock(subscriptions->lock);
        subscriptions->invalidate_seq++;
                                            wait_event(invalidate_seq != seq);
                                            return;
        interval_tree_remove(interval_sub); kfree(interval_sub);
        spin_unlock(subscriptions->lock);
        wake_up_all();
      
      As the wait_event() condition is true it will return immediately.  This
      can lead to use-after-free type errors if the caller frees the data
      structure containing the interval notifier subscription while it is
      still on a deferred list.  Fix this by taking the appropriate lock when
      reading invalidate_seq to ensure proper synchronisation.
      
      I observed this whilst running stress testing during some development.
      You do have to be pretty unlucky, but it leads to the usual problems of
      use-after-free (memory corruption, kernel crash, difficult to diagnose
      WARN_ON, etc).
      
      Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com
      Fixes: 99cb252f ("mm/mmu_notifier: add an interval tree notifier")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31956166
    • Aleksandr Nogikh's avatar
      kcov: don't generate a warning on vm_insert_page()'s failure · ecc04463
      Aleksandr Nogikh authored
      vm_insert_page()'s failure is not an unexpected condition, so don't do
      WARN_ONCE() in such a case.
      
      Instead, print a kernel message and just return an error code.
      
      This flaw has been reported under an OOM condition by sysbot [1].
      
      The message is mainly for the benefit of the test log, in this case the
      fuzzer's log so that humans inspecting the log can figure out what was
      going on.  KCOV is a testing tool, so I think being a little more chatty
      when KCOV unexpectedly is about to fail will save someone debugging
      time.
      
      We don't want the WARN, because it's not a kernel bug that syzbot should
      report, and failure can happen if the fuzzer tries hard enough (as
      above).
      
      Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com [1]
      Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com
      Fixes: b3d7fe86 ("kcov: properly handle subsequent mmap calls"),
      Signed-off-by: default avatarAleksandr Nogikh <nogikh@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecc04463
    • Vincenzo Frascino's avatar
      MAINTAINERS: add Vincenzo Frascino to KASAN reviewers · 415fccf8
      Vincenzo Frascino authored
      Add my email address to KASAN reviewers list to make sure that I am
      Cc'ed in all the KASAN changes that may affect arm64 MTE.
      
      Link: https://lkml.kernel.org/r/20220419170640.21404-1-vincenzo.frascino@arm.comSigned-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      415fccf8
    • Nico Pache's avatar
      oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup · e4a38402
      Nico Pache authored
      The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
      can be targeted by the oom reaper.  This mapping is used to store the
      futex robust list head; the kernel does not keep a copy of the robust
      list and instead references a userspace address to maintain the
      robustness during a process death.
      
      A race can occur between exit_mm and the oom reaper that allows the oom
      reaper to free the memory of the futex robust list before the exit path
      has handled the futex death:
      
          CPU1                               CPU2
          --------------------------------------------------------------------
          page_fault
          do_exit "signal"
          wake_oom_reaper
                                              oom_reaper
                                              oom_reap_task_mm (invalidates mm)
          exit_mm
          exit_mm_release
          futex_exit_release
          futex_cleanup
          exit_robust_list
          get_user (EFAULT- can't access memory)
      
      If the get_user EFAULT's, the kernel will be unable to recover the
      waiters on the robust_list, leaving userspace mutexes hung indefinitely.
      
      Delay the OOM reaper, allowing more time for the exit path to perform
      the futex cleanup.
      
      Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
      
      Based on a patch by Michal Hocko.
      
      Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
      Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Signed-off-by: default avatarNico Pache <npache@redhat.com>
      Co-developed-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Herton R. Krzesinski <herton@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4a38402
    • Sidhartha Kumar's avatar
      selftest/vm: add skip support to mremap_test · 80df2fb9
      Sidhartha Kumar authored
      Allow the mremap test to be skipped due to errors such as failing to
      parse the mmap_min_addr sysctl.
      
      Link: https://lkml.kernel.org/r/20220420215721.4868-4-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80df2fb9
    • Sidhartha Kumar's avatar
      e5508fc5
    • Sidhartha Kumar's avatar
      selftest/vm: verify remap destination address in mremap_test · 18d609da
      Sidhartha Kumar authored
      Because mremap does not have a MAP_FIXED_NOREPLACE flag, it can destroy
      existing mappings.  This causes a segfault when regions such as text are
      remapped and the permissions are changed.
      
      Verify the requested mremap destination address does not overlap any
      existing mappings by using mmap's MAP_FIXED_NOREPLACE flag.  Keep
      incrementing the destination address until a valid mapping is found or
      fail the current test once the max address is reached.
      
      Link: https://lkml.kernel.org/r/20220420215721.4868-2-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18d609da
    • Sidhartha Kumar's avatar
      selftest/vm: verify mmap addr in mremap_test · 9c85a9ba
      Sidhartha Kumar authored
      Avoid calling mmap with requested addresses that are less than the
      system's mmap_min_addr.  When run as root, mmap returns EACCES when
      trying to map addresses < mmap_min_addr.  This is not one of the error
      codes for the condition to retry the mmap in the test.
      
      Rather than arbitrarily retrying on EACCES, don't attempt an mmap until
      addr > vm.mmap_min_addr.
      
      Add a munmap call after an alignment check as the mappings are retained
      after the retry and can reach the vm.max_map_count sysctl.
      
      Link: https://lkml.kernel.org/r/20220420215721.4868-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c85a9ba
    • Christophe Leroy's avatar
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy authored
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053 ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
    • Nadav Amit's avatar
      userfaultfd: mark uffd_wp regardless of VM_WRITE flag · 0e88904c
      Nadav Amit authored
      When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
      currently only marked as write-protected if the VMA has VM_WRITE flag
      set.  This seems incorrect or at least would be unexpected by the users.
      
      Consider the following sequence of operations that are being performed
      on a certain page:
      
      	mprotect(PROT_READ)
      	UFFDIO_COPY(UFFDIO_COPY_MODE_WP)
      	mprotect(PROT_READ|PROT_WRITE)
      
      At this point the user would expect to still get UFFD notification when
      the page is accessed for write, but the user would not get one, since
      the PTE was not marked as UFFD_WP during UFFDIO_COPY.
      
      Fix it by always marking PTEs as UFFD_WP regardless on the
      write-permission in the VMA flags.
      
      Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com
      Fixes: 292924b2 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e88904c
    • Shakeel Butt's avatar
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt authored
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarDaniel Dao <dqminh@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
    • Xu Yu's avatar
      mm/memory-failure.c: skip huge_zero_page in memory_failure() · d173d541
      Xu Yu authored
      Kernel panic when injecting memory_failure for the global
      huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.
      
        Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
        page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
        head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
        flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
        raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
        page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
        ------------[ cut here ]------------
        kernel BUG at mm/huge_memory.c:2499!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
        RIP: 0010:split_huge_page_to_list+0x66a/0x880
        Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
        RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
        RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
        RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
        R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
        R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
        FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         try_to_split_thp_page+0x3a/0x130
         memory_failure+0x128/0x800
         madvise_inject_error.cold+0x8b/0xa1
         __x64_sys_madvise+0x54/0x60
         do_syscall_64+0x35/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fc3754f8bf9
        Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
        RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
        RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
        RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
        R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
        R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
      
      This makes huge_zero_page bail out explicitly before split in
      memory_failure(), thus the panic above won't happen again.
      
      Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
      Fixes: 6a46079c ("HWPOISON: The high level memory error handler in the VM v7")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d173d541
    • Naoya Horiguchi's avatar
      mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() · 405ce051
      Naoya Horiguchi authored
      There is a race condition between memory_failure_hugetlb() and hugetlb
      free/demotion, which causes setting PageHWPoison flag on the wrong page.
      The one simple result is that wrong processes can be killed, but another
      (more serious) one is that the actual error is left unhandled, so no one
      prevents later access to it, and that might lead to more serious results
      like consuming corrupted data.
      
      Think about the below race window:
      
        CPU 1                                   CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
                                                hugetlb page might be freed to
                                                buddy, or even changed to another
                                                compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      The current code first does prechecks roughly and then reconfirms after
      taking refcount, but it's found that it makes code overly complicated,
      so move the prechecks in a single hugetlb_lock range.
      
      A newly introduced function, try_memory_failure_hugetlb(), always takes
      hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
      memory_failure() is rare in principle, so should not be a big problem.
      
      Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
      Fixes: 761ad8d7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      405ce051
  2. 21 Apr, 2022 9 commits
  3. 20 Apr, 2022 13 commits