1. 25 Apr, 2022 5 commits
  2. 22 Apr, 2022 20 commits
    • David Vernet's avatar
      cgroup: Add test_cpucg_weight_underprovisioned() testcase · 4ab93063
      David Vernet authored
      test_cpu.c includes testcases that validate the cgroup cpu controller.
      This patch adds a new testcase called test_cpucg_weight_underprovisioned()
      that verifies that processes with different cpu.weight that are all running
      on an underprovisioned system, still get roughly the same amount of cpu
      time.
      
      Because test_cpucg_weight_underprovisioned() is very similar to
      test_cpucg_weight_overprovisioned(), this patch also pulls the common logic
      into a separate helper function that is invoked from both testcases, and
      which uses function pointers to invoke the unique portions of the
      testcases.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4ab93063
    • David Vernet's avatar
      cgroup: Add test_cpucg_weight_overprovisioned() testcase · 6376b22c
      David Vernet authored
      test_cpu.c includes testcases that validate the cgroup cpu controller.
      This patch adds a new testcase called test_cpucg_weight_overprovisioned()
      that verifies the expected behavior of creating multiple processes with
      different cpu.weight, on a system that is overprovisioned.
      
      So as to avoid code duplication, this patch also updates cpu_hog_func_param
      to take a new hog_clock_type enum which informs how time is counted in
      hog_cpus_timed() (either process time or wall clock time).
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6376b22c
    • David Vernet's avatar
      cgroup: Add test_cpucg_stats() testcase to cgroup cpu selftests · 3c879a1b
      David Vernet authored
      test_cpu.c includes testcases that validate the cgroup cpu controller.
      This patch adds a new testcase called test_cpucg_stats() that verifies the
      expected behavior of the cpu.stat interface. In doing so, we define a
      new hog_cpus_timed() function which takes a cpu_hog_func_param struct
      that configures how many CPUs it uses, and how long it runs. Future
      patches will also spawn threads that hog CPUs, so this function will
      eventually serve those use-cases as well.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3c879a1b
    • David Vernet's avatar
      cgroup: Add new test_cpu.c test suite in cgroup selftests · 820a4f88
      David Vernet authored
      The cgroup selftests suite currently contains tests that validate various
      aspects of cgroup, such as validating the expected behavior for memory
      controllers, the expected behavior of cgroup.procs, etc. There are no tests
      that validate the expected behavior of the cgroup cpu controller.
      
      This patch therefore adds a new test_cpu.c file that will contain cpu
      controller testcases. The file currently only contains a single testcase
      that validates creating nested cgroups with cgroup.subtree_control
      including cpu. Future patches will add more sophisticated testcases that
      validate functional aspects of the cpu controller.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      820a4f88
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 281b9d9a
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "13 patches.
      
        Subsystems affected by this patch series: mm (memory-failure, memcg,
        userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
        kcov: don't generate a warning on vm_insert_page()'s failure
        MAINTAINERS: add Vincenzo Frascino to KASAN reviewers
        oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
        selftest/vm: add skip support to mremap_test
        selftest/vm: support xfail in mremap_test
        selftest/vm: verify remap destination address in mremap_test
        selftest/vm: verify mmap addr in mremap_test
        mm, hugetlb: allow for "high" userspace addresses
        userfaultfd: mark uffd_wp regardless of VM_WRITE flag
        memcg: sync flush only if periodic flush is delayed
        mm/memory-failure.c: skip huge_zero_page in memory_failure()
        mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
      281b9d9a
    • Nicholas Piggin's avatar
      mm/vmalloc: huge vmalloc backing pages should be split rather than compound · 3b8000ae
      Nicholas Piggin authored
      Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
      in order to allow the sub-pages to be refcounted by callers such as
      "remap_vmalloc_page [sic]" (remap_vmalloc_range).
      
      However a similar problem exists for other struct page fields callers
      use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
      not only refcounts it but uses ->lru, ->mapping, ->index.
      
      This is not compatible with compound sub-pages, and can cause bad page
      state issues like
      
        BUG: Bad page state in process swapper/0  pfn:00743
        page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
        flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
        raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
        raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: corrupted mapping in tail page
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
        Call Trace:
          dump_stack_lvl+0x74/0xa8 (unreliable)
          bad_page+0x12c/0x170
          free_tail_pages_check+0xe8/0x190
          free_pcp_prepare+0x31c/0x4e0
          free_unref_page+0x40/0x1b0
          __vunmap+0x1d8/0x420
          ...
      
      The correct approach is to use split high-order pages for the huge
      vmalloc backing. These allow callers to treat them in exactly the same
      way as individually-allocated order-0 pages.
      
      Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b8000ae
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm · d569e869
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Extra quiet after Easter, only have minor i915 and msm pulls. However
        I haven't seen a PR from our misc tree in a little while, I've cc'ed
        all the suspects. Once that unblocks I expect a bit larger bunch of
        patches to arrive.
      
        Otherwise as I said, one msm revert and two i915 fixes.
      
        msm:
      
         - revert iommu change that broke some platforms.
      
        i915:
      
         - Unset enable_psr2_sel_fetch if PSR2 detection fails
      
         - Fix to detect when VRR is turned off from panel settings"
      
      * tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm:
        drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
        drm/msm: Revert "drm/msm: Stop using iommu_present()"
        drm/i915/display/vrr: Reset VRR capable property on a long hpd
      d569e869
    • Alistair Popple's avatar
      mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() · 31956166
      Alistair Popple authored
      In some cases it is possible for mmu_interval_notifier_remove() to race
      with mn_tree_inv_end() allowing it to return while the notifier data
      structure is still in use.  Consider the following sequence:
      
        CPU0 - mn_tree_inv_end()            CPU1 - mmu_interval_notifier_remove()
        ----------------------------------- ------------------------------------
                                            spin_lock(subscriptions->lock);
                                            seq = subscriptions->invalidate_seq;
        spin_lock(subscriptions->lock);     spin_unlock(subscriptions->lock);
        subscriptions->invalidate_seq++;
                                            wait_event(invalidate_seq != seq);
                                            return;
        interval_tree_remove(interval_sub); kfree(interval_sub);
        spin_unlock(subscriptions->lock);
        wake_up_all();
      
      As the wait_event() condition is true it will return immediately.  This
      can lead to use-after-free type errors if the caller frees the data
      structure containing the interval notifier subscription while it is
      still on a deferred list.  Fix this by taking the appropriate lock when
      reading invalidate_seq to ensure proper synchronisation.
      
      I observed this whilst running stress testing during some development.
      You do have to be pretty unlucky, but it leads to the usual problems of
      use-after-free (memory corruption, kernel crash, difficult to diagnose
      WARN_ON, etc).
      
      Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com
      Fixes: 99cb252f ("mm/mmu_notifier: add an interval tree notifier")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31956166
    • Aleksandr Nogikh's avatar
      kcov: don't generate a warning on vm_insert_page()'s failure · ecc04463
      Aleksandr Nogikh authored
      vm_insert_page()'s failure is not an unexpected condition, so don't do
      WARN_ONCE() in such a case.
      
      Instead, print a kernel message and just return an error code.
      
      This flaw has been reported under an OOM condition by sysbot [1].
      
      The message is mainly for the benefit of the test log, in this case the
      fuzzer's log so that humans inspecting the log can figure out what was
      going on.  KCOV is a testing tool, so I think being a little more chatty
      when KCOV unexpectedly is about to fail will save someone debugging
      time.
      
      We don't want the WARN, because it's not a kernel bug that syzbot should
      report, and failure can happen if the fuzzer tries hard enough (as
      above).
      
      Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com [1]
      Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com
      Fixes: b3d7fe86 ("kcov: properly handle subsequent mmap calls"),
      Signed-off-by: default avatarAleksandr Nogikh <nogikh@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecc04463
    • Vincenzo Frascino's avatar
      MAINTAINERS: add Vincenzo Frascino to KASAN reviewers · 415fccf8
      Vincenzo Frascino authored
      Add my email address to KASAN reviewers list to make sure that I am
      Cc'ed in all the KASAN changes that may affect arm64 MTE.
      
      Link: https://lkml.kernel.org/r/20220419170640.21404-1-vincenzo.frascino@arm.comSigned-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      415fccf8
    • Nico Pache's avatar
      oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup · e4a38402
      Nico Pache authored
      The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
      can be targeted by the oom reaper.  This mapping is used to store the
      futex robust list head; the kernel does not keep a copy of the robust
      list and instead references a userspace address to maintain the
      robustness during a process death.
      
      A race can occur between exit_mm and the oom reaper that allows the oom
      reaper to free the memory of the futex robust list before the exit path
      has handled the futex death:
      
          CPU1                               CPU2
          --------------------------------------------------------------------
          page_fault
          do_exit "signal"
          wake_oom_reaper
                                              oom_reaper
                                              oom_reap_task_mm (invalidates mm)
          exit_mm
          exit_mm_release
          futex_exit_release
          futex_cleanup
          exit_robust_list
          get_user (EFAULT- can't access memory)
      
      If the get_user EFAULT's, the kernel will be unable to recover the
      waiters on the robust_list, leaving userspace mutexes hung indefinitely.
      
      Delay the OOM reaper, allowing more time for the exit path to perform
      the futex cleanup.
      
      Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
      
      Based on a patch by Michal Hocko.
      
      Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
      Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Signed-off-by: default avatarNico Pache <npache@redhat.com>
      Co-developed-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Herton R. Krzesinski <herton@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4a38402
    • Sidhartha Kumar's avatar
      selftest/vm: add skip support to mremap_test · 80df2fb9
      Sidhartha Kumar authored
      Allow the mremap test to be skipped due to errors such as failing to
      parse the mmap_min_addr sysctl.
      
      Link: https://lkml.kernel.org/r/20220420215721.4868-4-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80df2fb9
    • Sidhartha Kumar's avatar
      e5508fc5
    • Sidhartha Kumar's avatar
      selftest/vm: verify remap destination address in mremap_test · 18d609da
      Sidhartha Kumar authored
      Because mremap does not have a MAP_FIXED_NOREPLACE flag, it can destroy
      existing mappings.  This causes a segfault when regions such as text are
      remapped and the permissions are changed.
      
      Verify the requested mremap destination address does not overlap any
      existing mappings by using mmap's MAP_FIXED_NOREPLACE flag.  Keep
      incrementing the destination address until a valid mapping is found or
      fail the current test once the max address is reached.
      
      Link: https://lkml.kernel.org/r/20220420215721.4868-2-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18d609da
    • Sidhartha Kumar's avatar
      selftest/vm: verify mmap addr in mremap_test · 9c85a9ba
      Sidhartha Kumar authored
      Avoid calling mmap with requested addresses that are less than the
      system's mmap_min_addr.  When run as root, mmap returns EACCES when
      trying to map addresses < mmap_min_addr.  This is not one of the error
      codes for the condition to retry the mmap in the test.
      
      Rather than arbitrarily retrying on EACCES, don't attempt an mmap until
      addr > vm.mmap_min_addr.
      
      Add a munmap call after an alignment check as the mappings are retained
      after the retry and can reach the vm.max_map_count sysctl.
      
      Link: https://lkml.kernel.org/r/20220420215721.4868-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c85a9ba
    • Christophe Leroy's avatar
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy authored
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053 ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
    • Nadav Amit's avatar
      userfaultfd: mark uffd_wp regardless of VM_WRITE flag · 0e88904c
      Nadav Amit authored
      When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
      currently only marked as write-protected if the VMA has VM_WRITE flag
      set.  This seems incorrect or at least would be unexpected by the users.
      
      Consider the following sequence of operations that are being performed
      on a certain page:
      
      	mprotect(PROT_READ)
      	UFFDIO_COPY(UFFDIO_COPY_MODE_WP)
      	mprotect(PROT_READ|PROT_WRITE)
      
      At this point the user would expect to still get UFFD notification when
      the page is accessed for write, but the user would not get one, since
      the PTE was not marked as UFFD_WP during UFFDIO_COPY.
      
      Fix it by always marking PTEs as UFFD_WP regardless on the
      write-permission in the VMA flags.
      
      Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com
      Fixes: 292924b2 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e88904c
    • Shakeel Butt's avatar
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt authored
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarDaniel Dao <dqminh@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
    • Xu Yu's avatar
      mm/memory-failure.c: skip huge_zero_page in memory_failure() · d173d541
      Xu Yu authored
      Kernel panic when injecting memory_failure for the global
      huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.
      
        Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
        page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
        head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
        flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
        raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
        page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
        ------------[ cut here ]------------
        kernel BUG at mm/huge_memory.c:2499!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
        RIP: 0010:split_huge_page_to_list+0x66a/0x880
        Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
        RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
        RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
        RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
        R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
        R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
        FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         try_to_split_thp_page+0x3a/0x130
         memory_failure+0x128/0x800
         madvise_inject_error.cold+0x8b/0xa1
         __x64_sys_madvise+0x54/0x60
         do_syscall_64+0x35/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fc3754f8bf9
        Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
        RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
        RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
        RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
        R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
        R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
      
      This makes huge_zero_page bail out explicitly before split in
      memory_failure(), thus the panic above won't happen again.
      
      Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
      Fixes: 6a46079c ("HWPOISON: The high level memory error handler in the VM v7")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d173d541
    • Naoya Horiguchi's avatar
      mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() · 405ce051
      Naoya Horiguchi authored
      There is a race condition between memory_failure_hugetlb() and hugetlb
      free/demotion, which causes setting PageHWPoison flag on the wrong page.
      The one simple result is that wrong processes can be killed, but another
      (more serious) one is that the actual error is left unhandled, so no one
      prevents later access to it, and that might lead to more serious results
      like consuming corrupted data.
      
      Think about the below race window:
      
        CPU 1                                   CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
                                                hugetlb page might be freed to
                                                buddy, or even changed to another
                                                compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      The current code first does prechecks roughly and then reconfirms after
      taking refcount, but it's found that it makes code overly complicated,
      so move the prechecks in a single hugetlb_lock range.
      
      A newly introduced function, try_memory_failure_hugetlb(), always takes
      hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
      memory_failure() is rare in principle, so should not be a big problem.
      
      Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
      Fixes: 761ad8d7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      405ce051
  3. 21 Apr, 2022 5 commits
    • Dave Airlie's avatar
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine · b05a5683
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
       "A bunch of driver fixes:
      
         - idxd device RO checks and device cleanup
      
         - dw-edma unaligned access and alignment
      
         - qcom: missing minItems in binding
      
         - mediatek pm usage fix
      
         - imx init script"
      
      * tag 'dmaengine-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
        dt-bindings: dmaengine: qcom: gpi: Add minItems for interrupts
        dmaengine: idxd: skip clearing device context when device is read-only
        dmaengine: idxd: add RO check for wq max_transfer_size write
        dmaengine: idxd: add RO check for wq max_batch_size write
        dmaengine: idxd: fix retry value to be constant for duration of function call
        dmaengine: idxd: match type for retries var in idxd_enqcmds()
        dmaengine: dw-edma: Fix inconsistent indenting
        dmaengine: dw-edma: Fix unaligned 64bit access
        dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources
        dmaengine: imx-sdma: Fix error checking in sdma_event_remap
        dma: at_xdmac: fix a missing check on list iterator
        dmaengine: imx-sdma: fix init of uart scripts
        dmaengine: idxd: fix device cleanup on disable
      b05a5683
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2022-04-20' of... · e827d149
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2022-04-20' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
      
      - Unset enable_psr2_sel_fetch if PSR2 detection fails
      - Fix to detect when VRR is turned off from panel settings
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/YmAKuHwon7hGyIoC@jlahtine-mobl.ger.corp.intel.com
      e827d149
    • Linus Torvalds's avatar
      Merge tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 59f0c244
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from xfrm and can.
      
        Current release - regressions:
      
         - rxrpc: restore removed timer deletion
      
        Current release - new code bugs:
      
         - gre: fix device lookup for l3mdev use-case
      
         - xfrm: fix egress device lookup for l3mdev use-case
      
        Previous releases - regressions:
      
         - sched: cls_u32: fix netns refcount changes in u32_change()
      
         - smc: fix sock leak when release after smc_shutdown()
      
         - xfrm: limit skb_page_frag_refill use to a single page
      
         - eth: atlantic: invert deep par in pm functions, preventing null
           derefs
      
         - eth: stmmac: use readl_poll_timeout_atomic() in atomic state
      
        Previous releases - always broken:
      
         - gre: fix skb_under_panic on xmit
      
         - openvswitch: fix OOB access in reserve_sfa_size()
      
         - dsa: hellcreek: calculate checksums in tagger
      
         - eth: ice: fix crash in switchdev mode
      
         - eth: igc:
            - fix infinite loop in release_swfw_sync
            - fix scheduling while atomic"
      
      * tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
        drivers: net: hippi: Fix deadlock in rr_close()
        selftests: mlxsw: vxlan_flooding_ipv6: Prevent flooding of unwanted packets
        selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
        nfc: MAINTAINERS: add Bug entry
        net: stmmac: Use readl_poll_timeout_atomic() in atomic state
        doc/ip-sysctl: add bc_forwarding
        netlink: reset network and mac headers in netlink_dump()
        net: mscc: ocelot: fix broken IP multicast flooding
        net: dsa: hellcreek: Calculate checksums in tagger
        net: atlantic: invert deep par in pm functions, preventing null derefs
        can: isotp: stop timeout monitoring when no first frame was sent
        bonding: do not discard lowest hash bit for non layer3+4 hashing
        net: lan966x: Make sure to release ptp interrupt
        ipv6: make ip6_rt_gc_expire an atomic_t
        net: Handle l3mdev in ip_tunnel_init_flow
        l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
        net/sched: cls_u32: fix possible leak in u32_init_knode()
        net/sched: cls_u32: fix netns refcount changes in u32_change()
        powerpc: Update MAINTAINERS for ibmvnic and VAS
        net: restore alpha order to Ethernet devices in config
        ...
      59f0c244
    • Duoming Zhou's avatar
      drivers: net: hippi: Fix deadlock in rr_close() · bc6de287
      Duoming Zhou authored
      There is a deadlock in rr_close(), which is shown below:
      
         (Thread 1)                |      (Thread 2)
                                   | rr_open()
      rr_close()                   |  add_timer()
       spin_lock_irqsave() //(1)   |  (wait a time)
       ...                         | rr_timer()
       del_timer_sync()            |  spin_lock_irqsave() //(2)
       (wait timer to stop)        |  ...
      
      We hold rrpriv->lock in position (1) of thread 1 and
      use del_timer_sync() to wait timer to stop, but timer handler
      also need rrpriv->lock in position (2) of thread 2.
      As a result, rr_close() will block forever.
      
      This patch extracts del_timer_sync() from the protection of
      spin_lock_irqsave(), which could let timer handler to obtain
      the needed lock.
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Link: https://lore.kernel.org/r/20220417125519.82618-1-duoming@zju.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      bc6de287
  4. 20 Apr, 2022 10 commits