1. 02 Oct, 2023 4 commits
    • Vlastimil Babka's avatar
      mm/slub: attempt to find layouts up to 1/2 waste in calculate_order() · 5886fc82
      Vlastimil Babka authored
      The main loop in calculate_order() currently tries to find an order with
      at most 1/4 waste. If that's impossible (for particular large object
      sizes), there's a fallback that will try to place one object within
      slab_max_order.
      
      If we expand the loop boundary to also allow up to 1/2 waste as the last
      resort, we can remove the fallback and simplify the code, as the loop
      will find an order for such sizes as well. Note we don't need to allow
      more than 1/2 waste as that will never happen - calc_slab_order() would
      calculate more objects to fit, reducing waste below 1/2.
      
      Successfully finding an order in the loop (compared to the fallback)
      will also have the benefit in trying to satisfy min_objects, because the
      fallback was passing 1. Thus the resulting slab orders might be larger
      (not because it would improve waste, but to reduce pressure on shared
      locks), which is one of the goals of calculate_order().
      
      For example, with nr_cpus=1 and 4kB PAGE_SIZE, slub_max_order=3, before
      the patch we would get the following orders for these object sizes:
      
       2056 to 10920 - order-3 as selected by the loop
      10928 to 12280 - order-2 due to fallback, as <1/4 waste is not possible
      12288 to 32768 - order-3 as <1/4 waste is again possible
      
      After the patch:
      
      2056 to 32768 - order-3, because even in the range of 10928 to 12280 we
                      try to satisfy the calculated min_objects.
      
      As a result the code is simpler and gives more consistent results.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-and-tested-by: default avatarJay Patel <jaypatel@linux.ibm.com>
      5886fc82
    • Vlastimil Babka's avatar
      mm/slub: remove min_objects loop from calculate_order() · 0fe2735d
      Vlastimil Babka authored
      calculate_order() currently has two nested loops. The inner one that
      gradually modifies the acceptable waste from 1/16 up to 1/4, and the
      outer one that decreases min_objects down to 2.
      
      Upon closer inspection, the outer loop is unnecessary. Decreasing
      min_objects could have in theory two effects to make the inner loop and
      its call to calc_slab_order() succeed where a previous iteration with
      higher min_objects would not:
      
      - it could cause the min_objects-derived min_order to fit within
        slub_max_order. But min_objects is already pre-capped to max_objects
        that's derived from slub_max_order above the loops, so every iteration
        tries at least slub_max_order in calc_slab_order()
      
      - it could cause calc_slab_order() to be called with lower min_objects
        thus potentially lower min_order in its loop. This would make a
        difference if the lower order could cause the fractional waste test to
        succeed where a higher order has already failed with same fract_leftover
        in the previous iteration with a higher min_order. But that's not
        possible, because increasing the order can only result in lower (or
        same) fractional waste. If we increase the slab size 2 times, we will
        fit at least 2 times the number of objects (thus same fraction of
        waste), or it will allow us to fit one more object (lower fraction of
        waste).
      
      For more confidence I have tried adding a printk to notify when
      decreasing min_objects resulted in a success, and simulated calculations
      for a range of object sizes, nr_cpus and page_sizes. As expected, the
      printk never triggered.
      
      Thus remove the outer loop and adjust comments accordingly.
      
      There's almost no functional change except a weird corner case when
      slub_min_objects=1 on boot command line would cause the whole two nested
      loops to be skipped before this patch. Now it would try to find the best
      layout as usual, resulting in potentially higher orderthat minimizes
      waste. This is not wrong and will be further expanded by the next patch.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-and-tested-by: default avatarJay Patel <jaypatel@linux.ibm.com>
      0fe2735d
    • Vlastimil Babka's avatar
      mm/slub: simplify the last resort slab order calculation · c7355d75
      Vlastimil Babka authored
      If calculate_order() can't fit even a single large object within
      slub_max_order, it will try using the smallest necessary order that may
      exceed slub_max_order but not MAX_ORDER.
      
      Currently this is done with a call to calc_slab_order() which is
      unnecessary. We can simply use get_order(size). No functional change.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-and-tested-by: default avatarJay Patel <jaypatel@linux.ibm.com>
      c7355d75
    • Feng Tang's avatar
      mm/slub: add sanity check for slub_min/max_order cmdline setup · e519ce7a
      Feng Tang authored
      Currently there are 2 parameters could be setup from kernel cmdline:
      slub_min_order and slub_max_order. It's possible that the user
      configured slub_min_order is bigger than the default slub_max_order
      [1], which can still take effect, as calculate_oder() will use MAX_ORDER
      as a fallback to check against, but has some downsides:
      
      * the kernel message about SLUB will be strange in showing min/max
        orders:
      
          SLUB: HWalign=64, Order=9-3, MinObjects=0, CPUs=16, Nodes=1
      
      * in calculate_order() called by each slab, the 2 loops of
        calc_slab_order() will all be meaningless due to slub_min_order
        is bigger than slub_max_order
      
      * prevent future code cleanup like in [2].
      
      Fix it by adding some sanity check to enforce the min/max semantics.
      
      [1]. https://lore.kernel.org/lkml/21a0ba8b-bf05-0799-7c78-2a35f8c8d52a@os.amperecomputing.com/
      [2]. https://lore.kernel.org/lkml/20230908145302.30320-7-vbabka@suse.cz/Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      e519ce7a
  2. 01 Oct, 2023 15 commits
    • Linus Torvalds's avatar
      Linux 6.6-rc4 · 8a749fd1
      Linus Torvalds authored
      8a749fd1
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.6-2' of... · e81a2dab
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Fix the module compression with xz so the in-kernel decompressor
         works
      
       - Document a kconfig idiom to express an optional dependency between
         modules
      
       - Make modpost, when W=1 is given, detect broken drivers that reference
         .exit.* sections
      
       - Remove unused code
      
      * tag 'kbuild-fixes-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: remove stale code for 'source' symlink in packaging scripts
        modpost: Don't let "driver"s reference .exit.*
        vmlinux.lds.h: remove unused CPU_KEEP and CPU_DISCARD macros
        modpost: add missing else to the "of" check
        Documentation: kbuild: explain handling optional dependencies
        kbuild: Use CRC32 and a 1MiB dictionary for XZ compressed modules
      e81a2dab
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of... · d2c52315
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc fixes from Andrew Morton:
       "Fourteen hotfixes, eleven of which are cc:stable. The remainder
        pertain to issues which were introduced after 6.5"
      
      * tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        Crash: add lock to serialize crash hotplug handling
        selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
        mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
        mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
        mm, memcg: reconsider kmem.limit_in_bytes deprecation
        mm: zswap: fix potential memory corruption on duplicate store
        arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
        mm: hugetlb: add huge page size param to set_huge_pte_at()
        maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
        maple_tree: add mas_is_active() to detect in-tree walks
        nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
        mm: abstract moving to the next PFN
        mm: report success more often from filemap_map_folio_range()
        fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
      d2c52315
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 8f633369
      Linus Torvalds authored
      Pull misc driver fix from Greg KH:
       "Here is a single, much requested, fix for a set of misc drivers to
        resolve a much reported regression in the -rc series that has also
        propagated back to the stable releases. Sorry for the delay, lots of
        conference travel for a few weeks put me very far behind in patch
        wrangling.
      
        It has been reported by many to resolve the reported problem, and has
        been in linux-next with no reported issues"
      
      * tag 'char-misc-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe
      8f633369
    • Linus Torvalds's avatar
      Merge tag 'tty-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 3abd15e2
      Linus Torvalds authored
      Pull tty / serial driver fixes from Greg KH:
       "Here are two tty/serial driver fixes for 6.6-rc4 that resolve some
        reported regressions:
      
         - revert a n_gsm change that ended up causing problems
      
         - 8250_port fix for irq data
      
        both have been in linux-next for over a week with no reported
        problems"
      
      * tag 'tty-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        Revert "tty: n_gsm: fix UAF in gsm_cleanup_mux"
        serial: 8250_port: Check IRQ data before use
      3abd15e2
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ec8c2981
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "Misc fixes: a kerneldoc build warning fix, add SRSO mitigation for
        AMD-derived Hygon processors, and fix a SGX kernel crash in the page
        fault handler that can trigger when ksgxd races to reclaim the SECS
        special page, by making the SECS page unswappable"
      
      * tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
        x86/srso: Add SRSO mitigation for Hygon processors
        x86/kgdb: Fix a kerneldoc warning when build with W=1
      ec8c2981
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 373ceff2
      Linus Torvalds authored
      Pull timer fix from Ingo Molnar:
       "Fix a spurious kernel warning during CPU hotplug events that may
        trigger when timer/hrtimer softirqs are pending, which are otherwise
        hotplug-safe and don't merit a warning"
      
      * tag 'timers-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timers: Tag (hr)timer softirq as hotplug safe
      373ceff2
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c5ecffe6
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "Fix a RT tasks related lockup/live-lock during CPU offlining"
      
      * tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/rt: Fix live lock between select_fallback_rq() and RT push
      c5ecffe6
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3a38c57a
      Linus Torvalds authored
      Pull perf event fixes from Ingo Molnar:
       "Misc fixes: work around an AMD microcode bug on certain models, and
        fix kexec kernel PMI handlers on AMD systems that get loaded on older
        kernels that have an unexpected register state"
      
      * tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/amd: Do not WARN() on every IRQ
        perf/x86/amd/core: Fix overflow reset on hotplug
      3a38c57a
    • Masahiro Yamada's avatar
      kbuild: remove stale code for 'source' symlink in packaging scripts · 2d7d1bc1
      Masahiro Yamada authored
      Since commit d8131c29 ("kbuild: remove $(MODLIB)/source symlink"),
      modules_install does not create the 'source' symlink.
      
      Remove the stale code from builddeb and kernel.spec.
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      2d7d1bc1
    • Uwe Kleine-König's avatar
      modpost: Don't let "driver"s reference .exit.* · f177cd0c
      Uwe Kleine-König authored
      Drivers must not reference functions marked with __exit as these likely
      are not available when the code is built-in.
      
      There are few creative offenders uncovered for example in ARCH=amd64
      allmodconfig builds. So only trigger the section mismatch warning for
      W=1 builds.
      
      The dual rule that drivers must not reference .init.* is implemented
      since commit 0db25245 ("modpost: don't allow *driver to reference
      .init.*") which however missed that .exit.* should be handled in the
      same way.
      
      Thanks to Masahiro Yamada and Arnd Bergmann who gave valuable hints to
      find this improvement.
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      f177cd0c
    • Masahiro Yamada's avatar
      vmlinux.lds.h: remove unused CPU_KEEP and CPU_DISCARD macros · 15e86643
      Masahiro Yamada authored
      Remove the left-over of commit e24f6628 ("modpost: remove all
      traces of cpuinit/cpuexit sections").
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Acked-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      15e86643
    • Mauricio Faria de Oliveira's avatar
      modpost: add missing else to the "of" check · cbc3d00c
      Mauricio Faria de Oliveira authored
      Without this 'else' statement, an "usb" name goes into two handlers:
      the first/previous 'if' statement _AND_ the for-loop over 'devtable',
      but the latter is useless as it has no 'usb' device_id entry anyway.
      
      Tested with allmodconfig before/after patch; no changes to *.mod.c:
      
          git checkout v6.6-rc3
          make -j$(nproc) allmodconfig
          make -j$(nproc) olddefconfig
      
          make -j$(nproc)
          find . -name '*.mod.c' | cpio -pd /tmp/before
      
          # apply patch
      
          make -j$(nproc)
          find . -name '*.mod.c' | cpio -pd /tmp/after
      
          diff -r /tmp/before/ /tmp/after/
          # no difference
      
      Fixes: acbef7b7 ("modpost: fix module autoloading for OF devices with generic compatible property")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      cbc3d00c
    • Linus Torvalds's avatar
      Merge tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · e402b086
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "These are the latest bug fixes that have come up in the soc tree. Most
        of these are fairly minor. Most notably, the majority of changes this
        time are not for dts files as usual.
      
         - Updates to the addresses of the broadcom and aspeed entries in the
           MAINTAINERS file.
      
         - Defconfig updates to address a regression on samsung and a build
           warning from an unknown Kconfig symbol
      
         - Build fixes for the StrongARM and Uniphier platforms
      
         - Code fixes for SCMI and FF-A firmware drivers, both of which had a
           simple bug that resulted in invalid data, and a lesser fix for the
           optee firmware driver
      
         - Multiple fixes for the recently added loongson/loongarch "guts" soc
           driver
      
         - Devicetree fixes for RISC-V on the startfive platform, addressing
           issues with NOR flash, usb and uart.
      
         - Multiple fixes for NXP i.MX8/i.MX9 dts files, fixing problems with
           clock, gpio, hdmi settings and the Makefile
      
         - Bug fixes for i.MX firmware code and the OCOTP soc driver
      
         - Multiple fixes for the TI sysc bus driver
      
         - Minor dts updates for TI omap dts files, to address boot time
           warnings and errors"
      
      * tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (35 commits)
        MAINTAINERS: Fix Florian Fainelli's email address
        arm64: defconfig: enable syscon-poweroff driver
        ARM: locomo: fix locomolcd_power declaration
        soc: loongson: loongson2_guts: Remove unneeded semicolon
        soc: loongson: loongson2_guts: Convert to devm_platform_ioremap_resource()
        soc: loongson: loongson_pm2: Populate children syscon nodes
        dt-bindings: soc: loongson,ls2k-pmc: Allow syscon-reboot/syscon-poweroff as child
        soc: loongson: loongson_pm2: Drop useless of_device_id compatible
        dt-bindings: soc: loongson,ls2k-pmc: Use fallbacks for ls2k-pmc compatible
        soc: loongson: loongson_pm2: Add dependency for INPUT
        arm64: defconfig: remove CONFIG_COMMON_CLK_NPCM8XX=y
        ARM: uniphier: fix cache kernel-doc warnings
        MAINTAINERS: aspeed: Update Andrew's email address
        MAINTAINERS: aspeed: Update git tree URL
        firmware: arm_ffa: Don't set the memory region attributes for MEM_LEND
        arm64: dts: imx: Add imx8mm-prt8mm.dtb to build
        arm64: dts: imx8mm-evk: Fix hdmi@3d node
        soc: imx8m: Enable OCOTP clock for imx8mm before reading registers
        arm64: dts: imx8mp-beacon-kit: Fix audio_pll2 clock
        arm64: dts: imx8mp: Fix SDMA2/3 clocks
        ...
      e402b086
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 3b347e40
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Make sure 32-bit applications using user events have aligned access
         when running on a 64-bit kernel.
      
       - Add cond_resched in the loop that handles converting enums in
         print_fmt string is trace events.
      
       - Fix premature wake ups of polling processes in the tracing ring
         buffer. When a task polls waiting for a percentage of the ring buffer
         to be filled, the writer still will wake it up at every event. Add
         the polling's percentage to the "shortest_full" list to tell the
         writer when to wake it up.
      
       - For eventfs dir lookups on dynamic events, an event system's only
         event could be removed, leaving its dentry with no children. This is
         totally legitimate. But in eventfs_release() it must not access the
         children array, as it is only allocated when the dentry has children.
      
      * tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        eventfs: Test for dentries array allocated in eventfs_release()
        tracing/user_events: Align set_bit() address for all archs
        tracing: relax trace_event_eval_update() execution with cond_resched()
        ring-buffer: Update "shortest_full" in polling
      3b347e40
  3. 30 Sep, 2023 21 commits
    • Steven Rostedt (Google)'s avatar
      eventfs: Test for dentries array allocated in eventfs_release() · 2598bd3c
      Steven Rostedt (Google) authored
      The dcache_dir_open_wrapper() could be called when a dynamic event is
      being deleted leaving a dentry with no children. In this case the
      dlist->dentries array will never be allocated. This needs to be checked
      for in eventfs_release(), otherwise it will trigger a NULL pointer
      dereference.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230930090106.1c3164e9@rorschach.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Fixes: ef36b4f9 ("eventfs: Remember what dentries were created on dir open")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2598bd3c
    • Beau Belgrave's avatar
      tracing/user_events: Align set_bit() address for all archs · 2de9ee94
      Beau Belgrave authored
      All architectures should use a long aligned address passed to set_bit().
      User processes can pass either a 32-bit or 64-bit sized value to be
      updated when tracing is enabled when on a 64-bit kernel. Both cases are
      ensured to be naturally aligned, however, that is not enough. The
      address must be long aligned without affecting checks on the value
      within the user process which require different adjustments for the bit
      for little and big endian CPUs.
      
      Add a compat flag to user_event_enabler that indicates when a 32-bit
      value is being used on a 64-bit kernel. Long align addresses and correct
      the bit to be used by set_bit() to account for this alignment. Ensure
      compat flags are copied during forks and used during deletion clears.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
      Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
      
      Cc: stable@vger.kernel.org
      Fixes: 72357590 ("tracing/user_events: Use remote writes for event enablement")
      Reported-by: default avatarClément Léger <cleger@rivosinc.com>
      Suggested-by: default avatarClément Léger <cleger@rivosinc.com>
      Signed-off-by: default avatarBeau Belgrave <beaub@linux.microsoft.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2de9ee94
    • Clément Léger's avatar
      tracing: relax trace_event_eval_update() execution with cond_resched() · 23cce5f2
      Clément Léger authored
      When kernel is compiled without preemption, the eval_map_work_func()
      (which calls trace_event_eval_update()) will not be preempted up to its
      complete execution. This can actually cause a problem since if another
      CPU call stop_machine(), the call will have to wait for the
      eval_map_work_func() function to finish executing in the workqueue
      before being able to be scheduled. This problem was observe on a SMP
      system at boot time, when the CPU calling the initcalls executed
      clocksource_done_booting() which in the end calls stop_machine(). We
      observed a 1 second delay because one CPU was executing
      eval_map_work_func() and was not preempted by the stop_machine() task.
      
      Adding a call to cond_resched() in trace_event_eval_update() allows
      other tasks to be executed and thus continue working asynchronously
      like before without blocking any pending task at boot time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarClément Léger <cleger@rivosinc.com>
      Tested-by: default avatarAtish Patra <atishp@rivosinc.com>
      Reviewed-by: default avatarAtish Patra <atishp@rivosinc.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      23cce5f2
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Update "shortest_full" in polling · 1e0cb399
      Steven Rostedt (Google) authored
      It was discovered that the ring buffer polling was incorrectly stating
      that read would not block, but that's because polling did not take into
      account that reads will block if the "buffer-percent" was set. Instead,
      the ring buffer polling would say reads would not block if there was any
      data in the ring buffer. This was incorrect behavior from a user space
      point of view. This was fixed by commit 42fb0a1e by having the polling
      code check if the ring buffer had more data than what the user specified
      "buffer percent" had.
      
      The problem now is that the polling code did not register itself to the
      writer that it wanted to wait for a specific "full" value of the ring
      buffer. The result was that the writer would wake the polling waiter
      whenever there was a new event. The polling waiter would then wake up, see
      that there's not enough data in the ring buffer to notify user space and
      then go back to sleep. The next event would wake it up again.
      
      Before the polling fix was added, the code would wake up around 100 times
      for a hackbench 30 benchmark. After the "fix", due to the constant waking
      of the writer, it would wake up over 11,0000 times! It would never leave
      the kernel, so the user space behavior was still "correct", but this
      definitely is not the desired effect.
      
      To fix this, have the polling code add what it's waiting for to the
      "shortest_full" variable, to tell the writer not to wake it up if the
      buffer is not as full as it expects to be.
      
      Note, after this fix, it appears that the waiter is now woken up around 2x
      the times it was before (~200). This is a tremendous improvement from the
      11,000 times, but I will need to spend some time to see why polling is
      more aggressive in its wakeups than the read blocking code.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Fixes: 42fb0a1e ("tracing/ring-buffer: Have polling block on watermark")
      Reported-by: default avatarJulia Lawall <julia.lawall@inria.fr>
      Tested-by: default avatarJulia Lawall <julia.lawall@inria.fr>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1e0cb399
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping · 3b517966
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
      
       - fix the narea calculation in swiotlb initialization (Ross Lagerwall)
      
       - fix the check whether a device has used swiotlb (Petr Tesarik)
      
      * tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping:
        swiotlb: fix the check whether a device has used software IO TLB
        swiotlb: use the calculated number of areas
      3b517966
    • Linus Torvalds's avatar
      Merge tag 'iomap-6.6-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 25d48d57
      Linus Torvalds authored
      Pull iomap fixes from Darrick Wong:
      
       - Handle a race between writing and shrinking block devices by
         returning EIO
      
       - Fix a typo in a comment
      
      * tag 'iomap-6.6-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: Spelling s/preceeding/preceding/g
        iomap: add a workaround for racy i_size updates on block devices
      25d48d57
    • Linus Torvalds's avatar
      Merge tag 'i2c-for-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · cefc06e4
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Usual business: a driver fix, a DT fix, a minor core fix"
      
      * tag 'i2c-for-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: npcm7xx: Fix callback completion ordering
        i2c: mux: Avoid potential false error message in i2c_mux_add_adapter
        dt-bindings: i2c: mxs: Pass ref and 'unevaluatedProperties: false'
      cefc06e4
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 830380e3
      Linus Torvalds authored
      Pull ACPI fix from Rafael Wysocki:
       "Fix a possible NULL pointer dereference in the error path of
        acpi_video_bus_add() resulting from recent changes (Dinghao Liu)"
      
      * tag 'acpi-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: video: Fix NULL pointer dereference in acpi_video_bus_add()
      830380e3
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 1c9d8312
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix arch_stack_walk_reliable(), used by live patching
      
       - Fix powerpc selftests to work with run_kselftest.sh
      
      Thanks to Joe Lawrence and Petr Mladek.
      
      * tag 'powerpc-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        selftests/powerpc: Fix emit_tests to work with run_kselftest.sh
        powerpc/stacktrace: Fix arch_stack_walk_reliable()
      1c9d8312
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · ae213639
      Linus Torvalds authored
      Pull nfsd fix from Chuck Lever:
      
       - Fix NFSv4 READ corner case
      
      * tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not set
      ae213639
    • Linus Torvalds's avatar
      Merge tag '6.6-rc3-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6 · ba77f7a6
      Linus Torvalds authored
      Pull smb client fix from Steve French:
       "Fix for password freeing potential oops (also for stable)"
      
      * tag '6.6-rc3-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
        fs/smb/client: Reset password pointer to NULL
      ba77f7a6
    • Baoquan He's avatar
      Crash: add lock to serialize crash hotplug handling · e2a8f20d
      Baoquan He authored
      Eric reported that handling corresponding crash hotplug event can be
      failed easily when many memory hotplug event are notified in a short
      period.  They failed because failing to take __kexec_lock.
      
      =======
      [   78.714569] Fallback order for Node 0: 0
      [   78.714575] Built 1 zonelists, mobility grouping on.  Total pages: 1817886
      [   78.717133] Policy zone: Normal
      [   78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
      [   78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
      [   80.056643] PEFILE: Unsigned PE binary
      =======
      
      The memory hotplug events are notified very quickly and very many, while
      the handling of crash hotplug is much slower relatively.  So the atomic
      variable __kexec_lock and kexec_trylock() can't guarantee the
      serialization of crash hotplug handling.
      
      Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
      handling specifically.  This doesn't impact the usage of __kexec_lock.
      
      Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
      Fixes: 24726275 ("crash: add generic infrastructure for crash hotplug support")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Tested-by: default avatarEric DeVolder <eric.devolder@oracle.com>
      Reviewed-by: default avatarEric DeVolder <eric.devolder@oracle.com>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2a8f20d
    • Juntong Deng's avatar
      selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and... · bbe246f8
      Juntong Deng authored
      selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
      
      According to the awk manual, the -e option does not need to be specified
      in front of 'program' (unless you need to mix program-file).
      
      The redundant -e option can cause error when users use awk tools other
      than gawk (for example, mawk does not support the -e option).
      
      Error Example:
      awk: not an option: -e
      
      Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COMSigned-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe246f8
    • Yang Shi's avatar
      mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified · 24526268
      Yang Shi authored
      When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
      should attempt to migrate all existing pages, and return -EIO if there is
      misplaced or unmovable page.  Then commit 6f4576e3 ("mempolicy: apply
      page table walker on queue_pages_range()") messed up the return value and
      didn't break VMA scan early ianymore when MPOL_MF_STRICT alone.  The
      return value problem was fixed by commit a7f40cfe ("mm: mempolicy:
      make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
      the VMA walk early if unmovable page is met, it may cause some pages are
      not migrated as expected.
      
      The code should conceptually do:
      
       if (MPOL_MF_MOVE|MOVEALL)
           scan all vmas
           try to migrate the existing pages
           return success
       else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
           scan all vmas
           try to migrate the existing pages
           return -EIO if unmovable or migration failed
       else /* MPOL_MF_STRICT alone */
           break early if meets unmovable and don't call mbind_range() at all
       else /* none of those flags */
           check the ranges in test_walk, EFAULT without mbind_range() if discontig.
      
      Fixed the behavior.
      
      Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com
      Fixes: a7f40cfe ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24526268
    • Jinjie Ruan's avatar
      mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions() · 45120b15
      Jinjie Ruan authored
      When CONFIG_DAMON_VADDR_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
      and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      Since commit 9f86d624 ("mm/damon/vaddr-test: remove unnecessary
      variables"), the damon_destroy_ctx() is removed, but still call
      damon_new_target() and damon_new_region(), the damon_region which is
      allocated by kmem_cache_alloc() in damon_new_region() and the damon_target
      which is allocated by kmalloc in damon_new_target() are not freed.  And
      the damon_region which is allocated in damon_new_region() in
      damon_set_regions() is also not freed.
      
      So use damon_destroy_target to free all the damon_regions and damon_target.
      
          unreferenced object 0xffff888107c9a940 (size 64):
            comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              60 c7 9c 07 81 88 ff ff f8 cb 9c 07 81 88 ff ff  `...............
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff8881079cc740 (size 56):
            comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888107c9ac40 (size 64):
            comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              a0 cc 9c 07 81 88 ff ff 78 a1 76 07 81 88 ff ff  ........x.v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff8881079ccc80 (size 56):
            comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888107c9af40 (size 64):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              20 a2 76 07 81 88 ff ff b8 a6 76 07 81 88 ff ff   .v.......v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776a200 (size 56):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776a740 (size 56):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.025s)
            hex dump (first 32 bytes):
              3d 00 00 00 00 00 00 00 3f 00 00 00 00 00 00 00  =.......?.......
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
              [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888108038240 (size 64):
            comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 03 00 00 00 6b 6b 6b 6b  ............kkkk
              48 ad 76 07 81 88 ff ff 98 ae 76 07 81 88 ff ff  H.v.......v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776ad28 (size 56):
            comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
              [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
              [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
      
      Link: https://lkml.kernel.org/r/20230925072100.3725620-1-ruanjinjie@huawei.com
      Fixes: 9f86d624 ("mm/damon/vaddr-test: remove unnecessary variables")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45120b15
    • Michal Hocko's avatar
      mm, memcg: reconsider kmem.limit_in_bytes deprecation · 4597648f
      Michal Hocko authored
      This reverts commits 86327e8e ("memcg: drop kmem.limit_in_bytes") and
      partially reverts 58056f77 ("memcg, kmem: further deprecate
      kmem.limit_in_bytes") which have incrementally removed support for the
      kernel memory accounting hard limit.  Unfortunately it has turned out that
      there is still userspace depending on the existence of
      memory.kmem.limit_in_bytes [1].  The underlying functionality is not
      really required but the non-existent file just confuses the userspace
      which fails in the result.  The patch to fix this on the userspace side
      has been submitted but it is hard to predict how it will propagate through
      the maze of 3rd party consumers of the software.
      
      Now, reverting alone 86327e8e is not an option because there is
      another set of userspace which cannot cope with ENOTSUPP returned when
      writing to the file.  Therefore we have to go and revisit 58056f77 as
      well.  There are two ways to go ahead.  Either we give up on the
      deprecation and fully revert 58056f77 as well or we can keep
      kmem.limit_in_bytes but make the write a noop and warn about the fact. 
      This should work for both known breaking workloads which depend on the
      existence but do not depend on the hard limit enforcement.
      
      Note to backporters to stable trees.  a8c49af3 ("memcg: add per-memcg
      total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
      so the accounting is not done by obj_cgroup_charge_pages directly for v1
      anymore.  Prior kernels need to add it explicitly (thanks to Johannes for
      pointing this out).
      
      [akpm@linux-foundation.org: fix build - remove unused local]
      Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1]
      Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz
      Fixes: 86327e8e ("memcg: drop kmem.limit_in_bytes")
      Fixes: 58056f77 ("memcg, kmem: further deprecate kmem.limit_in_bytes")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4597648f
    • Domenico Cerasuolo's avatar
      mm: zswap: fix potential memory corruption on duplicate store · ca56489c
      Domenico Cerasuolo authored
      While stress-testing zswap a memory corruption was happening when writing
      back pages.  __frontswap_store used to check for duplicate entries before
      attempting to store a page in zswap, this was because if the store fails
      the old entry isn't removed from the tree.  This change removes duplicate
      entries in zswap_store before the actual attempt.
      
      [cerasuolodomenico@gmail.com: add a warning and a comment, per Johannes]
        Link: https://lkml.kernel.org/r/20230925130002.1929369-1-cerasuolodomenico@gmail.com
      Link: https://lkml.kernel.org/r/20230922172211.1704917-1-cerasuolodomenico@gmail.com
      Fixes: 42c06a0e ("mm: kill frontswap")
      Signed-off-by: default avatarDomenico Cerasuolo <cerasuolodomenico@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca56489c
    • Ryan Roberts's avatar
      arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries · 6f1bace9
      Ryan Roberts authored
      When called with a swap entry that does not embed a PFN (e.g. 
      PTE_MARKER_POISONED or PTE_MARKER_UFFD_WP), the previous implementation of
      set_huge_pte_at() would either cause a BUG() to fire (if CONFIG_DEBUG_VM
      is enabled) or cause a dereference of an invalid address and subsequent
      panic.
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      Arguably, the root cause is really due to commit 18f39629 ("mm:
      hugetlb: kill set_huge_swap_pte_at()"), which aimed to simplify the
      interface to the core code by removing set_huge_swap_pte_at() (which took
      a page size parameter) and replacing it with calls to set_huge_pte_at()
      where the size was inferred from the folio, as descibed above.  While that
      commit didn't break anything at the time, it did break the interface
      because it couldn't handle swap entries without PFNs.  And since then new
      callers have come along which rely on this working.  But given the
      brokeness is only observable after commit 8a13897f ("mm: userfaultfd:
      support UFFDIO_POISON for hugetlbfs"), that one gets the Fixes tag.
      
      Now that we have modified the set_huge_pte_at() interface to pass the huge
      page size in the previous patch, we can trivially fix this issue.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-3-ryan.roberts@arm.com
      Fixes: 8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f1bace9
    • Ryan Roberts's avatar
      mm: hugetlb: add huge page size param to set_huge_pte_at() · 935d4f0c
      Ryan Roberts authored
      Patch series "Fix set_huge_pte_at() panic on arm64", v2.
      
      This series fixes a bug in arm64's implementation of set_huge_pte_at(),
      which can result in an unprivileged user causing a kernel panic.  The
      problem was triggered when running the new uffd poison mm selftest for
      HUGETLB memory.  This test (and the uffd poison feature) was merged for
      v6.5-rc7.
      
      Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
      (correctly this time) to get it backported to v6.5, where the issue first
      showed up.
      
      
      Description of Bug
      ==================
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
      it will dereference a bad pointer in page_folio():
      
          static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
          {
              VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
      
              return page_folio(pfn_to_page(swp_offset_pfn(entry)));
          }
      
      
      Fix
      ===
      
      The simplest fix would have been to revert the dodgy cleanup commit
      18f39629 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
      things have moved on, this would have required an audit of all the new
      set_huge_pte_at() call sites to see if they should be converted to
      set_huge_swap_pte_at().  As per the original intent of the change, it
      would also leave us open to future bugs when people invariably get it
      wrong and call the wrong helper.
      
      So instead, I've added a huge page size parameter to set_huge_pte_at(). 
      This means that the arm64 code has the size in all cases.  It's a bigger
      change, due to needing to touch the arches that implement the function,
      but it is entirely mechanical, so in my view, low risk.
      
      I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
      s390, sparc (and additionally x86_64).  I've additionally booted and run
      mm selftests against arm64, where I observe the uffd poison test is fixed,
      and there are no other regressions.
      
      
      This patch (of 2):
      
      In order to fix a bug, arm64 needs to be told the size of the huge page
      for which the pte is being set in set_huge_pte_at().  Provide for this by
      adding an `unsigned long sz` parameter to the function.  This follows the
      same pattern as huge_pte_clear().
      
      This commit makes the required interface modifications to the core mm as
      well as all arches that implement this function (arm64, parisc, powerpc,
      riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
      commit.
      
      No behavioral changes intended.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
      Fixes: 8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>	[powerpc 8xx]
      Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>	[vmalloc change]
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      935d4f0c
    • Liam R. Howlett's avatar
      maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states · a8091f03
      Liam R. Howlett authored
      When updating the maple tree iterator to avoid rewalks, an issue was
      introduced when shifting beyond the limits.  This can be seen by trying to
      go to the previous address of 0, which would set the maple node to
      MAS_NONE and keep the range as the last entry.
      
      Subsequent calls to mas_find() would then search upwards from mas->last
      and skip the value at mas->index/mas->last.  This showed up as a bug in
      mprotect which skips the actual VMA at the current range after attempting
      to go to the previous VMA from 0.
      
      Since MAS_NONE may already be set when searching for a value that isn't
      contained within a node, changing the handling of MAS_NONE in mas_find()
      would make the code more complicated and error prone.  Furthermore, there
      was no way to tell which limit was hit, and thus which action to take
      (next or the entry at the current range).
      
      This solution is to add two states to track what happened with the
      previous iterator action.  This allows for the expected behaviour of the
      next command to return the correct item (either the item at the range
      requested, or the next/previous).
      
      Tests are also added and updated accordingly.
      
      Link: https://lkml.kernel.org/r/20230921181236.509072-3-Liam.Howlett@oracle.com
      Link: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
      Link: https://lore.kernel.org/linux-mm/20230921181236.509072-1-Liam.Howlett@oracle.com/
      Fixes: 39193685 ("maple_tree: try harder to keep active node with mas_prev()")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarPedro Falcato <pedro.falcato@gmail.com>
      Closes: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
      Closes: https://bugs.archlinux.org/task/79656
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8091f03
    • Liam R. Howlett's avatar
      maple_tree: add mas_is_active() to detect in-tree walks · 5c590804
      Liam R. Howlett authored
      Patch series "maple_tree: Fix mas_prev() state regression".
      
      Pedro Falcato retported an mprotect regression [1] which was bisected back
      to the iterator changes for maple tree.  Root cause analysis showed the
      mas_prev() running off the end of the VMA space (previous from 0) followed
      by mas_find(), would skip the first value.
      
      This patchset introduces maple state underflow/overflow so the sequence of
      calls on the maple state will return what the user expects.
      
      Users who encounter this bug may see mprotect(), userfaultfd_register(),
      and mlock() fail on VMAs mapped with address 0.
      
      
      This patch (of 2):
      
      Instead of constantly checking each possibility of the maple state,
      create a fast path that will skip over checking unlikely states.
      
      Link: https://lkml.kernel.org/r/20230921181236.509072-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20230921181236.509072-2-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c590804