1. 09 Aug, 2024 1 commit
    • Linus Torvalds's avatar
      module: make waiting for a concurrent module loader interruptible · 2124d84d
      Linus Torvalds authored
      The recursive aes-arm-bs module load situation reported by Russell King
      is getting fixed in the crypto layer, but this in the meantime fixes the
      "recursive load hangs forever" by just making the waiting for the first
      module load be interruptible.
      
      This should now match the old behavior before commit 9b9879fc
      ("modules: catch concurrent module loads, treat them as idempotent"),
      which used the different "wait for module to be ready" code in
      module_patient_check_exists().
      
      End result: a recursive module load will still block, but now a signal
      will interrupt it and fail the second module load, at which point the
      first module will successfully complete loading.
      
      Fixes: 9b9879fc ("modules: catch concurrent module loads, treat them as idempotent")
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2124d84d
  2. 08 Aug, 2024 39 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · ee9a43b7
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bluetooth.
      
        Current release - regressions:
      
         - eth: bnxt_en: fix memory out-of-bounds in bnxt_fill_hw_rss_tbl() on
           older chips
      
        Current release - new code bugs:
      
         - ethtool: fix off-by-one error / kdoc contradicting the code for max
           RSS context IDs
      
         - Bluetooth: hci_qca:
            - QCA6390: fix support on non-DT platforms
            - QCA6390: don't call pwrseq_power_off() twice
            - fix a NULL-pointer derefence at shutdown
      
         - eth: ice: fix incorrect assigns of FEC counters
      
        Previous releases - regressions:
      
         - mptcp: fix handling endpoints with both 'signal' and 'subflow'
           flags set
      
         - virtio-net: fix changing ring count when vq IRQ coalescing not
           supported
      
         - eth: gve: fix use of netif_carrier_ok() during reconfig / reset
      
        Previous releases - always broken:
      
         - eth: idpf: fix bugs in queue re-allocation on reconfig / reset
      
         - ethtool: fix context creation with no parameters
      
        Misc:
      
         - linkwatch: use system_unbound_wq to ease RTNL contention"
      
      * tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits)
        net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.
        ethtool: Fix context creation with no parameters
        net: ethtool: fix off-by-one error in max RSS context IDs
        net: pse-pd: tps23881: include missing bitfield.h header
        net: fec: Stop PPS on driver remove
        net: bcmgenet: Properly overlay PHY and MAC Wake-on-LAN capabilities
        l2tp: fix lockdep splat
        net: stmmac: dwmac4: fix PCS duplex mode decode
        idpf: fix UAFs when destroying the queues
        idpf: fix memleak in vport interrupt configuration
        idpf: fix memory leaks and crashes while performing a soft reset
        bnxt_en : Fix memory out-of-bounds in bnxt_fill_hw_rss_tbl()
        net: dsa: bcm_sf2: Fix a possible memory leak in bcm_sf2_mdio_register()
        net/smc: add the max value of fallback reason count
        Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor
        Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()
        Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown
        Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms
        Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390
        ice: Fix incorrect assigns of FEC counts
        ...
      ee9a43b7
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 9466b6ae
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Have reading of event format files test if the metadata still exists.
      
         When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
         is set to state that it is to prevent any new references to it from
         happening while waiting for existing references to close. When the
         last reference closes, the metadata is freed. But the "format" was
         missing a check to this flag (along with some other files) that
         allowed new references to happen, and a use-after-free bug to occur.
      
       - Have the trace event meta data use the refcount infrastructure
         instead of relying on its own atomic counters.
      
       - Have tracefs inodes use alloc_inode_sb() for allocation instead of
         using kmem_cache_alloc() directly.
      
       - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
         callers expect a real object or an ERR_PTR.
      
       - Have release_ei() use call_srcu() and not call_rcu() as all the
         protection is on SRCU and not RCU.
      
       - Fix ftrace_graph_ret_addr() to use the task passed in and not
         current.
      
       - Fix overflow bug in get_free_elt() where the counter can overflow the
         integer and cause an infinite loop.
      
       - Remove unused function ring_buffer_nr_pages()
      
       - Have tracefs freeing use the inode RCU infrastructure instead of
         creating its own.
      
         When the kernel had randomize structure fields enabled, the rcu field
         of the tracefs_inode was overlapping the rcu field of the inode
         structure, and corrupting it. Instead, use the destroy_inode()
         callback to do the initial cleanup of the code, and then have
         free_inode() free it.
      
      * tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracefs: Use generic inode RCU for synchronizing freeing
        ring-buffer: Remove unused function ring_buffer_nr_pages()
        tracing: Fix overflow in get_free_elt()
        function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
        eventfs: Use SRCU for freeing eventfs_inodes
        eventfs: Don't return NULL in eventfs_create_dir()
        tracefs: Fix inode allocation
        tracing: Use refcount for trace_event_file reference counter
        tracing: Have format file honor EVENT_FILE_FL_FREED
      9466b6ae
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs · b3f5620f
      Linus Torvalds authored
      Pull bcachefs fixes from Kent Overstreet:
       "Assorted little stuff:
      
         - lockdep fixup for lockdep_set_notrack_class()
      
         - we can now remove a device when using erasure coding without
           deadlocking, though we still hit other issues
      
         - the 'allocator stuck' timeout is now configurable, and messages are
           ratelimited. The default timeout has been increased from 10 seconds
           to 30"
      
      * tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
        bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
        bcachefs: Make allocator stuck timeout configurable, ratelimit messages
        bcachefs: Add missing path_traverse() to btree_iter_next_node()
        bcachefs: ec should not allocate from ro devs
        bcachefs: Improved allocator debugging for ec
        bcachefs: Add missing bch2_trans_begin() call
        bcachefs: Add a comment for bucket helper types
        bcachefs: Don't rely on implicit unsigned -> signed integer conversion
        lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
        bcachefs: Fix double free of ca->buckets_nouse
      b3f5620f
    • Linus Torvalds's avatar
      module: warn about excessively long module waits · cb5b81bc
      Linus Torvalds authored
      Russell King reported that the arm cbc(aes) crypto module hangs when
      loaded, and Herbert Xu bisected it to commit 9b9879fc ("modules:
      catch concurrent module loads, treat them as idempotent"), and noted:
      
       "So what's happening here is that the first modprobe tries to load a
        fallback CBC implementation, in doing so it triggers a load of the
        exact same module due to module aliases.
      
        IOW we're loading aes-arm-bs which provides cbc(aes). However, this
        needs a fallback of cbc(aes) to operate, which is made out of the
        generic cbc module + any implementation of aes, or ecb(aes). The
        latter happens to also be provided by aes-arm-cb so that's why it
        tries to load the same module again"
      
      So loading the aes-arm-bs module ends up wanting to recursively load
      itself, and the recursive load then ends up waiting for the original
      module load to complete.
      
      This is a regression, in that it used to be that we just tried to load
      the module multiple times, and then as we went on to install it the
      second time we would instead just error out because the module name
      already existed.
      
      That is actually also exactly what the original "catch concurrent loads"
      patch did in commit 9828ed3f ("module: error out early on concurrent
      load of the same module file"), but it turns out that it ends up being
      racy, in that erroring out before the module has been fully initialized
      will cause failures in dependent module loading.
      
      See commit ac2263b5 (which was the revert of that "error out early")
      commit for details about why erroring out before the module has been
      initialized is actually fundamentally racy.
      
      Now, for the actual recursive module load (as opposed to just
      concurrently loading the same module twice), the race is not an issue.
      
      At the same time it's hard for the kernel to see that this is recursion,
      because the module load is always done from a usermode helper, so the
      recursion is not some simple callchain within the kernel.
      
      End result: this is not the real fix, but this at least adds a warning
      for the situation (admittedly much too late for all the debugging pain
      that Russell and Herbert went through) and if we can come to a
      resolution on how to detect the recursion properly, this re-organizes
      the code to make that easier.
      
      Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/Reported-by: default avatarRussell King <linux@armlinux.org.uk>
      Debugged-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb5b81bc
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.11-1' of... · cf6d429e
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
       "Enable general EFI poweroff method to make poweroff usable on
        hardwares which lack ACPI S5, use accessors to page table entries
        instead of direct dereference to avoid potential problems, and two
        trivial kvm cleanups"
      
      * tag 'loongarch-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: KVM: Remove undefined a6 argument comment for kvm_hypercall()
        LoongArch: KVM: Remove unnecessary definition of KVM_PRIVATE_MEM_SLOTS
        LoongArch: Use accessors to page table entries instead of direct dereference
        LoongArch: Enable general EFI poweroff method
      cf6d429e
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 2ff4ceb0
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-08-07 (ice)
      
      This series contains updates to ice driver only.
      
      Grzegorz adds IRQ synchronization call before performing reset and
      prevents writing to hardware when it is resetting.
      
      Mateusz swaps incorrect assignment of FEC statistics.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        ice: Fix incorrect assigns of FEC counts
        ice: Skip PTP HW writes during PTP reset procedure
        ice: Fix reset handler
      ====================
      
      Link: https://patch.msgid.link/20240807224521.3819189-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ff4ceb0
    • Martin Whitaker's avatar
      net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897. · 0411f73c
      Martin Whitaker authored
      As noted in the device errata [1-8], EEE support is not fully operational
      in the KSZ8567, KSZ9477, KSZ9567, KSZ9896, and KSZ9897 devices, causing
      link drops when connected to another device that supports EEE. The patch
      series "net: add EEE support for KSZ9477 switch family" merged in commit
      9b0bf4f7 caused EEE support to be enabled in these devices. A fix for
      this regression for the KSZ9477 alone was merged in commit 08c6d8ba.
      This patch extends this fix to the other affected devices.
      
      [1] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ8567R-Errata-DS80000752.pdf
      [2] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ8567S-Errata-DS80000753.pdf
      [3] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ9477S-Errata-DS80000754.pdf
      [4] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ9567R-Errata-DS80000755.pdf
      [5] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ9567S-Errata-DS80000756.pdf
      [6] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ9896C-Errata-DS80000757.pdf
      [7] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ9897R-Errata-DS80000758.pdf
      [8] https://ww1.microchip.com/downloads/aemDocuments/documents/UNG/ProductDocuments/Errata/KSZ9897S-Errata-DS80000759.pdf
      
      Fixes: 69d3b36c ("net: dsa: microchip: enable EEE support") # for KSZ8567/KSZ9567/KSZ9896/KSZ9897
      Link: https://lore.kernel.org/netdev/137ce1ee-0b68-4c96-a717-c8164b514eec@martin-whitaker.me.uk/Signed-off-by: default avatarMartin Whitaker <foss@martin-whitaker.me.uk>
      Acked-by: default avatarArun Ramadoss <arun.ramadoss@microchip.com>
      Reviewed-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarLukasz Majewski <lukma@denx.de>
      Link: https://patch.msgid.link/20240807205209.21464-1-foss@martin-whitaker.me.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0411f73c
    • Gal Pressman's avatar
      ethtool: Fix context creation with no parameters · 4d7c3c1a
      Gal Pressman authored
      The 'at least one change' requirement is not applicable for context
      creation, skip the check in such case.
      This allows a command such as 'ethtool -X eth0 context new' to work.
      
      The command works by mistake when using older versions of userspace
      ethtool due to an incompatibility issue where rxfh.input_xfrm is passed
      as zero (unset) instead of RXH_XFRM_NO_CHANGE as done with recent
      userspace. This patch does not try to solve the incompatibility issue.
      
      Link: https://lore.kernel.org/netdev/05ae8316-d3aa-4356-98c6-55ed4253c8a7@nvidia.com/
      Fixes: 84a1d9c4 ("net: ethtool: extend RXNFC API to support RSS spreading of filter matches")
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarEdward Cree <ecree.xilinx@gmail.com>
      Link: https://patch.msgid.link/20240807173352.3501746-1-gal@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d7c3c1a
    • Edward Cree's avatar
      net: ethtool: fix off-by-one error in max RSS context IDs · b54de559
      Edward Cree authored
      Both ethtool_ops.rxfh_max_context_id and the default value used when
       it's not specified are supposed to be exclusive maxima (the former
       is documented as such; the latter, U32_MAX, cannot be used as an ID
       since it equals ETH_RXFH_CONTEXT_ALLOC), but xa_alloc() expects an
       inclusive maximum.
      Subtract one from 'limit' to produce an inclusive maximum, and pass
       that to xa_alloc().
      Increase bnxt's max by one to prevent a (very minor) regression, as
       BNXT_MAX_ETH_RSS_CTX is an inclusive max.  This is safe since bnxt
       is not actually hard-limited; BNXT_MAX_ETH_RSS_CTX is just a
       leftover from old driver code that managed context IDs itself.
      Rename rxfh_max_context_id to rxfh_max_num_contexts to make its
       semantics (hopefully) more obvious.
      
      Fixes: 847a8ab1 ("net: ethtool: let the core choose RSS context IDs")
      Signed-off-by: default avatarEdward Cree <ecree.xilinx@gmail.com>
      Link: https://patch.msgid.link/5a2d11a599aa5b0cc6141072c01accfb7758650c.1723045898.git.ecree.xilinx@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b54de559
    • Arnd Bergmann's avatar
      net: pse-pd: tps23881: include missing bitfield.h header · a70b637d
      Arnd Bergmann authored
      Using FIELD_GET() fails in configurations that don't already include
      the header file indirectly:
      
      drivers/net/pse-pd/tps23881.c: In function 'tps23881_i2c_probe':
      drivers/net/pse-pd/tps23881.c:755:13: error: implicit declaration of function 'FIELD_GET' [-Wimplicit-function-declaration]
        755 |         if (FIELD_GET(TPS23881_REG_DEVID_MASK, ret) != TPS23881_DEVICE_ID) {
            |             ^~~~~~~~~
      
      Fixes: 89108cb5 ("net: pse-pd: tps23881: Fix the device ID check")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Link: https://patch.msgid.link/20240807075455.2055224-1-arnd@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a70b637d
    • Csókás, Bence's avatar
      net: fec: Stop PPS on driver remove · 8fee6d5a
      Csókás, Bence authored
      PPS was not stopped in `fec_ptp_stop()`, called when
      the adapter was removed. Consequentially, you couldn't
      safely reload the driver with the PPS signal on.
      
      Fixes: 32cba57b ("net: fec: introduce fec_ptp_stop and use in probe fail path")
      Reviewed-by: default avatarFabio Estevam <festevam@gmail.com>
      Link: https://lore.kernel.org/netdev/CAOMZO5BzcZR8PwKKwBssQq_wAGzVgf1ffwe_nhpQJjviTdxy-w@mail.gmail.com/T/#m01dcb810bfc451a492140f6797ca77443d0cb79fSigned-off-by: default avatarCsókás, Bence <csokas.bence@prolan.hu>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFrank Li <Frank.Li@nxp.com>
      Link: https://patch.msgid.link/20240807080956.2556602-1-csokas.bence@prolan.huSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8fee6d5a
    • Florian Fainelli's avatar
      net: bcmgenet: Properly overlay PHY and MAC Wake-on-LAN capabilities · 9ee09edc
      Florian Fainelli authored
      Some Wake-on-LAN modes such as WAKE_FILTER may only be supported by the MAC,
      while others might be only supported by the PHY. Make sure that the .get_wol()
      returns the union of both rather than only that of the PHY if the PHY supports
      Wake-on-LAN.
      
      Fixes: 7e400ff3 ("net: bcmgenet: Add support for PHY-based Wake-on-LAN")
      Signed-off-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://patch.msgid.link/20240806175659.3232204-1-florian.fainelli@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9ee09edc
    • James Chapman's avatar
      l2tp: fix lockdep splat · 86a41ea9
      James Chapman authored
      When l2tp tunnels use a socket provided by userspace, we can hit
      lockdep splats like the below when data is transmitted through another
      (unrelated) userspace socket which then gets routed over l2tp.
      
      This issue was previously discussed here:
      https://lore.kernel.org/netdev/87sfialu2n.fsf@cloudflare.com/
      
      The solution is to have lockdep treat socket locks of l2tp tunnel
      sockets separately than those of standard INET sockets. To do so, use
      a different lockdep subclass where lock nesting is possible.
      
        ============================================
        WARNING: possible recursive locking detected
        6.10.0+ #34 Not tainted
        --------------------------------------------
        iperf3/771 is trying to acquire lock:
        ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0
      
        but task is already holding lock:
        ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(slock-AF_INET/1);
          lock(slock-AF_INET/1);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        10 locks held by iperf3/771:
         #0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40
         #1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
         #2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
         #3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0
         #4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260
         #5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
         #6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
         #7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
         #8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450
         #9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450
      
        stack backtrace:
        CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack_lvl+0x69/0xa0
         dump_stack+0xc/0x20
         __lock_acquire+0x135d/0x2600
         ? srso_alias_return_thunk+0x5/0xfbef5
         lock_acquire+0xc4/0x2a0
         ? l2tp_xmit_skb+0x243/0x9d0
         ? __skb_checksum+0xa3/0x540
         _raw_spin_lock_nested+0x35/0x50
         ? l2tp_xmit_skb+0x243/0x9d0
         l2tp_xmit_skb+0x243/0x9d0
         l2tp_eth_dev_xmit+0x3c/0xc0
         dev_hard_start_xmit+0x11e/0x420
         sch_direct_xmit+0xc3/0x640
         __dev_queue_xmit+0x61c/0x1450
         ? ip_finish_output2+0xf4c/0x1130
         ip_finish_output2+0x6b6/0x1130
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? __ip_finish_output+0x217/0x380
         ? srso_alias_return_thunk+0x5/0xfbef5
         __ip_finish_output+0x217/0x380
         ip_output+0x99/0x120
         __ip_queue_xmit+0xae4/0xbc0
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? tcp_options_write.constprop.0+0xcb/0x3e0
         ip_queue_xmit+0x34/0x40
         __tcp_transmit_skb+0x1625/0x1890
         __tcp_send_ack+0x1b8/0x340
         tcp_send_ack+0x23/0x30
         __tcp_ack_snd_check+0xa8/0x530
         ? srso_alias_return_thunk+0x5/0xfbef5
         tcp_rcv_established+0x412/0xd70
         tcp_v4_do_rcv+0x299/0x420
         tcp_v4_rcv+0x1991/0x1e10
         ip_protocol_deliver_rcu+0x50/0x220
         ip_local_deliver_finish+0x158/0x260
         ip_local_deliver+0xc8/0xe0
         ip_rcv+0xe5/0x1d0
         ? __pfx_ip_rcv+0x10/0x10
         __netif_receive_skb_one_core+0xce/0xe0
         ? process_backlog+0x28b/0x9f0
         __netif_receive_skb+0x34/0xd0
         ? process_backlog+0x28b/0x9f0
         process_backlog+0x2cb/0x9f0
         __napi_poll.constprop.0+0x61/0x280
         net_rx_action+0x332/0x670
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? find_held_lock+0x2b/0x80
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? srso_alias_return_thunk+0x5/0xfbef5
         handle_softirqs+0xda/0x480
         ? __dev_queue_xmit+0xa2c/0x1450
         do_softirq+0xa1/0xd0
         </IRQ>
         <TASK>
         __local_bh_enable_ip+0xc8/0xe0
         ? __dev_queue_xmit+0xa2c/0x1450
         __dev_queue_xmit+0xa48/0x1450
         ? ip_finish_output2+0xf4c/0x1130
         ip_finish_output2+0x6b6/0x1130
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? __ip_finish_output+0x217/0x380
         ? srso_alias_return_thunk+0x5/0xfbef5
         __ip_finish_output+0x217/0x380
         ip_output+0x99/0x120
         __ip_queue_xmit+0xae4/0xbc0
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? tcp_options_write.constprop.0+0xcb/0x3e0
         ip_queue_xmit+0x34/0x40
         __tcp_transmit_skb+0x1625/0x1890
         tcp_write_xmit+0x766/0x2fb0
         ? __entry_text_end+0x102ba9/0x102bad
         ? srso_alias_return_thunk+0x5/0xfbef5
         ? __might_fault+0x74/0xc0
         ? srso_alias_return_thunk+0x5/0xfbef5
         __tcp_push_pending_frames+0x56/0x190
         tcp_push+0x117/0x310
         tcp_sendmsg_locked+0x14c1/0x1740
         tcp_sendmsg+0x28/0x40
         inet_sendmsg+0x5d/0x90
         sock_write_iter+0x242/0x2b0
         vfs_write+0x68d/0x800
         ? __pfx_sock_write_iter+0x10/0x10
         ksys_write+0xc8/0xf0
         __x64_sys_write+0x3d/0x50
         x64_sys_call+0xfaf/0x1f50
         do_syscall_64+0x6d/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
        RIP: 0033:0x7f4d143af992
        Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0
        RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992
        RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005
        RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc
        R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0
         </TASK>
      
      Fixes: 0b2c5972 ("l2tp: close all race conditions in l2tp_tunnel_register()")
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+6acef9e0a4d1f46c83d4@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4
      CC: gnault@redhat.com
      CC: cong.wang@bytedance.com
      Signed-off-by: default avatarJames Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarTom Parkin <tparkin@katalix.com>
      Link: https://patch.msgid.link/20240806160626.1248317-1-jchapman@katalix.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      86a41ea9
    • Russell King (Oracle)'s avatar
      net: stmmac: dwmac4: fix PCS duplex mode decode · 85ba108a
      Russell King (Oracle) authored
      dwmac4 was decoding the duplex mode from the GMAC_PHYIF_CONTROL_STATUS
      register incorrectly, using GMAC_PHYIF_CTRLSTATUS_LNKMOD_MASK (value 1)
      rather than GMAC_PHYIF_CTRLSTATUS_LNKMOD (bit 16). Fix this.
      
      Fixes: 70523e63 ("drivers: net: stmmac: reworking the PCS code.")
      Reviewed-by: default avatarAndrew Halaney <ahalaney@redhat.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://patch.msgid.link/E1sbJvd-001rGD-E3@rmk-PC.armlinux.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      85ba108a
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of... · 660e4b18
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc fixes from Andrew Morton:
       "Nine hotfixes. Five are cc:stable, the others either pertain to
        post-6.10 material or aren't considered necessary for earlier kernels.
      
        Five are MM and four are non-MM. No identifiable theme here - please
        see the individual changelogs"
      
      * tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        padata: Fix possible divide-by-0 panic in padata_mt_helper()
        mailmap: update entry for David Heidelberg
        memcg: protect concurrent access to mem_cgroup_idr
        mm: shmem: fix incorrect aligned index when checking conflicts
        mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem
        mm: list_lru: fix UAF for memory cgroup
        kcov: properly check for softirq context
        MAINTAINERS: Update LTP members and web
        selftests: mm: add s390 to ARCH check
      660e4b18
    • Jakub Kicinski's avatar
      Merge tag 'for-net-2024-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · b928e7d1
      Jakub Kicinski authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - hci_sync: avoid dup filtering when passive scanning with adv monitor
       - hci_qca: don't call pwrseq_power_off() twice for QCA6390
       - hci_qca: fix QCA6390 support on non-DT platforms
       - hci_qca: fix a NULL-pointer derefence at shutdown
       - l2cap: always unlock channel in l2cap_conless_channel()
      
      * tag 'for-net-2024-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor
        Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()
        Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown
        Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms
        Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390
      ====================
      
      Link: https://patch.msgid.link/20240807210103.142483-1-luiz.dentz@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b928e7d1
    • Jakub Kicinski's avatar
      Merge branch 'idpf-fix-3-bugs-revealed-by-the-chapter-i' · bc59b558
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      idpf: fix 3 bugs revealed by the Chapter I
      
      Alexander Lobakin says:
      
      The libeth conversion revealed 2 serious issues which lead to sporadic
      crashes or WARNs under certain configurations. Additional one was found
      while debugging these two with kmemleak.
      This one is targeted stable, the rest can be backported manually later
      if needed. They can be reproduced only after the conversion is applied
      anyway.
      ====================
      
      Link: https://patch.msgid.link/20240806220923.3359860-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bc59b558
    • Alexander Lobakin's avatar
      idpf: fix UAFs when destroying the queues · 290f1c03
      Alexander Lobakin authored
      The second tagged commit started sometimes (very rarely, but possible)
      throwing WARNs from
      net/core/page_pool.c:page_pool_disable_direct_recycling().
      Turned out idpf frees interrupt vectors with embedded NAPIs *before*
      freeing the queues making page_pools' NAPI pointers lead to freed
      memory before these pools are destroyed by libeth.
      It's not clear whether there are other accesses to the freed vectors
      when destroying the queues, but anyway, we usually free queue/interrupt
      vectors only when the queues are destroyed and the NAPIs are guaranteed
      to not be referenced anywhere.
      
      Invert the allocation and freeing logic making queue/interrupt vectors
      be allocated first and freed last. Vectors don't require queues to be
      present, so this is safe. Additionally, this change allows to remove
      that useless queue->q_vector pointer cleanup, as vectors are still
      valid when freeing the queues (+ both are freed within one function,
      so it's not clear why nullify the pointers at all).
      
      Fixes: 1c325aac ("idpf: configure resources for TX queues")
      Fixes: 90912f9f ("idpf: convert header split mode to libeth + napi_build_skb()")
      Reported-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarKrishneil Singh <krishneil.k.singh@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://patch.msgid.link/20240806220923.3359860-4-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      290f1c03
    • Michal Kubiak's avatar
      idpf: fix memleak in vport interrupt configuration · 3cc88e84
      Michal Kubiak authored
      The initialization of vport interrupt consists of two functions:
       1) idpf_vport_intr_init() where a generic configuration is done
       2) idpf_vport_intr_req_irq() where the irq for each q_vector is
         requested.
      
      The first function used to create a base name for each interrupt using
      "kasprintf()" call. Unfortunately, although that call allocated memory
      for a text buffer, that memory was never released.
      
      Fix this by removing creating the interrupt base name in 1).
      Instead, always create a full interrupt name in the function 2), because
      there is no need to create a base name separately, considering that the
      function 2) is never called out of idpf_vport_intr_init() context.
      
      Fixes: d4d55871 ("idpf: initialize interrupts and enable vport")
      Cc: stable@vger.kernel.org # 6.7
      Signed-off-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Reviewed-by: default avatarPavan Kumar Linga <pavan.kumar.linga@intel.com>
      Signed-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarKrishneil Singh <krishneil.k.singh@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://patch.msgid.link/20240806220923.3359860-3-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3cc88e84
    • Alexander Lobakin's avatar
      idpf: fix memory leaks and crashes while performing a soft reset · f01032a2
      Alexander Lobakin authored
      The second tagged commit introduced a UAF, as it removed restoring
      q_vector->vport pointers after reinitializating the structures.
      This is due to that all queue allocation functions are performed here
      with the new temporary vport structure and those functions rewrite
      the backpointers to the vport. Then, this new struct is freed and
      the pointers start leading to nowhere.
      
      But generally speaking, the current logic is very fragile. It claims
      to be more reliable when the system is low on memory, but in fact, it
      consumes two times more memory as at the moment of running this
      function, there are two vports allocated with their queues and vectors.
      Moreover, it claims to prevent the driver from running into "bad state",
      but in fact, any error during the rebuild leaves the old vport in the
      partially allocated state.
      Finally, if the interface is down when the function is called, it always
      allocates a new queue set, but when the user decides to enable the
      interface later on, vport_open() allocates them once again, IOW there's
      a clear memory leak here.
      
      Just don't allocate a new queue set when performing a reset, that solves
      crashes and memory leaks. Readd the old queue number and reopen the
      interface on rollback - that solves limbo states when the device is left
      disabled and/or without HW queues enabled.
      
      Fixes: 02cbfba1 ("idpf: add ethtool callbacks")
      Fixes: e4891e46 ("idpf: split &idpf_queue into 4 strictly-typed queue structures")
      Signed-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarKrishneil Singh <krishneil.k.singh@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://patch.msgid.link/20240806220923.3359860-2-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f01032a2
    • Michael Chan's avatar
      bnxt_en : Fix memory out-of-bounds in bnxt_fill_hw_rss_tbl() · da03f5d1
      Michael Chan authored
      A recent commit has modified the code in __bnxt_reserve_rings() to
      set the default RSS indirection table to default only when the number
      of RX rings is changing.  While this works for newer firmware that
      requires RX ring reservations, it causes the regression on older
      firmware not requiring RX ring resrvations (BNXT_NEW_RM() returns
      false).
      
      With older firmware, RX ring reservations are not required and so
      hw_resc->resv_rx_rings is not always set to the proper value.  The
      comparison:
      
      if (old_rx_rings != bp->hw_resc.resv_rx_rings)
      
      in __bnxt_reserve_rings() may be false even when the RX rings are
      changing.  This will cause __bnxt_reserve_rings() to skip setting
      the default RSS indirection table to default to match the current
      number of RX rings.  This may later cause bnxt_fill_hw_rss_tbl() to
      use an out-of-range index.
      
      We already have bnxt_check_rss_tbl_no_rmgr() to handle exactly this
      scenario.  We just need to move it up in bnxt_need_reserve_rings()
      to be called unconditionally when using older firmware.  Without the
      fix, if the TX rings are changing, we'll skip the
      bnxt_check_rss_tbl_no_rmgr() call and __bnxt_reserve_rings() may also
      skip the bnxt_set_dflt_rss_indir_tbl() call for the reason explained
      in the last paragraph.  Without setting the default RSS indirection
      table to default, it causes the regression:
      
      BUG: KASAN: slab-out-of-bounds in __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
      Read of size 2 at addr ffff8881c5809618 by task ethtool/31525
      Call Trace:
      __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
       bnxt_hwrm_vnic_rss_cfg_p5+0xf7/0x460
       __bnxt_setup_vnic_p5+0x12e/0x270
       __bnxt_open_nic+0x2262/0x2f30
       bnxt_open_nic+0x5d/0xf0
       ethnl_set_channels+0x5d4/0xb30
       ethnl_default_set_doit+0x2f1/0x620
      Reported-by: default avatarBreno Leitao <leitao@debian.org>
      Closes: https://lore.kernel.org/netdev/ZrC6jpghA3PWVWSB@gmail.com/
      Fixes: 98ba1d93 ("bnxt_en: Fix RSS logic in __bnxt_reserve_rings()")
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Tested-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://patch.msgid.link/20240806053742.140304-1-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      da03f5d1
    • Joe Hattori's avatar
      net: dsa: bcm_sf2: Fix a possible memory leak in bcm_sf2_mdio_register() · e3862093
      Joe Hattori authored
      bcm_sf2_mdio_register() calls of_phy_find_device() and then
      phy_device_remove() in a loop to remove existing PHY devices.
      of_phy_find_device() eventually calls bus_find_device(), which calls
      get_device() on the returned struct device * to increment the refcount.
      The current implementation does not decrement the refcount, which causes
      memory leak.
      
      This commit adds the missing phy_device_free() call to decrement the
      refcount via put_device() to balance the refcount.
      
      Fixes: 771089c2 ("net: dsa: bcm_sf2: Ensure that MDIO diversion is used")
      Signed-off-by: default avatarJoe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
      Tested-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Link: https://patch.msgid.link/20240806011327.3817861-1-joe@pf.is.s.u-tokyo.ac.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e3862093
    • Zhengchao Shao's avatar
      net/smc: add the max value of fallback reason count · d27a835f
      Zhengchao Shao authored
      The number of fallback reasons defined in the smc_clc.h file has reached
      36. For historical reasons, some are no longer quoted, and there's 33
      actually in use. So, add the max value of fallback reason count to 36.
      
      Fixes: 6ac1e656 ("net/smc: support smc v2.x features validate")
      Fixes: 7f0620b9 ("net/smc: support max connections per lgr negotiation")
      Fixes: 69b888e3 ("net/smc: support max links per lgr negotiation in clc handshake")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Reviewed-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Link: https://patch.msgid.link/20240805043856.565677-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d27a835f
    • Waiman Long's avatar
      padata: Fix possible divide-by-0 panic in padata_mt_helper() · 6d45e1c9
      Waiman Long authored
      We are hit with a not easily reproducible divide-by-0 panic in padata.c at
      bootup time.
      
        [   10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
        [   10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
        [   10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
        [   10.017908] Workqueue: events_unbound padata_mt_helper
        [   10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
          :
        [   10.017963] Call Trace:
        [   10.017968]  <TASK>
        [   10.018004]  ? padata_mt_helper+0x39/0xb0
        [   10.018084]  process_one_work+0x174/0x330
        [   10.018093]  worker_thread+0x266/0x3a0
        [   10.018111]  kthread+0xcf/0x100
        [   10.018124]  ret_from_fork+0x31/0x50
        [   10.018138]  ret_from_fork_asm+0x1a/0x30
        [   10.018147]  </TASK>
      
      Looking at the padata_mt_helper() function, the only way a divide-by-0
      panic can happen is when ps->chunk_size is 0.  The way that chunk_size is
      initialized in padata_do_multithreaded(), chunk_size can be 0 when the
      min_chunk in the passed-in padata_mt_job structure is 0.
      
      Fix this divide-by-0 panic by making sure that chunk_size will be at least
      1 no matter what the input parameters are.
      
      Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
      Fixes: 004ed426 ("padata: add basic support for multithreaded jobs")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d45e1c9
    • David Heidelberg's avatar
      mailmap: update entry for David Heidelberg · f2087995
      David Heidelberg authored
      Link my old gmail address to my active email.
      
      Link: https://lkml.kernel.org/r/20240804054704.859503-1-david@ixit.czSigned-off-by: default avatarDavid Heidelberg <david@ixit.cz>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f2087995
    • Shakeel Butt's avatar
      memcg: protect concurrent access to mem_cgroup_idr · 9972605a
      Shakeel Butt authored
      Commit 73f576c0 ("mm: memcontrol: fix cgroup creation failure after
      many small jobs") decoupled the memcg IDs from the CSS ID space to fix the
      cgroup creation failures.  It introduced IDR to maintain the memcg ID
      space.  The IDR depends on external synchronization mechanisms for
      modifications.  For the mem_cgroup_idr, the idr_alloc() and idr_replace()
      happen within css callback and thus are protected through cgroup_mutex
      from concurrent modifications.  However idr_remove() for mem_cgroup_idr
      was not protected against concurrency and can be run concurrently for
      different memcgs when they hit their refcnt to zero.  Fix that.
      
      We have been seeing list_lru based kernel crashes at a low frequency in
      our fleet for a long time.  These crashes were in different part of
      list_lru code including list_lru_add(), list_lru_del() and reparenting
      code.  Upon further inspection, it looked like for a given object (dentry
      and inode), the super_block's list_lru didn't have list_lru_one for the
      memcg of that object.  The initial suspicions were either the object is
      not allocated through kmem_cache_alloc_lru() or somehow
      memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but
      returned success.  No evidence were found for these cases.
      
      Looking more deeply, we started seeing situations where valid memcg's id
      is not present in mem_cgroup_idr and in some cases multiple valid memcgs
      have same id and mem_cgroup_idr is pointing to one of them.  So, the most
      reasonable explanation is that these situations can happen due to race
      between multiple idr_remove() calls or race between
      idr_alloc()/idr_replace() and idr_remove().  These races are causing
      multiple memcgs to acquire the same ID and then offlining of one of them
      would cleanup list_lrus on the system for all of them.  Later access from
      other memcgs to the list_lru cause crashes due to missing list_lru_one.
      
      Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev
      Fixes: 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
      Signed-off-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9972605a
    • Baolin Wang's avatar
      mm: shmem: fix incorrect aligned index when checking conflicts · 4cbf320b
      Baolin Wang authored
      In the shmem_suitable_orders() function, xa_find() is used to check for
      conflicts in the pagecache to select suitable huge orders.  However, when
      checking each huge order in every loop, the aligned index is calculated
      from the previous iteration, which may cause suitable huge orders to be
      missed.
      
      We should use the original index each time in the loop to calculate a new
      aligned index for checking conflicts to avoid this issue.
      
      Link: https://lkml.kernel.org/r/07433b0f16a152bffb8cee34934a5c040e8e2ad6.1722404078.git.baolin.wang@linux.alibaba.com
      Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4cbf320b
    • Baolin Wang's avatar
      mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem · b66b1b71
      Baolin Wang authored
      Similar to commit d659b715 ("mm/huge_memory: avoid PMD-size page
      cache if needed"), ARM64 can support 512MB PMD-sized THP when the base
      page size is 64KB, which is larger than the maximum supported page cache
      size MAX_PAGECACHE_ORDER.
      
      This is not expected.  To fix this issue, use THP_ORDERS_ALL_FILE_DEFAULT
      for shmem to filter allowable huge orders.
      
      [baolin.wang@linux.alibaba.com: remove comment, per Barry]
        Link: https://lkml.kernel.org/r/c55d7ef7-78aa-4ed6-b897-c3e03a3f3ab7@linux.alibaba.com
      [wangkefeng.wang@huawei.com: remove local `orders']
        Link: https://lkml.kernel.org/r/87769ae8-b6c6-4454-925d-1864364af9c8@huawei.com
      Link: https://lkml.kernel.org/r/117121665254442c3c7f585248296495e5e2b45c.1722404078.git.baolin.wang@linux.alibaba.com
      Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b66b1b71
    • Muchun Song's avatar
      mm: list_lru: fix UAF for memory cgroup · 5161b487
      Muchun Song authored
      The mem_cgroup_from_slab_obj() is supposed to be called under rcu lock or
      cgroup_mutex or others which could prevent returned memcg from being
      freed.  Fix it by adding missing rcu read lock.
      
      Found by code inspection.
      
      [songmuchun@bytedance.com: only grab rcu lock when necessary, per Vlastimil]
        Link: https://lkml.kernel.org/r/20240801024603.1865-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20240718083607.42068-1-songmuchun@bytedance.com
      Fixes: 0a97c01c ("list_lru: allow explicit memcg and NUMA node selection")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5161b487
    • Andrey Konovalov's avatar
      kcov: properly check for softirq context · 7d4df2da
      Andrey Konovalov authored
      When collecting coverage from softirqs, KCOV uses in_serving_softirq() to
      check whether the code is running in the softirq context.  Unfortunately,
      in_serving_softirq() is > 0 even when the code is running in the hardirq
      or NMI context for hardirqs and NMIs that happened during a softirq.
      
      As a result, if a softirq handler contains a remote coverage collection
      section and a hardirq with another remote coverage collection section
      happens during handling the softirq, KCOV incorrectly detects a nested
      softirq coverate collection section and prints a WARNING, as reported by
      syzbot.
      
      This issue was exposed by commit a7f3813e ("usb: gadget: dummy_hcd:
      Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using
      hrtimer and made the timer's callback be executed in the hardirq context.
      
      Change the related checks in KCOV to account for this behavior of
      in_serving_softirq() and make KCOV ignore remote coverage collection
      sections in the hardirq and NMI contexts.
      
      This prevents the WARNING printed by syzbot but does not fix the inability
      of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd
      is in use (caused by a7f3813e); a separate patch is required for that.
      
      Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev
      Fixes: 5ff3b30a ("kcov: collect coverage from interrupts")
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13acAcked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Marcello Sylvester Bauer <sylv@sylv.io>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d4df2da
    • Petr Vorel's avatar
      MAINTAINERS: Update LTP members and web · 37bf7fbe
      Petr Vorel authored
      LTP project uses now readthedocs.org instance instead of GitHub wiki.
      
      LTP maintainers are listed in alphabetical order.
      
      Link: https://lkml.kernel.org/r/20240726072009.1021599-1-pvorel@suse.czSigned-off-by: default avatarPetr Vorel <pvorel@suse.cz>
      Reviewed-by: default avatarLi Wang <liwang@redhat.com>
      Reviewed-by: default avatarCyril Hrubis <chrubis@suse.cz>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Xiao Yang <yangx.jy@fujitsu.com>
      Cc: Yang Xu <xuyang2018.jy@fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37bf7fbe
    • Nico Pache's avatar
      selftests: mm: add s390 to ARCH check · 30b651c8
      Nico Pache authored
      commit 0518dbe9 ("selftests/mm: fix cross compilation with LLVM")
      changed the env variable for the architecture from MACHINE to ARCH.
      
      This is preventing 3 required TEST_GEN_FILES from being included when
      cross compiling s390x and errors when trying to run the test suite.  This
      is due to the ARCH variable already being set and the arch folder name
      being s390.
      
      Add "s390" to the filtered list to cover this case and have the 3 files
      included in the build.
      
      Link: https://lkml.kernel.org/r/20240724213517.23918-1-npache@redhat.com
      Fixes: 0518dbe9 ("selftests/mm: fix cross compilation with LLVM")
      Signed-off-by: default avatarNico Pache <npache@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30b651c8
    • Kent Overstreet's avatar
      bcachefs: Use bch2_wait_on_allocator() in btree node alloc path · 73dc1656
      Kent Overstreet authored
      If the allocator gets stuck, we need to know why.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      73dc1656
    • Kent Overstreet's avatar
      bcachefs: Make allocator stuck timeout configurable, ratelimit messages · cecf7279
      Kent Overstreet authored
      Limit these messages to once every 2 minutes to avoid spamming logs;
      with multiple devices the output can be quite significant.
      
      Also, up the default timeout to 30 seconds from 10 seconds.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      cecf7279
    • Kent Overstreet's avatar
      bcachefs: Add missing path_traverse() to btree_iter_next_node() · 6d496e02
      Kent Overstreet authored
      This fixes a bug exposed by the next path - we pop an assert in
      path_set_should_be_locked().
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      6d496e02
    • Steven Rostedt's avatar
      tracefs: Use generic inode RCU for synchronizing freeing · 0b6743bd
      Steven Rostedt authored
      With structure layout randomization enabled for 'struct inode' we need to
      avoid overlapping any of the RCU-used / initialized-only-once members,
      e.g. i_lru or i_sb_list to not corrupt related list traversals when making
      use of the rcu_head.
      
      For an unlucky structure layout of 'struct inode' we may end up with the
      following splat when running the ftrace selftests:
      
      [<...>] list_del corruption, ffff888103ee2cb0->next (tracefs_inode_cache+0x0/0x4e0 [slab object]) is NULL (prev is tracefs_inode_cache+0x78/0x4e0 [slab object])
      [<...>] ------------[ cut here ]------------
      [<...>] kernel BUG at lib/list_debug.c:54!
      [<...>] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [<...>] CPU: 3 PID: 2550 Comm: mount Tainted: G                 N  6.8.12-grsec+ #122 ed2f536ca62f28b087b90e3cc906a8d25b3ddc65
      [<...>] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [<...>] RIP: 0010:[<ffffffff84656018>] __list_del_entry_valid_or_report+0x138/0x3e0
      [<...>] Code: 48 b8 99 fb 65 f2 ff ff ff ff e9 03 5c d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff e9 33 5a d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff <0f> 0b 4c 89 e9 48 89 ea 48 89 ee 48 c7 c7 60 8f dd 89 31 c0 e8 2f
      [<...>] RSP: 0018:fffffe80416afaf0 EFLAGS: 00010283
      [<...>] RAX: 0000000000000098 RBX: ffff888103ee2cb0 RCX: 0000000000000000
      [<...>] RDX: ffffffff84655fe8 RSI: ffffffff89dd8b60 RDI: 0000000000000001
      [<...>] RBP: ffff888103ee2cb0 R08: 0000000000000001 R09: fffffbd0082d5f25
      [<...>] R10: fffffe80416af92f R11: 0000000000000001 R12: fdf99c16731d9b6d
      [<...>] R13: 0000000000000000 R14: ffff88819ad4b8b8 R15: 0000000000000000
      [<...>] RBX: tracefs_inode_cache+0x0/0x4e0 [slab object]
      [<...>] RDX: __list_del_entry_valid_or_report+0x108/0x3e0
      [<...>] RSI: __func__.47+0x4340/0x4400
      [<...>] RBP: tracefs_inode_cache+0x0/0x4e0 [slab object]
      [<...>] RSP: process kstack fffffe80416afaf0+0x7af0/0x8000 [mount 2550 2550]
      [<...>] R09: kasan shadow of process kstack fffffe80416af928+0x7928/0x8000 [mount 2550 2550]
      [<...>] R10: process kstack fffffe80416af92f+0x792f/0x8000 [mount 2550 2550]
      [<...>] R14: tracefs_inode_cache+0x78/0x4e0 [slab object]
      [<...>] FS:  00006dcb380c1840(0000) GS:ffff8881e0600000(0000) knlGS:0000000000000000
      [<...>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [<...>] CR2: 000076ab72b30e84 CR3: 000000000b088004 CR4: 0000000000360ef0 shadow CR4: 0000000000360ef0
      [<...>] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [<...>] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [<...>] ASID: 0003
      [<...>] Stack:
      [<...>]  ffffffff818a2315 00000000f5c856ee ffffffff896f1840 ffff888103ee2cb0
      [<...>]  ffff88812b6b9750 0000000079d714b6 fffffbfff1e9280b ffffffff8f49405f
      [<...>]  0000000000000001 0000000000000000 ffff888104457280 ffffffff8248b392
      [<...>] Call Trace:
      [<...>]  <TASK>
      [<...>]  [<ffffffff818a2315>] ? lock_release+0x175/0x380 fffffe80416afaf0
      [<...>]  [<ffffffff8248b392>] list_lru_del+0x152/0x740 fffffe80416afb48
      [<...>]  [<ffffffff8248ba93>] list_lru_del_obj+0x113/0x280 fffffe80416afb88
      [<...>]  [<ffffffff8940fd19>] ? _atomic_dec_and_lock+0x119/0x200 fffffe80416afb90
      [<...>]  [<ffffffff8295b244>] iput_final+0x1c4/0x9a0 fffffe80416afbb8
      [<...>]  [<ffffffff8293a52b>] dentry_unlink_inode+0x44b/0xaa0 fffffe80416afbf8
      [<...>]  [<ffffffff8293fefc>] __dentry_kill+0x23c/0xf00 fffffe80416afc40
      [<...>]  [<ffffffff8953a85f>] ? __this_cpu_preempt_check+0x1f/0xa0 fffffe80416afc48
      [<...>]  [<ffffffff82949ce5>] ? shrink_dentry_list+0x1c5/0x760 fffffe80416afc70
      [<...>]  [<ffffffff82949b71>] ? shrink_dentry_list+0x51/0x760 fffffe80416afc78
      [<...>]  [<ffffffff82949da8>] shrink_dentry_list+0x288/0x760 fffffe80416afc80
      [<...>]  [<ffffffff8294ae75>] shrink_dcache_sb+0x155/0x420 fffffe80416afcc8
      [<...>]  [<ffffffff8953a7c3>] ? debug_smp_processor_id+0x23/0xa0 fffffe80416afce0
      [<...>]  [<ffffffff8294ad20>] ? do_one_tree+0x140/0x140 fffffe80416afcf8
      [<...>]  [<ffffffff82997349>] ? do_remount+0x329/0xa00 fffffe80416afd18
      [<...>]  [<ffffffff83ebf7a1>] ? security_sb_remount+0x81/0x1c0 fffffe80416afd38
      [<...>]  [<ffffffff82892096>] reconfigure_super+0x856/0x14e0 fffffe80416afd70
      [<...>]  [<ffffffff815d1327>] ? ns_capable_common+0xe7/0x2a0 fffffe80416afd90
      [<...>]  [<ffffffff82997436>] do_remount+0x416/0xa00 fffffe80416afdd0
      [<...>]  [<ffffffff829b2ba4>] path_mount+0x5c4/0x900 fffffe80416afe28
      [<...>]  [<ffffffff829b25e0>] ? finish_automount+0x13a0/0x13a0 fffffe80416afe60
      [<...>]  [<ffffffff82903812>] ? user_path_at_empty+0xb2/0x140 fffffe80416afe88
      [<...>]  [<ffffffff829b2ff5>] do_mount+0x115/0x1c0 fffffe80416afeb8
      [<...>]  [<ffffffff829b2ee0>] ? path_mount+0x900/0x900 fffffe80416afed8
      [<...>]  [<ffffffff8272461c>] ? __kasan_check_write+0x1c/0xa0 fffffe80416afee0
      [<...>]  [<ffffffff829b31cf>] __do_sys_mount+0x12f/0x280 fffffe80416aff30
      [<...>]  [<ffffffff829b36cd>] __x64_sys_mount+0xcd/0x2e0 fffffe80416aff70
      [<...>]  [<ffffffff819f8818>] ? syscall_trace_enter+0x218/0x380 fffffe80416aff88
      [<...>]  [<ffffffff8111655e>] x64_sys_call+0x5d5e/0x6720 fffffe80416affa8
      [<...>]  [<ffffffff8952756d>] do_syscall_64+0xcd/0x3c0 fffffe80416affb8
      [<...>]  [<ffffffff8100119b>] entry_SYSCALL_64_safe_stack+0x4c/0x87 fffffe80416affe8
      [<...>]  </TASK>
      [<...>]  <PTREGS>
      [<...>] RIP: 0033:[<00006dcb382ff66a>] vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map]
      [<...>] Code: 48 8b 0d 29 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f6 17 0d 00 f7 d8 64 89 01 48
      [<...>] RSP: 002b:0000763d68192558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
      [<...>] RAX: ffffffffffffffda RBX: 00006dcb38433264 RCX: 00006dcb382ff66a
      [<...>] RDX: 000017c3e0d11210 RSI: 000017c3e0d1a5a0 RDI: 000017c3e0d1ae70
      [<...>] RBP: 000017c3e0d10fb0 R08: 000017c3e0d11260 R09: 00006dcb383d1be0
      [<...>] R10: 000000000020002e R11: 0000000000000246 R12: 0000000000000000
      [<...>] R13: 000017c3e0d1ae70 R14: 000017c3e0d11210 R15: 000017c3e0d10fb0
      [<...>] RBX: vm_area_struct[mount 2550 2550 file 6dcb38433000-6dcb38434000 5b 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RCX: vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map]
      [<...>] RDX: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RSI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RDI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RBP: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RSP: vm_area_struct[mount 2550 2550 anon 763d68173000-763d68195000 7ffffffdd 100133(read|write|mayread|maywrite|growsdown|account)]+0x0/0xb8 [userland map]
      [<...>] R08: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R09: vm_area_struct[mount 2550 2550 file 6dcb383d1000-6dcb383d3000 1cd 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R13: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R14: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R15: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>]  </PTREGS>
      [<...>] Modules linked in:
      [<...>] ---[ end trace 0000000000000000 ]---
      
      The list debug message as well as RBX's symbolic value point out that the
      object in question was allocated from 'tracefs_inode_cache' and that the
      list's '->next' member is at offset 0. Dumping the layout of the relevant
      parts of 'struct tracefs_inode' gives the following:
      
        struct tracefs_inode {
          union {
            struct inode {
              struct list_head {
                struct list_head * next;                    /*     0     8 */
                struct list_head * prev;                    /*     8     8 */
              } i_lru;
              [...]
            } vfs_inode;
            struct callback_head {
              void (*func)(struct callback_head *);         /*     0     8 */
              struct callback_head * next;                  /*     8     8 */
            } rcu;
          };
          [...]
        };
      
      Above shows that 'vfs_inode.i_lru' overlaps with 'rcu' which will
      destroy the 'i_lru' list as soon as the 'rcu' member gets used, e.g. in
      call_rcu() or later when calling the RCU callback. This will disturb
      concurrent list traversals as well as object reuse which assumes these
      list heads will keep their integrity.
      
      For reproduction, the following diff manually overlays 'i_lru' with
      'rcu' as, otherwise, one would require some good portion of luck for
      gambling an unlucky RANDSTRUCT seed:
      
        --- a/include/linux/fs.h
        +++ b/include/linux/fs.h
        @@ -629,6 +629,7 @@ struct inode {
         	umode_t			i_mode;
         	unsigned short		i_opflags;
         	kuid_t			i_uid;
        +	struct list_head	i_lru;		/* inode LRU list */
         	kgid_t			i_gid;
         	unsigned int		i_flags;
      
        @@ -690,7 +691,6 @@ struct inode {
         	u16			i_wb_frn_avg_time;
         	u16			i_wb_frn_history;
         #endif
        -	struct list_head	i_lru;		/* inode LRU list */
         	struct list_head	i_sb_list;
         	struct list_head	i_wb_list;	/* backing dev writeback list */
         	union {
      
      The tracefs inode does not need to supply its own RCU delayed destruction
      of its inode. The inode code itself offers both a "destroy_inode()"
      callback that gets called when the last reference of the inode is
      released, and the "free_inode()" which is called after a RCU
      synchronization period from the "destroy_inode()".
      
      The tracefs code can unlink the inode from its list in the destroy_inode()
      callback, and the simply free it from the free_inode() callback. This
      should provide the same protection.
      
      Link: https://lore.kernel.org/all/20240807115143.45927-3-minipli@grsecurity.net/
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Cc: Ilkka =?utf-8?b?TmF1bGFww6TDpA==?= <digirigawa@gmail.com>
      Link: https://lore.kernel.org/20240807185402.61410544@gandalf.local.home
      Fixes: baa23a8d ("tracefs: Reset permissions on remount if permissions are options")
      Reported-by: default avatarMathias Krause <minipli@grsecurity.net>
      Reported-by: default avatarBrad Spengler <spender@grsecurity.net>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0b6743bd
    • Jianhui Zhou's avatar
      ring-buffer: Remove unused function ring_buffer_nr_pages() · 58f7e4d7
      Jianhui Zhou authored
      Because ring_buffer_nr_pages() is not an inline function and user accesses
      buffer->buffers[cpu]->nr_pages directly, the function ring_buffer_nr_pages
      is removed.
      Signed-off-by: default avatarJianhui Zhou <912460177@qq.com>
      Link: https://lore.kernel.org/tencent_F4A7E9AB337F44E0F4B858D07D19EF460708@qq.comSigned-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      58f7e4d7
    • Tze-nan Wu's avatar
      tracing: Fix overflow in get_free_elt() · bcf86c01
      Tze-nan Wu authored
      "tracing_map->next_elt" in get_free_elt() is at risk of overflowing.
      
      Once it overflows, new elements can still be inserted into the tracing_map
      even though the maximum number of elements (`max_elts`) has been reached.
      Continuing to insert elements after the overflow could result in the
      tracing_map containing "tracing_map->max_size" elements, leaving no empty
      entries.
      If any attempt is made to insert an element into a full tracing_map using
      `__tracing_map_insert()`, it will cause an infinite loop with preemption
      disabled, leading to a CPU hang problem.
      
      Fix this by preventing any further increments to "tracing_map->next_elt"
      once it reaches "tracing_map->max_elt".
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Co-developed-by: default avatarCheng-Jui Wang <cheng-jui.wang@mediatek.com>
      Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.comSigned-off-by: default avatarCheng-Jui Wang <cheng-jui.wang@mediatek.com>
      Signed-off-by: default avatarTze-nan Wu <Tze-nan.Wu@mediatek.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      bcf86c01
    • Petr Pavlu's avatar
      function_graph: Fix the ret_stack used by ftrace_graph_ret_addr() · 604b72b3
      Petr Pavlu authored
      When ftrace_graph_ret_addr() is invoked to convert a found stack return
      address to its original value, the function can end up producing the
      following crash:
      
      [   95.442712] BUG: kernel NULL pointer dereference, address: 0000000000000028
      [   95.442720] #PF: supervisor read access in kernel mode
      [   95.442724] #PF: error_code(0x0000) - not-present page
      [   95.442727] PGD 0 P4D 0-
      [   95.442731] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
      [   95.442736] CPU: 1 UID: 0 PID: 2214 Comm: insmod Kdump: loaded Tainted: G           OE K    6.11.0-rc1-default #1 67c62a3b3720562f7e7db5f11c1fdb40b7a2857c
      [   95.442747] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [K]=LIVEPATCH
      [   95.442750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
      [   95.442754] RIP: 0010:ftrace_graph_ret_addr+0x42/0xc0
      [   95.442766] Code: [...]
      [   95.442773] RSP: 0018:ffff979b80ff7718 EFLAGS: 00010006
      [   95.442776] RAX: ffffffff8ca99b10 RBX: ffff979b80ff7760 RCX: ffff979b80167dc0
      [   95.442780] RDX: ffffffff8ca99b10 RSI: ffff979b80ff7790 RDI: 0000000000000005
      [   95.442783] RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000
      [   95.442786] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff8e9491e0
      [   95.442790] R13: ffffffff8d6f70f0 R14: ffff979b80167da8 R15: ffff979b80167dc8
      [   95.442793] FS:  00007fbf83895740(0000) GS:ffff8a0afdd00000(0000) knlGS:0000000000000000
      [   95.442797] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   95.442800] CR2: 0000000000000028 CR3: 0000000005070002 CR4: 0000000000370ef0
      [   95.442806] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   95.442809] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   95.442816] Call Trace:
      [   95.442823]  <TASK>
      [   95.442896]  unwind_next_frame+0x20d/0x830
      [   95.442905]  arch_stack_walk_reliable+0x94/0xe0
      [   95.442917]  stack_trace_save_tsk_reliable+0x7d/0xe0
      [   95.442922]  klp_check_and_switch_task+0x55/0x1a0
      [   95.442931]  task_call_func+0xd3/0xe0
      [   95.442938]  klp_try_switch_task.part.5+0x37/0x150
      [   95.442942]  klp_try_complete_transition+0x79/0x2d0
      [   95.442947]  klp_enable_patch+0x4db/0x890
      [   95.442960]  do_one_initcall+0x41/0x2e0
      [   95.442968]  do_init_module+0x60/0x220
      [   95.442975]  load_module+0x1ebf/0x1fb0
      [   95.443004]  init_module_from_file+0x88/0xc0
      [   95.443010]  idempotent_init_module+0x190/0x240
      [   95.443015]  __x64_sys_finit_module+0x5b/0xc0
      [   95.443019]  do_syscall_64+0x74/0x160
      [   95.443232]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
      [   95.443236] RIP: 0033:0x7fbf82f2c709
      [   95.443241] Code: [...]
      [   95.443247] RSP: 002b:00007fffd5ea3b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      [   95.443253] RAX: ffffffffffffffda RBX: 000056359c48e750 RCX: 00007fbf82f2c709
      [   95.443257] RDX: 0000000000000000 RSI: 000056356ed4efc5 RDI: 0000000000000003
      [   95.443260] RBP: 000056356ed4efc5 R08: 0000000000000000 R09: 00007fffd5ea3c10
      [   95.443263] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
      [   95.443267] R13: 000056359c48e6f0 R14: 0000000000000000 R15: 0000000000000000
      [   95.443272]  </TASK>
      [   95.443274] Modules linked in: [...]
      [   95.443385] Unloaded tainted modules: intel_uncore_frequency(E):1 isst_if_common(E):1 skx_edac(E):1
      [   95.443414] CR2: 0000000000000028
      
      The bug can be reproduced with kselftests:
      
       cd linux/tools/testing/selftests
       make TARGETS='ftrace livepatch'
       (cd ftrace; ./ftracetest test.d/ftrace/fgraph-filter.tc)
       (cd livepatch; ./test-livepatch.sh)
      
      The problem is that ftrace_graph_ret_addr() is supposed to operate on the
      ret_stack of a selected task but wrongly accesses the ret_stack of the
      current task. Specifically, the above NULL dereference occurs when
      task->curr_ret_stack is non-zero, but current->ret_stack is NULL.
      
      Correct ftrace_graph_ret_addr() to work with the right ret_stack.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reported-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Link: https://lore.kernel.org/20240803131211.17255-1-petr.pavlu@suse.com
      Fixes: 7aa1eaef ("function_graph: Allow multiple users to attach to function graph")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      604b72b3