1. 23 May, 2024 11 commits
    • Paolo Abeni's avatar
      Merge branch 'intel-interpret-set_channels-input-differently' · 3d8597d8
      Paolo Abeni authored
      Jacob Keller says:
      
      ====================
      intel: Interpret .set_channels() input differently
      
      The ice and idpf drivers can trigger a crash with AF_XDP due to incorrect
      interpretation of the asymmetric Tx and Rx parameters in their
      .set_channels() implementations:
      
      1. ethtool -l <IFNAME> -> combined: 40
      2. Attach AF_XDP to queue 30
      3. ethtool -L <IFNAME> rx 15 tx 15
         combined number is not specified, so command becomes {rx_count = 15,
         tx_count = 15, combined_count = 40}.
      4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
         new (combined_count + rx_count) to the old one, so from 55 to 40, check
         does not trigger.
      5. the driver interprets `rx 15 tx 15` as 15 combined channels and deletes
         the queue that AF_XDP is attached to.
      
      This is fundamentally a problem with interpreting a request for asymmetric
      queues as symmetric combined queues.
      
      Fix the ice and idpf drivers to stop interpreting such requests as a
      request for combined queues. Due to current driver design for both ice and
      idpf, it is not possible to support requests of the same count of Tx and Rx
      queues with independent interrupts, (i.e. ethtool -L <IFNAME> rx 15 tx 15)
      so such requests are now rejected.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      ====================
      
      Link: https://lore.kernel.org/r/20240521-iwl-net-2024-05-14-set-channels-fixes-v2-0-7aa39e2e99f1@intel.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3d8597d8
    • Larysa Zaremba's avatar
      idpf: Interpret .set_channels() input differently · 5e7695e0
      Larysa Zaremba authored
      Unlike ice, idpf does not check, if user has requested at least 1 combined
      channel. Instead, it relies on a check in the core code. Unfortunately, the
      check does not trigger for us because of the hacky .set_channels()
      interpretation logic that is not consistent with the core code.
      
      This naturally leads to user being able to trigger a crash with an invalid
      input. This is how:
      
      1. ethtool -l <IFNAME> -> combined: 40
      2. ethtool -L <IFNAME> rx 0 tx 0
         combined number is not specified, so command becomes {rx_count = 0,
         tx_count = 0, combined_count = 40}.
      3. ethnl_set_channels checks, if there is at least 1 RX and 1 TX channel,
         comparing (combined_count + rx_count) and (combined_count + tx_count)
         to zero. Obviously, (40 + 0) is greater than zero, so the core code
         deems the input OK.
      4. idpf interprets `rx 0 tx 0` as 0 channels and tries to proceed with such
         configuration.
      
      The issue has to be solved fundamentally, as current logic is also known to
      cause AF_XDP problems in ice [0].
      
      Interpret the command in a way that is more consistent with ethtool
      manual [1] (--show-channels and --set-channels) and new ice logic.
      
      Considering that in the idpf driver only the difference between RX and TX
      queues forms dedicated channels, change the correct way to set number of
      channels to:
      
      ethtool -L <IFNAME> combined 10 /* For symmetric queues */
      ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */
      
      [0] https://lore.kernel.org/netdev/20240418095857.2827-1-larysa.zaremba@intel.com/
      [1] https://man7.org/linux/man-pages/man8/ethtool.8.html
      
      Fixes: 02cbfba1 ("idpf: add ethtool callbacks")
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Reviewed-by: default avatarIgor Bagnucki <igor.bagnucki@intel.com>
      Signed-off-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Tested-by: default avatarKrishneil Singh <krishneil.k.singh@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5e7695e0
    • Larysa Zaremba's avatar
      ice: Interpret .set_channels() input differently · 05d6f442
      Larysa Zaremba authored
      A bug occurs because a safety check guarding AF_XDP-related queues in
      ethnl_set_channels(), does not trigger. This happens, because kernel and
      ice driver interpret the ethtool command differently.
      
      How the bug occurs:
      1. ethtool -l <IFNAME> -> combined: 40
      2. Attach AF_XDP to queue 30
      3. ethtool -L <IFNAME> rx 15 tx 15
         combined number is not specified, so command becomes {rx_count = 15,
         tx_count = 15, combined_count = 40}.
      4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
         new (combined_count + rx_count) to the old one, so from 55 to 40, check
         does not trigger.
      5. ice interprets `rx 15 tx 15` as 15 combined channels and deletes the
         queue that AF_XDP is attached to.
      
      Interpret the command in a way that is more consistent with ethtool
      manual [0] (--show-channels and --set-channels).
      
      Considering that in the ice driver only the difference between RX and TX
      queues forms dedicated channels, change the correct way to set number of
      channels to:
      
      ethtool -L <IFNAME> combined 10 /* For symmetric queues */
      ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */
      
      [0] https://man7.org/linux/man-pages/man8/ethtool.8.html
      
      Fixes: 87324e74 ("ice: Implement ethtool ops for channels")
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Tested-by: default avatarChandan Kumar Rout <chandanx.rout@intel.com>
      Tested-by: default avatarPucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
      Acked-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      05d6f442
    • Ryosuke Yasuoka's avatar
      nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() · 6671e352
      Ryosuke Yasuoka authored
      When nci_rx_work() receives a zero-length payload packet, it should not
      discard the packet and exit the loop. Instead, it should continue
      processing subsequent packets.
      
      Fixes: d24b0353 ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet")
      Signed-off-by: default avatarRyosuke Yasuoka <ryasuoka@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20240521153444.535399-1-ryasuoka@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6671e352
    • Paolo Abeni's avatar
      net: relax socket state check at accept time. · 26afda78
      Paolo Abeni authored
      Christoph reported the following splat:
      
      WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0
      Modules linked in:
      CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759
      Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80
      RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293
      RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64
      R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000
      R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800
      FS:  000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786
       do_accept+0x435/0x620 net/socket.c:1929
       __sys_accept4_file net/socket.c:1969 [inline]
       __sys_accept4+0x9b/0x110 net/socket.c:1999
       __do_sys_accept net/socket.c:2016 [inline]
       __se_sys_accept net/socket.c:2013 [inline]
       __x64_sys_accept+0x7d/0x90 net/socket.c:2013
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x4315f9
      Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
      RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300
      R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055
       </TASK>
      
      The reproducer invokes shutdown() before entering the listener status.
      After commit 94062790 ("tcp: defer shutdown(SEND_SHUTDOWN) for
      TCP_SYN_RECV sockets"), the above causes the child to reach the accept
      syscall in FIN_WAIT1 status.
      
      Eric noted we can relax the existing assertion in __inet_accept()
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 94062790 ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets")
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      26afda78
    • Jason Xing's avatar
      tcp: remove 64 KByte limit for initial tp->rcv_wnd value · 378979e9
      Jason Xing authored
      Recently, we had some servers upgraded to the latest kernel and noticed
      the indicator from the user side showed worse results than before. It is
      caused by the limitation of tp->rcv_wnd.
      
      In 2018 commit a337531b ("tcp: up initial rmem to 128KB and SYN rwin
      to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
      CDN teams would not benefit from this change because they cannot have a
      large window to receive a big packet, which will be slowed down especially
      in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
      the side effect for the latency/time-sensitive users.
      
      To avoid future confusion, current change doesn't affect the initial
      receive window on the wire in a SYN or SYN+ACK packet which are set within
      65535 bytes according to RFC 7323 also due to the limit in
      __tcp_transmit_skb():
      
          th->window      = htons(min(tp->rcv_wnd, 65535U));
      
      In one word, __tcp_transmit_skb() already ensures that constraint is
      respected, no matter how large tp->rcv_wnd is. The change doesn't violate
      RFC.
      
      Let me provide one example if with or without the patch:
      Before:
      client   --- SYN: rwindow=65535 ---> server
      client   <--- SYN+ACK: rwindow=65535 ----  server
      client   --- ACK: rwindow=65536 ---> server
      Note: for the last ACK, the calculation is 512 << 7.
      
      After:
      client   --- SYN: rwindow=65535 ---> server
      client   <--- SYN+ACK: rwindow=65535 ----  server
      client   --- ACK: rwindow=175232 ---> server
      Note: I use the following command to make it work:
      ip route change default via [ip] dev eth0 metric 100 initrwnd 120
      For the last ACK, the calculation is 1369 << 7.
      
      When we apply such a patch, having a large rcv_wnd if the user tweak this
      knob can help transfer data more rapidly and save some rtts.
      
      Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Signed-off-by: default avatarJason Xing <kernelxing@tencent.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      378979e9
    • Romain Gantois's avatar
      net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() · b31c7e78
      Romain Gantois authored
      In the prueth_probe() function, if one of the calls to emac_phy_connect()
      fails due to of_phy_connect() returning NULL, then the subsequent call to
      phy_attached_info() will dereference a NULL pointer.
      
      Check the return code of emac_phy_connect and fail cleanly if there is an
      error.
      
      Fixes: 128d5874 ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRomain Gantois <romain.gantois@bootlin.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarMD Danish Anwar <danishanwar@ti.com>
      Link: https://lore.kernel.org/r/20240521-icssg-prueth-fix-v1-1-b4b17b1433e9@bootlin.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b31c7e78
    • Dae R. Jeong's avatar
      tls: fix missing memory barrier in tls_init · 91e61dd7
      Dae R. Jeong authored
      In tls_init(), a write memory barrier is missing, and store-store
      reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.
      
      CPU0                               CPU1
      -----                              -----
      // In tls_init()
      // In tls_ctx_create()
      ctx = kzalloc()
      ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)
      
      // In update_sk_prot()
      WRITE_ONCE(sk->sk_prot, tls_prots)     -(2)
      
                                         // In sock_common_setsockopt()
                                         READ_ONCE(sk->sk_prot)->setsockopt()
      
                                         // In tls_{setsockopt,getsockopt}()
                                         ctx->sk_proto->setsockopt()    -(3)
      
      In the above scenario, when (1) and (2) are reordered, (3) can observe
      the NULL value of ctx->sk_proto, causing NULL dereference.
      
      To fix it, we rely on rcu_assign_pointer() which implies the release
      barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is
      initialized, we can ensure that ctx->sk_proto are visible when
      changing sk->sk_prot.
      
      Fixes: d5bee737 ("net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE")
      Signed-off-by: default avatarYewon Choi <woni9911@gmail.com>
      Signed-off-by: default avatarDae R. Jeong <threeearcat@gmail.com>
      Link: https://lore.kernel.org/netdev/ZU4OJG56g2V9z_H7@dragonet/T/
      Link: https://lore.kernel.org/r/Zkx4vjSFp0mfpjQ2@libra05Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      91e61dd7
    • Wei Fang's avatar
      net: fec: avoid lock evasion when reading pps_enable · 3b1c92f8
      Wei Fang authored
      The assignment of pps_enable is protected by tmreg_lock, but the read
      operation of pps_enable is not. So the Coverity tool reports a lock
      evasion warning which may cause data race to occur when running in a
      multithread environment. Although this issue is almost impossible to
      occur, we'd better fix it, at least it seems more logically reasonable,
      and it also prevents Coverity from continuing to issue warnings.
      
      Fixes: 278d2404 ("net: fec: ptp: Enable PPS output based on ptp clock")
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Link: https://lore.kernel.org/r/20240521023800.17102-1-wei.fang@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3b1c92f8
    • Jacob Keller's avatar
      Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" · b35b1c0b
      Jacob Keller authored
      This reverts commit 56573604.
      
      According to the commit, it implements a manual AN-37 for some
      "troublesome" Juniper MX5 switches. This appears to be a workaround for a
      particular switch.
      
      It has been reported that this causes a severe breakage for other switches,
      including a Cisco 3560CX-12PD-S.
      
      The code appears to be a workaround for a specific switch which fails to
      link in SFI mode. It expects to see AN-37 auto negotiation in order to
      link. The Cisco switch is not expecting AN-37 auto negotiation. When the
      device starts the manual AN-37, the Cisco switch decides that the port is
      confused and stops attempting to link with it. This persists until a power
      cycle. A simple driver unload and reload does not resolve the issue, even
      if loading with a version of the driver which lacks this workaround.
      
      The authors of the workaround commit have not responded with
      clarifications, and the result of the workaround is complete failure to
      connect with other switches.
      
      This appears to be a case where the driver can either "correctly" link with
      the Juniper MX5 switch, at the cost of bricking the link with the Cisco
      switch, or it can behave properly for the Cisco switch, but fail to link
      with the Junipir MX5 switch. I do not know enough about the standards
      involved to clearly determine whether either switch is at fault or behaving
      incorrectly. Nor do I know whether there exists some alternative fix which
      corrects behavior with both switches.
      
      Revert the workaround for the Juniper switch.
      
      Fixes: 56573604 ("ixgbe: Manual AN-37 for troublesome link partners for X550 SFI")
      Link: https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@intel.com/T/
      Link: https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/#post-612291Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Cc: Jeff Daly <jeffd@silicom-usa.com>
      Cc: kernel.org-fo5k2w@ycharbi.fr
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240520-net-2024-05-20-revert-silicom-switch-workaround-v1-1-50f80f261c94@intel.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b35b1c0b
    • Joe Damato's avatar
      testing: net-drv: use stats64 for testing · a61a459f
      Joe Damato authored
      Testing a network device that has large numbers of bytes/packets may
      overflow. Using stats64 when comparing fixes this problem.
      
      I tripped on this while iterating on a qstats patch for mlx5. See below
      for confirmation without my added code that this is a bug.
      
      Before this patch (with added debugging output):
      
      $ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
      KTAP version 1
      1..4
      ok 1 stats.check_pause
      ok 2 stats.check_fec
      rstat: 481708634 qstat: 666201639514 key: tx-bytes
      not ok 3 stats.pkt_byte_sum
      ok 4 stats.qstat_by_ifindex
      
      Note the huge delta above ^^^ in the rtnl vs qstats.
      
      After this patch:
      
      $ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
      KTAP version 1
      1..4
      ok 1 stats.check_pause
      ok 2 stats.check_fec
      ok 3 stats.pkt_byte_sum
      ok 4 stats.qstat_by_ifindex
      
      It looks like rtnl_fill_stats in net/core/rtnetlink.c will attempt to
      copy the 64bit stats into a 32bit structure which is probably why this
      behavior is occurring.
      
      To show this is happening, you can get the underlying stats that the
      stats.py test uses like this:
      
      $ ./cli.py --spec ../../../Documentation/netlink/specs/rt_link.yaml \
                 --do getlink --json '{"ifi-index": 7}'
      
      And examine the output (heavily snipped to show relevant fields):
      
       'stats': {
                 'multicast': 3739197,
                 'rx-bytes': 1201525399,
                 'rx-packets': 56807158,
                 'tx-bytes': 492404458,
                 'tx-packets': 1200285371,
      
       'stats64': {
                   'multicast': 3739197,
                   'rx-bytes': 35561263767,
                   'rx-packets': 56807158,
                   'tx-bytes': 666212335338,
                   'tx-packets': 1200285371,
      
      The stats.py test prior to this patch was using the 'stats' structure
      above, which matches the failure output on my system.
      
      Comparing side by side, rx-bytes and tx-bytes, and getting ethtool -S
      output:
      
      rx-bytes stats:    1201525399
      rx-bytes stats64: 35561263767
      rx-bytes ethtool: 36203402638
      
      tx-bytes stats:      492404458
      tx-bytes stats64: 666212335338
      tx-bytes ethtool: 666215360113
      
      Note that the above was taken from a system with an mlx5 NIC, which only
      exposes ndo_get_stats64.
      
      Based on the ethtool output and qstat output, it appears that stats.py
      should be updated to use the 'stats64' structure for accurate
      comparisons when packet/byte counters get very large.
      
      To confirm that this was not related to the qstats code I was iterating
      on, I booted a kernel without my driver changes and re-ran the test
      which shows the qstats are skipped (as they don't exist for mlx5):
      
      NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
      KTAP version 1
      1..4
      ok 1 stats.check_pause
      ok 2 stats.check_fec
      ok 3 stats.pkt_byte_sum # SKIP qstats not supported by the device
      ok 4 stats.qstat_by_ifindex # SKIP No ifindex supports qstats
      
      But, fetching the stats using the CLI
      
      $ ./cli.py --spec ../../../Documentation/netlink/specs/rt_link.yaml \
                 --do getlink --json '{"ifi-index": 7}'
      
      Shows the same issue (heavily snipped for relevant fields only):
      
       'stats': {
                 'multicast': 105489,
                 'rx-bytes': 530879526,
                 'rx-packets': 751415,
                 'tx-bytes': 2510191396,
                 'tx-packets': 27700323,
       'stats64': {
                   'multicast': 105489,
                   'rx-bytes': 530879526,
                   'rx-packets': 751415,
                   'tx-bytes': 15395093284,
                   'tx-packets': 27700323,
      
      Comparing side by side with ethtool -S on the unmodified mlx5 driver:
      
      tx-bytes stats:    2510191396
      tx-bytes stats64: 15395093284
      tx-bytes ethtool: 17718435810
      
      Fixes: f0e6c86e ("testing: net-drv: add a driver test for stats reporting")
      Signed-off-by: default avatarJoe Damato <jdamato@fastly.com>
      Link: https://lore.kernel.org/r/20240520235850.190041-1-jdamato@fastly.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a61a459f
  2. 22 May, 2024 2 commits
  3. 21 May, 2024 9 commits
    • Aaron Conole's avatar
      openvswitch: Set the skbuff pkt_type for proper pmtud support. · 30a92c9e
      Aaron Conole authored
      Open vSwitch is originally intended to switch at layer 2, only dealing with
      Ethernet frames.  With the introduction of l3 tunnels support, it crossed
      into the realm of needing to care a bit about some routing details when
      making forwarding decisions.  If an oversized packet would need to be
      fragmented during this forwarding decision, there is a chance for pmtu
      to get involved and generate a routing exception.  This is gated by the
      skbuff->pkt_type field.
      
      When a flow is already loaded into the openvswitch module this field is
      set up and transitioned properly as a packet moves from one port to
      another.  In the case that a packet execute is invoked after a flow is
      newly installed this field is not properly initialized.  This causes the
      pmtud mechanism to omit sending the required exception messages across
      the tunnel boundary and a second attempt needs to be made to make sure
      that the routing exception is properly setup.  To fix this, we set the
      outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
      to the openvswitch module via a port device or packet command.
      
      Even for bridge ports as users, the pkt_type needs to be reset when
      doing the transmit as the packet is truly outgoing and routing needs
      to get involved post packet transformations, in the case of
      VXLAN/GENEVE/udp-tunnel packets.  In general, the pkt_type on output
      gets ignored, since we go straight to the driver, but in the case of
      tunnel ports they go through IP routing layer.
      
      This issue is periodically encountered in complex setups, such as large
      openshift deployments, where multiple sets of tunnel traversal occurs.
      A way to recreate this is with the ovn-heater project that can setup
      a networking environment which mimics such large deployments.  We need
      larger environments for this because we need to ensure that flow
      misses occur.  In these environment, without this patch, we can see:
      
        ./ovn_cluster.sh start
        podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
        podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache
        podman exec ovn-chassis-1 ip netns exec sw01p1 \
               ping 21.0.0.3 -M do -s 1300 -c2
        PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
        From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
      
        --- 21.0.0.3 ping statistics ---
        ...
      
      Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
      sent into the server.
      
      With this patch, setting the pkt_type, we see the following:
      
        podman exec ovn-chassis-1 ip netns exec sw01p1 \
               ping 21.0.0.3 -M do -s 1300 -c2
        PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
        From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
        ping: local error: message too long, mtu=1222
      
        --- 21.0.0.3 ping statistics ---
        ...
      
      In this case, the first ping request receives the FRAG_NEEDED message and
      a local routing exception is created.
      Tested-by: default avatarJaime Caamano <jcaamano@redhat.com>
      Reported-at: https://issues.redhat.com/browse/FDP-164
      Fixes: 58264848 ("openvswitch: Add vxlan tunneling support.")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20240516200941.16152-1-aconole@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      30a92c9e
    • Paolo Abeni's avatar
      Merge branch 'af_unix-fix-gc-and-improve-selftest' · 580acf6c
      Paolo Abeni authored
      Michal Luczaj says:
      
      ====================
      af_unix: Fix GC and improve selftest
      
      Series deals with AF_UNIX garbage collector mishandling some in-flight
      graph cycles. Embryos carrying OOB packets with SCM_RIGHTS cause issues.
      
      Patch 1/2 fixes the memory leak.
      Patch 2/2 tweaks the selftest for a better OOB coverage.
      
      v3:
        - Patch 1/2: correct the commit message (Kuniyuki)
      
      v2: https://lore.kernel.org/netdev/20240516145457.1206847-1-mhal@rbox.co/
        - Patch 1/2: remove WARN_ON_ONCE() (Kuniyuki)
        - Combine both patches into a series (Kuniyuki)
      
      v1: https://lore.kernel.org/netdev/20240516103049.1132040-1-mhal@rbox.co/
      ====================
      
      Link: https://lore.kernel.org/r/20240517093138.1436323-1-mhal@rbox.coSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      580acf6c
    • Kuniyuki Iwashima's avatar
      selftest: af_unix: Make SCM_RIGHTS into OOB data. · e060e433
      Kuniyuki Iwashima authored
      scm_rights.c covers various test cases for inflight file descriptors
      and garbage collector for AF_UNIX sockets.
      
      Currently, SCM_RIGHTS messages are sent with 3-bytes string, and it's
      not good for MSG_OOB cases, as SCM_RIGTS cmsg goes with the first 2-bytes,
      which is non-OOB data.
      
      Let's send SCM_RIGHTS messages with 1-byte character to pack SCM_RIGHTS
      into OOB data.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e060e433
    • Michal Luczaj's avatar
      af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS · 041933a1
      Michal Luczaj authored
      GC attempts to explicitly drop oob_skb's reference before purging the hit
      list.
      
      The problem is with embryos: kfree_skb(u->oob_skb) is never called on an
      embryo socket.
      
      The python script below [0] sends a listener's fd to its embryo as OOB
      data.  While GC does collect the embryo's queue, it fails to drop the OOB
      skb's refcount.  The skb which was in embryo's receive queue stays as
      unix_sk(sk)->oob_skb and keeps the listener's refcount [1].
      
      Tell GC to dispose embryo's oob_skb.
      
      [0]:
      from array import array
      from socket import *
      
      addr = '\x00unix-oob'
      lis = socket(AF_UNIX, SOCK_STREAM)
      lis.bind(addr)
      lis.listen(1)
      
      s = socket(AF_UNIX, SOCK_STREAM)
      s.connect(addr)
      scm = (SOL_SOCKET, SCM_RIGHTS, array('i', [lis.fileno()]))
      s.sendmsg([b'x'], [scm], MSG_OOB)
      lis.close()
      
      [1]
      $ grep unix-oob /proc/net/unix
      $ ./unix-oob.py
      $ grep unix-oob /proc/net/unix
      0000000000000000: 00000002 00000000 00000000 0001 02     0 @unix-oob
      0000000000000000: 00000002 00000000 00010000 0001 01  6072 @unix-oob
      
      Fixes: 4090fa37 ("af_unix: Replace garbage collection algorithm.")
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      041933a1
    • Kuniyuki Iwashima's avatar
      tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). · 3ebc46ca
      Kuniyuki Iwashima authored
      In dctcp_update_alpha(), we use a module parameter dctcp_shift_g
      as follows:
      
        alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g);
        ...
        delivered_ce <<= (10 - dctcp_shift_g);
      
      It seems syzkaller started fuzzing module parameters and triggered
      shift-out-of-bounds [0] by setting 100 to dctcp_shift_g:
      
        memcpy((void*)0x20000080,
               "/sys/module/tcp_dctcp/parameters/dctcp_shift_g\000", 47);
        res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000080ul,
                      /*flags=*/2ul, /*mode=*/0ul);
        memcpy((void*)0x20000000, "100\000", 4);
        syscall(__NR_write, /*fd=*/r[0], /*val=*/0x20000000ul, /*len=*/4ul);
      
      Let's limit the max value of dctcp_shift_g by param_set_uint_minmax().
      
      With this patch:
      
        # echo 10 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g
        # cat /sys/module/tcp_dctcp/parameters/dctcp_shift_g
        10
        # echo 11 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g
        -bash: echo: write error: Invalid argument
      
      [0]:
      UBSAN: shift-out-of-bounds in net/ipv4/tcp_dctcp.c:143:12
      shift exponent 100 is too large for 32-bit type 'u32' (aka 'unsigned int')
      CPU: 0 PID: 8083 Comm: syz-executor345 Not tainted 6.9.0-05151-g1b294a1f #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x201/0x300 lib/dump_stack.c:114
       ubsan_epilogue lib/ubsan.c:231 [inline]
       __ubsan_handle_shift_out_of_bounds+0x346/0x3a0 lib/ubsan.c:468
       dctcp_update_alpha+0x540/0x570 net/ipv4/tcp_dctcp.c:143
       tcp_in_ack_event net/ipv4/tcp_input.c:3802 [inline]
       tcp_ack+0x17b1/0x3bc0 net/ipv4/tcp_input.c:3948
       tcp_rcv_state_process+0x57a/0x2290 net/ipv4/tcp_input.c:6711
       tcp_v4_do_rcv+0x764/0xc40 net/ipv4/tcp_ipv4.c:1937
       sk_backlog_rcv include/net/sock.h:1106 [inline]
       __release_sock+0x20f/0x350 net/core/sock.c:2983
       release_sock+0x61/0x1f0 net/core/sock.c:3549
       mptcp_subflow_shutdown+0x3d0/0x620 net/mptcp/protocol.c:2907
       mptcp_check_send_data_fin+0x225/0x410 net/mptcp/protocol.c:2976
       __mptcp_close+0x238/0xad0 net/mptcp/protocol.c:3072
       mptcp_close+0x2a/0x1a0 net/mptcp/protocol.c:3127
       inet_release+0x190/0x1f0 net/ipv4/af_inet.c:437
       __sock_release net/socket.c:659 [inline]
       sock_close+0xc0/0x240 net/socket.c:1421
       __fput+0x41b/0x890 fs/file_table.c:422
       task_work_run+0x23b/0x300 kernel/task_work.c:180
       exit_task_work include/linux/task_work.h:38 [inline]
       do_exit+0x9c8/0x2540 kernel/exit.c:878
       do_group_exit+0x201/0x2b0 kernel/exit.c:1027
       __do_sys_exit_group kernel/exit.c:1038 [inline]
       __se_sys_exit_group kernel/exit.c:1036 [inline]
       __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xe4/0x240 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x67/0x6f
      RIP: 0033:0x7f6c2b5005b6
      Code: Unable to access opcode bytes at 0x7f6c2b50058c.
      RSP: 002b:00007ffe883eb948 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 00007f6c2b5862f0 RCX: 00007f6c2b5005b6
      RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
      RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffffc0
      R10: 0000000000000006 R11: 0000000000000246 R12: 00007f6c2b5862f0
      R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Reported-by: default avatarYue Sun <samsun1006219@gmail.com>
      Reported-by: default avatarxingwei lee <xrivendell7@gmail.com>
      Closes: https://lore.kernel.org/netdev/CAEkJfYNJM=cw-8x7_Vmj1J6uYVCWMbbvD=EFmDPVBGpTsqOxEA@mail.gmail.com/
      Fixes: e3118e83 ("net: tcp: add DCTCP congestion control algorithm")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240517091626.32772-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3ebc46ca
    • Hangbin Liu's avatar
      selftests/net: use tc rule to filter the na packet · ea63ac14
      Hangbin Liu authored
      Test arp_ndisc_untracked_subnets use tcpdump to filter the unsolicited
      and untracked na messages. It set -e before calling tcpdump. But if
      tcpdump filters 0 packet, it will return none zero, and cause the script
      to exit.
      
      Instead of using slow tcpdump to capture packets, let's using tc rule
      to filter out the na message.
      
      At the same time, fix function setup_v6 which only needs one parameter.
      Move all the related helpers from forwarding lib.sh to net lib.sh.
      
      Fixes: 0ea7b0a4 ("selftests: net: arp_ndisc_untracked_subnets: test for arp_accept and accept_untracked_na")
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240517010327.2631319-1-liuhangbin@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ea63ac14
    • Hangbin Liu's avatar
      ipv6: sr: fix memleak in seg6_hmac_init_algo · efb9f4f1
      Hangbin Liu authored
      seg6_hmac_init_algo returns without cleaning up the previous allocations
      if one fails, so it's going to leak all that memory and the crypto tfms.
      
      Update seg6_hmac_exit to only free the memory when allocated, so we can
      reuse the code directly.
      
      Fixes: bf355b8d ("ipv6: sr: add core files for SR HMAC support")
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Closes: https://lore.kernel.org/netdev/Zj3bh-gE7eT6V6aH@hog/Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/20240517005435.2600277-1-liuhangbin@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      efb9f4f1
    • Kuniyuki Iwashima's avatar
      af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. · 9841991a
      Kuniyuki Iwashima authored
      Billy Jheng Bing-Jhong reported a race between __unix_gc() and
      queue_oob().
      
      __unix_gc() tries to garbage-collect close()d inflight sockets,
      and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
      will drop the reference and set NULL to it locklessly.
      
      However, the peer socket still can send MSG_OOB message and
      queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading
      NULL pointer dereference. [0]
      
      To fix the issue, let's update unix_sk(sk)->oob_skb under the
      sk_receive_queue's lock and take it everywhere we touch oob_skb.
      
      Note that we defer kfree_skb() in manage_oob() to silence lockdep
      false-positive (See [1]).
      
      [0]:
      BUG: kernel NULL pointer dereference, address: 0000000000000008
       PF: supervisor write access in kernel mode
       PF: error_code(0x0002) - not-present page
      PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
      Oops: 0002 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579 #110
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      Workqueue: events delayed_fput
      RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
      Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
      RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
      RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
      RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
      RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
      R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
      R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
      FS:  0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
      PKRU: 55555554
      Call Trace:
       <TASK>
       unix_release_sock (net/unix/af_unix.c:654)
       unix_release (net/unix/af_unix.c:1050)
       __sock_release (net/socket.c:660)
       sock_close (net/socket.c:1423)
       __fput (fs/file_table.c:423)
       delayed_fput (fs/file_table.c:444 (discriminator 3))
       process_one_work (kernel/workqueue.c:3259)
       worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
       kthread (kernel/kthread.c:388)
       ret_from_fork (arch/x86/kernel/process.c:153)
       ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
       </TASK>
      Modules linked in:
      CR2: 0000000000000008
      
      Link: https://lore.kernel.org/netdev/a00d3993-c461-43f2-be6d-07259c98509a@rbox.co/ [1]
      Fixes: 1279f9d9 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
      Reported-by: default avatarBilly Jheng Bing-Jhong <billy@starlabs.sg>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240516134835.8332-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9841991a
    • Heiner Kallweit's avatar
      Revert "r8169: don't try to disable interrupts if NAPI is, scheduled already" · eabb8a9b
      Heiner Kallweit authored
      This reverts commit 7274c414.
      
      Ken reported that RTL8125b can lock up if gro_flush_timeout has the
      default value of 20000 and napi_defer_hard_irqs is set to 0.
      In this scenario device interrupts aren't disabled, what seems to
      trigger some silicon bug under heavy load. I was able to reproduce this
      behavior on RTL8168h. Fix this by reverting 7274c414.
      
      Fixes: 7274c414 ("r8169: don't try to disable interrupts if NAPI is scheduled already")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarKen Milmore <ken.milmore@gmail.com>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/9b5b6f4c-4f54-4b90-b0b3-8d8023c2e780@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      eabb8a9b
  4. 20 May, 2024 4 commits
  5. 18 May, 2024 10 commits
    • Linus Torvalds's avatar
      kprobe/ftrace: fix build error due to bad function definition · 4b377b48
      Linus Torvalds authored
      Commit 1a7d0890 ("kprobe/ftrace: bail out if ftrace was killed")
      introduced a bad K&R function definition, which we haven't accepted in a
      long long time.
      
      Gcc seems to let it slide, but clang notices with the appropriate error:
      
        kernel/kprobes.c:1140:24: error: a function declaration without a prototype is deprecated in all >
         1140 | void kprobe_ftrace_kill()
              |                        ^
              |                         void
      
      but this commit was apparently never in linux-next before it was sent
      upstream, so it didn't get the appropriate build test coverage.
      
      Fixes: 1a7d0890 kprobe/ftrace: bail out if ftrace was killed
      Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b377b48
    • Linus Torvalds's avatar
      Merge tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f08a1e91
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Current release - regressions:
      
         - virtio_net: fix missed error path rtnl_unlock after control queue
           locking rework
      
        Current release - new code bugs:
      
         - bpf: fix KASAN slab-out-of-bounds in percpu_array_map_gen_lookup,
           caused by missing nested map handling
      
         - drv: dsa: correct initialization order for KSZ88x3 ports
      
        Previous releases - regressions:
      
         - af_packet: do not call packet_read_pending() from
           tpacket_destruct_skb() fix performance regression
      
         - ipv6: fix route deleting failure when metric equals 0, don't assume
           0 means not set / default in this case
      
        Previous releases - always broken:
      
         - bridge: couple of syzbot-driven fixes"
      
      * tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (30 commits)
        selftests: net: local_termination: annotate the expected failures
        net: dsa: microchip: Correct initialization order for KSZ88x3 ports
        MAINTAINERS: net: Update reviewers for TI's Ethernet drivers
        dt-bindings: net: ti: Update maintainers list
        l2tp: fix ICMP error handling for UDP-encap sockets
        net: txgbe: fix to control VLAN strip
        net: wangxun: match VLAN CTAG and STAG features
        net: wangxun: fix to change Rx features
        af_packet: do not call packet_read_pending() from tpacket_destruct_skb()
        virtio_net: Fix missed rtnl_unlock
        netrom: fix possible dead-lock in nr_rt_ioctl()
        idpf: don't skip over ethtool tcp-data-split setting
        dt-bindings: net: qcom: ethernet: Allow dma-coherent
        bonding: fix oops during rmmod
        net/ipv6: Fix route deleting failure when metric equals 0
        selftests/net: reduce xfrm_policy test time
        selftests/bpf: Adjust btf_dump test to reflect recent change in file_operations
        selftests/bpf: Adjust test_access_variable_array after a kernel function name change
        selftests/net/lib: no need to record ns name if it already exist
        net: qrtr: ns: Fix module refcnt
        ...
      f08a1e91
    • Linus Torvalds's avatar
      Merge tag 'trace-tools-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 26aa834f
      Linus Torvalds authored
      Pull tracing tool updates from Steven Rostedt:
       "Specific for timerlat:
      
         - Improve the output of timerlat top by adding a missing \n, and by
           avoiding printing color-formatting characters where they are
           translated to regular characters.
      
         - Improve timerlat auto-analysis output by replacing '\t' with spaces
           to avoid copy-and-paste issues when reporting problems.
      
         - Make the user-space (-u) option the default, as it is the most
           complete test. Add a -k option to use the in-kernel workload.
      
         - On timerlat top and hist, add a summary with the overall results.
           For instance, the minimum value for all CPUs, the overall average
           and the maximum value from all CPUs.
      
         - timerlat hist was printing initial values (i.e., 0 as max, and ~0
           as min) if the trace stopped before the first Ret-User event. This
           problem was fixed by printing the " - " no value string to the
           output if that was the case.
      
        For all RTLA tools:
      
         - Add a --warm-up <seconds> option, allowing the workload to run for
           <seconds> before starting to collect results.
      
         - Add a --trace-buffer-size option, allowing the user to set the
           tracing buffer size for -t option. This option is mainly useful for
           reducing the trace file. Now rtla depends on libtracefs >= 1.6.
      
         - Fix the -t [trace_file] parsing, now it does not require the '='
           before the option parameter, and better handles the multiple ways a
           user can pass the trace_file.txt"
      
      * tag 'trace-tools-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        rtla: Documentation: Fix -t, --trace
        rtla: Fix -t\--trace[=file]
        rtla/timerlat: Fix histogram report when a cpu count is 0
        rtla: Add --trace-buffer-size option
        rtla/timerlat: Make user-space threads the default
        rtla: Add the --warm-up option
        rtla/timerlat: Add a summary for hist mode
        rtla/timerlat: Add a summary for top mode
        rtla/timerlat: Use pretty formatting only on interactive tty
        rtla/auto-analysis: Replace \t with spaces
        rtla/timerlat: Simplify "no value" printing on top
      26aa834f
    • Linus Torvalds's avatar
      Merge tag 'trace-user-events-v6.10' of... · fa3889d9
      Linus Torvalds authored
      Merge tag 'trace-user-events-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull tracing user-event updates from Steven Rostedt:
      
       - Minor update to the user_events interface
      
        The ABI of creating a user event states that the fields are separated
        by semicolons, and spaces should be ignored.
      
        But the parsing expected at least one space to be there (which was
        incorrect). Fix the reading of the string to handle fields separated
        by semicolons but no space between them.
      
        This does extend the API sightly as now "field;field" will now be
        parsed and not cause an error. But it should not cause any regressions
        as no logic should expect it to fail.
      
        Note, that the logic that parses the event fields to create the
        trace_event works with no spaces after the semi-colon. It is
        the logic that tests against existing events that is inconsistent.
        This causes registering an event without using spaces to succeed
        if it doesn't exist, but makes the same call that tries to register
        to the same event, but doesn't use spaces, fail.
      
      * tag 'trace-user-events-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        selftests/user_events: Add non-spacing separator check
        tracing/user_events: Fix non-spaced field matching
      fa3889d9
    • Linus Torvalds's avatar
      Merge tag 'trace-ringbuffer-v6.10' of... · 53683e40
      Linus Torvalds authored
      Merge tag 'trace-ringbuffer-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull tracing ring buffer updates from Steven Rostedt:
       "Add ring_buffer memory mappings.
      
        The tracing ring buffer was created based on being mostly used with
        the splice system call. It is broken up into page ordered sub-buffers
        and the reader swaps a new sub-buffer with an existing sub-buffer
        that's part of the write buffer. It then has total access to the
        swapped out sub-buffer and can do copyless movements of the memory
        into other mediums (file system, network, etc).
      
        The buffer is great for passing around the ring buffer contents in the
        kernel, but is not so good for when the consumer is the user space
        task itself.
      
        A new interface is added that allows user space to memory map the ring
        buffer. It will get all the write sub-buffers as well as reader
        sub-buffer (that is not written to). It can send an ioctl to change
        which sub-buffer is the new reader sub-buffer.
      
        The ring buffer is read only to user space. It only needs to call the
        ioctl when it is finished with a sub-buffer and needs a new sub-buffer
        that the writer will not write over.
      
        A self test program was also created for testing and can be used as an
        example for the interface to user space. The libtracefs (external to
        the kernel) also has code that interacts with this, although it is
        disabled until the interface is in a official release. It can be
        enabled by compiling the library with a special flag. This was used
        for testing applications that perform better with the buffer being
        mapped.
      
        Memory mapped buffers have limitations. The main one is that it can
        not be used with the snapshot logic. If the buffer is mapped,
        snapshots will be disabled. If any logic is set to trigger snapshots
        on a buffer, that buffer will not be allowed to be mapped"
      
      * tag 'trace-ringbuffer-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        ring-buffer: Add cast to unsigned long addr passed to virt_to_page()
        ring-buffer: Have mmapped ring buffer keep track of missed events
        ring-buffer/selftest: Add ring-buffer mapping test
        Documentation: tracing: Add ring-buffer mapping
        tracing: Allow user-space mapping of the ring-buffer
        ring-buffer: Introducing ring-buffer mapping functions
        ring-buffer: Allocate sub-buffers with __GFP_COMP
      53683e40
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 594d2815
      Linus Torvalds authored
      Pull tracing updates from Steven Rostedt:
      
       - Remove unused ftrace_direct_funcs variables
      
       - Fix a possible NULL pointer dereference race in eventfs
      
       - Update do_div() usage in trace event benchmark test
      
       - Speedup direct function registration with asynchronous RCU callback.
      
         The synchronization was done in the registration code and this caused
         delays when registering direct callbacks. Move the freeing to a
         call_rcu() that will prevent delaying of the registering.
      
       - Replace simple_strtoul() usage with kstrtoul()
      
      * tag 'trace-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        eventfs: Fix a possible null pointer dereference in eventfs_find_events()
        ftrace: Fix possible use-after-free issue in ftrace_location()
        ftrace: Remove unused global 'ftrace_direct_func_count'
        ftrace: Remove unused list 'ftrace_direct_funcs'
        tracing: Improve benchmark test performance by using do_div()
        ftrace: Use asynchronous grace period for register_ftrace_direct()
        ftrace: Replaces simple_strtoul in ftrace
      594d2815
    • Linus Torvalds's avatar
      Merge tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 70a66320
      Linus Torvalds authored
      Pull probes updates from Masami Hiramatsu:
      
       - tracing/probes: Add new pseudo-types %pd and %pD support for dumping
         dentry name from 'struct dentry *' and file name from 'struct file *'
      
       - uprobes performance optimizations:
          - Speed up the BPF uprobe event by delaying the fetching of the
            uprobe event arguments that are not used in BPF
          - Avoid locking by speculatively checking whether uprobe event is
            valid
          - Reduce lock contention by using read/write_lock instead of
            spinlock for uprobe list operation. This improved BPF uprobe
            benchmark result 43% on average
      
       - rethook: Remove non-fatal warning messages when tracing stack from
         BPF and skip rcu_is_watching() validation in rethook if possible
      
       - objpool: Optimize objpool (which is used by kretprobes and fprobe as
         rethook backend storage) by inlining functions and avoid caching
         nr_cpu_ids because it is a const value
      
       - fprobe: Add entry/exit callbacks types (code cleanup)
      
       - kprobes: Check ftrace was killed in kprobes if it uses ftrace
      
      * tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        kprobe/ftrace: bail out if ftrace was killed
        selftests/ftrace: Fix required features for VFS type test case
        objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
        objpool: enable inlining objpool_push() and objpool_pop() operations
        rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
        ftrace: make extra rcu_is_watching() validation check optional
        uprobes: reduce contention on uprobes_tree access
        rethook: Remove warning messages printed for finding return address of a frame.
        fprobe: Add entry/exit callbacks types
        selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
        selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
        Documentation: tracing: add new type '%pd' and '%pD' for kprobe
        tracing/probes: support '%pD' type for print struct file's name
        tracing/probes: support '%pd' type for print struct dentry's name
        uprobes: add speculative lockless system-wide uprobe filter check
        uprobes: prepare uprobe args buffer lazily
        uprobes: encapsulate preparation of uprobe args buffer
      70a66320
    • Linus Torvalds's avatar
      Merge tag 'bootconfig-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · e9d68251
      Linus Torvalds authored
      Pull bootconfig updates from Masami Hiramatsu:
      
       - Do not put unneeded quotes on the extra command line items which was
         inserted from the bootconfig.
      
       - Remove redundant spaces from the extra command line.
      
      * tag 'bootconfig-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        init/main.c: Minor cleanup for the setup_command_line() function
        init/main.c: Remove redundant space from saved_command_line
        bootconfig: do not put quotes on cmdline items unless necessary
      e9d68251
    • Linus Torvalds's avatar
      Merge tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl · 91b6163b
      Linus Torvalds authored
      Pull sysctl updates from Joel Granados:
      
       - Remove sentinel elements from ctl_table structs in kernel/*
      
         Removing sentinels in ctl_table arrays reduces the build time size
         and runtime memory consumed by ~64 bytes per array. Removals for
         net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline
         through their respective subsystems making the next release the most
         likely place where the final series that removes the check for
         proc_name == NULL will land.
      
         This adds to removals already in arch/, drivers/ and fs/.
      
       - Adjust ctl_table definitions and references to allow constification
           - Remove unused ctl_table function arguments
           - Move non-const elements from ctl_table to ctl_table_header
           - Make ctl_table pointers const in ctl_table_root structure
      
         Making the static ctl_table structs const will increase safety by
         keeping the pointers to proc_handler functions in .rodata. Though no
         ctl_tables where made const in this PR, the ground work for making
         that possible has started with these changes sent by Thomas
         Weißschuh.
      
      * tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
        sysctl: drop now unnecessary out-of-bounds check
        sysctl: move sysctl type to ctl_table_header
        sysctl: drop sysctl_is_perm_empty_ctl_table
        sysctl: treewide: constify argument ctl_table_root::permissions(table)
        sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
        bpf: Remove the now superfluous sentinel elements from ctl_table array
        delayacct: Remove the now superfluous sentinel elements from ctl_table array
        kprobes: Remove the now superfluous sentinel elements from ctl_table array
        printk: Remove the now superfluous sentinel elements from ctl_table array
        scheduler: Remove the now superfluous sentinel elements from ctl_table array
        seccomp: Remove the now superfluous sentinel elements from ctl_table array
        timekeeping: Remove the now superfluous sentinel elements from ctl_table array
        ftrace: Remove the now superfluous sentinel elements from ctl_table array
        umh: Remove the now superfluous sentinel elements from ctl_table array
        kernel misc: Remove the now superfluous sentinel elements from ctl_table array
      91b6163b
    • Linus Torvalds's avatar
      Merge tag 'devicetree-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 06f054b1
      Linus Torvalds authored
      Pull devicetree updates from Rob Herring:
       "DT Bindings:
      
         - Convert samsung,exynos5-dp, atmel,lcdc, aspeed,ast2400-wdt bindings
           to schemas
      
         - Add bindings for Allwinner H616 NMI controller, Renesas r8a779g0
           irqc, Renesas R-Car V4M TMU and CMT timers, Freescale S32G3
           linflexuart, and Mediatek MT7988 XHCI
      
         - Add 'reg' constraints on DSI and SPI display panels
      
         - More dropping of unnecessary quotes in schemas
      
         - Use full paths rather than relative paths in schema $refs
      
         - Drop redundant storing of phandle for reserved memory
      
        DT Core:
      
         - Use scope based cleanups for kfree() and of_node_put()
      
         - Track interrupt-map and power-supplies for fw_devlink
      
         - Add buffer overflow check in of_modalias()
      
         - Add and use __of_prop_free() helper for freeing struct property"
      
      * tag 'devicetree-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (25 commits)
        of: property: Add fw_devlink support for interrupt-map property
        dt-bindings: display: panel: constrain 'reg' in DSI panels
        dt-bindings: display: panel: constrain 'reg' in SPI panels
        dt-bindings: display: samsung,ams495qa01: add missing SPI properties ref
        dt-bindings: Use full path to other schemas
        dt-bindings: PCI: qcom,pcie-sm8350: Drop redundant 'oneOf' sub-schema
        of: module: add buffer overflow check in of_modalias()
        dt-bindings: PCI: microchip: increase number of items in ranges property
        dt-bindings: Drop unnecessary quotes on keys
        dt-bindings: interrupt-controller: mediatek,mt6577-sysirq: Drop unnecessary quotes
        of: property: Use scope based cleanup on port_node
        of: reserved_mem: Remove the use of phandle from the reserved_mem APIs
        of: property: fw_devlink: Add support for "power-supplies" binding
        dt-bindings: watchdog: aspeed,ast2400-wdt: Convert to DT schema
        dt-bindings: irq: sun7i-nmi: Add binding for the H616 NMI controller
        dt-bindings: interrupt-controller: renesas,irqc: Add r8a779g0 support
        dt-bindings: timer: renesas,tmu: Add R-Car V4M support
        dt-bindings: timer: renesas,cmt: Add R-Car V4M support
        of: Use scope based of_node_put() cleanups
        of: Use scope based kfree() cleanups
        ...
      06f054b1
  6. 17 May, 2024 4 commits