1. 25 Sep, 2024 8 commits
    • Jianbo Liu's avatar
      net/mlx5e: Fix crash caused by calling __xfrm_state_delete() twice · 7b124695
      Jianbo Liu authored
      The km.state is not checked in driver's delayed work. When
      xfrm_state_check_expire() is called, the state can be reset to
      XFRM_STATE_EXPIRED, even if it is XFRM_STATE_DEAD already. This
      happens when xfrm state is deleted, but not freed yet. As
      __xfrm_state_delete() is called again in xfrm timer, the following
      crash occurs.
      
      To fix this issue, skip xfrm_state_check_expire() if km.state is not
      XFRM_STATE_VALID.
      
       Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
       CPU: 5 UID: 0 PID: 7448 Comm: kworker/u102:2 Not tainted 6.11.0-rc2+ #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_sw_limits [mlx5_core]
       RIP: 0010:__xfrm_state_delete+0x3d/0x1b0
       Code: 0f 84 8b 01 00 00 48 89 fd c6 87 c8 00 00 00 05 48 8d bb 40 10 00 00 e8 11 04 1a 00 48 8b 95 b8 00 00 00 48 8b 85 c0 00 00 00 <48> 89 42 08 48 89 10 48 8b 55 10 48 b8 00 01 00 00 00 00 ad de 48
       RSP: 0018:ffff88885f945ec8 EFLAGS: 00010246
       RAX: dead000000000122 RBX: ffffffff82afa940 RCX: 0000000000000036
       RDX: dead000000000100 RSI: 0000000000000000 RDI: ffffffff82afb980
       RBP: ffff888109a20340 R08: ffff88885f945ea0 R09: 0000000000000000
       R10: 0000000000000000 R11: ffff88885f945ff8 R12: 0000000000000246
       R13: ffff888109a20340 R14: ffff88885f95f420 R15: ffff88885f95f400
       FS:  0000000000000000(0000) GS:ffff88885f940000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f2163102430 CR3: 00000001128d6001 CR4: 0000000000370eb0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <IRQ>
        ? die_addr+0x33/0x90
        ? exc_general_protection+0x1a2/0x390
        ? asm_exc_general_protection+0x22/0x30
        ? __xfrm_state_delete+0x3d/0x1b0
        ? __xfrm_state_delete+0x2f/0x1b0
        xfrm_timer_handler+0x174/0x350
        ? __xfrm_state_delete+0x1b0/0x1b0
        __hrtimer_run_queues+0x121/0x270
        hrtimer_run_softirq+0x88/0xd0
        handle_softirqs+0xcc/0x270
        do_softirq+0x3c/0x50
        </IRQ>
        <TASK>
        __local_bh_enable_ip+0x47/0x50
        mlx5e_ipsec_handle_sw_limits+0x7d/0x90 [mlx5_core]
        process_one_work+0x137/0x2d0
        worker_thread+0x28d/0x3a0
        ? rescuer_thread+0x480/0x480
        kthread+0xb8/0xe0
        ? kthread_park+0x80/0x80
        ret_from_fork+0x2d/0x50
        ? kthread_park+0x80/0x80
        ret_from_fork_asm+0x11/0x20
        </TASK>
      
      Fixes: b2f7b01d ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7b124695
    • Dragos Tatulea's avatar
      net/mlx5e: SHAMPO, Fix overflow of hd_per_wq · 023d2a43
      Dragos Tatulea authored
      When having larger RQ sizes and small MTUs sizes, the hd_per_wq variable
      can overflow. Like in the following case:
      
      $> ethtool --set-ring eth1 rx 8192
      $> ip link set dev eth1 mtu 144
      $> ethtool --features eth1 rx-gro-hw on
      
      ... yields in dmesg:
      
      mlx5_core 0000:08:00.1: mlx5_cmd_out_err:808:(pid 194797): CREATE_MKEY(0x200) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x3bf6f), err(-22)
      
      because hd_per_wq is 64K which overflows to 0 and makes the command
      fail.
      
      This patch increases the variable size to 32 bit.
      
      Fixes: 99be5617 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      023d2a43
    • Yevgeny Kliteynik's avatar
      net/mlx5: HWS, changed E2BIG error to a negative return code · d15525f3
      Yevgeny Kliteynik authored
      Fixed all the 'E2BIG' returns in error flow of functions to
      the negative '-E2BIG' as we are using negative error codes
      everywhere in HWS code.
      
      This also fixes the following smatch warnings:
      	"warn: was negative '-E2BIG' intended?"
      
      Fixes: 74a778b4 ("net/mlx5: HWS, added definers handling")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/all/f8c77688-7d83-4937-baba-ac844dfe2e0b@stanley.mountain/Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d15525f3
    • Yevgeny Kliteynik's avatar
      net/mlx5: HWS, fixed double-free in error flow of creating SQ · d8c56174
      Yevgeny Kliteynik authored
      When SQ creation fails, call the appropriate mlx5_core destroy function.
      
      This fixes the following smatch warnings:
        divers/net/ethernet/mellanox/mlx5/core/steering/hws/mlx5hws_send.c:739
          hws_send_ring_open_sq() warn: 'sq->dep_wqe' double freed
          hws_send_ring_open_sq() warn: 'sq->wq_ctrl.buf.frags' double freed
          hws_send_ring_open_sq() warn: 'sq->wr_priv' double freed
      
      Fixes: 2ca62599 ("net/mlx5: HWS, added send engine and context handling")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/all/e4ebc227-4b25-49bf-9e4c-14b7ea5c6a07@stanley.mountain/Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d8c56174
    • Yevgeny Kliteynik's avatar
      net/mlx5: Fix wrong reserved field in hca_cap_2 in mlx5_ifc · 19da1701
      Yevgeny Kliteynik authored
      Fixing the wrong size of a field in hca_cap_2.
      The bug was introduced by adding new fields for HWS
      and not fixing the reserved field size.
      
      Fixes: 34c626c3 ("net/mlx5: Added missing mlx5_ifc definition for HW Steering")
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      19da1701
    • Elena Salomatkina's avatar
      net/mlx5e: Fix NULL deref in mlx5e_tir_builder_alloc() · f25389e7
      Elena Salomatkina authored
      In mlx5e_tir_builder_alloc() kvzalloc() may return NULL
      which is dereferenced on the next line in a reference
      to the modify field.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Fixes: a6696735 ("net/mlx5e: Convert TIR to a dedicated object")
      Signed-off-by: default avatarElena Salomatkina <esalomatkina@ispras.ru>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      f25389e7
    • Mohamed Khalfella's avatar
      net/mlx5: Added cond_resched() to crdump collection · ec793155
      Mohamed Khalfella authored
      Collecting crdump involves reading vsc registers from pci config space
      of mlx device, which can take long time to complete. This might result
      in starving other threads waiting to run on the cpu.
      
      Numbers I got from testing ConnectX-5 Ex MCX516A-CDAT in the lab:
      
      - mlx5_vsc_gw_read_block_fast() was called with length = 1310716.
      - mlx5_vsc_gw_read_fast() reads 4 bytes at a time. It was not used to
        read the entire 1310716 bytes. It was called 53813 times because
        there are jumps in read_addr.
      - On average mlx5_vsc_gw_read_fast() took 35284.4ns.
      - In total mlx5_vsc_wait_on_flag() called vsc_read() 54707 times.
        The average time for each call was 17548.3ns. In some instances
        vsc_read() was called more than one time when the flag was not set.
        As expected the thread released the cpu after 16 iterations in
        mlx5_vsc_wait_on_flag().
      - Total time to read crdump was 35284.4ns * 53813 ~= 1.898s.
      
      It was seen in the field that crdump can take more than 5 seconds to
      complete. During that time mlx5_vsc_wait_on_flag() did not release the
      cpu because it did not complete 16 iterations. It is believed that pci
      config reads were slow. Adding cond_resched() every 128 register read
      improves the situation. In the common case the, crdump takes ~1.8989s,
      the thread yields the cpu every ~4.51ms. If crdump takes ~5s, the thread
      yields the cpu every ~18.0ms.
      
      Fixes: 8b9d8baa ("net/mlx5: Add Crdump support")
      Reviewed-by: default avatarYuanyuan Zhong <yzhong@purestorage.com>
      Signed-off-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ec793155
    • Gerd Bayer's avatar
      net/mlx5: Fix error path in multi-packet WQE transmit · 2bcae12c
      Gerd Bayer authored
      Remove the erroneous unmap in case no DMA mapping was established
      
      The multi-packet WQE transmit code attempts to obtain a DMA mapping for
      the skb. This could fail, e.g. under memory pressure, when the IOMMU
      driver just can't allocate more memory for page tables. While the code
      tries to handle this in the path below the err_unmap label it erroneously
      unmaps one entry from the sq's FIFO list of active mappings. Since the
      current map attempt failed this unmap is removing some random DMA mapping
      that might still be required. If the PCI function now presents that IOVA,
      the IOMMU may assumes a rogue DMA access and e.g. on s390 puts the PCI
      function in error state.
      
      The erroneous behavior was seen in a stress-test environment that created
      memory pressure.
      
      Fixes: 5af75c74 ("net/mlx5e: Enhanced TX MPWQE for SKBs")
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Reviewed-by: default avatarZhu Yanjun <yanjun.zhu@linux.dev>
      Acked-by: default avatarMaxim Mikityanskiy <maxtram95@gmail.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2bcae12c
  2. 24 Sep, 2024 10 commits
  3. 23 Sep, 2024 1 commit
    • Josh Hunt's avatar
      tcp: check skb is non-NULL in tcp_rto_delta_us() · c8770db2
      Josh Hunt authored
      We have some machines running stock Ubuntu 20.04.6 which is their 5.4.0-174-generic
      kernel that are running ceph and recently hit a null ptr dereference in
      tcp_rearm_rto(). Initially hitting it from the TLP path, but then later we also
      saw it getting hit from the RACK case as well. Here are examples of the oops
      messages we saw in each of those cases:
      
      Jul 26 15:05:02 rx [11061395.780353] BUG: kernel NULL pointer dereference, address: 0000000000000020
      Jul 26 15:05:02 rx [11061395.787572] #PF: supervisor read access in kernel mode
      Jul 26 15:05:02 rx [11061395.792971] #PF: error_code(0x0000) - not-present page
      Jul 26 15:05:02 rx [11061395.798362] PGD 0 P4D 0
      Jul 26 15:05:02 rx [11061395.801164] Oops: 0000 [#1] SMP NOPTI
      Jul 26 15:05:02 rx [11061395.805091] CPU: 0 PID: 9180 Comm: msgr-worker-1 Tainted: G W 5.4.0-174-generic #193-Ubuntu
      Jul 26 15:05:02 rx [11061395.814996] Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023
      Jul 26 15:05:02 rx [11061395.825952] RIP: 0010:tcp_rearm_rto+0xe4/0x160
      Jul 26 15:05:02 rx [11061395.830656] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3
      Jul 26 15:05:02 rx [11061395.849665] RSP: 0018:ffffb75d40003e08 EFLAGS: 00010246
      Jul 26 15:05:02 rx [11061395.855149] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000
      Jul 26 15:05:02 rx [11061395.862542] RDX: 0000000062177c30 RSI: 000000000000231c RDI: ffff9874ad283a60
      Jul 26 15:05:02 rx [11061395.869933] RBP: ffffb75d40003e20 R08: 0000000000000000 R09: ffff987605e20aa8
      Jul 26 15:05:02 rx [11061395.877318] R10: ffffb75d40003f00 R11: ffffb75d4460f740 R12: ffff9874ad283900
      Jul 26 15:05:02 rx [11061395.884710] R13: ffff9874ad283a60 R14: ffff9874ad283980 R15: ffff9874ad283d30
      Jul 26 15:05:02 rx [11061395.892095] FS: 00007f1ef4a2e700(0000) GS:ffff987605e00000(0000) knlGS:0000000000000000
      Jul 26 15:05:02 rx [11061395.900438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Jul 26 15:05:02 rx [11061395.906435] CR2: 0000000000000020 CR3: 0000003e450ba003 CR4: 0000000000760ef0
      Jul 26 15:05:02 rx [11061395.913822] PKRU: 55555554
      Jul 26 15:05:02 rx [11061395.916786] Call Trace:
      Jul 26 15:05:02 rx [11061395.919488]
      Jul 26 15:05:02 rx [11061395.921765] ? show_regs.cold+0x1a/0x1f
      Jul 26 15:05:02 rx [11061395.925859] ? __die+0x90/0xd9
      Jul 26 15:05:02 rx [11061395.929169] ? no_context+0x196/0x380
      Jul 26 15:05:02 rx [11061395.933088] ? ip6_protocol_deliver_rcu+0x4e0/0x4e0
      Jul 26 15:05:02 rx [11061395.938216] ? ip6_sublist_rcv_finish+0x3d/0x50
      Jul 26 15:05:02 rx [11061395.943000] ? __bad_area_nosemaphore+0x50/0x1a0
      Jul 26 15:05:02 rx [11061395.947873] ? bad_area_nosemaphore+0x16/0x20
      Jul 26 15:05:02 rx [11061395.952486] ? do_user_addr_fault+0x267/0x450
      Jul 26 15:05:02 rx [11061395.957104] ? ipv6_list_rcv+0x112/0x140
      Jul 26 15:05:02 rx [11061395.961279] ? __do_page_fault+0x58/0x90
      Jul 26 15:05:02 rx [11061395.965458] ? do_page_fault+0x2c/0xe0
      Jul 26 15:05:02 rx [11061395.969465] ? page_fault+0x34/0x40
      Jul 26 15:05:02 rx [11061395.973217] ? tcp_rearm_rto+0xe4/0x160
      Jul 26 15:05:02 rx [11061395.977313] ? tcp_rearm_rto+0xe4/0x160
      Jul 26 15:05:02 rx [11061395.981408] tcp_send_loss_probe+0x10b/0x220
      Jul 26 15:05:02 rx [11061395.985937] tcp_write_timer_handler+0x1b4/0x240
      Jul 26 15:05:02 rx [11061395.990809] tcp_write_timer+0x9e/0xe0
      Jul 26 15:05:02 rx [11061395.994814] ? tcp_write_timer_handler+0x240/0x240
      Jul 26 15:05:02 rx [11061395.999866] call_timer_fn+0x32/0x130
      Jul 26 15:05:02 rx [11061396.003782] __run_timers.part.0+0x180/0x280
      Jul 26 15:05:02 rx [11061396.008309] ? recalibrate_cpu_khz+0x10/0x10
      Jul 26 15:05:02 rx [11061396.012841] ? native_x2apic_icr_write+0x30/0x30
      Jul 26 15:05:02 rx [11061396.017718] ? lapic_next_event+0x21/0x30
      Jul 26 15:05:02 rx [11061396.021984] ? clockevents_program_event+0x8f/0xe0
      Jul 26 15:05:02 rx [11061396.027035] run_timer_softirq+0x2a/0x50
      Jul 26 15:05:02 rx [11061396.031212] __do_softirq+0xd1/0x2c1
      Jul 26 15:05:02 rx [11061396.035044] do_softirq_own_stack+0x2a/0x40
      Jul 26 15:05:02 rx [11061396.039480]
      Jul 26 15:05:02 rx [11061396.041840] do_softirq.part.0+0x46/0x50
      Jul 26 15:05:02 rx [11061396.046022] __local_bh_enable_ip+0x50/0x60
      Jul 26 15:05:02 rx [11061396.050460] _raw_spin_unlock_bh+0x1e/0x20
      Jul 26 15:05:02 rx [11061396.054817] nf_conntrack_tcp_packet+0x29e/0xbe0 [nf_conntrack]
      Jul 26 15:05:02 rx [11061396.060994] ? get_l4proto+0xe7/0x190 [nf_conntrack]
      Jul 26 15:05:02 rx [11061396.066220] nf_conntrack_in+0xe9/0x670 [nf_conntrack]
      Jul 26 15:05:02 rx [11061396.071618] ipv6_conntrack_local+0x14/0x20 [nf_conntrack]
      Jul 26 15:05:02 rx [11061396.077356] nf_hook_slow+0x45/0xb0
      Jul 26 15:05:02 rx [11061396.081098] ip6_xmit+0x3f0/0x5d0
      Jul 26 15:05:02 rx [11061396.084670] ? ipv6_anycast_cleanup+0x50/0x50
      Jul 26 15:05:02 rx [11061396.089282] ? __sk_dst_check+0x38/0x70
      Jul 26 15:05:02 rx [11061396.093381] ? inet6_csk_route_socket+0x13b/0x200
      Jul 26 15:05:02 rx [11061396.098346] inet6_csk_xmit+0xa7/0xf0
      Jul 26 15:05:02 rx [11061396.102263] __tcp_transmit_skb+0x550/0xb30
      Jul 26 15:05:02 rx [11061396.106701] tcp_write_xmit+0x3c6/0xc20
      Jul 26 15:05:02 rx [11061396.110792] ? __alloc_skb+0x98/0x1d0
      Jul 26 15:05:02 rx [11061396.114708] __tcp_push_pending_frames+0x37/0x100
      Jul 26 15:05:02 rx [11061396.119667] tcp_push+0xfd/0x100
      Jul 26 15:05:02 rx [11061396.123150] tcp_sendmsg_locked+0xc70/0xdd0
      Jul 26 15:05:02 rx [11061396.127588] tcp_sendmsg+0x2d/0x50
      Jul 26 15:05:02 rx [11061396.131245] inet6_sendmsg+0x43/0x70
      Jul 26 15:05:02 rx [11061396.135075] __sock_sendmsg+0x48/0x70
      Jul 26 15:05:02 rx [11061396.138994] ____sys_sendmsg+0x212/0x280
      Jul 26 15:05:02 rx [11061396.143172] ___sys_sendmsg+0x88/0xd0
      Jul 26 15:05:02 rx [11061396.147098] ? __seccomp_filter+0x7e/0x6b0
      Jul 26 15:05:02 rx [11061396.151446] ? __switch_to+0x39c/0x460
      Jul 26 15:05:02 rx [11061396.155453] ? __switch_to_asm+0x42/0x80
      Jul 26 15:05:02 rx [11061396.159636] ? __switch_to_asm+0x5a/0x80
      Jul 26 15:05:02 rx [11061396.163816] __sys_sendmsg+0x5c/0xa0
      Jul 26 15:05:02 rx [11061396.167647] __x64_sys_sendmsg+0x1f/0x30
      Jul 26 15:05:02 rx [11061396.171832] do_syscall_64+0x57/0x190
      Jul 26 15:05:02 rx [11061396.175748] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
      Jul 26 15:05:02 rx [11061396.181055] RIP: 0033:0x7f1ef692618d
      Jul 26 15:05:02 rx [11061396.184893] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ca ee ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 48 89 44 24 08 e8 fe ee ff ff 48
      Jul 26 15:05:02 rx [11061396.203889] RSP: 002b:00007f1ef4a26aa0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
      Jul 26 15:05:02 rx [11061396.211708] RAX: ffffffffffffffda RBX: 000000000000084b RCX: 00007f1ef692618d
      Jul 26 15:05:02 rx [11061396.219091] RDX: 0000000000004000 RSI: 00007f1ef4a26b10 RDI: 0000000000000275
      Jul 26 15:05:02 rx [11061396.226475] RBP: 0000000000004000 R08: 0000000000000000 R09: 0000000000000020
      Jul 26 15:05:02 rx [11061396.233859] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000084b
      Jul 26 15:05:02 rx [11061396.241243] R13: 00007f1ef4a26b10 R14: 0000000000000275 R15: 000055592030f1e8
      Jul 26 15:05:02 rx [11061396.248628] Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet mii ast drm_vram_helper ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp mac_hid ipmi_si ipmi_devintf ipmi_msghandler nft_ct sch_fq_codel nf_tables_set nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ramoops reed_solomon efi_pstore drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid1 mlx5_core hid_generic pci_hyperv_intf crc32_pclmul tls usbhid ahci mlxfw bnxt_en libahci hid nvme i2c_piix4 nvme_core wmi
      Jul 26 15:05:02 rx [11061396.324334] CR2: 0000000000000020
      Jul 26 15:05:02 rx [11061396.327944] ---[ end trace 68a2b679d1cfb4f1 ]---
      Jul 26 15:05:02 rx [11061396.433435] RIP: 0010:tcp_rearm_rto+0xe4/0x160
      Jul 26 15:05:02 rx [11061396.438137] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3
      Jul 26 15:05:02 rx [11061396.457144] RSP: 0018:ffffb75d40003e08 EFLAGS: 00010246
      Jul 26 15:05:02 rx [11061396.462629] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000
      Jul 26 15:05:02 rx [11061396.470012] RDX: 0000000062177c30 RSI: 000000000000231c RDI: ffff9874ad283a60
      Jul 26 15:05:02 rx [11061396.477396] RBP: ffffb75d40003e20 R08: 0000000000000000 R09: ffff987605e20aa8
      Jul 26 15:05:02 rx [11061396.484779] R10: ffffb75d40003f00 R11: ffffb75d4460f740 R12: ffff9874ad283900
      Jul 26 15:05:02 rx [11061396.492164] R13: ffff9874ad283a60 R14: ffff9874ad283980 R15: ffff9874ad283d30
      Jul 26 15:05:02 rx [11061396.499547] FS: 00007f1ef4a2e700(0000) GS:ffff987605e00000(0000) knlGS:0000000000000000
      Jul 26 15:05:02 rx [11061396.507886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Jul 26 15:05:02 rx [11061396.513884] CR2: 0000000000000020 CR3: 0000003e450ba003 CR4: 0000000000760ef0
      Jul 26 15:05:02 rx [11061396.521267] PKRU: 55555554
      Jul 26 15:05:02 rx [11061396.524230] Kernel panic - not syncing: Fatal exception in interrupt
      Jul 26 15:05:02 rx [11061396.530885] Kernel Offset: 0x1b200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      Jul 26 15:05:03 rx [11061396.660181] ---[ end Kernel panic - not syncing: Fatal
       exception in interrupt ]---
      
      After we hit this we disabled TLP by setting tcp_early_retrans to 0 and then hit the crash in the RACK case:
      
      Aug 7 07:26:16 rx [1006006.265582] BUG: kernel NULL pointer dereference, address: 0000000000000020
      Aug 7 07:26:16 rx [1006006.272719] #PF: supervisor read access in kernel mode
      Aug 7 07:26:16 rx [1006006.278030] #PF: error_code(0x0000) - not-present page
      Aug 7 07:26:16 rx [1006006.283343] PGD 0 P4D 0
      Aug 7 07:26:16 rx [1006006.286057] Oops: 0000 [#1] SMP NOPTI
      Aug 7 07:26:16 rx [1006006.289896] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.4.0-174-generic #193-Ubuntu
      Aug 7 07:26:16 rx [1006006.299107] Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023
      Aug 7 07:26:16 rx [1006006.309970] RIP: 0010:tcp_rearm_rto+0xe4/0x160
      Aug 7 07:26:16 rx [1006006.314584] Code: 87 ca 04 00 00 00 5b 41 5c 41 5d 5d c3 c3 49 8b bc 24 40 06 00 00 eb 8d 48 bb cf f7 53 e3 a5 9b c4 20 4c 89 ef e8 0c fe 0e 00 <48> 8b 78 20 48 c1 ef 03 48 89 f8 41 8b bc 24 80 04 00 00 48 f7 e3
      Aug 7 07:26:16 rx [1006006.333499] RSP: 0018:ffffb42600a50960 EFLAGS: 00010246
      Aug 7 07:26:16 rx [1006006.338895] RAX: 0000000000000000 RBX: 20c49ba5e353f7cf RCX: 0000000000000000
      Aug 7 07:26:16 rx [1006006.346193] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff92d687ed8160
      Aug 7 07:26:16 rx [1006006.353489] RBP: ffffb42600a50978 R08: 0000000000000000 R09: 00000000cd896dcc
      Aug 7 07:26:16 rx [1006006.360786] R10: ffff92dc3404f400 R11: 0000000000000001 R12: ffff92d687ed8000
      Aug 7 07:26:16 rx [1006006.368084] R13: ffff92d687ed8160 R14: 00000000cd896dcc R15: 00000000cd8fca81
      Aug 7 07:26:16 rx [1006006.375381] FS: 0000000000000000(0000) GS:ffff93158ad40000(0000) knlGS:0000000000000000
      Aug 7 07:26:16 rx [1006006.383632] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Aug 7 07:26:16 rx [1006006.389544] CR2: 0000000000000020 CR3: 0000003e775ce006 CR4: 0000000000760ee0
      Aug 7 07:26:16 rx [1006006.396839] PKRU: 55555554
      Aug 7 07:26:16 rx [1006006.399717] Call Trace:
      Aug 7 07:26:16 rx [1006006.402335]
      Aug 7 07:26:16 rx [1006006.404525] ? show_regs.cold+0x1a/0x1f
      Aug 7 07:26:16 rx [1006006.408532] ? __die+0x90/0xd9
      Aug 7 07:26:16 rx [1006006.411760] ? no_context+0x196/0x380
      Aug 7 07:26:16 rx [1006006.415599] ? __bad_area_nosemaphore+0x50/0x1a0
      Aug 7 07:26:16 rx [1006006.420392] ? _raw_spin_lock+0x1e/0x30
      Aug 7 07:26:16 rx [1006006.424401] ? bad_area_nosemaphore+0x16/0x20
      Aug 7 07:26:16 rx [1006006.428927] ? do_user_addr_fault+0x267/0x450
      Aug 7 07:26:16 rx [1006006.433450] ? __do_page_fault+0x58/0x90
      Aug 7 07:26:16 rx [1006006.437542] ? do_page_fault+0x2c/0xe0
      Aug 7 07:26:16 rx [1006006.441470] ? page_fault+0x34/0x40
      Aug 7 07:26:16 rx [1006006.445134] ? tcp_rearm_rto+0xe4/0x160
      Aug 7 07:26:16 rx [1006006.449145] tcp_ack+0xa32/0xb30
      Aug 7 07:26:16 rx [1006006.452542] tcp_rcv_established+0x13c/0x670
      Aug 7 07:26:16 rx [1006006.456981] ? sk_filter_trim_cap+0x48/0x220
      Aug 7 07:26:16 rx [1006006.461419] tcp_v6_do_rcv+0xdb/0x450
      Aug 7 07:26:16 rx [1006006.465257] tcp_v6_rcv+0xc2b/0xd10
      Aug 7 07:26:16 rx [1006006.468918] ip6_protocol_deliver_rcu+0xd3/0x4e0
      Aug 7 07:26:16 rx [1006006.473706] ip6_input_finish+0x15/0x20
      Aug 7 07:26:16 rx [1006006.477710] ip6_input+0xa2/0xb0
      Aug 7 07:26:16 rx [1006006.481109] ? ip6_protocol_deliver_rcu+0x4e0/0x4e0
      Aug 7 07:26:16 rx [1006006.486151] ip6_sublist_rcv_finish+0x3d/0x50
      Aug 7 07:26:16 rx [1006006.490679] ip6_sublist_rcv+0x1aa/0x250
      Aug 7 07:26:16 rx [1006006.494779] ? ip6_rcv_finish_core.isra.0+0xa0/0xa0
      Aug 7 07:26:16 rx [1006006.499828] ipv6_list_rcv+0x112/0x140
      Aug 7 07:26:16 rx [1006006.503748] __netif_receive_skb_list_core+0x1a4/0x250
      Aug 7 07:26:16 rx [1006006.509057] netif_receive_skb_list_internal+0x1a1/0x2b0
      Aug 7 07:26:16 rx [1006006.514538] gro_normal_list.part.0+0x1e/0x40
      Aug 7 07:26:16 rx [1006006.519068] napi_complete_done+0x91/0x130
      Aug 7 07:26:16 rx [1006006.523352] mlx5e_napi_poll+0x18e/0x610 [mlx5_core]
      Aug 7 07:26:16 rx [1006006.528481] net_rx_action+0x142/0x390
      Aug 7 07:26:16 rx [1006006.532398] __do_softirq+0xd1/0x2c1
      Aug 7 07:26:16 rx [1006006.536142] irq_exit+0xae/0xb0
      Aug 7 07:26:16 rx [1006006.539452] do_IRQ+0x5a/0xf0
      Aug 7 07:26:16 rx [1006006.542590] common_interrupt+0xf/0xf
      Aug 7 07:26:16 rx [1006006.546421]
      Aug 7 07:26:16 rx [1006006.548695] RIP: 0010:native_safe_halt+0xe/0x10
      Aug 7 07:26:16 rx [1006006.553399] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 2c 50 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 2c 50 00 fb f4 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 dd 5e 61 ff 65
      Aug 7 07:26:16 rx [1006006.572309] RSP: 0018:ffffb42600177e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffc2
      Aug 7 07:26:16 rx [1006006.580040] RAX: ffffffff8ed08b20 RBX: 0000000000000005 RCX: 0000000000000001
      Aug 7 07:26:16 rx [1006006.587337] RDX: 00000000f48eeca2 RSI: 0000000000000082 RDI: 0000000000000082
      Aug 7 07:26:16 rx [1006006.594635] RBP: ffffb42600177e90 R08: 0000000000000000 R09: 000000000000020f
      Aug 7 07:26:16 rx [1006006.601931] R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000000005
      Aug 7 07:26:16 rx [1006006.609229] R13: ffff93157deb5f00 R14: 0000000000000000 R15: 0000000000000000
      Aug 7 07:26:16 rx [1006006.616530] ? __cpuidle_text_start+0x8/0x8
      Aug 7 07:26:16 rx [1006006.620886] ? default_idle+0x20/0x140
      Aug 7 07:26:16 rx [1006006.624804] arch_cpu_idle+0x15/0x20
      Aug 7 07:26:16 rx [1006006.628545] default_idle_call+0x23/0x30
      Aug 7 07:26:16 rx [1006006.632640] do_idle+0x1fb/0x270
      Aug 7 07:26:16 rx [1006006.636035] cpu_startup_entry+0x20/0x30
      Aug 7 07:26:16 rx [1006006.640126] start_secondary+0x178/0x1d0
      Aug 7 07:26:16 rx [1006006.644218] secondary_startup_64+0xa4/0xb0
      Aug 7 07:26:17 rx [1006006.648568] Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 nft_ct amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif input_leds joydev rndis_host cdc_ether usbnet ast mii drm_vram_helper ttm drm_kms_helper i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt ccp mac_hid ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel nf_tables_set nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ramoops reed_solomon efi_pstore drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid1 hid_generic mlx5_core pci_hyperv_intf crc32_pclmul usbhid ahci tls mlxfw bnxt_en hid libahci nvme i2c_piix4 nvme_core wmi [last unloaded: cpuid]
      Aug 7 07:26:17 rx [1006006.726180] CR2: 0000000000000020
      Aug 7 07:26:17 rx [1006006.729718] ---[ end trace e0e2e37e4e612984 ]---
      
      Prior to seeing the first crash and on other machines we also see the warning in
      tcp_send_loss_probe() where packets_out is non-zero, but both transmit and retrans
      queues are empty so we know the box is seeing some accounting issue in this area:
      
      Jul 26 09:15:27 kernel: ------------[ cut here ]------------
      Jul 26 09:15:27 kernel: invalid inflight: 2 state 1 cwnd 68 mss 8988
      Jul 26 09:15:27 kernel: WARNING: CPU: 16 PID: 0 at net/ipv4/tcp_output.c:2605 tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: Modules linked in: vrf bridge stp llc vxlan ip6_udp_tunnel udp_tunnel nls_iso8859_1 nft_ct amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof ipmi_ssif joydev input_leds rndis_host cdc_ether usbnet mii ast drm_vram_helper ttm drm_kms_he>
      Jul 26 09:15:27 kernel: CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.4.0-174-generic #193-Ubuntu
      Jul 26 09:15:27 kernel: Hardware name: Supermicro SMC 2x26 os-gen8 64C NVME-Y 256G/H12SSW-NTR, BIOS 2.5.V1.2U.NVMe.UEFI 05/09/2023
      Jul 26 09:15:27 kernel: RIP: 0010:tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: Code: 08 26 01 00 75 e2 41 0f b6 54 24 12 41 8b 8c 24 c0 06 00 00 45 89 f0 48 c7 c7 e0 b4 20 a7 c6 05 8d 08 26 01 01 e8 4a c0 0f 00 <0f> 0b eb ba 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
      Jul 26 09:15:27 kernel: RSP: 0018:ffffb7838088ce00 EFLAGS: 00010286
      Jul 26 09:15:27 kernel: RAX: 0000000000000000 RBX: ffff9b84b5630430 RCX: 0000000000000006
      Jul 26 09:15:27 kernel: RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9b8e4621c8c0
      Jul 26 09:15:27 kernel: RBP: ffffb7838088ce18 R08: 0000000000000927 R09: 0000000000000004
      Jul 26 09:15:27 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9b84b5630000
      Jul 26 09:15:27 kernel: R13: 0000000000000000 R14: 000000000000231c R15: ffff9b84b5630430
      Jul 26 09:15:27 kernel: FS: 0000000000000000(0000) GS:ffff9b8e46200000(0000) knlGS:0000000000000000
      Jul 26 09:15:27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Jul 26 09:15:27 kernel: CR2: 000056238cec2380 CR3: 0000003e49ede005 CR4: 0000000000760ee0
      Jul 26 09:15:27 kernel: PKRU: 55555554
      Jul 26 09:15:27 kernel: Call Trace:
      Jul 26 09:15:27 kernel: <IRQ>
      Jul 26 09:15:27 kernel: ? show_regs.cold+0x1a/0x1f
      Jul 26 09:15:27 kernel: ? __warn+0x98/0xe0
      Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: ? report_bug+0xd1/0x100
      Jul 26 09:15:27 kernel: ? do_error_trap+0x9b/0xc0
      Jul 26 09:15:27 kernel: ? do_invalid_op+0x3c/0x50
      Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: ? invalid_op+0x1e/0x30
      Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: tcp_write_timer_handler+0x1b4/0x240
      Jul 26 09:15:27 kernel: tcp_write_timer+0x9e/0xe0
      Jul 26 09:15:27 kernel: ? tcp_write_timer_handler+0x240/0x240
      Jul 26 09:15:27 kernel: call_timer_fn+0x32/0x130
      Jul 26 09:15:27 kernel: __run_timers.part.0+0x180/0x280
      Jul 26 09:15:27 kernel: ? timerqueue_add+0x9b/0xb0
      Jul 26 09:15:27 kernel: ? enqueue_hrtimer+0x3d/0x90
      Jul 26 09:15:27 kernel: ? do_error_trap+0x9b/0xc0
      Jul 26 09:15:27 kernel: ? do_invalid_op+0x3c/0x50
      Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: ? invalid_op+0x1e/0x30
      Jul 26 09:15:27 kernel: ? tcp_send_loss_probe+0x214/0x220
      Jul 26 09:15:27 kernel: tcp_write_timer_handler+0x1b4/0x240
      Jul 26 09:15:27 kernel: tcp_write_timer+0x9e/0xe0
      Jul 26 09:15:27 kernel: ? tcp_write_timer_handler+0x240/0x240
      Jul 26 09:15:27 kernel: call_timer_fn+0x32/0x130
      Jul 26 09:15:27 kernel: __run_timers.part.0+0x180/0x280
      Jul 26 09:15:27 kernel: ? timerqueue_add+0x9b/0xb0
      Jul 26 09:15:27 kernel: ? enqueue_hrtimer+0x3d/0x90
      Jul 26 09:15:27 kernel: ? recalibrate_cpu_khz+0x10/0x10
      Jul 26 09:15:27 kernel: ? ktime_get+0x3e/0xa0
      Jul 26 09:15:27 kernel: ? native_x2apic_icr_write+0x30/0x30
      Jul 26 09:15:27 kernel: run_timer_softirq+0x2a/0x50
      Jul 26 09:15:27 kernel: __do_softirq+0xd1/0x2c1
      Jul 26 09:15:27 kernel: irq_exit+0xae/0xb0
      Jul 26 09:15:27 kernel: smp_apic_timer_interrupt+0x7b/0x140
      Jul 26 09:15:27 kernel: apic_timer_interrupt+0xf/0x20
      Jul 26 09:15:27 kernel: </IRQ>
      Jul 26 09:15:27 kernel: RIP: 0010:native_safe_halt+0xe/0x10
      Jul 26 09:15:27 kernel: Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 2c 50 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 2c 50 00 fb f4 <c3> 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 dd 5e 61 ff 65
      Jul 26 09:15:27 kernel: RSP: 0018:ffffb783801cfe70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
      Jul 26 09:15:27 kernel: RAX: ffffffffa6908b20 RBX: 0000000000000010 RCX: 0000000000000001
      Jul 26 09:15:27 kernel: RDX: 000000006fc0c97e RSI: 0000000000000082 RDI: 0000000000000082
      Jul 26 09:15:27 kernel: RBP: ffffb783801cfe90 R08: 0000000000000000 R09: 0000000000000225
      Jul 26 09:15:27 kernel: R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000000010
      Jul 26 09:15:27 kernel: R13: ffff9b8e390b0000 R14: 0000000000000000 R15: 0000000000000000
      Jul 26 09:15:27 kernel: ? __cpuidle_text_start+0x8/0x8
      Jul 26 09:15:27 kernel: ? default_idle+0x20/0x140
      Jul 26 09:15:27 kernel: arch_cpu_idle+0x15/0x20
      Jul 26 09:15:27 kernel: default_idle_call+0x23/0x30
      Jul 26 09:15:27 kernel: do_idle+0x1fb/0x270
      Jul 26 09:15:27 kernel: cpu_startup_entry+0x20/0x30
      Jul 26 09:15:27 kernel: start_secondary+0x178/0x1d0
      Jul 26 09:15:27 kernel: secondary_startup_64+0xa4/0xb0
      Jul 26 09:15:27 kernel: ---[ end trace e7ac822987e33be1 ]---
      
      The NULL ptr deref is coming from tcp_rto_delta_us() attempting to pull an skb
      off the head of the retransmit queue and then dereferencing that skb to get the
      skb_mstamp_ns value via tcp_skb_timestamp_us(skb).
      
      The crash is the same one that was reported a # of years ago here:
      https://lore.kernel.org/netdev/86c0f836-9a7c-438b-d81a-839be45f1f58@gmail.com/T/#t
      
      and the kernel we're running has the fix which was added to resolve this issue.
      
      Unfortunately we've been unsuccessful so far in reproducing this problem in the
      lab and do not have the luxury of pushing out a new kernel to try and test if
      newer kernels resolve this issue at the moment. I realize this is a report
      against both an Ubuntu kernel and also an older 5.4 kernel. I have reported this
      issue to Ubuntu here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2077657
      however I feel like since this issue has possibly cropped up again it makes
      sense to build in some protection in this path (even on the latest kernel
      versions) since the code in question just blindly assumes there's a valid skb
      without testing if it's NULL b/f it looks at the timestamp.
      
      Given we have seen crashes in this path before and now this case it seems like
      we should protect ourselves for when packets_out accounting is incorrect.
      While we should fix that root cause we should also just make sure the skb
      is not NULL before dereferencing it. Also add a warn once here to capture
      some information if/when the problem case is hit again.
      
      Fixes: e1a10ef7 ("tcp: introduce tcp_rto_delta_us() helper for xmit timer fix")
      Signed-off-by: default avatarJosh Hunt <johunt@akamai.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8770db2
  4. 22 Sep, 2024 2 commits
  5. 19 Sep, 2024 5 commits
    • Kaixin Wang's avatar
      net: seeq: Fix use after free vulnerability in ether3 Driver Due to Race Condition · b5109b60
      Kaixin Wang authored
      In the ether3_probe function, a timer is initialized with a callback
      function ether3_ledoff, bound to &prev(dev)->timer. Once the timer is
      started, there is a risk of a race condition if the module or device
      is removed, triggering the ether3_remove function to perform cleanup.
      The sequence of operations that may lead to a UAF bug is as follows:
      
      CPU0                                    CPU1
      
                            |  ether3_ledoff
      ether3_remove         |
        free_netdev(dev);   |
        put_devic           |
        kfree(dev);         |
       |  ether3_outw(priv(dev)->regs.config2 |= CFG2_CTRLO, REG_CONFIG2);
                            | // use dev
      
      Fix it by ensuring that the timer is canceled before proceeding with
      the cleanup in ether3_remove.
      
      Fixes: 6fd9c53f ("net: seeq: Convert timers to use timer_setup()")
      Signed-off-by: default avatarKaixin Wang <kxwang23@m.fudan.edu.cn>
      Link: https://patch.msgid.link/20240915144045.451-1-kxwang23@m.fudan.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b5109b60
    • Eric Dumazet's avatar
      netfilter: nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put() · 9c778fe4
      Eric Dumazet authored
      syzbot reported that nf_reject_ip6_tcphdr_put() was possibly sending
      garbage on the four reserved tcp bits (th->res1)
      
      Use skb_put_zero() to clear the whole TCP header,
      as done in nf_reject_ip_tcphdr_put()
      
      BUG: KMSAN: uninit-value in nf_reject_ip6_tcphdr_put+0x688/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:255
        nf_reject_ip6_tcphdr_put+0x688/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:255
        nf_send_reset6+0xd84/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:344
        nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48
        expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
        nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288
        nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161
        nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
        nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
        nf_hook include/linux/netfilter.h:269 [inline]
        NF_HOOK include/linux/netfilter.h:312 [inline]
        ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
        __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775
        process_backlog+0x4ad/0xa50 net/core/dev.c:6108
        __napi_poll+0xe7/0x980 net/core/dev.c:6772
        napi_poll net/core/dev.c:6841 [inline]
        net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963
        handle_softirqs+0x1ce/0x800 kernel/softirq.c:554
        __do_softirq+0x14/0x1a kernel/softirq.c:588
        do_softirq+0x9a/0x100 kernel/softirq.c:455
        __local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:382
        local_bh_enable include/linux/bottom_half.h:33 [inline]
        rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline]
        __dev_queue_xmit+0x2692/0x5610 net/core/dev.c:4450
        dev_queue_xmit include/linux/netdevice.h:3105 [inline]
        neigh_resolve_output+0x9ca/0xae0 net/core/neighbour.c:1565
        neigh_output include/net/neighbour.h:542 [inline]
        ip6_finish_output2+0x2347/0x2ba0 net/ipv6/ip6_output.c:141
        __ip6_finish_output net/ipv6/ip6_output.c:215 [inline]
        ip6_finish_output+0xbb8/0x14b0 net/ipv6/ip6_output.c:226
        NF_HOOK_COND include/linux/netfilter.h:303 [inline]
        ip6_output+0x356/0x620 net/ipv6/ip6_output.c:247
        dst_output include/net/dst.h:450 [inline]
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ip6_xmit+0x1ba6/0x25d0 net/ipv6/ip6_output.c:366
        inet6_csk_xmit+0x442/0x530 net/ipv6/inet6_connection_sock.c:135
        __tcp_transmit_skb+0x3b07/0x4880 net/ipv4/tcp_output.c:1466
        tcp_transmit_skb net/ipv4/tcp_output.c:1484 [inline]
        tcp_connect+0x35b6/0x7130 net/ipv4/tcp_output.c:4143
        tcp_v6_connect+0x1bcc/0x1e40 net/ipv6/tcp_ipv6.c:333
        __inet_stream_connect+0x2ef/0x1730 net/ipv4/af_inet.c:679
        inet_stream_connect+0x6a/0xd0 net/ipv4/af_inet.c:750
        __sys_connect_file net/socket.c:2061 [inline]
        __sys_connect+0x606/0x690 net/socket.c:2078
        __do_sys_connect net/socket.c:2088 [inline]
        __se_sys_connect net/socket.c:2085 [inline]
        __x64_sys_connect+0x91/0xe0 net/socket.c:2085
        x64_sys_call+0x27a5/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:43
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Uninit was stored to memory at:
        nf_reject_ip6_tcphdr_put+0x60c/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:249
        nf_send_reset6+0xd84/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:344
        nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48
        expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
        nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288
        nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161
        nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
        nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
        nf_hook include/linux/netfilter.h:269 [inline]
        NF_HOOK include/linux/netfilter.h:312 [inline]
        ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
        __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775
        process_backlog+0x4ad/0xa50 net/core/dev.c:6108
        __napi_poll+0xe7/0x980 net/core/dev.c:6772
        napi_poll net/core/dev.c:6841 [inline]
        net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963
        handle_softirqs+0x1ce/0x800 kernel/softirq.c:554
        __do_softirq+0x14/0x1a kernel/softirq.c:588
      
      Uninit was stored to memory at:
        nf_reject_ip6_tcphdr_put+0x2ca/0x6c0 net/ipv6/netfilter/nf_reject_ipv6.c:231
        nf_send_reset6+0xd84/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:344
        nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48
        expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
        nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288
        nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161
        nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
        nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
        nf_hook include/linux/netfilter.h:269 [inline]
        NF_HOOK include/linux/netfilter.h:312 [inline]
        ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
        __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775
        process_backlog+0x4ad/0xa50 net/core/dev.c:6108
        __napi_poll+0xe7/0x980 net/core/dev.c:6772
        napi_poll net/core/dev.c:6841 [inline]
        net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963
        handle_softirqs+0x1ce/0x800 kernel/softirq.c:554
        __do_softirq+0x14/0x1a kernel/softirq.c:588
      
      Uninit was created at:
        slab_post_alloc_hook mm/slub.c:3998 [inline]
        slab_alloc_node mm/slub.c:4041 [inline]
        kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4084
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:583
        __alloc_skb+0x363/0x7b0 net/core/skbuff.c:674
        alloc_skb include/linux/skbuff.h:1320 [inline]
        nf_send_reset6+0x98d/0x15b0 net/ipv6/netfilter/nf_reject_ipv6.c:327
        nft_reject_inet_eval+0x3c1/0x880 net/netfilter/nft_reject_inet.c:48
        expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
        nft_do_chain+0x438/0x22a0 net/netfilter/nf_tables_core.c:288
        nft_do_chain_inet+0x41a/0x4f0 net/netfilter/nft_chain_filter.c:161
        nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
        nf_hook_slow+0xf4/0x400 net/netfilter/core.c:626
        nf_hook include/linux/netfilter.h:269 [inline]
        NF_HOOK include/linux/netfilter.h:312 [inline]
        ipv6_rcv+0x29b/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
        __netif_receive_skb+0x1da/0xa00 net/core/dev.c:5775
        process_backlog+0x4ad/0xa50 net/core/dev.c:6108
        __napi_poll+0xe7/0x980 net/core/dev.c:6772
        napi_poll net/core/dev.c:6841 [inline]
        net_rx_action+0xa5a/0x19b0 net/core/dev.c:6963
        handle_softirqs+0x1ce/0x800 kernel/softirq.c:554
        __do_softirq+0x14/0x1a kernel/softirq.c:588
      
      Fixes: c8d7b98b ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Link: https://patch.msgid.link/20240913170615.3670897-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9c778fe4
    • Sean Anderson's avatar
      net: xilinx: axienet: Fix packet counting · 5a6caa2c
      Sean Anderson authored
      axienet_free_tx_chain returns the number of DMA descriptors it's
      handled. However, axienet_tx_poll treats the return as the number of
      packets. When scatter-gather SKBs are enabled, a single packet may use
      multiple DMA descriptors, which causes incorrect packet counts. Fix this
      by explicitly keepting track of the number of packets processed as
      separate from the DMA descriptors.
      
      Budget does not affect the number of Tx completions we can process for
      NAPI, so we use the ring size as the limit instead of budget. As we no
      longer return the number of descriptors processed to axienet_tx_poll, we
      now update tx_bd_ci in axienet_free_tx_chain.
      
      Fixes: 8a3b7a25 ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
      Signed-off-by: default avatarSean Anderson <sean.anderson@linux.dev>
      Link: https://patch.msgid.link/20240913145156.2283067-1-sean.anderson@linux.devSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5a6caa2c
    • Sean Anderson's avatar
      net: xilinx: axienet: Schedule NAPI in two steps · ba0da2dc
      Sean Anderson authored
      As advised by Documentation/networking/napi.rst, masking IRQs after
      calling napi_schedule can be racy. Avoid this by only masking/scheduling
      if napi_schedule_prep returns true.
      
      Fixes: 9e2bc267 ("net: axienet: Use NAPI for TX completion path")
      Fixes: cc37610c ("net: axienet: implement NAPI and GRO receive")
      Signed-off-by: default avatarSean Anderson <sean.anderson@linux.dev>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20240913145711.2284295-1-sean.anderson@linux.devSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ba0da2dc
    • Vladimir Oltean's avatar
      net: phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present · 194ef9d0
      Vladimir Oltean authored
      The author of the blamed commit apparently did not notice something
      about aqr_wait_reset_complete(): it polls the exact same register -
      MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID - as aqr_firmware_load().
      
      Thus, the entire logic after the introduction of aqr_wait_reset_complete() is
      now completely side-stepped, because if aqr_wait_reset_complete()
      succeeds, MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID could have only been a
      non-zero value. The handling of the case where the register reads as 0
      is dead code, due to the previous -ETIMEDOUT having stopped execution
      and returning a fatal error to the caller. We never attempt to load
      new firmware if no firmware is present.
      
      Based on static code analysis, I guess we should simply introduce a
      switch/case statement based on the return code from aqr_wait_reset_complete(),
      to determine whether to load firmware or not. I am not intending to
      change the procedure through which the driver determines whether to load
      firmware or not, as I am unaware of alternative possibilities.
      
      At the same time, Russell King suggests that if aqr_wait_reset_complete()
      is expected to return -ETIMEDOUT as part of normal operation and not
      just catastrophic failure, the use of phy_read_mmd_poll_timeout() is
      improper, since that has an embedded print inside. Just open-code a
      call to read_poll_timeout() to avoid printing -ETIMEDOUT, but continue
      printing actual read errors from the MDIO bus.
      
      Fixes: ad649a1f ("net: phy: aquantia: wait for FW reset before checking the vendor ID")
      Reported-by: default avatarClark Wang <xiaoning.wang@nxp.com>
      Reported-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Closes: https://lore.kernel.org/netdev/8ac00a45-ac61-41b4-9f74-d18157b8b6bf@nvidia.com/Reported-by: default avatarHans-Frieder Vogt <hfdevel@gmx.net>
      Closes: https://lore.kernel.org/netdev/c7c1a3ae-be97-4929-8d89-04c8aa870209@gmx.net/Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: default avatarBartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Tested-by: default avatarHans-Frieder Vogt <hfdevel@gmx.net>
      Link: https://patch.msgid.link/20240913121230.2620122-1-vladimir.oltean@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      194ef9d0
  6. 16 Sep, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next · 94106455
      Linus Torvalds authored
      Pull networking updates from Jakub Kicinski:
       "The zero-copy changes are relatively significant, but regression risk
        should be contained. The feature needs to be used to cause trouble.
      
        Also it feels like we got an order of magnitude more semi-automated
        "refactoring" chaff than usual, I wonder if it's just us.
      
        Core & protocols:
      
         - Support Device Memory TCP, ability to zero-copy receive TCP
           payloads to a DMABUF region of memory while packet headers land
           separately in normal kernel buffers, and TCP processes then as
           usual.
      
         - The ability to read the PTP PHC (Physical Hardware Clock) alongside
           MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously
           only CLOCK_REALTIME was supported.
      
         - Allow matching on all bits of IP DSCP for routing decisions.
           Previously we only supported on matching TOS bits in IPv4 which is
           a narrower interpretation of the same header field.
      
         - Increase the range of weights used for multi-path routing from
           8 bits to 16 bits.
      
         - Add support for IPv6 PIO p flag in the Prefix Information Option
           per draft-ietf-6man-pio-pflag.
      
         - IPv6 IOAM6 support for new tunsrc encap mode for better
           performance.
      
         - Detect destinations which blackhole MPTCP traffic and avoid
           initiating MPTCP connections to them for a certain period of time,
           1h by default.
      
         - Improve IPsec control path performance by removing the inexact
           policies list.
      
         - AF_VSOCK: add support for SIOCOUTQ ioctl.
      
         - Add enum for reasons TCP reset was sent for easier tracing.
      
         - Add SMC ringbufs usage statistics.
      
        Drivers:
      
         - Handle netconsole setup failures more gracefully, don't fail
           loading, retain the specified target as disabled.
      
         - Extend bonding's IPsec offload pass thru capabilities (ESN, stats).
      
        Filtering:
      
         - Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
           when long-lived sockets miss a chance to set additional callbacks
           if a sockops program was not attached early in their lifetime.
      
         - Support using BPF skb helpers in tracepoints.
      
         - Conntrack Netlink: support CTA_FILTER for flush.
      
         - Improve SCTP support in nfnetlink_queue.
      
         - Improve performance of large nftables flush transactions.
      
        Things we sprinkled into general kernel code:
      
         - selftests: support setting an "interpreter" for script files; make
           it easy to run as separate cases tests where one "interpreter" is
           fed various test descriptions (in our case packet sequences).
      
        Driver API:
      
         - Extend core and ethtool APIs to support many PHYs connected to a
           single interface (PHY topologies).
      
         - Extend cable diagnostics to specify whether Time Domain
           Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was
           used.
      
         - Add library for implementing MAC-PHY Ethernet drivers for SPI
           devices compatible with Open Alliance 10BASE-T1x MAC-PHY Serial
           Interface (TC6) standard.
      
         - Add helpers to the PHY framework, for PHYs following the Open
           Alliance standards:
             - 1000BaseT1 link settings
             - cable test and diagnostics
      
         - Support listing / dumping all allocated RSS contexts.
      
         - Add configuration for frequency Embedded SYNC in DPLL, which
           magically embeds sync pulses into Ethernet signaling.
      
        Device drivers:
      
         - Ethernet high-speed NICs:
            - Broadcom (bnxt):
               - use better FW APIs for queue reset
               - support QOS and TPID settings for the SR-IOV VLAN
               - support dynamic MSI-X allocation
            - Intel (100G, ice, idpf):
               - ice: support PCIe subfunctions
               - iavf: add support for TC U32 filters on VFs
               - ice: support Embedded SYNC in DPLL
            - nVidia/Mellanox (mlx5):
               - support HW managed steering tables
               - support PCIe PTM cross timestamping
            - AMD/Pensando:
               - ionic: use page_pool to increase Rx performance
            - Cisco (enic):
               - report per-queue statistics
      
         - Ethernet virtual:
            - Microsoft vNIC:
               - mana: support configuring ring length
               - netvsc: enable more channels on systems with many CPUs
            - IBM veth:
               - optimize polling to improve TCP_RR performance
               - optimize performance of Tx handling
            - VirtIO net:
               - synchronize the operstate with the admin state to allow a
                 lower virtio-net to propagate the link status to an upper
                 device like macvlan
      
         - Ethernet NICs consumer, and embedded:
            - Add driver for Realtek automotive PCIe devices (RTL9054,
              RTL9068, RTL9072, RTL9075, RTL9068, RTL9071)
            - Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY.
            - Microchip:
               - lan743x: use phylink - support WOL, EEE, pause, link settings
               - add Wake-on-LAN support for KSZ87xx family
               - add KSZ8895/KSZ8864 switch support
               - factor out FDMA code and use it in sparx5 and lan966x
                 (including DCB support in both)
            - Synopsys (stmmac):
               - support frame preemption (configured using TC and ethtool)
               - support Loongson DWMAC (GMAC v3.73)
               - support RockChips RK3576 DWMAC
            - TI:
               - am65-cpsw: add multi queue RX support
               - icssg-prueth: HSR offload support
            - Cadence (macb):
               - enable software (hrtimer based) IRQ coalescing by default
            - Xilinx (axinet):
               - expose HW statistics
               - improve multicast filtering
               - relax Rx checksum offload constraints
            - MediaTek:
               - mt7530: add EN7581 support
            - Aspeed (ftgmac100):
               - report link speed and duplex
            - Intel:
               - igc: add mqprio offload
               - igc: report EEE configuration
            - RealTek (r8169):
               - add support for RTL8126A rev.b
            - Vitesse (vsc73xx):
               - implement FDB add/del/dump operations
            - Freescale (fs_enet):
               - use phylink
      
         - Ethernet PHYs:
            - vitesse: implement downshift and MDI-X in vsc73xx PHYs
            - microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1)
              and IEEE 802.3bp (1000BASE-T1) specifications
            - add Applied Micro QT2025 PHY driver (in Rust)
            - add Motorcomm yt8821 2.5G Ethernet PHY driver
      
         - CAN:
            - add driver for Rockchip RK3568 CAN-FD controller
            - flexcan: add wakeup support for imx95
            - kvaser_usb: set hardware timestamp on transmitted packets
      
         - WiFi:
            - mac80211/cfg80211:
               - EHT rate support in AQL airtime fairness
               - handle DFS (radar detection) per link in Multi-Link Operation
            - RealTek (rtw89):
               - support RTL8852BT and 8852BE-VT (WiFi 6)
               - support hardware rfkill
               - support HW encryption in unicast management frames
               - support Wake-on-WLAN with supported network detection
            - RealTek (rtw89):
               - improve Rx performance by using USB frame aggregation
               - support USB 3 with RTL8822CU/RTL8822BU
            - Intel (iwlwifi/mvm):
               - offload RLC/SMPS functionality to firmware
            - Marvell (mwifiex):
               - add host based MLME to enable WPA3
      
         - Bluetooth:
            - add support for Amlogic HCI UART protocol
            - add support for ISO data/packets to Intel and NXP drivers"
      
      * tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1303 commits)
        net/mlx5: HWS, check the correct variable in hws_send_ring_alloc_sq()
        netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level()
        ice: Fix a NULL vs IS_ERR() check in probe()
        ice: Fix a couple NULL vs IS_ERR() bugs
        net: ethernet: fs_enet: Make the per clock optional
        net: ti: icssg-prueth: Add multicast filtering support in HSR mode
        net: ti: icssg-prueth: Enable HSR Tx duplication, Tx Tag and Rx Tag offload
        net: ti: icssg-prueth: Add support for HSR frame forward offload
        net: ti: icssg-prueth: Stop hardcoding def_inc
        net: ti: icss-iep: Move icss_iep structure
        net: ibm: emac: get rid of wol_irq
        net: ibm: emac: remove all waiting code
        net: ibm: emac: replace of_get_property
        net: ibm: emac: use netdev's phydev directly
        net: ibm: emac: use devm for register_netdev
        net: ibm: emac: remove mii_bus with devm
        net: ibm: emac: use devm for of_iomap
        net: ibm: emac: manage emac_irq with devm
        net: ibm: emac: use devm for alloc_etherdev
        octeontx2-af: debugfs: Add Channel info to RPM map
        ...
      94106455
  7. 15 Sep, 2024 9 commits
  8. 14 Sep, 2024 4 commits