1. 03 Feb, 2022 9 commits
    • Yannick Vignon's avatar
      net: stmmac: ensure PTP time register reads are consistent · 80d46090
      Yannick Vignon authored
      Even if protected from preemption and interrupts, a small time window
      remains when the 2 register reads could return inconsistent values,
      each time the "seconds" register changes. This could lead to an about
      1-second error in the reported time.
      
      Add logic to ensure the "seconds" and "nanoseconds" values are consistent.
      
      Fixes: 92ba6888 ("stmmac: add the support for PTP hw clock driver")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80d46090
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 77b1b8b4
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-02-03
      
      We've added 6 non-merge commits during the last 10 day(s) which contain
      a total of 7 files changed, 11 insertions(+), 236 deletions(-).
      
      The main changes are:
      
      1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
         flag which otherwise trips over KASAN, from Hou Tao.
      
      2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
         rename, from Alexei Starovoitov.
      
      3) Fix a possible race in inc_misses_counter() when IRQ would trigger
         during counter update, from He Fengqing.
      
      4) Fix tooling infra for cross-building with clang upon probing whether
         gcc provides the standard libraries, from Jean-Philippe Brucker.
      
      5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
      
      6) Drop unneeded and outdated lirc.h header copy from tooling infra as
         BPF does not require it anymore, from Sean Young.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        tools: Ignore errors from `which' when searching a GCC toolchain
        tools headers UAPI: remove stale lirc.h
        bpf: Fix possible race in inc_misses_counter
        bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
      ====================
      
      Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77b1b8b4
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-enable-register-retention' · 0166556a
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: enable register retention
      
      With runtime power management in place, we sometimes need to issue
      a command to enable retention of IPA register values before power
      collapse.  This requires a new Device Tree property, whose presence
      will also be used to signal that the command is required.
      ====================
      
      Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0166556a
    • Alex Elder's avatar
      net: ipa: request IPA register values be retained · 34a08176
      Alex Elder authored
      In some cases, the IPA hardware needs to request the always-on
      subsystem (AOSS) to coordinate with the IPA microcontroller to
      retain IPA register values at power collapse.  This is done by
      issuing a QMP request to the AOSS microcontroller.  A similar
      request ondoes that request.
      
      We must get and hold the "QMP" handle early, because we might get
      back EPROBE_DEFER for that.  But the actual request should be sent
      while we know the IPA clock is active, and when we know the
      microcontroller is operational.
      
      Fixes: 1aac309d ("net: ipa: use autosuspend")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34a08176
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: add optional qcom,qmp property · ac62a017
      Alex Elder authored
      For some systems, the IPA driver must make a request to ensure that
      its registers are retained across power collapse of the IPA hardware.
      On such systems, we'll use the existence of the "qcom,qmp" property
      as a signal that this request is required.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac62a017
    • Nathan Chancellor's avatar
      tools/resolve_btfids: Do not print any commands when building silently · 7f3bdbc3
      Nathan Chancellor authored
      When building with 'make -s', there is some output from resolve_btfids:
      
      $ make -sj"$(nproc)" oldconfig prepare
        MKDIR     .../tools/bpf/resolve_btfids/libbpf/
        MKDIR     .../tools/bpf/resolve_btfids//libsubcmd
        LINK     resolve_btfids
      
      Silent mode means that no information should be emitted about what is
      currently being done. Use the $(silent) variable from Makefile.include
      to avoid defining the msg macro so that there is no information printed.
      
      Fixes: fbbb68de ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220201212503.731732-1-nathan@kernel.org
      7f3bdbc3
    • Hou Tao's avatar
      bpf: Use VM_MAP instead of VM_ALLOC for ringbuf · b293dcc4
      Hou Tao authored
      After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
      after mapping"), non-VM_ALLOC mappings will be marked as accessible
      in __get_vm_area_node() when KASAN is enabled. But now the flag for
      ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
      after vmap() returns. Because the ringbuf area is created by mapping
      allocated pages, so use VM_MAP instead.
      
      After the change, info in /proc/vmallocinfo also changes from
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
      to
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmap user
      
      Fixes: 457f4436 ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com
      b293dcc4
    • Daniel Borkmann's avatar
      net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work · 4a81f6da
      Daniel Borkmann authored
      syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
      
        kworker/0:16/14617 is trying to acquire lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        [...]
        but task is already holding lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
      
      The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
      triggered an immediate probe as per commit cd28ca0a ("neigh: reduce
      arp latency") via neigh_probe() given table lock was held.
      
      One option to fix this situation is to defer the neigh_probe() back to
      the neigh_timer_handler() similarly as pre cd28ca0a. For the case
      of NTF_MANAGED, this deferral is acceptable given this only happens on
      actual failure state and regular / expected state is NUD_VALID with the
      entry already present.
      
      The fix adds a parameter to __neigh_event_send() in order to communicate
      whether immediate probe is allowed or disallowed. Existing call-sites
      of neigh_event_send() default as-is to immediate probe. However, the
      neigh_managed_work() disables it via use of neigh_event_send_probe().
      
      [0] <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
        check_deadlock kernel/locking/lockdep.c:2999 [inline]
        validate_chain kernel/locking/lockdep.c:3788 [inline]
        __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
        lock_acquire kernel/locking/lockdep.c:5639 [inline]
        lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
        _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
        ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
        __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
        __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
        ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
        NF_HOOK_COND include/linux/netfilter.h:296 [inline]
        ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
        dst_output include/net/dst.h:451 [inline]
        NF_HOOK include/linux/netfilter.h:307 [inline]
        ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
        ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
        ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
        neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
        __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
        neigh_event_send include/net/neighbour.h:470 [inline]
        neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
        process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
        worker_thread+0x657/0x1110 kernel/workqueue.c:2454
        kthread+0x2e9/0x3a0 kernel/kthread.c:377
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        </TASK>
      
      Fixes: 7482e384 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a81f6da
    • Eric Dumazet's avatar
      tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() · b67985be
      Eric Dumazet authored
      tcp_shift_skb_data() might collapse three packets into a larger one.
      
      P_A, P_B, P_C  -> P_ABC
      
      Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
      because it was enough.
      
      In commit 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions"),
      this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
      
      But the now needed test over P_C has been missed.
      
      This probably broke MPTCP.
      
      Then later, commit 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      added an extra condition to tcp_skb_can_collapse(), but the missing call
      from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
      might have different skb_zcopy_pure() status.
      
      Fixes: 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions")
      Fixes: 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b67985be
  2. 02 Feb, 2022 30 commits
    • Steen Hegelund's avatar
      net: sparx5: do not refer to skb after passing it on · 81eb8b0b
      Steen Hegelund authored
      Do not try to use any SKB fields after the packet has been passed up in the
      receive stack.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Link: https://lore.kernel.org/r/20220202083039.3774851-1-steen.hegelund@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      81eb8b0b
    • Dmitry V. Levin's avatar
      Partially revert "net/smc: Add netlink net namespace support" · c86d8613
      Dmitry V. Levin authored
      The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc5
      ("net/smc: Add netlink net namespace support") introduced an ABI
      regression: since struct smc_diag_lgrinfo contains an object of
      type "struct smc_diag_linkinfo", offset of all subsequent members
      of struct smc_diag_lgrinfo was changed by that change.
      
      As result, applications compiled with the old version
      of struct smc_diag_linkinfo will receive garbage in
      struct smc_diag_lgrinfo.role if the kernel implements
      this new version of struct smc_diag_linkinfo.
      
      Fix this regression by reverting the part of commit 79d39fc5 that
      changes struct smc_diag_linkinfo.  After all, there is SMC_GEN_NETLINK
      interface which is good enough, so there is probably no need to touch
      the smc_diag ABI in the first place.
      
      Fixes: 79d39fc5 ("net/smc: Add netlink net namespace support")
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Reviewed-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220202030904.GA9742@altlinux.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c86d8613
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2022-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · c8ff576e
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2022-02-01
      
      This series provides bug fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      
      Sorry about the long series, but I had to move the top two patches from
      net-next to net to help avoiding a build break when kspp branch is merged
      into linus-next on next merge window.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8ff576e
    • Jakub Kicinski's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 3aa430d3
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-02-01
      
      This series contains updates to e1000e driver only.
      
      Sasha removes CSME handshake with TGL platform as this is not supported
      and is causing hardware unit hangs to be reported.
      
      * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        e1000e: Handshake with CSME starts from ADL platforms
        e1000e: Separate ADP board type from TGP
      ====================
      
      Link: https://lore.kernel.org/r/20220201173754.580305-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3aa430d3
    • Kees Cook's avatar
      net/mlx5e: Avoid field-overflowing memcpy() · ad518573
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Use flexible arrays instead of zero-element arrays (which look like they
      are always overflowing) and split the cross-field memcpy() into two halves
      that can be appropriately bounds-checked by the compiler.
      
      We were doing:
      
      	#define ETH_HLEN  14
      	#define VLAN_HLEN  4
      	...
      	#define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN)
      	...
              struct mlx5e_tx_wqe      *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
      	...
              struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
              struct mlx5_wqe_data_seg *dseg = wqe->data;
      	...
      	memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE);
      
      target is wqe->eth.inline_hdr.start (which the compiler sees as being
      2 bytes in size), but copying 18, intending to write across start
      (really vlan_tci, 2 bytes). The remaining 16 bytes get written into
      wqe->data[0], covering byte_count (4 bytes), lkey (4 bytes), and addr
      (8 bytes).
      
      struct mlx5e_tx_wqe {
              struct mlx5_wqe_ctrl_seg   ctrl;                 /*     0    16 */
              struct mlx5_wqe_eth_seg    eth;                  /*    16    16 */
              struct mlx5_wqe_data_seg   data[];               /*    32     0 */
      
              /* size: 32, cachelines: 1, members: 3 */
              /* last cacheline: 32 bytes */
      };
      
      struct mlx5_wqe_eth_seg {
              u8                         swp_outer_l4_offset;  /*     0     1 */
              u8                         swp_outer_l3_offset;  /*     1     1 */
              u8                         swp_inner_l4_offset;  /*     2     1 */
              u8                         swp_inner_l3_offset;  /*     3     1 */
              u8                         cs_flags;             /*     4     1 */
              u8                         swp_flags;            /*     5     1 */
              __be16                     mss;                  /*     6     2 */
              __be32                     flow_table_metadata;  /*     8     4 */
              union {
                      struct {
                              __be16     sz;                   /*    12     2 */
                              u8         start[2];             /*    14     2 */
                      } inline_hdr;                            /*    12     4 */
                      struct {
                              __be16     type;                 /*    12     2 */
                              __be16     vlan_tci;             /*    14     2 */
                      } insert;                                /*    12     4 */
                      __be32             trailer;              /*    12     4 */
              };                                               /*    12     4 */
      
              /* size: 16, cachelines: 1, members: 9 */
              /* last cacheline: 16 bytes */
      };
      
      struct mlx5_wqe_data_seg {
              __be32                     byte_count;           /*     0     4 */
              __be32                     lkey;                 /*     4     4 */
              __be64                     addr;                 /*     8     8 */
      
              /* size: 16, cachelines: 1, members: 3 */
              /* last cacheline: 16 bytes */
      };
      
      So, split the memcpy() so the compiler can reason about the buffer
      sizes.
      
      "pahole" shows no size nor member offset changes to struct mlx5e_tx_wqe
      nor struct mlx5e_umr_wqe. "objdump -d" shows no meaningful object
      code changes (i.e. only source line number induced differences and
      optimizations).
      
      Fixes: b5503b99 ("net/mlx5e: XDP TX forwarding support")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ad518573
    • Kees Cook's avatar
      net/mlx5e: Use struct_group() for memcpy() region · 6d5c900e
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Use struct_group() in struct vlan_ethhdr around members h_dest and
      h_source, so they can be referenced together. This will allow memcpy()
      and sizeof() to more easily reason about sizes, improve readability,
      and avoid future warnings about writing beyond the end of h_dest.
      
      "pahole" shows no size nor member offset changes to struct vlan_ethhdr.
      "objdump -d" shows no object code changes.
      
      Fixes: 34802a42 ("net/mlx5e: Do not modify the TX SKB")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6d5c900e
    • Roi Dayan's avatar
      net/mlx5e: Avoid implicit modify hdr for decap drop rule · 5b209d1a
      Roi Dayan authored
      Currently the driver adds implicit modify hdr action for
      decap rules on tunnel devices if the port is an ovs port.
      This is also done if the action is drop and makes the modify
      hdr redundant and also the FW doesn't support it and will generate
      a syndrome.
      
      kernel: mlx5_core 0000:08:00.0: mlx5_cmd_check:777:(pid 102063): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8708c3)
      
      Fix it by adding the implicit modify hdr only for fwd actions.
      
      Fixes: b16eb3c8 ("net/mlx5: Support internal port as decap route device")
      Fixes: 077cdda7 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarAriel Levkovich <lariel@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5b209d1a
    • Raed Salem's avatar
      net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic · de47db0c
      Raed Salem authored
      IPsec Tunnel mode crypto offload software parser (SWP) setting in data
      path currently always set the inner L4 offset regardless of the
      encapsulated L4 header type and whether it exists in the first place,
      this breaks non TCP/UDP traffic as such.
      
      Set the SWP inner L4 offset only when the IPsec tunnel encapsulated L4
      header protocol is TCP/UDP.
      
      While at it fix inner ip protocol read for setting MLX5_ETH_WQE_SWP_INNER_L4_UDP
      flag to address the case where the ip header protocol is IPv6.
      
      Fixes: f1267798 ("net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload")
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      de47db0c
    • Raed Salem's avatar
      net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic · 5352859b
      Raed Salem authored
      IPsec crypto offload always set the ethernet segment checksum flags with
      the inner L4 header checksum flag enabled for encapsulated IPsec offloaded
      packet regardless of the encapsulated L4 header type, and even if it
      doesn't exists in the first place, this breaks non TCP/UDP traffic as
      such.
      
      Set the inner L4 checksum flag only when the encapsulated L4 header
      protocol is TCP/UDP using software parser swp_inner_l4_offset field as
      indication.
      
      Fixes: 5cfb540e ("net/mlx5e: Set IPsec WAs only in IP's non checksum partial case.")
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5352859b
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Don't treat small ceil values as unlimited in HTB offload · 736dfe4e
      Maxim Mikityanskiy authored
      The hardware spec defines max_average_bw == 0 as "unlimited bandwidth".
      max_average_bw is calculated as `ceil / BYTES_IN_MBIT`, which can become
      0 when ceil is small, leading to an undesired effect of having no
      bandwidth limit.
      
      This commit fixes it by rounding up small values of ceil to 1 Mbit/s.
      
      Fixes: 214baf22 ("net/mlx5e: Support HTB offload")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      736dfe4e
    • Maor Dickman's avatar
      net/mlx5: E-Switch, Fix uninitialized variable modact · d8e5883d
      Maor Dickman authored
      The variable modact is not initialized before used in command
      modify header allocation which can cause command to fail.
      
      Fix by initializing modact with zeros.
      
      Addresses-Coverity: ("Uninitialized scalar variable")
      Fixes: 8f1e0b97 ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d8e5883d
    • Maor Dickman's avatar
      net/mlx5e: Fix handling of wrong devices during bond netevent · ec41332e
      Maor Dickman authored
      Current implementation of bond netevent handler only check if
      the handled netdev is VF representor and it missing a check if
      the VF representor is on the same phys device of the bond handling
      the netevent.
      
      Fix by adding the missing check and optimizing the check if
      the netdev is VF representor so it will not access uninitialized
      private data and crashes.
      
      BUG: kernel NULL pointer dereference, address: 000000000000036c
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP NOPTI
      Workqueue: eth3bond0 bond_mii_monitor [bonding]
      RIP: 0010:mlx5e_is_uplink_rep+0xc/0x50 [mlx5_core]
      RSP: 0018:ffff88812d69fd60 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff8881cf800000 RCX: 0000000000000000
      RDX: ffff88812d69fe10 RSI: 000000000000001b RDI: ffff8881cf800880
      RBP: ffff8881cf800000 R08: 00000445cabccf2b R09: 0000000000000008
      R10: 0000000000000004 R11: 0000000000000008 R12: ffff88812d69fe10
      R13: 00000000fffffffe R14: ffff88820c0f9000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88846fb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000000036c CR3: 0000000103d80006 CR4: 0000000000370ea0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       mlx5e_eswitch_uplink_rep+0x31/0x40 [mlx5_core]
       mlx5e_rep_is_lag_netdev+0x94/0xc0 [mlx5_core]
       mlx5e_rep_esw_bond_netevent+0xeb/0x3d0 [mlx5_core]
       raw_notifier_call_chain+0x41/0x60
       call_netdevice_notifiers_info+0x34/0x80
       netdev_lower_state_changed+0x4e/0xa0
       bond_mii_monitor+0x56b/0x640 [bonding]
       process_one_work+0x1b9/0x390
       worker_thread+0x4d/0x3d0
       ? rescuer_thread+0x350/0x350
       kthread+0x124/0x150
       ? set_kthread_struct+0x40/0x40
       ret_from_fork+0x1f/0x30
      
      Fixes: 7e51891a ("net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ec41332e
    • Khalid Manaa's avatar
      net/mlx5e: Fix broken SKB allocation in HW-GRO · 7957837b
      Khalid Manaa authored
      In case the HW doesn't perform header-data split, it will write the whole
      packet into the data buffer in the WQ, in this case the SHAMPO CQE handler
      couldn't use the header entry to build the SKB, instead it should allocate
      a new memory to build the SKB using the function:
      mlx5e_skb_from_cqe_mpwrq_nonlinear.
      
      Fixes: f97d5c2a ("net/mlx5e: Add handle SHAMPO cqe support")
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7957837b
    • Khalid Manaa's avatar
      net/mlx5e: Fix wrong calculation of header index in HW_GRO · b8d91145
      Khalid Manaa authored
      The HW doesn't wrap the CQE.shampo.header_index field according to the
      headers buffer size, instead it always increases it until reaching overflow
      of u16 size.
      
      Thus the mlx5e_handle_rx_cqe_mpwrq_shampo handler should mask the
      CQE header_index field to find the actual header index in the headers buffer.
      
      Fixes: f97d5c2a ("net/mlx5e: Add handle SHAMPO cqe support")
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b8d91145
    • Roi Dayan's avatar
      net/mlx5: Bridge, Fix devlink deadlock on net namespace deletion · 880b5176
      Roi Dayan authored
      When changing mode to switchdev, rep bridge init registered to netdevice
      notifier holds the devlink lock and then takes pernet_ops_rwsem.
      At that time deleting a netns holds pernet_ops_rwsem and then takes
      the devlink lock.
      
      Example sequence is:
      $ ip netns add foo
      $ devlink dev eswitch set pci/0000:00:08.0 mode switchdev &
      $ ip netns del foo
      
      deleting netns trace:
      
      [ 1185.365555]  ? devlink_pernet_pre_exit+0x74/0x1c0
      [ 1185.368331]  ? mutex_lock_io_nested+0x13f0/0x13f0
      [ 1185.370984]  ? xt_find_table+0x40/0x100
      [ 1185.373244]  ? __mutex_lock+0x24a/0x15a0
      [ 1185.375494]  ? net_generic+0xa0/0x1c0
      [ 1185.376844]  ? wait_for_completion_io+0x280/0x280
      [ 1185.377767]  ? devlink_pernet_pre_exit+0x74/0x1c0
      [ 1185.378686]  devlink_pernet_pre_exit+0x74/0x1c0
      [ 1185.379579]  ? devlink_nl_cmd_get_dumpit+0x3a0/0x3a0
      [ 1185.380557]  ? xt_find_table+0xda/0x100
      [ 1185.381367]  cleanup_net+0x372/0x8e0
      
      changing mode to switchdev trace:
      
      [ 1185.411267]  down_write+0x13a/0x150
      [ 1185.412029]  ? down_write_killable+0x180/0x180
      [ 1185.413005]  register_netdevice_notifier+0x1e/0x210
      [ 1185.414000]  mlx5e_rep_bridge_init+0x181/0x360 [mlx5_core]
      [ 1185.415243]  mlx5e_uplink_rep_enable+0x269/0x480 [mlx5_core]
      [ 1185.416464]  ? mlx5e_uplink_rep_disable+0x210/0x210 [mlx5_core]
      [ 1185.417749]  mlx5e_attach_netdev+0x232/0x400 [mlx5_core]
      [ 1185.418906]  mlx5e_netdev_attach_profile+0x15b/0x1e0 [mlx5_core]
      [ 1185.420172]  mlx5e_netdev_change_profile+0x15a/0x1d0 [mlx5_core]
      [ 1185.421459]  mlx5e_vport_rep_load+0x557/0x780 [mlx5_core]
      [ 1185.422624]  ? mlx5e_stats_grp_vport_rep_num_stats+0x10/0x10 [mlx5_core]
      [ 1185.424006]  mlx5_esw_offloads_rep_load+0xdb/0x190 [mlx5_core]
      [ 1185.425277]  esw_offloads_enable+0xd74/0x14a0 [mlx5_core]
      
      Fix this by registering rep bridges for per net netdev notifier
      instead of global one, which operats on the net namespace without holding
      the pernet_ops_rwsem.
      
      Fixes: 19e9bfa0 ("net/mlx5: Bridge, add offload infrastructure")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      880b5176
    • Dima Chumak's avatar
      net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE · 55b2ca70
      Dima Chumak authored
      Only prio 1 is supported for nic mode when there is no ignore flow level
      support in firmware. But for switchdev mode, which supports fixed number
      of statically pre-allocated prios, this restriction is not relevant so
      it can be relaxed.
      
      Fixes: d671e109 ("net/mlx5: Fix tc max supported prio for nic mode")
      Signed-off-by: default avatarDima Chumak <dchumak@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      55b2ca70
    • Roi Dayan's avatar
      net/mlx5e: TC, Reject rules with forward and drop actions · 5623ef8a
      Roi Dayan authored
      Such rules are redundant but allowed and passed to the driver.
      The driver does not support offloading such rules so return an error.
      
      Fixes: 03a9d11e ("net/mlx5e: Add TC drop and mirred/redirect action parsing for SRIOV offloads")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5623ef8a
    • Maher Sanalla's avatar
      net/mlx5: Use del_timer_sync in fw reset flow of halting poll · 3c5193a8
      Maher Sanalla authored
      Substitute del_timer() with del_timer_sync() in fw reset polling
      deactivation flow, in order to prevent a race condition which occurs
      when del_timer() is called and timer is deactivated while another
      process is handling the timer interrupt. A situation that led to
      the following call trace:
      	RIP: 0010:run_timer_softirq+0x137/0x420
      	<IRQ>
      	recalibrate_cpu_khz+0x10/0x10
      	ktime_get+0x3e/0xa0
      	? sched_clock_cpu+0xb/0xc0
      	__do_softirq+0xf5/0x2ea
      	irq_exit_rcu+0xc1/0xf0
      	sysvec_apic_timer_interrupt+0x9e/0xc0
      	asm_sysvec_apic_timer_interrupt+0x12/0x20
      	</IRQ>
      
      Fixes: 38b9f903 ("net/mlx5: Handle sync reset request event")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3c5193a8
    • Gal Pressman's avatar
      net/mlx5e: Fix module EEPROM query · 4a08a131
      Gal Pressman authored
      When querying the module EEPROM, there was a misusage of the 'offset'
      variable vs the 'query.offset' field.
      Fix that by always using 'offset' and assigning its value to
      'query.offset' right before the mcia register read call.
      
      While at it, the cross-pages read size adjustment was changed to be more
      intuitive.
      
      Fixes: e19b0a34 ("net/mlx5: Refactor module EEPROM query")
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      4a08a131
    • Roi Dayan's avatar
      net/mlx5e: TC, Reject rules with drop and modify hdr action · a2446bc7
      Roi Dayan authored
      This kind of action is not supported by firmware and generates a
      syndrome.
      
      kernel: mlx5_core 0000:08:00.0: mlx5_cmd_check:777:(pid 102063): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8708c3)
      
      Fixes: d7e75a32 ("net/mlx5e: Add offloading of E-Switch TC pedit (header re-write) actions")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a2446bc7
    • Vlad Buslov's avatar
      net/mlx5: Bridge, ensure dev_name is null-terminated · 350d9a82
      Vlad Buslov authored
      Even though net_device->name is guaranteed to be null-terminated string of
      size<=IFNAMSIZ, the test robot complains that return value of netdev_name()
      can be larger:
      
      In file included from include/trace/define_trace.h:102,
                          from drivers/net/ethernet/mellanox/mlx5/core/esw/diag/bridge_tracepoint.h:113,
                          from drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c:12:
         drivers/net/ethernet/mellanox/mlx5/core/esw/diag/bridge_tracepoint.h: In function 'trace_event_raw_event_mlx5_esw_bridge_fdb_template':
      >> drivers/net/ethernet/mellanox/mlx5/core/esw/diag/bridge_tracepoint.h:24:29: warning: 'strncpy' output may be truncated copying 16 bytes from a string of length 20 [-Wstringop-truncation]
            24 |                             strncpy(__entry->dev_name,
               |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~
            25 |                                     netdev_name(fdb->dev),
               |                                     ~~~~~~~~~~~~~~~~~~~~~~
            26 |                                     IFNAMSIZ);
               |                                     ~~~~~~~~~
      
      This is caused by the fact that default value of IFNAMSIZ is 16, while
      placeholder value that is returned by netdev_name() for unnamed net devices
      is larger than that.
      
      The offending code is in a tracing function that is only called for mlx5
      representors, so there is no straightforward way to reproduce the issue but
      let's fix it for correctness sake by replacing strncpy() with strscpy() to
      ensure that resulting string is always null-terminated.
      
      Fixes: 9724fd5d ("net/mlx5: Bridge, add tracepoints")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      350d9a82
    • Vlad Buslov's avatar
      net/mlx5: Bridge, take rtnl lock in init error handler · 04f8c12f
      Vlad Buslov authored
      The mlx5_esw_bridge_cleanup() is expected to be called with rtnl lock
      taken, which is true for mlx5e_rep_bridge_cleanup() function but not for
      error handling code in mlx5e_rep_bridge_init(). Add missing rtnl
      lock/unlock calls and extend both mlx5_esw_bridge_cleanup() and its dual
      function mlx5_esw_bridge_init() with ASSERT_RTNL() to verify the invariant
      from now on.
      
      Fixes: 7cd6a54a ("net/mlx5: Bridge, handle FDB events")
      Fixes: 19e9bfa0 ("net/mlx5: Bridge, add offload infrastructure")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      04f8c12f
    • Jakub Kicinski's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · c7108979
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-01-31
      
      This series contains updates to i40e driver only.
      
      Jedrzej fixes a condition check which would cause an error when
      resetting bandwidth when DCB is active with one TC.
      
      Karen resolves a null pointer dereference that could occur when removing
      the driver while VSI rings are being disabled.
      
      * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        i40e: Fix reset path while removing the driver
        i40e: Fix reset bw limit when DCB enabled with 1 TC
      ====================
      
      Link: https://lore.kernel.org/r/20220201000522.505909-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c7108979
    • Lior Nahmanson's avatar
      net: macsec: Verify that send_sci is on when setting Tx sci explicitly · d0cfa548
      Lior Nahmanson authored
      When setting Tx sci explicit, the Rx side is expected to use this
      sci and not recalculate it from the packet.However, in case of Tx sci
      is explicit and send_sci is off, the receiver is wrongly recalculate
      the sci from the source MAC address which most likely be different
      than the explicit sci.
      
      Fix by preventing such configuration when macsec newlink is established
      and return EINVAL error code on such cases.
      
      Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
      Signed-off-by: default avatarLior Nahmanson <liorna@nvidia.com>
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Link: https://lore.kernel.org/r/1643542672-29403-1-git-send-email-raeds@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d0cfa548
    • Georgi Valkov's avatar
      ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback · 63e4b45c
      Georgi Valkov authored
      When rx_buf is allocated we need to account for IPHETH_IP_ALIGN,
      which reduces the usable size by 2 bytes. Otherwise we have 1512
      bytes usable instead of 1514, and if we receive more than 1512
      bytes, ipheth_rcvbulk_callback is called with status -EOVERFLOW,
      after which the driver malfunctiones and all communication stops.
      
      Resolves ipheth 2-1:4.2: ipheth_rcvbulk_callback: urb status: -75
      
      Fixes: f33d9e2b ("usbnet: ipheth: fix connectivity with iOS 14")
      Signed-off-by: default avatarGeorgi Valkov <gvalkov@abv.bg>
      Tested-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Link: https://lore.kernel.org/all/B60B8A4B-92A0-49B3-805D-809A2433B46C@abv.bg/
      Link: https://lore.kernel.org/all/24851bd2769434a5fc24730dce8e8a984c5a4505.1643699778.git.jan.kiszka@siemens.com/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      63e4b45c
    • Eric Dumazet's avatar
      tcp: fix mem under-charging with zerocopy sendmsg() · 479f5547
      Eric Dumazet authored
      We got reports of following warning in inet_sock_destruct()
      
      	WARN_ON(sk_forward_alloc_get(sk));
      
      Whenever we add a non zero-copy fragment to a pure zerocopy skb,
      we have to anticipate that whole skb->truesize will be uncharged
      when skb is finally freed.
      
      skb->data_len is the payload length. But the memory truesize
      estimated by __zerocopy_sg_from_iter() is page aligned.
      
      Fixes: 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20220201065254.680532-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      479f5547
    • Eric Dumazet's avatar
      af_packet: fix data-race in packet_setsockopt / packet_setsockopt · e42e70ad
      Eric Dumazet authored
      When packet_setsockopt( PACKET_FANOUT_DATA ) reads po->fanout,
      no lock is held, meaning that another thread can change po->fanout.
      
      Given that po->fanout can only be set once during the socket lifetime
      (it is only cleared from fanout_release()), we can use
      READ_ONCE()/WRITE_ONCE() to document the race.
      
      BUG: KCSAN: data-race in packet_setsockopt / packet_setsockopt
      
      write to 0xffff88813ae8e300 of 8 bytes by task 14653 on cpu 0:
       fanout_add net/packet/af_packet.c:1791 [inline]
       packet_setsockopt+0x22fe/0x24a0 net/packet/af_packet.c:3931
       __sys_setsockopt+0x209/0x2a0 net/socket.c:2180
       __do_sys_setsockopt net/socket.c:2191 [inline]
       __se_sys_setsockopt net/socket.c:2188 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2188
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88813ae8e300 of 8 bytes by task 14654 on cpu 1:
       packet_setsockopt+0x691/0x24a0 net/packet/af_packet.c:3935
       __sys_setsockopt+0x209/0x2a0 net/socket.c:2180
       __do_sys_setsockopt net/socket.c:2191 [inline]
       __se_sys_setsockopt net/socket.c:2188 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2188
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000000000000000 -> 0xffff888106f8c000
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 14654 Comm: syz-executor.3 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 47dceb8e ("packet: add classic BPF fanout mode")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220201022358.330621-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e42e70ad
    • Eric Dumazet's avatar
      rtnetlink: make sure to refresh master_dev/m_ops in __rtnl_newlink() · c6f6f244
      Eric Dumazet authored
      While looking at one unrelated syzbot bug, I found the replay logic
      in __rtnl_newlink() to potentially trigger use-after-free.
      
      It is better to clear master_dev and m_ops inside the loop,
      in case we have to replay it.
      
      Fixes: ba7d49b1 ("rtnetlink: provide api for getting and setting slave info")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20220201012106.216495-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c6f6f244
    • Eric Dumazet's avatar
      net: sched: fix use-after-free in tc_new_tfilter() · 04c2a47f
      Eric Dumazet authored
      Whenever tc_new_tfilter() jumps back to replay: label,
      we need to make sure @q and @chain local variables are cleared again,
      or risk use-after-free as in [1]
      
      For consistency, apply the same fix in tc_ctl_chain()
      
      BUG: KASAN: use-after-free in mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581
      Write of size 8 at addr ffff8880985c4b08 by task syz-executor.4/1945
      
      CPU: 0 PID: 1945 Comm: syz-executor.4 Not tainted 5.17.0-rc1-syzkaller-00495-gff58831f #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581
       tcf_chain_head_change_item net/sched/cls_api.c:372 [inline]
       tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:386
       tcf_chain_tp_insert net/sched/cls_api.c:1657 [inline]
       tcf_chain_tp_insert_unique net/sched/cls_api.c:1707 [inline]
       tc_new_tfilter+0x1e67/0x2350 net/sched/cls_api.c:2086
       rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x331/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmmsg+0x195/0x470 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f2647172059
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f2645aa5168 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00007f2647285100 RCX: 00007f2647172059
      RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006
      RBP: 00007f26471cc08d R08: 0000000000000000 R09: 0000000000000000
      R10: 9e00000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fffb3f7f02f R14: 00007f2645aa5300 R15: 0000000000022000
       </TASK>
      
      Allocated by task 1944:
       kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:45 [inline]
       set_alloc_info mm/kasan/common.c:436 [inline]
       ____kasan_kmalloc mm/kasan/common.c:515 [inline]
       ____kasan_kmalloc mm/kasan/common.c:474 [inline]
       __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:524
       kmalloc_node include/linux/slab.h:604 [inline]
       kzalloc_node include/linux/slab.h:726 [inline]
       qdisc_alloc+0xac/0xa10 net/sched/sch_generic.c:941
       qdisc_create.constprop.0+0xce/0x10f0 net/sched/sch_api.c:1211
       tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660
       rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5592
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x331/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmmsg+0x195/0x470 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Freed by task 3609:
       kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
       kasan_set_track+0x21/0x30 mm/kasan/common.c:45
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free+0x130/0x160 mm/kasan/common.c:328
       kasan_slab_free include/linux/kasan.h:236 [inline]
       slab_free_hook mm/slub.c:1728 [inline]
       slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1754
       slab_free mm/slub.c:3509 [inline]
       kfree+0xcb/0x280 mm/slub.c:4562
       rcu_do_batch kernel/rcu/tree.c:2527 [inline]
       rcu_core+0x7b8/0x1540 kernel/rcu/tree.c:2778
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
       __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
       __call_rcu kernel/rcu/tree.c:3026 [inline]
       call_rcu+0xb1/0x740 kernel/rcu/tree.c:3106
       qdisc_put_unlocked+0x6f/0x90 net/sched/sch_generic.c:1109
       tcf_block_release+0x86/0x90 net/sched/cls_api.c:1238
       tc_new_tfilter+0xc0d/0x2350 net/sched/cls_api.c:2148
       rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x331/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmmsg+0x195/0x470 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff8880985c4800
       which belongs to the cache kmalloc-1k of size 1024
      The buggy address is located 776 bytes inside of
       1024-byte region [ffff8880985c4800, ffff8880985c4c00)
      The buggy address belongs to the page:
      page:ffffea0002617000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x985c0
      head:ffffea0002617000 order:3 compound_mapcount:0 compound_pincount:0
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 0000000000000000 dead000000000122 ffff888010c41dc0
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 1941, ts 1038999441284, free_ts 1033444432829
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       __kmalloc+0x2fb/0x340 mm/slub.c:4420
       kmalloc include/linux/slab.h:586 [inline]
       kzalloc include/linux/slab.h:715 [inline]
       __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1335
       neigh_sysctl_register+0x2c8/0x5e0 net/core/neighbour.c:3787
       devinet_sysctl_register+0xb1/0x230 net/ipv4/devinet.c:2618
       inetdev_init+0x286/0x580 net/ipv4/devinet.c:278
       inetdev_event+0xa8a/0x15d0 net/ipv4/devinet.c:1532
       notifier_call_chain+0xb5/0x200 kernel/notifier.c:84
       call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:1919
       call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
       call_netdevice_notifiers net/core/dev.c:1945 [inline]
       register_netdevice+0x1073/0x1500 net/core/dev.c:9698
       veth_newlink+0x59c/0xa90 drivers/net/veth.c:1722
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       release_pages+0x748/0x1220 mm/swap.c:956
       tlb_batch_pages_flush mm/mmu_gather.c:50 [inline]
       tlb_flush_mmu_free mm/mmu_gather.c:243 [inline]
       tlb_flush_mmu+0xe9/0x6b0 mm/mmu_gather.c:250
       zap_pte_range mm/memory.c:1441 [inline]
       zap_pmd_range mm/memory.c:1490 [inline]
       zap_pud_range mm/memory.c:1519 [inline]
       zap_p4d_range mm/memory.c:1540 [inline]
       unmap_page_range+0x1d1d/0x2a30 mm/memory.c:1561
       unmap_single_vma+0x198/0x310 mm/memory.c:1606
       unmap_vmas+0x16b/0x2f0 mm/memory.c:1638
       exit_mmap+0x201/0x670 mm/mmap.c:3178
       __mmput+0x122/0x4b0 kernel/fork.c:1114
       mmput+0x56/0x60 kernel/fork.c:1135
       exit_mm kernel/exit.c:507 [inline]
       do_exit+0xa3c/0x2a30 kernel/exit.c:793
       do_group_exit+0xd2/0x2f0 kernel/exit.c:935
       __do_sys_exit_group kernel/exit.c:946 [inline]
       __se_sys_exit_group kernel/exit.c:944 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:944
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff8880985c4a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880985c4a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8880985c4b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
       ffff8880985c4b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880985c4c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      
      Fixes: 470502de ("net: sched: unlock rules update API")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220131172018.3704490-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04c2a47f
    • Jakub Kicinski's avatar
      ethernet: smc911x: fix indentation in get/set EEPROM · 6dde7acd
      Jakub Kicinski authored
      Build bot produced a smatch indentation warning,
      the code looks correct but it mixes spaces and tabs.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/r/20220131211730.3940875-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6dde7acd
  3. 01 Feb, 2022 1 commit