1. 21 Apr, 2023 40 commits
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: do not send complete notification of deletions · 28339b21
      Pablo Neira Ayuso authored
      In most cases, table, name and handle is sufficient for userspace to
      identify an object that has been deleted. Skipping unneeded fields in
      the netlink attributes in the message saves bandwidth (ie. less chances
      of hitting ENOBUFS).
      
      Rules are an exception: the existing userspace monitor code relies on
      the rule definition. This exception can be removed by implementing a
      rule cache in userspace, this is already supported by the tracing
      infrastructure.
      
      Regarding flowtables, incremental deletion of devices is possible.
      Skipping a full notification allows userspace to differentiate between
      flowtable removal and incremental removal of devices.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      28339b21
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: extended netlink error reporting for netdevice · c3c060ad
      Pablo Neira Ayuso authored
      Flowtable and netdev chains are bound to one or several netdevice,
      extend netlink error reporting to specify the the netdevice that
      triggers the error.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c3c060ad
    • Simon Horman's avatar
      ipvs: Correct spelling in comments · c7d15aaa
      Simon Horman authored
      Correct some spelling errors flagged by codespell and found by inspection.
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c7d15aaa
    • Simon Horman's avatar
      ipvs: Remove {Enter,Leave}Function · 210ffe4a
      Simon Horman authored
      Remove EnterFunction and LeaveFunction.
      
      These debugging macros seem well past their use-by date.  And seem to
      have little value these days. Removing them allows some trivial cleanup
      of some exit paths for some functions. These are also included in this
      patch. There is likely scope for further cleanup of both debugging and
      unwind paths. But let's leave that for another day.
      
      Only intended to change debug output, and only when CONFIG_IP_VS_DEBUG
      is enabled. Compile tested only.
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      210ffe4a
    • Simon Horman's avatar
      ipvs: Consistently use array_size() in ip_vs_conn_init() · 28065493
      Simon Horman authored
      Consistently use array_size() to calculate the size of ip_vs_conn_tab
      in bytes.
      
      Flagged by Coccinelle:
       WARNING: array_size is already used (line 1498) to compute the same size
      
      No functional change intended.
      Compile tested only.
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      28065493
    • Simon Horman's avatar
      ipvs: Update width of source for ip_vs_sync_conn_options · e3478c68
      Simon Horman authored
      In ip_vs_sync_conn_v0() copy is made to struct ip_vs_sync_conn_options.
      That structure looks like this:
      
      struct ip_vs_sync_conn_options {
              struct ip_vs_seq        in_seq;
              struct ip_vs_seq        out_seq;
      };
      
      The source of the copy is the in_seq field of struct ip_vs_conn.  Whose
      type is struct ip_vs_seq. Thus we can see that the source - is not as
      wide as the amount of data copied, which is the width of struct
      ip_vs_sync_conn_option.
      
      The copy is safe because the next field in is another struct ip_vs_seq.
      Make use of struct_group() to annotate this.
      
      Flagged by gcc-13 as:
      
       In file included from ./include/linux/string.h:254,
                        from ./include/linux/bitmap.h:11,
                        from ./include/linux/cpumask.h:12,
                        from ./arch/x86/include/asm/paravirt.h:17,
                        from ./arch/x86/include/asm/cpuid.h:62,
                        from ./arch/x86/include/asm/processor.h:19,
                        from ./arch/x86/include/asm/timex.h:5,
                        from ./include/linux/timex.h:67,
                        from ./include/linux/time32.h:13,
                        from ./include/linux/time.h:60,
                        from ./include/linux/stat.h:19,
                        from ./include/linux/module.h:13,
                        from net/netfilter/ipvs/ip_vs_sync.c:38:
       In function 'fortify_memcpy_chk',
           inlined from 'ip_vs_sync_conn_v0' at net/netfilter/ipvs/ip_vs_sync.c:606:3:
       ./include/linux/fortify-string.h:529:25: error: call to '__read_overflow2_field' declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
         529 |                         __read_overflow2_field(q_size_field, size);
             |
      
      Compile tested only.
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e3478c68
    • Florian Westphal's avatar
      netfilter: nf_tables: do not store rule in traceinfo structure · 46df4175
      Florian Westphal authored
      pass it as argument instead.  This reduces size of traceinfo to
      16 bytes.  Total stack usage:
      
       nf_tables_core.c:252 nft_do_chain    304     static
      
      While its possible to also pass basechain as argument, doing so
      increases nft_do_chaininfo function size.
      
      Unlike pktinfo/verdict/rule the basechain info isn't used in
      the expression evaluation path. gcc places it on the stack, which
      results in extra push/pop when it gets passed to the trace helpers
      as argument rather than as part of the traceinfo structure.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      46df4175
    • Florian Westphal's avatar
      netfilter: nf_tables: do not store verdict in traceinfo structure · 0a202145
      Florian Westphal authored
      Just pass it as argument to nft_trace_notify. Stack is reduced by 8 bytes:
      
      nf_tables_core.c:256 nft_do_chain    312     static
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0a202145
    • Florian Westphal's avatar
      netfilter: nf_tables: do not store pktinfo in traceinfo structure · 698bb828
      Florian Westphal authored
      pass it as argument.  No change in object size.
      
      stack usage decreases by 8 byte:
       nf_tables_core.c:254  nft_do_chain       320     static
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      698bb828
    • Florian Westphal's avatar
      netfilter: nf_tables: remove unneeded conditional · 2a1d6abd
      Florian Westphal authored
      This helper is inlined, so keep it as small as possible.
      
      If the static key is true, there is only a very small chance
      that info->trace is false:
      
      1. tracing was enabled at this very moment, the static key was
         updated to active right after nft_do_table was called.
      
      2. tracing was disabled at this very moment.
         trace->info is already false, the static key is about to
         be patched to false soon.
      
      In both cases, no event will be sent because info->trace
      is false (checked in noinline slowpath). info->nf_trace is irrelevant.
      
      The nf_trace update is redunant in this case, but this will only
      happen for short duration, when static key flips.
      
             text  data   bss   dec   hex filename
      old:   2980   192    32  3204   c84 nf_tables_core.o
      new:   2964   192    32  3188   c74i nf_tables_core.o
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2a1d6abd
    • Florian Westphal's avatar
      netfilter: nf_tables: make validation state per table · 00c320f9
      Florian Westphal authored
      We only need to validate tables that saw changes in the current
      transaction.
      
      The existing code revalidates all tables, but this isn't needed as
      cross-table jumps are not allowed (chains have table scope).
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      00c320f9
    • Florian Westphal's avatar
      netfilter: nf_tables: don't write table validation state without mutex · 9a32e985
      Florian Westphal authored
      The ->cleanup callback needs to be removed, this doesn't work anymore as
      the transaction mutex is already released in the ->abort function.
      
      Just do it after a successful validation pass, this either happens
      from commit or abort phases where transaction mutex is held.
      
      Fixes: f102d66b ("netfilter: nf_tables: use dedicated mutex to guard transactions")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9a32e985
    • Florian Westphal's avatar
      netfilter: nf_tables: don't store chain address on jump · 63e9bbbc
      Florian Westphal authored
      Now that the rule trailer/end marker and the rcu head reside in the
      same structure, we no longer need to save/restore the chain pointer
      when performing/returning from a jump.
      
      We can simply let the trace infra walk the evaluated rule until it
      hits the end marker and then fetch the chain pointer from there.
      
      When the rule is NULL (policy tracing), then chain and basechain
      pointers were already identical, so just use the basechain.
      
      This cuts size of jumpstack in half, from 256 to 128 bytes in 64bit,
      scripts/stackusage says:
      
      nf_tables_core.c:251 nft_do_chain    328     static
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      63e9bbbc
    • Florian Westphal's avatar
      netfilter: nf_tables: don't store address of last rule on jump · d4d89e65
      Florian Westphal authored
      Walk the rule headers until the trailer one (last_bit flag set) instead
      of stopping at last_rule address.
      
      This avoids the need to store the address when jumping to another chain.
      
      This cuts size of jumpstack array by one third, on 64bit from
      384 to 256 bytes.  Still, stack usage is still quite large:
      
      scripts/stackusage:
      nf_tables_core.c:258 nft_do_chain    496     static
      
      Next patch will also remove chain pointer.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d4d89e65
    • Florian Westphal's avatar
      netfilter: nf_tables: merge nft_rules_old structure and end of ruleblob marker · e38fbfa9
      Florian Westphal authored
      In order to free the rules in a chain via call_rcu, the rule array used
      to stash a rcu_head and space for a pointer at the end of the rule array.
      
      When the current nft_rule_dp blob format got added in
      2c865a8a ("netfilter: nf_tables: add rule blob layout"), this results
      in a double-trailer:
      
        size (unsigned long)
        struct nft_rule_dp
          struct nft_expr
               ...
          struct nft_rule_dp
           struct nft_expr
               ...
          struct nft_rule_dp (is_last=1) // Trailer
      
      The trailer, struct nft_rule_dp (is_last=1), is not accounted for in size,
      so it can be located via start_addr + size.
      
      Because the rcu_head is stored after 'start+size' as well this means the
      is_last trailer is *aliased* to the rcu_head (struct nft_rules_old).
      
      This is harmless, because at this time the nft_do_chain function never
      evaluates/accesses the trailer, it only checks the address boundary:
      
              for (; rule < last_rule; rule = nft_rule_next(rule)) {
      ...
      
      But this way the last_rule address has to be stashed in the jump
      structure to restore it after returning from a chain.
      
      nft_do_chain stack usage has become way too big, so put it on a diet.
      
      Without this patch is impossible to use
              for (; !rule->is_last; rule = nft_rule_next(rule)) {
      
      ... because on free, the needed update of the rcu_head will clobber the
      nft_rule_dp is_last bit.
      
      Furthermore, also stash the chain pointer in the trailer, this allows
      to recover the original chain structure from nf_tables_trace infra
      without a need to place them in the jump struct.
      
      After this patch it is trivial to diet the jump stack structure,
      done in the next two patches.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e38fbfa9
    • Jakub Kicinski's avatar
      Merge tag 'wireless-next-2023-04-21' of... · ca288965
      Jakub Kicinski authored
      Merge tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
      
      Kalle Valo says:
      
      ====================
      wireless-next patches for v6.4
      
      Most likely the last -next pull request for v6.4. We have changes all
      over. rtw88 now supports SDIO bus and iwlwifi continues to work on
      Wi-Fi 7 support. Not much stack changes this time.
      
      Major changes:
      
      cfg80211/mac80211
       - fix some Fine Time Measurement (FTM) frames not being bufferable
       - flush frames before key removal to avoid potential unencrypted
         transmission depending on the hardware design
      
      iwlwifi
       - preparation for Wi-Fi 7 EHT and multi-link support
      
      rtw88
       - SDIO bus support
       - RTL8822BS, RTL8822CS and RTL8821CS SDIO chipset support
      
      rtw89
       - framework firmware backwards compatibility
      
      brcmfmac
       - Cypress 43439 SDIO support
      
      mt76
       - mt7921 P2P support
       - mt7996 mesh A-MSDU support
       - mt7996 EHT support
       - mt7996 coredump support
      
      wcn36xx
       - support for pronto v3 hardware
      
      ath11k
       - PCIe DeviceTree bindings
       - WCN6750: enable SAR support
      
      ath10k
       - convert DeviceTree bindings to YAML
      
      * tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (261 commits)
        wifi: rtw88: Update spelling in main.h
        wifi: airo: remove ISA_DMA_API dependency
        wifi: rtl8xxxu: Simplify setting the initial gain
        wifi: rtl8xxxu: Add rtl8xxxu_write{8,16,32}_{set,clear}
        wifi: rtl8xxxu: Don't print the vendor/product/serial
        wifi: rtw88: Fix memory leak in rtw88_usb
        wifi: rtw88: call rtw8821c_switch_rf_set() according to chip variant
        wifi: rtw88: set pkg_type correctly for specific rtw8821c variants
        wifi: rtw88: rtw8821c: Fix rfe_option field width
        wifi: rtw88: usb: fix priority queue to endpoint mapping
        wifi: rtw88: 8822c: add iface combination
        wifi: rtw88: handle station mode concurrent scan with AP mode
        wifi: rtw88: prevent scan abort with other VIFs
        wifi: rtw88: refine reserved page flow for AP mode
        wifi: rtw88: disallow PS during AP mode
        wifi: rtw88: 8822c: extend reserved page number
        wifi: rtw88: add port switch for AP mode
        wifi: rtw88: add bitmap for dynamic port settings
        wifi: rtw89: mac: use regular int as return type of DLE buffer request
        wifi: mac80211: remove return value check of debugfs_create_dir()
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20230421104726.800BCC433D2@smtp.kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca288965
    • Jianfeng Tan's avatar
      net/packet: support mergeable feature of virtio · dfc39d40
      Jianfeng Tan authored
      Packet sockets, like tap, can be used as the backend for kernel vhost.
      In packet sockets, virtio net header size is currently hardcoded to be
      the size of struct virtio_net_hdr, which is 10 bytes; however, it is not
      always the case: some virtio features, such as mrg_rxbuf, need virtio
      net header to be 12-byte long.
      
      Mergeable buffers, as a virtio feature, is worthy of supporting: packets
      that are larger than one-mbuf size will be dropped in vhost worker's
      handle_rx if mrg_rxbuf feature is not used, but large packets
      cannot be avoided and increasing mbuf's size is not economical.
      
      With this virtio feature enabled by virtio-user, packet sockets with
      hardcoded 10-byte virtio net header will parse mac head incorrectly in
      packet_snd by taking the last two bytes of virtio net header as part of
      mac header.
      This incorrect mac header parsing will cause packet to be dropped due to
      invalid ether head checking in later under-layer device packet receiving.
      
      By adding extra field vnet_hdr_sz with utilizing holes in struct
      packet_sock to record currently used virtio net header size and supporting
      extra sockopt PACKET_VNET_HDR_SZ to set specified vnet_hdr_sz, packet
      sockets can know the exact length of virtio net header that virtio user
      gives.
      In packet_snd, tpacket_snd and packet_recvmsg, instead of using
      hardcoded virtio net header size, it can get the exact vnet_hdr_sz from
      corresponding packet_sock, and parse mac header correctly based on this
      information to avoid the packets being mistakenly dropped.
      Signed-off-by: default avatarJianfeng Tan <henry.tjf@antgroup.com>
      Co-developed-by: default avatarAnqi Shen <amy.saq@antgroup.com>
      Signed-off-by: default avatarAnqi Shen <amy.saq@antgroup.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dfc39d40
    • David S. Miller's avatar
      Merge branch 'mlx5-ipsec-fixes' · 156c9398
      David S. Miller authored
      Leon Romanovsky says:
      
      ====================
      Fixes to mlx5 IPsec implementation
      
      This small patchset includes various fixes and one refactoring patch
      which I collected for the features sent in this cycle, with one exception -
      first patch.
      
      First patch fixes code which was introduced in previous cycle, however I
      was able to trigger FW error only in custom debug code, so don't see a
      need to send it to net-rc.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      156c9398
    • Leon Romanovsky's avatar
      net/mlx5e: Refactor duplicated code in mlx5e_ipsec_init_macs · 45fd01f2
      Leon Romanovsky authored
      ARP discovery code has same logic for RX and TX flows, but with
      different source and destination fields. Instead of duplicating
      same code in mlx5e_ipsec_init_macs, let's refactor.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45fd01f2
    • Leon Romanovsky's avatar
      net/mlx5e: Properly release work data structure · 94edec44
      Leon Romanovsky authored
      There are some flows in which work structure is not allocated at all
      and it is needed to be checked prior release of data structure.
      
       general protection fault, probably for non-canonical address 0xdffffc000000000a: 0000 [#1] SMP KASAN
       KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
       CPU: 6 PID: 3486 Comm: kworker/6:0 Not tainted 6.3.0-rc5_for_upstream_debug_2023_04_06_11_01 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Workqueue: events xfrm_state_gc_task
       RIP: 0010:mlx5e_xfrm_free_state+0x177/0x260 [mlx5_core]
       Code: c1 ea 03 80 3c 02 00 0f 85 f5 00 00 00 4c 8b a5 08 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b7 00 00 00 49 8b 7c 24 50 e8 85 7c 09 e0 4c 89
       RSP: 0018:ffff888137a8fc50 EFLAGS: 00010206
       RAX: dffffc0000000000 RBX: ffff888180398000 RCX: 0000000000000000
       RDX: 000000000000000a RSI: ffffffffa1878227 RDI: 0000000000000050
       RBP: ffff88812a0c8000 R08: ffff888137a8fb60 R09: 0000000000000000
       R10: fffffbfff09aba0c R11: 0000000000000001 R12: 0000000000000000
       R13: ffff88812a0c8108 R14: ffffffff84c63480 R15: ffff8881acb63118
       FS:  0000000000000000(0000) GS:ffff88881eb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f667e8bc000 CR3: 0000000004693006 CR4: 0000000000370ea0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
      
        ___xfrm_state_destroy+0x3c8/0x5e0
        xfrm_state_gc_task+0xf6/0x140
        ? ___xfrm_state_destroy+0x5e0/0x5e0
        process_one_work+0x7c2/0x1340
        ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
        ? pwq_dec_nr_in_flight+0x230/0x230
        ? spin_bug+0x1d0/0x1d0
        worker_thread+0x59d/0xec0
        ? __kthread_parkme+0xd9/0x1d0
        ? process_one_work+0x1340/0x1340
        kthread+0x28f/0x330
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
      
       Modules linked in: sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm ib_uverbs ib_core vfio_pci vfio_pci_core vfio_iommu_type1 vfio cuse overlay zram zsmalloc fuse [last unloaded: mlx5_core]
       ---[ end trace 0000000000000000 ]---
      
      Fixes: 4562116f ("net/mlx5e: Generalize IPsec work structs")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94edec44
    • Leon Romanovsky's avatar
      net/mlx5e: Compare all fields in IPv6 address · 3198ae7d
      Leon Romanovsky authored
      Fix size argument in memcmp to compare whole IPv6 address.
      
      Fixes: b3beba1f ("net/mlx5e: Allow policies with reqid 0, to support IKE policy holes")
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarEmeel Hakim <ehakim@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3198ae7d
    • Leon Romanovsky's avatar
      net/mlx5e: Don't overwrite extack message returned from IPsec SA validator · 697b3518
      Leon Romanovsky authored
      Addition of new err_xfrm label caused to error messages be overwritten.
      Fix it by using proper NL_SET_ERR_MSG_WEAK_MOD macro together with change
      in a default message.
      
      Fixes: aa8bd0c9 ("net/mlx5e: Support IPsec acquire default SA")
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      697b3518
    • Leon Romanovsky's avatar
      net/mlx5e: Fix FW error while setting IPsec policy block action · e239e31a
      Leon Romanovsky authored
      When trying to set IPsec policy block action the following error is
      generated:
      
       mlx5_cmd_out_err:803:(pid 3426): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed,
      	status bad parameter(0x3), syndrome (0x8708c3), err(-22)
      
      This error means that drop action is not allowed when modify action is
      set, so update the code to skip modify header for XFRM_POLICY_BLOCK action.
      
      Fixes: 67212396 ("net/mlx5e: Skip IPsec encryption for TX path without matching policy")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e239e31a
    • Yan Wang's avatar
      net: stmmac:fix system hang when setting up tag_8021q VLAN for DSA ports · 35226750
      Yan Wang authored
      The system hang because of dsa_tag_8021q_port_setup()->
      				stmmac_vlan_rx_add_vid().
      
      I found in stmmac_drv_probe() that cailing pm_runtime_put()
      disabled the clock.
      
      First, when the kernel is compiled with CONFIG_PM=y,The stmmac's
      resume/suspend is active.
      
      Secondly,stmmac as DSA master,the dsa_tag_8021q_port_setup() function
      will callback stmmac_vlan_rx_add_vid when DSA dirver starts. However,
      The system is hanged for the stmmac_vlan_rx_add_vid() accesses its
      registers after stmmac's clock is closed.
      
      I would suggest adding the pm_runtime_resume_and_get() to the
      stmmac_vlan_rx_add_vid().This guarantees that resuming clock output
      while in use.
      
      Fixes: b3dcb312 ("net: stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid()")
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarYan Wang <rk.code@outlook.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35226750
    • David S. Miller's avatar
      Merge branch 'pds_core' · d8bb3824
      David S. Miller authored
      Shannon Nelson says:
      
      ====================
      pds_core driver
      
      Summary:
      --------
      This patchset implements a new driver for use with the AMD/Pensando
      Distributed Services Card (DSC), intended to provide core configuration
      services through the auxiliary_bus and through a couple of EXPORTed
      functions for use initially in VFio and vDPA feature specific drivers.
      
      To keep this patchset to a manageable size, the pds_vdpa and pds_vfio
      drivers have been split out into their own patchsets to be reviewed
      separately.
      
      Detail:
      -------
      AMD/Pensando is making available a new set of devices for supporting vDPA,
      VFio, and potentially other features in the Distributed Services Card
      (DSC).  These features are implemented through a PF that serves as a Core
      device for controlling and configuring its VF devices.  These VF devices
      have separate drivers that use the auxiliary_bus to work through the Core
      device as the control path.
      
      Currently, the DSC supports standard ethernet operations using the
      ionic driver.  This is not replaced by the Core-based devices - these
      new devices are in addition to the existing Ethernet device.  Typical DSC
      configurations will include both PDS devices and Ionic Eth devices.
      However, there is a potential future path for ethernet services to come
      through this device as well.
      
      The Core device is a new PCI PF/VF device managed by a new driver
      'pds_core'.  The PF device has access to an admin queue for configuring
      the services used by the VFs, and sets up auxiliary_bus devices for each
      vDPA VF for communicating with the drivers for the vDPA devices.  The VFs
      may be for VFio or vDPA, and other services in the future; these VF types
      are selected as part of the DSC internal FW configurations, which is out
      of the scope of this patchset.
      
      When the vDPA support set is enabled in the core PF through its devlink
      param, auxiliary_bus devices are created for each VF that supports the
      feature.  The vDPA driver then connects to and uses this auxiliary_device
      to do control path configuration through the PF device.  This can then be
      used with the vdpa kernel module to provide devices for virtio_vdpa kernel
      module for host interfaces, or vhost_vdpa kernel module for interfaces
      exported into your favorite VM.
      
      A cheap ASCII diagram of a vDPA instance looks something like this:
      
                                      ,----------.
                                      |   vdpa   |
                                      '----------'
                                        |     ||
                                       ctl   data
                                        |     ||
                                .----------.  ||
                                | pds_vdpa |  ||
                                '----------'  ||
                                     |        ||
                             pds_core.vDPA.1  ||
                                     |        ||
                          .---------------.   ||
                          |   pds_core    |   ||
                          '---------------'   ||
                              ||         ||   ||
                            09:00.0      09:00.1
              == PCI ============================================
                              ||            ||
                         .----------.   .----------.
                  ,------|    PF    |---|    VF    |-------,
                  |      '----------'   '----------'       |
                  |                  DSC                   |
                  |                                        |
                  ------------------------------------------
      
      Changes:
        v11:
       - change strncpy to strscpy
      Reported-by: default avatarkernel test robot <lkp@intel.com>
           Link: https://lore.kernel.org/oe-kbuild-all/202304181137.WaZTYyAa-lkp@intel.com/
      
        v10:
      Link: https://lore.kernel.org/netdev/20230418003228.28234-1-shannon.nelson@amd.com/
       - remove CONFIG_DEBUG_FS guard static inline stuff
       - remove unnecessary 0 and null initializations
       - verify in driver load that PDS_CORE_DRV_NAME matches KBUILD_MODNAME
       - remove debugfs irqs_show(), redundant with /proc
       - return -ENOMEM if intr_info = kcalloc() fails
       - move the status code enum into pds_core_if.h as part of API definition
       - fix up one place in pdsc_devcmd_wait() we're using the status codes where we could use the errno
       - remove redundant calls to flush_workqueue()
       - grab config_lock before testing state bits in pdsc_fw_reporter_diagnose()
       - change pdsc_color_match() to return bool
       - remove useless VIF setup loop and just setup vDPA services for now
       - remove pf pointer from struct padev and have clients use pci_physfn()
       - drop use of "vf" in auxdev.c function names, make more generic
       - remove last of client ops struct and simply export the functions
       - drop drivers@pensando.io from MAINTAINERS and add new include dir
       - include dynamic_debug.h in adminq.c to protect dynamic_hex_dump()
       - fixed fw_slot type from u8 to int for handling error returns
       - fixed comment spelling
       - changed void arg in pdsc_adminq_post() to struct pdsc *
      
        v9:
      Link: https://lore.kernel.org/netdev/20230406234143.11318-1-shannon.nelson@amd.com/
       - change pdsc field name id to uid to clarify the unique id used for aux device
       - remove unnecessary pf->state and other checks in aux device creation
       - hardcode fw slotnames for devlink info, don't use strings from FW
       - handle errors from PDS_CORE_CMD_INIT devcmd call
       - tighten up health thread use of config_lock
       - remove pdsc_queue_health_check() layer over queuing health check
       - start pds_core.rst file in first patch, add to it incrementally
       - give more user interaction info in commit messages
       - removed a few more extraneous includes
      
        v8:
      Link: https://lore.kernel.org/netdev/20230330234628.14627-1-shannon.nelson@amd.com/
       - fixed deadlock problem, use devl_health_reporter_destroy() when devlink is locked
       - don't clear client_id until after auxiliary_device_uninit()
      
        v7:
      Link: https://lore.kernel.org/netdev/20230330192313.62018-1-shannon.nelson@amd.com/
       - use explicit devlink locking and devl_* APIs
       - move some of devlink setup logic into probe and remove
       - use debugfs_create_u{type}() for state and queue head and tail
       - add include for linux/vmalloc.h
      Reported-by: default avatarkernel test robot <lkp@intel.com>
           Link: https://lore.kernel.org/oe-kbuild-all/202303260420.Tgq0qobF-lkp@intel.com/
      
        v6:
      Link: https://lore.kernel.org/netdev/20230324190243.27722-1-shannon.nelson@amd.com/
       - removed version.h include noticed by kernel test robot's version check
      Reported-by: default avatarkernel test robot <lkp@intel.com>
           Link: https://lore.kernel.org/oe-kbuild-all/202303230742.pX3ply0t-lkp@intel.com/
       - fixed up the more egregious checkpatch line length complaints
       - make sure pdsc_auxbus_dev_register() checks padev pointer errcode
      
        v5:
      Link: https://lore.kernel.org/netdev/20230322185626.38758-1-shannon.nelson@amd.com/
       - added devlink health reporter for FW issues
       - removed asic_type, asic_rev, serial_num, fw_version from debugfs as
         they are available through other means
       - trimed OS info in pdsc_identify(), we don't need to send that much info to the FW
       - removed reg/unreg from auxbus client API, they are now in the core when VF
         is started
       - removed need for pdsc definition in client by simplifying the padev to only carry
         struct pci_dev pointers rather than full struct pdsc to the pf and vf
       - removed the unused pdsc argument in pdsc_notify()
       - moved include/linux/pds/pds_core.h to driver/../pds_core/core.h
       - restored a few pds_core_if.h interface values and structs that are shared
         with FW source
       - moved final config_lock unlock to before tear down of timer and workqueue
         to be sure there are no deadlocks while waiting for any stragglers
       - changed use of PAGE_SIZE to local PDS_PAGE_SIZE to keep with FW layout needs
         without regard to kernel PAGE_SIZE configuration
       - removed the redundant *adminqcq argument from pdsc_adminq_post()
      
        v4:
      Link: https://lore.kernel.org/netdev/20230308051310.12544-1-shannon.nelson@amd.com/
       - reworked to attach to both Core PF and vDPA VF PCI devices
       - now creates auxiliary_device as part of each VF PCI probe, removes them on PCI remove
       - auxiliary devices now use simple unique id rather than PCI address for identifier
       - replaced home-grown event publishing with kernel-based notifier service
       - dropped live_migration parameter, not needed when not creating aux device for it
       - replaced devm_* functions with traditional interfaces
       - added MAINTAINERS entry
       - removed lingering traces of set/get_vf attribute adminq commands
       - trimmed some include lists
       - cleaned a kernel test robot complaint about a stray unused variable
              Link: https://lore.kernel.org/oe-kbuild-all/202302181049.yeUQMeWY-lkp@intel.com/
      
        v3:
      Link: https://lore.kernel.org/netdev/20230217225558.19837-1-shannon.nelson@amd.com/
       - changed names from "pensando" to "amd" and updated copyright strings
       - dropped the DEVLINK_PARAM_GENERIC_ID_FW_BANK for future development
       - changed the auxiliary device creation to be triggered by the
         PCI bus event BOUND_DRIVER, and torn down at UNBIND_DRIVER in order
         to properly handle users using the sysfs bind/unbind functions
       - dropped some noisy log messages
       - rebased to current net-next
      
        RFC to v2:
      Link: https://lore.kernel.org/netdev/20221207004443.33779-1-shannon.nelson@amd.com/
       - added separate devlink param patches for DEVLINK_PARAM_GENERIC_ID_ENABLE_MIGRATION
         and DEVLINK_PARAM_GENERIC_ID_FW_BANK, and dropped the driver specific implementations
       - updated descriptions for the new devlink parameters
       - dropped netdev support
       - dropped vDPA patches, will followup later
       - separated fw update and fw bank select into their own patches
      
        RFC:
      Link: https://lore.kernel.org/netdev/20221118225656.48309-1-snelson@pensando.io/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8bb3824
    • Shannon Nelson's avatar
      pds_core: Kconfig and pds_core.rst · ddbcb220
      Shannon Nelson authored
      Remaining documentation and Kconfig hook for building the driver.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddbcb220
    • Shannon Nelson's avatar
      pds_core: publish events to the clients · d24c2827
      Shannon Nelson authored
      When the Core device gets an event from the device, or notices
      the device FW to be up or down, it needs to send those events
      on to the clients that have an event handler.  Add the code to
      pass along the events to the clients.
      
      The entry points pdsc_register_notify() and pdsc_unregister_notify()
      are EXPORTed for other drivers that want to listen for these events.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d24c2827
    • Shannon Nelson's avatar
      pds_core: add the aux client API · 10659034
      Shannon Nelson authored
      Add the client API operations for running adminq commands.
      The core registers the client with the FW, then the client
      has a context for requesting adminq services.  We expect
      to add additional operations for other clients, including
      requesting additional private adminqs and IRQs, but don't have
      the need yet.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10659034
    • Shannon Nelson's avatar
      pds_core: devlink params for enabling VIF support · 40ced894
      Shannon Nelson authored
      Add the devlink parameter switches so the user can enable
      the features supported by the VFs.  The only feature supported
      at the moment is vDPA.
      
      Example:
          devlink dev param set pci/0000:2b:00.0 \
      	    name enable_vnet cmode runtime value true
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40ced894
    • Shannon Nelson's avatar
      pds_core: add auxiliary_bus devices · 4569cce4
      Shannon Nelson authored
      An auxiliary_bus device is created for each vDPA type VF at VF
      probe and destroyed at VF remove.  The aux device name comes
      from the driver name + VIF type + the unique id assigned at PCI
      probe.  The VFs are always removed on PF remove, so there should
      be no issues with VFs trying to access missing PF structures.
      
      The auxiliary_device names will look like "pds_core.vDPA.nn"
      where 'nn' is the VF's uid.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4569cce4
    • Shannon Nelson's avatar
      pds_core: add initial VF device handling · f53d9311
      Shannon Nelson authored
      This is the initial VF PCI driver framework for the new
      pds_vdpa VF device, which will work in conjunction with an
      auxiliary_bus client of the pds_core driver.  This does the
      very basics of registering for the new VF device, setting
      up debugfs entries, and registering with devlink.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f53d9311
    • Shannon Nelson's avatar
      pds_core: set up the VIF definitions and defaults · 65e0185a
      Shannon Nelson authored
      The Virtual Interfaces (VIFs) supported by the DSC's
      configuration (vDPA, Eth, RDMA, etc) are reported in the
      dev_ident struct and made visible in debugfs.  At this point
      only vDPA is supported in this driver so we only setup
      devices for that feature.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65e0185a
    • Shannon Nelson's avatar
      pds_core: add FW update feature to devlink · 49ce92fb
      Shannon Nelson authored
      Add in the support for doing firmware updates.  Of the two
      main banks available, a and b, this updates the one not in
      use and then selects it for the next boot.
      
      Example:
          devlink dev flash pci/0000:b2:00.0 \
      	    file pensando/dsc_fw_1.63.0-22.tar
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49ce92fb
    • Shannon Nelson's avatar
      pds_core: Add adminq processing and commands · 01ba61b5
      Shannon Nelson authored
      Add the service routines for submitting and processing
      the adminq messages and for handling notifyq events.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01ba61b5
    • Shannon Nelson's avatar
      pds_core: set up device and adminq · 45d76f49
      Shannon Nelson authored
      Set up the basic adminq and notifyq queue structures.  These are
      used mostly by the client drivers for feature configuration.
      These are essentially the same adminq and notifyq as in the
      ionic driver.
      
      Part of this includes querying for device identity and FW
      information, so we can make that available to devlink dev info.
      
        $ devlink dev info pci/0000:b5:00.0
        pci/0000:b5:00.0:
          driver pds_core
          serial_number FLM18420073
          versions:
              fixed:
                asic.id 0x0
                asic.rev 0x0
              running:
                fw 1.51.0-73
              stored:
                fw.goldfw 1.15.9-C-22
                fw.mainfwa 1.60.0-73
                fw.mainfwb 1.60.0-57
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45d76f49
    • Shannon Nelson's avatar
      pds_core: add devlink health facilities · 25b450c0
      Shannon Nelson authored
      Add devlink health reporting on top of our fw watchdog.
      
      Example:
        # devlink health show pci/0000:2b:00.0 reporter fw
        pci/0000:2b:00.0:
          reporter fw
            state healthy error 0 recover 0
        # devlink health diagnose pci/0000:2b:00.0 reporter fw
         Status: healthy State: 1 Generation: 0 Recoveries: 0
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25b450c0
    • Shannon Nelson's avatar
      pds_core: health timer and workqueue · c2dbb090
      Shannon Nelson authored
      Add in the periodic health check and the related workqueue,
      as well as the handlers for when a FW reset is seen.
      
      The firmware is polled every 5 seconds to be sure that it is
      still alive and that the FW generation didn't change.
      
      The alive check looks to see that the PCI bus is still readable
      and the fw_status still has the RUNNING bit on.  If not alive,
      the driver stops activity and tears things down.  When the FW
      recovers and the alive check again succeeds, the driver sets
      back up for activity.
      
      The generation check looks at the fw_generation to see if it
      has changed, which can happen if the FW crashed and recovered
      or was updated in between health checks.  If changed, the
      driver counts that as though the alive test failed and forces
      the fw_down/fw_up cycle.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2dbb090
    • Shannon Nelson's avatar
      pds_core: add devcmd device interfaces · 523847df
      Shannon Nelson authored
      The devcmd interface is the basic connection to the device through the
      PCI BAR for low level identification and command services.  This does
      the early device initialization and finds the identity data, and adds
      devcmd routines to be used by later driver bits.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      523847df
    • Shannon Nelson's avatar
      pds_core: initial framework for pds_core PF driver · 55435ea7
      Shannon Nelson authored
      This is the initial PCI driver framework for the new pds_core device
      driver and its family of devices.  This does the very basics of
      registering for the new PF PCI device 1dd8:100c, setting up debugfs
      entries, and registering with devlink.
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55435ea7
    • David S. Miller's avatar
      Merge branch 'bridge-neigh-suppression' · 25c800b2
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      bridge: Add per-{Port, VLAN} neighbor suppression
      
      Background
      ==========
      
      In order to minimize the flooding of ARP and ND messages in the VXLAN
      network, EVPN includes provisions [1] that allow participating VTEPs to
      suppress such messages in case they know the MAC-IP binding and can
      reply on behalf of the remote host. In Linux, the above is implemented
      in the bridge driver using a per-port option called "neigh_suppress"
      that was added in kernel version 4.15 [2].
      
      Motivation
      ==========
      
      Some applications use ARP messages as keepalives between the application
      nodes in the network. This works perfectly well when two nodes are
      connected to the same VTEP. When a node goes down it will stop
      responding to ARP requests and the other node will notice it
      immediately.
      
      However, when the two nodes are connected to different VTEPs and
      neighbor suppression is enabled, the local VTEP will reply to ARP
      requests even after the remote node went down, until certain timers
      expire and the EVPN control plane decides to withdraw the MAC/IP
      Advertisement route for the address. Therefore, some users would like to
      be able to disable neighbor suppression on VLANs where such applications
      reside and keep it enabled on the rest.
      
      Implementation
      ==============
      
      The proposed solution is to allow user space to control neighbor
      suppression on a per-{Port, VLAN} basis, in a similar fashion to other
      per-port options that gained per-{Port, VLAN} counterparts such as
      "mcast_router". This allows users to benefit from the operational
      simplicity and scalability associated with shared VXLAN devices (i.e.,
      external / collect-metadata mode), while still allowing for per-VLAN/VNI
      neighbor suppression control.
      
      The user interface is extended with a new "neigh_vlan_suppress" bridge
      port option that allows user space to enable per-{Port, VLAN} neighbor
      suppression on the bridge port. When enabled, the existing
      "neigh_suppress" option has no effect and neighbor suppression is
      controlled using a new "neigh_suppress" VLAN option. Example usage:
      
       # bridge link set dev vxlan0 neigh_vlan_suppress on
       # bridge vlan add vid 10 dev vxlan0
       # bridge vlan set vid 10 dev vxlan0 neigh_suppress on
      
      Testing
      =======
      
      Tested using existing bridge selftests. Added a dedicated selftest in
      the last patch.
      
      Patchset overview
      =================
      
      Patches #1-#5 are preparations.
      
      Patch #6 adds per-{Port, VLAN} neighbor suppression support to the
      bridge's data path.
      
      Patches #7-#8 add the required netlink attributes to enable the feature.
      
      Patch #9 adds a selftest.
      
      iproute2 patches can be found here [3].
      
      Changelog
      =========
      
      Since RFC [4]:
      
      No changes.
      
      [1] https://www.rfc-editor.org/rfc/rfc7432#section-10
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a42317785c898c0ed46db45a33b0cc71b671bf29
      [3] https://github.com/idosch/iproute2/tree/submit/neigh_suppress_v1
      [4] https://lore.kernel.org/netdev/20230413095830.2182382-1-idosch@nvidia.com/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25c800b2