1. 18 Nov, 2022 40 commits
    • Schspa Shi's avatar
      mrp: introduce active flags to prevent UAF when applicant uninit · ab037780
      Schspa Shi authored
      The caller of del_timer_sync must prevent restarting of the timer, If
      we have no this synchronization, there is a small probability that the
      cancellation will not be successful.
      
      And syzbot report the fellowing crash:
      ==================================================================
      BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:929 [inline]
      BUG: KASAN: use-after-free in enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
      Write at addr f9ff000024df6058 by task syz-fuzzer/2256
      Pointer tag: [f9], memory tag: [fe]
      
      CPU: 1 PID: 2256 Comm: syz-fuzzer Not tainted 6.1.0-rc5-syzkaller-00008-
      ge01d50cb #0
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace.part.0+0xe0/0xf0 arch/arm64/kernel/stacktrace.c:156
       dump_backtrace arch/arm64/kernel/stacktrace.c:162 [inline]
       show_stack+0x18/0x40 arch/arm64/kernel/stacktrace.c:163
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x68/0x84 lib/dump_stack.c:106
       print_address_description mm/kasan/report.c:284 [inline]
       print_report+0x1a8/0x4a0 mm/kasan/report.c:395
       kasan_report+0x94/0xb4 mm/kasan/report.c:495
       __do_kernel_fault+0x164/0x1e0 arch/arm64/mm/fault.c:320
       do_bad_area arch/arm64/mm/fault.c:473 [inline]
       do_tag_check_fault+0x78/0x8c arch/arm64/mm/fault.c:749
       do_mem_abort+0x44/0x94 arch/arm64/mm/fault.c:825
       el1_abort+0x40/0x60 arch/arm64/kernel/entry-common.c:367
       el1h_64_sync_handler+0xd8/0xe4 arch/arm64/kernel/entry-common.c:427
       el1h_64_sync+0x64/0x68 arch/arm64/kernel/entry.S:576
       hlist_add_head include/linux/list.h:929 [inline]
       enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
       mod_timer+0x14/0x20 kernel/time/timer.c:1161
       mrp_periodic_timer_arm net/802/mrp.c:614 [inline]
       mrp_periodic_timer+0xa0/0xc0 net/802/mrp.c:627
       call_timer_fn.constprop.0+0x24/0x80 kernel/time/timer.c:1474
       expire_timers+0x98/0xc4 kernel/time/timer.c:1519
      
      To fix it, we can introduce a new active flags to make sure the timer will
      not restart.
      
      Reported-by: syzbot+6fd64001c20aa99e34a4@syzkaller.appspotmail.com
      Signed-off-by: default avatarSchspa Shi <schspa@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab037780
    • David S. Miller's avatar
      Merge tag 'rxrpc-next-20221116' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 8cf4f8c7
      David S. Miller authored
      David Howells says:
      
      ====================
      rxrpc: Fix oops and missing config conditionals
      
      The patches that were pulled into net-next previously[1] had some issues
      that this patchset fixes:
      
       (1) Fix missing IPV6 config conditionals.
      
       (2) Fix an oops caused by calling udpv6_sendmsg() directly on an AF_INET
           socket.
      
       (3) Fix the validation of network addresses on entry to socket functions
           so that we don't allow an AF_INET6 address if we've selected an
           AF_INET transport socket.
      
      Link: https://lore.kernel.org/r/166794587113.2389296.16484814996876530222.stgit@warthog.procyon.org.uk/ [1]
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cf4f8c7
    • Eric Dumazet's avatar
      net: fix napi_disable() logic error · fd896e38
      Eric Dumazet authored
      Dan reported a new warning after my recent patch:
      
      New smatch warnings:
      net/core/dev.c:6409 napi_disable() error: uninitialized symbol 'new'.
      
      Indeed, we must first wait for STATE_SCHED and STATE_NPSVC to be cleared,
      to make sure @new variable has been initialized properly.
      
      Fixes: 4ffa1d1c ("net: adopt try_cmpxchg() in napi_{enable|disable}()")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd896e38
    • Dan Carpenter's avatar
      rxrpc: uninitialized variable in rxrpc_send_ack_packet() · 38461894
      Dan Carpenter authored
      The "pkt" was supposed to have been deleted in a previous patch.  It
      leads to an uninitialized variable bug.
      
      Fixes: 72f0c6fb ("rxrpc: Allocate ACK records at proposal and queue for transmission")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38461894
    • Dan Carpenter's avatar
      rxrpc: fix rxkad_verify_response() · 101c1bb6
      Dan Carpenter authored
      The error handling for if skb_copy_bits() fails was accidentally deleted
      so the rxkad_decrypt_ticket() function is not called.
      
      Fixes: 5d7edbc9 ("rxrpc: Get rid of the Rx ring")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      101c1bb6
    • Lorenzo Bianconi's avatar
      net: ethernet: mtk_eth_soc: remove cpu_relax in mtk_pending_work · ec8cd134
      Lorenzo Bianconi authored
      Get rid of cpu_relax in mtk_pending_work routine since MTK_RESETTING is
      set only in mtk_pending_work() and it runs holding rtnl lock
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec8cd134
    • Lorenzo Bianconi's avatar
      net: ethernet: mtk_eth_soc: do not overwrite mtu configuration running reset routine · b677d6c7
      Lorenzo Bianconi authored
      Restore user configured MTU running mtk_hw_init() during tx timeout routine
      since it will be overwritten after a hw reset.
      Reported-by: default avatarFelix Fietkau <nbd@nbd.name>
      Fixes: 9ea4d311 ("net: ethernet: mediatek: add the whole ethernet reset into the reset process")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b677d6c7
    • Alex Elder's avatar
      net: ipa: avoid a null pointer dereference · 15b4f993
      Alex Elder authored
      Dan Carpenter reported that Smatch found an instance where a pointer
      which had previously been assumed could be null (as indicated by a
      null check) was later dereferenced without a similar check.
      
      In practice this doesn't lead to a problem because currently the
      pointers used are all non-null.  Nevertheless this patch addresses
      the reported problem.
      
      In addition, I spotted another bug that arose in the same commit.
      When the command to initialize a routing table memory region was
      added, the number of entries computed for the non-hashed table
      was wrong (it ended up being a Boolean rather than the count
      intended).  This bug is fixed here as well.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Link: https://lore.kernel.org/kernel-janitors/Y3OOP9dXK6oEydkf@kiliTested-by: default avatarCaleb Connolly <caleb.connolly@linaro.com>
      Fixes: 5cb76899 ("net: ipa: reduce arguments to ipa_table_init_add()")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15b4f993
    • David S. Miller's avatar
      Merge tag 'wireless-next-2022-11-18' of... · c609d739
      David S. Miller authored
      Merge tag 'wireless-next-2022-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
      
      Kalle Valo says:
      
      ====================
      wireless-next patches for v6.2
      
      Second set of patches for v6.2. Only driver patches this time, nothing
      really special. Unused platform data support was removed from wl1251
      and rtw89 got WoWLAN support.
      
      Major changes:
      
      ath11k
      
      * support configuring channel dwell time during scan
      
      rtw89
      
      * new dynamic header firmware format support
      
      * Wake-over-WLAN support
      
      rtl8xxxu
      
      * enable IEEE80211_HW_SUPPORT_FAST_XMIT
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c609d739
    • David S. Miller's avatar
      Merge branch 'sctp-vrf' · 22700706
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: support vrf processing
      
      This patchset adds the VRF processing in SCTP. Simliar to TCP/UDP,
      it includes socket bind and socket/association lookup changes.
      
      For socket bind change, it allows sockets to bind to a VRF device
      and allows multiple sockets with the same IP and PORT to bind to
      different interfaces in patch 1-3.
      
      For socket/association lookup change, it adds dif and sdif check
      in both asoc and ep lookup in patch 4 and 5, and when binding to
      nodev, users can decide if accept the packets received from one
      l3mdev by setup a sysctl option in patch 6.
      
      Note with VRF support, in a netns, an association will be decided
      by src ip + src port + dst ip + dst port + bound_dev_if, and it's
      possible for ss to have:
      
        State       Local Address:Port      Peer Address:Port
         ESTAB     192.168.1.2%vrf-s1:1234
         `- ESTAB   192.168.1.2%veth1:1234   192.168.1.1:1234
         ESTAB     192.168.1.2%vrf-s2:1234
         `- ESTAB   192.168.1.2%veth2:1234   192.168.1.1:1234
      
      See the selftest in patch 7 for more usage.
      
      Also, thanks Carlo for testing this patch series on their use.
      
      v1->v2:
        - In Patch 5, move sctp_sk_bound_dev_eq() definition to net/sctp/
          input.c to avoid a build error when IP_SCTP is disabled, as Paolo
          suggested.
        - In Patch 7, avoid one sleep by disabling the IPv6 dad, and remove
          another sleep by using ss to check if the server's ready, and also
          delete two unncessary sleeps in sctp_hello.c, as Paolo suggested.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22700706
    • Xin Long's avatar
      selftests: add a selftest for sctp vrf · a61bd7b9
      Xin Long authored
      This patch adds 12 small test cases: 01-04 test for the sysctl
      net.sctp.l3mdev_accept. 05-10 test for only binding to a right
      l3mdev device, the connection can be created. 11-12 test for
      two socks binding to different l3mdev devices at the same time,
      each of them can process the packets from the corresponding
      peer. The tests run for both IPv4 and IPv6 SCTP.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a61bd7b9
    • Xin Long's avatar
      sctp: add sysctl net.sctp.l3mdev_accept · b712d032
      Xin Long authored
      This patch is to add sysctl net.sctp.l3mdev_accept to allow
      users to change the pernet global l3mdev_accept.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b712d032
    • Xin Long's avatar
      sctp: add dif and sdif check in asoc and ep lookup · 0af03170
      Xin Long authored
      This patch at first adds a pernet global l3mdev_accept to decide if it
      accepts the packets from a l3mdev when a SCTP socket doesn't bind to
      any interface. It's set to 1 to avoid any possible incompatible issue,
      and in next patch, a sysctl will be introduced to allow to change it.
      
      Then similar to inet/udp_sk_bound_dev_eq(), sctp_sk_bound_dev_eq() is
      added to check either dif or sdif is equal to sk_bound_dev_if, and to
      check sid is 0 or l3mdev_accept is 1 if sk_bound_dev_if is not set.
      This function is used to match a association or a endpoint, namely
      called by sctp_addrs_lookup_transport() and sctp_endpoint_is_match().
      All functions that needs updating are:
      
      sctp_rcv():
        asoc:
        __sctp_rcv_lookup()
          __sctp_lookup_association() -> sctp_addrs_lookup_transport()
          __sctp_rcv_lookup_harder()
            __sctp_rcv_init_lookup()
               __sctp_lookup_association() -> sctp_addrs_lookup_transport()
            __sctp_rcv_walk_lookup()
               __sctp_rcv_asconf_lookup()
                 __sctp_lookup_association() -> sctp_addrs_lookup_transport()
      
        ep:
        __sctp_rcv_lookup_endpoint() -> sctp_endpoint_is_match()
      
      sctp_connect():
        sctp_endpoint_is_peeled_off()
          __sctp_lookup_association()
            sctp_has_association()
              sctp_lookup_association()
                __sctp_lookup_association() -> sctp_addrs_lookup_transport()
      
      sctp_diag_dump_one():
        sctp_transport_lookup_process() -> sctp_addrs_lookup_transport()
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0af03170
    • Xin Long's avatar
      sctp: add skb_sdif in struct sctp_af · 33e93ed2
      Xin Long authored
      Add skb_sdif function in struct sctp_af to get the enslaved device
      for both ipv4 and ipv6 when adding SCTP VRF support in sctp_rcv in
      the next patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33e93ed2
    • Xin Long's avatar
      sctp: check sk_bound_dev_if when matching ep in get_port · f87b1ac0
      Xin Long authored
      In sctp_get_port_local(), when binding to IP and PORT, it should
      also check sk_bound_dev_if to match listening sk if it's set by
      SO_BINDTOIFINDEX, so that multiple sockets with the same IP and
      PORT, but different sk_bound_dev_if can be listened at the same
      time.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f87b1ac0
    • Xin Long's avatar
      sctp: check ipv6 addr with sk_bound_dev if set · 6fe1e524
      Xin Long authored
      When binding to an ipv6 address, it calls ipv6_chk_addr() to check if
      this address is on any dev. If a socket binds to a l3mdev but no dev
      is passed to do this check, all l3mdev and slaves will be skipped and
      the check will fail.
      
      This patch is to pass the bound_dev to make sure the devices under the
      same l3mdev can be returned in ipv6_chk_addr(). When the bound_dev is
      not a l3mdev or l3slave, l3mdev_master_dev_rcu() will return NULL in
      __ipv6_chk_addr_and_flags(), it will keep compitable with before when
      NULL dev was passed.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fe1e524
    • Xin Long's avatar
      sctp: verify the bind address with the tb_id from l3mdev · 26943aef
      Xin Long authored
      After binding to a l3mdev, it should use the route table from the
      corresponding VRF to verify the addr when binding to an address.
      
      Note ipv6 doesn't need it, as binding to ipv6 address does not
      verify the addr with route lookup.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26943aef
    • Jiawen Wu's avatar
      net: libwx: Fix dead code for duplicate check · 0b6ffefb
      Jiawen Wu authored
      Fix duplicate check on polling timeout.
      
      Fixes: 1efa9bfe ("net: libwx: Implement interaction with firmware")
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b6ffefb
    • Antoine Tenart's avatar
      net: phy: mscc: macsec: do not copy encryption keys · 0dc33c65
      Antoine Tenart authored
      Following 1b16b3fd ("net: phy: mscc: macsec: clear encryption keys when freeing a flow"),
      go one step further and instead of calling memzero_explicit on the key
      when freeing a flow, simply not copy the key in the first place as it's
      only used when a new flow is set up.
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dc33c65
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-change-gsi-firmware-load-specification' · a452d30f
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: change GSI firmware load specification
      
      Currently, GSI firmware must be loaded for IPA before it can be
      used--either by the modem, or by the AP.  New hardware supports a
      third option, with the bootloader taking responsibility for loading
      GSI firmware.  In that case, neither the AP nor the modem needs to
      do that.
      
      The first patch in this series deprecates the "modem-init" Device
      Tree property in the IPA binding, using a new "qcom,gsi-loader"
      property instead.  The second and third implement logic in the code
      to support either the "old" or the "new" way of specifying how GSI
      firmware is loaded.
      
      The last two patches implement a new value for the "qcom,gsi-loader"
      property.  If the value is "skip", neither the AP nor modem needs to
      load the GSI firmware.  The first of these patches implements the
      change in the IPA binding; the second implements it in the code.
      ====================
      
      Link: https://lore.kernel.org/r/20221116073257.34010-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a452d30f
    • Alex Elder's avatar
      net: ipa: permit GSI firmware loading to be skipped · 7569805e
      Alex Elder authored
      Define a new value "skip" for the "qcom,gsi-loader" Device Tree
      property.  If used, it indicates that neither the AP nor the modem
      need to load GSI firmware (because it has already been loaded--for
      example by the boot loader).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7569805e
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: support skipping GSI firmware load · a49c3ab7
      Alex Elder authored
      Add a new enumerated value to those defined for the qcom,gsi-loader
      property.  If the qcom,gsi-loader is "skip", the GSI firmware will
      already be loaded, so neither the AP nor modem is required to load
      GSI firmware.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a49c3ab7
    • Alex Elder's avatar
      net: ipa: introduce "qcom,gsi-loader" property · 07f2f8e1
      Alex Elder authored
      Introduce a new way of specifying how the GSI firmware gets loaded
      for IPA.  Currently, this is indicated by the presence or absence of
      the Boolean "modem-init" Device Tree property.  The new property
      must have a value--either "self" or "modem"--which indicates whether
      the AP or modem is the GSI firmware loader, respectively.
      
      For legacy systems, the new property will not exist, and the
      "modem-init" property will be used.  For newer systems, the
      "qcom,gsi-loader" property *must* exist, and must have one of the
      two prescribed values.  It is an error to have both properties
      defined, and it is an error for the new property to have an
      unrecognized value.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      07f2f8e1
    • Alex Elder's avatar
      net: ipa: encapsulate decision about firmware load · 50f803d4
      Alex Elder authored
      The GSI layer used for IPA requires firmware to be loaded.
      
      Currently either the AP or the modem loads the firmware,
      distinguished by whether the "modem-init" Device Tree
      property is defined.
      
      Some newer systems implement a third option.  In preparation for
      that, encapsulate the code that determines how the GSI firmware
      gets loaded in a new function, ipa_firmware_loader().
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      50f803d4
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: deprecate modem-init · 4ca0c647
      Alex Elder authored
      GSI firmware for IPA must be loaded during initialization, either by
      the AP or by the modem.  The loader is currently specified based on
      whether the Boolean modem-init property is present.
      
      Instead, use a new property with an enumerated value to indicate
      explicitly how GSI firmware gets loaded.  With this in place, a
      third approach can be added in an upcoming patch.
      
      The new qcom,gsi-loader property has two defined values:
        - self:   The AP loads GSI firmware
        - modem:  The modem loads GSI firmware
      The modem-init property must still be supported, but is now marked
      deprecated.
      
      Update the example so it represents the SC7180 SoC, and provide
      examples for the qcom,gsi-loader, memory-region, and firmware-name
      properties.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4ca0c647
    • Xin Long's avatar
      sctp: move SCTP_PAD4 and SCTP_TRUNC4 to linux/sctp.h · 647541ea
      Xin Long authored
      Move these two macros from net/sctp/sctp.h to linux/sctp.h, so that
      it will be enough to include only linux/sctp.h in nft_exthdr.c and
      xt_sctp.c. It should not include "net/sctp/sctp.h" if a module does
      not have a dependence on SCTP module.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSaeed Mahameed <saeed@kernel.org>
      Link: https://lore.kernel.org/r/ef6468a687f36da06f575c2131cd4612f6b7be88.1668526821.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      647541ea
    • Xin Long's avatar
      sctp: change to include linux/sctp.h in net/sctp/checksum.h · b78c4162
      Xin Long authored
      Currently "net/sctp/checksum.h" including "net/sctp/sctp.h" is
      included in quite some places in netfilter and openswitch and
      net/sched. It's not necessary to include "net/sctp/sctp.h" if
      a module does not have dependence on SCTP, "linux/sctp.h" is
      the right one to include.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSaeed Mahameed <saeed@kernel.org>
      Link: https://lore.kernel.org/r/ca7ea96d62a26732f0491153c3979dc1c0d8d34a.1668526793.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b78c4162
    • Jakub Kicinski's avatar
      Merge branch 'implement-devlink-rate-api-and-extend-it' · 24f627a3
      Jakub Kicinski authored
      Michal Wilczynski says:
      
      ====================
      Implement devlink-rate API and extend it
      
      This patch series implements devlink-rate for ice driver. Unfortunately
      current API isn't flexible enough for our use case, so there is a need to
      extend it. Some functions have been introduced to enable the driver to
      export current Tx scheduling configuration.
      
      Pasting justification for this series from commit implementing devlink-rate
      in ice driver(that is a part of this series):
      
      There is a need to support modification of Tx scheduler tree, in the
      ice driver. This will allow user to control Tx settings of each node in
      the internal hierarchy of nodes. As a result user will be able to use
      Hierarchy QoS implemented entirely in the hardware.
      
      This patch implemenents devlink-rate API. It also exports initial
      default hierarchy. It's mostly dictated by the fact that the tree
      can't be removed entirely, all we can do is enable the user to modify
      it. For example root node shouldn't ever be removed, also nodes that
      have children are off-limits.
      
      Example initial tree with 2 VF's:
      
      [root@fedora ~]# devlink port function rate show
      pci/0000:4b:00.0/node_27: type node parent node_26
      pci/0000:4b:00.0/node_26: type node parent node_0
      pci/0000:4b:00.0/node_34: type node parent node_33
      pci/0000:4b:00.0/node_33: type node parent node_32
      pci/0000:4b:00.0/node_32: type node parent node_16
      pci/0000:4b:00.0/node_19: type node parent node_18
      pci/0000:4b:00.0/node_18: type node parent node_17
      pci/0000:4b:00.0/node_17: type node parent node_16
      pci/0000:4b:00.0/node_21: type node parent node_20
      pci/0000:4b:00.0/node_20: type node parent node_3
      pci/0000:4b:00.0/node_14: type node parent node_5
      pci/0000:4b:00.0/node_5: type node parent node_3
      pci/0000:4b:00.0/node_13: type node parent node_4
      pci/0000:4b:00.0/node_12: type node parent node_4
      pci/0000:4b:00.0/node_11: type node parent node_4
      pci/0000:4b:00.0/node_10: type node parent node_4
      pci/0000:4b:00.0/node_9: type node parent node_4
      pci/0000:4b:00.0/node_8: type node parent node_4
      pci/0000:4b:00.0/node_7: type node parent node_4
      pci/0000:4b:00.0/node_6: type node parent node_4
      pci/0000:4b:00.0/node_4: type node parent node_3
      pci/0000:4b:00.0/node_3: type node parent node_16
      pci/0000:4b:00.0/node_16: type node parent node_15
      pci/0000:4b:00.0/node_15: type node parent node_0
      pci/0000:4b:00.0/node_2: type node parent node_1
      pci/0000:4b:00.0/node_1: type node parent node_0
      pci/0000:4b:00.0/node_0: type node
      pci/0000:4b:00.0/1: type leaf parent node_27
      pci/0000:4b:00.0/2: type leaf parent node_27
      
      Let me visualize part of the tree:
      
                              +---------+
                              |  node_0 |
                              +---------+
                                   |
                              +----v----+
                              | node_26 |
                              +----+----+
                                   |
                              +----v----+
                              | node_27 |
                              +----+----+
                                   |
                          |-----------------|
                     +----v----+       +----v----+
                     |   VF 1  |       |   VF 2  |
                     +----+----+       +----+----+
      
      So at this point there is a couple things that can be done.
      For example we could only assign parameters to VF's.
      
      [root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                       tx_max 5Gbps
      
      This would cap the VF 1 BW to 5Gbps.
      
      But let's say you would like to create a completely new branch.
      This can be done like this:
      
      [root@fedora ~]# devlink port function rate add \
                       pci/0000:4b:00.0/node_custom parent node_0
      [root@fedora ~]# devlink port function rate add \
                       pci/0000:4b:00.0/node_custom_1 parent node_custom
      [root@fedora ~]# devlink port function rate set \
                       pci/0000:4b:00.0/1 parent node_custom_1
      
      This creates a completely new branch and reassigns VF 1 to it.
      
      A number of parameters is supported per each node: tx_max, tx_share,
      tx_priority and tx_weight.
      ====================
      
      Link: https://lore.kernel.org/r/20221115104825.172668-1-michal.wilczynski@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      24f627a3
    • Michal Wilczynski's avatar
      Documentation: Add documentation for new devlink-rate attributes · 242dd643
      Michal Wilczynski authored
      Provide documentation for newly introduced netlink attributes for
      devlink-rate: tx_priority and tx_weight.
      
      Mention the possibility to export tree from the driver.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      242dd643
    • Michal Wilczynski's avatar
      ice: Add documentation for devlink-rate implementation · 16eb4afc
      Michal Wilczynski authored
      Add documentation to a newly added devlink-rate feature. Provide some
      examples on how to use the commands, which netlink attributes are
      supported and descriptions of the attributes.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      16eb4afc
    • Michal Wilczynski's avatar
      ice: Prevent ADQ, DCB coexistence with Custom Tx scheduler · 80fe30a8
      Michal Wilczynski authored
      ADQ, DCB might interfere with Custom Tx Scheduler changes that user
      might introduce using devlink-rate API.
      
      Check if ADQ, DCB is active, when user tries to change any setting
      in exported Tx scheduler tree. If any of those are active block the user
      from doing so, and log an appropriate message.
      
      Remove the exported hierarchy if user enable ADQ or DCB.
      Prevent ADQ or DCB from getting configured if user already made some
      changes using devlink-rate API.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80fe30a8
    • Michal Wilczynski's avatar
      ice: Implement devlink-rate API · 42c2eb6b
      Michal Wilczynski authored
      There is a need to support modification of Tx scheduler tree, in the
      ice driver. This will allow user to control Tx settings of each node in
      the internal hierarchy of nodes. As a result user will be able to use
      Hierarchy QoS implemented entirely in the hardware.
      
      This patch implemenents devlink-rate API. It also exports initial
      default hierarchy. It's mostly dictated by the fact that the tree
      can't be removed entirely, all we can do is enable the user to modify
      it. For example root node shouldn't ever be removed, also nodes that
      have children are off-limits.
      
      Example initial tree with 2 VF's:
      
      [root@fedora ~]# devlink port function rate show
      
      pci/0000:4b:00.0/node_27: type node parent node_26
      pci/0000:4b:00.0/node_26: type node parent node_0
      pci/0000:4b:00.0/node_34: type node parent node_33
      pci/0000:4b:00.0/node_33: type node parent node_32
      pci/0000:4b:00.0/node_32: type node parent node_16
      pci/0000:4b:00.0/node_19: type node parent node_18
      pci/0000:4b:00.0/node_18: type node parent node_17
      pci/0000:4b:00.0/node_17: type node parent node_16
      pci/0000:4b:00.0/node_21: type node parent node_20
      pci/0000:4b:00.0/node_20: type node parent node_3
      pci/0000:4b:00.0/node_14: type node parent node_5
      pci/0000:4b:00.0/node_5: type node parent node_3
      pci/0000:4b:00.0/node_13: type node parent node_4
      pci/0000:4b:00.0/node_12: type node parent node_4
      pci/0000:4b:00.0/node_11: type node parent node_4
      pci/0000:4b:00.0/node_10: type node parent node_4
      pci/0000:4b:00.0/node_9: type node parent node_4
      pci/0000:4b:00.0/node_8: type node parent node_4
      pci/0000:4b:00.0/node_7: type node parent node_4
      pci/0000:4b:00.0/node_6: type node parent node_4
      pci/0000:4b:00.0/node_4: type node parent node_3
      pci/0000:4b:00.0/node_3: type node parent node_16
      pci/0000:4b:00.0/node_16: type node parent node_15
      pci/0000:4b:00.0/node_15: type node parent node_0
      pci/0000:4b:00.0/node_2: type node parent node_1
      pci/0000:4b:00.0/node_1: type node parent node_0
      pci/0000:4b:00.0/node_0: type node
      pci/0000:4b:00.0/1: type leaf parent node_27
      pci/0000:4b:00.0/2: type leaf parent node_27
      
      Let me visualize part of the tree:
      
                          +---------+
                          |  node_0 |
                          +---------+
                               |
                          +----v----+
                          | node_26 |
                          +----+----+
                               |
                          +----v----+
                          | node_27 |
                          +----+----+
                               |
                      |-----------------|
                 +----v----+       +----v----+
                 |   VF 1  |       |   VF 2  |
                 +----+----+       +----+----+
      
      So at this point there is a couple things that can be done.
      For example we could only assign parameters to VF's.
      
      [root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                       tx_max 5Gbps
      
      This would cap the VF 1 BW to 5Gbps.
      
      But let's say you would like to create a completely new branch.
      This can be done like this:
      
      [root@fedora ~]# devlink port function rate add \
                       pci/0000:4b:00.0/node_custom parent node_0
      [root@fedora ~]# devlink port function rate add \
                       pci/0000:4b:00.0/node_custom_1 parent node_custom
      [root@fedora ~]# devlink port function rate set \
                       pci/0000:4b:00.0/1 parent node_custom_1
      
      This creates a completely new branch and reassigns VF 1 to it.
      
      A number of parameters is supported per each node: tx_max, tx_share,
      tx_priority and tx_weight.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      42c2eb6b
    • Michal Wilczynski's avatar
      ice: Add an option to pre-allocate memory for ice_sched_node · bdf96d96
      Michal Wilczynski authored
      devlink-rate API requires a priv object to be allocated when node still
      doesn't have a parent. This is problematic, because ice_sched_node can't
      be currently created without a parent.
      
      Add an option to pre-allocate memory for ice_sched_node struct. Add
      new arguments to ice_sched_add() and ice_sched_add_elems() that allow
      for pre-allocation of memory for ice_sched_node struct.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bdf96d96
    • Michal Wilczynski's avatar
      ice: Introduce new parameters in ice_sched_node · 16dfa494
      Michal Wilczynski authored
      To support new devlink-rate API ice_sched_node struct needs to store
      a number of additional parameters. This includes tx_max, tx_share,
      tx_weight, and tx_priority.
      
      Add new fields to ice_sched_node struct. Add new functions to configure
      the hardware with new parameters. Introduce new xarray to identify
      nodes uniquely.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      16dfa494
    • Michal Wilczynski's avatar
      devlink: Allow to set up parent in devl_rate_leaf_create() · f2fc15e2
      Michal Wilczynski authored
      Currently the driver is able to create leaf nodes for the devlink-rate,
      but is unable to set parent for them. This wasn't as issue before the
      possibility to export hierarchy from the driver. After adding the export
      feature, in order for the driver to supply correct hierarchy, it's
      necessary for it to be able to supply a parent name to
      devl_rate_leaf_create().
      
      Introduce a new parameter 'parent_name' in devl_rate_leaf_create().
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f2fc15e2
    • Michal Wilczynski's avatar
      devlink: Allow for devlink-rate nodes parent reassignment · 04d674f0
      Michal Wilczynski authored
      Currently it's not possible to reassign the parent of the node using one
      command. As the previous commit introduced a way to export entire
      hierarchy from the driver, being able to modify and reassign parents
      become important. This way user might easily change QoS settings without
      interrupting traffic.
      
      Example command:
      devlink port function rate set pci/0000:4b:00.0/1 parent node_custom_1
      
      This reassigns leaf node parent to node_custom_1.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04d674f0
    • Michal Wilczynski's avatar
      devlink: Enable creation of the devlink-rate nodes from the driver · caba177d
      Michal Wilczynski authored
      Intel 100G card internal firmware hierarchy for Hierarchicial QoS is very
      rigid and can't be easily removed. This requires an ability to export
      default hierarchy to allow user to modify it. Currently the driver is
      only able to create the 'leaf' nodes, which usually represent the vport.
      This is not enough for HQoS implemented in Intel hardware.
      
      Introduce new function devl_rate_node_create() that allows for creation
      of the devlink-rate nodes from the driver.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      caba177d
    • Michal Wilczynski's avatar
      devlink: Introduce new attribute 'tx_weight' to devlink-rate · 6e2d7e84
      Michal Wilczynski authored
      To fully utilize offload capabilities of Intel 100G card QoS capabilities
      new attribute 'tx_weight' needs to be introduced. This attribute allows
      for usage of Weighted Fair Queuing arbitration scheme among siblings.
      This arbitration scheme can be used simultaneously with the strict
      priority.
      
      Introduce new attribute in devlink-rate that will allow for configuration
      of Weighted Fair Queueing. New attribute is optional.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e2d7e84
    • Michal Wilczynski's avatar
      devlink: Introduce new attribute 'tx_priority' to devlink-rate · cd502236
      Michal Wilczynski authored
      To fully utilize offload capabilities of Intel 100G card QoS capabilities
      new attribute 'tx_priority' needs to be introduced. This attribute allows
      for usage of strict priority arbiter among siblings. This arbitration
      scheme attempts to schedule nodes based on their priority as long as the
      nodes remain within their bandwidth limit.
      
      Introduce new attribute in devlink-rate that will allow for configuration
      of strict priority. New attribute is optional.
      Signed-off-by: default avatarMichal Wilczynski <michal.wilczynski@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd502236
    • Jakub Kicinski's avatar
      Merge branch 'autoload-dsa-tagging-driver-when-dynamically-changing-protocol' · 4ab45e97
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Autoload DSA tagging driver when dynamically changing protocol
      
      This patch set solves the issue reported by Michael and Heiko here:
      https://lore.kernel.org/lkml/20221027113248.420216-1-michael@walle.cc/
      making full use of Michael's suggestion of having two modaliases: one
      gets used for loading the tagging protocol when it's the default one
      reported by the switch driver, the other gets loaded at user's request,
      by name.
      
        # modinfo tag_ocelot
        filename:       /lib/modules/6.1.0-rc4+/kernel/net/dsa/tag_ocelot.ko
        license:        GPL v2
        alias:          dsa_tag:seville
        alias:          dsa_tag:id-21
        alias:          dsa_tag:ocelot
        alias:          dsa_tag:id-15
        depends:        dsa_core
        intree:         Y
        name:           tag_ocelot
        vermagic:       6.1.0-rc4+ SMP preempt mod_unload modversions aarch64
      
      Tested on NXP LS1028A-RDB with the following device tree addition:
      
      &mscc_felix_port4 {
      	dsa-tag-protocol = "ocelot-8021q";
      };
      
      &mscc_felix_port5 {
      	dsa-tag-protocol = "ocelot-8021q";
      };
      
      CONFIG_NET_DSA and everything that depends on it is built as module.
      Everything auto-loads, and "cat /sys/class/net/eno2/dsa/tagging" shows
      "ocelot-8021q". Traffic works as well. Furthermore, "echo ocelot-8021q"
      into the aforementioned sysfs file now auto-loads the driver for it.
      ====================
      
      Link: https://lore.kernel.org/r/20221115011847.2843127-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4ab45e97