1. 06 Feb, 2023 40 commits
    • Jacob Keller's avatar
      ice: introduce clear_reset_state operation · fa4a15c8
      Jacob Keller authored
      When hardware is reset, the VF relies on the VFGEN_RSTAT register to detect
      when the VF is finished resetting. This is a tri-state register where 0
      indicates a reset is in progress, 1 indicates the hardware is done
      resetting, and 2 indicates that the software is done resetting.
      
      Currently the PF driver relies on the device hardware resetting VFGEN_RSTAT
      when a global reset occurs. This works ok, but it does mean that the VF
      might not immediately notice a reset when the driver first detects that the
      global reset is occurring.
      
      This is also problematic for Scalable IOV, because there is no read/write
      equivalent VFGEN_RSTAT register for the Scalable VSI type. Instead, the
      Scalable IOV VFs will need to emulate this register.
      
      To support this, introduce a new VF operation, clear_reset_state, which is
      called when the PF driver first detects a global reset. The Single Root IOV
      implementation can just write to VFGEN_RSTAT to ensure it's cleared
      immediately, without waiting for the actual hardware reset to begin. The
      Scalable IOV implementation will use this as part of its tracking of the
      reset status to allow properly reporting the emulated VFGEN_RSTAT to the VF
      driver.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      fa4a15c8
    • Jacob Keller's avatar
      ice: convert vf_ops .vsi_rebuild to .create_vsi · 5531bb85
      Jacob Keller authored
      The .vsi_rebuild function exists for ice_reset_vf. It is used to release
      and re-create the VSI during a single-VF reset.
      
      This function is only called when we need to re-create the VSI, and not
      when rebuilding an existing VSI. This makes the single-VF reset process
      different from the process used to restore functionality after a
      hardware reset such as the PF reset or EMP reset.
      
      When we add support for Scalable IOV VFs, the implementation will be very
      similar. The primary difference will be in the fact that each VF type uses
      a different underlying VSI type in hardware.
      
      Move the common functionality into a new ice_vf_recreate VSI function. This
      will allow the two IOV paths to share this functionality. Rework the
      .vsi_rebuild vf_op into .create_vsi, only performing the task of creating a
      new VSI.
      
      This creates a nice dichotomy between the ice_vf_rebuild_vsi and
      ice_vf_recreate_vsi, and should make it more clear why the two flows atre
      distinct.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      5531bb85
    • Jacob Keller's avatar
      ice: introduce ice_vf_init_host_cfg function · b1b56942
      Jacob Keller authored
      Introduce a new generic helper ice_vf_init_host_cfg which performs common
      host configuration initialization tasks that will need to be done for both
      Single Root IOV and the new Scalable IOV implementation.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      b1b56942
    • Jacob Keller's avatar
      ice: add a function to initialize vf entry · b5dcff1f
      Jacob Keller authored
      Some of the initialization code for Single Root IOV VFs will need to be
      reused when we introduce Scalable IOV. Pull this code out into a new
      ice_initialize_vf_entry helper function.
      Co-developed-by: default avatarHarshitha Ramamurthy <harshitha.ramamurthy@intel.com>
      Signed-off-by: default avatarHarshitha Ramamurthy <harshitha.ramamurthy@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      b5dcff1f
    • Jacob Keller's avatar
      ice: Pull common tasks into ice_vf_post_vsi_rebuild · aeead3d0
      Jacob Keller authored
      The Single Root IOV implementation of .post_vsi_rebuild performs some tasks
      that will ultimately need to be shared with the Scalable IOV implementation
      such as rebuilding the host configuration.
      
      Refactor by introducing a new wrapper function, ice_vf_post_vsi_rebuild
      which performs the tasks that will be shared between SR-IOV and Scalable
      IOV. Move the ice_vf_rebuild_host_cfg and ice_vf_set_initialized calls into
      this wrapper. Then call the implementation specific post_vsi_rebuild
      handler afterwards.
      
      This ensures that we will properly re-initialize filters and expected
      settings for both SR-IOV and Scalable IOV.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      aeead3d0
    • Jacob Keller's avatar
      ice: move ice_vf_vsi_release into ice_vf_lib.c · 1efee073
      Jacob Keller authored
      The ice_vf_vsi_release function will be used in a future change to
      refactor the .vsi_rebuild function. Move this over to ice_vf_lib.c so
      that it can be used there.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      1efee073
    • Jacob Keller's avatar
      ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg · e1588197
      Jacob Keller authored
      The ice_vsi_alloc and ice_vsi_cfg functions are used together to allocate
      and configure a new VSI, called as part of the ice_vsi_setup function.
      
      In the future with the addition of the subfunction code the ice driver
      will want to be able to allocate a VSI while delaying the configuration to
      a later point of the port activation.
      
      Currently this requires that the port code know what type of VSI should
      be allocated. This is required because ice_vsi_alloc assigns the VSI type.
      
      Refactor the ice_vsi_alloc and ice_vsi_cfg functions so that VSI type
      assignment isn't done until the configuration stage. This will allow the
      devlink port addition logic to reserve a VSI as early as possible before
      the type of the port is known. In this way, the port add can fail in the
      event that all hardware VSI resources are exhausted.
      
      Since the ice_vsi_cfg function already takes the ice_vsi_cfg_params
      structure, this is relatively straight forward.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      e1588197
    • Jacob Keller's avatar
      ice: refactor VSI setup to use parameter structure · 5e509ab2
      Jacob Keller authored
      The ice_vsi_setup function, ice_vsi_alloc, and ice_vsi_cfg functions have
      grown a large number of parameters. These parameters are used to initialize
      a new VSI, as well as re-configure an existing VSI
      
      Any time we want to add a new parameter to this function chain, even if it
      will usually be unset, we have to change many call sites due to changing
      the function signature.
      
      A future change is going to refactor ice_vsi_alloc and ice_vsi_cfg to move
      the VSI configuration and initialization all into ice_vsi_cfg.
      
      Before this, refactor the VSI setup flow to use a new ice_vsi_cfg_params
      structure. This will contain the configuration (mainly pointers) used to
      initialize a VSI.
      
      Pass this from ice_vsi_setup into the related functions such as
      ice_vsi_alloc, ice_vsi_cfg, and ice_vsi_cfg_def.
      
      Introduce a helper, ice_vsi_to_params to convert an existing VSI to the
      parameters used to initialize it. This will aid in the flows where we
      rebuild an existing VSI.
      
      Since we also pass the ICE_VSI_FLAG_INIT to more functions which do not
      need (or cannot yet have) the VSI parameters, lets make this clear by
      renaming the function parameter to vsi_flags and using a u32 instead of a
      signed integer. The name vsi_flags also makes it clear that we may extend
      the flags in the future.
      
      This change will make it easier to refactor the setup flow in the future,
      and will reduce the complexity required to add a new parameter for
      configuration in the future.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      5e509ab2
    • Jacob Keller's avatar
      ice: drop unnecessary VF parameter from several VSI functions · 157acda5
      Jacob Keller authored
      The vsi->vf pointer gets assigned early on during ice_vsi_alloc. Several
      functions currently take a VF pointer, but they can just use the existing
      vsi->vf pointer as needed. Modify these functions to drop the unnecessary
      VF parameter.
      
      Note that ice_vsi_cfg is not changed as a following change will refactor so
      that the VF pointer is assigned during ice_vsi_cfg rather than
      ice_vsi_alloc.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Tested-by: default avatarMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      157acda5
    • Jacob Keller's avatar
      ice: fix function comment referring to ice_vsi_alloc · a2ca73ea
      Jacob Keller authored
      Since commit 1d2e32275de7 ("ice: split ice_vsi_setup into smaller
      functions") ice_vsi_alloc has not been responsible for all of the behavior
      implied by the comment for ice_vsi_setup_vector_base.
      
      Fix the comment to refer to the new function ice_vsi_alloc_def().
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      a2ca73ea
    • Brett Creeley's avatar
      ice: Add more usage of existing function ice_get_vf_vsi(vf) · 772dec64
      Brett Creeley authored
      Extend the usage of function ice_get_vf_vsi(vf) in multiple places
      instead of VF's VSI by using a long string of dereferences
      (i.e. vf->pf->vsi[vf->lan_vsi_idx]).
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Signed-off-by: default avatarKalyan Kodamagula <kalyan.kodamagula@intel.com>
      Tested-by: default avatarPiotr Tyda <piotr.tyda@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      772dec64
    • David S. Miller's avatar
      Merge branch 'tuntap-socket-uid' · c21adf25
      David S. Miller authored
      Pietro Borrello says:
      
      ====================
      tuntap: correctly initialize socket uid
      
      sock_init_data() assumes that the `struct socket` passed in input is
      contained in a `struct socket_alloc` allocated with sock_alloc().
      However, tap_open() and tun_chr_open() pass a `struct socket` embedded
      in a `struct tap_queue` and `struct tun_file` respectively, both
      allocated with sk_alloc().
      This causes a type confusion when issuing a container_of() with
      SOCK_INODE() in sock_init_data() which results in assigning a wrong
      sk_uid to the `struct sock` in input.
      
      Due to the type confusion, both sockets happen to have their uid set
      to 0, i.e. root.
      While it will be often correct, as tuntap devices require
      CAP_NET_ADMIN, it may not always be the case.
      Not sure how widespread is the impact of this, it seems the socket uid
      may be used for network filtering and routing, thus tuntap sockets may
      be incorrectly managed.
      Additionally, it seems a socket with an incorrect uid may be returned
      to the vhost driver when issuing a get_socket() on a tuntap device in
      vhost_net_set_backend().
      
      Fix the bugs by adding and using sock_init_data_uid(), which
      explicitly takes a uid as argument.
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      ---
      Changes in v3:
      - Fix the bug by defining and using sock_init_data_uid()
      - Link to v2: https://lore.kernel.org/r/20230131-tuntap-sk-uid-v2-0-29ec15592813@diag.uniroma1.it
      
      Changes in v2:
      - Shorten and format comments
      - Link to v1: https://lore.kernel.org/r/20230131-tuntap-sk-uid-v1-0-af4f9f40979d@diag.uniroma1.it
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c21adf25
    • Pietro Borrello's avatar
      tap: tap_open(): correctly initialize socket uid · 66b2c338
      Pietro Borrello authored
      sock_init_data() assumes that the `struct socket` passed in input is
      contained in a `struct socket_alloc` allocated with sock_alloc().
      However, tap_open() passes a `struct socket` embedded in a `struct
      tap_queue` allocated with sk_alloc().
      This causes a type confusion when issuing a container_of() with
      SOCK_INODE() in sock_init_data() which results in assigning a wrong
      sk_uid to the `struct sock` in input.
      On default configuration, the type confused field overlaps with
      padding bytes between `int vnet_hdr_sz` and `struct tap_dev __rcu
      *tap` in `struct tap_queue`, which makes the uid of all tap sockets 0,
      i.e., the root one.
      Fix the assignment by using sock_init_data_uid().
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66b2c338
    • Pietro Borrello's avatar
      tun: tun_chr_open(): correctly initialize socket uid · a096ccca
      Pietro Borrello authored
      sock_init_data() assumes that the `struct socket` passed in input is
      contained in a `struct socket_alloc` allocated with sock_alloc().
      However, tun_chr_open() passes a `struct socket` embedded in a `struct
      tun_file` allocated with sk_alloc().
      This causes a type confusion when issuing a container_of() with
      SOCK_INODE() in sock_init_data() which results in assigning a wrong
      sk_uid to the `struct sock` in input.
      On default configuration, the type confused field overlaps with the
      high 4 bytes of `struct tun_struct __rcu *tun` of `struct tun_file`,
      NULL at the time of call, which makes the uid of all tun sockets 0,
      i.e., the root one.
      Fix the assignment by using sock_init_data_uid().
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a096ccca
    • Pietro Borrello's avatar
      net: add sock_init_data_uid() · 584f3742
      Pietro Borrello authored
      Add sock_init_data_uid() to explicitly initialize the socket uid.
      To initialise the socket uid, sock_init_data() assumes a the struct
      socket* sock is always embedded in a struct socket_alloc, used to
      access the corresponding inode uid. This may not be true.
      Examples are sockets created in tun_chr_open() and tap_open().
      
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarPietro Borrello <borrello@diag.uniroma1.it>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      584f3742
    • David S. Miller's avatar
      Merge branch 'ENETC-mqprio-taprio-cleanup' · b601135e
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      net: ENETC mqprio/taprio cleanup
      
      Please excuse the increased patch set size compared to v4's 15 patches,
      but Claudiu stirred up the pot :) when he pointed out that the mqprio
      TXQ validation procedure is still incorrect, so I had to fix that, and
      then do some consolidation work so that taprio doesn't duplicate
      mqprio's bugs. Compared to v4, 3 patches are new and 1 was dropped for now
      ("net/sched: taprio: mask off bits in gate mask that exceed number of TCs"),
      since there's not really much to gain from it. Since the previous patch
      set has largely been reviewed, I hope that a delta overview will help
      and make up for the large size.
      
      v4->v5:
      - new patches:
        "[08/17] net/sched: mqprio: allow reverse TC:TXQ mappings"
        "[11/17] net/sched: taprio: centralize mqprio qopt validation"
        "[12/17] net/sched: refactor mqprio qopt reconstruction to a library function"
      - changed patches worth revisiting:
        "[09/17] net/sched: mqprio: allow offloading drivers to request queue
        count validation"
      v4 at:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230130173145.475943-1-vladimir.oltean@nxp.com/
      
      v3->v4:
      - adjusted patch 07/15 to not remove "#include <net/pkt_sched.h>" from
        ti cpsw
      https://patchwork.kernel.org/project/netdevbpf/cover/20230127001516.592984-1-vladimir.oltean@nxp.com/
      
      v2->v3:
      - move min_num_stack_tx_queues definition so it doesn't conflict with
        the ethtool mm patches I haven't submitted yet for enetc (and also to
        make use of a 4 byte hole)
      - warn and mask off excess TCs in gate mask instead of failing
      - finally CC qdisc maintainers
      v2 at:
      https://patchwork.kernel.org/project/netdevbpf/patch/20230126125308.1199404-16-vladimir.oltean@nxp.com/
      
      v1->v2:
      - patches 1->4 are new
      - update some header inclusions in drivers
      - fix typo (said "taprio" instead of "mqprio")
      - better enetc mqprio error handling
      - dynamically reconstruct mqprio configuration in taprio offload
      - also let stmmac and tsnep use per-TXQ gate_mask
      v1 (RFC) at:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230120141537.1350744-1-vladimir.oltean@nxp.com/
      
      The main goal of this patch set is to make taprio pass the mqprio queue
      configuration structure down to ndo_setup_tc() - patch 13/17. But mqprio
      itself is not in the best shape currently, so there are some
      consolidation patches on that as well.
      
      Next, there are some consolidation patches in the enetc driver's
      handling of TX queues and their traffic class assignment. Then, there is
      a consolidation between the TX queue configuration for mqprio and
      taprio.
      
      Finally, there is a change in the meaning of the gate_mask passed by
      taprio through ndo_setup_tc(). We introduce a capability through which
      drivers can request the gate mask to be per TXQ. The default is changed
      so that it is per TC.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b601135e
    • Vladimir Oltean's avatar
      net: enetc: act upon mqprio queue config in taprio offload · 06b1c911
      Vladimir Oltean authored
      We assume that the mqprio queue configuration from taprio has a simple
      1:1 mapping between prio and traffic class, and one TX queue per TC.
      That might not be the case. Actually parse and act upon the mqprio
      config.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06b1c911
    • Vladimir Oltean's avatar
      net: enetc: act upon the requested mqprio queue configuration · 1a353111
      Vladimir Oltean authored
      Regardless of the requested queue count per traffic class, the enetc
      driver allocates a number of TX rings equal to the number of TCs, and
      hardcodes a queue configuration of "1@0 1@1 ... 1@max-tc". Other
      configurations are silently ignored and treated the same.
      
      Improve that by allowing what the user requests to be actually
      fulfilled. This allows more than one TX ring per traffic class.
      For example:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 4 \
      	map 0 0 1 1 2 2 3 3 queues 2@0 2@2 2@4 2@6
      [  146.267648] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
      [  146.273451] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
      [  146.283280] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 1
      [  146.293987] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 1
      [  146.300467] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 2
      [  146.306866] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 2
      [  146.313261] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 3
      [  146.319622] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 3
      $ tc qdisc del dev eno0 root
      [  178.238418] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
      [  178.244369] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
      [  178.251486] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 0
      [  178.258006] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 0
      [  178.265038] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 0
      [  178.271557] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 0
      [  178.277910] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 0
      [  178.284281] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 0
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
      [  186.113162] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
      [  186.118764] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 1
      [  186.124374] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 2
      [  186.130765] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 3
      [  186.136404] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 4
      [  186.142049] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 5
      [  186.147674] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 6
      [  186.153305] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 7
      
      The driver used to set TC_MQPRIO_HW_OFFLOAD_TCS, near which there is
      this comment in the UAPI header:
      
              TC_MQPRIO_HW_OFFLOAD_TCS,       /* offload TCs, no queue counts */
      
      which is what enetc was doing up until now (and no longer is; we offload
      queue counts too), remove that assignment.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a353111
    • Vladimir Oltean's avatar
      net: enetc: request mqprio to validate the queue counts · 735ef62c
      Vladimir Oltean authored
      The enetc driver does not validate the mqprio queue configuration, so it
      currently allows things like this:
      
      $ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
      
      But also things like this, completely omitting the queue configuration:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 hw 1
      
      By requesting validation via the mqprio capability structure, this is no
      longer allowed, and we bring what is accepted by hardware in line with
      what is accepted by software.
      
      The check that num_tc <= real_num_tx_queues also becomes superfluous and
      can be dropped, because mqprio_validate_queue_counts() validates that no
      TXQ range exceeds real_num_tx_queues. That is a stronger check, because
      there is at least 1 TXQ per TC, so there are at least as many TXQs as TCs.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      735ef62c
    • Vladimir Oltean's avatar
      net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw · 522d15ea
      Vladimir Oltean authored
      There are 2 classes of in-tree drivers currently:
      
      - those who act upon struct tc_taprio_sched_entry :: gate_mask as if it
        holds a bit mask of TXQs
      
      - those who act upon the gate_mask as if it holds a bit mask of TCs
      
      When it comes to the standard, IEEE 802.1Q-2018 does say this in the
      second paragraph of section 8.6.8.4 Enhancements for scheduled traffic:
      
      | A gate control list associated with each Port contains an ordered list
      | of gate operations. Each gate operation changes the transmission gate
      | state for the gate associated with each of the Port's traffic class
      | queues and allows associated control operations to be scheduled.
      
      In typically obtuse language, it refers to a "traffic class queue"
      rather than a "traffic class" or a "queue". But careful reading of
      802.1Q clarifies that "traffic class" and "queue" are in fact
      synonymous (see 8.6.6 Queuing frames):
      
      | A queue in this context is not necessarily a single FIFO data structure.
      | A queue is a record of all frames of a given traffic class awaiting
      | transmission on a given Bridge Port. The structure of this record is not
      | specified.
      
      i.o.w. their definition of "queue" isn't the Linux TX queue.
      
      The gate_mask really is input into taprio via its UAPI as a mask of
      traffic classes, but taprio_sched_to_offload() converts it into a TXQ
      mask.
      
      The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is:
      
      - hellcreek, felix, sja1105: these are DSA switches, it's not even very
        clear what TXQs correspond to, other than purely software constructs.
        Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense.
        So it's fine to convert these to a gate mask per TC.
      
      - enetc: I have the hardware and can confirm that the gate mask is per
        TC, and affects all TXQs (BD rings) configured for that priority.
      
      - igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted
        to be per-TXQ.
      
      - tsnep: Gerhard Engleder clarifies that even though this hardware
        supports at most 1 TXQ per TC, the TXQ indices may be different from
        the TC values themselves, and it is the TXQ indices that matter to
        this hardware. So keep it per-TXQ as well.
      
      - stmmac: I have a GMAC datasheet, and in the EST section it does
        specify that the gate events are per TXQ rather than per TC.
      
      - lan966x: again, this is a switch, and while not a DSA one, the way in
        which it implements lan966x_mqprio_add() - by only allowing num_tc ==
        NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely
        software construct here as well. They seem to map 1:1 with TCs.
      
      - am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the
        impression that the fetch_allow variable is treated like a prio_mask.
        This definitely sounds closer to a per-TC gate mask rather than a
        per-TXQ one, and TI documentation does seem to recomment an identity
        mapping between TCs and TXQs. However, Roger Quadros would like to do
        some testing before making changes, so I'm leaving this driver to
        operate as it did before, for now. Link with more details at the end.
      
      Based on this breakdown, we have 5 drivers with a gate mask per TC and
      4 with a gate mask per TXQ. So let's make the gate mask per TXQ the
      opt-in and the gate mask per TC the default.
      
      Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
      query the device driver before calling the proper ndo_setup_tc(), and
      figure out if it expects one or the other format.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204
      Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
      Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
      Cc: Roger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      522d15ea
    • Vladimir Oltean's avatar
      net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() · 09c794c0
      Vladimir Oltean authored
      The taprio qdisc does not currently pass the mqprio queue configuration
      down to the offloading device driver. So the driver cannot act upon the
      TXQ counts/offsets per TC, or upon the prio->tc map. It was probably
      assumed that the driver only wants to offload num_tc (see
      TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(),
      but there's clearly more to the mqprio configuration than that.
      
      I've considered 2 mechanisms to remedy that. First is to pass a struct
      tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second
      is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to*
      TC_SETUP_QDISC_TAPRIO.
      
      The difference is that in the first case, existing drivers (offloading
      or not) all ignore taprio's mqprio portion currently, whereas in the
      second case, we could control whether to call TC_SETUP_QDISC_MQPRIO,
      based on a new capability. The question is which approach would be
      better.
      
      I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based
      on a taprio capability bit) would risk introducing regressions. For
      example, taprio doesn't populate (or validate) qopt->hw, as well as
      mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate.
      
      In comparison, adding a capability is functionally equivalent to just
      passing the mqprio in a way that drivers can ignore it, except it's
      slightly more complicated to use it (need to set the capability).
      
      Ultimately, what made me go for the "mqprio in taprio" variant was that
      it's easier for offloading drivers to interpret the mqprio qopt slightly
      differently when it comes from taprio vs when it comes from mqprio,
      should that ever become necessary.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09c794c0
    • Vladimir Oltean's avatar
      net/sched: refactor mqprio qopt reconstruction to a library function · 9dd6ad67
      Vladimir Oltean authored
      The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
      netdev settings once more in a future patch, but this code was already
      written twice, once in taprio and once in mqprio.
      
      Refactor the code to a helper in the common mqprio library.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dd6ad67
    • Vladimir Oltean's avatar
      net/sched: taprio: centralize mqprio qopt validation · 1dfe086d
      Vladimir Oltean authored
      There is a lot of code in taprio which is "borrowed" from mqprio.
      It makes sense to put a stop to the "borrowing" and start actually
      reusing code.
      
      Because taprio and mqprio are built as part of different kernel modules,
      code reuse can only take place either by writing it as static inline
      (limiting), putting it in sch_generic.o (not generic enough), or
      creating a third auto-selectable kernel module which only holds library
      code. I opted for the third variant.
      
      In a previous change, mqprio gained support for reverse TC:TXQ mappings,
      something which taprio still denies. Make taprio use the same validation
      logic so that it supports this configuration as well.
      
      The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
      that looks intentional, even if I've no idea why that might be. Preserve
      that, but add a comment.
      
      There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
      update there.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1dfe086d
    • Vladimir Oltean's avatar
      net/sched: mqprio: add extack messages for queue count validation · d404959f
      Vladimir Oltean authored
      To make mqprio more user-friendly, create netlink extended ack messages
      which say exactly what is wrong about the queue counts. This uses the
      new support for printf-formatted extack messages.
      
      Example:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d404959f
    • Vladimir Oltean's avatar
      net/sched: mqprio: allow offloading drivers to request queue count validation · 19278d76
      Vladimir Oltean authored
      mqprio_parse_opt() proudly has a comment:
      
      	/* If hardware offload is requested we will leave it to the device
      	 * to either populate the queue counts itself or to validate the
      	 * provided queue counts.
      	 */
      
      Unfortunately some device drivers did not get this memo, and don't
      validate the queue counts, or populate them.
      
      In case drivers don't want to populate the queue counts themselves, just
      act upon the requested configuration, it makes sense to introduce a tc
      capability, and make mqprio query it, so they don't have to do the
      validation themselves.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19278d76
    • Vladimir Oltean's avatar
      net/sched: mqprio: allow reverse TC:TXQ mappings · d7045f52
      Vladimir Oltean authored
      By imposing that the last TXQ of TC i is smaller than the first TXQ of
      any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for
      the TXQ indices (they must increase as TCs increase).
      
      Claudiu points out that the complexity of the TXQ count validation is
      too high for this logic, i.e. instead of iterating over j, it is
      sufficient that the TXQ indices of TC i and i + 1 are ordered, and that
      will eventually ensure global ordering.
      
      This is true, however it doesn't appear to me that is what the code
      really intended to do. Instead, based on the comments, it just wanted to
      check for overlaps (and this isn't how one does that).
      
      So the following mqprio configuration, which I had recommended to
      Vinicius more than once for igb/igc (to account for the fact that on
      this hardware, lower numbered TXQs have higher dequeue priority than
      higher ones):
      
      num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0
      
      is in fact denied today by mqprio.
      
      The full story is that in fact, it's only denied with "hw 0"; if
      hardware offloading is requested, mqprio defers TXQ range overlap
      validation to the device driver (a strange decision in itself).
      
      This is most certainly a bug, but it's not one that has any merit for
      being fixed on "stable" as far as I can tell. This is because mqprio
      always rejected a configuration which was in fact valid, and this has
      shaped the way in which mqprio configuration scripts got built for
      various hardware (see igb/igc in the link below). Therefore, one could
      consider it to be merely an improvement for mqprio to allow reverse
      TC:TXQ mappings.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7045f52
    • Vladimir Oltean's avatar
      net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h · 9adafe2b
      Vladimir Oltean authored
      Since mqprio is a scheduler and not a classifier, move its offload
      structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies.
      
      Also update some header inclusions in drivers that access this
      structure, to the best of my abilities.
      
      Cc: Igor Russkikh <irusskikh@marvell.com>
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
      Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
      Cc: Saeed Mahameed <saeedm@nvidia.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: Daniel Machon <daniel.machon@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9adafe2b
    • Vladimir Oltean's avatar
      net/sched: mqprio: refactor offloading and unoffloading to dedicated functions · 5cfb45e2
      Vladimir Oltean authored
      Some more logic will be added to mqprio offloading, so split that code
      up from mqprio_init(), which is already large, and create a new
      function, mqprio_enable_offload(), similar to taprio_enable_offload().
      Also create the opposite function mqprio_disable_offload().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cfb45e2
    • Vladimir Oltean's avatar
      net/sched: mqprio: refactor nlattr parsing to a separate function · feb2cf3d
      Vladimir Oltean authored
      mqprio_init() is quite large and unwieldy to add more code to.
      Split the netlink attribute parsing to a dedicated function.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feb2cf3d
    • Praveen Kaligineedi's avatar
      gve: Fix gve interrupt names · 84371145
      Praveen Kaligineedi authored
      IRQs are currently requested before the netdevice is registered
      and a proper name is assigned to the device. Changing interrupt
      name to avoid using the format string in the name.
      
      Interrupt name before change: eth%d-ntfy-block.<blk_id>
      Interrupt name after change: gve-ntfy-blk<blk_id>@pci:<pci_name>
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84371145
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · d78f8d83
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      net: implement devlink reload in ice
      
      Michal Swiatkowski says:
      
      This is a part of changes done in patchset [0]. Resource management is
      kind of controversial part, so I split it into two patchsets.
      
      It is the first one, covering refactor and implement reload API call.
      The refactor will unblock some of the patches needed by SIOV or
      subfunction.
      
      Most of this patchset is about implementing driver reload mechanism.
      Part of code from probe and rebuild is used to not duplicate code.
      To allow this reuse probe and rebuild path are split into smaller
      functions.
      
      Patch "ice: split ice_vsi_setup into smaller functions" changes
      boolean variable in function call to integer and adds define
      for it. Instead of having the function called with true/false now it
      can be called with readable defines ICE_VSI_FLAG_INIT or
      ICE_VSI_FLAG_NO_INIT. It was suggested by Jacob Keller and probably this
      mechanism will be implemented across ice driver in follow up patchset.
      
      Previously the code was reviewed here [0].
      
      [0] https://lore.kernel.org/netdev/Y3ckRWtAtZU1BdXm@unreal/T/#m3bb8feba0a62f9b4cd54cd94917b7e2143fc2ecd
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d78f8d83
    • Jesper Dangaard Brouer's avatar
      net: introduce skb_poison_list and use in kfree_skb_list · 9dde0cd3
      Jesper Dangaard Brouer authored
      First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs
      earlier like introduced in commit eedade12 ("net: kfree_skb_list use
      kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in
      commit f72ff8b8 ("net: fix kfree_skb_list use of skb_mark_not_on_list").
      
      In case of a bug like mentioned commit we would have seen OOPS with:
       general protection fault, probably for non-canonical address 0xdead000000000870
      And content of one the registers e.g. R13: dead000000000800
      
      In this case skb->len is at offset 112 bytes (0x70) why fault happens at
       0x800+0x70 = 0x870
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dde0cd3
    • David S. Miller's avatar
      Merge branch 'wangxun-interrupts' · 149e8fb0
      David S. Miller authored
      Jiawen Wu says:
      
      ====================
      Wangxun interrupt and RxTx support
      
      Configure interrupt, setup RxTx ring, support to receive and transmit
      packets.
      
      change log:
      v3:
      - Use upper_32_bits() to avoid compile warning.
      - Remove useless codes.
      v2:
      - Andrew Lunn: https://lore.kernel.org/netdev/Y86kDphvyHj21IxK@lunn.ch/
      - Add a judgment when allocate dma for descriptor.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      149e8fb0
    • Mengyuan Lou's avatar
      net: ngbe: Support Rx and Tx process path · b97f955e
      Mengyuan Lou authored
      Add enable and disable operation process for ngbe open/close.
      Clean Rx and Tx ring interrupts, process packets in the data path.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b97f955e
    • Jiawen Wu's avatar
      net: txgbe: Support Rx and Tx process path · 0d22be52
      Jiawen Wu authored
      Clean Rx and Tx ring interrupts, process packets in the data path.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d22be52
    • Mengyuan Lou's avatar
      net: libwx: Add tx path to process packets · 09a50880
      Mengyuan Lou authored
      Support to transmit packets without hardware features.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09a50880
    • Jiawen Wu's avatar
      net: libwx: Support to receive packets in NAPI · 3c47e8ae
      Jiawen Wu authored
      Clean all queues associated with a q_vector, to simple receive packets
      without hardware features.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c47e8ae
    • Jiawen Wu's avatar
      net: txgbe: Setup Rx and Tx ring · 0ef7e159
      Jiawen Wu authored
      Improve the configuration of Rx and Tx ring, set Rx flags and implement
      ndo_set_rx_mode ops.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ef7e159
    • Jiawen Wu's avatar
      net: libwx: Allocate Rx and Tx resources · 850b9711
      Jiawen Wu authored
      Setup Rx and Tx descriptors for specefic rings.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      850b9711
    • Jiawen Wu's avatar
      net: libwx: Configure Rx and Tx unit on hardware · 18b5b8a9
      Jiawen Wu authored
      Configure hardware for preparing to process packets. Including configure
      receive and transmit unit of the MAC layer, and setup the specific rings.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18b5b8a9