1. 13 Oct, 2021 20 commits
  2. 12 Oct, 2021 20 commits
    • Jakub Kicinski's avatar
      Merge branch 'devlink-reload-simplification' · 0e258cec
      Jakub Kicinski authored
      Leon Romanovsky says:
      
      ====================
      devlink reload simplification
      
      Simplify devlink reload APIs.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1634044267.git.leonro@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0e258cec
    • Leon Romanovsky's avatar
      devlink: Delete reload enable/disable interface · 82465bec
      Leon Romanovsky authored
      Commit a0c76345 ("devlink: disallow reload operation during device
      cleanup") added devlink_reload_{enable,disable}() APIs to prevent reload
      operation from racing with device probe/dismantle.
      
      After recent changes to move devlink_register() to the end of device
      probe and devlink_unregister() to the beginning of device dismantle,
      these races can no longer happen. Reload operations will be denied if
      the devlink instance is unregistered and devlink_unregister() will block
      until all in-flight operations are done.
      
      Therefore, remove these devlink_reload_{enable,disable}() APIs.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      82465bec
    • Leon Romanovsky's avatar
      net/mlx5: Set devlink reload feature bit for supported devices only · 96869f19
      Leon Romanovsky authored
      Mulitport slave device doesn't support devlink reload, so instead of
      complicating initialization flow with devlink_reload_enable() which
      will be removed in next patch, don't set DEVLINK_F_RELOAD feature bit
      for such devices.
      
      This fixes an error when reload counters exposed (and equal zero) for
      the mode that is not supported at all.
      
      Fixes: d89ddaae ("net/mlx5: Disable devlink reload for multi port slave device")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96869f19
    • Leon Romanovsky's avatar
      devlink: Allow control devlink ops behavior through feature mask · bd032e35
      Leon Romanovsky authored
      Introduce new devlink call to set feature mask to control devlink
      behavior during device initialization phase after devlink_alloc()
      is already called.
      
      This allows us to set reload ops based on device property which
      is not known at the beginning of driver initialization.
      
      For the sake of simplicity, this API lacks any type of locking and
      needs to be called before devlink_register() to make sure that no
      parallel access to the ops is possible at this stage.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bd032e35
    • Leon Romanovsky's avatar
      devlink: Annotate devlink API calls · b88f7b12
      Leon Romanovsky authored
      Initial annotation patch to separate calls that needs to be executed
      before or after devlink_register().
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b88f7b12
    • Leon Romanovsky's avatar
      devlink: Move netdev_to_devlink helpers to devlink.c · 2bc50987
      Leon Romanovsky authored
      Both netdev_to_devlink and netdev_to_devlink_port are used in devlink.c
      only, so move them in order to reduce their scope.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2bc50987
    • Leon Romanovsky's avatar
      devlink: Reduce struct devlink exposure · 21314638
      Leon Romanovsky authored
      The declaration of struct devlink in general header provokes the
      situation where internal fields can be accidentally used by the driver
      authors. In order to reduce such possible situations, let's reduce the
      namespace exposure of struct devlink.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      21314638
    • Jakub Kicinski's avatar
      ethernet: tulip: avoid duplicate variable name on sparc · 177c9235
      Jakub Kicinski authored
      I recently added a variable called addr to tulip_init_one()
      but for sparc there's already a variable called that half
      way thru the function. Rename it to fix build.
      
      Fixes: ca879317 ("ethernet: tulip: remove direct netdev->dev_addr writes")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      177c9235
    • Hao Chen's avatar
      net: hns3: debugfs add support dumping page pool info · 850bfb91
      Hao Chen authored
      Add a file node "page_pool_info" for debugfs, then cat this
      file node to dump page pool info as below:
      
      QUEUE_ID  ALLOCATE_CNT  FREE_CNT      POOL_SIZE(PAGE_NUM)  ORDER  NUMA_ID  MAX_LEN
      0         512           0             512                  0      2        4K
      1         512           0             512                  0      2        4K
      2         512           0             512                  0      2        4K
      3         512           0             512                  0      2        4K
      4         512           0             512                  0      2        4K
      Signed-off-by: default avatarHao Chen <chenhao288@hisilicon.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      850bfb91
    • Jakub Kicinski's avatar
      tulip: fix setting device address from rom · 25b90c19
      Jakub Kicinski authored
      I missed removing i from the array index when converting
      from a loop to a direct copy.
      
      Fixes: ca879317 ("ethernet: tulip: remove direct netdev->dev_addr writes")
      Reported-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25b90c19
    • David S. Miller's avatar
      Merge branch 'Managed-Neighbor-Entries' · 2ed08b5e
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      Managed Neighbor Entries
      
      This series adds a couple of fixes related to NTF_EXT_LEARNED and NTF_USE
      neighbor flags, extends the UAPI with a new NDA_FLAGS_EXT netlink attribute
      in order to be able to add new neighbor flags from user space given all
      current struct ndmsg / ndm_flags bits are used up. Finally, the core of this
      series adds a new NTF_EXT_MANAGED flag to neighbors, which allows user space
      control planes to add 'managed' neighbor entries. Meaning, user space may
      either transition existing entries or can push down new L3 entries without
      lladdr into the kernel where the latter will periodically try to keep such
      NTF_EXT_MANAGED managed entries in reachable state. Main use case for this
      series are XDP / tc BPF load-balancers which make use of the bpf_fib_lookup()
      helper for backends. For more details, please see individual patches. Thanks!
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ed08b5e
    • Daniel Borkmann's avatar
      net, neigh: Add NTF_MANAGED flag for managed neighbor entries · 7482e384
      Daniel Borkmann authored
      Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED
      flag. The flag then indicates to the kernel that the neighbor entry should be
      periodically probed for keeping the entry in NUD_REACHABLE state iff possible.
      
      The use case for this is targeting XDP or tc BPF load-balancers which use
      the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution
      for their backends. Given they cannot be resolved in fast-path, a control
      plane inserts the L3 (without L2) entries manually into the neighbor table
      and lets the kernel do the neighbor resolution either on the gateway or on
      the backend directly in case the latter resides in the same L2. This avoids
      to deal with L2 in the control plane and to rebuild what the kernel already
      does best anyway.
      
      NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC
      eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor
      table which gets triggered by the system work queue to periodically call
      neigh_event_send() for performing the resolution. The implementation allows
      migration from/to NTF_MANAGED neighbor entries, so that already existing
      entries can be converted by the control plane if needed. Potentially, we could
      make the interval for periodically calling neigh_event_send() configurable;
      right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which
      has similar driver-internal infrastructure c723c735 ("mlxsw: spectrum_router:
      Periodically update the kernel's neigh table"). In future, the latter could
      possibly reuse the NTF_MANAGED neighbors as well.
      
      Example:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE
        [...]
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Link: https://linuxplumbersconf.org/event/11/contributions/953/Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7482e384
    • Roopa Prabhu's avatar
      net, neigh: Extend neigh->flags to 32 bit to allow for extensions · 2c611ad9
      Roopa Prabhu authored
      Currently, all bits in struct ndmsg's ndm_flags are used up with the most
      recent addition of 435f2e7c ("net: bridge: add support for sticky fdb
      entries"). This makes it impossible to extend the neighboring subsystem
      with new NTF_* flags:
      
        struct ndmsg {
          __u8   ndm_family;
          __u8   ndm_pad1;
          __u16  ndm_pad2;
          __s32  ndm_ifindex;
          __u16  ndm_state;
          __u8   ndm_flags;
          __u8   ndm_type;
        };
      
      There are ndm_pad{1,2} attributes which are not used. However, due to
      uncareful design, the kernel does not enforce them to be zero upon new
      neighbor entry addition, and given they've been around forever, it is
      not possible to reuse them today due to risk of breakage. One option to
      overcome this limitation is to add a new NDA_FLAGS_EXT attribute for
      extended flags.
      
      In struct neighbour, there is a 3 byte hole between protocol and ha_lock,
      which allows neigh->flags to be extended from 8 to 32 bits while still
      being on the same cacheline as before. This also allows for all future
      NTF_* flags being in neigh->flags rather than yet another flags field.
      Unknown flags in NDA_FLAGS_EXT will be rejected by the kernel.
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c611ad9
    • Daniel Borkmann's avatar
      net, neigh: Enable state migration between NUD_PERMANENT and NTF_USE · 3dc20f47
      Daniel Borkmann authored
      Currently, it is not possible to migrate a neighbor entry between NUD_PERMANENT
      state and NTF_USE flag with a dynamic NUD state from a user space control plane.
      Similarly, it is not possible to add/remove NTF_EXT_LEARNED flag from an existing
      neighbor entry in combination with NTF_USE flag.
      
      This is due to the latter directly calling into neigh_event_send() without any
      meta data updates as happening in __neigh_update(). Thus, to enable this use
      case, extend the latter with a NEIGH_UPDATE_F_USE flag where we break the
      NUD_PERMANENT state in particular so that a latter neigh_event_send() is able
      to re-resolve a neighbor entry.
      
      Before fix, NUD_PERMANENT -> NUD_* & NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
      
      As can be seen, despite the admin-triggered replace, the entry remains in the
      NUD_PERMANENT state.
      
      After fix, NUD_PERMANENT -> NUD_* & NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
        [...]
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn STALE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
      
      After the fix, the admin-triggered replace switches to a dynamic state from
      the NTF_USE flag which triggered a new neighbor resolution. Likewise, we can
      transition back from there, if needed, into NUD_PERMANENT.
      
      Similar before/after behavior can be observed for below transitions:
      
      Before fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
      
      After fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [..]
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dc20f47
    • Daniel Borkmann's avatar
      net, neigh: Fix NTF_EXT_LEARNED in combination with NTF_USE · e4400bbf
      Daniel Borkmann authored
      The NTF_EXT_LEARNED neigh flag is usually propagated back to user space
      upon dump of the neighbor table. However, when used in combination with
      NTF_USE flag this is not the case despite exempting the entry from the
      garbage collector. This results in inconsistent state since entries are
      typically marked in neigh->flags with NTF_EXT_LEARNED, but here they are
      not. Fix it by propagating the creation flag to ___neigh_create().
      
      Before fix:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
      
      After fix:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
        [...]
      
      Fixes: 9ce33e46 ("neighbour: support for NTF_EXT_LEARNED flag")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4400bbf
    • Len Baker's avatar
      net: hns: Prefer struct_size over open coded arithmetic · 7bb39a39
      Len Baker authored
      As noted in the "Deprecated Interfaces, Language Features, Attributes,
      and Conventions" documentation [1], size calculations (especially
      multiplication) should not be performed in memory allocator (or similar)
      function arguments due to the risk of them overflowing. This could lead
      to values wrapping around and a smaller allocation being made than the
      caller was expecting. Using those allocations could lead to linear
      overflows of heap memory and other misbehaviors.
      
      So, take the opportunity to refactor the hnae_handle structure to switch
      the last member to flexible array, changing the code accordingly. Also,
      fix the comment in the hnae_vf_cb structure to inform that the ae_handle
      member must be the last member.
      
      Then, use the struct_size() helper to do the arithmetic instead of the
      argument "size + count * size" in the kzalloc() function.
      
      This code was detected with the help of Coccinelle and audited and fixed
      manually.
      
      [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-argumentsSigned-off-by: default avatarLen Baker <len.baker@gmx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bb39a39
    • David S. Miller's avatar
      Merge branch 'mlxsw-ECN-mirroring' · 249ae949
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Add support for ECN mirroring
      
      Petr says:
      
      Patches in this set have been floating around for some time now together
      with trap_fwd support. That will however need more work, time for which is
      nowhere to be found, apparently. Instead, this patchset enables offload of
      only packet mirroring on RED mark qevent, enabling mirroring of ECN-marked
      packets.
      
      Formally it enables offload of filters added to blocks bound to the RED
      qevent mark if:
      
      - The switch ASIC is Spectrum-2 or above.
      - Only a single filter is attached at the block, at chain 0 (the default),
        and its classifier is matchall.
      - The filter has hw_stats set to disabled.
      - The filter has a single action, which is mirror.
      
      This differs from early_drop qevent offload, which supports mirroring and
      trapping. However trapping in context of ECN-marked packets is not
      suitable, because the HW does not drop the packet, as the trap action
      implies. And there is as of now no way to express only the part of trapping
      that transfers the packet to the SW datapath, sans the HW-datapath drop.
      
      The patchset progresses as follows:
      
      Patch #1 is an extack propagation.
      
      Mirroring of ECN-marked packets is configured in the ASIC through an ECN
      trigger, which is considered "egress", unlike the EARLY_DROP trigger.
      In patch #2, add a helper to classify triggers as ingress.
      
      As clarified above, traps cannot be offloaded on mark qevent. Similarly,
      given a trap_fwd action, it would not be offloadable on early_drop qevent.
      In patch #3, introduce support for tracking actions permissible on a given
      block.
      
      Patch #4 actually adds the mark qevent offload.
      
      In patch #5, fix a small style issue in one of the selftests, and in
      patch #6 add mark offload selftests.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      249ae949
    • Petr Machata's avatar
      selftests: mlxsw: RED: Add selftests for the mark qevent · 0cd6fa99
      Petr Machata authored
      Add do_mark_test(), which is to do_ecn_test() like do_drop_test() is to
      do_red_test(): meant to test that actions on the RED mark qevent block are
      offloaded, and executed on ECN-marked packets.
      
      The test splits install_qdisc() into its constituents, install_root_qdisc()
      and install_qdisc_tcX(). This is in order to test that when mirroring is
      enabled on one TC, the other TC does not mirror.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cd6fa99
    • Petr Machata's avatar
      selftests: mlxsw: sch_red_core: Drop two unused variables · a703b517
      Petr Machata authored
      These variables are cut'n'pasted from other functions in the file and not
      actually used.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a703b517
    • Petr Machata's avatar
      mlxsw: spectrum_qdisc: Offload RED qevent mark · 9c18eaf2
      Petr Machata authored
      The RED "mark" qevent can be offloaded under similar conditions as the RED
      "early_drop" qevent. Therefore recognize its binding type in the
      TC_SETUP_BLOCK handler and translate to the right SPAN trigger, with the
      right set of supported actions.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c18eaf2