Commits · 150e8f8a1bae1dea75f4217f483ac95ca48c979a · nexedi / linux

12 Aug, 2019 1 commit

netdevsim: register couple of devlink params · 150e8f8a

Jiri Pirko authored Aug 09, 2019

Register couple of devlink params, one generic, one driver-specific.
Make the values available over debugfs.

Example:
$ echo "111" > /sys/bus/netdevsim/new_device
$ devlink dev param
netdevsim/netdevsim111:
  name max_macs type generic
    values:
      cmode driverinit value 32
  name test1 type driver-specific
    values:
      cmode driverinit value true
$ cat /sys/kernel/debug/netdevsim/netdevsim111/max_macs
32
$ cat /sys/kernel/debug/netdevsim/netdevsim111/test1
Y
$ devlink dev param set netdevsim/netdevsim111 name max_macs cmode driverinit value 16
$ devlink dev param set netdevsim/netdevsim111 name test1 cmode driverinit value false
$ devlink dev reload netdevsim/netdevsim111
$ cat /sys/kernel/debug/netdevsim/netdevsim111/max_macs
16
$ cat /sys/kernel/debug/netdevsim/netdevsim111/test1
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

150e8f8a

11 Aug, 2019 11 commits

Merge branch 'drop_monitor-Capture-dropped-packets-and-metadata' · 6e5ee483

David S. Miller authored Aug 11, 2019

Ido Schimmel says:

====================
drop_monitor: Capture dropped packets and metadata

So far drop monitor supported only one mode of operation in which a
summary of recent packet drops is periodically sent to user space as a
netlink event. The event only includes the drop location (program
counter) and number of drops in the last interval.

While this mode of operation allows one to understand if the system is
dropping packets, it is not sufficient if a more detailed analysis is
required. Both the packet itself and related metadata are missing.

This patchset extends drop monitor with another mode of operation where
the packet - potentially truncated - and metadata (e.g., drop location,
timestamp, netdev) are sent to user space as a netlink event. Thanks to
the extensible nature of netlink, more metadata can be added in the
future.

To avoid performing expensive operations in the context in which
kfree_skb() is called, the dropped skbs are cloned and queued on per-CPU
skb drop list. The list is then processed in process context (using a
workqueue), where the netlink messages are allocated, prepared and
finally sent to user space.

A follow-up patchset will integrate drop monitor with devlink and allow
the latter to call into drop monitor to report hardware drops. In the
future, XDP drops can be added as well, thereby making drop monitor the
go-to netlink channel for diagnosing all packet drops.

Example usage with patched dropwatch [1] can be found here [2]. Example
dissection of drop monitor netlink events with patched wireshark [3] can
be found here [4]. I will submit both changes upstream after the kernel
changes are accepted. Another change worth making is adding a dropmon
pseudo interface to libpcap, similar to the nflog interface [5]. This
will allow users to specifically listen on dropmon traffic instead of
capturing all netlink packets via the nlmon netdev.

Patches #1-#5 prepare the code towards the actual changes in later
patches.

Patch #6 adds another mode of operation to drop monitor in which the
dropped packet itself is notified to user space along with metadata.

Patch #7 allows users to truncate reported packets to a specific length,
in case only the headers are of interest. The original length of the
packet is added as metadata to the netlink notification.

Patch #8 allows user to query the current configuration of drop monitor
(e.g., alert mode, truncation length).

Patches #9-#10 allow users to tune the length of the per-CPU skb drop
list according to their needs.

Changes since v1 [6]:
* Add skb protocol as metadata. This allows user space to correctly
  dissect the packet instead of blindly assuming it is an Ethernet
  packet

Changes since RFC [7]:
* Limit the length of the per-CPU skb drop list and make it configurable
* Do not use the hysteresis timer in packet alert mode
* Introduce alert mode operations in a separate patch and only then
  introduce the new alert mode
* Use 'skb->skb_iif' instead of 'skb->dev' because the latter is inside
  a union with 'dev_scratch' and therefore not guaranteed to point to a
  valid netdev
* Return '-EBUSY' instead of '-EOPNOTSUPP' when trying to configure drop
  monitor while it is monitoring
* Did not change schedule_work() in favor of schedule_work_on() as I did
  not observe a change in number of tail drops

[1] https://github.com/idosch/dropwatch/tree/packet-mode
[2] https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile1-txt
[3] https://github.com/idosch/wireshark/tree/drop-monitor-v2
[4] https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile2-txt
[5] https://github.com/the-tcpdump-group/libpcap/blob/master/pcap-netfilter-linux.c
[6] https://patchwork.ozlabs.org/cover/1143443/
[7] https://patchwork.ozlabs.org/cover/1135226/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6e5ee483

drop_monitor: Expose tail drop counter · e9feb580

Ido Schimmel authored Aug 11, 2019

Previous patch made the length of the per-CPU skb drop list
configurable. Expose a counter that shows how many packets could not be
enqueued to this list.

This allows users determine the desired queue length.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e9feb580

drop_monitor: Make drop queue length configurable · 30328d46

Ido Schimmel authored Aug 11, 2019

In packet alert mode, each CPU holds a list of dropped skbs that need to
be processed in process context and sent to user space. To avoid
exhausting the system's memory the maximum length of this queue is
currently set to 1000.

Allow users to tune the length of this queue according to their needs.
The configured length is reported to user space when drop monitor
configuration is queried.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

30328d46

drop_monitor: Add a command to query current configuration · 444be061

Ido Schimmel authored Aug 11, 2019

Users should be able to query the current configuration of drop monitor
before they start using it. Add a command to query the existing
configuration which currently consists of alert mode and packet
truncation length.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

444be061

drop_monitor: Allow truncation of dropped packets · 57986617

Ido Schimmel authored Aug 11, 2019

When sending dropped packets to user space it is not always necessary to
copy the entire packet as usually only the headers are of interest.

Allow user to specify the truncation length and add the original length
of the packet as additional metadata to the netlink message.

By default no truncation is performed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

57986617

drop_monitor: Add packet alert mode · ca30707d

Ido Schimmel authored Aug 11, 2019

So far drop monitor supported only one alert mode in which a summary of
locations in which packets were recently dropped was sent to user space.

This alert mode is sufficient in order to understand that packets were
dropped, but lacks information to perform a more detailed analysis.

Add a new alert mode in which the dropped packet itself is passed to
user space along with metadata: The drop location (as program counter
and resolved symbol), ingress netdevice and drop timestamp. More
metadata can be added in the future.

To avoid performing expensive operations in the context in which
kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and
queued on per-CPU skb drop list. Then, in process context the netlink
message is allocated, prepared and finally sent to user space.

The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting
the system's memory. Subsequent patches will make this limit
configurable and also add a counter that indicates how many skbs were
tail dropped.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca30707d

drop_monitor: Add alert mode operations · 28315f79

Ido Schimmel authored Aug 11, 2019

The next patch is going to add another alert mode in which the dropped
packet is notified to user space, instead of only a summary of recent
drops.

Abstract the differences between the modes by adding alert mode
operations. The operations are selected based on the currently
configured mode and associated with the probes and the work item just
before tracing starts.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28315f79

drop_monitor: Require CAP_NET_ADMIN for drop monitor configuration · c5ab9b1c

Ido Schimmel authored Aug 11, 2019

Currently, the configure command does not do anything but return an
error. Subsequent patches will enable the command to change various
configuration options such as alert mode and packet truncation.

Similar to other netlink-based configuration channels, make sure only
users with the CAP_NET_ADMIN capability set can execute this command.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5ab9b1c

drop_monitor: Reset per-CPU data before starting to trace · 44075f56

Ido Schimmel authored Aug 11, 2019

The function reset_per_cpu_data() allocates and prepares a new skb for
the summary netlink alert message ('NET_DM_CMD_ALERT'). The new skb is
stored in the per-CPU 'data' variable and the old is returned.

The function is invoked during module initialization and from the
workqueue, before an alert is sent. This means that it is possible to
receive an alert with stale data, if we stopped tracing when the
hysteresis timer ('data->send_timer') was pending.

Instead of invoking the function during module initialization, invoke it
just before we start tracing and ensure we get a fresh skb.

This also allows us to remove the calls to initialize the timer and the
work item from the module initialization path, since both could have
been triggered by the error paths of reset_per_cpu_data().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

44075f56

drop_monitor: Initialize timer and work item upon tracing enable · 70c69274

Ido Schimmel authored Aug 11, 2019

The timer and work item are currently initialized once during module
init, but subsequent patches will need to associate different functions
with the work item, based on the configured alert mode.

Allow subsequent patches to make that change by initializing and
de-initializing these objects during tracing enable and disable.

This also guarantees that once the request to disable tracing returns,
no more netlink notifications will be generated.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

70c69274

drop_monitor: Split tracing enable / disable to different functions · 7c747838

Ido Schimmel authored Aug 11, 2019

Subsequent patches will need to enable / disable tracing based on the
configured alerting mode.

Reduce the nesting level and prepare for the introduction of this
functionality by splitting the tracing enable / disable operations into
two different functions.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c747838

10 Aug, 2019 21 commits

Merge branch 'Networking-driver-debugfs-cleanups' · 2cc2743d

David S. Miller authored Aug 10, 2019

Greg Kroah-Hartman says:

====================
Networking driver debugfs cleanups

There is no need to test the result of any debugfs call anymore.  The
debugfs core warns the user if something fails, and the return value of
a debugfs call can always be fed back into another debugfs call with no
problems.

Also, debugfs is for debugging, so if there are problems with debugfs
(i.e. the system is out of memory) the rest of the kernel should not
change behavior, so testing for debugfs calls is pointless and not the
goal of debugfs at all.

This series cleans up a lot of networking drivers and some wimax code
that was calling debugfs and trying to do something with the return
value that it didn't need to.  Removing this logic makes the code
smaller, easier to understand, and use less run-time memory in some
cases, all good things.

The series is against net-next, and have no dependancies between any of
them if they want to go through any random tree/order.  Or, if wanted,
I can take them through my driver-core tree where other debugfs cleanups
are being slowly fed during major merge windows.

v3: fix build warning in i2400m, I thought I had caught them all :(
    add acks from some reviewers

v2: fix up build warnings, it's as if I never even built these.  Ugh, so
    sorry for wasting people's time with the v1 series.  I need to stop
    relying on 0-day as it isn't working well anymore :(
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2cc2743d

ieee802154: no need to check return value of debugfs_create functions · 7e174a49

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Alexander Aring <alex.aring@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Harry Morris <h.morris@cascoda.com>
Cc: linux-wpan@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e174a49

ixgbe: no need to check return value of debugfs_create functions · 35dc61eb

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

35dc61eb

i40e: no need to check return value of debugfs_create functions · 43c4eb03

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

43c4eb03

fm10k: no need to check return value of debugfs_create functions · ecc55707

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ecc55707

mvpp2: no need to check return value of debugfs_create functions · e6882aa6

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Huckleberry <nhuck@google.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e6882aa6

skge: no need to check return value of debugfs_create functions · 2f62f8e6

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f62f8e6

qca: no need to check return value of debugfs_create functions · 687236b0

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stefan Wahren <stefan.wahren@i2se.com>
Cc: Michael Heimpold <michael.heimpold@i2se.com>
Cc: Yangtao Li <tiny.windzz@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

687236b0

dpaa2: no need to check return value of debugfs_create functions · 92aff5b4

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Because we don't care about the individual files, we can remove the
stored dentry for the files, as they are not needed to be kept track of
at all.

Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

92aff5b4

stmmac: no need to check return value of debugfs_create functions · 8d72ab11

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Because we don't care about the individual files, we can remove the
stored dentry for the files, as they are not needed to be kept track of
at all.

Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Jose Abreu <joabreu@synopsys.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: netdev@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d72ab11

nfp: no need to check return value of debugfs_create functions · 16e9b481

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Edwin Peer <edwin.peer@netronome.com>
Cc: Yangtao Li <tiny.windzz@gmail.com>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: oss-drivers@netronome.com
Cc: netdev@vger.kernel.org
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

16e9b481

hns3: no need to check return value of debugfs_create functions · 11ab11e6

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

11ab11e6

cxgb4: no need to check return value of debugfs_create functions · 9dac1e8e

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

If a debugfs call fails, it will properly warn in the syslog, there's no
need for all individual drivers to also print a message, so that is one
more reason to not care about checking the return values.

Cc: Vishal Kulkarni <vishal@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Casey Leedom <leedom@chelsio.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9dac1e8e

bnxt: no need to check return value of debugfs_create functions · 3a131e85

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleans up a lot of unneeded code and logic around the debugfs
files, making all of this much simpler and easier to understand.

Cc: Michael Chan <michael.chan@broadcom.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3a131e85

xgbe: no need to check return value of debugfs_create functions · 9e3926df

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleans up a lot of unneeded code and logic around the debugfs
files, making all of this much simpler and easier to understand.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e3926df

mlx5: no need to check return value of debugfs_create functions · 9f818c8a

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleans up a lot of unneeded code and logic around the debugfs
files, making all of this much simpler and easier to understand as we
don't need to keep the dentries saved anymore.

Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9f818c8a

bonding: no need to print a message if debugfs_create_dir() fails · fedcc6da

Greg Kroah-Hartman authored Aug 10, 2019

The debugfs core now will print a message if this function fails, so
don't duplicate that logic.  Also, no need to change the code logic if
the call fails either, as no debugfs calls should interrupt normal
kernel code for any reason.

Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

fedcc6da

wimax: no need to check return value of debugfs_create functions · a62052ba

Greg Kroah-Hartman authored Aug 10, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleans up a lot of unneeded code and logic around the debugfs wimax
files, making all of this much simpler and easier to understand.

Cc: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Cc: linux-wimax@intel.com
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a62052ba

Merge tag 'mlx5-updates-2019-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 38b9e0f6

David S. Miller authored Aug 09, 2019

Saeed Mahameed says:

====================
mlx5-updates-2019-08-09

This series includes update to mlx5 ethernet and core driver:

In first #11 patches, Vlad submits part 2 of 3 part series to allow
TC flow handling for concurrent execution.

1) TC flow handling for concurrent execution (part 2)

Vald Says:
==========

Refactor data structures that are shared between flows in tc.
Currently, all cls API hardware offloads driver callbacks require caller
to hold rtnl lock when calling them. Cls API has already been updated to
update software filters in parallel (on classifiers that support
unlocked execution), however hardware offloads code still obtains rtnl
lock before calling driver tc callbacks. This set implements support for
unlocked execution of tc hairpin, mod_hdr and encap subsystem. The
changed implemented in these subsystems are very similar in general.

The main difference is that hairpin is accessed through mlx5e_tc_table
(legacy mode), mod_hdr is accessed through both mlx5e_tc_table and
mlx5_esw_offload (legacy and switchdev modes) and encap is only accessed
through mlx5_esw_offload (switchdev mode).

1.1) Hairpin handling and structure mlx5e_hairpin_entry refactored in
following way:

- Hairpin structure is extended with atomic reference counter. This
  approach allows to lookup of hairpin entry and obtain reference to it
  with hairpin_tbl_lock protection and then continue using the entry
  unlocked (including provisioning to hardware).

- To support unlocked provisioning of hairpin entry to hardware, the entry
  is extended with 'res_ready' completion and is inserted to hairpin_tbl
  before calling the firmware. With this approach any concurrent users that
  attempt to use the same hairpin entry wait for completion first to
  prevent access to entries that are not fully initialized.

- Hairpin entry is extended with new flows_lock spinlock to protect the
  list when multiple concurrent tc instances update flows attached to
  the same hairpin entry.

1.2) Modify header handling code and structure mlx5e_mod_hdr_entry
are refactored in the following way:

- Mod_hdr structure is extended with atomic reference counter. This
  approach allows to lookup of mod_hdr entry and obtain reference to it
  with mod_hdr_tbl_lock protection and then continue using the entry
  unlocked (including provisioning to hardware).

- To support unlocked provisioning of mod_hdr entry to hardware, the entry
  is extended with 'res_ready' completion and is inserted to mod_hdr_tbl
  before calling the firmware. With this approach any concurrent users that
  attempt to use the same mod_hdr entry wait for completion first to
  prevent access to entries that are not fully initialized.

- Mod_Hdr entry is extended with new flows_lock spinlock to protect the
  list when multiple concurrent tc instances update flows attached to
  the same mod_hdr entry.

1.3) Encapsulation handling code and Structure mlx5e_encap_entry
are refactored in the following way:

- encap structure is extended with atomic reference counter. This
  approach allows to lookup of encap entry and obtain reference to it
  with encap_tbl_lock protection and then continue using the entry
  unlocked (including provisioning to hardware).

- To support unlocked provisioning of encap entry to hardware, the entry is
  extended with 'res_ready' completion and is inserted to encap_tbl before
  calling the firmware. With this approach any concurrent users that
  attempt to use the same encap entry wait for completion first to prevent
  access to entries that are not fully initialized.

- As a difference from approach used to refactor hairpin and mod_hdr,
  encap entry is not extended with any per-entry fine-grained lock.
  Instead, encap_table_lock is used to synchronize all operations on
  encap table and instances of mlx5e_encap_entry. This is necessary
  because single flow can be attached to multiple encap entries
  simultaneously. During new flow creation or neigh update event all of
  encaps that flow is attached to must be accessed together as in atomic
  manner, which makes usage of per-entry lock infeasible.

- Encap entry is extended with new flows_lock spinlock to protect the
  list when multiple concurrent tc instances update flows attached to
  the same encap entry.

==========

3) Parav improves the way port representors report their parent ID and
port index.

4) Use refcount_t for refcount in vxlan data base from  Chuhong Yuan
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

38b9e0f6

tc-testing: added tdc tests for matchall filter · 62ad42ec

Roman Mashak authored Aug 09, 2019

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62ad42ec

selftests: Fix detection of nettest command in fcnal-test · f887427b

David Ahern authored Aug 09, 2019

Most of the tests run by fcnal-test.sh relies on the nettest command.
Rather than trying to cover all of the individual tests, check for the
binary only at the beginning.

Also removes the need for log_error which is undefined.

Fixes: 6f9d5cac ("selftests: Setup for functional tests for fib and socket lookups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f887427b

09 Aug, 2019 7 commits

net/mlx5e: Use refcount_t for refcount · b51c225e

Chuhong Yuan authored Aug 03, 2019

refcount_t is better for reference counters since its
implementation can prevent overflows.
So convert atomic_t ref counters to refcount_t.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

b51c225e

net/mlx5e: Use vhca_id in generating representor port_index · c938451f

Parav Pandit authored Jul 20, 2019

It is desired to use unique port indices when multiple pci devices'
devlink instance have the same switch-id.

Make use of vhca-id to generate such unique devlink port indices.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

c938451f

net/mlx5e: Simplify querying port representor parent id · 724ee179

Parav Pandit authored Jul 26, 2019

System image GUID doesn't depend on eswitch switchdev mode.

Hence, remove the check which simplifies the code.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

724ee179

net/mlx5: E-switch, Removed unused hwid · ef2e4094

Parav Pandit authored Jul 26, 2019

Currently mlx5_eswitch_rep stores same hw ID for all representors.
However it is never used from this structure.
It is always used from mlx5_vport.

Hence, remove unused field.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

ef2e4094

net/mlx5e: Allow concurrent creation of encap entries · d589e785

Vlad Buslov authored Aug 08, 2019

Encap entries creation is fully synchronized by encap_tbl_lock. In order to
allow concurrent allocation of hardware resources used to offload
encapsulation, extend mlx5e_encap_entry with 'res_ready' completion. Move
call to mlx5e_tc_tun_create_header_ipv{4|6}() out of encap_tbl_lock
critical section. Modify code that attaches new flows to existing encap to
wait for 'res_ready' completion before using the entry. Insert encap entry
to table before provisioning it to hardware and modify all users of the
encap table to verify that encap was fully initialized by checking
completion result for non-zero value (and to wait for 'res_ready'
completion, if necessary).
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

d589e785

net/mlx5e: Protect encap hash table with mutex · 61086f39

Vlad Buslov authored Aug 02, 2019

To remove dependency on rtnl lock, protect encap hash table from concurrent
modifications with new "encap_tbl_lock" mutex. Use the mutex to protect
internal encap entry state from concurrent modification. This is necessary
because a flow can be attached to multiple encap entries simultaneously,
which significantly complicates using finer grained per-entry lock.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

61086f39

net/mlx5e: Extend encap entry with reference counter · 948993f2

Vlad Buslov authored Jun 03, 2018

List of flows attached to encap entry is used as implicit reference
counter (encap entry is deallocated when list becomes free) and as a
mechanism to obtain encap entry that flow is attached to (through list
head). This is not safe when concurrent modification of list of flows
attached to encap entry is possible. Proper atomic reference counter is
required to support concurrent access.

As a preparation for extending encap with reference counting, extract code
that lookups and deletes encap entry into standalone put/get helpers. In
order to remove this dependency on external locking, extend encap entry
with reference counter to manage its lifetime and extend flow structure
with direct pointer to encap entry that flow is attached to.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

948993f2