Commits · a842fe1425cb20f457abd3f8ef98b468f83ca98b · Kirill Smelkov / linux

12 Jun, 2019 19 commits

tcp: add optional per socket transmit delay · a842fe14

Eric Dumazet authored Jun 12, 2019

Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

  unsigned int tx_delay = 10000; /* 10 msec */

  setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
   limit
       Hard  limit  on  the  real  queue  size. When this limit is
       reached, new packets are dropped. If the value is  lowered,
       packets  are  dropped so that the new limit is met. Default
       is 10000 packets.

   flow_limit
       Hard limit on the maximum  number  of  packets  queued  per
       flow.  Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a842fe14

Merge branch 'ena-dynamic-queue-sizes' · e0ffbd37

David S. Miller authored Jun 12, 2019

Sameeh Jubran says:

====================
Support for dynamic queue size changes

This patchset introduces the following:
* add new admin command for supporting different queue size for Tx/Rx
* add support for Tx/Rx queues size modification through ethtool
* allow queues allocation backoff when low on memory
* update driver version

Difference from v2:
* Dropped superfluous range checks which are already done in ethtool. [patch 5/7]
* Dropped inline keyword from function. [patch 4/7]
* Added a new patch which drops inline keyword all *.c files. [patch 6/7]

Difference from v1:
* Changed ena_update_queue_sizes() signature to use u32 instead of int
  type for the size arguments. [patch 5/7]
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e0ffbd37

net: ena: update driver version from 2.0.3 to 2.1.0 · dbbc6e68

Sameeh Jubran authored Jun 11, 2019

Update driver version to match device specification.
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbbc6e68

net: ena: remove inline keyword from functions in *.c · c2b54204

Sameeh Jubran authored Jun 11, 2019

Let the compiler decide if the function should be inline in *.c files
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2b54204

net: ena: add ethtool function for changing io queue sizes · eece4d2a

Sameeh Jubran authored Jun 11, 2019

Implement the set_ringparam() function of the ethtool interface
to enable the changing of io queue sizes.
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eece4d2a

net: ena: allow queue allocation backoff when low on memory · 13ca32a6

Sameeh Jubran authored Jun 11, 2019

If there is not enough memory to allocate io queues the driver will
try to allocate smaller queues.

The backoff algorithm is as follows:

1. Try to allocate TX and RX and if successful.
1.1. return success

2. Divide by 2 the size of the larger of RX and TX queues (or both if their size is the same).

3. If TX or RX is smaller than 256
3.1. return failure.
4. else
4.1. go back to 1.

Also change the tx_queue_size, rx_queue_size field names in struct
adapter to requested_tx_queue_size and requested_rx_queue_size, and
use RX and TX queue 0 for actual queue sizes.
Explanation:
The original fields were useless as they were simply used to assign
values once from them to each of the queues in the adapter in ena_probe().
They could simply be deleted. However now that we have a backoff
feature, we have use for them. In case of backoff there is a difference
between the requested queue sizes and the actual sizes. Therefore there
is a need to save the requested queue size for future retries of queue
allocation (for example if allocation failed and then ifdown + ifup was
called we want to start the allocation from the original requested size of
the queues).
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

13ca32a6

net: ena: make ethtool show correct current and max queue sizes · 9f9ae3f9

Sameeh Jubran authored Jun 11, 2019

Currently ethtool -g shows the same size for current and max queue
sizes.
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9f9ae3f9

net: ena: enable negotiating larger Rx ring size · 31aa9857

Sameeh Jubran authored Jun 11, 2019

Use MAX_QUEUES_EXT get feature capability to query the device.
Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31aa9857

net: ena: add MAX_QUEUES_EXT get feature admin command · ba8ef506

Arthur Kiyanovski authored Jun 11, 2019

Add a new admin command to support different queue size for Tx/Rx
queues (the change also support different SQ/CQ sizes)
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ba8ef506

Merge branch 'dpaa2-eth-Add-support-for-MQPRIO-offloading' · f2dec9a2

David S. Miller authored Jun 12, 2019

Ioana Radulescu says:

====================
dpaa2-eth: Add support for MQPRIO offloading

Add support for adding multiple TX traffic classes with mqprio. We can have
up to one netdev queue and hardware frame queue per TC per core.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f2dec9a2

dpaa2-eth: Add mqprio support · ab1e6de2

Ioana Radulescu authored Jun 11, 2019

Implement mqprio qdisc support by mapping traffic classes to
different hardware enqueue priorities. The maximum number of
supported traffic classes is an attribute of each DPNI object.

The traffic classes map to hardware priorities from highest (0)
to lowest (highest prio number). The skb priority information
received from the stack is used to select the hardware Tx queue
on which to enqueue the frame.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab1e6de2

dpaa2-eth: Support multiple traffic classes on Tx · 15c87f6b

Ioana Radulescu authored Jun 11, 2019

DPNI objects can have multiple traffic classes, as reflected by
the num_tc attribute. Until now we ignored its value and only
used traffic class 0.

This patch adds support for multiple Tx traffic classes; we have
num_queues x num_tcs hardware queues available for each interface.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

15c87f6b

dpaa2-eth: Refactor xps code · 06d5b179

Ioana Radulescu authored Jun 11, 2019

Move the code configuring xps on the netdev TX queues to a
separate function. A subsequent patch will need to call
this in another context as well.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06d5b179

net: ethernet: ti: cpts: fix build failure for powerpc · a41efedf

Grygorii Strashko authored Jun 11, 2019

Add dependency to TI CPTS from Common CLK framework COMMON_CLK to fix
allyesconfig build for Powerpc:

drivers/net/ethernet/ti/cpts.c: In function 'cpts_of_mux_clk_setup':
drivers/net/ethernet/ti/cpts.c:567:2: error: implicit declaration of function 'of_clk_parent_fill'; did you mean 'of_clk_get_parent_name'? [-Werror=implicit-function-declaration]
  of_clk_parent_fill(refclk_np, parent_names, num_parents);
  ^~~~~~~~~~~~~~~~~~
  of_clk_get_parent_name

Fixes: a3047a81 ("net: ethernet: ti: cpts: add support for ext rftclk selection")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a41efedf

net: dsa: Deal with non-existing PHY/fixed-link · 2131fba5

Florian Fainelli authored Jun 10, 2019

We need to specifically deal with phylink_of_phy_connect() returning
-ENODEV, because this can happen when a CPU/DSA port does connect
neither to a PHY, nor has a fixed-link property. This is a valid use
case that is permitted by the binding and indicates to the switch:
auto-configure port with maximum capabilities.

Fixes: 0e279218 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

2131fba5

net: dsa: mv88e6xxx: lock mutex in port_fdb_dump · fcf15367

Vivien Didelot authored Jun 12, 2019

During a port FDB dump operation, the mutex protecting the concurrent
access to the switch registers is currently held by the internal
mv88e6xxx_port_db_dump and mv88e6xxx_port_db_dump_fid helpers.

It must be held at the higher level in mv88e6xxx_port_fdb_dump which
is called directly by DSA through ds->ops->port_fdb_dump. Fix this.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fcf15367

dt-bindings: net: wiznet: add w5x00 support · 0114214e

Nicolas Saenz Julienne authored Jun 12, 2019

Add bindings for Wiznet's w5x00 series of SPI interfaced Ethernet chips.

Based on the bindings for microchip,enc28j60.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

0114214e

net: ethernet: wiznet: w5X00 add device tree support · b9dd694e

Nicolas Saenz Julienne authored Jun 12, 2019

The w5X00 chip provides an SPI to Ethernet inteface. This patch allows
platform devices to be defined through the device tree.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9dd694e

net: sched: ingress: set 'unlocked' flag for Qdisc ops · 7a096d57

Vlad Buslov authored Jun 12, 2019

To remove rtnl lock dependency in tc filter update API when using ingress
Qdisc, set QDISC_CLASS_OPS_DOIT_UNLOCKED flag in ingress Qdisc_class_ops.

Ingress Qdisc ops don't require any modifications to be used without rtnl
lock on tc filter update path. Ingress implementation never changes its
q->block and only releases it when Qdisc is being destroyed. This means it
is enough for RTM_{NEWTFILTER|DELTFILTER|GETTFILTER} message handlers to
hold ingress Qdisc reference while using it without relying on rtnl lock
protection. Unlocked Qdisc ops support is already implemented in filter
update path by unlocked cls API patch set.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7a096d57

11 Jun, 2019 17 commits

Merge branch 'tls-add-support-for-kernel-driven-resync-and-nfp-RX-offload' · 758a0a4d

David S. Miller authored Jun 11, 2019

Jakub Kicinski says:

====================
tls: add support for kernel-driven resync and nfp RX offload

This series adds TLS RX offload for NFP and completes the offload
by providing resync strategies.  When TLS data stream looses segments
or experiences reorder NIC can no longer perform in line offload.
Resyncs provide information about placement of records in the
stream so that offload can resume.

Existing TLS resync mechanisms are not a great fit for the NFP.
In particular the TX resync is hard to implement for packet-centric
NICs.  This patchset adds an ability to perform TX resync in a way
similar to the way initial sync is done - by calling down to the
driver when new record is created after driver indicated sync had
been lost.

Similarly on the RX side, we try to wait for a gap in the stream
and send record information for the next record.  This works very
well for RPC workloads which are the primary focus at this time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

758a0a4d

nfp: tls: make use of kernel-driven TX resync · 9ed431c1

Jakub Kicinski authored Jun 10, 2019

When TCP stream gets out of sync (driver stops receiving skbs
with expected TCP sequence numbers) request a TX resync from
the kernel.

We try to distinguish retransmissions from missed transmissions
by comparing the sequence number to expected - if it's further
than the expected one - we probably missed packets.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ed431c1

net/tls: add kernel-driven resync mechanism for TX · 50180074

Jakub Kicinski authored Jun 10, 2019

TLS offload drivers keep track of TCP seq numbers to make sure
the packets are fed into the HW in order.

When packets get dropped on the way through the stack, the driver
will get out of sync and have to use fallback encryption, but unless
TCP seq number is resynced it will never match the packets correctly
(or even worse - use incorrect record sequence number after TCP seq
wraps).

Existing drivers (mlx5) feed the entire record on every out-of-order
event, allowing FW/HW to always be in sync.

This patch adds an alternative, more akin to the RX resync.  When
driver sees a frame which is past its expected sequence number the
stream must have gotten out of order (if the sequence number is
smaller than expected its likely a retransmission which doesn't
require resync).  Driver will ask the stack to perform TX sync
before it submits the next full record, and fall back to software
crypto until stack has performed the sync.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50180074

net/tls: generalize the resync callback · eeb2efaf

Jakub Kicinski authored Jun 10, 2019

Currently only RX direction is ever resynced, however, TX may
also get out of sequence if packets get dropped on the way to
the driver.  Rename the resync callback and add a direction
parameter.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eeb2efaf

nfp: tls: enable TLS RX offload · c0a4948e

Jakub Kicinski authored Jun 10, 2019

Set ethtool TLS RX feature based on NIC capabilities, and enable
TLS RX when connections are added for decryption.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0a4948e

nfp: tls: implement RX TLS resync · cad228a3

Dirk van der Merwe authored Jun 10, 2019

Enable kernel-controlled RX resync and propagate TLS connection
RX resync from kernel TLS to firmware.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cad228a3

nfp: add async version of mailbox communication · e2c7114a

Jakub Kicinski authored Jun 10, 2019

Some control messages must be sent from atomic context.  The mailbox
takes sleeping locks and uses a waitqueue so add a "posted" version
of communication.

Trylock the semaphore and if that's successful kick of the device
communication.  The device communication will be completed from
a workqueue, which will also release the semaphore.

If locks are taken queue the message and return.  Schedule a
different workqueue to take the semaphore and run the communication.
Note that the there are currently no atomic users which would actually
need the return value, so all replies to posted messages are just
freed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2c7114a

nfp: rename nfp_ccm_mbox_alloc() · d7053e04

Jakub Kicinski authored Jun 10, 2019

We need the name nfp_ccm_mbox_alloc() for allocating the mailbox
communication channel itself.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7053e04

nfp: tls: set skb decrypted flag · 5bcb5c7e

Dirk van der Merwe authored Jun 10, 2019

Firmware indicates when a packet has been decrypted by reusing the
currently unused BPF flag.  Transfer this information into the skb
and provide a statistic of all decrypted segments.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5bcb5c7e

net/tls: add kernel-driven TLS RX resync · f953d33b

Jakub Kicinski authored Jun 10, 2019

TLS offload device may lose sync with the TCP stream if packets
arrive out of order.  Drivers can currently request a resync at
a specific TCP sequence number.  When a record is found starting
at that sequence number kernel will inform the device of the
corresponding record number.

This requires the device to constantly scan the stream for a
known pattern (constant bytes of the header) after sync is lost.

This patch adds an alternative approach which is entirely under
the control of the kernel.  Kernel tracks records it had to fully
decrypt, even though TLS socket is in TLS_HW mode.  If multiple
records did not have any decrypted parts - it's a pretty strong
indication that the device is out of sync.

We choose the min number of fully encrypted records to be 2,
which should hopefully be more than will get retransmitted at
a time.

After kernel decides the device is out of sync it schedules a
resync request.  If the TCP socket is empty the resync gets
performed immediately.  If socket is not empty we leave the
record parser to resync when next record comes.

Before resync in message parser we peek at the TCP socket and
don't attempt the sync if the socket already has some of the
next record queued.

On resync failure (encrypted data continues to flow in) we
retry with exponential backoff, up to once every 128 records
(with a 16k record thats at most once every 2M of data).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f953d33b

net/tls: rename handle_device_resync() · fe58a5a0

Jakub Kicinski authored Jun 10, 2019

handle_device_resync() doesn't describe the function very well.
The function checks if resync should be issued upon parsing of
a new record.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe58a5a0

net/tls: pass record number as a byte array · 89fec474

Jakub Kicinski authored Jun 10, 2019

TLS offload code casts record number to a u64.  The buffer
should be aligned to 8 bytes, but its actually a __be64, and
the rest of the TLS code treats it as big int.  Make the
offload callbacks take a byte array, drivers can make the
choice to do the ugly cast if they want to.

Prepare for copying the record number onto the stack by
defining a constant for max size of the byte array.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

89fec474

net/tls: simplify seq calculation in handle_device_resync() · 49673739

Jakub Kicinski authored Jun 10, 2019

We subtract "TLS_HEADER_SIZE - 1" from req_seq, then if they
match we add the same constant to seq.  Just add it to seq,
and we don't have to touch req_seq.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49673739

packet: remove unused variable 'status' in __packet_lookup_frame_in_block · 46088059

Mao Wenan authored Jun 11, 2019

The variable 'status' in  __packet_lookup_frame_in_block() is never used since
introduction in commit f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer
implementation."), we can remove it.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46088059

net: openvswitch: remove unnecessary ASSERT_OVSL in ovs_vport_del() · f7a8fb1f

Taehee Yoo authored Jun 10, 2019

ASSERT_OVSL() in ovs_vport_del() is unnecessary because
ovs_vport_del() is only called by ovs_dp_detach_port() and
ovs_dp_detach_port() calls ASSERT_OVSL() too.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7a8fb1f

net: netlink: make netlink_walk_start() void return type · abf9979f

Taehee Yoo authored Jun 10, 2019

netlink_walk_start() needed to return an error code because of
rhashtable_walk_init(). but that was converted to rhashtable_walk_enter()
and it is a void type function. so now netlink_walk_start() doesn't need
any return value.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

abf9979f

selftests: pmtu: Introduce list_flush_ipv6_exception test case · e28799e5

Stefano Brivio authored Jun 06, 2019

This test checks that route exceptions can be successfully listed and
flushed using ip -6 route {list,flush} cache.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e28799e5

10 Jun, 2019 4 commits

Merge branch 'net-Enable-nexthop-objects-with-IPv4-and-IPv6-routes' · 48debfd7

David S. Miller authored Jun 10, 2019

David Ahern says:

====================
net: Enable nexthop objects with IPv4 and IPv6 routes

This is the final set of the initial nexthop object work. When I
started this idea almost 2 years ago, it took 18 seconds to inject
700k+ IPv4 routes with 1 hop and about 28 seconds for 4-paths. Some
of that time was due to inefficiencies in 'ip', but most of it was
kernel side with excessive synchronize_rcu calls in ipv4, and redundant
processing validating a nexthop spec (device, gateway, encap). Worse,
the time increased dramatically as the number of legs in the routes
increased; for example, taking over 72 seconds for 16-path routes.

After this set, with increased dirty memory limits (fib_sync_mem sysctl),
an improved ip and nexthop objects a full internet fib (743,799 routes
based on a pull in January 2019) can be pushed to the kernel in 4.3
seconds. Even better, the time to insert is "almost" constant with
increasing number of paths. The 'almost constant' time is due to
expanding the nexthop definitions when generating notifications. A
follow on patch will be sent adding a sysctl that allows an admin to
avoid the nexthop expansion and truly get constant route insert time
regardless of the number of paths in a route! (Useful once all programs
used for a deployment that care about routes understand nexthop objects).

To be clear, 'ip' is used for benchmarking for no other reason than
'ip -batch' is a trivial to use for the tests. FRR, for example, better
manages nexthops and route changes and the way those are pushed to the
kernel and thus will have less userspace processing times than 'ip -batch'.

Patches 1-10 iterate over fib6_nh with a nexthop invoke a processing
function per fib6_nh. Prior to nexthop objects, a fib6_info referenced
a single fib6_nh. Multipath routes were added as separate fib6_info for
each leg of the route and linked as siblings:

    f6i -> sibling -> sibling ... -> sibling
     |                                   |
     +--------- multipath route ---------+

With nexthop objects a single fib6_info references an external
nexthop which may have a series of fib6_nh:

     f6i ---> nexthop ---> fib6_nh
                           ...
                           fib6_nh

making IPv6 routes similar to IPv4. The side effect is that a single
fib6_info now indirectly references a series of fib6_nh so the code
needs to walk each entry and call the local, per-fib6_nh processing
function.

Patches 11 and 13 wire up use of nexthops with fib entries for IPv4
and IPv6. With these commits you can actually use nexthops with routes.

Patch 12 is an optimization for IPv4 when using nexthops in the most
predominant use case (no metrics).

Patches 14 handles replace of a nexthop config.

Patches 15-18 add update pmtu and redirect tests to use both old and
new routing.

Patches 19 and 20 add new tests for the nexthop infrastructure. The first
is single nexthop is used by multiple prefixes to communicate with remote
hosts. This is on top of the functional tests already committed. The
second verifies multipath selection.

v4
- changed return to 'goto out' in patch 9 since the rcu_read_lock is
  held (noticed by Wei)

v3
- removed found arg in patch 7 and changed rt6_nh_remove_exception_rt
  to return 1 when a match is found for an exception

v2
- changed ++i to i++ in patches 1 and 14 as noticed by DaveM
- improved commit message for patch 14 (nexthop replace)
- removed the skip_fib argument to remove_nexthop; vestige of an
  older design
====================
Reviewed-By: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

48debfd7

selftests: Add version of router_multipath.sh using nexthop objects · cab14d10

David Ahern authored Jun 08, 2019

Add a version of router_multipath.sh that uses nexthop objects for
routes.

Ido requested a version that does not cause regressions with mlxsw
testing since it does not support nexthop objects yet.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cab14d10

selftests: Add test with multiple prefixes using single nexthop · 735ab2f6

David Ahern authored Jun 08, 2019

Add tests where multiple FIB entries use the same nexthop object. Generate
per-cpu cached routes for each by running ping on each cpu, and then
generate exceptions unique to each prefix (remote host) with different
mtus.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

735ab2f6

selftests: icmp_redirect: Add support for routing via nexthop objects · 622946d9

David Ahern authored Jun 08, 2019

Add a second pass to icmp_redirect.sh to use nexthop objects for
routes.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

622946d9