Commits · 8c873e2199700c2de7dbd5eedb9d90d5f109462b · Kirill Smelkov / linux

08 Jan, 2018 32 commits

netfilter: core: free hooks with call_rcu · 8c873e21

Florian Westphal authored Dec 01, 2017

Giuseppe Scrivano says:
  "SELinux, if enabled, registers for each new network namespace 6
    netfilter hooks."

Cost for this is high.  With synchronize_net() removed:
   "The net benefit on an SMP machine with two cores is that creating a
   new network namespace takes -40% of the original time."

This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation.  Thus store this information right after the rcu head.

We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries.  However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

8c873e21

netfilter: core: remove synchronize_net call if nfqueue is used · 26888dfd

Florian Westphal authored Dec 01, 2017

since commit 960632ec ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued.  Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

26888dfd

netfilter: core: make nf_unregister_net_hooks simple wrapper again · 4e645b47

Florian Westphal authored Dec 01, 2017

This reverts commit d3ad2c17
("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").

Nothing wrong with it.  However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.

This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4e645b47

netfilter: nf_conntrack_h323: Remove unwanted comments. · ca9b0147

Varsha Rao authored Nov 30, 2017

Change old multi-line comment style to kernel comment style and
remove unwanted comments.
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ca9b0147

netfilter: ipset: add resched points during set listing · a778a15f

Florian Westphal authored Nov 30, 2017

When sets are extremely large we can get softlockup during ipset -L.
We could fix this by adding cond_resched_rcu() at the right location
during iteration, but this only works if RCU nesting depth is 1.

At this time entire variant->list() is called under under rcu_read_lock_bh.
This used to be a read_lock_bh() but as rcu doesn't really lock anything,
it does not appear to be needed, so remove it (ipset increments set
reference count before this, so a set deletion should not be possible).
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a778a15f

netfilter: ipset: use nfnl_mutex_is_locked · 49971b88

Florian Westphal authored Nov 30, 2017

Check that we really hold nfnl mutex here instead of relying on correct
usage alone.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

49971b88

netfilter: ipvs: Remove useless ipvsh param of frag_safe_skb_hp · 6b3d9330

Gao Feng authored Nov 13, 2017

The param of frag_safe_skb_hp, ipvsh, isn't used now. So remove it and
update the callers' codes too.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

6b3d9330

netfilter: conntrack: timeouts can be const · 2c9e8637

Florian Westphal authored Nov 12, 2017

Nowadays this is just the default template that is used when setting up
the net namespace, so nothing writes to these locations.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

2c9e8637

netfilter: mark expected switch fall-throughs · e8542dce

Gustavo A. R. Silva authored Nov 07, 2017

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

e8542dce

netfilter: conntrack: l4 protocol trackers can be const · 9dae47ab

Florian Westphal authored Nov 07, 2017

previous patches removed all writes to these structs so we can
now mark them as const.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9dae47ab

netfilter: conntrack: constify list of builtin trackers · cd9ceafc
Florian Westphal authored Nov 07, 2017
```
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
```
cd9ceafc

netfilter: conntrack: remove nlattr_size pointer from l4proto trackers · 39215846

Florian Westphal authored Nov 07, 2017

similar to previous commit, but instead compute this at compile time
and turn nlattr_size into an u16.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

39215846

Merge branch 'ipv6-ipv4-nexthop-align' · f66faae2

David S. Miller authored Jan 07, 2018

Ido Schimmel says:

====================
ipv6: Align nexthop behaviour with IPv4

This set tries to eliminate some differences between IPv4's and IPv6's
treatment of nexthops. These differences are most likely a side effect
of IPv6's data structures (specifically 'rt6_info') that incorporate
both the route and the nexthop and the late addition of ECMP support in
commit 51ebd318 ("ipv6: add support of equal cost multipath
(ECMP)").

IPv4 and IPv6 do not react the same to certain netdev events. For
example, upon carrier change affected IPv4 nexthops are marked using the
RTNH_F_LINKDOWN flag and the nexthop group is rebalanced accordingly.
IPv6 on the other hand, does nothing which forces us to perform a
carrier check during route lookup and dump. This makes it difficult to
introduce features such as non-equal-cost multipath that are built on
top of this set [1].

In addition, when a netdev is put administratively down IPv4 nexthops
are marked using the RTNH_F_DEAD flag, whereas IPv6 simply flushes all
the routes using these nexthops. To be consistent with IPv4, multipath
routes should only be flushed when all nexthops in the group are
considered dead.

The first 12 patches introduce non-functional changes that store the
RTNH_F_DEAD and RTNH_F_LINKDOWN flags in IPv6 routes based on netdev
events, in a similar fashion to IPv4. This allows us to remove the
carrier check performed during route lookup and dump.

The next three patches make sure we only flush a multipath route when
all of its nexthops are dead.

Last three patches add test cases for IPv4/IPv6 FIB. These verify that
both address families react similarly to netdev events.

Finally, this series also serves as a good first step towards David
Ahern's goal of treating nexthops as standalone objects [2], as it makes
the code more in line with IPv4 where the nexthop and the nexthop group
are separate objects from the route itself.

1. https://github.com/idosch/linux/tree/ipv6-nexthops
2. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf

Changes since RFC (feedback from David Ahern):
* Remove redundant declaration of rt6_ifdown() in patch 4 and adjust
comment referencing it accordingly
* Drop patch to flush multipath routes upon NETDEV_UNREGISTER. Reword
cover letter accordingly
* Use a temporary variable to make code more readable in patch 15
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f66faae2

selftests: fib_tests: Add test cases for netdev carrier change · 82e45b6f

Ido Schimmel authored Jan 07, 2018

Check that IPv4 and IPv6 react the same when the carrier of a netdev is
toggled. Local routes should not be affected by this, whereas unicast
routes should.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

82e45b6f

selftests: fib_tests: Add test cases for netdev down · 5adb7683

Ido Schimmel authored Jan 07, 2018

Check that IPv4 and IPv6 react the same when a netdev is being put
administratively down.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5adb7683

selftests: fib_tests: Add test cases for IPv4/IPv6 FIB · 607bd2e5

Ido Schimmel authored Jan 07, 2018

Add test cases to check that IPv4 and IPv6 react to a netdev being
unregistered as expected.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

607bd2e5

ipv6: Flush multipath routes when all siblings are dead · 1de178ed

Ido Schimmel authored Jan 07, 2018

By default, IPv6 deletes nexthops from a multipath route when the
nexthop device is put administratively down. This differs from IPv4
where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
multipath route is flushed when all of its nexthops become dead.

Align IPv6 with IPv4 and have it conform to the same guidelines.

In case the multipath route needs to be flushed, its siblings are
flushed one by one. Otherwise, the nexthops are marked with the
appropriate flags and the tree walker is instructed to skip all the
siblings.

As explained in previous patches, care is taken to update the sernum of
the affected tree nodes, so as to prevent the use of wrong dst entries.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1de178ed

ipv6: Take table lock outside of sernum update function · 922c2ac8

Ido Schimmel authored Jan 07, 2018

The next patch is going to allow dead routes to remain in the FIB tree
in certain situations.

When this happens we need to be sure to bump the sernum of the nodes
where these are stored so that potential copies cached in sockets are
invalidated.

The function that performs this update assumes the table lock is not
taken when it is invoked, but that will not be the case when it is
invoked by the tree walker.

Have the function assume the lock is taken and make the single caller
take the lock itself.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

922c2ac8

ipv6: Export sernum update function · 4a8e56ee

Ido Schimmel authored Jan 07, 2018

We are going to allow dead routes to stay in the FIB tree (e.g., when
they are part of a multipath route, directly connected route with no
carrier) and revive them when their nexthop device gains carrier or when
it is put administratively up.

This is equivalent to the addition of the route to the FIB tree and we
should therefore take care of updating the sernum of all the parent
nodes of the node where the route is stored. Otherwise, we risk sockets
caching and using sub-optimal dst entries.

Export the function that performs the above, so that it could be invoked
from fib6_ifup() later on.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a8e56ee

ipv6: Teach tree walker to skip multipath routes · b5cb5a75

Ido Schimmel authored Jan 07, 2018

As explained in previous patch, fib6_ifdown() needs to consider the
state of all the sibling routes when a multipath route is traversed.

This is done by evaluating all the siblings when the first sibling in a
multipath route is traversed. If the multipath route does not need to be
flushed (e.g., not all siblings are dead), then we should just skip the
multipath route as our work is done.

Have the tree walker jump to the last sibling when it is determined that
the multipath route needs to be skipped.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b5cb5a75

ipv6: Add explicit flush indication to routes · a2c554d3

Ido Schimmel authored Jan 07, 2018

When routes that are a part of a multipath route are evaluated by
fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events
the state of their sibling routes is not considered.

This will change in subsequent patches in order to align IPv6 with
IPv4's behavior. For example, when the last sibling in a multipath route
becomes dead, the entire multipath route needs to be removed.

To prevent the tree walker from re-evaluating all the sibling routes
each time, we can simply evaluate them once - when the first sibling is
traversed.

If we determine the entire multipath route needs to be removed, then the
'should_flush' bit is set in all the siblings, which will cause the
walker to flush them when it traverses them.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2c554d3

ipv6: Report dead flag during route dump · f9d882ea

Ido Schimmel authored Jan 07, 2018

Up until now the RTNH_F_DEAD flag was only reported in route dump when
the 'ignore_routes_with_linkdown' sysctl was set. This is expected as
dead routes were flushed otherwise.

The reliance on this sysctl is going to be removed, so we need to report
the flag regardless of the sysctl's value.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f9d882ea

ipv6: Ignore dead routes during lookup · 8067bb8c

Ido Schimmel authored Jan 07, 2018

Currently, dead routes are only present in the routing tables in case
the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are
flushed.

Subsequent patches are going to remove the reliance on this sysctl and
make IPv6 more consistent with IPv4.

Before this is done, we need to make sure dead routes are skipped during
route lookup, so as to not cause packet loss.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8067bb8c

ipv6: Check nexthop flags in route dump instead of carrier · 44c9f2f2

Ido Schimmel authored Jan 07, 2018

Similar to previous patch, there is no need to check for the carrier of
the nexthop device when dumping the route and we can instead check for
the presence of the RTNH_F_LINKDOWN flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

44c9f2f2

ipv6: Check nexthop flags during route lookup instead of carrier · 14c5206c

Ido Schimmel authored Jan 07, 2018

Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
need to dereference the nexthop device and check its carrier and instead
check for the presence of the flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14c5206c

ipv6: Set nexthop flags during route creation · 5609b80a

Ido Schimmel authored Jan 07, 2018

It is valid to install routes with a nexthop device that does not have a
carrier, so we need to make sure they're marked accordingly.

As explained in the previous patch, host and anycast routes are never
marked with the 'linkdown' flag.

Note that reject routes are unaffected, as these use the loopback device
which always has a carrier.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5609b80a

ipv6: Set nexthop flags upon carrier change · 27c6fa73

Ido Schimmel authored Jan 07, 2018

Similar to IPv4, when the carrier of a netdev changes we should toggle
the 'linkdown' flag on all the nexthops using it as their nexthop
device.

This will later allow us to test for the presence of this flag during
route lookup and dump.

Up until commit 4832c30d ("net: ipv6: put host and anycast routes on
device with address") host and anycast routes used the loopback netdev
as their nexthop device and thus were not marked with the 'linkdown'
flag. The patch preserves this behavior and allows one to ping the local
address even when the nexthop device does not have a carrier and the
'ignore_routes_with_linkdown' sysctl is set.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

27c6fa73

ipv6: Prepare to handle multiple netdev events · 4c981e28

Ido Schimmel authored Jan 07, 2018

To make IPv6 more in line with IPv4 we need to be able to respond
differently to different netdev events. For example, when a netdev is
unregistered all the routes using it as their nexthop device should be
flushed, whereas when the netdev's carrier changes only the 'linkdown'
flag should be toggled.

Currently, this is not possible, as the function that traverses the
routing tables is not aware of the triggering event.

Propagate the triggering event down, so that it could be used in later
patches.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c981e28

ipv6: Clear nexthop flags upon netdev up · 2127d95a

Ido Schimmel authored Jan 07, 2018

Previous patch marked nexthops with the 'dead' and 'linkdown' flags.
Clear these flags when the netdev comes back up.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2127d95a

ipv6: Mark dead nexthops with appropriate flags · 2b241361

Ido Schimmel authored Jan 07, 2018

When a netdev is put administratively down or unregistered all the
nexthops using it as their nexthop device should be marked with the
'dead' and 'linkdown' flags.

Currently, when a route is dumped its nexthop device is tested and the
flags are set accordingly. A similar check is performed during route
lookup.

Instead, we can simply mark the nexthops based on netdev events and
avoid checking the netdev's state during route dump and lookup.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2b241361

ipv6: Remove redundant route flushing during namespace dismantle · 9fcb0714

Ido Schimmel authored Jan 07, 2018

By the time fib6_net_exit() is executed all the netdevs in the namespace
have been either unregistered or pushed back to the default namespace.
That is because pernet subsys operations are always ordered before
pernet device operations and therefore invoked after them during
namespace dismantle.

Thus, all the routing tables in the namespace are empty by the time
fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.

This allows us to simplify the condition in fib6_ifdown() as it's only
ever called with an actual netdev.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9fcb0714

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 7f0b8000

David S. Miller authored Jan 07, 2018

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-01-07

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a start of a framework for extending struct xdp_buff without
   having the overhead of populating every data at runtime. Idea
   is to have a new per-queue struct xdp_rxq_info that holds read
   mostly data (currently that is, queue number and a pointer to
   the corresponding netdev) which is set up during rxqueue config
   time. When a XDP program is invoked, struct xdp_buff holds a
   pointer to struct xdp_rxq_info that the BPF program can then
   walk. The user facing BPF program that uses struct xdp_md for
   context can use these members directly, and the verifier rewrites
   context access transparently by walking the xdp_rxq_info and
   net_device pointers to load the data, from Jesper.

2) Redo the reporting of offload device information to user space
   such that it works in combination with network namespaces. The
   latter is reported through a device/inode tuple as similarly
   done in other subsystems as well (e.g. perf) in order to identify
   the namespace. For this to work, ns_get_path() has been generalized
   such that the namespace can be retrieved not only from a specific
   task (perf case), but also from a callback where we deduce the
   netns (ns_common) from a netdevice. bpftool support using the new
   uapi info and extensive test cases for test_offload.py in BPF
   selftests have been added as well, from Jakub.

3) Add two bpftool improvements: i) properly report the bpftool
   version such that it corresponds to the version from the kernel
   source tree. So pick the right linux/version.h from the source
   tree instead of the installed one. ii) fix bpftool and also
   bpf_jit_disasm build with bintutils >= 2.9. The reason for the
   build breakage is that binutils library changed the function
   signature to select the disassembler. Given this is needed in
   multiple tools, add a proper feature detection to the
   tools/build/features infrastructure, from Roman.

4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the
   stacktrace map. It is currently unimplemented, but there are
   use cases where user space needs to walk all stacktrace map
   entries e.g. for dumping or deleting map entries w/o having to
   close and recreate the map. Add BPF selftests along with it,
   from Yonghong.

5) Few follow-up cleanups for the bpftool cgroup code: i) rename
   the cgroup 'list' command into 'show' as we have it for other
   subcommands as well, ii) then alias the 'show' command such that
   'list' is accepted which is also common practice in iproute2,
   and iii) remove couple of newlines from error messages using
   p_err(), from Jakub.

6) Two follow-up cleanups to sockmap code: i) remove the unused
   bpf_compute_data_end_sk_skb() function and ii) only build the
   sockmap infrastructure when CONFIG_INET is enabled since it's
   only aware of TCP sockets at this time, from John.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7f0b8000

06 Jan, 2018 3 commits

Merge branch 'bpf-stacktrace-map-next-key-support' · 9be99bad

Daniel Borkmann authored Jan 06, 2018

Yonghong Song says:

====================
The patch set implements bpf syscall command BPF_MAP_GET_NEXT_KEY
for stacktrace map. Patch #1 is the core implementation
and Patch #2 implements a bpf test at tools/testing/selftests/bpf
directory. Please see individual patch comments for details.

Changelog:
  v1 -> v2:
   - For invalid key (key pointer is non-NULL), sets next_key to be the first valid key.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

9be99bad

tools/bpf: add a bpf selftest for stacktrace · 3ced9b60

Yonghong Song authored Jan 04, 2018

Added a bpf selftest in test_progs at tools directory for stacktrace.
The test will populate a hashtable map and a stacktrace map
at the same time with the same key, stackid.
The user space will compare both maps, using BPF_MAP_LOOKUP_ELEM
command and BPF_MAP_GET_NEXT_KEY command, to ensure that both have
the same set of keys.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

3ced9b60

bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map · 16f07c55

Yonghong Song authored Jan 04, 2018

Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.

The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

16f07c55

05 Jan, 2018 5 commits

Merge branch 'xdp_rxq_info' · 11d16edb

Alexei Starovoitov authored Jan 05, 2018

Jesper Dangaard Brouer says:

====================
V4:
* Added reviewers/acks to patches
* Fix patch desc in i40e that got out-of-sync with code
* Add SPDX license headers for the two new files added in patch 14

V3:
* Fixed bug in virtio_net driver
* Removed export of xdp_rxq_info_init()

V2:
* Changed API exposed to drivers
  - Removed invocation of "init" in drivers, and only call "reg"
    (Suggested by Saeed)
  - Allow "reg" to fail and handle this in drivers
    (Suggested by David Ahern)
* Removed the SINKQ qtype, instead allow to register as "unused"
* Also fixed some drivers during testing on actual HW (noted in patches)

There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on.  For both the XDP bpf-prog and kernel side.

Instead of extending struct xdp_buff each time new info is needed,
this patchset takes a different approach.  Struct xdp_buff is only
extended with a pointer to a struct xdp_rxq_info (allowing for easier
extending this later).  This xdp_rxq_info contains information related
to how the driver have setup the individual RX-queue's.  This is
read-mostly information, and all xdp_buff frames (in drivers
napi_poll) point to the same xdp_rxq_info (per RX-queue).

We stress this data/cache-line is for read-mostly info.  This is NOT
for dynamic per packet info, use the data_meta for such use-cases.

This patchset start out small, and only expose ingress_ifindex and the
RX-queue index to the XDP/BPF program. Access to tangible info like
the ingress ifindex and RX queue index, is fairly easy to comprehent.
The other future use-cases could allow XDP frames to be recycled back
to the originating device driver, by providing info on RX device and
queue number.

As XDP doesn't have driver feature flags, and eBPF code due to
bpf-tail-calls cannot determine that XDP driver invoke it, this
patchset have to update every driver that support XDP.

For driver developers (review individual driver patches!):

The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
modification require (temporary) stopping RX frames, then the
xdp_rxq_info should (likely) also be unregistred and re-registered,
especially if reallocating the pages in the ring. Make sure ethtool
set_channels does the right thing. When replacing XDP prog, if and
only if RX-ring need to be changed, then also re-register the
xdp_rxq_info.

I'm Cc'ing the individual driver patches to the registered maintainers.

Testing:

I've only tested the NIC drivers I have hardware for.  The general
test procedure is to (DUT = Device Under Test):
 (1) run pktgen script pktgen_sample04_many_flows.sh       (against DUT)
 (2) run samples/bpf program xdp_rxq_info --dev $DEV       (on DUT)
 (3) runtime modify number of NIC queues via ethtool -L    (on DUT)
 (4) runtime modify number of NIC ring-size via ethtool -G (on DUT)

Patch based on git tree bpf-next (at commit fb982666):
 https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

11d16edb

samples/bpf: program demonstrating access to xdp_rxq_info · 0fca931a

Jesper Dangaard Brouer authored Jan 03, 2018

This sample program can be used for monitoring and reporting how many
packets per sec (pps) are received per NIC RX queue index and which
CPU processed the packet. In itself it is a useful tool for quickly
identifying RSS imbalance issues, see below.

The default XDP action is XDP_PASS in-order to provide a monitor
mode. For benchmarking purposes it is possible to specify other XDP
actions on the cmdline --action.

Output below shows an imbalance RSS case where most RXQ's deliver to
CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
things from a CPU level the two CPUs are processing approx the same
amount, BUT looking at the rx_queue_index levels it is clear that
RXQ-2 receive much better service, than other RXQs which all share CPU-0.

Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       900,473     0
XDP-RX CPU      2       906,921     0
XDP-RX CPU      total   1,807,395

RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   180,098     0
rx_queue_index    0:sum 180,098
rx_queue_index    1:0   180,098     0
rx_queue_index    1:sum 180,098
rx_queue_index    2:2   906,921     0
rx_queue_index    2:sum 906,921
rx_queue_index    3:0   180,098     0
rx_queue_index    3:sum 180,098
rx_queue_index    4:0   180,082     0
rx_queue_index    4:sum 180,082
rx_queue_index    5:0   180,093     0
rx_queue_index    5:sum 180,093
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

0fca931a

bpf: finally expose xdp_rxq_info to XDP bpf-programs · 02dd3291

Jesper Dangaard Brouer authored Jan 03, 2018

Now all XDP driver have been updated to setup xdp_rxq_info and assign
this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
of the xdp_rxq_info struct members.

This patch extend xdp_md and expose UAPI to userspace for
ingress_ifindex and rx_queue_index.  Access happens via bpf
instruction rewrite, that load data directly from struct xdp_rxq_info.

* ingress_ifindex map to xdp_rxq_info->dev->ifindex
* rx_queue_index  map to xdp_rxq_info->queue_index
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

02dd3291

xdp: generic XDP handling of xdp_rxq_info · e817f856

Jesper Dangaard Brouer authored Jan 03, 2018

Hook points for xdp_rxq_info:
 * reg  : netif_alloc_rx_queues
 * unreg: netif_free_rx_queues

The net_device have some members (num_rx_queues + real_num_rx_queues)
and data-area (dev->_rx with struct netdev_rx_queue's) that were
primarily used for exporting information about RPS (CONFIG_RPS) queues
to sysfs (CONFIG_SYSFS).

For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
and remove some of the CONFIG_SYSFS ifdefs.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e817f856

virtio_net: setup xdp_rxq_info · 754b8a21

Jesper Dangaard Brouer authored Jan 03, 2018

The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet.  Thus, the xdp_rxq_info also remains
fairly static.  This allow us to simply add the reg/unreg to
net_device open/close functions.

Driver hook points for xdp_rxq_info:
 * reg  : virtnet_open
 * unreg: virtnet_close

V3:
 - bugfix, also setup xdp.rxq in receive_mergeable()
 - Tested bpf-sample prog inside guest on a virtio_net device

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

754b8a21