Commits · 9c270af37bb62e708e3e4415d653ce73e713df02 · nexedi / linux

18 Oct, 2017 3 commits

bpf: XDP_REDIRECT enable use of cpumap · 9c270af3

Jesper Dangaard Brouer authored Oct 16, 2017

This patch connects cpumap to the xdp_do_redirect_map infrastructure.

Still no SKB allocation are done yet.  The XDP frames are transferred
to the other CPU, but they are simply refcnt decremented on the remote
CPU.  This served as a good benchmark for measuring the overhead of
remote refcnt decrement.  If driver page recycle cache is not
efficient then this, exposes a bottleneck in the page allocator.

A shout-out to MST's ptr_ring, which is the secret behind is being so
efficient to transfer memory pointers between CPUs, without constantly
bouncing cache-lines between CPUs.

V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.

V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
 as implementation require a separate upstream discussion.

V5:
 - Fix a maybe-uninitialized pointed out by kbuild test robot.
 - Restrict bpf-prog side access to cpumap, open when use-cases appear
 - Implement cpu_map_enqueue() as a more simple void pointer enqueue

V6:
 - Allow cpumap type for usage in helper bpf_redirect_map,
   general bpf-prog side restriction moved to earlier patch.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c270af3

bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112

Jesper Dangaard Brouer authored Oct 16, 2017

The 'cpumap' is primarily used as a backend map for XDP BPF helper
call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.

This patch implement the main part of the map.  It is not connected to
the XDP redirect system yet, and no SKB allocation are done yet.

The main concern in this patch is to ensure the datapath can run
without any locking.  This adds complexity to the setup and tear-down
procedure, which assumptions are extra carefully documented in the
code comments.

V2:
 - make sure array isn't larger than NR_CPUS
 - make sure CPUs added is a valid possible CPU

V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>

V5:
 - Restrict map allocation to root / CAP_SYS_ADMIN
 - WARN_ON_ONCE if queue is not empty on tear-down
 - Return -EPERM on memlock limit instead of -ENOMEM
 - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
 - Moved cpu_map_enqueue() to next patch

V6: all notice by Daniel Borkmann
 - Fix err return code in cpu_map_alloc() introduced in V5
 - Move cpu_possible() check after max_entries boundary check
 - Forbid usage initially in check_map_func_compatibility()

V7:
 - Fix alloc error path spotted by Daniel Borkmann
 - Did stress test adding+removing CPUs from the map concurrently
 - Fixed refcnt issue on cpu_map_entry, kthread started too soon
 - Make sure packets are flushed during tear-down, involved use of
   rcu_barrier() and kthread_run only exit after queue is empty
 - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring

V8:
 - Nitpicking comments and gramma by Edward Cree
 - Fix missing semi-colon introduced in V7 due to rebasing
 - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6710e112

net: ftgmac100: Request clock and set speed · 4b70c62b

Joel Stanley authored Oct 13, 2017

According to the ASPEED datasheet, gigabit speeds require a clock of
100MHz or higher. Other speeds require 25MHz or higher. This patch
configures a 100MHz clock if the system has a direct-attached
PHY, or 25MHz if the system is running NC-SI which is limited to 100MHz.

There appear to be no other upstream users of the FTGMAC100 driver it is
hard to know the clocking requirements of other platforms. Therefore a
conservative approach was taken with enabling clocks. If the platform is
not ASPEED, both requesting the clock and configuring the speed is
skipped.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Tested-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b70c62b

17 Oct, 2017 1 commit

net: export netdev_txq_to_tc to allow sch_mqprio to compile as module · 8a5f2166

Henrik Austad authored Oct 17, 2017

In commit 32302902 ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

     ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Henrik Austad <haustad@cisco.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a5f2166

16 Oct, 2017 31 commits

Merge branch 'mlxsw-GRE-Offload-decap-without-encap' · e4467f2e

David S. Miller authored Oct 16, 2017

Jiri Pirko says:

====================
mlxsw: GRE: Offload decap without encap

Petr says:

The current code doesn't offload GRE decapsulation unless there's a
corresponding encapsulation route as well. While not strictly incorrect (when
encap route is absent, the decap route traps traffic to CPU and the kernel
handles it), it's a missed optimization opportunity.

With this patchset, IPIP entries are created as soon as offloadable tunneling
netdevice is created. This then leads to offloading of decap route, if one
exists, or is added afterwards, even when no encap route is present.

In Linux, when there is a decap route, matching IP-in-IP packets are always
decapsulated. However, with IPv4 overlays in particular, whether the inner
packet is then forwarded depends on setting of net.ipv4.conf.*.rp_filter. When
RP filtering is turned on, inner packets aren't forwarded unless there's a
matching encap route. The mlxsw driver doesn't reflect this behavior in other
router interfaces, and thus it's not implemented for tunnel types either. A
better support for this will be subject of follow-up work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e4467f2e

mlxsw: spectrum: Drop refcounting of IPIP entries · 4cccb737

Petr Machata authored Oct 16, 2017

Formerly, IPIP entries were created lazily by next hops that referenced
an offloadable IP-in-IP netdevice. However now that they are created
eagerly as a reaction to events on such netdevices, the reference
counting is useless. Hence drop it.

The routes whose next hops reference an offloaded IP-in-IP netdevice
actually linger around a bit after their device is unregistered.
However, mlxsw_sp_ipip_entry_destroy() also destroys the backing
loopback, and mlxsw_sp_rif_destroy() transitively (via
mlxsw_sp_nexthop_rif_gone_sync()) calls mlxsw_sp_nexthop_ipip_fini(),
which unlinks the IPIP entry from a next hop. Thus no dangling pointers
are left behind for the brief window after netdevice is gone, but routes
not yet.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cccb737

mlxsw: spectrum: Support IPIP overlay VRF migration · f63ce4e5

Petr Machata authored Oct 16, 2017

IPIP entries are created as soon as an offloadable device is created.
That means that when such a device is later moved to a different VRF,
the loopback device that backs the tunnel is wrong.

Thus when an offloadable encapsulating netdevice moves from one VRF to
another, make sure that the loopback is updated as necessary.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f63ce4e5

mlxsw: spectrum: Support decap-only IP-in-IP tunnels · 0063587d

Petr Machata authored Oct 16, 2017

Current code for offloading IP-in-IP tunneling assumes that there is no
decap without encap. But that's never true for IPv6 overlays, and is not
true for IPv4 ones either, if net.ipv4.conf.*.rp_filter is unset.

To support decap-only tunnels, an IPIP entry is now created as soon as
an offloadable tunneling device is created. When that netdevice is up'd,
a decap route is looked up and possibly offloaded. Thus decap is not
handled implicitly as part of mlxsw_sp_ipip_entry_get() call anymore,
but needs to be done explicitly after the get, if desired.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0063587d

mlxsw: spectrum_router: Move mlxsw_sp_netdev_ipip_type() · 6698c168

Petr Machata authored Oct 16, 2017

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6698c168

mlxsw: spectrum: Move netdevice NB to struct mlxsw_sp · c30f5d01

Petr Machata authored Oct 16, 2017

So far, all netdevice notifications that the driver cared about were
related to its own ports, and mlxsw_sp could be retrieved from the
netdevice's private data. For IP-in-IP offloading however, the driver
cares about events on foreign netdevices, and getting at mlxsw_sp or
router data structures from the handler is inconvenient.

Therefore move the netdevice notifier blocks from global scope to struct
mlxsw_sp to allow retrieval from the notifier block pointer itself.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c30f5d01

tipc: fix rebasing error · 36c0a9df

Jon Maloy authored Oct 16, 2017

In commit 2f487712 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") there was introduced a last-minute rebasing
error that broke non-group communication.

We fix this here.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

36c0a9df

Merge branch 'net-core-rcuify-rtnl-af_ops' · 45880485

David S. Miller authored Oct 16, 2017

Florian Westphal says:

====================
net: core: rcuify rtnl af_ops

None of the rtnl af_ops callbacks sleep, so they can be called while
holding rcu read lock.

Switch handling of af_ops to rcu.

This would allow to later call af_ops functions without holding
the rtnl mutex anymore.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

45880485

net: core: rcu-ify rtnl af_ops · 5fa85a09

Florian Westphal authored Oct 16, 2017

rtnl af_ops currently rely on rtnl mutex: unregister (called from module
exit functions) takes the rtnl mutex and all users that do af_ops lookup
also take the rtnl mutex. IOW, parallel rmmod will block until doit()
callback is done.

As none of the af_ops implementation sleep we can use rcu instead.

doit functions that need the af_ops can now use rcu instead of the
rtnl mutex provided the mutex isn't needed for other reasons.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

5fa85a09

rtnetlink: place link af dump into own helper · 070cbf5b

Florian Westphal authored Oct 16, 2017

next patch will rcu-ify rtnl af_ops, i.e. allow af_ops
lookup and function calls with rcu read lock held instead
of rtnl mutex.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

070cbf5b

tcp: cdg: make struct tcp_cdg static · d85969f1

Colin Ian King authored Oct 16, 2017

The structure tcp_cdg is local to the source and
does not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol 'tcp_cdg' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d85969f1

net: systemport: add NET_DSA dependency · 00fb3a7c

Arnd Bergmann authored Oct 16, 2017

The notifier cause a link error when NET_DSA is a loadable
module:

drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_remove':
bcmsysport.c:(.text+0x1582): undefined reference to `unregister_dsa_notifier'
drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_probe':
bcmsysport.c:(.text+0x278d): undefined reference to `register_dsa_notifier'

This adds a dependency that forces the systemport driver to be
a loadable module as well when that happens, but otherwise
allows it to be built normally when DSA is either built-in or
completely disabled.

Fixes: d1565763 ("net: systemport: Establish lower/upper queue mapping")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

00fb3a7c

hamradio: baycom_par: use new parport device model · 92c43fca

Sudip Mukherjee authored Oct 15, 2017

Modify baycom driver to use the new parallel port device model.
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92c43fca

net: dccp: mark expected switch fall-throughs · bc28df6e

Gustavo A. R. Silva authored Oct 15, 2017

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that for options.c file, I placed the "fall through" comment
on its own line, which is what GCC is expecting to find.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bc28df6e

pch_gbe: Switch to new PCI IRQ allocation API · 2a600d97

Andy Shevchenko authored Oct 14, 2017

This removes custom flag handling.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a600d97

tracing: bpf: Hide bpf trace events when they are not used · 9185a610

Steven Rostedt (VMware) authored Oct 12, 2017

All the trace events defined in include/trace/events/bpf.h are only
used when CONFIG_BPF_SYSCALL is defined. But this file gets included by
include/linux/bpf_trace.h which is included by the networking code with
CREATE_TRACE_POINTS defined.

If a trace event is created but not used it still has data structures
and functions created for its use, even though nothing is using them.
To not waste space, do not define the BPF trace events in bpf.h unless
CONFIG_BPF_SYSCALL is defined.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9185a610

ipv6: only update __use and lastusetime once per jiffy at most · 0da4af00

Wei Wang authored Oct 13, 2017

In order to not dirty the cacheline too often, we try to only update
dst->__use and dst->lastusetime at most once per jiffy.
As dst->lastusetime is only used by ipv6 garbage collector, it should
be good enough time resolution.
And __use is only used in ipv6_route_seq_show() to show how many times a
dst has been used. And as __use is not atomic_t right now, it does not
show the precise number of usage times anyway. So we think it should be
OK to only update it at most once per jiffy.

According to my latest syn flood test on a machine with intel Xeon 6th
gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
on 2 NUMA nodes:
With this patch, the packet process rate increases from ~3.49Mpps to
~3.75Mpps with a 7% increase rate.

Note: dst_use() is being renamed to dst_hold_and_use() to better specify
the purpose of the function.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@googl.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0da4af00

ipv6: check fn before doing FIB6_SUBTREE(fn) · 0e80193b

Wei Wang authored Oct 13, 2017

In fib6_locate(), we need to first make sure fn is not NULL before doing
FIB6_SUBTREE(fn) to avoid crash.

This fixes the following static checker warning:
net/ipv6/ip6_fib.c:1462 fib6_locate()
         warn: variable dereferenced before check 'fn' (see line 1459)

net/ipv6/ip6_fib.c
  1458          if (src_len) {
  1459                  struct fib6_node *subtree = FIB6_SUBTREE(fn);
                                                    ^^^^^^^^^^^^^^^^
We shifted this dereference

  1460
  1461                  WARN_ON(saddr == NULL);
  1462                  if (fn && subtree)
                            ^^
before the check for NULL.

  1463                          fn = fib6_locate_1(subtree, saddr, src_len,
  1464                                             offsetof(struct rt6_info, rt6i_src)

Fixes: 66f5d6ce ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0e80193b

bpf: Add -target to clang switch while cross compiling. · 9db95838

Abhijit Ayarekar authored Oct 13, 2017

Update to llvm excludes assembly instructions.
llvm git revision is below

commit 65fad7c26569 ("bpf: add inline-asm support")

This change will be part of llvm  release 6.0

__ASM_SYSREG_H define is not required for native compile.
-target switch includes appropriate target specific files
while cross compiling

Tested on x86 and arm64.
Signed-off-by: Abhijit Ayarekar <abhijit.ayarekar@caviumnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

9db95838

Merge branch 'sched-tp_q-remove' · 745482e0

David S. Miller authored Oct 16, 2017

Jiri Pirko <jiri@mellanox.com>

====================
net: sched: remove some tp->q usage

In order to prepare for block sharing, tcf_proto instances need to be
independent on particular qdisc instances. This patchset takes care of
removal of couple occurrences of tp->q usage.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

745482e0