Commits · 0de22baabc97f7fc05e31d82cec8049947946887 · Kirill Smelkov / linux

20 Sep, 2018 10 commits

netfilter: nf_tables: use rhashtable_walk_enter instead of rhashtable_walk_init · 0de22baa

Taehee Yoo authored Sep 14, 2018

rhashtable_walk_init() is deprecated and rhashtable_walk_enter() can be
used instead. rhashtable_walk_init() is wrapper function of
rhashtable_walk_enter() so that logic is actually same.
But rhashtable_walk_enter() doesn't return error hence error path
code can be removed.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

0de22baa

netfilter: nat: remove duplicate skb_is_nonlinear() in __nf_nat_mangle_tcp_packet() · f8b0a3ab

Taehee Yoo authored Sep 12, 2018

__nf_nat_mangle_tcp_packet() and nf_nat_mangle_udp_packet() call
mangle_contents(). and __nf_nat_mangle_tcp_packet()
and mangle_contents() call skb_is_nonlinear(). so that
skb_is_nonlinear() in __nf_nat_mangle_tcp_packet() is unnecessary.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

f8b0a3ab

netfilter: conntrack: clamp l4proto array size at largers supported protocol · 93185c80

Florian Westphal authored Sep 12, 2018

All higher l4proto numbers are handled by the generic tracker; the
l4proto lookup function already returns generic one in case the l4proto
number exceeds max size.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

93185c80

netfilter: conntrack: remove l3->l4 mapping information · dd2934a9

Florian Westphal authored Sep 17, 2018

l4 protocols are demuxed by l3num, l4num pair.

However, almost all l4 trackers are l3 agnostic.

Only exceptions are:
 - gre, icmp (ipv4 only)
 - icmpv6 (ipv6 only)

This commit gets rid of the l3 mapping, l4 trackers can now be looked up
by their IPPROTO_XXX value alone, which gets rid of the additional l3
indirection.

For icmp, ipcmp6 and gre, add a check on state->pf and
return -NF_ACCEPT in case we're asked to track e.g. icmpv6-in-ipv4,
this seems more fitting than using the generic tracker.

Additionally we can kill the 2nd l4proto definitions that were needed
for v4/v6 split -- they are now the same so we can use single l4proto
struct for each protocol, rather than two.

The EXPORT_SYMBOLs can be removed as all these object files are
part of nf_conntrack with no external references.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

dd2934a9

netfilter: conntrack: remove unused proto arg from netns init functions · ca2ca6e1

Florian Westphal authored Sep 12, 2018

Its unused, next patch will remove l4proto->l3proto number to simplify
l4 protocol demuxer lookup.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ca2ca6e1

netfilter: conntrack: remove error callback and handle icmp from core · 6fe78fa4

Florian Westphal authored Sep 12, 2018

icmp(v6) are the only two layer four protocols that need the error()
callback (to handle icmp errors that are related to an established
connections, e.g. packet too big, port unreachable and the like).

Remove the error callback and handle these two special cases from the core.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

6fe78fa4

netfilter: conntrack: avoid using ->error callback if possible · 0150ffba

Florian Westphal authored Sep 12, 2018

The error() handler gets called before allocating or looking up a
connection tracking entry.

We can instead use direct calls from the ->packet() handlers which get
invoked for every packet anyway.

Only exceptions are icmp and icmpv6, these two special cases will be
handled in the next patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

0150ffba

netfilter: conntrack: deconstify packet callback skb pointer · 83d213fd

Florian Westphal authored Sep 12, 2018

Only two protocols need the ->error() function: icmp and icmpv6.
This is because icmp error mssages might be RELATED to an existing
connection (e.g. PMTUD, port unreachable and the like), and their
->error() handlers do this.

The error callback is already optional, so remove it for
udp and call them from ->packet() instead.

As the error() callback can call checksum functions that write to
skb->csum*, the const qualifier has to be removed as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

83d213fd

netfilter: conntrack: remove the l4proto->new() function · 9976fc6e

Florian Westphal authored Sep 12, 2018

->new() gets invoked after ->error() and before ->packet() if
a conntrack lookup has found no result for the tuple.

We can fold it into ->packet() -- the packet() implementations
can check if the conntrack is confirmed (new) or not
(already in hash).

If its unconfirmed, the conntrack isn't in the hash yet so current
skb created a new conntrack entry.

Only relevant side effect -- if packet() doesn't return NF_ACCEPT
but -NF_ACCEPT (or drop), while the conntrack was just created,
then the newly allocated conntrack is freed right away, rather than not
created in the first place.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9976fc6e

netfilter: conntrack: pass nf_hook_state to packet and error handlers · 93e66024

Florian Westphal authored Sep 12, 2018

nf_hook_state contains all the hook meta-information: netns, protocol family,
hook location, and so on.

Instead of only passing selected information, pass a pointer to entire
structure.

This will allow to merge the error and the packet handlers and remove
the ->new() function in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

93e66024

17 Sep, 2018 13 commits

netfilter: nat: remove unnecessary rcu_read_lock in nf_nat_redirect_ipv{4/6} · c8204cab

Taehee Yoo authored Sep 12, 2018

nf_nat_redirect_ipv4() and nf_nat_redirect_ipv6() are only called by
netfilter hook point. so that rcu_read_lock and rcu_read_unlock() are
unnecessary.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

c8204cab

netfilter: cttimeout: remove superfluous check on layer 4 netlink functions · 4430b897

Pablo Neira Ayuso authored Sep 11, 2018

We assume they are always set accordingly since a874752a
("netfilter: conntrack: timeout interface depend on
CONFIG_NF_CONNTRACK_TIMEOUT"), so we can get rid of this checks.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4430b897

netfilter: nf_nat_ipv4: remove obsolete EXPORT_SYMBOL · 7052ba40

Florian Westphal authored Sep 07, 2018

There are no external callers anymore, previous change just
forgot to also remove the EXPORT_SYMBOL().

Fixes: 9971a514 ("netfilter: nf_nat: add nat type hooks to nat core")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

7052ba40

netfilter: xtables: avoid BUG_ON · 70c0eb1c

Florian Westphal authored Sep 04, 2018

I see no reason for them, label or timer cannot be NULL, and if they
were, we'll crash with null deref anyway.

For skb_header_pointer failure, just set hotdrop to true and toss
such packet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

70c0eb1c

netfilter: nf_tables: avoid BUG_ON usage · fa5950e4

Florian Westphal authored Sep 04, 2018

None of these spots really needs to crash the kernel.
In one two cases we can jsut report error to userspace, in the other
cases we can just use WARN_ON (and leak memory instead).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

fa5950e4

netfilter: xt_cgroup: shrink size of v2 path · 0d704967

Pablo Neira Ayuso authored Sep 04, 2018

cgroup v2 path field is PATH_MAX which is too large, this is placing too
much pressure on memory allocation for people with many rules doing
cgroup v1 classid matching, side effects of this are bug reports like:

https://bugzilla.kernel.org/show_bug.cgi?id=200639

This patch registers a new revision that shrinks the cgroup path to 512
bytes, which is the same approach we follow in similar extensions that
have a path field.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Tejun Heo <tj@kernel.org>

0d704967

netfilter: ctnetlink: Support L3 protocol-filter on flush · 59c08c69

Kristian Evensen authored Sep 03, 2018

The same connection mark can be set on flows belonging to different
address families. This commit adds support for filtering on the L3
protocol when flushing connection track entries. If no protocol is
specified, then all L3 protocols match.

In order to avoid code duplication and a redundant check, the protocol
comparison in ctnetlink_dump_table() has been removed. Instead, a filter
is created if the GET-message triggering the dump contains an address
family. ctnetlink_filter_match() is then used to compare the L3
protocols.
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

59c08c69

netfilter: nf_tables: add xfrm expression · 6c472602

Florian Westphal authored Sep 03, 2018

supports fetching saddr/daddr of tunnel mode states, request id and spi.
If direction is 'in', use inbound skb secpath, else dst->xfrm.

Joint work with Máté Eckl.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

6c472602

netfilter: remove obsolete need_conntrack stub · 2953d80f

Florian Westphal authored Aug 31, 2018

as of a0ae2562 ("netfilter: conntrack: remove l3proto
abstraction") there are no users anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

2953d80f

netfilter: nf_tables: asynchronous release · 0935d558

Florian Westphal authored Aug 29, 2018

Release the committed transaction log from a work queue, moving
expensive synchronize_rcu out of the locked section and providing
opportunity to batch this.

On my test machine this cuts runtime of nft-test.py in half.
Based on earlier patch from Pablo Neira Ayuso.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

0935d558

netfilter: nf_tables: warn when expr implements only one of activate/deactivate · 0ef235c7

Florian Westphal authored Aug 30, 2018

->destroy is only allowed to free data, or do other cleanups that do not
have side effects on other state, such as visibility to other netlink
requests.

Such things need to be done in ->deactivate.
As a transaction can fail, we need to make sure we can undo such
operations, therefore ->activate() has to be provided too.

So print a warning and refuse registration if expr->ops provides
only one of the two operations.

v2: fix nft_expr_check_ops to not repeat same check twice (Jones Desougi)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

0ef235c7

netfilter: nf_tables: split set destruction in deactivate and destroy phase · cd5125d8

Florian Westphal authored Aug 29, 2018

Splits unbind_set into destroy_set and unbinding operation.

Unbinding removes set from lists (so new transaction would not
find it anymore) but keeps memory allocated (so packet path continues
to work).

Rebind function is added to allow unrolling in case transaction
that wants to remove set is aborted.

Destroy function is added to free the memory, but this could occur
outside of transaction in the future.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

cd5125d8

netfilter: nf_tables: rt: allow checking if dst has xfrm attached · 02b408fa

Florian Westphal authored Aug 29, 2018

Useful e.g. to avoid NATting inner headers of to-be-encrypted packets.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

02b408fa

16 Sep, 2018 6 commits

ip6_gre: simplify gre header parsing in ip6gre_err · a82738ad

Haishuang Yan authored Sep 14, 2018

Same as ip_gre, use gre_parse_header to parse gre header in gre error
handler code.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a82738ad

ip_gre: fix parsing gre header in ipgre_err · b0350d51

Haishuang Yan authored Sep 14, 2018

gre_parse_header stops parsing when csum_err is encountered, which means
tpi->key is undefined and ip_tunnel_lookup will return NULL improperly.

This patch introduce a NULL pointer as csum_err parameter. Even when
csum_err is encountered, it won't return error and continue parsing gre
header as expected.

Fixes: 9f57c67c ("gre: Remove support for sharing GRE protocol hook.")
Reported-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0350d51

net: phy: et011c: Remove incorrect PHY_POLL flags · 21e65923

Florian Fainelli authored Sep 13, 2018

PHY_POLL is defined as -1 which means that we would be setting all flags of the
PHY driver, this is also not a valid flag to tell PHYLIB about, just remove it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

21e65923

Merge branch 'act_police-lockless-data-path' · 50676de4

David S. Miller authored Sep 16, 2018

Davide Caratti says:

====================
net/sched: act_police: lockless data path

the data path of 'police' action can be faster if we avoid using spinlocks:
 - patch 1 converts act_police to use per-cpu counters
 - patch 2 lets act_police use RCU to access its configuration data.

test procedure (using pktgen from https://github.com/netoptimizer):
 # ip link add name eth1 type dummy
 # ip link set dev eth1 up
 # tc qdisc add dev eth1 clsact
 # tc filter add dev eth1 egress matchall action police \
 > rate 2gbit burst 100k conform-exceed pass/pass index 100
 # for c in 1 2 4; do
 > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 5000000 -i eth1
 > done

test results (avg. pps/thread):

  $c | before patch |  after patch | improvement
 ----+--------------+--------------+-------------
   1 |      3518448 |      3591240 |  irrelevant
   2 |      3070065 |      3383393 |         10%
   4 |      1540969 |      3238385 |        110%
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

50676de4

net/sched: act_police: don't use spinlock in the data path · 2d550dba

Davide Caratti authored Sep 13, 2018

use RCU instead of spinlocks, to protect concurrent read/write on
act_police configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d550dba

net/sched: act_police: use per-cpu counters · 93be42f9

Davide Caratti authored Sep 13, 2018

use per-CPU counters, instead of sharing a single set of stats with all
cores. This removes the need of using spinlock when statistics are read
or updated.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

93be42f9

14 Sep, 2018 4 commits

cxgb4: update supported DCB version · c3ec8bcc

Ganesh Goudar authored Sep 14, 2018

- In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb
  version is changed and update the dcb supported version.

- Also, fill the priority code point value for priority
  based flow control.
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3ec8bcc

cxgb4: add per rx-queue counter for packet errors · 992bea8e

Ganesh Goudar authored Sep 14, 2018

print per rx-queue packet errors in sge_qinfo
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

992bea8e

cxgb4: Fix endianness issue in t4_fwcache() · 0dc235af

Ganesh Goudar authored Sep 14, 2018

Do not put host-endian 0 or 1 into big endian feild.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0dc235af

net: move definition of pcpu_lstats to header file · 52bb6677

Li RongQing authored Sep 14, 2018

pcpu_lstats is defined in several files, so unify them as one
and move to header file
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52bb6677

13 Sep, 2018 7 commits

net/ibm/emac: Remove VLA usage · ee4fccbe

Kees Cook authored Sep 13, 2018

In the quest to remove all stack VLA usage from the kernel[1], this
removes the VLA used for the emac xaht registers size. Since the size
of registers can only ever be 4 or 8, as detected in emac_init_config(),
the max can be hardcoded and a runtime test added for robustness.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christian Lamparter <chunkeey@gmail.com>
Cc: Ivan Mikhaylov <ivan@de.ibm.com>
Cc: netdev@vger.kernel.org
Co-developed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee4fccbe

pktgen: Fix fall-through annotation · f91845da

Gustavo A. R. Silva authored Sep 13, 2018

Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f91845da

tg3: Fix fall-through annotations · 310fc051

Gustavo A. R. Silva authored Sep 13, 2018

Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

310fc051

gso_segment: Reset skb->mac_len after modifying network header · 50c12f74

Toke Høiland-Jørgensen authored Sep 13, 2018

When splitting a GSO segment that consists of encapsulated packets, the
skb->mac_len of the segments can end up being set wrong, causing packet
drops in particular when using act_mirred and ifb interfaces in
combination with a qdisc that splits GSO packets.

This happens because at the time skb_segment() is called, network_header
will point to the inner header, throwing off the calculation in
skb_reset_mac_len(). The network_header is subsequently adjust by the
outer IP gso_segment handlers, but they don't set the mac_len.

Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
gso_segment handlers, after they modify the network_header.

Many thanks to Eric Dumazet for his help in identifying the cause of
the bug.
Acked-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>

50c12f74

vxlan: Remove duplicated include from vxlan.h · 293681f1

YueHaibing authored Sep 13, 2018

Remove duplicated include.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

293681f1

net: dsa: b53: Do not fail when IRQ are not initialized · b2ddc48a

Florian Fainelli authored Sep 13, 2018

When the Device Tree is not providing the per-port interrupts, do not fail
during b53_srab_irq_enable() but instead bail out gracefully. The SRAB driver
is used on the BCM5301X (Northstar) platforms which do not yet have the SRAB
interrupts wired up.

Fixes: 16994374 ("net: dsa: b53: Make SRAB driver manage port interrupts")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2ddc48a

Merge branch 'vhost_net-TX-batching' · 8bb83b78

David S. Miller authored Sep 13, 2018

Jason Wang says:

====================
vhost_net TX batching

This series tries to batch submitting packets to underlayer socket
through msg_control during sendmsg(). This is done by:

1) Doing userspace copy inside vhost_net
2) Build XDP buff
3) Batch at most 64 (VHOST_NET_BATCH) XDP buffs and submit them once
   through msg_control during sendmsg().
4) Underlayer sockets can use XDP buffs directly when XDP is enalbed,
   or build skb based on XDP buff.

For the packet that can not be built easily with XDP or for the case
that batch submission is hard (e.g sndbuf is limited). We will go for
the previous slow path, passing iov iterator to underlayer socket
through sendmsg() once per packet.

This can help to improve cache utilization and avoid lots of indirect
calls with sendmsg(). It can also co-operate with the batching support
of the underlayer sockets (e.g the case of XDP redirection through
maps).

Testpmd(txonly) in guest shows obvious improvements:

Test                /+pps%
XDP_DROP on TAP     /+44.8%
XDP_REDIRECT on TAP /+29%
macvtap (skb)       /+26%

Netperf TCP_STREAM TX from guest shows obvious improvements on small
packet:

    size/session/+thu%/+normalize%
       64/     1/   +2%/    0%
       64/     2/   +3%/   +1%
       64/     4/   +7%/   +5%
       64/     8/   +8%/   +6%
      256/     1/   +3%/    0%
      256/     2/  +10%/   +7%
      256/     4/  +26%/  +22%
      256/     8/  +27%/  +23%
      512/     1/   +3%/   +2%
      512/     2/  +19%/  +14%
      512/     4/  +43%/  +40%
      512/     8/  +45%/  +41%
     1024/     1/   +4%/    0%
     1024/     2/  +27%/  +21%
     1024/     4/  +38%/  +73%
     1024/     8/  +15%/  +24%
     2048/     1/  +10%/   +7%
     2048/     2/  +16%/  +12%
     2048/     4/    0%/   +2%
     2048/     8/    0%/   +2%
     4096/     1/  +36%/  +60%
     4096/     2/  -11%/  -26%
     4096/     4/    0%/  +14%
     4096/     8/    0%/   +4%
    16384/     1/   -1%/   +5%
    16384/     2/    0%/   +2%
    16384/     4/    0%/   -3%
    16384/     8/    0%/   +4%
    65535/     1/    0%/  +10%
    65535/     2/    0%/   +8%
    65535/     4/    0%/   +1%
    65535/     8/    0%/   +3%

Please review.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8bb83b78