Commits · 96e5ae4e76f1ea950d493f510399b49308bea731 · nexedi / linux

05 Sep, 2017 1 commit

bpf: fix numa_node validation · 96e5ae4e

Eric Dumazet authored Sep 04, 2017

syzkaller reported crashes in bpf map creation or map update [1]

Problem is that nr_node_ids is a signed integer,
NUMA_NO_NODE is also an integer, so it is very tempting
to declare numa_node as a signed integer.

This means the typical test to validate a user provided value :

        if (numa_node != NUMA_NO_NODE &&
            (numa_node >= nr_node_ids ||
             !node_online(numa_node)))

must be written :

        if (numa_node != NUMA_NO_NODE &&
            ((unsigned int)numa_node >= nr_node_ids ||
             !node_online(numa_node)))

[1]
kernel BUG at mm/slab.c:3256!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801d2bc60c0 task.stack: ffff8801c0c90000
RIP: 0010:____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292
RSP: 0018:ffff8801c0c97638 EFLAGS: 00010096
RAX: ffffffffffff8b7b RBX: 0000000001080220 RCX: 0000000000000000
RDX: 00000000ffff8b7b RSI: 0000000001080220 RDI: ffff8801dac00040
RBP: ffff8801c0c976c0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8801c0c97620 R11: 0000000000000001 R12: ffff8801dac00040
R13: ffff8801dac00040 R14: 0000000000000000 R15: 00000000ffff8b7b
FS:  0000000002119940(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001fec CR3: 00000001d2980000 CR4: 00000000001406f0
Call Trace:
 __do_kmalloc_node mm/slab.c:3688 [inline]
 __kmalloc_node+0x33/0x70 mm/slab.c:3696
 kmalloc_node include/linux/slab.h:535 [inline]
 alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740
 htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820
 map_update_elem kernel/bpf/syscall.c:587 [inline]
 SYSC_bpf kernel/bpf/syscall.c:1468 [inline]
 SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x440409
RSP: 002b:00007ffd1f1792b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440409
RDX: 0000000000000020 RSI: 0000000020006000 RDI: 0000000000000002
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401d70
R13: 0000000000401e00 R14: 0000000000000000 R15: 0000000000000000
Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41
RIP: ____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: ffff8801c0c97638
---[ end trace d745f355da2e33ce ]---
Kernel panic - not syncing: Fatal exception

Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

96e5ae4e

04 Sep, 2017 39 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 2ff81cd3

David S. Miller authored Sep 04, 2017

Pablo Neira Ayuso says:

====================
Netfilter updates for next-net (part 2)

The following patchset contains Netfilter updates for net-next. This
patchset includes updates for nf_tables, removal of
CONFIG_NETFILTER_DEBUG and a new mode for xt_hashlimit. More
specifically, they:

1) Add new rate match mode for hashlimit, this introduces a new revision
   for this match. The idea is to stop matching packets until ratelimit
   criteria stands true. Patch from Vishwanath Pai.

2) Add ->select_ops indirection to nf_tables named objects, so we can
   choose between different flavours of the same object type, patch from
   Pablo M. Bermudo.

3) Shorter function names in nft_limit, basically:
   s/nft_limit_pkt_bytes/nft_limit_bytes, also from Pablo M. Bermudo.

4) Add new stateful limit named object type, this allows us to create
   limit policies that you can identify via name, also from Pablo.

5) Remove unused hooknum parameter in conntrack ->packet indirection.
   From Florian Westphal.

6) Patches to remove CONFIG_NETFILTER_DEBUG and macros such as
   IP_NF_ASSERT and IP_NF_ASSERT. From Varsha Rao.

7) Add nf_tables_updchain() helper function and use it from
   nf_tables_newchain() to make it more maintainable. Similarly,
   add nf_tables_addchain() and use it too.

8) Add new netlink NLM_F_NONREC flag, this flag should only be used for
   deletion requests, specifically, to support non-recursive deletion.
   Based on what we discussed during NFWS'17 in Faro.

9) Use NLM_F_NONREC from table and sets in nf_tables.

10) Support for recursive chain deletion. Table and set deletion
    commands come with an implicit content flush on deletion, while
    chains do not. This patch addresses this inconsistency by adding
    the code to perform recursive chain deletions. This also comes with
    the bits to deal with the new NLM_F_NONREC netlink flag.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2ff81cd3

netfilter: nf_tables: support for recursive chain deletion · 9dee1474

Pablo Neira Ayuso authored Sep 03, 2017

This patch sorts out an asymmetry in deletions. Currently, table and set
deletion commands come with an implicit content flush on deletion.
However, chain deletion results in -EBUSY if there is content in this
chain, so no implicit flush happens. So you have to send a flush command
in first place to delete chains, this is inconsistent and it can be
annoying in terms of user experience.

This patch uses the new NLM_F_NONREC flag to request non-recursive chain
deletion, ie. if the chain to be removed contains rules, then this
returns EBUSY. This problem was discussed during the NFWS'17 in Faro,
Portugal. In iptables, you hit -EBUSY if you try to delete a chain that
contains rules, so you have to flush first before you can remove
anything. Since iptables-compat uses the nf_tables netlink interface, it
has to use the NLM_F_NONREC flag from userspace to retain the original
iptables semantics, ie. bail out on removing chains that contain rules.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9dee1474

netfilter: nf_tables: use NLM_F_NONREC for deletion requests · a8278400

Pablo Neira Ayuso authored Sep 03, 2017

Bail out if user requests non-recursive deletion for tables and sets.
This new flags tells nf_tables netlink interface to reject deletions if
tables and sets have content.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a8278400

netlink: add NLM_F_NONREC flag for deletion requests · 2335ba70

Pablo Neira Ayuso authored Sep 03, 2017

In the last NFWS in Faro, Portugal, we discussed that netlink is lacking
the semantics to request non recursive deletions, ie. do not delete an
object iff it has child objects that hang from this parent object that
the user requests to be deleted.

We need this new flag to solve a problem for the iptables-compat
backward compatibility utility, that runs iptables commands using the
existing nf_tables netlink interface. Specifically, custom chains in
iptables cannot be deleted if there are rules in it, however, nf_tables
allows to remove any chain that is populated with content. To sort out
this asymmetry, iptables-compat userspace sets this new NLM_F_NONREC
flag to obtain the same semantics that iptables provides.

This new flag should only be used for deletion requests. Note this new
flag value overlaps with the existing:

* NLM_F_ROOT for get requests.
* NLM_F_REPLACE for new requests.

However, those flags should not ever be used in deletion requests.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

2335ba70

netfilter: nf_tables: add nf_tables_addchain() · 4035285f

Pablo Neira Ayuso authored Sep 03, 2017

Wrap the chain addition path in a function to make it more maintainable.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4035285f

netfilter: nf_tables: add nf_tables_updchain() · 2c4a488a

Pablo Neira Ayuso authored Sep 03, 2017

nf_tables_newchain() is too large, wrap the chain update path in a
function to make it more maintainable.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

2c4a488a

net: Remove CONFIG_NETFILTER_DEBUG and _ASSERT() macros. · 9efdb14f

Varsha Rao authored Aug 30, 2017

This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they
are no longer required. Replace _ASSERT() macros with WARN_ON().
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9efdb14f

net: Replace NF_CT_ASSERT() with WARN_ON(). · 44d6e2f2

Varsha Rao authored Aug 30, 2017

This patch removes NF_CT_ASSERT() and instead uses WARN_ON().
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>

44d6e2f2

netfilter: remove unused hooknum arg from packet functions · d1c1e39d
Florian Westphal authored Aug 29, 2017
```
tested with allmodconfig build.
Signed-off-by: Florian Westphal <fw@strlen.de>
```
d1c1e39d

netfilter: nft_limit: add stateful object type · a6912055

Pablo M. Bermudo Garay authored Aug 23, 2017

Register a new limit stateful object type into the stateful object
infrastructure.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a6912055

netfilter: nft_limit: replace pkt_bytes with bytes · 6e323887

Pablo M. Bermudo Garay authored Aug 23, 2017

Just a small refactor patch in order to improve the code readability.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

6e323887

netfilter: nf_tables: add select_ops for stateful objects · dfc46034

Pablo M. Bermudo Garay authored Aug 23, 2017

This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.

This change is needed for upcoming additions to the stateful objects
infrastructure.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

dfc46034

netfilter: xt_hashlimit: add rate match mode · bea74641

Vishwanath Pai authored Aug 18, 2017

This patch adds a new feature to hashlimit that allows matching on the
current packet/byte rate without rate limiting. This can be enabled
with a new flag --hashlimit-rate-match. The match returns true if the
current rate of packets is above/below the user specified value.

The main difference between the existing algorithm and the new one is
that the existing algorithm rate-limits the flow whereas the new
algorithm does not. Instead it *classifies* the flow based on whether
it is above or below a certain rate. I will demonstrate this with an
example below. Let us assume this rule:

iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain

If the packet rate is 15/s, the existing algorithm would ACCEPT 10
packets every second and send 5 packets to "new_chain".

But with the new algorithm, as long as the rate of 15/s is sustained,
all packets will continue to match and every packet is sent to new_chain.

This new functionality will let us classify different flows based on
their current rate, so that further decisions can be made on them based on
what the current rate is.

This is how the new algorithm works:
We divide time into intervals of 1 (sec/min/hour) as specified by
the user. We keep track of the number of packets/bytes processed in the
current interval. After each interval we reset the counter to 0.

When we receive a packet for match, we look at the packet rate
during the current interval and the previous interval to make a
decision:

if [ prev_rate < user and cur_rate < user ]
        return Below
else
        return Above

Where cur_rate is the number of packets/bytes seen in the current
interval, prev is the number of packets/bytes seen in the previous
interval and 'user' is the rate specified by the user.

We also provide flexibility to the user for choosing the time
interval using the option --hashilmit-interval. For example the user can
keep a low rate like x/hour but still keep the interval as small as 1
second.

To preserve backwards compatibility we have to add this feature in a new
revision, so I've created revision 3 for hashlimit. The two new options
we add are:

--hashlimit-rate-match
--hashlimit-rate-interval

I have updated the help text to add these new options. Also added a few
tests for the new options.
Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

bea74641

Merge branch 'for-upstream' of... · 45865dab

David S. Miller authored Sep 03, 2017

Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2017-09-03

Here's one last bluetooth-next pull request for the 4.14 kernel:

 - NULL pointer fix in ca8210 802.15.4 driver
 - A few "const" fixes
 - New Kconfig option for disabling legacy interfaces

Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

45865dab

Merge branch 'qualcomm-rmnet-Fix-comments-on-initial-patchset' · f98ce389

David S. Miller authored Sep 03, 2017

Subash Abhinov Kasiviswanathan says:

====================
net: qualcomm: rmnet: Fix comments on initial patchset

This series fixes the comments from Dan on the first patch series.

Fixes a memory corruption which could occur if mux_id was higher than 32.
Remove the RMNET_LOCAL_LOGICAL_ENDPOINT which is no longer used.
Make a log message more useful.
Combine __rmnet_set_endpoint_config() with rmnet_set_endpoint_config().
Set the mux_id in rmnet_vnd_newlink().
Set the ingress and egress data format directly in newlink.
Implement ndo_get_iflink to find the real_dev.
Rename the real_dev_info to port to make it similar to other drivers.

The conversion of rmnet_devices to a list and hash lookup will be sent
as part of a seperate patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f98ce389

net: qualcomm: rmnet: Rename real_dev_info to port · b665f4f8

Subash Abhinov Kasiviswanathan authored Sep 02, 2017

Make it similar to drivers like ipvlan / macvlan so it is easier to read.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b665f4f8

net: qualcomm: rmnet: Implement ndo_get_iflink · b752eff5

Subash Abhinov Kasiviswanathan authored Sep 02, 2017

This makes it easier to find out the parent dev.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b752eff5

net: qualcomm: rmnet: Refactor the new rmnet dev creation · 032ee468

Subash Abhinov Kasiviswanathan authored Sep 02, 2017

Data format can be directly set from rmnet_newlink() since the
rmnet real dev info is already available.

Since __rmnet_get_real_dev_info() is no longer used in rmnet_config.c
after removal of those functions, move content to
rmnet_get_real_dev_info().

__rmnet_set_endpoint_config() is collapsed into
rmnet_set_endpoint_config() since only mux_id was being set additionally
within it. Remove an unnecessary mux_id check.

Set the mux_id for the rmnet_dev within rmnet_vnd_newlink() itself.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

032ee468

net: qualcomm: rmnet: Move the device creation log · 2d516c0d

Subash Abhinov Kasiviswanathan authored Sep 02, 2017

The current log is not very useful as it does not log the device
name since it it is prior to registration -

(unnamed net_device) (uninitialized): Setting up device

Modify to log after the device registration -

rmnet1: rmnet dev created
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d516c0d

net: qualcomm: rmnet: Remove the unused endpoint -1 · 61bf5490

Subash Abhinov Kasiviswanathan authored Sep 02, 2017

This was used only in the original patch series where the IOCTLs were
present and is no longer in use.

Fixes: ceed73a2 ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61bf5490

net: qualcomm: rmnet: Fix memory corruption if mux_id is greater than 32 · 009e1b2b

Subash Abhinov Kasiviswanathan authored Sep 02, 2017

rmnet_rtnl_validate() was checking for upto mux_id 254, however the
rmnet_devices devices could hold upto 32 entries only. Fix this by
increasing the size of the rmnet_devices.

Fixes: ceed73a2 ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

009e1b2b

Merge branch 'nfp-refactor-app-init-and-minor-flower-fixes' · 3cf2e08f

David S. Miller authored Sep 03, 2017

Jakub Kicinski says:

====================
nfp: refactor app init, and minor flower fixes

This series is a part 2 to what went into net as a simpler fix.
In net we simply moved when existing callbacks are invoked to
ensure flower app does not still use representors when lower
netdev has already been destroyed.  In this series we add a
callback to notify apps when vNIC netdevs are fully initialized
and they are about to be destroyed.  This allows flower to spawn
representors at the right time, while keeping the start/stop
callbacks for what they are intended to be used - FW initialization
over control channel.

Patch 4 improves drop monitor interaction and patch 5 changes
the default Kconfig selection of flower offload.  Patch 6 fixes
locking around representor updates which got lost in net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3cf2e08f

nfp: flower: restore RTNL locking around representor updates · 9ce4fa54

Jakub Kicinski authored Sep 02, 2017

When we moved to updating representors from a workqueue grabbing
the RTNL somehow got lost in the process.  Restore it, and make
sure RCU lock is not held while we are grabbing the RTNL.  RCU
protects the representor table, so since we will be under RTNL
we can drop RCU lock as soon as we find the netdev pointer.
RTNL is needed for the dev_set_mtu() call.

Fixes: 2dff1962 ("nfp: process MTU updates from firmware flower app")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ce4fa54

nfp: build the flower offload by default · 7c8a2d8b

Jakub Kicinski authored Sep 02, 2017

It's reasonable to assume that if user selects to build the NFP
driver all offload capabilities will be enabled by default.
Change the CONFIG_NFP_APP_FLOWER to default to enabled.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c8a2d8b

nfp: be drop monitor friendly · 023a9284

Jakub Kicinski authored Sep 02, 2017

Use dev_consume_skb_any() in place of dev_kfree_skb_any()
when control frame has been successfully processed in flower
and on the driver's main TX completion path.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

023a9284

nfp: move the start/stop app callbacks back · 9d8b17be

Jakub Kicinski authored Sep 02, 2017

Since representors are now created with a separate callback
start/stop app callbacks can be moved again to their original
location.  They are intended to app-specific init/clean up
over the control channel.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d8b17be

nfp: flower: base lifetime of representors on existence of lower vNIC · 192e6851

Jakub Kicinski authored Sep 02, 2017

Create representors after lower vNIC is registered and destroy
them before it is destroyed.  Move the code out of start/stop
callbacks directly into vnic_init/clean callbacks.  Make sure
SR-IOV callbacks don't try to create representors when lower
device does not exist.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

192e6851

nfp: separate app vNIC init/clean from alloc/free · c496291c

Jakub Kicinski authored Sep 02, 2017

We currently only have one app callback for vNIC creation
and destruction.  This is insufficient, because some actions
have to be taken before netdev is registered, after it's
registered and after it's unregistered.  Old callbacks
were really corresponding to alloc/free actions.  Rename
them and add proper init/clean.  Apps using representors
will be able to use new callbacks to manage lifetime of
upper devices.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c496291c

Merge tag 'mlx5-updates-2017-09-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 18a4ded9

David S. Miller authored Sep 03, 2017

Saeed Mahameed says:

====================
mlx5-updates-2017-09-03

This series from Tariq includes micro data path optimization for mlx5e
netdevice driver.

Mainly Tariq introduces the following changes to NAPI and RX handling
path of the driver:
 - RX ring structure reorganizing
 - Trivial code refactoring and optimization
 - NAPI busy-poll for when fast UMR is in progress
 - Non-atomic state operations in NAPI context
 - Remove unnecessary fields from fast path structures
 - page-cache micro optimization
 - Rely on NAPI to avoid missing an IRQ for RX/TX shared NAPI contexts
 - Stop NAPI when irq changes affinity
 - Distribute RSS table among all RX rings
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

18a4ded9

Merge branch 'mlxsw-Offloading-GRE-tunnels' · ccfdf21b

David S. Miller authored Sep 03, 2017

Jiri Pirko says:

====================
mlxsw: Offloading GRE tunnels

Petr says:

This patch series introduces to mlxsw driver support for offloading
IP-in-IP tunnels in general, and for (subset of) GRE in particular.

This patchset supports two ways of configuring GRE:

- So called "hierarchical configuration", where the GRE device has a bound
  dummy device, which is in a different VRF. The VRF with host traffic is
  called "overlay", the one with encapsulated traffic is called "underlay".

- So called "flat configuration", where the GRE device doesn't have a bound
  device, and overlay and underlay are both in the same VRF (possibly the
  default one).

Two routes are then interesting: a route that directs traffic to a GRE
device (which would typically be in overlay VRF, but could be in another
one), and a local route for the tunnel's local address (in underlay).
Handling of these two route types is then introduced as patches to support,
respectively, IPv4 and IPv6 encapsulation and IPv4 decapsulation.

The encap and decap routes then reference a loopback device, a new type of
RIF introduced by this patchset for the specific use of offloading tunnels.

The encap and decap code is abstract with respect to the particulars of
individual L3 tunnel types. This patchset introduces support for GRE
tunnels in particular.

Limitations:

- Each tunnel needs to have a different local address (within a given VRF).
  When two tunnels are used that are in conflict, FIB abort is triggered
  and the driver ceases offloading FIBs. Full handling of such
  configurations needs special setup in the hardware, such that the tunnels
  that share an address are dispatched correctly according to their key (or
  lack thereof). That's currently not implemented, and to keep things
  deterministic, the driver triggers FIB abort.

- A next hop that uses an incompletely-specified tunnel (e.g. such that are
  used for LWT) is not offloaded, but doesn't trigger FIB abort like the
  above. If such routes end up being in a de facto conflict with other
  tunnels, then if there already is an offload for that address, the
  traffic for the conflicting tunnel will end up mismatching the
  configuration of the offloaded tunnel, and thus gets to slow path through
  an error trap.

- GRE checksumming and sequence numbers are not supported and TTL and TOS
  need to be set to inherit. Tunnels with a different configuration are not
  offloaded and their traffic is trapping to slow path.

  Note in particular that TOS of inherit is not the default configuration
  and needs to be explicitly specified when the tunnel is created.

- The only feature that is not graciously handled is that if a change is
  made to the tunnel, e.g. through "ip tunnel change", such changes are not
  reflected in the driver. There is currently no notification mechanism for
  these changes. Introduction of this mechanism and its leverage in the
  driver will be subject of follow-up work. For now this limitation can be
  worked around by removing and re-adding the encap route.

---
v1->v2:
-fix order of patch 5
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ccfdf21b

mlxsw: spectrum_router: Support GRE tunnels · ee954d1a

Petr Machata authored Sep 02, 2017

This patch introduces callbacks and tunnel type to offload GRE tunnels.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee954d1a

mlxsw: spectrum_router: Add loopback accessors · 92107cfb

Petr Machata authored Sep 02, 2017

struct mlxsw_sp_rif is a router-private structure, and therefore
everything related to it is as well: parameters, and derived RIF types
including loopbacks. IPIP module needs access to some details of
loopback interfaces, but exporting all the RIF shebang would create too
large an interface.

So instead export just the bare minimum necessary: accessors for RIF
index and underlay VRF ID.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92107cfb

mlxsw: spectrum: Register for IPIP_DECAP_ERROR trap · 86484de2

Petr Machata authored Sep 02, 2017

These traps are generated for packets that fail checks for source IP,
encapsulation type, or GRE key. Trap these packets to CPU for follow-up
handling by the kernel, which will send ICMP destination unreachable
responses.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

86484de2

mlxsw: spectrum_router: Use existing decap route · 1cc38fb1

Petr Machata authored Sep 02, 2017

The local route that points at IPIP's underlay device (decap route) can
be present long before the GRE device. Thus when an encap route is
added, it's necessary to look inside the underlay FIB if the decap route
is already present. If so, the current trap offload needs to be
withdrawn and replaced with a decap offload.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1cc38fb1

mlxsw: spectrum_router: Support IPv4 underlay decap · 4607f6d2

Petr Machata authored Sep 02, 2017

Unlike encapsulation, which is represented by a next hop forwarding to
an IPIP tunnel, decapsulation is a type of local route. It is created
for local routes whose prefix corresponds to the local address of one of
offloaded IPIP tunnels. When the tunnel is removed (i.e. all the encap
next hops are removed), the decap offload is migrated back to a trap for
resolution in slow path.

This patch assumes that decap route is already present when encap route
is added. A follow-up patch will fix this issue.

Note that this patch only supports IPv4 underlay. Support for IPv6
underlay will be subject to follow-up work apart from this patchset.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4607f6d2

mlxsw: spectrum_router: Support IPv6 overlay encap · 8f28a309

Petr Machata authored Sep 02, 2017

Add the missing bits to recognize IPv6 next hops as IPIP ones to enable
offloading of IPv6 overlay encapsulation.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f28a309

mlxsw: spectrum_router: Support IPv4 overlay encap · 1012b9ac

Petr Machata authored Sep 02, 2017

This introduces some common code for tracking of offloaded IP-in-IP
tunnels, and support for offloading IPv4 overlay encapsulating routes in
particular. A follow-up patch will introduce IPv6 overlay as well.

Offloaded tunnels are kept in a linked list of mlxsw_sp_ipip_entry
objects hooked up in mlxsw_sp_router. A network device that represents
the tunnel is used as a key to look up the corresponding IPIP entry.
Note that in the future, more general keying mechanism will be needed,
because parts of the tunnel information can be provided by the route.

IPIP entries are reference counted, because several next hops may end up
using the same tunnel, and we only want to offload it once.

Encapsulation path hooks into next hop handling. Routes that forward to
a tunnel are now considered gateway routes, thus giving them the same
treatment that other remote routes get. An IPIP next hop type is
introduced.

Details of individual tunnel types are kept in an array of
mlxsw_sp_ipip_ops objects. If a tunnel type doesn't match any of the
known tunnel types, the next-hop is not considered an IPIP next hop.

The list of IPIP tunnel types is currently empty, follow-up patches will
add support for GRE. Traffic to IPIP tunnel types that are not
explicitly recognized by the driver traps and is handled in slow path.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1012b9ac

mlxsw: spectrum_router: Make nexthops typed · 35225e47

Petr Machata authored Sep 02, 2017

In the router, some next hops may reference an encapsulating netdevice,
such as GRE or IPIP. To properly offload these next hops, mlxsw needs to
keep track of whether a given next hop is a regular Ethernet entry, or
an IP-in-IP tunneling entry.

To facilitate this book-keeping, add a type field to struct
mlxsw_sp_nexthop. There is, as of this patch, only one next hop type:
MLXSW_SP_NEXTHOP_TYPE_ETH. Follow-up patches will introduce the IP-in-IP
variant.

There are several places where next hops are initialized in the IPv4
path. Instead of replicating the logic at every one of them, factor it
out to a function mlxsw_sp_nexthop4_type_init(). The corresponding fini
is actually protocol-neutral, so put it to mlxsw_sp_nexthop_type_fini(),
but create a corresponding protocoled _fini function that dispatches to
the protocol-neutral one.

The IPv6 path is simpler, but for symmetry with IPv4, create the same
suite of functions with corresponding logic.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35225e47

mlxsw: spectrum_router: Extract mlxsw_sp_rt6_is_gateway() · f6050ee6

Petr Machata authored Sep 02, 2017

IPv6 counterpart of the previous patch: introduce a function to
determine whether a given route is a gateway route.

The new function takes a mlxsw_sp argument which follow-up patches will
use. Thus mlxsw_sp_fib6_entry_type_set() got that argument as well.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f6050ee6