1. 15 Oct, 2020 23 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 2295cddf
      Jakub Kicinski authored
      Minor conflicts in net/mptcp/protocol.h and
      tools/testing/selftests/net/Makefile.
      
      In both cases code was added on both sides in the same place
      so just keep both.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2295cddf
    • Jakub Kicinski's avatar
      Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH" · 2ecbc1f6
      Jakub Kicinski authored
      This reverts commit 1d273fcc.
      
      Alexei points out there's nothing implying headers will be built
      and therefore exist under usr/include, so this fix doesn't make
      much sense.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ecbc1f6
    • Davide Caratti's avatar
      netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements · 346e320c
      Davide Caratti authored
      nftables payload statements are used to mangle SCTP headers, but they can
      only replace the Internet Checksum. As a consequence, nftables rules that
      mangle sport/dport/vtag in SCTP headers potentially generate packets that
      are discarded by the receiver, unless the CRC-32C is "offloaded" (e.g the
      rule mangles a skb having 'ip_summed' equal to 'CHECKSUM_PARTIAL'.
      
      Fix this extending uAPI definitions and L4 checksum update function, in a
      way that userspace programs (e.g. nft) can instruct the kernel to compute
      CRC-32C in SCTP headers. Also ensure that LIBCRC32C is built if NF_TABLES
      is 'y' or 'm' in the kernel build configuration.
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      346e320c
    • Jakub Kicinski's avatar
      Merge tag 'rxrpc-next-20201015' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 54086c5a
      Jakub Kicinski authored
      David Howells says:
      
      ====================
      rxrpc fixes
      
      Here are a couple of fixes that need to be applied on top of rxrpc patches
      in net-next:
      
       (1) Fix a bug in the connection bundle changes in the net-next tree.
      
       (2) Fix the loss of final ACK on socket shutdown.
      ====================
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      54086c5a
    • Yonghong Song's avatar
      net: fix pos incrementment in ipv6_route_seq_next · 6617dfd4
      Yonghong Song authored
      Commit 4fc427e0 ("ipv6_route_seq_next should increase position index")
      tried to fix the issue where seq_file pos is not increased
      if a NULL element is returned with seq_ops->next(). See bug
        https://bugzilla.kernel.org/show_bug.cgi?id=206283
      The commit effectively does:
        - increase pos for all seq_ops->start()
        - increase pos for all seq_ops->next()
      
      For ipv6_route, increasing pos for all seq_ops->next() is correct.
      But increasing pos for seq_ops->start() is not correct
      since pos is used to determine how many items to skip during
      seq_ops->start():
        iter->skip = *pos;
      seq_ops->start() just fetches the *current* pos item.
      The item can be skipped only after seq_ops->show() which essentially
      is the beginning of seq_ops->next().
      
      For example, I have 7 ipv6 route entries,
        root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096
        00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001     eth0
        fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
        00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
        00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
        fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
        ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001     eth0
        00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
        0+1 records in
        0+1 records out
        1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s
        root@arch-fb-vm1:~/net-next
      
      In the above, I specify buffer size 4096, so all records can be returned
      to user space with a single trip to the kernel.
      
      If I use buffer size 128, since each record size is 149, internally
      kernel seq_read() will read 149 into its internal buffer and return the data
      to user space in two read() syscalls. Then user read() syscall will trigger
      next seq_ops->start(). Since the current implementation increased pos even
      for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first
      record is #1.
      
        root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128
        00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001     eth0
        00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
        fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
        00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
      4+1 records in
      4+1 records out
      600 bytes copied, 0.00127758 s, 470 kB/s
      
      To fix the problem, create a fake pos pointer so seq_ops->start()
      won't actually increase seq_file pos. With this fix, the
      above `dd` command with `bs=128` will show correct result.
      
      Fixes: 4fc427e0 ("ipv6_route_seq_next should increase position index")
      Cc: Alexei Starovoitov <ast@kernel.org>
      Suggested-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6617dfd4
    • Jakub Kicinski's avatar
      Merge branch 'net-smc-fixes-2020-10-14' · 0c124aa5
      Jakub Kicinski authored
      Karsten Graul says:
      
      ====================
      net/smc: fixes 2020-10-14
      
      The first patch fixes a possible use-after-free of delayed llc events.
      Patch 2 corrects the number of DMB buffer sizes. And patch 3 ensures
      a correctly formatted return code when smc_ism_register_dmb() fails to
      create a new DMB.
      ====================
      
      Link: https://lore.kernel.org/r/20201014174329.35791-1-kgraul@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c124aa5
    • Karsten Graul's avatar
      net/smc: fix invalid return code in smcd_new_buf_create() · 6b1bbf94
      Karsten Graul authored
      smc_ism_register_dmb() returns error codes set by the ISM driver which
      are not guaranteed to be negative or in the errno range. Such values
      would not be handled by ERR_PTR() and finally the return code will be
      used as a memory address.
      Fix that by using a valid negative errno value with ERR_PTR().
      
      Fixes: 72b7f6c4 ("net/smc: unique reason code for exceeded max dmb count")
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6b1bbf94
    • Karsten Graul's avatar
      net/smc: fix valid DMBE buffer sizes · ef12ad45
      Karsten Graul authored
      The SMCD_DMBE_SIZES should include all valid DMBE buffer sizes, so the
      correct value is 6 which means 1MB. With 7 the registration of an ISM
      buffer would always fail because of the invalid size requested.
      Fix that and set the value to 6.
      
      Fixes: c6ba7c9b ("net/smc: add base infrastructure for SMC-D and ISM")
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef12ad45
    • Karsten Graul's avatar
      net/smc: fix use-after-free of delayed events · d535ca13
      Karsten Graul authored
      When a delayed event is enqueued then the event worker will send this
      event the next time it is running and no other flow is currently
      active. The event handler is called for the delayed event, and the
      pointer to the event keeps set in lgr->delayed_event. This pointer is
      cleared later in the processing by smc_llc_flow_start().
      This can lead to a use-after-free condition when the processing does not
      reach smc_llc_flow_start(), but frees the event because of an error
      situation. Then the delayed_event pointer is still set but the event is
      freed.
      Fix this by always clearing the delayed event pointer when the event is
      provided to the event handler for processing, and remove the code to
      clear it in smc_llc_flow_start().
      
      Fixes: 555da9af ("net/smc: add event-based llc_flow framework")
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d535ca13
    • YueHaibing's avatar
      bpfilter: Fix build error with CONFIG_BPFILTER_UMH · 1d273fcc
      YueHaibing authored
      IF CONFIG_BPFILTER_UMH is set, building fails:
      
      In file included from /usr/include/sys/socket.h:33:0,
                       from net/bpfilter/main.c:6:
      /usr/include/bits/socket.h:390:10: fatal error: asm/socket.h: No such file or directory
       #include <asm/socket.h>
                ^~~~~~~~~~~~~~
      compilation terminated.
      scripts/Makefile.userprogs:43: recipe for target 'net/bpfilter/main.o' failed
      make[2]: *** [net/bpfilter/main.o] Error 1
      
      Add missing include path to fix this.
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1d273fcc
    • Ayush Sawal's avatar
      cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr · 0ec78cdb
      Ayush Sawal authored
      This patch changes the module name to "ch_ipsec" and prepends
      "ch_ipsec" string instead of "chcr" in all debug messages and
      function names.
      
      V1->V2:
      -Removed inline keyword from functions.
      -Removed CH_IPSEC prefix from pr_debug.
      -Used proper indentation for the continuation line of the function
      arguments.
      
      V2->V3:
      Fix the checkpatch.pl warnings.
      
      Fixes: 1b77be46 ("crypto/chcr: Moving chelsio's inline ipsec functionality to /drivers/net")
      Signed-off-by: default avatarAyush Sawal <ayush.sawal@chelsio.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0ec78cdb
    • Leon Romanovsky's avatar
      net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info · d086a1c6
      Leon Romanovsky authored
      The access of tcf_tunnel_info() produces the following splat, so fix it
      by dereferencing the tcf_tunnel_key_params pointer with marker that
      internal tcfa_liock is held.
      
       =============================
       WARNING: suspicious RCU usage
       5.9.0+ #1 Not tainted
       -----------------------------
       include/net/tc_act/tc_tunnel_key.h:59 suspicious rcu_dereference_protected() usage!
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       1 lock held by tc/34839:
        #0: ffff88828572c2a0 (&p->tcfa_lock){+...}-{2:2}, at: tc_setup_flow_action+0xb3/0x48b5
       stack backtrace:
       CPU: 1 PID: 34839 Comm: tc Not tainted 5.9.0+ #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x9a/0xd0
        tc_setup_flow_action+0x14cb/0x48b5
        fl_hw_replace_filter+0x347/0x690 [cls_flower]
        fl_change+0x2bad/0x4875 [cls_flower]
        tc_new_tfilter+0xf6f/0x1ba0
        rtnetlink_rcv_msg+0x5f2/0x870
        netlink_rcv_skb+0x124/0x350
        netlink_unicast+0x433/0x700
        netlink_sendmsg+0x6f1/0xbd0
        sock_sendmsg+0xb0/0xe0
        ____sys_sendmsg+0x4fa/0x6d0
        ___sys_sendmsg+0x12e/0x1b0
        __sys_sendmsg+0xa4/0x120
        do_syscall_64+0x2d/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f1f8cd4fe57
       Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
       RSP: 002b:00007ffdc1e193b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1f8cd4fe57
       RDX: 0000000000000000 RSI: 00007ffdc1e19420 RDI: 0000000000000003
       RBP: 000000005f85aafa R08: 0000000000000001 R09: 00007ffdc1e1936c
       R10: 000000000040522d R11: 0000000000000246 R12: 0000000000000001
       R13: 0000000000000000 R14: 00007ffdc1e1d6f0 R15: 0000000000482420
      
      Fixes: 3ebaf6da ("net: sched: Do not assume RTNL is held in tunnel key action helpers")
      Fixes: 7a472814 ("net: sched: lock action when translating it to flow_action infra")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d086a1c6
    • David Howells's avatar
      rxrpc: Fix loss of final ack on shutdown · ddc7834a
      David Howells authored
      Fix the loss of transmission of a call's final ack when a socket gets shut
      down.  This means that the server will retransmit the last data packet or
      send a ping ack and then get an ICMP indicating the port got closed.  The
      server will then view this as a failure.
      
      Fixes: 3136ef49 ("rxrpc: Delay terminal ACK transmission on a client call")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      ddc7834a
    • David Howells's avatar
      rxrpc: Fix bundle counting for exclusive connections · f3af4ad1
      David Howells authored
      Fix rxrpc_unbundle_conn() to not drop the bundle usage count when cleaning
      up an exclusive connection.
      
      Based on the suggested fix from Hillf Danton.
      
      Fixes: 245500d8 ("rxrpc: Rewrite the client connection manager")
      Reported-by: syzbot+d57aaf84dd8a550e6d91@syzkaller.appspotmail.com
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Hillf Danton <hdanton@sina.com>
      f3af4ad1
    • Pablo Neira Ayuso's avatar
      netfilter: restore NF_INET_NUMHOOKS · d25e2e93
      Pablo Neira Ayuso authored
      This definition is used by the iptables legacy UAPI, restore it.
      
      Fixes: d3519cb8 ("netfilter: nf_tables: add inet ingress support")
      Reported-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Tested-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d25e2e93
    • Jakub Kicinski's avatar
      Merge branch 'ibmveth-gso-fix' · 15f0d292
      Jakub Kicinski authored
      David Wilder says:
      
      ====================
      ibmveth gso fix
      
      The ibmveth driver is a virtual Ethernet driver used on IBM pSeries systems.
      Gso packets can be sent between LPARS (virtual hosts) without segmentation,
      by flagging gso packets using one of two methods depending on the firmware
      version. Some gso packet were not correctly identified by the receiver.
      This patch-set corrects this issue.
      
      V2:
      - Added fix tags.
      - Byteswap the constant at compilation time.
      - Updated the commit message to clarify what frame validation is performed
        by the hypervisor.
      ====================
      
      Link: https://lore.kernel.org/r/20201013232014.26044-1-dwilder@us.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      15f0d292
    • David Wilder's avatar
      ibmveth: Identify ingress large send packets. · 413f142c
      David Wilder authored
      Ingress large send packets are identified by either:
      The IBMVETH_RXQ_LRG_PKT flag in the receive buffer
      or with a -1 placed in the ip header checksum.
      The method used depends on firmware version. Frame
      geometry and sufficient header validation is performed by the
      hypervisor eliminating the need for further header checks here.
      
      Fixes: 7b596738 ("ibmveth: set correct gso_size and gso_type")
      Signed-off-by: default avatarDavid Wilder <dwilder@us.ibm.com>
      Reviewed-by: default avatarThomas Falcon <tlfalcon@linux.ibm.com>
      Reviewed-by: default avatarCristobal Forno <cris.forno@ibm.com>
      Reviewed-by: default avatarPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      413f142c
    • David Wilder's avatar
      ibmveth: Switch order of ibmveth_helper calls. · 5ce9ad81
      David Wilder authored
      ibmveth_rx_csum_helper() must be called after ibmveth_rx_mss_helper()
      as ibmveth_rx_csum_helper() may alter ip and tcp checksum values.
      
      Fixes: 66aa0678 ("ibmveth: Support to enable LSO/CSO for Trunk VEA.")
      Signed-off-by: default avatarDavid Wilder <dwilder@us.ibm.com>
      Reviewed-by: default avatarThomas Falcon <tlfalcon@linux.ibm.com>
      Reviewed-by: default avatarCristobal Forno <cris.forno@ibm.com>
      Reviewed-by: default avatarPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5ce9ad81
    • Herat Ramani's avatar
      cxgb4: handle 4-tuple PEDIT to NAT mode translation · 2ef813b8
      Herat Ramani authored
      The 4-tuple NAT offload via PEDIT always overwrites all the 4-tuple
      fields even if they had not been explicitly enabled. If any fields in
      the 4-tuple are not enabled, then the hardware overwrites the
      disabled fields with zeros, instead of ignoring them.
      
      So, add a parser that can translate the enabled 4-tuple PEDIT fields
      to one of the NAT mode combinations supported by the hardware and
      hence avoid overwriting disabled fields to 0. Any rule with
      unsupported NAT mode combination is rejected.
      Signed-off-by: default avatarHerat Ramani <herat@chelsio.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ef813b8
    • Jakub Kicinski's avatar
      Merge branch 'l3mdev-icmp-error-route-lookup-fixes' · f8ea4a19
      Jakub Kicinski authored
      Mathieu Desnoyers says:
      
      ====================
      l3mdev icmp error route lookup fixes
      
      Here is a series of fixes for ipv4 and ipv6 which ensure the route
      lookup is performed on the right routing table in VRF configurations
      when sending TTL expired icmp errors (useful for traceroute).
      
      It includes tests for both ipv4 and ipv6.
      
      These fixes address specifically address the code paths involved in
      sending TTL expired icmp errors. As detailed in the individual commit
      messages, those fixes do not address similar icmp errors related to
      network namespaces and unreachable / fragmentation needed messages,
      which appear to use different code paths.
      ====================
      
      Link: https://lore.kernel.org/r/20201012145016.2023-1-mathieu.desnoyers@efficios.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f8ea4a19
    • Michael Jeanson's avatar
      selftests: Add VRF route leaking tests · 1a017276
      Michael Jeanson authored
      The objective of the tests is to check that ICMP errors generated while
      crossing between VRFs are properly routed back to the source host.
      
      The first ttl test sends a ping with a ttl of 1 from h1 to h2 and parses the
      output of the command to check that a ttl expired error is received.
      
      The second ttl test runs traceroute from h1 to h2 and parses the output to
      check for a hop on r1.
      
      The mtu test sends a ping with a payload of 1450 from h1 to h2, through
      r1 which has an interface with a mtu of 1400 and parses the output of the
      command to check that a fragmentation needed error is received.
      
      [ The IPv6 MTU test still fails with the symmetric routing setup. It
        appears to be caused by source address selection picking ::1.  Fixing
        this is beyond the scope of this series. ]
      Signed-off-by: default avatarMichael Jeanson <mjeanson@efficios.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1a017276
    • Mathieu Desnoyers's avatar
      ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) · 272928d1
      Mathieu Desnoyers authored
      As per RFC4443, the destination address field for ICMPv6 error messages
      is copied from the source address field of the invoking packet.
      
      In configurations with Virtual Routing and Forwarding tables, looking up
      which routing table to use for sending ICMPv6 error messages is
      currently done by using the destination net_device.
      
      If the source and destination interfaces are within separate VRFs, or
      one in the global routing table and the other in a VRF, looking up the
      source address of the invoking packet in the destination interface's
      routing table will fail if the destination interface's routing table
      contains no route to the invoking packet's source address.
      
      One observable effect of this issue is that traceroute6 does not work in
      the following cases:
      
      - Route leaking between global routing table and VRF
      - Route leaking between VRFs
      
      Use the source device routing table when sending ICMPv6 error
      messages.
      
      [ In the context of ipv4, it has been pointed out that a similar issue
        may exist with ICMP errors triggered when forwarding between network
        namespaces. It would be worthwhile to investigate whether ipv6 has
        similar issues, but is outside of the scope of this investigation. ]
      
      [ Testing shows that similar issues exist with ipv6 unreachable /
        fragmentation needed messages.  However, investigation of this
        additional failure mode is beyond this investigation's scope. ]
      
      Link: https://tools.ietf.org/html/rfc4443Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      272928d1
    • Mathieu Desnoyers's avatar
      ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) · e1e84eb5
      Mathieu Desnoyers authored
      As per RFC792, ICMP errors should be sent to the source host.
      
      However, in configurations with Virtual Routing and Forwarding tables,
      looking up which routing table to use is currently done by using the
      destination net_device.
      
      commit 9d1a6c4e ("net: icmp_route_lookup should use rt dev to
      determine L3 domain") changes the interface passed to
      l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
      to skb_dst(skb_in)->dev. This effectively uses the destination device
      rather than the source device for choosing which routing table should be
      used to lookup where to send the ICMP error.
      
      Therefore, if the source and destination interfaces are within separate
      VRFs, or one in the global routing table and the other in a VRF, looking
      up the source host in the destination interface's routing table will
      fail if the destination interface's routing table contains no route to
      the source host.
      
      One observable effect of this issue is that traceroute does not work in
      the following cases:
      
      - Route leaking between global routing table and VRF
      - Route leaking between VRFs
      
      Preferably use the source device routing table when sending ICMP error
      messages. If no source device is set, fall-back on the destination
      device routing table. Else, use the main routing table (index 0).
      
      [ It has been pointed out that a similar issue may exist with ICMP
        errors triggered when forwarding between network namespaces. It would
        be worthwhile to investigate, but is outside of the scope of this
        investigation. ]
      
      [ It has also been pointed out that a similar issue exists with
        unreachable / fragmentation needed messages, which can be triggered by
        changing the MTU of eth1 in r1 to 1400 and running:
      
        ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
      
        Some investigation points to raw_icmp_error() and raw_err() as being
        involved in this last scenario. The focus of this patch is TTL expired
        ICMP messages, which go through icmp_route_lookup.
        Investigation of failure modes related to raw_icmp_error() is beyond
        this investigation's scope. ]
      
      Fixes: 9d1a6c4e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
      Link: https://tools.ietf.org/html/rfc792Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1e84eb5
  2. 14 Oct, 2020 17 commits