An error occurred fetching the project authors.
  1. 31 Aug, 2020 1 commit
  2. 25 Aug, 2020 2 commits
  3. 04 Aug, 2020 1 commit
    • Stefano Brivio's avatar
      ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18
      Stefano Brivio authored
      Currently, processes sending traffic to a local bridge with an
      encapsulation device as a port don't get ICMP errors if they exceed
      the PMTU of the encapsulated link.
      
      David Ahern suggested this as a hack, but it actually looks like
      the correct solution: when we update the PMTU for a given destination
      by means of updating or creating a route exception, the encapsulation
      might trigger this because of PMTU discovery happening either on the
      encapsulation device itself, or its lower layer. This happens on
      bridged encapsulations only.
      
      The output interface shouldn't matter, because we already have a
      valid destination. Drop the output interface restriction from the
      associated route lookup.
      
      For UDP tunnels, we will now have a route exception created for the
      encapsulation itself, with a MTU value reflecting its headroom, which
      allows a bridge forwarding IP packets originated locally to deliver
      errors back to the sending socket.
      
      The behaviour is now consistent with IPv6 and verified with selftests
      pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
      this series.
      
      v2:
      - reset output interface only for bridge ports (David Ahern)
      - add and use netif_is_any_bridge_port() helper (David Ahern)
      Suggested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df23bb18
  4. 28 Jun, 2020 1 commit
  5. 17 May, 2020 1 commit
  6. 09 May, 2020 1 commit
    • Paolo Abeni's avatar
      net: ipv4: really enforce backoff for redirects · 57644431
      Paolo Abeni authored
      In commit b406472b ("net: ipv4: avoid mixed n_redirects and
      rate_tokens usage") I missed the fact that a 0 'rate_tokens' will
      bypass the backoff algorithm.
      
      Since rate_tokens is cleared after a redirect silence, and never
      incremented on redirects, if the host keeps receiving packets
      requiring redirect it will reply ignoring the backoff.
      
      Additionally, the 'rate_last' field will be updated with the
      cadence of the ingress packet requiring redirect. If that rate is
      high enough, that will prevent the host from generating any
      other kind of ICMP messages
      
      The check for a zero 'rate_tokens' value was likely a shortcut
      to avoid the more complex backoff algorithm after a redirect
      silence period. Address the issue checking for 'n_redirects'
      instead, which is incremented on successful redirect, and
      does not interfere with other ICMP replies.
      
      Fixes: b406472b ("net: ipv4: avoid mixed n_redirects and rate_tokens usage")
      Reported-and-tested-by: default avatarColin Walters <walters@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      57644431
  7. 27 Apr, 2020 1 commit
  8. 24 Mar, 2020 1 commit
  9. 24 Feb, 2020 1 commit
  10. 04 Feb, 2020 1 commit
  11. 24 Jan, 2020 1 commit
  12. 15 Jan, 2020 2 commits
    • Ido Schimmel's avatar
      ipv4: Add "offload" and "trap" indications to routes · 90b93f1b
      Ido Schimmel authored
      When performing L3 offload, routes and nexthops are usually programmed
      into two different tables in the underlying device. Therefore, the fact
      that a nexthop resides in hardware does not necessarily mean that all
      the associated routes also reside in hardware and vice-versa.
      
      While the kernel can signal to user space the presence of a nexthop in
      hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
      for routes. In addition, the fact that a route resides in hardware does
      not necessarily mean that the traffic is offloaded. For example,
      unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
      packets to the CPU so that the kernel will be able to generate the
      appropriate ICMP error packet.
      
      This patch adds an "offload" and "trap" indications to IPv4 routes, so
      that users will have better visibility into the offload process.
      
      'struct fib_alias' is extended with two new fields that indicate if the
      route resides in hardware or not and if it is offloading traffic from
      the kernel or trapping packets to it. Note that the new fields are added
      in the 6 bytes hole and therefore the struct still fits in a single
      cache line [1].
      
      Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
      route's key in order to set the flags.
      
      The indications are dumped to user space via a new flags (i.e.,
      'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
      ancillary header.
      
      v2:
      * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()
      
      [1]
      struct fib_alias {
              struct hlist_node  fa_list;                      /*     0    16 */
              struct fib_info *          fa_info;              /*    16     8 */
              u8                         fa_tos;               /*    24     1 */
              u8                         fa_type;              /*    25     1 */
              u8                         fa_state;             /*    26     1 */
              u8                         fa_slen;              /*    27     1 */
              u32                        tb_id;                /*    28     4 */
              s16                        fa_default;           /*    32     2 */
              u8                         offload:1;            /*    34: 0  1 */
              u8                         trap:1;               /*    34: 1  1 */
              u8                         unused:6;             /*    34: 2  1 */
      
              /* XXX 5 bytes hole, try to pack */
      
              struct callback_head rcu __attribute__((__aligned__(8))); /*    40    16 */
      
              /* size: 56, cachelines: 1, members: 12 */
              /* sum members: 50, holes: 1, sum holes: 5 */
              /* sum bitfield members: 8 bits (1 bytes) */
              /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90b93f1b
    • Ido Schimmel's avatar
      ipv4: Encapsulate function arguments in a struct · 1e301fd0
      Ido Schimmel authored
      fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
      using the passed arguments. Currently, the function takes 11 arguments,
      6 of which are attributes of the route being dumped (e.g., prefix, TOS).
      
      The next patch will need the function to also dump to user space an
      indication if the route is present in hardware or not. Instead of
      passing yet another argument, change the function to take a struct
      containing the different route attributes.
      
      v2:
      * Name last argument of fib_dump_info()
      * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
        later be passed to fib_alias_hw_flags_set()
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e301fd0
  13. 25 Dec, 2019 1 commit
    • Hangbin Liu's avatar
      net: add bool confirm_neigh parameter for dst_ops.update_pmtu · bd085ef6
      Hangbin Liu authored
      The MTU update code is supposed to be invoked in response to real
      networking events that update the PMTU. In IPv6 PMTU update function
      __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
      confirmed time.
      
      But for tunnel code, it will call pmtu before xmit, like:
        - tnl_update_pmtu()
          - skb_dst_update_pmtu()
            - ip6_rt_update_pmtu()
              - __ip6_rt_update_pmtu()
                - dst_confirm_neigh()
      
      If the tunnel remote dst mac address changed and we still do the neigh
      confirm, we will not be able to update neigh cache and ping6 remote
      will failed.
      
      So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
      should not be invoking dst_confirm_neigh() as we have no evidence
      of successful two-way communication at this point.
      
      On the other hand it is also important to keep the neigh reachability fresh
      for TCP flows, so we cannot remove this dst_confirm_neigh() call.
      
      To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
      to choose whether we should do neigh update or not. I will add the parameter
      in this patch and set all the callers to true to comply with the previous
      way, and fix the tunnel code one by one on later patches.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Suggested-by: default avatarDavid Miller <davem@davemloft.net>
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd085ef6
  14. 21 Nov, 2019 1 commit
    • Paolo Abeni's avatar
      ipv4: use dst hint for ipv4 list receive · 02b24941
      Paolo Abeni authored
      This is alike the previous change, with some additional ipv4 specific
      quirk. Even when using the route hint we still have to do perform
      additional per packet checks about source address validity: a new
      helper is added to wrap them.
      
      Hints are explicitly disabled if the destination is a local broadcast,
      that keeps the code simple and local broadcast are a slower path anyway.
      
      UDP flood performances vs recvmmsg() receiver:
      
      vanilla		patched		delta
      Kpps		Kpps		%
      1683		1871		+11
      
      In the worst case scenario - each packet has a different
      destination address - the performance delta is within noise
      range.
      
      v3 -> v4:
       - re-enable hints for forward
      
      v2 -> v3:
       - really fix build (sic) and hint usage check
       - use fib4_has_custom_rules() helpers (David A.)
       - add ip_extract_route_hint() helper (Edward C.)
       - use prev skb as hint instead of copying data (Willem)
      
      v1 -> v2:
       - fix build issue with !CONFIG_IP_MULTIPLE_TABLES
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02b24941
  15. 05 Nov, 2019 1 commit
  16. 17 Oct, 2019 2 commits
    • Wei Wang's avatar
      ipv4: fix race condition between route lookup and invalidation · 5018c596
      Wei Wang authored
      Jesse and Ido reported the following race condition:
      <CPU A, t0> - Received packet A is forwarded and cached dst entry is
      taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
      
      <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
      from multiple ISPs"), route is added / deleted and rt_cache_flush() is
      called
      
      <CPU B, t2> - Received packet B tries to use the same cached dst entry
      from t0, but rt_cache_valid() is no longer true and it is replaced in
      rt_cache_route() by the newer one. This calls dst_dev_put() on the
      original dst entry which assigns the blackhole netdev to 'dst->dev'
      
      <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
      to 'dst->dev' being the blackhole netdev
      
      There are 2 issues in the v4 routing code:
      1. A per-netns counter is used to do the validation of the route. That
      means whenever a route is changed in the netns, users of all routes in
      the netns needs to redo lookup. v6 has an implementation of only
      updating fn_sernum for routes that are affected.
      2. When rt_cache_valid() returns false, rt_cache_route() is called to
      throw away the current cache, and create a new one. This seems
      unnecessary because as long as this route does not change, the route
      cache does not need to be recreated.
      
      To fully solve the above 2 issues, it probably needs quite some code
      changes and requires careful testing, and does not suite for net branch.
      
      So this patch only tries to add the deleted cached rt into the uncached
      list, so user could still be able to use it to receive packets until
      it's done.
      
      Fixes: 95c47f9c ("ipv4: call dst_dev_put() properly")
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Reported-by: default avatarIdo Schimmel <idosch@idosch.org>
      Reported-by: default avatarJesse Hathaway <jesse@mbuki-mvuki.org>
      Tested-by: default avatarJesse Hathaway <jesse@mbuki-mvuki.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5018c596
    • Stefano Brivio's avatar
      ipv4: Return -ENETUNREACH if we can't create route but saddr is valid · 595e0651
      Stefano Brivio authored
      ...instead of -EINVAL. An issue was found with older kernel versions
      while unplugging a NFS client with pending RPCs, and the wrong error
      code here prevented it from recovering once link is back up with a
      configured address.
      
      Incidentally, this is not an issue anymore since commit 4f8943f8
      ("SUNRPC: Replace direct task wakeups from softirq context"), included
      in 5.2-rc7, had the effect of decoupling the forwarding of this error
      by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin
      Coddington.
      
      To the best of my knowledge, this isn't currently causing any further
      issue, but the error code doesn't look appropriate anyway, and we
      might hit this in other paths as well.
      
      In detail, as analysed by Gonzalo Siero, once the route is deleted
      because the interface is down, and can't be resolved and we return
      -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(),
      as the socket error seen by tcp_write_err(), called by
      tcp_retransmit_timer().
      
      In turn, tcp_write_err() indirectly calls xs_error_report(), which
      wakes up the RPC pending tasks with a status of -EINVAL. This is then
      seen by call_status() in the SUN RPC implementation, which aborts the
      RPC call calling rpc_exit(), instead of handling this as a
      potentially temporary condition, i.e. as a timeout.
      
      Return -EINVAL only if the input parameters passed to
      ip_route_output_key_hash_rcu() are actually invalid (this is the case
      if the specified source address is multicast, limited broadcast or
      all zeroes), but return -ENETUNREACH in all cases where, at the given
      moment, the given source address doesn't allow resolving the route.
      
      While at it, drop the initialisation of err to -ENETUNREACH, which
      was added to __ip_route_output_key() back then by commit
      0315e382 ("net: Fix behaviour of unreachable, blackhole and
      prohibit routes"), but actually had no effect, as it was, and is,
      overwritten by the fib_lookup() return code assignment, and anyway
      ignored in all other branches, including the if (fl4->saddr) one:
      I find this rather confusing, as it would look like -ENETUNREACH is
      the "default" error, while that statement has no effect.
      
      Also note that after commit fc75fc83 ("ipv4: dont create routes
      on down devices"), we would get -ENETUNREACH if the device is down,
      but -EINVAL if the source address is specified and we can't resolve
      the route, and this appears to be rather inconsistent.
      Reported-by: default avatarStefan Walter <walteste@inf.ethz.ch>
      Analysed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Analysed-by: default avatarGonzalo Siero <gsierohu@redhat.com>
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      595e0651
  17. 05 Oct, 2019 1 commit
    • Paolo Abeni's avatar
      net: ipv4: avoid mixed n_redirects and rate_tokens usage · b406472b
      Paolo Abeni authored
      Since commit c09551c6 ("net: ipv4: use a dedicated counter
      for icmp_v4 redirect packets") we use 'n_redirects' to account
      for redirect packets, but we still use 'rate_tokens' to compute
      the redirect packets exponential backoff.
      
      If the device sent to the relevant peer any ICMP error packet
      after sending a redirect, it will also update 'rate_token' according
      to the leaking bucket schema; typically 'rate_token' will raise
      above BITS_PER_LONG and the redirect packets backoff algorithm
      will produce undefined behavior.
      
      Fix the issue using 'n_redirects' to compute the exponential backoff
      in ip_rt_send_redirect().
      
      Note that we still clear rate_tokens after a redirect silence period,
      to avoid changing an established behaviour.
      
      The root cause predates git history; before the mentioned commit in
      the critical scenario, the kernel stopped sending redirects, after
      the mentioned commit the behavior more randomic.
      Reported-by: default avatarXiumei Mu <xmu@redhat.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Fixes: c09551c6 ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b406472b
  18. 21 Sep, 2019 1 commit
  19. 24 Aug, 2019 1 commit
    • John Fastabend's avatar
      net: route dump netlink NLM_F_MULTI flag missing · e93fb3e9
      John Fastabend authored
      An excerpt from netlink(7) man page,
      
        In multipart messages (multiple nlmsghdr headers with associated payload
        in one byte stream) the first and all following headers have the
        NLM_F_MULTI flag set, except for the last  header  which  has the type
        NLMSG_DONE.
      
      but, after (ee28906f) there is a missing NLM_F_MULTI flag in the middle of a
      FIB dump. The result is user space applications following above man page
      excerpt may get confused and may stop parsing msg believing something went
      wrong.
      
      In the golang netlink lib [0] the library logic stops parsing believing the
      message is not a multipart message. Found this running Cilium[1] against
      net-next while adding a feature to auto-detect routes. I noticed with
      multiple route tables we no longer could detect the default routes on net
      tree kernels because the library logic was not returning them.
      
      Fix this by handling the fib_dump_info_fnhe() case the same way the
      fib_dump_info() handles it by passing the flags argument through the
      call chain and adding a flags argument to rt_fill_info().
      
      Tested with Cilium stack and auto-detection of routes works again. Also
      annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
      NLMSG_DONE flags look correct after this.
      
      Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
      as is done for fib_dump_info() so this looks correct to me.
      
      [0] https://github.com/vishvananda/netlink/
      [1] https://github.com/cilium/
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e93fb3e9
  20. 08 Jul, 2019 1 commit
  21. 05 Jul, 2019 1 commit
    • Ido Schimmel's avatar
      ipv4: Fix NULL pointer dereference in ipv4_neigh_lookup() · 537de0c8
      Ido Schimmel authored
      Both ip_neigh_gw4() and ip_neigh_gw6() can return either a valid pointer
      or an error pointer, but the code currently checks that the pointer is
      not NULL.
      
      Fix this by checking that the pointer is not an error pointer, as this
      can result in a NULL pointer dereference [1]. Specifically, I believe
      that what happened is that ip_neigh_gw4() returned '-EINVAL'
      (0xffffffffffffffea) to which the offset of 'refcnt' (0x70) was added,
      which resulted in the address 0x000000000000005a.
      
      [1]
       BUG: KASAN: null-ptr-deref in refcount_inc_not_zero_checked+0x6e/0x180
       Read of size 4 at addr 000000000000005a by task swapper/2/0
      
       CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.2.0-rc6-custom-reg-179657-gaa32d89 #396
       Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
       Call Trace:
       <IRQ>
       dump_stack+0x73/0xbb
       __kasan_report+0x188/0x1ea
       kasan_report+0xe/0x20
       refcount_inc_not_zero_checked+0x6e/0x180
       ipv4_neigh_lookup+0x365/0x12c0
       __neigh_update+0x1467/0x22f0
       arp_process.constprop.6+0x82e/0x1f00
       __netif_receive_skb_one_core+0xee/0x170
       process_backlog+0xe3/0x640
       net_rx_action+0x755/0xd90
       __do_softirq+0x29b/0xae7
       irq_exit+0x177/0x1c0
       smp_apic_timer_interrupt+0x164/0x5e0
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      
      Fixes: 5c9f7c1d ("ipv4: Add helpers for neigh lookup for nexthop")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      537de0c8
  22. 02 Jul, 2019 1 commit
  23. 28 Jun, 2019 1 commit
  24. 26 Jun, 2019 2 commits
    • Stephen Suryaputra's avatar
      ipv4: reset rt_iif for recirculated mcast/bcast out pkts · 5b18f128
      Stephen Suryaputra authored
      Multicast or broadcast egress packets have rt_iif set to the oif. These
      packets might be recirculated back as input and lookup to the raw
      sockets may fail because they are bound to the incoming interface
      (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
      returns rt_iif instead of skb_iif. Hence, the lookup fails.
      
      v2: Make it non vrf specific (David Ahern). Reword the changelog to
          reflect it.
      Signed-off-by: default avatarStephen Suryaputra <ssuryaextr@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b18f128
    • Eric Dumazet's avatar
      ipv4: fix suspicious RCU usage in fib_dump_info_fnhe() · 93ed54b1
      Eric Dumazet authored
      sysbot reported that we lack appropriate rcu_read_lock()
      protection in fib_dump_info_fnhe()
      
      net/ipv4/route.c:2875 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by syz-executor609/8966:
       #0: 00000000b7dbe288 (rtnl_mutex){+.+.}, at: netlink_dump+0xe7/0xfb0 net/netlink/af_netlink.c:2199
      
      stack backtrace:
      CPU: 0 PID: 8966 Comm: syz-executor609 Not tainted 5.2.0-rc5+ #43
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5250
       fib_dump_info_fnhe+0x9d9/0x1080 net/ipv4/route.c:2875
       fn_trie_dump_leaf net/ipv4/fib_trie.c:2141 [inline]
       fib_table_dump+0x64a/0xd00 net/ipv4/fib_trie.c:2175
       inet_dump_fib+0x83c/0xa90 net/ipv4/fib_frontend.c:1004
       rtnl_dump_all+0x295/0x490 net/core/rtnetlink.c:3445
       netlink_dump+0x558/0xfb0 net/netlink/af_netlink.c:2244
       __netlink_dump_start+0x5b1/0x7d0 net/netlink/af_netlink.c:2352
       netlink_dump_start include/linux/netlink.h:226 [inline]
       rtnetlink_rcv_msg+0x73d/0xb00 net/core/rtnetlink.c:5182
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:646 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:665
       sock_write_iter+0x27c/0x3e0 net/socket.c:994
       call_write_iter include/linux/fs.h:1872 [inline]
       new_sync_write+0x4d3/0x770 fs/read_write.c:483
       __vfs_write+0xe1/0x110 fs/read_write.c:496
       vfs_write+0x20c/0x580 fs/read_write.c:558
       ksys_write+0x14f/0x290 fs/read_write.c:611
       __do_sys_write fs/read_write.c:623 [inline]
       __se_sys_write fs/read_write.c:620 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:620
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401b9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffc8e134978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401b9
      RDX: 000000000000001c RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000401a40
      R13: 0000000000401ad0 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93ed54b1
  25. 24 Jun, 2019 2 commits
    • Stefano Brivio's avatar
      ipv4: Dump route exceptions if requested · ee28906f
      Stefano Brivio authored
      Since commit 4895c771 ("ipv4: Add FIB nexthop exceptions."), cached
      exception routes are stored as a separate entity, so they are not dumped
      on a FIB dump, even if the RTM_F_CLONED flag is passed.
      
      This implies that the command 'ip route list cache' doesn't return any
      result anymore.
      
      If the RTM_F_CLONED is passed, and strict checking requested, retrieve
      nexthop exception routes and dump them. If no strict checking is
      requested, filtering can't be performed consistently: dump everything in
      that case.
      
      With this, we need to add an argument to the netlink callback in order to
      track how many entries were already dumped for the last leaf included in
      a partial netlink dump.
      
      A single additional argument is sufficient, even if we traverse logically
      nested structures (nexthop objects, hash table buckets, bucket chains): it
      doesn't matter if we stop in the middle of any of those, because they are
      always traversed the same way. As an example, s_i values in [], s_fa
      values in ():
      
        node (fa) #1 [1]
          nexthop #1
          bucket #1 -> #0 in chain (1)
          bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
          bucket #3 -> #0 in chain (5) -> #1 in chain (6)
      
          nexthop #2
          bucket #1 -> #0 in chain (7) -> #1 in chain (8)
          bucket #2 -> #0 in chain (9)
        --
        node (fa) #2 [2]
          nexthop #1
          bucket #1 -> #0 in chain (1) -> #1 in chain (2)
          bucket #2 -> #0 in chain (3)
      
      it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
      for "node #2": walking flattens all that.
      
      It would even be possible to drop the distinction between the in-tree
      (s_i) and in-node (s_fa) counter, but a further improvement might
      advise against this. This is only as accurate as the existing tracking
      mechanism for leaves: if a partial dump is restarted after exceptions
      are removed or expired, we might skip some non-dumped entries.
      
      To improve this, we could attach a 'sernum' attribute (similar to the
      one used for IPv6) to nexthop entities, and bump this counter whenever
      exceptions change: having a distinction between the two counters would
      make this more convenient.
      
      Listing of exception routes (modified routes pre-3.5) was tested against
      these versions of kernel and iproute2:
      
                          iproute2
      kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
       3.5-rc4         +        +        +        +       +
       4.4
       4.9
       4.14
       4.15
       4.19
       5.0
       5.1
       fixed           +        +        +        +       +
      
      v7:
         - Move loop over nexthop objects to route.c, and pass struct fib_info
           and table ID to it, not a struct fib_alias (suggested by David Ahern)
         - While at it, note that the NULL check on fa->fa_info is redundant,
           and the check on RTNH_F_DEAD is also not consistent with what's done
           with regular route listing: just keep it for nhc_flags
         - Rename entry point function for dumping exceptions to
           fib_dump_info_fnhe(), and rearrange arguments for consistency with
           fib_dump_info()
         - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
           one bucket at a time
         - Expand commit message to describe why we can have a single "skip"
           counter for all exceptions stored in bucket chains in nexthop objects
           (suggested by David Ahern)
      
      v6:
         - Rebased onto net-next
         - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
           avoids need to export rt_fill_info() and to touch exceptions from
           fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
           (suggested by David Ahern)
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee28906f
    • Stefano Brivio's avatar
      ipv4/route: Allow NULL flowinfo in rt_fill_info() · d948974c
      Stefano Brivio authored
      In the next patch, we're going to use rt_fill_info() to dump exception
      routes upon RTM_GETROUTE with NLM_F_ROOT, meaning userspace is requesting
      a dump and not a specific route selection, which in turn implies the input
      interface is not relevant. Update rt_fill_info() to handle a NULL
      flowinfo.
      
      v7: If fl4 is NULL, explicitly set r->rtm_tos to 0: it's not initialised
          otherwise (spotted by David Ahern)
      
      v6: New patch
      Suggested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d948974c
  26. 15 Jun, 2019 1 commit
  27. 05 Jun, 2019 3 commits
    • Xin Long's avatar
      ipv4: not do cache for local delivery if bc_forwarding is enabled · 0a90478b
      Xin Long authored
      With the topo:
      
          h1 ---| rp1            |
                |     route  rp3 |--- h3 (192.168.200.1)
          h2 ---| rp2            |
      
      If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
      doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
      h2, and the packets can still be forwared.
      
      This issue was caused by the input route cache. It should only do
      the cache for either bc forwarding or local delivery. Otherwise,
      local delivery can use the route cache for bc forwarding of other
      interfaces.
      
      This patch is to fix it by not doing cache for local delivery if
      all.bc_forwarding is enabled.
      
      Note that we don't fix it by checking route cache local flag after
      rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
      common route code shouldn't be touched for bc_forwarding.
      
      Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a90478b
    • David Ahern's avatar
      ipv4: Prepare for fib6_nh from a nexthop object · dcb1ecb5
      David Ahern authored
      Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes
      to use a fib6_nh based nexthop. In the end, only code not using a
      nexthop object in a fib_info should directly access fib_nh in a fib_info
      without checking the famiy and going through fib_nh_common. Those
      functions will be marked when it is not directly evident.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcb1ecb5
    • David Ahern's avatar
      ipv4: Use accessors for fib_info nexthop data · 5481d73f
      David Ahern authored
      Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the
      fib_dev macro which is an alias for the first nexthop. Replacements:
      
        fi->fib_dev    --> fib_info_nh(fi, 0)->fib_nh_dev
        fi->fib_nh     --> fib_info_nh(fi, 0)
        fi->fib_nh[i]  --> fib_info_nh(fi, i)
        fi->fib_nhs    --> fib_info_num_path(fi)
      
      where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path
      returns fi->fib_nhs.
      
      Move the existing fib_info_nhc to nexthop.h and define the new ones
      there. A later patch adds a check if a fib_info uses a nexthop object,
      and defining the helpers in nexthop.h avoid circular header
      dependencies.
      
      After this all remaining open coded references to fi->fib_nhs and
      fi->fib_nh are in:
      - fib_create_info and helpers used to lookup an existing fib_info
        entry, and
      - the netdev event functions fib_sync_down_dev and fib_sync_up.
      
      The latter two will not be reused for nexthops, and the fib_create_info
      will be updated to handle a nexthop in a fib_info.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5481d73f
  28. 30 May, 2019 1 commit
  29. 05 May, 2019 3 commits
  30. 27 Apr, 2019 1 commit
    • Johannes Berg's avatar
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg authored
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cb08174
  31. 24 Apr, 2019 1 commit
    • Eric Dumazet's avatar
      ipv4: add sanity checks in ipv4_link_failure() · 20ff83f1
      Eric Dumazet authored
      Before calling __ip_options_compile(), we need to ensure the network
      header is a an IPv4 one, and that it is already pulled in skb->head.
      
      RAW sockets going through a tunnel can end up calling ipv4_link_failure()
      with total garbage in the skb, or arbitrary lengthes.
      
      syzbot report :
      
      BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline]
      BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
      Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x38/0x50 mm/kasan/common.c:133
       memcpy include/linux/string.h:355 [inline]
       __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
       __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695
       ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204
       dst_link_failure include/net/dst.h:427 [inline]
       vti6_xmit net/ipv6/ip6_vti.c:514 [inline]
       vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553
       __netdev_start_xmit include/linux/netdevice.h:4414 [inline]
       netdev_start_xmit include/linux/netdevice.h:4423 [inline]
       xmit_one net/core/dev.c:3292 [inline]
       dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308
       __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3911
       neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527
       neigh_output include/net/neighbour.h:508 [inline]
       ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229
       ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip_output+0x21f/0x670 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       raw_send_hdrinc net/ipv4/raw.c:432 [inline]
       raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663
       inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       sock_write_iter+0x27c/0x3e0 net/socket.c:988
       call_write_iter include/linux/fs.h:1866 [inline]
       new_sync_write+0x4c7/0x760 fs/read_write.c:474
       __vfs_write+0xe4/0x110 fs/read_write.c:487
       vfs_write+0x20c/0x580 fs/read_write.c:549
       ksys_write+0x14f/0x2d0 fs/read_write.c:599
       __do_sys_write fs/read_write.c:611 [inline]
       __se_sys_write fs/read_write.c:608 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:608
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458c29
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29
      RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4
      R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff
      
      The buggy address belongs to the page:
      page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1fffc0000000000()
      raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
       ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
                               ^
       ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00
       ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fixes: ed0de45a ("ipv4: recompile ip options in ipv4_link_failure")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20ff83f1