1. 10 Jan, 2018 2 commits
  2. 08 Jan, 2018 21 commits
    • Alexei Starovoitov's avatar
      bpf: fix verifier GPF in kmalloc failure path · 5896351e
      Alexei Starovoitov authored
      syzbot reported the following panic in the verifier triggered
      by kmalloc error injection:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
      RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
      Call Trace:
       pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
       push_stack kernel/bpf/verifier.c:491 [inline]
       check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
       do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
       bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
       bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
       SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
       SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769
      
      when copy_verifier_state() aborts in the middle due to kmalloc failure
      some of the frames could have been partially copied while
      current free_verifier_state() loop
      for (i = 0; i <= state->curframe; i++)
      assumed that all frames are non-null.
      Simply fix it by adding 'if (!state)' to free_func_state().
      Also avoid stressing copy frame logic more if kzalloc fails
      in push_stack() free env->cur_state right away.
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com
      Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5896351e
    • David S. Miller's avatar
      Merge branch 'ipv6-ipv4-nexthop-align' · f66faae2
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      ipv6: Align nexthop behaviour with IPv4
      
      This set tries to eliminate some differences between IPv4's and IPv6's
      treatment of nexthops. These differences are most likely a side effect
      of IPv6's data structures (specifically 'rt6_info') that incorporate
      both the route and the nexthop and the late addition of ECMP support in
      commit 51ebd318 ("ipv6: add support of equal cost multipath
      (ECMP)").
      
      IPv4 and IPv6 do not react the same to certain netdev events. For
      example, upon carrier change affected IPv4 nexthops are marked using the
      RTNH_F_LINKDOWN flag and the nexthop group is rebalanced accordingly.
      IPv6 on the other hand, does nothing which forces us to perform a
      carrier check during route lookup and dump. This makes it difficult to
      introduce features such as non-equal-cost multipath that are built on
      top of this set [1].
      
      In addition, when a netdev is put administratively down IPv4 nexthops
      are marked using the RTNH_F_DEAD flag, whereas IPv6 simply flushes all
      the routes using these nexthops. To be consistent with IPv4, multipath
      routes should only be flushed when all nexthops in the group are
      considered dead.
      
      The first 12 patches introduce non-functional changes that store the
      RTNH_F_DEAD and RTNH_F_LINKDOWN flags in IPv6 routes based on netdev
      events, in a similar fashion to IPv4. This allows us to remove the
      carrier check performed during route lookup and dump.
      
      The next three patches make sure we only flush a multipath route when
      all of its nexthops are dead.
      
      Last three patches add test cases for IPv4/IPv6 FIB. These verify that
      both address families react similarly to netdev events.
      
      Finally, this series also serves as a good first step towards David
      Ahern's goal of treating nexthops as standalone objects [2], as it makes
      the code more in line with IPv4 where the nexthop and the nexthop group
      are separate objects from the route itself.
      
      1. https://github.com/idosch/linux/tree/ipv6-nexthops
      2. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf
      
      Changes since RFC (feedback from David Ahern):
      * Remove redundant declaration of rt6_ifdown() in patch 4 and adjust
      comment referencing it accordingly
      * Drop patch to flush multipath routes upon NETDEV_UNREGISTER. Reword
      cover letter accordingly
      * Use a temporary variable to make code more readable in patch 15
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f66faae2
    • Ido Schimmel's avatar
      selftests: fib_tests: Add test cases for netdev carrier change · 82e45b6f
      Ido Schimmel authored
      Check that IPv4 and IPv6 react the same when the carrier of a netdev is
      toggled. Local routes should not be affected by this, whereas unicast
      routes should.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82e45b6f
    • Ido Schimmel's avatar
      selftests: fib_tests: Add test cases for netdev down · 5adb7683
      Ido Schimmel authored
      Check that IPv4 and IPv6 react the same when a netdev is being put
      administratively down.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5adb7683
    • Ido Schimmel's avatar
      selftests: fib_tests: Add test cases for IPv4/IPv6 FIB · 607bd2e5
      Ido Schimmel authored
      Add test cases to check that IPv4 and IPv6 react to a netdev being
      unregistered as expected.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      607bd2e5
    • Ido Schimmel's avatar
      ipv6: Flush multipath routes when all siblings are dead · 1de178ed
      Ido Schimmel authored
      By default, IPv6 deletes nexthops from a multipath route when the
      nexthop device is put administratively down. This differs from IPv4
      where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
      multipath route is flushed when all of its nexthops become dead.
      
      Align IPv6 with IPv4 and have it conform to the same guidelines.
      
      In case the multipath route needs to be flushed, its siblings are
      flushed one by one. Otherwise, the nexthops are marked with the
      appropriate flags and the tree walker is instructed to skip all the
      siblings.
      
      As explained in previous patches, care is taken to update the sernum of
      the affected tree nodes, so as to prevent the use of wrong dst entries.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1de178ed
    • Ido Schimmel's avatar
      ipv6: Take table lock outside of sernum update function · 922c2ac8
      Ido Schimmel authored
      The next patch is going to allow dead routes to remain in the FIB tree
      in certain situations.
      
      When this happens we need to be sure to bump the sernum of the nodes
      where these are stored so that potential copies cached in sockets are
      invalidated.
      
      The function that performs this update assumes the table lock is not
      taken when it is invoked, but that will not be the case when it is
      invoked by the tree walker.
      
      Have the function assume the lock is taken and make the single caller
      take the lock itself.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      922c2ac8
    • Ido Schimmel's avatar
      ipv6: Export sernum update function · 4a8e56ee
      Ido Schimmel authored
      We are going to allow dead routes to stay in the FIB tree (e.g., when
      they are part of a multipath route, directly connected route with no
      carrier) and revive them when their nexthop device gains carrier or when
      it is put administratively up.
      
      This is equivalent to the addition of the route to the FIB tree and we
      should therefore take care of updating the sernum of all the parent
      nodes of the node where the route is stored. Otherwise, we risk sockets
      caching and using sub-optimal dst entries.
      
      Export the function that performs the above, so that it could be invoked
      from fib6_ifup() later on.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a8e56ee
    • Ido Schimmel's avatar
      ipv6: Teach tree walker to skip multipath routes · b5cb5a75
      Ido Schimmel authored
      As explained in previous patch, fib6_ifdown() needs to consider the
      state of all the sibling routes when a multipath route is traversed.
      
      This is done by evaluating all the siblings when the first sibling in a
      multipath route is traversed. If the multipath route does not need to be
      flushed (e.g., not all siblings are dead), then we should just skip the
      multipath route as our work is done.
      
      Have the tree walker jump to the last sibling when it is determined that
      the multipath route needs to be skipped.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5cb5a75
    • Ido Schimmel's avatar
      ipv6: Add explicit flush indication to routes · a2c554d3
      Ido Schimmel authored
      When routes that are a part of a multipath route are evaluated by
      fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events
      the state of their sibling routes is not considered.
      
      This will change in subsequent patches in order to align IPv6 with
      IPv4's behavior. For example, when the last sibling in a multipath route
      becomes dead, the entire multipath route needs to be removed.
      
      To prevent the tree walker from re-evaluating all the sibling routes
      each time, we can simply evaluate them once - when the first sibling is
      traversed.
      
      If we determine the entire multipath route needs to be removed, then the
      'should_flush' bit is set in all the siblings, which will cause the
      walker to flush them when it traverses them.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2c554d3
    • Ido Schimmel's avatar
      ipv6: Report dead flag during route dump · f9d882ea
      Ido Schimmel authored
      Up until now the RTNH_F_DEAD flag was only reported in route dump when
      the 'ignore_routes_with_linkdown' sysctl was set. This is expected as
      dead routes were flushed otherwise.
      
      The reliance on this sysctl is going to be removed, so we need to report
      the flag regardless of the sysctl's value.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9d882ea
    • Ido Schimmel's avatar
      ipv6: Ignore dead routes during lookup · 8067bb8c
      Ido Schimmel authored
      Currently, dead routes are only present in the routing tables in case
      the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are
      flushed.
      
      Subsequent patches are going to remove the reliance on this sysctl and
      make IPv6 more consistent with IPv4.
      
      Before this is done, we need to make sure dead routes are skipped during
      route lookup, so as to not cause packet loss.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8067bb8c
    • Ido Schimmel's avatar
      ipv6: Check nexthop flags in route dump instead of carrier · 44c9f2f2
      Ido Schimmel authored
      Similar to previous patch, there is no need to check for the carrier of
      the nexthop device when dumping the route and we can instead check for
      the presence of the RTNH_F_LINKDOWN flag.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44c9f2f2
    • Ido Schimmel's avatar
      ipv6: Check nexthop flags during route lookup instead of carrier · 14c5206c
      Ido Schimmel authored
      Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
      need to dereference the nexthop device and check its carrier and instead
      check for the presence of the flag.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14c5206c
    • Ido Schimmel's avatar
      ipv6: Set nexthop flags during route creation · 5609b80a
      Ido Schimmel authored
      It is valid to install routes with a nexthop device that does not have a
      carrier, so we need to make sure they're marked accordingly.
      
      As explained in the previous patch, host and anycast routes are never
      marked with the 'linkdown' flag.
      
      Note that reject routes are unaffected, as these use the loopback device
      which always has a carrier.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5609b80a
    • Ido Schimmel's avatar
      ipv6: Set nexthop flags upon carrier change · 27c6fa73
      Ido Schimmel authored
      Similar to IPv4, when the carrier of a netdev changes we should toggle
      the 'linkdown' flag on all the nexthops using it as their nexthop
      device.
      
      This will later allow us to test for the presence of this flag during
      route lookup and dump.
      
      Up until commit 4832c30d ("net: ipv6: put host and anycast routes on
      device with address") host and anycast routes used the loopback netdev
      as their nexthop device and thus were not marked with the 'linkdown'
      flag. The patch preserves this behavior and allows one to ping the local
      address even when the nexthop device does not have a carrier and the
      'ignore_routes_with_linkdown' sysctl is set.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27c6fa73
    • Ido Schimmel's avatar
      ipv6: Prepare to handle multiple netdev events · 4c981e28
      Ido Schimmel authored
      To make IPv6 more in line with IPv4 we need to be able to respond
      differently to different netdev events. For example, when a netdev is
      unregistered all the routes using it as their nexthop device should be
      flushed, whereas when the netdev's carrier changes only the 'linkdown'
      flag should be toggled.
      
      Currently, this is not possible, as the function that traverses the
      routing tables is not aware of the triggering event.
      
      Propagate the triggering event down, so that it could be used in later
      patches.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c981e28
    • Ido Schimmel's avatar
      ipv6: Clear nexthop flags upon netdev up · 2127d95a
      Ido Schimmel authored
      Previous patch marked nexthops with the 'dead' and 'linkdown' flags.
      Clear these flags when the netdev comes back up.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2127d95a
    • Ido Schimmel's avatar
      ipv6: Mark dead nexthops with appropriate flags · 2b241361
      Ido Schimmel authored
      When a netdev is put administratively down or unregistered all the
      nexthops using it as their nexthop device should be marked with the
      'dead' and 'linkdown' flags.
      
      Currently, when a route is dumped its nexthop device is tested and the
      flags are set accordingly. A similar check is performed during route
      lookup.
      
      Instead, we can simply mark the nexthops based on netdev events and
      avoid checking the netdev's state during route dump and lookup.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b241361
    • Ido Schimmel's avatar
      ipv6: Remove redundant route flushing during namespace dismantle · 9fcb0714
      Ido Schimmel authored
      By the time fib6_net_exit() is executed all the netdevs in the namespace
      have been either unregistered or pushed back to the default namespace.
      That is because pernet subsys operations are always ordered before
      pernet device operations and therefore invoked after them during
      namespace dismantle.
      
      Thus, all the routing tables in the namespace are empty by the time
      fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.
      
      This allows us to simplify the condition in fib6_ifdown() as it's only
      ever called with an actual netdev.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fcb0714
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 7f0b8000
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-01-07
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Add a start of a framework for extending struct xdp_buff without
         having the overhead of populating every data at runtime. Idea
         is to have a new per-queue struct xdp_rxq_info that holds read
         mostly data (currently that is, queue number and a pointer to
         the corresponding netdev) which is set up during rxqueue config
         time. When a XDP program is invoked, struct xdp_buff holds a
         pointer to struct xdp_rxq_info that the BPF program can then
         walk. The user facing BPF program that uses struct xdp_md for
         context can use these members directly, and the verifier rewrites
         context access transparently by walking the xdp_rxq_info and
         net_device pointers to load the data, from Jesper.
      
      2) Redo the reporting of offload device information to user space
         such that it works in combination with network namespaces. The
         latter is reported through a device/inode tuple as similarly
         done in other subsystems as well (e.g. perf) in order to identify
         the namespace. For this to work, ns_get_path() has been generalized
         such that the namespace can be retrieved not only from a specific
         task (perf case), but also from a callback where we deduce the
         netns (ns_common) from a netdevice. bpftool support using the new
         uapi info and extensive test cases for test_offload.py in BPF
         selftests have been added as well, from Jakub.
      
      3) Add two bpftool improvements: i) properly report the bpftool
         version such that it corresponds to the version from the kernel
         source tree. So pick the right linux/version.h from the source
         tree instead of the installed one. ii) fix bpftool and also
         bpf_jit_disasm build with bintutils >= 2.9. The reason for the
         build breakage is that binutils library changed the function
         signature to select the disassembler. Given this is needed in
         multiple tools, add a proper feature detection to the
         tools/build/features infrastructure, from Roman.
      
      4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the
         stacktrace map. It is currently unimplemented, but there are
         use cases where user space needs to walk all stacktrace map
         entries e.g. for dumping or deleting map entries w/o having to
         close and recreate the map. Add BPF selftests along with it,
         from Yonghong.
      
      5) Few follow-up cleanups for the bpftool cgroup code: i) rename
         the cgroup 'list' command into 'show' as we have it for other
         subcommands as well, ii) then alias the 'show' command such that
         'list' is accepted which is also common practice in iproute2,
         and iii) remove couple of newlines from error messages using
         p_err(), from Jakub.
      
      6) Two follow-up cleanups to sockmap code: i) remove the unused
         bpf_compute_data_end_sk_skb() function and ii) only build the
         sockmap infrastructure when CONFIG_INET is enabled since it's
         only aware of TCP sockets at this time, from John.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f0b8000
  3. 06 Jan, 2018 3 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-stacktrace-map-next-key-support' · 9be99bad
      Daniel Borkmann authored
      Yonghong Song says:
      
      ====================
      The patch set implements bpf syscall command BPF_MAP_GET_NEXT_KEY
      for stacktrace map. Patch #1 is the core implementation
      and Patch #2 implements a bpf test at tools/testing/selftests/bpf
      directory. Please see individual patch comments for details.
      
      Changelog:
        v1 -> v2:
         - For invalid key (key pointer is non-NULL), sets next_key to be the first valid key.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9be99bad
    • Yonghong Song's avatar
      tools/bpf: add a bpf selftest for stacktrace · 3ced9b60
      Yonghong Song authored
      Added a bpf selftest in test_progs at tools directory for stacktrace.
      The test will populate a hashtable map and a stacktrace map
      at the same time with the same key, stackid.
      The user space will compare both maps, using BPF_MAP_LOOKUP_ELEM
      command and BPF_MAP_GET_NEXT_KEY command, to ensure that both have
      the same set of keys.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3ced9b60
    • Yonghong Song's avatar
      bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map · 16f07c55
      Yonghong Song authored
      Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
      supported for stacktrace map. However, there are use cases where
      user space wants to enumerate all stacktrace map entries where
      BPF_MAP_GET_NEXT_KEY command will be really helpful.
      In addition, if user space wants to delete all map entries
      in order to save memory and does not want to close the
      map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
      performance if map entries are sparsely populated.
      
      The implementation has similar behavior for
      BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
      a NULL key pointer or an invalid key, the first key is returned.
      Otherwise, the first valid key after the input parameter "key"
      is returned, or -ENOENT if no valid key can be found.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      16f07c55
  4. 05 Jan, 2018 14 commits
    • Alexei Starovoitov's avatar
      Merge branch 'xdp_rxq_info' · 11d16edb
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      V4:
      * Added reviewers/acks to patches
      * Fix patch desc in i40e that got out-of-sync with code
      * Add SPDX license headers for the two new files added in patch 14
      
      V3:
      * Fixed bug in virtio_net driver
      * Removed export of xdp_rxq_info_init()
      
      V2:
      * Changed API exposed to drivers
        - Removed invocation of "init" in drivers, and only call "reg"
          (Suggested by Saeed)
        - Allow "reg" to fail and handle this in drivers
          (Suggested by David Ahern)
      * Removed the SINKQ qtype, instead allow to register as "unused"
      * Also fixed some drivers during testing on actual HW (noted in patches)
      
      There is a need for XDP to know more about the RX-queue a given XDP
      frames have arrived on.  For both the XDP bpf-prog and kernel side.
      
      Instead of extending struct xdp_buff each time new info is needed,
      this patchset takes a different approach.  Struct xdp_buff is only
      extended with a pointer to a struct xdp_rxq_info (allowing for easier
      extending this later).  This xdp_rxq_info contains information related
      to how the driver have setup the individual RX-queue's.  This is
      read-mostly information, and all xdp_buff frames (in drivers
      napi_poll) point to the same xdp_rxq_info (per RX-queue).
      
      We stress this data/cache-line is for read-mostly info.  This is NOT
      for dynamic per packet info, use the data_meta for such use-cases.
      
      This patchset start out small, and only expose ingress_ifindex and the
      RX-queue index to the XDP/BPF program. Access to tangible info like
      the ingress ifindex and RX queue index, is fairly easy to comprehent.
      The other future use-cases could allow XDP frames to be recycled back
      to the originating device driver, by providing info on RX device and
      queue number.
      
      As XDP doesn't have driver feature flags, and eBPF code due to
      bpf-tail-calls cannot determine that XDP driver invoke it, this
      patchset have to update every driver that support XDP.
      
      For driver developers (review individual driver patches!):
      
      The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
      modification require (temporary) stopping RX frames, then the
      xdp_rxq_info should (likely) also be unregistred and re-registered,
      especially if reallocating the pages in the ring. Make sure ethtool
      set_channels does the right thing. When replacing XDP prog, if and
      only if RX-ring need to be changed, then also re-register the
      xdp_rxq_info.
      
      I'm Cc'ing the individual driver patches to the registered maintainers.
      
      Testing:
      
      I've only tested the NIC drivers I have hardware for.  The general
      test procedure is to (DUT = Device Under Test):
       (1) run pktgen script pktgen_sample04_many_flows.sh       (against DUT)
       (2) run samples/bpf program xdp_rxq_info --dev $DEV       (on DUT)
       (3) runtime modify number of NIC queues via ethtool -L    (on DUT)
       (4) runtime modify number of NIC ring-size via ethtool -G (on DUT)
      
      Patch based on git tree bpf-next (at commit fb982666):
       https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      11d16edb
    • Jesper Dangaard Brouer's avatar
      samples/bpf: program demonstrating access to xdp_rxq_info · 0fca931a
      Jesper Dangaard Brouer authored
      This sample program can be used for monitoring and reporting how many
      packets per sec (pps) are received per NIC RX queue index and which
      CPU processed the packet. In itself it is a useful tool for quickly
      identifying RSS imbalance issues, see below.
      
      The default XDP action is XDP_PASS in-order to provide a monitor
      mode. For benchmarking purposes it is possible to specify other XDP
      actions on the cmdline --action.
      
      Output below shows an imbalance RSS case where most RXQ's deliver to
      CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
      things from a CPU level the two CPUs are processing approx the same
      amount, BUT looking at the rx_queue_index levels it is clear that
      RXQ-2 receive much better service, than other RXQs which all share CPU-0.
      
      Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      0       900,473     0
      XDP-RX CPU      2       906,921     0
      XDP-RX CPU      total   1,807,395
      
      RXQ stats       RXQ:CPU pps         issue-pps
      rx_queue_index    0:0   180,098     0
      rx_queue_index    0:sum 180,098
      rx_queue_index    1:0   180,098     0
      rx_queue_index    1:sum 180,098
      rx_queue_index    2:2   906,921     0
      rx_queue_index    2:sum 906,921
      rx_queue_index    3:0   180,098     0
      rx_queue_index    3:sum 180,098
      rx_queue_index    4:0   180,082     0
      rx_queue_index    4:sum 180,082
      rx_queue_index    5:0   180,093     0
      rx_queue_index    5:sum 180,093
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0fca931a
    • Jesper Dangaard Brouer's avatar
      bpf: finally expose xdp_rxq_info to XDP bpf-programs · 02dd3291
      Jesper Dangaard Brouer authored
      Now all XDP driver have been updated to setup xdp_rxq_info and assign
      this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
      of the xdp_rxq_info struct members.
      
      This patch extend xdp_md and expose UAPI to userspace for
      ingress_ifindex and rx_queue_index.  Access happens via bpf
      instruction rewrite, that load data directly from struct xdp_rxq_info.
      
      * ingress_ifindex map to xdp_rxq_info->dev->ifindex
      * rx_queue_index  map to xdp_rxq_info->queue_index
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02dd3291
    • Jesper Dangaard Brouer's avatar
      xdp: generic XDP handling of xdp_rxq_info · e817f856
      Jesper Dangaard Brouer authored
      Hook points for xdp_rxq_info:
       * reg  : netif_alloc_rx_queues
       * unreg: netif_free_rx_queues
      
      The net_device have some members (num_rx_queues + real_num_rx_queues)
      and data-area (dev->_rx with struct netdev_rx_queue's) that were
      primarily used for exporting information about RPS (CONFIG_RPS) queues
      to sysfs (CONFIG_SYSFS).
      
      For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
      and remove some of the CONFIG_SYSFS ifdefs.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e817f856
    • Jesper Dangaard Brouer's avatar
      virtio_net: setup xdp_rxq_info · 754b8a21
      Jesper Dangaard Brouer authored
      The virtio_net driver doesn't dynamically change the RX-ring queue
      layout and backing pages, but instead reject XDP setup if all the
      conditions for XDP is not meet.  Thus, the xdp_rxq_info also remains
      fairly static.  This allow us to simply add the reg/unreg to
      net_device open/close functions.
      
      Driver hook points for xdp_rxq_info:
       * reg  : virtnet_open
       * unreg: virtnet_close
      
      V3:
       - bugfix, also setup xdp.rxq in receive_mergeable()
       - Tested bpf-sample prog inside guest on a virtio_net device
      
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      754b8a21
    • Jesper Dangaard Brouer's avatar
      tun: setup xdp_rxq_info · 8bf5c4ee
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : tun_attach
       * unreg: __tun_detach
      
      I've done some manual testing of this tun driver, but I would
      appriciate good review and someone else running their use-case tests,
      as I'm not 100% sure I understand the tfile->detached semantics.
      
      V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang.
      
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8bf5c4ee
    • Jesper Dangaard Brouer's avatar
      thunderx: setup xdp_rxq_info · 27e95e36
      Jesper Dangaard Brouer authored
      This driver uses a bool scheme for "enable"/"disable" when setting up
      different resources.  Thus, the hook points for xdp_rxq_info is done
      in the same function call nicvf_rcv_queue_config().  This is activated
      through enable/disable via nicvf_config_data_transfer(), which is tied
      into nicvf_stop()/nicvf_open().
      
      Extending driver packet handler call-path nicvf_rcv_pkt_handler() with
      a pointer to the given struct rcv_queue, in-order to access the
      xdp_rxq_info data area (in nicvf_xdp_rx()).
      
      V2: Driver have no proper error path for failed XDP RX-queue info reg,
      as nicvf_rcv_queue_config is a void function.
      
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Robert Richter <rric@kernel.org>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      27e95e36
    • Jesper Dangaard Brouer's avatar
      nfp: setup xdp_rxq_info · 7f1c684a
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : nfp_net_rx_ring_alloc
       * unreg: nfp_net_rx_ring_free
      
      In struct nfp_net_rx_ring moved member @size into a hole on 64-bit.
      Thus, the size remaines the same after adding member @xdp_rxq.
      
      Cc: oss-drivers@netronome.com
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Simon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7f1c684a
    • Jesper Dangaard Brouer's avatar
      bnxt_en: setup xdp_rxq_info · 96a8604f
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : bnxt_alloc_rx_rings
       * unreg: bnxt_free_rx_rings
      
      This driver should be updated to re-register when changing
      allocation mode of RX rings.
      
      Tested on actual hardware.
      
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Michael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96a8604f
    • Jesper Dangaard Brouer's avatar
      mlx4: setup xdp_rxq_info · ae75415d
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : mlx4_en_create_rx_ring
       * unreg: mlx4_en_destroy_rx_ring
      
      Tested on actual hardware.
      
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ae75415d
    • Jesper Dangaard Brouer's avatar
      xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg · c0124f32
      Jesper Dangaard Brouer authored
      The driver code qede_free_fp_array() depend on kfree() can be called
      with a NULL pointer. This stems from the qede_alloc_fp_array()
      function which either (kz)alloc memory for fp->txq or fp->rxq.
      This also simplifies error handling code in case of memory allocation
      failures, but xdp_rxq_info_unreg need to know the difference.
      
      Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails
      and detect this is the failure path by seeing that xdp_rxq_info was
      not registred yet, which first happens after successful alloaction in
      qede_init_fp().
      
      Driver hook points for xdp_rxq_info:
       * reg  : qede_init_fp
       * unreg: qede_free_fp_array
      
      Tested on actual hardware with samples/bpf program.
      
      V2: Driver have no proper error path for failed XDP RX-queue info reg, as
      qede_init_fp() is a void function.
      
      Cc: everest-linux-l2@cavium.com
      Cc: Ariel Elior <Ariel.Elior@cavium.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c0124f32
    • Jesper Dangaard Brouer's avatar
      ixgbe: setup xdp_rxq_info · 99ffc5ad
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : ixgbe_setup_rx_resources()
       * unreg: ixgbe_free_rx_resources()
      
      Tested on actual hardware.
      
      V2: Fix ixgbe_set_ringparam, clear xdp_rxq_info in temp_ring
      
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      99ffc5ad
    • Jesper Dangaard Brouer's avatar
      i40e: setup xdp_rxq_info · 87128824
      Jesper Dangaard Brouer authored
      The i40e driver has a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
      a sideband channel for configuring/updating the flow director tables.
      This (i40e_vsi_)type does not invoke XDP-ebpf code.
      
      As suggested by Björn (V2): Instead of marking this I40E_VSI_FDIR RX-ring
      a special case, reverse the logic and only select RX-rings of type
      I40E_VSI_MAIN to register xdp_rxq_info's for.
      
      Driver hook points for xdp_rxq_info:
       * reg  : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources)
       * unreg: i40e_free_rx_resources    (via i40e_vsi_free_rx_resources)
      
      Tested on actual hardware with samples/bpf program.
      
      V2: Fixed bug in i40e_set_ringparam (memset zero) + match on I40E_VSI_MAIN.
      V4: Update patch desc that got out-of-sync with code.
      
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87128824
    • Jesper Dangaard Brouer's avatar
      xdp/mlx5: setup xdp_rxq_info · 0ddf5432
      Jesper Dangaard Brouer authored
      The mlx5 driver have a special drop-RQ queue (one per interface) that
      simply drops all incoming traffic. It helps driver keep other HW
      objects (flow steering) alive upon down/up operations.  It is
      temporarily pointed by flow steering objects during the interface
      setup, and when interface is down. It lacks many fields that are set
      in a regular RQ (for example its state is never switched to
      MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation).
      
      The XDP RX-queue info for this drop-RQ marked as unused, which
      allow us to use the same takedown/free code path as other RX-queues.
      
      Driver hook points for xdp_rxq_info:
       * reg   : mlx5e_alloc_rq()
       * unused: mlx5e_alloc_drop_rq()
       * unreg : mlx5e_free_rq()
      
      Tested on actual hardware with samples/bpf program
      
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Matan Barak <matanb@mellanox.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ddf5432