• Jussi Maki's avatar
    net, bonding: Add XDP support to the bonding driver · 9e2ee5c7
    Jussi Maki authored
    XDP is implemented in the bonding driver by transparently delegating
    the XDP program loading, removal and xmit operations to the bonding
    slave devices. The overall goal of this work is that XDP programs
    can be attached to a bond device *without* any further changes (or
    awareness) necessary to the program itself, meaning the same XDP
    program can be attached to a native device but also a bonding device.
    
    Semantics of XDP_TX when attached to a bond are equivalent in such
    setting to the case when a tc/BPF program would be attached to the
    bond, meaning transmitting the packet out of the bond itself using one
    of the bond's configured xmit methods to select a slave device (rather
    than XDP_TX on the slave itself). Handling of XDP_TX to transmit
    using the configured bonding mechanism is therefore implemented by
    rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
    performance impact this check is guarded by a static key, which is
    incremented when a XDP program is loaded onto a bond device. This
    approach was chosen to avoid changes to drivers implementing XDP. If
    the slave device does not match the receive device, then XDP_REDIRECT
    is transparently used to perform the redirection in order to have
    the network driver release the packet from its RX ring. The bonding
    driver hashing functions have been refactored to allow reuse with
    xdp_buff's to avoid code duplication.
    
    The motivation for this change is to enable use of bonding (and
    802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
    XDP and also to transparently support bond devices for projects that
    use XDP given most modern NICs have dual port adapters. An alternative
    to this approach would be to implement 802.3ad in user-space and
    implement the bonding load-balancing in the XDP program itself, but
    is rather a cumbersome endeavor in terms of slave device management
    (e.g. by watching netlink) and requires separate programs for native
    vs bond cases for the orchestrator. A native in-kernel implementation
    overcomes these issues and provides more flexibility.
    
    Below are benchmark results done on two machines with 100Gbit
    Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
    16-core 3950X on receiving machine. 64 byte packets were sent with
    pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
    ice driver, so the tests were performed with iommu=off and patch [2]
    applied. Additionally the bonding round robin algorithm was modified
    to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
    of cache misses were caused by the shared rr_tx_counter (see patch
    2/3). The statistics were collected using "sar -n dev -u 1 10". On top
    of that, for ice, further work is in progress on improving the XDP_TX
    numbers [4].
    
     -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
     without patch (1 dev):
       XDP_DROP:              3.15%      48.6Mpps
       XDP_TX:                3.12%      18.3Mpps     18.3Mpps
       XDP_DROP (RSS):        9.47%      116.5Mpps
       XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
     -----------------------
     with patch, bond (1 dev):
       XDP_DROP:              3.14%      46.7Mpps
       XDP_TX:                3.15%      13.9Mpps     13.9Mpps
       XDP_DROP (RSS):        10.33%     117.2Mpps
       XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
     -----------------------
     with patch, bond (2 devs):
       XDP_DROP:              6.27%      92.7Mpps
       XDP_TX:                6.26%      17.6Mpps     17.5Mpps
       XDP_DROP (RSS):       11.38%      117.2Mpps
       XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
     --------------------------------------------------------------
    
    RSS: Receive Side Scaling, e.g. the packets were sent to a range of
    destination IPs.
    
      [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
      [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
      [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
      [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#tSigned-off-by: default avatarJussi Maki <joamaki@gmail.com>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Cc: Jay Vosburgh <j.vosburgh@gmail.com>
    Cc: Veaceslav Falico <vfalico@gmail.com>
    Cc: Andy Gospodarek <andy@greyhouse.net>
    Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@intel.com>
    Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
    9e2ee5c7
bond_main.c 164 KB