1. 17 Mar, 2023 40 commits
    • Kai Shen's avatar
      net/smc: Use percpu ref for wr tx reference · 79a22238
      Kai Shen authored
      The refcount wr_tx_refcnt may cause cache thrashing problems among
      cores and we can use percpu ref to mitigate this issue here. We
      gain some performance improvement with percpu ref here on our
      customized smc-r verion. Applying cache alignment may also mitigate
      this problem but it seem more reasonable to use percpu ref here.
      We can also replace wr_reg_refcnt with one percpu reference like
      wr_tx_refcnt.
      
      redis-benchmark on smc-r with atomic wr_tx_refcnt:
      SET: 525707.06 requests per second, p50=0.087 msec
      GET: 554877.38 requests per second, p50=0.087 msec
      
      redis-benchmark on the percpu_ref version:
      SET: 540482.06 requests per second, p50=0.087 msec
      GET: 570711.12 requests per second, p50=0.079 msec
      
      Cases are like "redis-benchmark -h x.x.x.x -q -t set,get -P 1 -n
      5000000 -c 50 -d 10 --threads 4".
      Signed-off-by: default avatarKai Shen <KaiShen@linux.alibaba.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79a22238
    • David S. Miller's avatar
      Merge branch 'inet-const' · d27d367d
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      inet: better const qualifier awareness
      
      inet_sk() can be changed to propagate const qualifier,
      thanks to container_of_const()
      
      Following patches in this series add more const qualifiers.
      
      Other helpers like tcp_sk(), udp_sk(), raw_sk(), ... will be handled
      in following series.
      ====================
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d27d367d
    • Eric Dumazet's avatar
      inet_diag: constify raw_lookup() socket argument · 736c8b52
      Eric Dumazet authored
      Now both raw_v4_match() and raw_v6_match() accept a const socket,
      raw_lookup() can do the same to clarify its role.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      736c8b52
    • Eric Dumazet's avatar
      ipv4: raw: constify raw_v4_match() socket argument · 0a8c2568
      Eric Dumazet authored
      This clarifies raw_v4_match() intent.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a8c2568
    • Eric Dumazet's avatar
      ipv6: raw: constify raw_v6_match() socket argument · db6af4fd
      Eric Dumazet authored
      This clarifies raw_v6_match() intent.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db6af4fd
    • Eric Dumazet's avatar
      udp6: constify __udp_v6_is_mcast_sock() socket argument · dc3731ba
      Eric Dumazet authored
      This clarifies __udp_v6_is_mcast_sock() intent.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc3731ba
    • Eric Dumazet's avatar
      ipv6: constify inet6_mc_check() · 66eb554c
      Eric Dumazet authored
      inet6_mc_check() is essentially a read-only function.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66eb554c
    • Eric Dumazet's avatar
      udp: constify __udp_is_mcast_sock() socket argument · a0a989d3
      Eric Dumazet authored
      This clarifies __udp_is_mcast_sock() intent.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0a989d3
    • Eric Dumazet's avatar
      ipv4: constify ip_mc_sf_allow() socket argument · 33e972bd
      Eric Dumazet authored
      This clarifies ip_mc_sf_allow() intent.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33e972bd
    • Eric Dumazet's avatar
      inet: preserve const qualifier in inet_sk() · abc17a11
      Eric Dumazet authored
      We can change inet_sk() to propagate const qualifier of its argument.
      
      This should avoid some potential errors caused by accidental
      (const -> not_const) promotion.
      
      Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled
      in separate patch series.
      
      v2: use container_of_const() as advised by Jakub and Linus
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/
      Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abc17a11
    • Jakub Kicinski's avatar
      netlink: specs: allow uapi-header in genetlink · 82b32970
      Jakub Kicinski authored
      Chuck wanted to put the UAPI header in linux/net/ which seems
      reasonable, allow genetlink families to choose the location.
      It doesn't really matter for non-C-like languages.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82b32970
    • Jakub Kicinski's avatar
      netlink-specs: add partial specification for devlink · 74bf6477
      Jakub Kicinski authored
      Devlink is quite complex but put in the very basics so we can
      incrementally fill in the commands as needed.
      
      $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
          --dump get
      
      [{'bus-name': 'netdevsim',
        'dev-name': 'netdevsim1',
        'dev-stats': {'reload-stats': {'reload-action-info': {'reload-action': 1,
                                                              'reload-action-stats': {'reload-stats-entry': [{'reload-stats-limit': 0,
                                                                                                              'reload-stats-value': 0}]}}},
                      'remote-reload-stats': {'reload-action-info': {'reload-action': 2,
                                                                     'reload-action-stats': {'reload-stats-entry': [{'reload-stats-limit': 0,
                                                                                                                     'reload-stats-value': 0},
                                                                                                                    {'reload-stats-limit': 1,
                                                                                                                     'reload-stats-value': 0}]}}}},
        'reload-failed': 0}]
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74bf6477
    • David S. Miller's avatar
      Merge branch 'net-packet-KCSAN' · 19a9fbc0
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net/packet: KCSAN awareness
      
      This series is based on one syzbot report [1]
      
      Seven 'flags/booleans' are converted to atomic bit variant.
      
      po->xmit and po->tp_tstamp accesses get annotations.
      
      [1]
      BUG: KCSAN: data-race in packet_rcv / packet_setsockopt
      
      read-write to 0xffff88813dbe84e4 of 1 bytes by task 12312 on cpu 0:
      packet_setsockopt+0xb77/0xe60 net/packet/af_packet.c:3900
      __sys_setsockopt+0x212/0x2b0 net/socket.c:2252
      __do_sys_setsockopt net/socket.c:2263 [inline]
      __se_sys_setsockopt net/socket.c:2260 [inline]
      __x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff88813dbe84e4 of 1 bytes by task 1911 on cpu 1:
      packet_rcv+0x4b1/0xa40 net/packet/af_packet.c:2187
      deliver_skb net/core/dev.c:2189 [inline]
      dev_queue_xmit_nit+0x3a9/0x620 net/core/dev.c:2259
      xmit_one+0x71/0x2a0 net/core/dev.c:3586
      dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
      __dev_queue_xmit+0x91c/0x11c0 net/core/dev.c:4256
      dev_queue_xmit include/linux/netdevice.h:3008 [inline]
      neigh_hh_output include/net/neighbour.h:530 [inline]
      neigh_output include/net/neighbour.h:544 [inline]
      ip6_finish_output2+0x9e9/0xc30 net/ipv6/ip6_output.c:134
      __ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
      ip6_finish_output+0x395/0x4f0 net/ipv6/ip6_output.c:206
      NF_HOOK_COND include/linux/netfilter.h:291 [inline]
      ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:227
      dst_output include/net/dst.h:445 [inline]
      ip6_local_out+0x60/0x80 net/ipv6/output_core.c:161
      ip6tunnel_xmit include/net/ip6_tunnel.h:161 [inline]
      udp_tunnel6_xmit_skb+0x321/0x4a0 net/ipv6/ip6_udp_tunnel.c:109
      send6+0x2ed/0x3b0 drivers/net/wireguard/socket.c:152
      wg_socket_send_skb_to_peer+0xbb/0x120 drivers/net/wireguard/socket.c:178
      wg_packet_create_data_done drivers/net/wireguard/send.c:251 [inline]
      wg_packet_tx_worker+0x142/0x360 drivers/net/wireguard/send.c:276
      process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
      worker_thread+0x618/0xa70 kernel/workqueue.c:2436
      kthread+0x1a9/0x1e0 kernel/kthread.c:376
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
      ====================
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19a9fbc0
    • Eric Dumazet's avatar
      net/packet: convert po->pressure to an atomic flag · 791a3e9f
      Eric Dumazet authored
      Not only this removes some READ_ONCE()/WRITE_ONCE(),
      this also removes one integer.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      791a3e9f
    • Eric Dumazet's avatar
      net/packet: convert po->running to an atomic flag · 61edf479
      Eric Dumazet authored
      Instead of consuming 32 bits for po->running, use
      one available bit in po->flags.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61edf479
    • Eric Dumazet's avatar
      net/packet: convert po->has_vnet_hdr to an atomic flag · 50d935ea
      Eric Dumazet authored
      po->has_vnet_hdr can be read locklessly.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50d935ea
    • Eric Dumazet's avatar
      net/packet: convert po->tp_loss to an atomic flag · 164bddac
      Eric Dumazet authored
      tp_loss can be read locklessly.
      
      Convert it to an atomic flag to avoid races.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      164bddac
    • Eric Dumazet's avatar
      net/packet: convert po->tp_tx_has_off to an atomic flag · 74383446
      Eric Dumazet authored
      This is to use existing space in po->flags, and reclaim
      the storage used by the non atomic bit fields.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74383446
    • Eric Dumazet's avatar
      net/packet: annotate accesses to po->tp_tstamp · 1051ce4a
      Eric Dumazet authored
      tp_tstamp is read locklessly.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1051ce4a
    • Eric Dumazet's avatar
      net/packet: convert po->auxdata to an atomic flag · fd53c297
      Eric Dumazet authored
      po->auxdata can be read while another thread
      is changing its value, potentially raising KCSAN splat.
      
      Convert it to PACKET_SOCK_AUXDATA flag.
      
      Fixes: 8dc41944 ("[PACKET]: Add optional checksum computation for recvmsg")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd53c297
    • Eric Dumazet's avatar
      net/packet: convert po->origdev to an atomic flag · ee5675ec
      Eric Dumazet authored
      syzbot/KCAN reported that po->origdev can be read
      while another thread is changing its value.
      
      We can avoid this splat by converting this field
      to an actual bit.
      
      Following patches will convert remaining 1bit fields.
      
      Fixes: 80feaacb ("[AF_PACKET]: Add option to return orig_dev to userspace.")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee5675ec
    • Eric Dumazet's avatar
      net/packet: annotate accesses to po->xmit · b9d83ab8
      Eric Dumazet authored
      po->xmit can be set from setsockopt(PACKET_QDISC_BYPASS),
      while read locklessly.
      
      Use READ_ONCE()/WRITE_ONCE() to avoid potential load/store
      tearing issues.
      
      Fixes: d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket option")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9d83ab8
    • David S. Miller's avatar
      Merge branch 'gve-xdp-support' · dc021e6c
      David S. Miller authored
      Praveen Kaligineedi says:
      
      ====================
      gve: Add XDP support for GQI-QPL format
      
      Adding support for XDP DROP, PASS, TX, REDIRECT for GQI QPL format.
      Add AF_XDP zero-copy support.
      
      When an XDP program is installed, dedicated TX queues are created to
      handle XDP traffic. The user needs to ensure that the number of
      configured TX queues is equal to the number of configured RX queues; and
      the number of TX/RX queues is less than or equal to half the maximum
      number of TX/RX queues.
      
      The XDP traffic from AF_XDP sockets and from other NICs (arriving via
      XDP_REDIRECT) will also egress through the dedicated XDP TX queues.
      
      Although these changes support AF_XDP socket in zero-copy mode, there is
      still a copy happening within the driver between XSK buffer pool and QPL
      bounce buffers in GQI-QPL format.
      
      The following example demonstrates how the XDP packets are mapped to
      TX queues:
      
      Example configuration:
      Max RX queues : 2N, Max TX queues : 2N
      Configured RX queues : N, Configured TX queues : N
      
      TX queue mapping:
      TX queues with queue id 0,...,N-1 will handle traffic from the stack.
      TX queues with queue id N,...,2N-1 will handle XDP traffic.
      
      For the XDP packets transmitted using XDP_TX action:
      <Egress TX queue id> = N + <Ingress RX queue id>
      
      For the XDP packets that arrive from other NICs via XDP_REDIRECT action:
      <Egress TX queue id> = N + ( smp_processor_id % N )
      
      For AF_XDP zero-copy mode:
      <Egress TX queue id> = N + <AF_XDP TX queue id>
      
      Changes in v2:
      - Removed gve_close/gve_open when adding XDP dedicated queues. Instead
      we add and register additional TX queues when the XDP program is
      installed. If the allocation/registration fails we return error and do
      not install the XDP program. Added a new patch to enable adding TX queues
      without gve_close/gve_open
      - Removed xdp tx spin lock from this patch. It is needed for XDP_REDIRECT
      support as both XDP_REDIRECT and XDP_TX traffic share the dedicated XDP
      queues. Moved the code to add xdp tx spinlock to the subsequent patch
      that adds XDP_REDIRECT support.
      - Added netdev_err when the user tries to set rx/tx queues to the values
      not supported when XDP is enabled.
      - Removed rcu annotation for xdp_prog. We disable the napi prior to
      adding/removing the xdp_prog and reenable it after the program has
      been installed for all the queues.
      - Ring the tx doorbell once for napi instead of every XDP TX packet.
      - Added a new helper function for freeing the FIFO buffer
      - Unregister xdp rxq for all the queues when the registration
      fails during XDP program installation
      - Register xsk rxq only when XSK buff pool is enabled
      - Removed code accessing internal xsk_buff_pool fields
      - Removed sleep driven code when disabling XSK buff pool. Disable
      napi and re-enable it after disabling XSK pool.
      - Make sure that we clean up dma mappings on XSK pool disable
      - Use napi_if_scheduled_mark_missed to avoid unnecessary napi move
      to the CPU calling ndo_xsk_wakeup()
      
      Changes in v3:
      - Padding bytes are used if the XDP TX packet headers do not
      fit at tail of TX FIFO. Taking these padding bytes into account
      while checking if enough space is available in TX FIFO.
      
      Changes in v4:
      - Turn on the carrier based on the link status synchronously rather
      than asynchronously when XDP is installed/uninstalled
      - Set the supported flags in net_device.xdp_features
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc021e6c
    • Praveen Kaligineedi's avatar
      gve: Add AF_XDP zero-copy support for GQI-QPL format · fd8e4032
      Praveen Kaligineedi authored
      Adding AF_XDP zero-copy support.
      
      Note: Although these changes support AF_XDP socket in zero-copy
      mode, there is still a copy happening within the driver between
      XSK buffer pool and QPL bounce buffers in GQI-QPL format.
      In GQI-QPL queue format, the driver needs to allocate a fixed size
      memory, the size specified by vNIC device, for RX/TX and register this
      memory as a bounce buffer with the vNIC device when a queue is
      created. The number of pages in the bounce buffer is limited and the
      pages need to be made available to the vNIC by copying the RX data out
      to prevent head-of-line blocking. Therefore, we cannot pass the XSK
      buffer pool to the vNIC.
      
      The number of copies on RX path from the bounce buffer to XSK buffer is 2
      for AF_XDP copy mode (bounce buffer -> allocated page frag -> XSK buffer)
      and 1 for AF_XDP zero-copy mode (bounce buffer -> XSK buffer).
      
      This patch contains the following changes:
      1) Enable and disable XSK buffer pool
      2) Copy XDP packets from QPL bounce buffers to XSK buffer on rx
      3) Copy XDP packets from XSK buffer to QPL bounce buffers and
         ring the doorbell as part of XDP TX napi poll
      4) ndo_xsk_wakeup callback support
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd8e4032
    • Praveen Kaligineedi's avatar
      gve: Add XDP REDIRECT support for GQI-QPL format · 39a7f4aa
      Praveen Kaligineedi authored
      This patch contains the following changes:
      1) Support for XDP REDIRECT action on rx
      2) ndo_xdp_xmit callback support
      
      In GQI-QPL queue format, the driver needs to allocate a fixed size
      memory, the size specified by vNIC device, for RX/TX and register this
      memory as a bounce buffer with the vNIC device when a queue is created.
      The number of pages in the bounce buffer is limited and the pages need to
      be made available to the vNIC by copying the RX data out to prevent
      head-of-line blocking. The XDP_REDIRECT packets are therefore immediately
      copied to a newly allocated page.
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39a7f4aa
    • Praveen Kaligineedi's avatar
      gve: Add XDP DROP and TX support for GQI-QPL format · 75eaae15
      Praveen Kaligineedi authored
      Add support for XDP PASS, DROP and TX actions.
      
      This patch contains the following changes:
      1) Support installing/uninstalling XDP program
      2) Add dedicated XDP TX queues
      3) Add support for XDP DROP action
      4) Add support for XDP TX action
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75eaae15
    • Praveen Kaligineedi's avatar
      gve: Changes to add new TX queues · 7fc2bf78
      Praveen Kaligineedi authored
      Changes to enable adding and removing TX queues without calling
      gve_close() and gve_open().
      
      Made the following changes:
      1) priv->tx, priv->rx and priv->qpls arrays are allocated based on
         max tx queues and max rx queues
      2) Changed gve_adminq_create_tx_queues(), gve_adminq_destroy_tx_queues(),
      gve_tx_alloc_rings() and gve_tx_free_rings() functions to add/remove a
      subset of TX queues rather than all the TX queues.
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fc2bf78
    • Praveen Kaligineedi's avatar
      gve: XDP support GQI-QPL: helper function changes · 2e80aeae
      Praveen Kaligineedi authored
      This patch adds/modifies helper functions needed to add XDP
      support.
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e80aeae
    • David S. Miller's avatar
      Merge branch 'net-sk_err-lockless-annotate' · ec4040ae
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: annotate lockless accesses to sk_err[_soft]
      
      This patch series is inspired by yet another syzbot report.
      
      Most poll() handlers are lockless and read sk->sk_err
      while other cpus can change it.
      
      Add READ_ONCE/WRITE_ONCE() to major/usual offenders.
      
      More to come later.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec4040ae
    • Eric Dumazet's avatar
      af_unix: annotate lockless accesses to sk->sk_err · cc04410a
      Eric Dumazet authored
      unix_poll() and unix_dgram_poll() read sk->sk_err
      without any lock held.
      
      Add relevant READ_ONCE()/WRITE_ONCE() annotations.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc04410a
    • Eric Dumazet's avatar
      mptcp: annotate lockless accesses to sk->sk_err · 9ae8e5ad
      Eric Dumazet authored
      mptcp_poll() reads sk->sk_err without socket lock held/owned.
      
      Add READ_ONCE() and WRITE_ONCE() to avoid load/store tearing.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ae8e5ad
    • Eric Dumazet's avatar
      tcp: annotate lockless access to sk->sk_err · e13ec3da
      Eric Dumazet authored
      tcp_poll() reads sk->sk_err without socket lock held/owned.
      
      We should used READ_ONCE() here, and update writers
      to use WRITE_ONCE().
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e13ec3da
    • Eric Dumazet's avatar
      net: annotate lockless accesses to sk->sk_err_soft · 2f2d9972
      Eric Dumazet authored
      This field can be read/written without lock synchronization.
      
      tcp and dccp have been handled in different patches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f2d9972
    • Eric Dumazet's avatar
      dccp: annotate lockless accesses to sk->sk_err_soft · 9a25f0cb
      Eric Dumazet authored
      This field can be read/written without lock synchronization.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a25f0cb
    • Eric Dumazet's avatar
      tcp: annotate lockless accesses to sk->sk_err_soft · cee1af82
      Eric Dumazet authored
      This field can be read/written without lock synchronization.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cee1af82
    • Vadim Fedorenko's avatar
      vlan: partially enable SIOCSHWTSTAMP in container · 731b73db
      Vadim Fedorenko authored
      Setting timestamp filter was explicitly disabled on vlan devices in
      containers because it might affect other processes on the host. But it's
      absolutely legit in case when real device is in the same namespace.
      
      Fixes: 873017af ("vlan: disable SIOCSHWTSTAMP in container")
      Signed-off-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      731b73db
    • David S. Miller's avatar
      Merge branch 'pcs_get_state-fixes' · e05c5181
      David S. Miller authored
      Russell King (Oracle) says:
      
      ====================
      Minor fixes for pcs_get_state() implementations
      
      This series contains a number fixes for minor issues with some
      pcs_get_state() implementations, particualrly for the phylink
      state->an_enabled member. As they are minor, I'm suggesting we
      queue them in net-next as there is follow-on work for these, and
      there is no urgency for them to be in -rc.
      
      Just like phylib, state->advertising's Autoneg bit is a copy of
      state->an_enabled, and thus it is my intention to remove
      state->an_enabled from phylink to simplify things.
      
      This series gets rid of state->an_enabled assignments or
      reporting that should never have been there.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e05c5181
    • Russell King (Oracle)'s avatar
      net: pcs: lynx: don't print an_enabled in pcs_get_state() · ecec0ebb
      Russell King (Oracle) authored
      an_enabled will be going away, and in any case, pcs_get_state() should
      not be updating this member. Remove the print.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarSteen Hegelund <Steen.Hegelund@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecec0ebb
    • Russell King (Oracle)'s avatar
      net: pcs: xpcs: remove double-read of link state when using AN · ef63461c
      Russell King (Oracle) authored
      Phylink does not want the current state of the link when reading the
      PCS link state - it wants the latched state. Don't double-read the
      MII status register. Phylink will re-read as necessary to capture
      transient link-down events as of dbae3388 ("net: phylink: Force
      retrigger in case of latched link-fail indicator").
      
      The above referenced commit is a dependency for this change, and thus
      this change should not be backported to any kernel that does not
      contain the above referenced commit.
      
      Fixes: fcb26bd2 ("net: phy: Add Synopsys DesignWare XPCS MDIO module")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef63461c
    • David S. Miller's avatar
      Merge branch 'vxlan-MDB-support' · abf36703
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      vxlan: Add MDB support
      
      tl;dr
      =====
      
      This patchset implements MDB support in the VXLAN driver, allowing it to
      selectively forward IP multicast traffic to VTEPs with interested
      receivers instead of flooding it to all the VTEPs as BUM. The motivating
      use case is intra and inter subnet multicast forwarding using EVPN
      [1][2], which means that MDB entries are only installed by the user
      space control plane and no snooping is implemented, thereby avoiding a
      lot of unnecessary complexity in the kernel.
      
      Background
      ==========
      
      Both the bridge and VXLAN drivers have an FDB that allows them to
      forward Ethernet frames based on their destination MAC addresses and
      VLAN/VNI. These FDBs are managed using the same PF_BRIDGE/RTM_*NEIGH
      netlink messages and bridge(8) utility.
      
      However, only the bridge driver has an MDB that allows it to selectively
      forward IP multicast packets to bridge ports with interested receivers
      behind them, based on (S, G) and (*, G) MDB entries. When these packets
      reach the VXLAN driver they are flooded using the "all-zeros" FDB entry
      (00:00:00:00:00:00). The entry either includes the list of all the VTEPs
      in the tenant domain (when ingress replication is used) or the multicast
      address of the BUM tunnel (when P2MP tunnels are used), to which all the
      VTEPs join.
      
      Networks that make heavy use of multicast in the overlay can benefit
      from a solution that allows them to selectively forward IP multicast
      traffic only to VTEPs with interested receivers. Such a solution is
      described in the next section.
      
      Motivation
      ==========
      
      RFC 7432 [3] defines a "MAC/IP Advertisement route" (type 2) [4] that
      allows VTEPs in the EVPN network to advertise and learn reachability
      information for unicast MAC addresses. Traffic destined to a unicast MAC
      address can therefore be selectively forwarded to a single VTEP behind
      which the MAC is located.
      
      The same is not true for IP multicast traffic. Such traffic is simply
      flooded as BUM to all VTEPs in the broadcast domain (BD) / subnet,
      regardless if a VTEP has interested receivers for the multicast stream
      or not. This is especially problematic for overlay networks that make
      heavy use of multicast.
      
      The issue is addressed by RFC 9251 [1] that defines a "Selective
      Multicast Ethernet Tag Route" (type 6) [5] which allows VTEPs in the
      EVPN network to advertise multicast streams that they are interested in.
      This is done by having each VTEP suppress IGMP/MLD packets from being
      transmitted to the NVE network and instead communicate the information
      over BGP to other VTEPs.
      
      The draft in [2] further extends RFC 9251 with procedures to allow
      efficient forwarding of IP multicast traffic not only in a given subnet,
      but also between different subnets in a tenant domain.
      
      The required changes in the bridge driver to support the above were
      already merged in merge commit 8150f0cf ("Merge branch
      'bridge-mcast-extensions-for-evpn'"). However, full support entails MDB
      support in the VXLAN driver so that it will be able to selectively
      forward IP multicast traffic only to VTEPs with interested receivers.
      The implementation of this MDB is described in the next section.
      
      Implementation
      ==============
      
      The user interface is extended to allow user space to specify the
      destination VTEP(s) and related parameters. Example usage:
      
       # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1
       # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 192.0.2.1
      
       $ bridge -d -s mdb show
       dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 192.0.2.1    0.00
       dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1    0.00
      
      Since the MDB is fully managed by user space and since snooping is not
      implemented, only permanent entries can be installed and temporary
      entries are rejected by the kernel.
      
      The netlink interface is extended with a few new attributes in the
      RTM_NEWMDB / RTM_DELMDB request messages:
      
      [ struct nlmsghdr ]
      [ struct br_port_msg ]
      [ MDBA_SET_ENTRY ]
      	struct br_mdb_entry
      [ MDBA_SET_ENTRY_ATTRS ]
      	[ MDBE_ATTR_SOURCE ]
      		struct in_addr / struct in6_addr
      	[ MDBE_ATTR_SRC_LIST ]
      		[ MDBE_SRC_LIST_ENTRY ]
      			[ MDBE_SRCATTR_ADDRESS ]
      				struct in_addr / struct in6_addr
      		[ ...]
      	[ MDBE_ATTR_GROUP_MODE ]
      		u8
      	[ MDBE_ATTR_RTPORT ]
      		u8
      	[ MDBE_ATTR_DST ]	// new
      		struct in_addr / struct in6_addr
      	[ MDBE_ATTR_DST_PORT ]	// new
      		u16
      	[ MDBE_ATTR_VNI ]	// new
      		u32
      	[ MDBE_ATTR_IFINDEX ]	// new
      		s32
      	[ MDBE_ATTR_SRC_VNI ]	// new
      		u32
      
      RTM_NEWMDB / RTM_DELMDB responses and notifications are extended with
      corresponding attributes.
      
      One MDB entry that can be installed in the VXLAN MDB, but not in the
      bridge MDB is the catchall entry (0.0.0.0 / ::). It is used to transmit
      unregistered multicast traffic that is not link-local and is especially
      useful when inter-subnet multicast forwarding is required. See patch #12
      for a detailed explanation and motivation. It is similar to the
      "all-zeros" FDB entry that can be installed in the VXLAN FDB, but not
      the bridge FDB.
      
      "added_by_star_ex" entries
      --------------------------
      
      The bridge driver automatically installs (S, G) MDB port group entries
      marked as "added_by_star_ex" whenever it detects that an (S, G) entry
      can prevent traffic from being forwarded via a port associated with an
      EXCLUDE (*, G) entry. The bridge will add the port to the port group of
      the (S, G) entry, thereby creating a new port group entry. The
      complexity associated with these entries is not trivial, but it needs to
      reside in the bridge driver because it automatically installs MDB
      entries in response to snooped IGMP / MLD packets.
      
      The same in not true for the VXLAN MDB which is entirely managed by user
      space who is fully capable of forming the correct replication lists on
      its own. In addition, the complexity associated with the
      "added_by_star_ex" entries in the VXLAN driver is higher compared to the
      bridge: Whenever a remote VTEP is added to the catchall entry, it needs
      to be added to all the existing MDB entries, as such a remote requested
      all the multicast traffic to be forwarded to it. Similarly, whenever an
      (*, G) or (S, G) entry is added, all the remotes associated with the
      catchall entry need to be added to it.
      
      Given the above, this patchset does not implement support for such
      entries.  One argument against this decision can be that in the future
      someone might want to populate the VXLAN MDB in response to decapsulated
      IGMP / MLD packets and not according to EVPN routes. Regardless of my
      doubts regarding this possibility, it can be implemented using a new
      VXLAN device knob that will also enable the "added_by_star_ex"
      functionality.
      
      Testing
      =======
      
      Tested using existing VXLAN and MDB selftests under "net/" and
      "net/forwarding/". Added a dedicated selftest in the last patch.
      
      Patchset overview
      =================
      
      Patches #1-#3 are small preparations in the bridge driver. I plan to
      submit them separately together with an MDB dump test case.
      
      Patches #4-#6 are additional preparations centered around the extraction
      of the MDB netlink handlers from the bridge driver to the common
      rtnetlink code. This allows reusing the existing MDB netlink messages
      for the configuration of the VXLAN MDB.
      
      Patches #7-#9 include more small preparations in the common rtnetlink
      code and the VXLAN driver.
      
      Patch #10 implements the MDB control path in the VXLAN driver, which
      will allow user space to create, delete, replace and dump MDB entries.
      
      Patches #11-#12 implement the MDB data path in the VXLAN driver,
      allowing it to selectively forward IP multicast traffic according to the
      matched MDB entry.
      
      Patch #13 finally enables MDB support in the VXLAN driver.
      
      iproute2 patches can be found here [6].
      
      Note that in order to fully support the specifications in [1] and [2],
      additional functionality is required from the data path. However, it can
      be achieved using existing kernel interfaces which is why it is not
      described here.
      
      Changelog
      =========
      
      Since v1 [7]:
      
      Patch #9: Use htons() in 'case' instead of ntohs() in 'switch'.
      
      Since RFC [8]:
      
      Patch #3: Use NL_ASSERT_DUMP_CTX_FITS().
      Patch #3: memset the entire context when moving to the next device.
      Patch #3: Reset sequence counters when moving to the next device.
      Patch #3: Use NL_SET_ERR_MSG_ATTR() in rtnl_validate_mdb_entry().
      Patch #7: Remove restrictions regarding mixing of multicast and unicast
      remote destination IPs in an MDB entry. While such configuration does
      not make sense to me, it is no forbidden by the VXLAN FDB code and does
      not crash the kernel.
      Patch #7: Fix check regarding all-zeros MDB entry and source.
      Patch #11: New patch.
      
      [1] https://datatracker.ietf.org/doc/html/rfc9251
      [2] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast
      [3] https://datatracker.ietf.org/doc/html/rfc7432
      [4] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
      [5] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
      [6] https://github.com/idosch/iproute2/commits/submit/mdb_vxlan_rfc_v1
      [7] https://lore.kernel.org/netdev/20230313145349.3557231-1-idosch@nvidia.com/
      [8] https://lore.kernel.org/netdev/20230204170801.3897900-1-idosch@nvidia.com/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abf36703