1. 26 Oct, 2021 31 commits
    • Eric Dumazet's avatar
      net: annotate data-race in neigh_output() · d18785e2
      Eric Dumazet authored
      neigh_output() reads n->nud_state and hh->hh_len locklessly.
      
      This is fine, but we need to add annotations and document this.
      
      We evaluate skip_cache first to avoid reading these fields
      if the cache has to by bypassed.
      
      syzbot report:
      
      BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2
      
      write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1:
       __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128
       neigh_event_send include/net/neighbour.h:444 [inline]
       neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476
       neigh_output include/net/neighbour.h:510 [inline]
       ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       secondary_startup_64_no_verify+0xb1/0xbb
      
      read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0:
       neigh_output include/net/neighbour.h:507 [inline]
       ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       rest_init+0xee/0x100 init/main.c:734
       arch_call_rest_init+0xa/0xb
       start_kernel+0x5e4/0x669 init/main.c:1142
       secondary_startup_64_no_verify+0xb1/0xbb
      
      value changed: 0x20 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d18785e2
    • David S. Miller's avatar
      Merge branch 'mlxsw-rif-mac-prefixes' · 72b93a86
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Support multiple RIF MAC prefixes
      
      Currently, mlxsw enforces that all the netdevs used as router interfaces
      (RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1).
      Otherwise, an error is returned to user space with extack. This patchset
      relaxes the limitation through the use of RIF MAC profiles.
      
      A RIF MAC profile is a hardware entity that represents a particular MAC
      prefix which multiple RIFs can reference. Therefore, the number of
      possible MAC prefixes is no longer one, but the number of profiles
      supported by the device.
      
      The ability to change the MAC of a particular netdev is useful, for
      example, for users who use the netdev to connect to an upstream provider
      that performs MAC filtering. Currently, such users are either forced to
      negotiate with the provider or change the MAC address of all other
      netdevs so that they share the same prefix.
      
      Patchset overview:
      
      Patches #1-#3 are preparations.
      
      Patch #4 adds actual support for RIF MAC profiles.
      
      Patch #5 exposes RIF MAC profiles as a devlink resource, so that user
      space has visibility into the maximum number of profiles and current
      occupancy. Useful for debugging and testing (next 3 patches).
      
      Patches #6-#8 add both scale and functional tests.
      
      Patch #9 removes tests that validated the previous limitation. It is now
      covered by patch #6 for devices that support a single profile.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72b93a86
    • Danielle Ratson's avatar
      selftests: mlxsw: Remove deprecated test cases · c24dbf3d
      Danielle Ratson authored
      After adding the previous patches, the constraint that all the router
      interface MAC addresses have the same prefix is no longer relevant.
      
      Remove the test cases that validated that this constraint is honored.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c24dbf3d
    • Danielle Ratson's avatar
      selftests: Add an occupancy test for RIF MAC profiles · 20d446db
      Danielle Ratson authored
      When all the RIF MAC profiles are in use, test that it is possible to
      change the MAC of a netdev (i.e., a RIF) when its MAC profile is not
      shared with other RIFs. Test that replacement fails when the MAC profile
      is shared.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20d446db
    • Danielle Ratson's avatar
      selftests: mlxsw: Add forwarding test for RIF MAC profiles · a10b7bac
      Danielle Ratson authored
      Verify that MAC profile changes are indeed applied and that packets are
      forwarded with the correct source MAC.
      
      Output example:
      
      $ ./rif_mac_profiles.sh
      TEST: h1->h2: new mac profile                                       [ OK ]
      TEST: h2->h1: new mac profile                                       [ OK ]
      TEST: h1->h2: edit mac profile                                      [ OK ]
      TEST: h2->h1: edit mac profile                                      [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a10b7bac
    • Danielle Ratson's avatar
      selftests: mlxsw: Add a scale test for RIF MAC profiles · 152f98e7
      Danielle Ratson authored
      Query the maximum number of supported RIF MAC profiles using
      devlink-resource and verify that all available MAC profiles can be utilized
      and that an error is generated when user space tries to exceed this number.
      
      Output example in Spectrum-2:
      
      $ TESTS='rif_mac_profile' ./resource_scale.sh
      TEST: 'rif_mac_profile' 4                                           [ OK ]
      TEST: 'rif_mac_profile' overflow 5                                  [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      152f98e7
    • Danielle Ratson's avatar
      mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resource · 1c375ffb
      Danielle Ratson authored
      Expose via devlink-resource the maximum number of RIF MAC profiles and
      their current occupancy, so it can be used for debug and writing generic
      tests, like in the next patch.
      
      Example for Spectrum-2 output:
      
      $ devlink resource show pci/0000:06:00.0
      ...
        name rif_mac_profiles size 4 occ 0 unit entry dpipe_tables none
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c375ffb
    • Danielle Ratson's avatar
      mlxsw: spectrum_router: Add RIF MAC profiles support · 605d25cd
      Danielle Ratson authored
      Currently, mlxsw enforces that all the router interfaces (RIFs) have the
      same MAC prefix.
      
      Relax this limitation by using RIF MAC profiles. Each profile is
      associated with a particular MAC prefix and multiple RIFs can use the
      same profile. Therefore, the number of possible MAC prefixes is no
      longer one, but the number of profiles supported by the device.
      
      Store the profiles in an IDR and reference count them according to the
      number of RIFs using them.
      
      Associate a RIF with a profile when the RIF is created and remove the
      association when the RIF is deleted.
      
      Change the association following 'NETDEV_CHANGEADDR' events, except when
      only one RIF is using the profile. In which case, change the MAC prefix
      of the profile itself instead of associating the RIF with a new profile.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      605d25cd
    • Danielle Ratson's avatar
      mlxsw: spectrum_router: Propagate extack further · 26029225
      Danielle Ratson authored
      The next patch will set the MAC profile of a router interface (RIF) as
      part of its configure() callback. The operation can fail in case the
      maximum number of profiles was exceeded.
      
      Add extack to mlxsw_sp_rif_ops::configure() in order to communicate such
      failures to user space.
      
      In addition, the MAC profile of a RIF can change following a
      'NETDEV_CHANGEADDR' notification. Propagate extack to
      mlxsw_sp_router_port_change_event() so that failures could be
      communicated in this path as well.
      
      No functional changes intended.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26029225
    • Danielle Ratson's avatar
      mlxsw: resources: Add resource identifier for RIF MAC profiles · a8428e50
      Danielle Ratson authored
      Add a resource identifier for maximum RIF MAC profiles so that it could
      be later used to query the information from firmware.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8428e50
    • Danielle Ratson's avatar
      mlxsw: reg: Add MAC profile ID field to RITR register · d25d7fc3
      Danielle Ratson authored
      Add MAC profile ID field to RITR register so that it could be used for
      associating a RIF with a MAC profile ID by a later patch.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d25d7fc3
    • David S. Miller's avatar
      Merge branch 'netfilter-vrf-rework' · be348926
      David S. Miller authored
      Florian Westphal says:
      
      ====================
      vrf: rework interaction with netfilter/conntrack
      
      V2:
      - fix 'plain integer as null pointer' warning
      - reword commit message in patch 2 to clarify loss of 'ct set untracked'
      
      This patch series aims to solve the to-be-reverted change 09e856d5
      ("vrf: Reset skb conntrack connection on VRF rcv") in a different way.
      
      Rather than have skbs pass through conntrack and nat hooks twice, suppress
      conntrack invocation if the conntrack/nat hook is called from the vrf driver.
      
      First patch deals with 'incoming connection' case:
      1. suppress NAT transformations
      2. skip conntrack confirmation
      
      NAT and conntrack confirmation is done when ip/ipv6 stack calls
      the postrouting hook.
      
      Second patch deals with local packets:
      in vrf driver, mark the skbs as 'untracked', so conntrack output
      hook ignores them.  This skips all nat hooks as well.
      
      Afterwards, remove the untracked state again so the second
      round will pick them up.
      
      One alternative to the chosen implementation would be to add a 'caller
      id' field to 'struct nf_hook_state' and then use that, these patches
      use the more straightforward check of VRF flag on the state->out device.
      
      The two patches apply to both net and net-next, i am targeting -next
      because I think that since snat did not work correctly for so long that
      we can take the longer route.  If you disagree, apply to net at your
      discretion.
      
      The patches apply both with 09e856d5 reverted or still
      in-place, but only with the revert in place ingress conntrack settings
      (zone, notrack etc) start working again.
      
      I've already submitted selftests for vrf+nfqueue and conntrack+vrf.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be348926
    • Florian Westphal's avatar
      vrf: run conntrack only in context of lower/physdev for locally generated packets · 8c9c296a
      Florian Westphal authored
      The VRF driver invokes netfilter for output+postrouting hooks so that users
      can create rules that check for 'oif $vrf' rather than lower device name.
      
      This is a problem when NAT rules are configured.
      
      To avoid any conntrack involvement in round 1, tag skbs as 'untracked'
      to prevent conntrack from picking them up.
      
      This gets cleared before the packet gets handed to the ip stack so
      conntrack will be active on the second iteration.
      
      One remaining issue is that a rule like
      
        output ... oif $vrfname notrack
      
      won't propagate to the second round because we can't tell
      'notrack set via ruleset' and 'notrack set by vrf driver' apart.
      However, this isn't a regression: the 'notrack' removal happens
      instead of unconditional nf_reset_ct().
      I'd also like to avoid leaking more vrf specific conditionals into the
      netfilter infra.
      
      For ingress, conntrack has already been done before the packet makes it
      to the vrf driver, with this patch egress does connection tracking with
      lower/physical device as well.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c9c296a
    • Florian Westphal's avatar
      netfilter: conntrack: skip confirmation and nat hooks in postrouting for vrf · 8e0538d8
      Florian Westphal authored
      The VRF driver invokes netfilter for output+postrouting hooks so that users
      can create rules that check for 'oif $vrf' rather than lower device name.
      
      Afterwards, ip stack calls those hooks again.
      
      This is a problem when conntrack is used with IP masquerading.
      masquerading has an internal check that re-validates the output
      interface to account for route changes.
      
      This check will trigger in the vrf case.
      
      If the -j MASQUERADE rule matched on the first iteration, then round 2
      finds state->out->ifindex != nat->masq_index: the latter is the vrf
      index, but out->ifindex is the lower device.
      
      The packet gets dropped and the conntrack entry is invalidated.
      
      This change makes conntrack postrouting skip the nat hooks.
      Also skip confirmation.  This allows the second round
      (postrouting invocation from ipv4/ipv6) to create nat bindings.
      
      This also prevents the second round from seeing packets that had their
      source address changed by the nat hook.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e0538d8
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2021-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 4900a769
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2021-10-25
      
      Misc updates for mlx5 driver:
      
      1) Misc updates and cleanups:
       - Don't write directly to netdev->dev_addr, From Jakub Kicinski
       - Remove unnecessary checks for slow path flag in tc module
       - Fix unused function warning of mlx5i_flow_type_mask
       - Bridge, support replacing existing FDB entry
      
      2) Sub Functions, Reduction in memory usage:
       - Reduce flow counters bulk query buffer size
       - Implement max_macs devlink parameter
       - Add devlink vendor params to control Event Queue sizes
       - Added SF life cycle trace points by Parav/
      
      3) From Aya, Firmware health buffer reporting improvements
       - Print health buffer by log level and more missing information
       - Periodic update of host time to firmware
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4900a769
    • Jon Maxwell's avatar
      tcp: don't free a FIN sk_buff in tcp_remove_empty_skb() · cf12e6f9
      Jon Maxwell authored
      v1: Implement a more general statement as recommended by Eric Dumazet. The
      sequence number will be advanced, so this check will fix the FIN case and
      other cases.
      
      A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
      the write_queue was not empty as determined by tcp_write_queue_empty() but the
      sk_buff containing the FIN flag had been freed and the socket was zombied in
      that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.
      
      Some instrumentation was added to the kernel and it was found that there is a
      timing window where tcp_sendmsg() can run after tcp_send_fin().
      
      tcp_sendmsg() will hit an error, for example:
      
      1269 ▹       if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
      1270 ▹       ▹       goto do_error;
      
      tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
      TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.
      
      If the other side sends a FIN packet the socket will transition to CLOSING and
      remain that way until the system is rebooted.
      
      Fix this by checking for the FIN flag in the sk_buff and don't free it if that
      is the case. Testing confirmed that fixed the issue.
      
      Fixes: fdfc5c85 ("tcp: remove empty skb from write queue in error cases")
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Reported-by: default avatarMonir Zouaoui <Monir.Zouaoui@mail.schwarz>
      Reported-by: default avatarSimon Stier <simon.stier@mail.schwarz>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf12e6f9
    • Jakub Kicinski's avatar
      Merge branch 'small-fixes-for-true-expression-checks' · 36d935a0
      Jakub Kicinski authored
      Jean Sacren says:
      
      ====================
      Small fixes for true expression checks
      
      This series fixes checks of true !rc expression.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1634974124.git.sakiwit@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      36d935a0
    • Jean Sacren's avatar
      net: qed_dev: fix check of true !rc expression · 036f590f
      Jean Sacren authored
      Remove the check of !rc in (!rc && !resc_lock_params.b_granted) since it
      is always true.
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      036f590f
    • Jean Sacren's avatar
      net: qed_ptp: fix check of true !rc expression · 165f8e82
      Jean Sacren authored
      Remove the check of !rc in (!rc && !params.b_granted) since it is always
      true.
      
      We should also use constant 0 for return.
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      165f8e82
    • Jakub Kicinski's avatar
      Merge branch 'tcp-receive-path-optimizations' · e43b76ab
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      tcp: receive path optimizations
      
      This series aims to reduce cache line misses in RX path.
      
      I am still working on better cache locality in tcp_sock but
      this will wait few more weeks.
      ====================
      
      Link: https://lore.kernel.org/r/20211025164825.259415-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e43b76ab
    • Eric Dumazet's avatar
      ipv6/tcp: small drop monitor changes · 12c8691d
      Eric Dumazet authored
      Two kfree_skb() calls must be replaced by consume_skb()
      for skbs that are not technically dropped.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      12c8691d
    • Eric Dumazet's avatar
      ipv4: guard IP_MINTTL with a static key · 020e71a3
      Eric Dumazet authored
      RFC 5082 IP_MINTTL option is rarely used on hosts.
      
      Add a static key to remove from TCP fast path useless code,
      and potential cache line miss to fetch inet_sk(sk)->min_ttl
      
      Note that once ip4_min_ttl static key has been enabled,
      it stays enabled until next boot.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      020e71a3
    • Eric Dumazet's avatar
      ipv4: annotate data races arount inet->min_ttl · 14834c4f
      Eric Dumazet authored
      No report yet from KCSAN, yet worth documenting the races.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      14834c4f
    • Eric Dumazet's avatar
      ipv6: guard IPV6_MINHOPCOUNT with a static key · 790eb673
      Eric Dumazet authored
      RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts.
      
      Add a static key to remove from TCP fast path useless code,
      and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount
      
      Note that once ip6_min_hopcount static key has been enabled,
      it stays enabled until next boot.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      790eb673
    • Eric Dumazet's avatar
      ipv6: annotate data races around np->min_hopcount · cc17c3c8
      Eric Dumazet authored
      No report yet from KCSAN, yet worth documenting the races.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cc17c3c8
    • Eric Dumazet's avatar
      net: annotate accesses to sk->sk_rx_queue_mapping · 09b89846
      Eric Dumazet authored
      sk->sk_rx_queue_mapping can be modified locklessly,
      add a couple of READ_ONCE()/WRITE_ONCE() to document this fact.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09b89846
    • Eric Dumazet's avatar
      net: avoid dirtying sk->sk_rx_queue_mapping · 342159ee
      Eric Dumazet authored
      sk_rx_queue_mapping is located in a cache line that should be kept read mostly.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      342159ee
    • Eric Dumazet's avatar
      net: avoid dirtying sk->sk_napi_id · 2b13af8a
      Eric Dumazet authored
      sk_napi_id is located in a cache line that can be kept read mostly.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2b13af8a
    • Eric Dumazet's avatar
      ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie · ef57c161
      Eric Dumazet authored
      Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst
      
      This removes one or two cache line misses in IPv6 early demux (TCP/UDP)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef57c161
    • Eric Dumazet's avatar
      tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex · 0c0a5ef8
      Eric Dumazet authored
      Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst
      
      This is part of an effort to reduce cache line misses in TCP fast path.
      
      This removes one cache line miss in early demux.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c0a5ef8
    • Alexander Lobakin's avatar
      ax88796c: fix fetching error stats from percpu containers · fd559a94
      Alexander Lobakin authored
      rx_dropped, tx_dropped, rx_frame_errors and rx_crc_errors are being
      wrongly fetched from the target container rather than source percpu
      ones.
      No idea if that goes from the vendor driver or was brainoed during
      the refactoring, but fix it either way.
      
      Fixes: a97c69ba ("net: ax88796c: ASIX AX88796C SPI Ethernet Adapter Driver")
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Acked-by: default avatarŁukasz Stelmach <l.stelmach@samsung.com>
      Link: https://lore.kernel.org/r/20211023121148.113466-1-alobakin@pm.meSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fd559a94
  2. 25 Oct, 2021 9 commits