1. 03 Feb, 2021 21 commits
    • Jakub Kicinski's avatar
      Merge branch 'add-notifications-when-route-hardware-flags-change' · 389cb1ec
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      Add notifications when route hardware flags change
      
      Routes installed to the kernel can be programmed to capable devices, in
      which case they are marked with one of two flags. RTM_F_OFFLOAD for
      routes that offload traffic from the kernel and RTM_F_TRAP for routes
      that trap packets to the kernel for processing (e.g., host routes).
      
      These flags are of interest to routing daemons since they would like to
      delay advertisement of routes until they are installed in hardware. This
      allows them to avoid packet loss or misrouted packets. Currently,
      routing daemons do not receive any notifications when these flags are
      changed, requiring them to poll the kernel tables for changes which is
      inefficient.
      
      This series addresses the issue by having the kernel emit RTM_NEWROUTE
      notifications whenever these flags change. The behavior is controlled by
      two sysctls (net.ipv4.fib_notify_on_flag_change and
      net.ipv6.fib_notify_on_flag_change) that default to 0 (no
      notifications).
      
      Note that even if route installation in hardware is improved to be more
      synchronous, these notifications are still of interest. For example, a
      multipath route can change from RTM_F_OFFLOAD to RTM_F_TRAP if its
      neighbours become invalid. A routing daemon can choose to withdraw /
      replace the route in that case. In addition, the deletion of a route
      from the kernel can prompt the installation of an identical route
      (already in kernel, with an higher metric) to hardware.
      
      For testing purposes, netdevsim is aligned to simulate a "real" driver
      that programs routes to hardware.
      
      Series overview:
      
      Patches #1-#2 align netdevsim to perform route programming in a
      non-atomic context
      
      Patches #3-#5 add sysctl to control IPv4 notifications
      
      Patches #6-#8 add sysctl to control IPv6 notifications
      
      Patch #9 extends existing fib tests to set sysctls before running tests
      
      Patch #10 adds test for fib notifications over netdevsim
      ====================
      
      Link: https://lore.kernel.org/r/20210201194757.3463461-1-idosch@idosch.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      389cb1ec
    • Amit Cohen's avatar
      selftests: netdevsim: Add fib_notifications test · 19d36d29
      Amit Cohen authored
      Add test to check fib notifications behavior.
      
      The test checks route addition, route deletion and route replacement for
      both IPv4 and IPv6.
      
      When fib_notify_on_flag_change=0, expect single notification for route
      addition/deletion/replacement.
      
      When fib_notify_on_flag_change=1, expect:
      - two notification for route addition/replacement, first without RTM_F_TRAP
        and second with RTM_F_TRAP.
      - single notification for route deletion.
      
      $ ./fib_notifications.sh
      TEST: IPv4 route addition                                           [ OK ]
      TEST: IPv4 route deletion                                           [ OK ]
      TEST: IPv4 route replacement                                        [ OK ]
      TEST: IPv6 route addition                                           [ OK ]
      TEST: IPv6 route deletion                                           [ OK ]
      TEST: IPv6 route replacement                                        [ OK ]
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      19d36d29
    • Amit Cohen's avatar
      selftests: Extend fib tests to run with and without flags notifications · d1a7a489
      Amit Cohen authored
      Run the test cases with both `fib_notify_on_flag_change` sysctls set to
      '1', and then with both sysctls set to '0' to verify there are no
      regressions in the test when notifications are added.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d1a7a489
    • Amit Cohen's avatar
      net: ipv6: Emit notification when fib hardware flags are changed · 907eea48
      Amit Cohen authored
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead
      to a routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      
      Move fib6_info_hw_flags_set() to C file because it is no longer a short
      function.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      907eea48
    • Amit Cohen's avatar
      net: Do not call fib6_info_hw_flags_set() when IPv6 is disabled · efc42879
      Amit Cohen authored
      With the next patch mlxsw and netdevsim will fail in compilation if
      CONFIG_IPV6 is disabled.
      
      Do not call fib6_info_hw_flags_set() when IPv6 is disabled.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      efc42879
    • Amit Cohen's avatar
      net: Pass 'net' struct as first argument to fib6_info_hw_flags_set() · fbaca8f8
      Amit Cohen authored
      The next patch will emit notification when hardware flags are changed,
      in case that fib_notify_on_flag_change sysctl is set to 1.
      
      To know sysctl values, net struct is needed.
      This change is consistent with the IPv4 version, which gets 'net' struct
      as its first argument.
      
      Currently, the only callers of this function are mlxsw and netdevsim.
      Patch the callers to pass net.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fbaca8f8
    • Amit Cohen's avatar
      net: ipv4: Emit notification when fib hardware flags are changed · 680aea08
      Amit Cohen authored
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Acked-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      680aea08
    • Amit Cohen's avatar
      net: ipv4: Publish fib_nlmsg_size() · 1e7bdec6
      Amit Cohen authored
      Publish fib_nlmsg_size() to allow it to be used later on from
      fib_alias_hw_flags_set().
      
      Remove the inline keyword since it shouldn't be used inside C files.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1e7bdec6
    • Amit Cohen's avatar
      net: ipv4: Pass fib_rt_info as const to fib_dump_info() · 08554789
      Amit Cohen authored
      fib_dump_info() does not change 'fri', so pass it as 'const'.
      It will later allow us to invoke fib_dump_info() from
      fib_alias_hw_flags_set().
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      08554789
    • Amit Cohen's avatar
      netdevsim: fib: Perform the route programming in a non-atomic context · 0ae3eb7b
      Amit Cohen authored
      Currently, netdevsim implements dummy FIB offload and marks notified
      routes with RTM_F_TRAP flag. netdevsim does not defer route notifications
      to a work queue because it does not need to program any hardware.
      
      Given that netdevsim's purpose is to both give an example implementation
      and allow developers to test their code, align netdevsim to a "real"
      hardware device driver like mlxsw and have it also perform the route
      "programming" in a non-atomic context.
      
      It will be used to test route flags notifications which will be added in
      the next patches.
      
      The following changes are needed when route handling is performed in WQ:
      - Handle the accounting in the main context, to be able to return an
        error for adding route when all the routes are used.
        For FIB_EVENT_ENTRY_REPLACE increase the counter before scheduling
        the delayed work, and in case that this event replaces an existing route,
        decrease the counter as part of the delayed work.
      
      - For IPv6, cannot use fen6_info->rt->fib6_siblings list because it
        might be changed during handling the delayed work.
        Save an array with the nexthops as part of fib6_event struct, and take
        a reference for each nexthop to prevent them from being freed while
        event is queued.
      
      - Change GFP_ATOMIC allocations to GFP_KERNEL.
      
      - Use single work item that is handling a list of ordered routes.
        Handling routes must be processed in the order they were submitted to
        avoid logical errors that could lead to unexpected failures.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0ae3eb7b
    • Amit Cohen's avatar
      netdevsim: fib: Convert the current occupancy to an atomic variable · 9e635a21
      Amit Cohen authored
      When route is added/deleted, the appropriate counter is increased/decreased
      to maintain number of routes.
      
      User can limit the number of routes and then according to the appropriate
      counter, adding more routes than the limitation is forbidden.
      
      Currently, there is one lock which protects hashtable, list and accounting.
      
      Handling the counters will be performed from both atomic context and
      non-atomic context, while the hashtable and the list will be used only from
      non-atomic context and therefore will be protected by a separate lock.
      
      Protect accounting by using an atomic variable, so lock is not needed.
      
      v2:
      * Use atomic64_sub() in nsim_nexthop_account()'s error path
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e635a21
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-don-t-disable-napi-in-suspend' · 64b268e1
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: don't disable NAPI in suspend
      
      This is version 2 of a series that reworks the order in which things
      happen during channel stop and suspend (and start and resume), in
      order to address a hang that has been observed during suspend.
      The introductory message on the first version of the series gave
      some history which is omitted here.
      
      The end result of this series is that we only enable NAPI and the
      I/O completion interrupt on a channel when we start the channel for
      the first time.  And we only disable them when stopping the channel
      "for good."  In other words, NAPI and the completion interrupt
      remain enabled while a channel is stopped for suspend.
      
      One comment on version 1 of the series suggested *not* returning
      early on success in a function, instead having both success and
      error paths return from the same point at the end of the function
      block.  This has been addressed in this version.
      
      In addition, this version consolidates things a little bit, but the
      net result of the series is exactly the same as version 1 (with the
      exception of the return fix mentioned above).
      
      First, patch 6 in the first version was a small step to make patch 7
      easier to understand.  The two have been combined now.
      
      Second, previous version moved (and for suspend/resume, eliminated)
      I/O completion interrupt and NAPI disable/enable control in separate
      steps (patches).  Now both are moved around together in patch 5 and
      6, which eliminates the need for the final (NAPI-only) patch.
      
      I won't repeat the patch summaries provided in v1:
        https://lore.kernel.org/netdev/20210129202019.2099259-1-elder@linaro.org/
      
      Many thanks to Willem de Bruijn for his thoughtful input.
      ====================
      
      Link: https://lore.kernel.org/r/20210201172850.2221624-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      64b268e1
    • Alex Elder's avatar
      net: ipa: expand last transaction check · e6316920
      Alex Elder authored
      Transactions to send data for a network device can be allocated at
      any time up until the point the TX queue is stopped.  It is possible
      for ipa_start_xmit() to be called in one context just before a
      the transmit queue is stopped in another.
      
      Update gsi_channel_trans_last() so that for TX channels the
      allocated and pending transaction lists are checked--in addition
      to the completed and polled lists--to determine the "last"
      transaction.  This means any transaction that has been allocated
      before the TX queue is stopped will be allowed to complete before
      we conclude the channel is quiesced.
      
      Rework the function a bit to use a list pointer and gotos.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e6316920
    • Alex Elder's avatar
      net: ipa: don't disable interrupt on suspend · a65c0288
      Alex Elder authored
      No completion interrupts will occur while an endpoint is suspended,
      nor when a channel has been stopped for suspend.  So there's no need
      to disable the interrupt during suspend and re-enable it when
      resuming.  Without any interrupts occurring, there is no need to
      disable/re-enable NAPI for channel suspend/resume either.
      
      We'll only enable NAPI and the interrupt when we first start the
      channel, and disable it again only when it's "really" stopped.
      
      To accomplish this, move the enable/disable calls out of
      __gsi_channel_start() and __gsi_channel_stop(), and into
      gsi_channel_start() and gsi_channel_stop() instead.
      
      Add a call to napi_synchronize() to gsi_channel_suspend(), to ensure
      NAPI polling is done before moving on.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a65c0288
    • Alex Elder's avatar
      net: ipa: disable interrupt and NAPI after channel stop · 4fef691c
      Alex Elder authored
      Disable both the I/O completion interrupt and NAPI polling on a
      channel *after* we successfully stop it rather than before.  This
      ensures a completion occurring just before the channel is stopped
      gets processed.
      
      Enable NAPI polling and the interrupt *before* starting a channel
      rather than after, to be symmetric.  A stopped channel won't
      generate any completion interrupts anyway.
      
      Enable NAPI before the interrupt and disable it afterward.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4fef691c
    • Alex Elder's avatar
      net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() · bd1ea1e4
      Alex Elder authored
      Open-code gsi_channel_freeze() and gsi_channel_thaw() in all callers
      and get rid of these two functions.  This is part of reworking the
      sequence of things done during channel suspend/resume and start/stop.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bd1ea1e4
    • Alex Elder's avatar
      net: ipa: introduce __gsi_channel_start() · 893b838e
      Alex Elder authored
      Create a new function that does most of the work of starting a
      channel.  What's different is that it takes a flag indicating
      whether the channel should really be started or not.  Create
      another new function __gsi_channel_stop() that behaves similarly.
      
      IPA v3.5.1 implements suspend using a special SUSPEND endpoint
      setting.  If the endpoint is suspended when an I/O completes on the
      underlying GSI channel, a SUSPEND interrupt is generated.
      
      Newer versions of IPA do not implement the SUSPEND endpoint mode.
      Instead, endpoint suspend is implemented by simply stopping the
      underlying GSI channel.  In this case, a completing I/O on a
      *stopped* channel causes the SUSPEND interrupt condition.
      
      These new functions put all activity related to starting or stopping
      a channel (including "thawing/freezing" the channel) in one place,
      whether or not the channel is actually started or stopped.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      893b838e
    • Alex Elder's avatar
      net: ipa: introduce gsi_channel_stop_retry() · 697e834e
      Alex Elder authored
      Create a new helper function that encapsulates issuing a set of
      channel stop commands, retrying if appropriate, with a short delay
      between attempts.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      697e834e
    • Alex Elder's avatar
      net: ipa: don't thaw channel if error starting · 6b00a76a
      Alex Elder authored
      If an error occurs starting a channel, don't "thaw" it.
      We should assume the channel remains in a non-started state.
      
      Update the comment in gsi_channel_stop(); calls to this function
      are no longer retried.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6b00a76a
    • Marco Elver's avatar
      net: fix up truesize of cloned skb in skb_prepare_for_shift() · 097b9146
      Marco Elver authored
      Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
      cloning an skb, save and restore truesize after pskb_expand_head(). This
      can occur if the allocator decides to service an allocation of the same
      size differently (e.g. use a different size class, or pass the
      allocation on to KFENCE).
      
      Because truesize is used for bookkeeping (such as sk_wmem_queued), a
      modified truesize of a cloned skb may result in corrupt bookkeeping and
      relevant warnings (such as in sk_stream_kill_queues()).
      
      Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com
      Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      097b9146
    • Davide Caratti's avatar
      mptcp: fix length of MP_PRIO suboption · ec99a470
      Davide Caratti authored
      With version 0 of the protocol it was legal to encode the 'Subflow Id' in
      the MP_PRIO suboption, to specify which subflow would change its 'Backup'
      flag. This has been removed from v1 specification: thus, according to RFC
      8684 §3.3.8, the resulting 'Length' for MP_PRIO changed from 4 to 3 byte.
      
      Current Linux generates / parses MP_PRIO according to the old spec, using
      'Length' equal to 4, and hardcoding 1 as 'Subflow Id'; RFC compliance can
      improve if we change 'Length' in other to become 3, leaving a 'Nop' after
      the MP_PRIO suboption. In this way the kernel will emit and accept *only*
      MP_PRIO suboptions that are compliant to version 1 of the MPTCP protocol.
      
       unpatched 5.11-rc kernel:
       [root@bottarga ~]# tcpdump -tnnr unpatched.pcap | grep prio
       reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1)
       dropped privs to tcpdump
       IP 10.0.3.2.48433 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 4032325513 ecr 1876514270,mptcp prio non-backup id 1,mptcp dss ack 14084896651682217737], length 0
      
       patched 5.11-rc kernel:
       [root@bottarga ~]# tcpdump -tnnr patched.pcap | grep prio
       reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1)
       dropped privs to tcpdump
       IP 10.0.3.2.49735 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 1276737699 ecr 2686399734,mptcp prio non-backup,nop,mptcp dss ack 18433038869082491686], length 0
      
      Changes since v2:
       - when accounting for option space, don't increment 'TCPOLEN_MPTCP_PRIO'
         and use 'TCPOLEN_MPTCP_PRIO_ALIGN' instead, thanks to Matthieu Baerts.
      Changes since v1:
       - refactor patch to avoid using 'TCPOLEN_MPTCP_PRIO' with its old value,
         thanks to Geliang Tang.
      
      Fixes: 06706542 ("mptcp: add the outgoing MP_PRIO support")
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarMatteo Croce <mcroce@linux.microsoft.com>
      Link: https://lore.kernel.org/r/846cdd41e6ad6ec88ef23fee1552ab39c2f5a3d1.1612184361.git.dcaratti@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec99a470
  2. 02 Feb, 2021 19 commits