1. 29 Apr, 2024 8 commits
    • Amit Cohen's avatar
      mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure · c0d92678
      Amit Cohen authored
      The next patch will set the driver to use NAPI for event processing. Then
      tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
      to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
      and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
      replaced with NAPI instance.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0d92678
    • Amit Cohen's avatar
      mlxsw: pci: Initialize dummy net devices for NAPI · 5d01ed2e
      Amit Cohen authored
      mlxsw will use NAPI for event processing in a next patch. As preparation,
      add two dummy net devices and initialize them.
      
      NAPI instance should be attached to net device. Usually each queue is used
      by a single net device in network drivers, so the mapping between net
      device to NAPI instance is intuitive. In our case, Rx queues are not per
      port, they are per trap-group. Tx queues are mapped to net devices, but we
      do not have a separate queue for each local port, several ports share the
      same queue.
      
      Use init_dummy_netdev() to initialize dummy net devices for NAPI.
      
      To run NAPI poll method in a kernel thread, the net device which NAPI
      instance is attached to should be marked as 'threaded'. It is
      recommended to handle Tx packets in softIRQ context, as usually this is
      a short task - just free the Tx packet which has been transmitted.
      Rx packets handling is more complicated task, so drivers can use a
      dedicated kernel thread to process them. It allows processing packets from
      different Rx queues in parallel. We would like to handle only Rx packets in
      kernel threads, which means that we will use two dummy net devices
      (one for Rx and one for Tx). Set only one of them with 'threaded' as it
      will be used for Rx processing. Do not fail in case that setting 'threaded'
      fails, as it is better to use regular softIRQ NAPI rather than preventing
      the driver from loading.
      
      Note that the net devices are initialized with init_dummy_netdev(), so
      they are not registered, which means that they will not be visible to user.
      It will not be possible to change 'threaded' configuration from user
      space, but it is reasonable in our case, as there is no another
      configuration which makes sense, considering that user has no influence
      on the usage of each queue.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d01ed2e
    • Amit Cohen's avatar
      mlxsw: pci: Ring RDQ and CQ doorbells once per several completions · 6b3d015c
      Amit Cohen authored
      Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
      ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.
      
      The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
      when we post new WQE (after RDQ is handled), there is an available CQE.
      This was done because of a hardware bug as part of
      commit c9ebea04 ("mlxsw: pci: Ring CQ's doorbell before RDQ's").
      
      There is no real reason to ring RDQ and CQ doorbells for each completion,
      it is better to handle several completions and reduce number of ringings,
      as access to hardware is expensive (time wise) and might take time because
      of memory barriers.
      
      A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
      this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
      The counters of the doorbells are increased by the amount of packets
      that we handled, then the device will know for which completion to send
      an additional event.
      
      To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
      also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
      does not ring the doorbell.
      
      Note that with this change there is no need to copy the CQE, as we ring CQ
      doorbell only after Rx packet processing (which uses the CQE) is done.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b3d015c
    • Amit Cohen's avatar
      mlxsw: pci: Handle up to 64 Rx completions in tasklet · e28d8aba
      Amit Cohen authored
      We can get many completions in one interrupt. Currently, the CQ tasklet
      handles up to half queue size completions, and then arms the hardware to
      generate additional events, which means that in case that there were
      additional completions that we did not handle, we will get immediately an
      additional interrupt to handle the rest.
      
      The decision to handle up to half of the queue size is arbitrary and was
      determined in 2015, when mlxsw driver was added to the kernel. One
      additional fact that should be taken into account is that while WQEs
      from RDQ are handled, the CPU that handles the tasklet is dedicated for
      this task, which means that we might hold the CPU for a long time.
      
      Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
      to send additional notifications. Set the chunk size to 64 as this number
      is recommended using NAPI and the driver will use NAPI in a next patch.
      Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
      NAPI it will be more efficient as software will reschedule the poll
      method and we will not involve hardware for that.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e28d8aba
    • Eric Dumazet's avatar
      ipv6: use call_rcu_hurry() in fib6_info_release() · b5327b9a
      Eric Dumazet authored
      This is a followup of commit c4e86b43 ("net: add two more
      call_rcu_hurry()")
      
      fib6_info_destroy_rcu() is calling nexthop_put() or fib6_nh_release()
      
      We must not delay it too much or risk unregister_netdevice/ref_tracker
      traces because references to netdev are not released in time.
      
      This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5327b9a
    • Eric Dumazet's avatar
      inet: use call_rcu_hurry() in inet_free_ifa() · 61f5338d
      Eric Dumazet authored
      This is a followup of commit c4e86b43 ("net: add two more
      call_rcu_hurry()")
      
      Our reference to ifa->ifa_dev must be freed ASAP
      to release the reference to the netdev the same way.
      
      inet_rcu_free_ifa()
      
      	in_dev_put()
      	 -> in_dev_finish_destroy()
      	   -> netdev_put()
      
      This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61f5338d
    • Eric Dumazet's avatar
      net: give more chances to rcu in netdev_wait_allrefs_any() · cd42ba1c
      Eric Dumazet authored
      This came while reviewing commit c4e86b43 ("net: add two more
      call_rcu_hurry()").
      
      Paolo asked if adding one synchronize_rcu() would help.
      
      While synchronize_rcu() does not help, making sure to call
      rcu_barrier() before msleep(wait) is definitely helping
      to make sure lazy call_rcu() are completed.
      
      Instead of waiting ~100 seconds in my tests, the ref_tracker
      splats occurs one time only, and netdev_wait_allrefs_any()
      latency is reduced to the strict minimum.
      
      Ideally we should audit our call_rcu() users to make sure
      no refcount (or cascading call_rcu()) is held too long,
      because rcu_barrier() is quite expensive.
      
      Fixes: 0e4be9e5 ("net: use exponential backoff in netdev_wait_allrefs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd42ba1c
    • Tanmay Patil's avatar
      net: ethernet: ti: am65-cpsw-qos: Add support to taprio for past base_time · d63394ab
      Tanmay Patil authored
      If the base-time for taprio is in the past, start the schedule at the time
      of the form "base_time + N*cycle_time" where N is the smallest possible
      integer such that the above time is in the future.
      Signed-off-by: default avatarTanmay Patil <t-patil@ti.com>
      Signed-off-by: default avatarChintan Vankar <c-vankar@ti.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d63394ab
  2. 27 Apr, 2024 1 commit
  3. 26 Apr, 2024 31 commits