1. 01 Oct, 2023 8 commits
    • Eric Dumazet's avatar
      net_sched: sch_fq: add fast path for mostly idle qdisc · 076433bd
      Eric Dumazet authored
      TCQ_F_CAN_BYPASS can be used by few qdiscs.
      
      Idea is that if we queue a packet to an empty qdisc,
      following dequeue() would pick it immediately.
      
      FQ can not use the generic TCQ_F_CAN_BYPASS code,
      because some additional checks need to be performed.
      
      This patch adds a similar fast path to FQ.
      
      Most of the time, qdisc is not throttled,
      and many packets can avoid bringing/touching
      at least four cache lines, and consuming 128bytes
      of memory to store the state of a flow.
      
      After this patch, netperf can send UDP packets about 13 % faster,
      and pktgen goes 30 % faster (when FQ is in the way), on a fast NIC.
      
      TCP traffic is also improved, thanks to a reduction of cache line misses.
      I have measured a 5 % increase of throughput on a tcp_rr intensive workload.
      
      tc -s -d qd sh dev eth1
      ...
      qdisc fq 8004: parent 1:2 limit 10000p flow_limit 100p buckets 1024
         orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit
         refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
       Sent 5646784384 bytes 1985161 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
        flows 122 (inactive 122 throttled 0)
        gc 0 highprio 0 fastpath 659990 throttled 27762 latency 8.57us
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      076433bd
    • Eric Dumazet's avatar
      net_sched: sch_fq: change how @inactive is tracked · ee9af4e1
      Eric Dumazet authored
      Currently, when one fq qdisc has no more packets to send, it can still
      have some flows stored in its RR lists (q->new_flows & q->old_flows)
      
      This was a design choice, but what is a bit disturbing is that
      the inactive_flows counter does not include the count of empty flows
      in RR lists.
      
      As next patch needs to know better if there are active flows,
      this change makes inactive_flows exact.
      
      Before the patch, following command on an empty qdisc could have returned:
      
      lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
        flows 1322 (inactive 1316 throttled 0)
        flows 1330 (inactive 1325 throttled 0)
        flows 1193 (inactive 1190 throttled 0)
        flows 1208 (inactive 1202 throttled 0)
      
      After the patch, we now have:
      
      lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
        flows 1322 (inactive 1322 throttled 0)
        flows 1330 (inactive 1330 throttled 0)
        flows 1193 (inactive 1193 throttled 0)
        flows 1208 (inactive 1208 throttled 0)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee9af4e1
    • Eric Dumazet's avatar
      net_sched: sch_fq: struct sched_data reorg · 54ff8ad6
      Eric Dumazet authored
      q->flows can be often modified, and q->timer_slack is read mostly.
      
      Exchange the two fields, so that cache line countaining
      quantum, initial_quantum, and other critical parameters
      stay clean (read-mostly).
      
      Move q->watchdog next to q->stat_throttled
      
      Add comments explaining how the structure is split in
      three different parts.
      
      pahole output before the patch:
      
      struct fq_sched_data {
      	struct fq_flow_head        new_flows;            /*     0  0x10 */
      	struct fq_flow_head        old_flows;            /*  0x10  0x10 */
      	struct rb_root             delayed;              /*  0x20   0x8 */
      	u64                        time_next_delayed_flow; /*  0x28   0x8 */
      	u64                        ktime_cache;          /*  0x30   0x8 */
      	unsigned long              unthrottle_latency_ns; /*  0x38   0x8 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct fq_flow             internal __attribute__((__aligned__(64))); /*  0x40  0x80 */
      
      	/* XXX last struct has 16 bytes of padding */
      
      	/* --- cacheline 3 boundary (192 bytes) --- */
      	u32                        quantum;              /*  0xc0   0x4 */
      	u32                        initial_quantum;      /*  0xc4   0x4 */
      	u32                        flow_refill_delay;    /*  0xc8   0x4 */
      	u32                        flow_plimit;          /*  0xcc   0x4 */
      	unsigned long              flow_max_rate;        /*  0xd0   0x8 */
      	u64                        ce_threshold;         /*  0xd8   0x8 */
      	u64                        horizon;              /*  0xe0   0x8 */
      	u32                        orphan_mask;          /*  0xe8   0x4 */
      	u32                        low_rate_threshold;   /*  0xec   0x4 */
      	struct rb_root *           fq_root;              /*  0xf0   0x8 */
      	u8                         rate_enable;          /*  0xf8   0x1 */
      	u8                         fq_trees_log;         /*  0xf9   0x1 */
      	u8                         horizon_drop;         /*  0xfa   0x1 */
      
      	/* XXX 1 byte hole, try to pack */
      
      <bad>	u32                        flows;                /*  0xfc   0x4 */
      	/* --- cacheline 4 boundary (256 bytes) --- */
      	u32                        inactive_flows;       /* 0x100   0x4 */
      	u32                        throttled_flows;      /* 0x104   0x4 */
      	u64                        stat_gc_flows;        /* 0x108   0x8 */
      	u64                        stat_internal_packets; /* 0x110   0x8 */
      	u64                        stat_throttled;       /* 0x118   0x8 */
      	u64                        stat_ce_mark;         /* 0x120   0x8 */
      	u64                        stat_horizon_drops;   /* 0x128   0x8 */
      	u64                        stat_horizon_caps;    /* 0x130   0x8 */
      	u64                        stat_flows_plimit;    /* 0x138   0x8 */
      	/* --- cacheline 5 boundary (320 bytes) --- */
      	u64                        stat_pkts_too_long;   /* 0x140   0x8 */
      	u64                        stat_allocation_errors; /* 0x148   0x8 */
      <bad>	u32                        timer_slack;          /* 0x150   0x4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct qdisc_watchdog      watchdog;             /* 0x158  0x48 */
      
      	/* size: 448, cachelines: 7, members: 34 */
      	/* sum members: 411, holes: 2, sum holes: 5 */
      	/* padding: 32 */
      	/* paddings: 1, sum paddings: 16 */
      	/* forced alignments: 1 */
      };
      
      pahole output after the patch:
      
      struct fq_sched_data {
      	struct fq_flow_head        new_flows;            /*     0  0x10 */
      	struct fq_flow_head        old_flows;            /*  0x10  0x10 */
      	struct rb_root             delayed;              /*  0x20   0x8 */
      	u64                        time_next_delayed_flow; /*  0x28   0x8 */
      	u64                        ktime_cache;          /*  0x30   0x8 */
      	unsigned long              unthrottle_latency_ns; /*  0x38   0x8 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct fq_flow             internal __attribute__((__aligned__(64))); /*  0x40  0x80 */
      
      	/* XXX last struct has 16 bytes of padding */
      
      	/* --- cacheline 3 boundary (192 bytes) --- */
      	u32                        quantum;              /*  0xc0   0x4 */
      	u32                        initial_quantum;      /*  0xc4   0x4 */
      	u32                        flow_refill_delay;    /*  0xc8   0x4 */
      	u32                        flow_plimit;          /*  0xcc   0x4 */
      	unsigned long              flow_max_rate;        /*  0xd0   0x8 */
      	u64                        ce_threshold;         /*  0xd8   0x8 */
      	u64                        horizon;              /*  0xe0   0x8 */
      	u32                        orphan_mask;          /*  0xe8   0x4 */
      	u32                        low_rate_threshold;   /*  0xec   0x4 */
      	struct rb_root *           fq_root;              /*  0xf0   0x8 */
      	u8                         rate_enable;          /*  0xf8   0x1 */
      	u8                         fq_trees_log;         /*  0xf9   0x1 */
      	u8                         horizon_drop;         /*  0xfa   0x1 */
      
      	/* XXX 1 byte hole, try to pack */
      
      <good>	u32                        timer_slack;          /*  0xfc   0x4 */
      	/* --- cacheline 4 boundary (256 bytes) --- */
      <good>	u32                        flows;                /* 0x100   0x4 */
      	u32                        inactive_flows;       /* 0x104   0x4 */
      	u32                        throttled_flows;      /* 0x108   0x4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	u64                        stat_throttled;       /* 0x110   0x8 */
      <better> struct qdisc_watchdog     watchdog;             /* 0x118  0x48 */
      	/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
      	u64                        stat_gc_flows;        /* 0x160   0x8 */
      	u64                        stat_internal_packets; /* 0x168   0x8 */
      	u64                        stat_ce_mark;         /* 0x170   0x8 */
      	u64                        stat_horizon_drops;   /* 0x178   0x8 */
      	/* --- cacheline 6 boundary (384 bytes) --- */
      	u64                        stat_horizon_caps;    /* 0x180   0x8 */
      	u64                        stat_flows_plimit;    /* 0x188   0x8 */
      	u64                        stat_pkts_too_long;   /* 0x190   0x8 */
      	u64                        stat_allocation_errors; /* 0x198   0x8 */
      
      	/* Force padding: */
      	u64                        :64;
      	u64                        :64;
      	u64                        :64;
      	u64                        :64;
      
      	/* size: 448, cachelines: 7, members: 34 */
      	/* sum members: 411, holes: 2, sum holes: 5 */
      	/* padding: 32 */
      	/* paddings: 1, sum paddings: 16 */
      	/* forced alignments: 1 */
      };
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54ff8ad6
    • Eric Dumazet's avatar
      net_sched: constify qdisc_priv() · 1add9073
      Eric Dumazet authored
      In order to propagate const qualifiers, we change qdisc_priv()
      to accept a possibly const argument.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1add9073
    • David S. Miller's avatar
      Merge branch 'tcp_delack_max' · 66ac08a7
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: add tcp_delack_max()
      
      First patches are adding const qualifiers to four existing helpers.
      
      Third patch adds a much needed companion feature to RTAX_RTO_MIN.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66ac08a7
    • Eric Dumazet's avatar
      tcp: derive delack_max from rto_min · bbf80d71
      Eric Dumazet authored
      While BPF allows to set icsk->->icsk_delack_max
      and/or icsk->icsk_rto_min, we have an ip route
      attribute (RTAX_RTO_MIN) to be able to tune rto_min,
      but nothing to consequently adjust max delayed ack,
      which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}).
      
      This makes RTAX_RTO_MIN of almost no practical use,
      unless customers are in big trouble.
      
      Modern days datacenter communications want to set
      rto_min to ~5 ms, and the max delayed ack one jiffie
      smaller to avoid spurious retransmits.
      
      After this patch, an "rto_min 5" route attribute will
      effectively lower max delayed ack timers to 4 ms.
      
      Note in the following ss output, "rto:6 ... ato:4"
      
      $ ss -temoi dst XXXXXX
      State Recv-Q Send-Q           Local Address:Port       Peer Address:Port  Process
      ESTAB 0      0        [2002:a05:6608:295::]:52950   [2002:a05:6608:297::]:41597
           ino:255134 sk:1001 <->
               skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack
       cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500
       rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121
       bytes_received:54823120 segs_out:1370582 segs_in:1370580
       data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps
       pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579
       busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920
       rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536
      
      While we could argue this patch fixes a bug with RTAX_RTO_MIN,
      I do not add a Fixes: tag, so that we can soak it a bit before
      asking backports to stable branches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbf80d71
    • Eric Dumazet's avatar
      tcp: constify tcp_rto_min() and tcp_rto_min_us() argument · f68a181f
      Eric Dumazet authored
      Make clear these functions do not change any field from TCP socket.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f68a181f
    • Eric Dumazet's avatar
      net: constify sk_dst_get() and __sk_dst_get() argument · 5033f58d
      Eric Dumazet authored
      Both helpers only read fields from their socket argument.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5033f58d
  2. 30 Sep, 2023 1 commit
  3. 28 Sep, 2023 11 commits
  4. 22 Sep, 2023 5 commits
    • David S. Miller's avatar
      Merge branch 'mlxsw-multicast' · 5a1b322c
      David S. Miller authored
      Petr Machata says:
      
      ====================
      mlxsw: Improve blocks selection for IPv6 multicast forwarding
      
      Amit Cohen writes:
      
      The driver configures two ACL regions during initialization, these regions
      are used for IPv4 and IPv6 multicast forwarding. Entries residing in these
      two regions match on the {SIP, DIP, VRID} key elements.
      
      Currently for IPv6 region, 9 key blocks are used. This can be improved by
      reducing the amount key blocks needed for the IPv6 region to 8. It is
      possible to use key blocks that mix subsets of the VRID element with
      subsets of the DIP element.
      
      To make this happen, we have to take in account the algorithm that chooses
      which key blocks will be used. It is lazy and not the optimal one as it is
      a complex task. It searches the block that contains the most elements that
      are required, chooses it, removes the elements that appear in the chosen
      block and starts again searching the block that contains the most elements.
      
      To optimize the nubmber of the blocks for IPv6 multicast forwarding, handle
      the following:
      
      1. Add support for key blocks that mix subsets of the VRID element with
      subsets of the DIP element.
      
      2. Prevent the algorithm from chosing another blocks for VRID.
      Currently, we have the block 'ipv4_4' which contains 2 sub-elements of
      VRID. With the existing algorithm, this block might be chosen, then 8
      blocks must be chosen for SIP and DIP and we will get 9 blocks to match on
      {SIP, DIP, VRID}. Therefore, replace this block with a new block 'ipv4_5'
      that contains 1 element for VRID, this will not be chosen for IPv6 as VRID
      element will be broken to several sub-elements. In this way we can get 8
      blocks for IPv6 multicast forwarding.
      
      This improvement was tested and indeed 8 blocks are used instead of 9.
      
      v2:
      - Resending without changes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a1b322c
    • Amit Cohen's avatar
      mlxsw: Edit IPv6 key blocks to use one less block for multicast forwarding · 92953e7a
      Amit Cohen authored
      Two ACL regions that are configured by the driver during initialization are
      the ones used for IPv4 and IPv6 multicast forwarding. Entries residing
      in these two regions match on the {SIP, DIP, VRID} key elements.
      
      Currently for IPv6 region, 9 key blocks are used:
      * 4 for SIP - 'ipv4_1', 'ipv6_{3,4,5}'
      * 4 for DIP - 'ipv4_0', 'ipv6_{0,1,2/2b}'
      * 1 for VRID - 'ipv4_4b'
      
      This can be improved by reducing the amount key blocks needed for
      the IPv6 region to 8. It is possible to use key blocks that mix subsets of
      the VRID element with subsets of the DIP element.
      The following key blocks can be used:
      * 4 for SIP - 'ipv4_1', 'ipv6_{3,4,5}'
      * 1 for subset of DIP - 'ipv4_0'
      * 3 for the rest of DIP and subsets of VRID - 'ipv6_{0,1,2/2b}'
      
      To make this happen, add VRID sub-elements as part of existing keys -
      'ipv6_{0,1,2/2b}'. Note that one of the sub-elements is called
      VRID_ROUTER_MSB and does not contain bit numbers like the rest, as for
      Spectrum < 4 this element represents bits 8-10 and for Spectrum-4 it
      represents bits 8-11.
      
      Breaking VRID into 3 sub-elements makes the driver use one less block in
      IPv6 region for multicast forwarding. The sub-elements can be filled in
      blocks that are used for destination IP.
      
      The algorithm in the driver that chooses which key blocks will be used is
      lazy and not the optimal one. It searches the block that contains the most
      elements that are required, chooses it, removes the elements that appear
      in the chosen block and starts again searching the block that contains the
      most elements.
      
      When key block 'ipv4_4' is defined, the algorithm might choose it, as it
      contains 2 sub-elements of VRID, then 8 blocks must be chosen for SIP and
      DIP and we get 9 blocks to match on {SIP, DIP, VRID}. That is why we had to
      remove key block 'ipv4_4' in a previous patch and use key block that
      contains one field for VRID.
      
      This improvement was tested and indeed 8 blocks are used instead of 9.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92953e7a
    • Amit Cohen's avatar
      mlxsw: spectrum_acl_flex_keys: Add 'ipv4_5b' flex key · c6caabdf
      Amit Cohen authored
      The previous patch replaced the key block 'ipv4_4' with 'ipv4_5'. The
      corresponding block for Spectrum-4 is 'ipv4_4b'. To be consistent, replace
      key block 'ipv4_4b' with 'ipv4_5b'.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6caabdf
    • Amit Cohen's avatar
      mlxsw: Add 'ipv4_5' flex key · c2f3e10a
      Amit Cohen authored
      Currently virtual router ID element is broken to two sub-elements -
      'VIRT_ROUTER_LSB' and 'VIRT_ROUTER_MSB'. It was broken as this field is
      broken in 'ipv4_4' flex key which is used for IPv4 in Spectrum < 4.
      For Spectrum-4, we use 'ipv4_4b' flex key which contains one field for
      virtual router, this key is not supported in older ASICs.
      
      Add 'ipv4_5' flex key which is supported in all ASICs and contains one
      field for virtual router. Then there is no reason to use 'VIRT_ROUTER_LSB'
      and 'VIRT_ROUTER_MSB', remove them and add one element 'VIRT_ROUTER' for
      this field.
      
      The motivation is to get rid of 'ipv4_4' flex key, as it might be chosen
      for IPv6 multicast forwarding region. This will not allow the improvement
      in a following patch. See more details in the cover letter and in a
      following patch.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2f3e10a
    • Peter Lafreniere's avatar
      hamradio: baycom: remove useless link in Kconfig · 84c19e65
      Peter Lafreniere authored
      The Kconfig help text for baycom drivers suggests that more information
      on the hardware can be found at <https://www.baycom.de>. The website now
      includes no information on their ham radio products other than a mention
      that they were once produced by the company, saying:
      "The amateur radio equipment is now no longer part and business of BayCom GmbH"
      
      As there is no information relavent to the baycom driver on the site,
      remove the link.
      Signed-off-by: default avatarPeter Lafreniere <peter@n8pjl.ca>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84c19e65
  5. 21 Sep, 2023 15 commits