1. 12 Jul, 2018 27 commits
  2. 11 Jul, 2018 10 commits
    • Petr Machata's avatar
      selftests: forwarding: mirror_lib: Tighten up VLAN capture · db560d16
      Petr Machata authored
      The function do_test_span_vlan_dir_ips() is used for testing whether
      mirrored packets are VLAN-encapsulated. But since it only considers
      VLAN encapsulation, it may end up matching unmirrored ARP traffic as
      well. One consequence is a rare failure of mirror_gre_vlan_bridge_1q's
      test_gretap_untagged_egress. Decreasing ping cadence in mirror_test()
      makes the problem easily reproducible.
      
      Therefore tighten up the match criterion to only count those 802.1q
      packets where the next header is IP.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db560d16
    • David S. Miller's avatar
      Merge branch 'cake-qdisc' · 5025b99c
      David S. Miller authored
      Toke Høiland-Jørgensen says:
      
      ====================
      sched: Add Common Applications Kept Enhanced (cake) qdisc
      
      This patch series adds the CAKE qdisc, and has been split up to ease
      review.
      
      I have attempted to split out each configurable feature into its own patch.
      The first commit adds the base shaper and packet scheduler, while
      subsequent commits add the optional features. The full userspace API and
      most data structures are included in this commit, but options not
      understood in the base version will be ignored.
      
      The result of applying the entire series is identical to the out of tree
      version that have seen extensive testing in previous deployments, most
      notably as an out of tree patch to OpenWrt. However, note that I have only
      compile tested the individual patches; so the whole series should be
      considered as a unit.
      
      ---
      Changelog
      
      v19:
        - Rebase to current net-next.
        - Don't rely on the value of sch->q.qlen to break loops; fixes possible
          infinite loop on multi-queue devices.
        - Don't overwrite NAT flag when setting flow mode.
      
      v18:
        - Rework classification logic in the diffserv case to always hash if
          filter doesn't select a queue, and to run TC filters before
          selecting the diffserv tin (allowing filter to influence this).
        - Make sure we always call qdisc_watchdog_init() in cake_init(), so we
          don't crash in cake_destroy().
      
      v17:
        - Rebase to newest net-next and move the conntrack callback to
          nf_ct_hook
        - Fix a compile error when NF_CONNTRACK is unset.
      
      v16:
        - Move conntrack lookup function into conntrack core and read it via
          RCU so it is only active when the nf_conntrack module is loaded.
          This avoids the module dependency on conntrack for NAT mode. Thanks
          to Pablo for the idea.
      
      v15:
        - Handle ECN flags in ACK filter
      
      v14:
        - Handle seqno wraps and DSACKs in ACK filter
      
      v13:
        - Avoid ktime_t to scalar compares
        - Add class dumping and basic stats
        - Fail with ENOTSUPP when requesting NAT mode and conntrack is not
          available.
        - Parse all TCP options in ACK filter and make sure to only drop safe
          ones. Also handle SACK ranges properly.
      
      v12:
        - Get rid of custom time typedefs. Use ktime_t for time and u64 for
          duration instead.
      
      v11:
        - Fix overhead compensation calculation for GSO packets
        - Change configured rate to be u64 (I ran out of bits before I ran out
          of CPU when testing the effects of the above)
      
      v10:
        - Christmas tree gardening (fix variable declarations to be in reverse
          line length order)
      
      v9:
        - Remove duplicated checks around kvfree() and just call it
          unconditionally.
        - Don't pass __GFP_NOWARN when allocating memory
        - Move options in cake_dump() that are related to optional features to
          later patches implementing the features.
        - Support attaching filters to the qdisc and use the classification
          result to select flow queue.
        - Support overriding diffserv priority tin from skb->priority
      
      v8:
        - Remove inline keyword from function definitions
        - Simplify ACK filter; remove the complex state handling to make the
          logic easier to follow. This will potentially be a bit less efficient,
          but I have not been able to measure a difference.
      
      v7:
        - Split up patch into a series to ease review.
        - Constify the ACK filter.
      
      v6:
        - Fix 6in4 encapsulation checks in ACK filter code
        - Checkpatch fixes
      
      v5:
        - Refactor ACK filter code and hopefully fix the safety issues
          properly this time.
      
      v4:
        - Only split GSO packets if shaping at speeds <= 1Gbps
        - Fix overhead calculation code to also work for GSO packets
        - Don't re-implement kvzalloc()
        - Remove local header include from out-of-tree build (fixes kbuild-bot
          complaint).
        - Several fixes to the ACK filter:
          - Check pskb_may_pull() before deref of transport headers.
          - Don't run ACK filter logic on split GSO packets
          - Fix TCP sequence number compare to deal with wraparounds
      
      v3:
        - Use IS_REACHABLE() macro to fix compilation when sch_cake is
          built-in and conntrack is a module.
        - Switch the stats output to use nested netlink attributes instead
          of a versioned struct.
        - Remove GPL boilerplate.
        - Fix array initialisation style.
      
      v2:
        - Fix kbuild test bot complaint
        - Clean up the netlink ABI
        - Fix checkpatch complaints
        - A few tweaks to the behaviour of cake based on testing carried out
          while writing the paper.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5025b99c
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Conditionally split GSO segments · 0c850344
      Toke Høiland-Jørgensen authored
      At lower bandwidths, the transmission time of a single GSO segment can add
      an unacceptable amount of latency due to HOL blocking. Furthermore, with a
      software shaper, any tuning mechanism employed by the kernel to control the
      maximum size of GSO segments is thrown off by the artificial limit on
      bandwidth. For this reason, we split GSO segments into their individual
      packets iff the shaper is active and configured to a bandwidth <= 1 Gbps.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c850344
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Add overhead compensation support to the rate shaper · a729b7f0
      Toke Høiland-Jørgensen authored
      This commit adds configurable overhead compensation support to the rate
      shaper. With this feature, userspace can configure the actual bottleneck
      link overhead and encapsulation mode used, which will be used by the shaper
      to calculate the precise duration of each packet on the wire.
      
      This feature is needed because CAKE is often deployed one or two hops
      upstream of the actual bottleneck (which can be, e.g., inside a DSL or
      cable modem). In this case, the link layer characteristics and overhead
      reported by the kernel does not match the actual bottleneck. Being able to
      set the actual values in use makes it possible to configure the shaper rate
      much closer to the actual bottleneck rate (our experience shows it is
      possible to get with 0.1% of the actual physical bottleneck rate), thus
      keeping latency low without sacrificing bandwidth.
      
      The overhead compensation has three tunables: A fixed per-packet overhead
      size (which, if set, will be accounted from the IP packet header), a
      minimum packet size (MPU) and a framing mode supporting either ATM or PTM
      framing. We include a set of common keywords in TC to help users configure
      the right parameters. If no overhead value is set, the value reported by
      the kernel is used.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a729b7f0
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Add DiffServ handling · 83f8fd69
      Toke Høiland-Jørgensen authored
      This adds support for DiffServ-based priority queueing to CAKE. If the
      shaper is in use, each priority tier gets its own virtual clock, which
      limits that tier's rate to a fraction of the overall shaped rate, to
      discourage trying to game the priority mechanism.
      
      CAKE defaults to a simple, three-tier mode that interprets most code points
      as "best effort", but places CS1 traffic into a low-priority "bulk" tier
      which is assigned 1/16 of the total rate, and a few code points indicating
      latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
      into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
      The other supported DiffServ modes are a 4-tier mode matching the 802.11e
      precedence rules, as well as two 8-tier modes, one of which implements
      strict precedence of the eight priority levels.
      
      This commit also adds an optional DiffServ 'wash' mode, which will zero out
      the DSCP fields of any packet passing through CAKE. While this can
      technically be done with other mechanisms in the kernel, having the feature
      available in CAKE significantly decreases configuration complexity; and the
      implementation cost is low on top of the other DiffServ-handling code.
      
      Filters and applications can set the skb->priority field to override the
      DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches
      CAKE's qdisc handle, the minor number will be interpreted as a priority
      tier if it is less than or equal to the number of configured priority
      tiers.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83f8fd69
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Add NAT awareness to packet classifier · ea825115
      Toke Høiland-Jørgensen authored
      When CAKE is deployed on a gateway that also performs NAT (which is a
      common deployment mode), the host fairness mechanism cannot distinguish
      internal hosts from each other, and so fails to work correctly.
      
      To fix this, we add an optional NAT awareness mode, which will query the
      kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
      and use that in the flow and host hashing.
      
      When the shaper is enabled and the host is already performing NAT, the cost
      of this lookup is negligible. However, in unlimited mode with no NAT being
      performed, there is a significant CPU cost at higher bandwidths. For this
      reason, the feature is turned off by default.
      
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea825115
    • Toke Høiland-Jørgensen's avatar
      netfilter: Add nf_ct_get_tuple_skb global lookup function · b60a6040
      Toke Høiland-Jørgensen authored
      This adds a global netfilter function to extract a conntrack tuple from an
      skb. The function uses a new function added to nf_ct_hook, which will try
      to get the tuple from skb->_nfct, and do a full lookup if that fails. This
      makes it possible to use the lookup function before the skb has passed
      through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is
      copied to the caller to avoid issues with reference counting.
      
      The function returns false if conntrack is not loaded, allowing it to be
      used without incurring a module dependency on conntrack. This is used by
      the NAT mode in sch_cake.
      
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b60a6040
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Add optional ACK filter · 8b713881
      Toke Høiland-Jørgensen authored
      The ACK filter is an optional feature of CAKE which is designed to improve
      performance on links with very asymmetrical rate limits. On such links
      (which are unfortunately quite prevalent, especially for DSL and cable
      subscribers), the downstream throughput can be limited by the number of
      ACKs capable of being transmitted in the *upstream* direction.
      
      Filtering ACKs can, in general, have adverse effects on TCP performance
      because it interferes with ACK clocking (especially in slow start), and it
      reduces the flow's resiliency to ACKs being dropped further along the path.
      To alleviate these drawbacks, the ACK filter in CAKE tries its best to
      always keep enough ACKs queued to ensure forward progress in the TCP flow
      being filtered. It does this by only filtering redundant ACKs. In its
      default 'conservative' mode, the filter will always keep at least two
      redundant ACKs in the queue, while in 'aggressive' mode, it will filter
      down to a single ACK.
      
      The ACK filter works by inspecting the per-flow queue on every packet
      enqueue. Starting at the head of the queue, the filter looks for another
      eligible packet to drop (so the ACK being dropped is always closer to the
      head of the queue than the packet being enqueued). An ACK is eligible only
      if it ACKs *fewer* bytes than the new packet being enqueued, including any
      SACK options. This prevents duplicate ACKs from being filtered, to avoid
      interfering with retransmission logic. In addition, we check TCP header
      options and only drop those that are known to not interfere with sender
      state. In particular, packets with unknown option codes are never dropped.
      
      In aggressive mode, an eligible packet is always dropped, while in
      conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
      (with no data segments) are considered eligible for dropping, but when an
      ACK with data segments is enqueued, this can cause another pure ACK to
      become eligible for dropping.
      
      The approach described above ensures that this ACK filter avoids most of
      the drawbacks of a naive filtering mechanism that only keeps flow state but
      does not inspect the queue. This is the rationale for including the ACK
      filter in CAKE itself rather than as separate module (as the TC filter, for
      instance).
      
      Our performance evaluation has shown that on a 30/1 Mbps link with a
      bidirectional traffic test (RRUL), turning on the ACK filter on the
      upstream link improves downstream throughput by ~20% (both modes) and
      upstream throughput by ~12% in conservative mode and ~40% in aggressive
      mode, at the cost of ~5ms of inter-flow latency due to the increased
      congestion.
      
      In *really* pathological cases, the effect can be a lot more; for instance,
      the ACK filter increases the achievable downstream throughput on a link
      with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
      Mbps to ~25 Mbps).
      
      Finally, even though we consider the ACK filter to be safer than most, we
      do not recommend turning it on everywhere: on more symmetrical link
      bandwidths the effect is negligible at best.
      
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b713881
    • Toke Høiland-Jørgensen's avatar
      sch_cake: Add ingress mode · 7298de9c
      Toke Høiland-Jørgensen authored
      The ingress mode is meant to be enabled when CAKE runs downlink of the
      actual bottleneck (such as on an IFB device). The mode changes the shaper
      to also account dropped packets to the shaped rate, as these have already
      traversed the bottleneck.
      
      Enabling ingress mode will also tune the AQM to always keep at least two
      packets queued *for each flow*. This is done by scaling the minimum queue
      occupancy level that will disable the AQM by the number of active bulk
      flows. The rationale for this is that retransmits are more expensive in
      ingress mode, since dropped packets have to traverse the bottleneck again
      when they are retransmitted; thus, being more lenient and keeping a minimum
      number of packets queued will improve throughput in cases where the number
      of active flows are so large that they saturate the bottleneck even at
      their minimum window size.
      
      This commit also adds a separate switch to enable ingress mode rate
      autoscaling. If enabled, the autoscaling code will observe the actual
      traffic rate and adjust the shaper rate to match it. This can help avoid
      latency increases in the case where the actual bottleneck rate decreases
      below the shaped rate. The scaling filters out spikes by an EWMA filter.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7298de9c
    • Toke Høiland-Jørgensen's avatar
      sched: Add Common Applications Kept Enhanced (cake) qdisc · 046f6fd5
      Toke Høiland-Jørgensen authored
      sch_cake targets the home router use case and is intended to squeeze the
      most bandwidth and latency out of even the slowest ISP links and routers,
      while presenting an API simple enough that even an ISP can configure it.
      
      Example of use on a cable ISP uplink:
      
      tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
      
      To shape a cable download link (ifb and tc-mirred setup elided)
      
      tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
      
      CAKE is filled with:
      
      * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
        derived Flow Queuing system, which autoconfigures based on the bandwidth.
      * A novel "triple-isolate" mode (the default) which balances per-host
        and per-flow FQ even through NAT.
      * An deficit based shaper, that can also be used in an unlimited mode.
      * 8 way set associative hashing to reduce flow collisions to a minimum.
      * A reasonable interpretation of various diffserv latency/loss tradeoffs.
      * Support for zeroing diffserv markings for entering and exiting traffic.
      * Support for interacting well with Docsis 3.0 shaper framing.
      * Extensive support for DSL framing types.
      * Support for ack filtering.
      * Extensive statistics for measuring, loss, ecn markings, latency
        variation.
      
      A paper describing the design of CAKE is available at
      https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
      International Symposium on Local and Metropolitan Area Networks (LANMAN).
      
      This patch adds the base shaper and packet scheduler, while subsequent
      commits add the optional (configurable) features. The full userspace API
      and most data structures are included in this commit, but options not
      understood in the base version will be ignored.
      
      Various versions baking have been available as an out of tree build for
      kernel versions going back to 3.10, as the embedded router world has been
      running a few years behind mainline Linux. A stable version has been
      generally available on lede-17.01 and later.
      
      sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
      in the sqm-scripts, with sane defaults and vastly simpler configuration.
      
      CAKE's principal author is Jonathan Morton, with contributions from
      Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
      Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
      and Loganaden Velvindron.
      
      Testing from Pete Heist, Georgios Amanakis, and the many other members of
      the cake@lists.bufferbloat.net mailing list.
      
      tc -s qdisc show dev eth2
       qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
       Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
        backlog 0b 0p requeues 12
        memory used: 1053008b of 15140Kb
        capacity estimate: 970Mbit
        min/max network layer size:           28 /    1500
        min/max overhead-adjusted size:       84 /    1538
        average network hdr offset:           14
                          Bulk  Best Effort        Voice
         thresh      62500Kbit        1Gbit      250Mbit
         target          5.0ms        5.0ms        5.0ms
         interval      100.0ms      100.0ms      100.0ms
         pk_delay          5us          5us          6us
         av_delay          3us          2us          2us
         sp_delay          2us          1us          1us
         backlog            0b           0b           0b
         pkts          3164050     25030267      9530280
         bytes      3227519915  35396974782  12879808898
         way_inds            0            8            0
         way_miss           21          366           25
         way_cols            0            0            0
         drops               5            0            1
         marks               0            0            0
         ack_drop            0            0            0
         sp_flows            1            3            0
         bk_flows            0            1            1
         un_flows            0            0            0
         max_len         68130        68130        68130
      Tested-by: default avatarPete Heist <peteheist@gmail.com>
      Tested-by: default avatarGeorgios Amanakis <gamanakis@gmail.com>
      Signed-off-by: default avatarDave Taht <dave.taht@gmail.com>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      046f6fd5
  3. 09 Jul, 2018 3 commits