• Neal Cardwell's avatar
    tcp_bbr: add BBR congestion control · 0f8782ea
    Neal Cardwell authored
    This commit implements a new TCP congestion control algorithm: BBR
    (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
    published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
    "BBR: Congestion-Based Congestion Control".
    
    BBR has significantly increased throughput and reduced latency for
    connections on Google's internal backbone networks and google.com and
    YouTube Web servers.
    
    BBR requires only changes on the sender side, not in the network or
    the receiver side. Thus it can be incrementally deployed on today's
    Internet, or in datacenters.
    
    The Internet has predominantly used loss-based congestion control
    (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
    signal to slow down. While this worked well for many years, loss-based
    congestion control is unfortunately out-dated in today's networks. On
    today's Internet, loss-based congestion control causes the infamous
    bufferbloat problem, often causing seconds of needless queuing delay,
    since it fills the bloated buffers in many last-mile links. On today's
    high-speed long-haul links using commodity switches with shallow
    buffers, loss-based congestion control has abysmal throughput because
    it over-reacts to losses caused by transient traffic bursts.
    
    In 1981 Kleinrock and Gale showed that the optimal operating point for
    a network maximizes delivered bandwidth while minimizing delay and
    loss, not only for single connections but for the network as a
    whole. Finding that optimal operating point has been elusive, since
    any single network measurement is ambiguous: network measurements are
    the result of both bandwidth and propagation delay, and those two
    cannot be measured simultaneously.
    
    While it is impossible to disambiguate any single bandwidth or RTT
    measurement, a connection's behavior over time tells a clearer
    story. BBR uses a measurement strategy designed to resolve this
    ambiguity. It combines these measurements with a robust servo loop
    using recent control systems advances to implement a distributed
    congestion control algorithm that reacts to actual congestion, not
    packet loss or transient queue delay, and is designed to converge with
    high probability to a point near the optimal operating point.
    
    In a nutshell, BBR creates an explicit model of the network pipe by
    sequentially probing the bottleneck bandwidth and RTT. On the arrival
    of each ACK, BBR derives the current delivery rate of the last round
    trip, and feeds it through a windowed max-filter to estimate the
    bottleneck bandwidth. Conversely it uses a windowed min-filter to
    estimate the round trip propagation delay. The max-filtered bandwidth
    and min-filtered RTT estimates form BBR's model of the network pipe.
    
    Using its model, BBR sets control parameters to govern sending
    behavior. The primary control is the pacing rate: BBR applies a gain
    multiplier to transmit faster or slower than the observed bottleneck
    bandwidth. The conventional congestion window (cwnd) is now the
    secondary control; the cwnd is set to a small multiple of the
    estimated BDP (bandwidth-delay product) in order to allow full
    utilization and bandwidth probing while bounding the potential amount
    of queue at the bottleneck.
    
    When a BBR connection starts, it enters STARTUP mode and applies a
    high gain to perform an exponential search to quickly probe the
    bottleneck bandwidth (doubling its sending rate each round trip, like
    slow start). However, instead of continuing until it fills up the
    buffer (i.e. a loss), or until delay or ACK spacing reaches some
    threshold (like Hystart), it uses its model of the pipe to estimate
    when that pipe is full: it estimates the pipe is full when it notices
    the estimated bandwidth has stopped growing. At that point it exits
    STARTUP and enters DRAIN mode, where it reduces its pacing rate to
    drain the queue it estimates it has created.
    
    Then BBR enters steady state. In steady state, PROBE_BW mode cycles
    between first pacing faster to probe for more bandwidth, then pacing
    slower to drain any queue that created if no more bandwidth was
    available, and then cruising at the estimated bandwidth to utilize the
    pipe without creating excess queue. Occasionally, on an as-needed
    basis, it sends significantly slower to probe for RTT (PROBE_RTT
    mode).
    
    BBR has been fully deployed on Google's wide-area backbone networks
    and we're experimenting with BBR on Google.com and YouTube on a global
    scale.  Replacing CUBIC with BBR has resulted in significant
    improvements in network latency and application (RPC, browser, and
    video) metrics. For more details please refer to our upcoming ACM
    Queue publication.
    
    Example performance results, to illustrate the difference between BBR
    and CUBIC:
    
    Resilience to random loss (e.g. from shallow buffers):
      Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
      path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
      rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
    
    Low latency with the bloated buffers common in today's last-mile links:
      Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
      path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
      buffer. Both fully utilize the bottleneck bandwidth, but BBR
      achieves this with a median RTT 25x lower (43 ms instead of 1.09
      secs).
    
    Our long-term goal is to improve the congestion control algorithms
    used on the Internet. We are hopeful that BBR can help advance the
    efforts toward this goal, and motivate the community to do further
    research.
    
    Test results, performance evaluations, feedback, and BBR-related
    discussions are very welcome in the public e-mail list for BBR:
    
      https://groups.google.com/forum/#!forum/bbr-dev
    
    NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
    enabled, since pacing is integral to the BBR design and
    implementation. BBR without pacing would not function properly, and
    may incur unnecessary high packet loss rates.
    Signed-off-by: default avatarVan Jacobson <vanj@google.com>
    Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
    Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
    Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    0f8782ea
tcp_bbr.c 31.7 KB