• Neal Cardwell's avatar
    tcp: when scheduling TLP, time of RTO should account for current ACK · ed66dfaf
    Neal Cardwell authored
    Fix the TLP scheduling logic so that when scheduling a TLP probe, we
    ensure that the estimated time at which an RTO would fire accounts for
    the fact that ACKs indicating forward progress should push back RTO
    times.
    
    After the following fix:
    
    df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
    
    we had an unintentional behavior change in the following kind of
    scenario: suppose the RTT variance has been very low recently. Then
    suppose we send out a flight of N packets and our RTT is 100ms:
    
    t=0: send a flight of N packets
    t=100ms: receive an ACK for N-1 packets
    
    The response before df92c839 that was:
      -> schedule a TLP for now + RTO_interval
    
    The response after df92c839 is:
      -> schedule a TLP for t=0 + RTO_interval
    
    Since RTO_interval = srtt + RTT_variance, this means that we have
    scheduled a TLP timer at a point in the future that only accounts for
    RTT_variance. If the RTT_variance term is small, this means that the
    timer fires soon.
    
    Before df92c839 this would not happen, because in that code, when
    we receive an ACK for a prefix of flight, we did:
    
        1) Near the top of tcp_ack(), switch from TLP timer to RTO
           at write_queue_head->paket_tx_time + RTO_interval:
                if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                       tcp_rearm_rto(sk);
    
        2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
                if (flag & FLAG_ACKED) {
                       tcp_rearm_rto(sk);
    
        3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
           to TLP at now + RTO_interval:
                if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                       tcp_schedule_loss_probe(sk);
    
    In df92c839 we removed that 3-phase dance, and instead directly
    set the TLP timer once: we set the TLP timer in cases like this to
    write_queue_head->packet_tx_time + RTO_interval. So if the RTT
    variance is small, then this means that this is setting the TLP timer
    to fire quite soon. This means if the ACK for the tail of the flight
    takes longer than an RTT to arrive (often due to delayed ACKs), then
    the TLP timer fires too quickly.
    
    Fixes: df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
    Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
    Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    ed66dfaf
tcp_input.c 180 KB