• Eric Dumazet's avatar
    tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
    Eric Dumazet authored
    I had this bug sitting for too long in my pile, it is time to fix it.
    
    Thanks to Doug Porter for reminding me of it!
    
    We had various attempts in the past, including commit
    0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
    but the issue is that TCP stack currently only generates
    EPOLLOUT from input path, when tp->snd_una has advanced
    and skb(s) cleaned from rtx queue.
    
    If a flow has a big RTT, and/or receives SACKs, it is possible
    that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
    and no more data can be sent until tp->snd_una finally advances.
    
    What is needed is to also check if POLLOUT needs to be generated
    whenever tp->snd_nxt is advanced, from output path.
    
    This bug triggers more often after an idle period, as
    we do not receive ACK for at least one RTT. tcp_notsent_lowat
    could be a fraction of what CWND and pacing rate would allow to
    send during this RTT.
    
    In a followup patch, I will remove the bogus call
    to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
    from tcp_check_space(). Fact that we have decided to generate
    an EPOLLOUT does not mean the application has immediately
    refilled the transmit queue. This optimistic call
    might have been the reason the bug seemed not too serious.
    
    Tested:
    
    200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
    
    $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
    $ cat bench_rr.sh
    SUM=0
    for i in {1..10}
    do
     V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
     echo $V
     SUM=$(($SUM + $V))
    done
    echo SUM=$SUM
    
    Before patch:
    $ bench_rr.sh
    130000000
    80000000
    140000000
    140000000
    140000000
    140000000
    130000000
    40000000
    90000000
    110000000
    SUM=1140000000
    
    After patch:
    $ bench_rr.sh
    430000000
    590000000
    530000000
    450000000
    450000000
    350000000
    450000000
    490000000
    480000000
    460000000
    SUM=4680000000  # This is 410 % of the value before patch.
    
    Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Reported-by: default avatarDoug Porter <dsp@fb.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    4bfe744f
tcp_input.c 200 KB