• Eric Dumazet's avatar
    tcp: implement rb-tree based retransmit queue · 75c119af
    Eric Dumazet authored
    Using a linear list to store all skbs in write queue has been okay
    for quite a while : O(N) is not too bad when N < 500.
    
    Things get messy when N is the order of 100,000 : Modern TCP stacks
    want 10Gbit+ of throughput even with 200 ms RTT flows.
    
    40 ns per cache line miss means a full scan can use 4 ms,
    blowing away CPU caches.
    
    SACK processing often can use various hints to avoid parsing
    whole retransmit queue. But with high packet losses and/or high
    reordering, hints no longer work.
    
    Sender has to process thousands of unfriendly SACK, accumulating
    a huge socket backlog, burning a cpu and massively dropping packets.
    
    Using an rb-tree for retransmit queue has been avoided for years
    because it added complexity and overhead, but now is the time
    to be more resistant and say no to quadratic behavior.
    
    1) RTX queue is no longer part of the write queue : already sent skbs
    are stored in one rb-tree.
    
    2) Since reaching the head of write queue no longer needs
    sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
    
    Tested:
    
     On receiver :
     netem on ingress : delay 150ms 200us loss 1
     GRO disabled to force stress and SACK storms.
    
    for f in `seq 1 10`
    do
     ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
    done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
    
    Before patch :
    
    323.87
    351.48
    339.59
    338.62
    306.72
    204.07
    304.93
    291.88
    202.47
    176.88
       2840
    
    After patch:
    
    1700.83
    2207.98
    2070.17
    1544.26
    2114.76
    2124.89
    1693.14
    1080.91
    2216.82
    1299.94
      18053
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    75c119af
tcp_ipv4.c 63.8 KB