• Eric Dumazet's avatar
    tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
    Eric Dumazet authored
    TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
    as a step to enable bigger tcp sndbuf limits.
    
    It works reasonably well, but the following happens :
    
    Once the limit is reached, TCP stack generates
    an [E]POLLOUT event for every incoming ACK packet.
    
    This causes a high number of context switches.
    
    This patch implements the strategy David Miller added
    in sock_def_write_space() :
    
     - If TCP socket has a notsent_lowat constraint of X bytes,
       allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
       only if number of notsent bytes is below X/2
    
    This considerably reduces TCP_NOTSENT_LOWAT overhead,
    while allowing to keep the pipe full.
    
    Tested:
     100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
    
    A:/# cat /proc/sys/net/ipv4/tcp_wmem
    4096	262144	64000000
    A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
    
    A:/# grep TCP /proc/net/sockstat
    TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
    
    A:/# vmstat 5 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
     2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
     0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
     1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
     2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
    
    A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
    
    A:/# grep TCP /proc/net/sockstat
    TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
    
    A:/# vmstat 5 5  # check that context switches have not inflated too much.
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
     0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
     1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
     1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
     1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    a74f0fa0
tcp.h 66.1 KB