• Soheil Hassas Yeganeh's avatar
    tcp: send in-queue bytes in cmsg upon read · b75eba76
    Soheil Hassas Yeganeh authored
    Applications with many concurrent connections, high variance
    in receive queue length and tight memory bounds cannot
    allocate worst-case buffer size to drain sockets. Knowing
    the size of receive queue length, applications can optimize
    how they allocate buffers to read from the socket.
    
    The number of bytes pending on the socket is directly
    available through ioctl(FIONREAD/SIOCINQ) and can be
    approximated using getsockopt(MEMINFO) (rmem_alloc includes
    skb overheads in addition to application data). But, both of
    these options add an extra syscall per recvmsg. Moreover,
    ioctl(FIONREAD/SIOCINQ) takes the socket lock.
    
    Add the TCP_INQ socket option to TCP. When this socket
    option is set, recvmsg() relays the number of bytes available
    on the socket for reading to the application via the
    TCP_CM_INQ control message.
    
    Calculate the number of bytes after releasing the socket lock
    to include the processed backlog, if any. To avoid an extra
    branch in the hot path of recvmsg() for this new control
    message, move all cmsg processing inside an existing branch for
    processing receive timestamps. Since the socket lock is not held
    when calculating the size of receive queue, TCP_INQ is a hint.
    For example, it can overestimate the queue size by one byte,
    if FIN is received.
    
    With this method, applications can start reading from the socket
    using a small buffer, and then use larger buffers based on the
    remaining data when needed.
    
    V3 change-log:
    	As suggested by David Miller, added loads with barrier
    	to check whether we have multiple threads calling recvmsg
    	in parallel. When that happens we lock the socket to
    	calculate inq.
    V4 change-log:
    	Removed inline from a static function.
    Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
    Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
    Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
    Reviewed-by: default avatarNeal Cardwell <ncardwell@google.com>
    Suggested-by: default avatarDavid Miller <davem@davemloft.net>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    b75eba76
tcp.c 99.3 KB