• Jon Maloy's avatar
    tcp: add support for SO_PEEK_OFF socket option · 05ea4916
    Jon Maloy authored
    When reading received messages from a socket with MSG_PEEK, we may want
    to read the contents with an offset, like we can do with pread/preadv()
    when reading files. Currently, it is not possible to do that.
    
    In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
    in a similar way it is done for Unix Domain sockets.
    
    In the iperf3 log examples shown below, we can observe a throughput
    improvement of 15-20 % in the direction host->namespace when using the
    protocol splicer 'pasta' (https://passt.top).
    This is a consistent result.
    
    pasta(1) and passt(1) implement user-mode networking for network
    namespaces (containers) and virtual machines by means of a translation
    layer between Layer-2 network interface and native Layer-4 sockets
    (TCP, UDP, ICMP/ICMPv6 echo).
    
    Received, pending TCP data to the container/guest is kept in kernel
    buffers until acknowledged, so the tool routinely needs to fetch new
    data from socket, skipping data that was already sent.
    
    At the moment this is implemented using a dummy buffer passed to
    recvmsg(). With this change, we don't need a dummy buffer and the
    related buffer copy (copy_to_user()) anymore.
    
    passt and pasta are supported in KubeVirt and libvirt/qemu.
    
    jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
    SO_PEEK_OFF not supported by kernel.
    
    jmaloy@freyr:~/passt# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 192.168.122.1, port 44822
    [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
    [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
    [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
    [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
    [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
    [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
    [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
    [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
    [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
    [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
    [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
    -----------------------------------------------------------
    Server listening on 5201 (test #2)
    -----------------------------------------------------------
    ^Ciperf3: interrupt - the server has terminated
    jmaloy@freyr:~/passt#
    logout
    [ perf record: Woken up 23 times to write data ]
    [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
    jmaloy@freyr:~/passt$
    
    jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
    SO_PEEK_OFF supported by kernel.
    
    jmaloy@freyr:~/passt# iperf3 -s
    -----------------------------------------------------------
    Server listening on 5201 (test #1)
    -----------------------------------------------------------
    Accepted connection from 192.168.122.1, port 52084
    [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
    [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
    [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
    [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
    [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
    [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
    [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
    [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
    [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
    [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
    [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate
    [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
    -----------------------------------------------------------
    Server listening on 5201 (test #2)
    -----------------------------------------------------------
    ^Ciperf3: interrupt - the server has terminated
    logout
    [ perf record: Woken up 20 times to write data ]
    [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
    jmaloy@freyr:~/passt$
    
    The perf record confirms this result. Below, we can observe that the
    CPU spends significantly less time in the function ____sys_recvmsg()
    when we have offset support.
    
    Without offset support:
    ----------------------
    jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                           -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
    46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
    
    With offset support:
    ----------------------
    jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                           -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
    28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
    Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
    Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
    Signed-off-by: default avatarJon Maloy <jmaloy@redhat.com>
    Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    05ea4916
tcp.c 131 KB