• Jakub Kicinski's avatar
    Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux · 7f9eee19
    Jakub Kicinski authored
    Pavel Begunkov says:
    
    ====================
    io_uring zerocopy send
    
    The patchset implements io_uring zerocopy send. It works with both registered
    and normal buffers, mixing is allowed but not recommended. Apart from usual
    request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
    the userspace when buffers are freed and can be reused (see API design below),
    which is delivered into io_uring's Completion Queue. Those "buffer-free"
    notifications are not necessarily per request, but the userspace has control
    over it and should explicitly attaching a number of requests to a single
    notification. The series also adds some internal optimisations when used with
    registered buffers like removing page referencing.
    
    From the kernel networking perspective there are two main changes. The first
    one is passing ubuf_info into the network layer from io_uring (inside of an
    in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
    caching on the io_uring side, but also helps to avoid cross-referencing
    and synchronisation problems. The second part is an optional optimisation
    removing page referencing for requests with registered buffers.
    
    Benchmarking UDP with an optimised version of the selftest (see [1]), which
    sends a bunch of requests, waits for completions and repeats. "+ flush" column
    posts one additional "buffer-free" notification per request, and just "zc"
    doesn't post buffer notifications at all.
    
    NIC (requests / second):
    IO size | non-zc    | zc             | zc + flush
    4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
    1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
    1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
    600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
    
    dummy (requests / second):
    IO size | non-zc    | zc             | zc + flush
    8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
    4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
    1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
    600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
    
    Previously it also brought a massive performance speedup compared to the
    msg_zerocopy tool (see [3]), which is probably not super interesting. There
    is also an additional bunch of refcounting optimisations that was omitted from
    the series for simplicity and as they don't change the picture drastically,
    they will be sent as follow up, as well as flushing optimisations closing the
    performance gap b/w two last columns.
    
    For TCP on localhost (with hacks enabling localhost zerocopy) and including
    additional overhead for receive:
    
    IO size | non-zc    | zc
    1200    | 4174      | 4148
    4096    | 7597      | 11228
    
    Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
    omitted optimisations will somewhat help, should look better for 4000,
    but couldn't test properly because of setup problems.
    
    Links:
    
      liburing (benchmark + tests):
      [1] https://github.com/isilence/liburing/tree/zc_v4
    
      kernel repo:
      [2] https://github.com/isilence/linux/tree/zc_v4
    
      RFC v1:
      [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
    
      RFC v2:
      https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
    
      Net patches based:
      git@github.com:isilence/linux.git zc_v4-net-base
      or
      https://github.com/isilence/linux/tree/zc_v4-net-base
    
    API design overview:
    
      The series introduces an io_uring concept of notifactors. From the userspace
      perspective it's an entity to which it can bind one or more requests and then
      requesting to flush it. Flushing a notifier makes it impossible to attach new
      requests to it, and instructs the notifier to post a completion once all
      requests attached to it are completed and the kernel doesn't need the buffers
      anymore.
    
      Notifications are stored in notification slots, which should be registered as
      an array in io_uring. Each slot stores only one notifier at any particular
      moment. Flushing removes it from the slot and the slot automatically replaces
      it with a new notifier. All operations with notifiers are done by specifying
      an index of a slot it's currently in.
    
      When registering a notification the userspace specifies a u64 tag for each
      slot, which will be copied in notification completion entries as
      cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
      sequence number counting notifiers of a slot.
    
    ====================
    
    Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    7f9eee19
socket.c 88.8 KB