Commits · 63809137ebb58f0aa2ce359117422686e3304f45 · Kirill Smelkov / linux

25 Jul, 2022 40 commits

io_uring: flush notifiers after sendzc · 63809137

Pavel Begunkov authored Jul 12, 2022

Allow to flush notifiers as a part of sendzc request by setting
IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will
flush the used [active] notifier.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

63809137

io_uring: sendzc with fixed buffers · 10c7d33e

Pavel Begunkov authored Jul 12, 2022

Allow zerocopy sends to use fixed buffers. There is an optimisation for
this case, the network layer don't need to reference the pages, see
SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed
buffers until the notifier is released.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1d8bd1b5934e541d90c1824eb4020ae3f5f43f3.1657643355.git.asml.silence@gmail.com
[axboe: fold in 32-bit pointer cast warning fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

10c7d33e

io_uring: allow to pass addr into sendzc · 092aeedb

Pavel Begunkov authored Jul 12, 2022

Allow to specify an address to zerocopy sends making it more like
sendto(2).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70417a8f7c5b51ab454690bae08adc0c187f89e8.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

092aeedb

io_uring: account locked pages for non-fixed zc · e29e3bd4

Pavel Begunkov authored Jul 12, 2022

Fixed buffers are RLIMIT_MEMLOCK accounted, however it doesn't cover iovec
based zerocopy sends. Do the accounting on the io_uring side.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/19b6e3975440f59f1f6199c7ee7acf977b4eecdc.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e29e3bd4

io_uring: wire send zc request type · 06a5464b

Pavel Begunkov authored Jul 12, 2022

Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from
IORING_OP_SEND is that the user should specify a notification slot
index in sqe::notification_idx and the buffers are safe to reuse only
when the used notification is flushed and completes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a80387c6a68ce9cf99b3b6ef6f71068468761fb7.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

06a5464b

io_uring: add notification slot registration · bc24d6bd

Pavel Begunkov authored Jul 12, 2022

Let the userspace to register and unregister notification slots.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a0aa8161fe3ebb2a4cc6e5dbd0cffb96e6881cf5.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bc24d6bd

io_uring: add rsrc referencing for notifiers · 68ef5578

Pavel Begunkov authored Jul 12, 2022

In preparation to zerocopy sends with fixed buffers make notifiers to
reference the rsrc node to protect the used fixed buffers. We can't just
grab it for a send request as notifiers can likely outlive requests that
used it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3cd7a01d26837945b6982fa9cf15a63230f2ed4f.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

68ef5578

io_uring: complete notifiers in tw · e58d498e

Pavel Begunkov authored Jul 12, 2022

We need a task context to post CQEs but using wq is too expensive.
Try to complete notifiers using task_work and fall back to wq if fails.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/089799ab665b10b78fdc614ae6d59fa7ef0d5f91.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e58d498e

io_uring: cache struct io_notif · eb4a299b

Pavel Begunkov authored Jul 12, 2022

kmalloc'ing struct io_notif is too expensive when done frequently, cache
them as many other resources in io_uring. Keep two list, the first one
is from where we're getting notifiers, it's protected by ->uring_lock.
The second is protected by ->completion_lock, to which we queue released
notifiers. Then we splice one list into another when needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dec18f7fcbab9f4bd40b96e5ae158b119945230.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

eb4a299b

io_uring: add zc notification infrastructure · eb42cebb

Pavel Begunkov authored Jul 12, 2022

Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.

The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.

When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

eb42cebb

io_uring: export io_put_task() · e70cb608

Pavel Begunkov authored Jul 12, 2022

Make io_put_task() available to non-core parts of io_uring, we'll need
it for notification infrastructure.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3686807d4c03b72e389947b0e8692d4d44334ef0.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e70cb608

io_uring: initialise msghdr::msg_ubuf · e02b6651

Pavel Begunkov authored Jul 12, 2022

Initialise newly added ->msg_ubuf in io_recv() and io_send().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b8f9f263875a4a36e7f26cc5d55ebe315308f57d.1657643355.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e02b6651

Merge branch 'for-5.20/io_uring' into for-5.20/io_uring-zerocopy-send · 4effe18f

Jens Axboe authored Jul 24, 2022

* for-5.20/io_uring: (716 commits)
  io_uring: ensure REQ_F_ISREG is set async offload
  net: fix compat pointer in get_compat_msghdr()
  io_uring: Don't require reinitable percpu_ref
  io_uring: fix types in io_recvmsg_multishot_overflow
  io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
  io_uring: support multishot in recvmsg
  net: copy from user before calling __get_compat_msghdr
  net: copy from user before calling __copy_msghdr
  io_uring: support 0 length iov in buffer select in compat
  io_uring: fix multishot ending when not polled
  io_uring: add netmsg cache
  io_uring: impose max limit on apoll cache
  io_uring: add abstraction around apoll cache
  io_uring: move apoll cache to poll.c
  io_uring: consolidate hash_locked io-wq handling
  io_uring: clear REQ_F_HASH_LOCKED on hash removal
  io_uring: don't race double poll setting REQ_F_ASYNC_DATA
  io_uring: don't miss setting REQ_F_DOUBLE_POLL
  io_uring: disable multishot recvmsg
  io_uring: only trace one of complete or overflow
  ...
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4effe18f

Merge branch 'io_uring-zerocopy-send' of... · 32e09298

Jens Axboe authored Jul 24, 2022

Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-5.20/io_uring-zerocopy-send

Merge prep net series for io_uring tx zc from the Jakub's tree.

* 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
  net: fix uninitialised msghdr->sg_from_iter
  tcp: support externally provided ubufs
  ipv6/udp: support externally provided ubufs
  ipv4/udp: support externally provided ubufs
  net: introduce __skb_fill_page_desc_noacc
  net: introduce managed frags infrastructure
  net: Allow custom iter handler in msghdr
  skbuff: carry external ubuf_info in msghdr
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: don't mix ubuf_info from different sources
  ipv6: avoid partial copy for zc
  ipv4: avoid partial copy for zc

32e09298

io_uring: ensure REQ_F_ISREG is set async offload · f6b543fd

Jens Axboe authored Jul 21, 2022

If we're offloading requests directly to io-wq because IOSQE_ASYNC was
set in the sqe, we can miss hashing writes appropriately because we
haven't set REQ_F_ISREG yet. This can cause a performance regression
with buffered writes, as io-wq then no longer correctly serializes writes
to that file.

Ensure that we set the flags in io_prep_async_work(), which will cause
the io-wq work item to be hashed appropriately.

Fixes: 584b0180 ("io_uring: move read/write file prep state into actual opcode handler")
Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f6b543fd

net: fix compat pointer in get_compat_msghdr() · 4f6a94d3

Jens Axboe authored Jul 15, 2022

A previous change enabled external users to copy the data before
calling __get_compat_msghdr(), but didn't modify get_compat_msghdr() or
__io_compat_recvmsg_copy_hdr() to take that into account. They are both
stil passing in the __user pointer rather than the copied version.

Ensure we pass in the kernel struct, not the pointer to the user data.

Link: https://lore.kernel.org/all/46439555-644d-08a1-7d66-16f8f9a320f0@samsung.com/
Fixes: 1a3e4e94a1b9 ("net: copy from user before calling __get_compat_msghdr")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f6a94d3

io_uring: Don't require reinitable percpu_ref · 48904229

Michal Koutný authored Jul 15, 2022

The commit 8bb649ee ("io_uring: remove ring quiesce for
io_uring_register") removed the worklow relying on reinit/resurrection
of the percpu_ref, hence, initialization with that requested is a relic.

This is based on code review, this causes no real bug (and theoretically
can't). Technically it's a revert of commit 21482896 ("io_uring:
initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since
the flag omission is now justified, I'm not making this a revert.

Fixes: 8bb649ee ("io_uring: remove ring quiesce for io_uring_register")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

48904229

io_uring: fix types in io_recvmsg_multishot_overflow · 9b0fc3c0

Dylan Yudaken authored Jul 15, 2022

io_recvmsg_multishot_overflow had incorrect types on non x64 system.
But also it had an unnecessary INT_MAX check, which could just be done
by changing the type of the accumulator to int (also simplifying the
casts).
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: a8b38c4ce724 ("io_uring: support multishot in recvmsg")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220715130252.610639-1-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9b0fc3c0

io_uring: Use atomic_long_try_cmpxchg in __io_account_mem · 4ccc6db0

Uros Bizjak authored Jul 14, 2022

Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __io_account_mem.
x86 CMPXCHG instruction returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value
to "old" when cmpxchg fails, enabling further code simplifications.

No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4ccc6db0

io_uring: support multishot in recvmsg · 9bb66906

Dylan Yudaken authored Jul 14, 2022

Similar to multishot recv, this will require provided buffers to be
used. However recvmsg is much more complex than recv as it has multiple
outputs. Specifically flags, name, and control messages.

Support this by introducing a new struct io_uring_recvmsg_out with 4
fields. namelen, controllen and flags match the similar out fields in
msghdr from standard recvmsg(2), payloadlen is the length of the payload
following the header.
This struct is placed at the start of the returned buffer. Based on what
the user specifies in struct msghdr, the next bytes of the buffer will be
name (the next msg_namelen bytes), and then control (the next
msg_controllen bytes). The payload will come at the end. The return value
in the CQE is the total used size of the provided buffer.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com
[axboe: style fixups, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9bb66906

net: copy from user before calling __get_compat_msghdr · 72c531f8

Dylan Yudaken authored Jul 14, 2022

this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.

functionally this should be a no-op.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-3-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

72c531f8

net: copy from user before calling __copy_msghdr · 7fa875b8

Dylan Yudaken authored Jul 14, 2022

this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.

functionally this should be a no-op.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-2-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7fa875b8

io_uring: support 0 length iov in buffer select in compat · 6d2f75a0

Dylan Yudaken authored Jul 08, 2022

Match up work done in "io_uring: allow iov_len = 0 for recvmsg and buffer
select", but for compat code path.

Fixes: a68caad69ce5 ("io_uring: allow iov_len = 0 for recvmsg and buffer select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-3-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6d2f75a0

io_uring: fix multishot ending when not polled · e2df2ccb

Dylan Yudaken authored Jul 08, 2022

If multishot is not actually polling then return IOU_OK rather than the
result.
If the result was > 0 this will confuse things further up the callstack
which expect a return <= 0.

Fixes: 1300ebb20286 ("io_uring: multishot recv")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-2-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e2df2ccb

io_uring: add netmsg cache · 43e0bbbd

Jens Axboe authored Jul 07, 2022

For recvmsg/sendmsg, if they don't complete inline, we currently need
to allocate a struct io_async_msghdr for each request. This is a
somewhat large struct.

Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the
alloc + free overhead considerably, yielding 4-5% of extra performance
running netbench.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

43e0bbbd

io_uring: impose max limit on apoll cache · 9731bc98

Jens Axboe authored Jul 07, 2022

Caches like this tend to grow to the peak size, and then never get any
smaller. Impose a max limit on the size, to prevent it from growing too
big.

A somewhat randomly chosen 512 is the max size we'll allow the cache
to get. If a batch of frees come in and would bring it over that, we
simply start kfree'ing the surplus.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9731bc98

io_uring: add abstraction around apoll cache · 9b797a37

Jens Axboe authored Jul 07, 2022

In preparation for adding limits, and one more user, abstract out the
core bits of the allocation+free cache.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9b797a37

io_uring: move apoll cache to poll.c · 9da7471e

Jens Axboe authored Jul 07, 2022

This is where it's used, move the flush handler in there.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9da7471e

io_uring: consolidate hash_locked io-wq handling · e8375e43

Pavel Begunkov authored Jul 07, 2022

Don't duplicate code disabling REQ_F_HASH_LOCKED for IO_URING_F_UNLOCKED
(i.e. io-wq), move the handling into __io_arm_poll_handler().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ff0ffdfaa65b3d536131535c3dad3c63d9b7bb0.1657203020.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e8375e43

io_uring: clear REQ_F_HASH_LOCKED on hash removal · b21a51e2

Pavel Begunkov authored Jul 07, 2022

Instead of clearing REQ_F_HASH_LOCKED while arming a poll, unset the bit
when we're removing the entry from the table in io_poll_tw_hash_eject().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/02e48bb88d6f1480c94ac2924c43ad1fbd48e92a.1657203020.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b21a51e2

io_uring: don't race double poll setting REQ_F_ASYNC_DATA · ceff5017

Pavel Begunkov authored Jul 07, 2022

Just as with io_poll_double_prepare() setting REQ_F_DOUBLE_POLL, we can
race with the first poll entry when setting REQ_F_ASYNC_DATA. Move it
under io_poll_double_prepare().

Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df6920f509c11115aa2bce8b34dc5fdb0eb98920.1657203020.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ceff5017

io_uring: don't miss setting REQ_F_DOUBLE_POLL · 7a121ced

Pavel Begunkov authored Jul 07, 2022

When adding a second poll entry we should set REQ_F_DOUBLE_POLL
unconditionally. We might race with the first entry removal but that
doesn't change the rule.

Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs")
Reported-and-tested-by: syzbot+49950ba66096b1f0209b@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b680d83ded07424db83e8745585e7a6d72826ef.1657203020.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7a121ced

io_uring: disable multishot recvmsg · cf0dd952

Dylan Yudaken authored Jul 04, 2022

recvmsg has semantics that do not make it trivial to extend to
multishot. Specifically it has user pointers and returns data in the
original parameter. In order to make this API useful these will need to be
somehow included with the provided buffers.

For now remove multishot for recvmsg as it is not useful.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220704140106.200167-1-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

cf0dd952

io_uring: only trace one of complete or overflow · e0486f3f

Dylan Yudaken authored Jun 30, 2022

In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e0486f3f

io_uring: fix io_uring_cqe_overflow trace format · 9b26e811

Dylan Yudaken authored Jun 30, 2022

Make the trace format consistent with io_uring_complete for cflags
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-12-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9b26e811

io_uring: multishot recv · b3fdea6e

Dylan Yudaken authored Jun 30, 2022

Support multishot receive for io_uring.
Typical server applications will run a loop where for each recv CQE it
requeues another recv/recvmsg.

This can be simplified by using the existing multishot functionality
combined with io_uring's provided buffers.
The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will
then be posted (with IORING_CQE_F_MORE flag set) when data is available
and is read. Once an error occurs or the socket ends, the multishot will
be removed and a completion without IORING_CQE_F_MORE will be posted.

The benefit to this is that the recv is much more performant.
 * Subsequent receives are queued up straight away without requiring the
   application to finish a processing loop.
 * If there are more data in the socket (sat the provided buffer size is
   smaller than the socket buffer) then the data is immediately
   returned, improving batching.
 * Poll is only armed once and reused, saving CPU cycles
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b3fdea6e

io_uring: fix multishot accept ordering · cbd25748

Dylan Yudaken authored Jun 30, 2022

Similar to multishot poll, drop multishot accept when CQE overflow occurs.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-10-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

cbd25748

io_uring: fix multishot poll on overflow · a2da6763

Dylan Yudaken authored Jun 30, 2022

On overflow, multishot poll can still complete with the IORING_CQE_F_MORE
flag set.
If in the meantime the user clears a CQE and a the poll was cancelled then
the poll will post a CQE without the IORING_CQE_F_MORE (and likely result
-ECANCELED).

However when processing the application will encounter the non-overflow
CQE which indicates that there will be no more events posted. Typical
userspace applications would free memory associated with the poll in this
case.
It will then subsequently receive the earlier CQE which has overflowed,
which breaks the contract given by the IORING_CQE_F_MORE flag.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-9-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a2da6763

io_uring: add allow_overflow to io_post_aux_cqe · 52120f0f

Dylan Yudaken authored Jun 30, 2022

Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

52120f0f

io_uring: add IOU_STOP_MULTISHOT return code · 114eccdf

Dylan Yudaken authored Jun 30, 2022

For multishot we want a way to signal the caller that multishot has ended
but also this might not be an error return.

For example sockets return 0 when closed, which should end a multishot
recv, but still have a CQE with result 0

Introduce IOU_STOP_MULTISHOT which does this and indicates that the return
code is stored inside req->cqe
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

114eccdf