Commits · 0fdb9a196c6728b51e0e7a4f6fa292d9fd5793de · Kirill Smelkov / linux

23 Jun, 2023 10 commits

io_uring: make io_cq_unlock_post static · 0fdb9a19

Pavel Begunkov authored Jun 23, 2023

io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it
static and don't expose to other files.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0fdb9a19

io_uring: inline __io_cq_unlock · ff126177

Pavel Begunkov authored Jun 23, 2023

__io_cq_unlock is not very helpful, and users should be calling flush
variants anyway. Open code the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d875c4cfb69f38ccecb58a57111446c77a614caa.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff126177

io_uring: fix acquire/release annotations · 55b6a69f

Pavel Begunkov authored Jun 23, 2023

We do conditional locking, so __io_cq_lock() and friends not always
actually grab/release the lock, so kill misleading annotations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a098f9144c24cab622f8bf90b39f44da5d0401e.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

55b6a69f

io_uring: kill io_cq_unlock() · f432b76b

Pavel Begunkov authored Jun 23, 2023

We're abusing ->completion_lock helpers. io_cq_unlock() neither
locking conditionally nor doing CQE flushing, which means that callers
must have some side reason of taking the lock and should do it directly.

Open code io_cq_unlock() into io_cqring_overflow_kill() and clean it up.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7dabb36856db2b562e78780480396c52c29b2bf4.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f432b76b

io_uring: remove IOU_F_TWQ_FORCE_NORMAL · 91c7884a

Pavel Begunkov authored Jun 23, 2023

Extract a function for non-local task_work_add, and use it directly from
io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL
and it can be killed.

As a small positive side effect we don't grab task->io_uring in
io_req_normal_work_add anymore, which is not needed for
io_req_local_work_add().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

91c7884a

io_uring: don't batch task put on reqs free · 2fdd6fb5

Pavel Begunkov authored Jun 23, 2023

We're trying to batch io_put_task() in io_free_batch_list(), but
considering that the hot path is a simple inc, it's most cerainly and
probably faster to just do io_put_task() instead of task tracking.

We don't care about io_put_task_remote() as it's only for IOPOLL
where polling/waiting is done by not the submitter task.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4a7ef7dce845fe2bd35507bf389d6bd2d5c1edf0.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2fdd6fb5

io_uring: move io_clean_op() · 5a754dea

Pavel Begunkov authored Jun 23, 2023

Move io_clean_op() up in the source file and remove the forward
declaration, as the function doesn't have tricky dependencies
anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b7163b2ba7c3a8322d972c79c1b0a9301b3057e.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5a754dea

io_uring: inline io_dismantle_req() · 3b7a612f

Pavel Begunkov authored Jun 23, 2023

io_dismantle_req() is only used in __io_req_complete_post(), open code
it there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ba8f20cb2c914eefa2e7d120a104a198552050db.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3b7a612f

io_uring: remove io_free_req_tw · 6ec9afc7

Pavel Begunkov authored Jun 23, 2023

Request completion is a very hot path in general, but there are 3 places
that can be doing it: io_free_batch_list(), io_req_complete_post() and
io_free_req_tw().

io_free_req_tw() is used rather marginally and we don't care about it.
Killing it can help to clean up and optimise the left two, do that by
replacing it with io_req_task_complete().

There are two things to consider:
1) io_free_req() is called when all refs are put, so we need to reinit
references. The easiest way to do that is to clear REQ_F_REFCOUNT.
2) We also don't need a cqe from it, so silence it with REQ_F_CQE_SKIP.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/434a2be8f33d474ad888ce1c17fe5ea7bbcb2a55.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6ec9afc7

io_uring: open code io_put_req_find_next · 247f97a5

Pavel Begunkov authored Jun 23, 2023

There is only one user of io_put_req_find_next() and it doesn't make
much sense to have it. Open code the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/38b5c5e48e4adc8e6a0cd16fdd5c1531d7ff81a9.1687518903.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

247f97a5

20 Jun, 2023 8 commits

io_uring: add helpers to decode the fixed file file_ptr · 4bfb0c9a

Christoph Hellwig authored Jun 20, 2023

Remove all the open coded magic on slot->file_ptr by introducing two
helpers that return the file pointer and the flags instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

4bfb0c9a

io_uring: use io_file_from_index in io_msg_grab_file · f432c8c8

Christoph Hellwig authored Jun 20, 2023

Use io_file_from_index instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f432c8c8

io_uring: use io_file_from_index in __io_sync_cancel · 60a666f0

Christoph Hellwig authored Jun 20, 2023

Use io_file_from_index instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

60a666f0

io_uring: return REQ_F_ flags from io_file_get_flags · 8487f083

Christoph Hellwig authored Jun 20, 2023

Two of the three callers want them, so return the more usual format,
and shift into the FFS_ form only for the fixed file table.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8487f083

io_uring: remove io_req_ffs_set · 3beed235

Christoph Hellwig authored Jun 20, 2023

Just checking the flag directly makes it a lot more obvious what is
going on here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

3beed235

io_uring: remove a confusing comment above io_file_get_flags · b57c7cd1

Christoph Hellwig authored Jun 20, 2023

The SCM inflight mechanism has nothing to do with the fact that a file
might be a regular file or not and if it supports non-blocking
operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b57c7cd1

io_uring: remove the mode variable in io_file_get_flags · 53cfd5ce

Christoph Hellwig authored Jun 20, 2023

The variable is only once now, so don't bother with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

53cfd5ce

io_uring: remove __io_file_supports_nowait · b9a6c945

Christoph Hellwig authored Jun 20, 2023

Now that this only checks O_NONBLOCK and FMODE_NOWAIT, the helper is
complete overkilļ, and the comments are confusing bordering to wrong.
Just inline the check into the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b9a6c945

12 Jun, 2023 1 commit

io_uring: wait interruptibly for request completions on exit · 4826c594

Jens Axboe authored Jun 11, 2023

WHen the ring exits, cleanup is done and the final cancelation and
waiting on completions is done by io_ring_exit_work. That function is
invoked by kworker, which doesn't take any signals. Because of that, it
doesn't really matter if we wait for completions in TASK_INTERRUPTIBLE
or TASK_UNINTERRUPTIBLE state. However, it does matter to the hung task
detection checker!

Normally we expect cancelations and completions to happen rather
quickly. Some test cases, however, will exit the ring and park the
owning task stopped (eg via SIGSTOP). If the owning task needs to run
task_work to complete requests, then io_ring_exit_work won't make any
progress until the task is runnable again. Hence io_ring_exit_work can
trigger the hung task detection, which is particularly problematic if
panic-on-hung-task is enabled.

As the ring exit doesn't take signals to begin with, have it wait
interruptibly rather than uninterruptibly. io_uring has a separate
stuck-exit warning that triggers independently anyway, so we're not
really missing anything by making this switch.

Cc: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/r/b0e4aaef-7088-56ce-244c-976edeac0e66@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

4826c594

07 Jun, 2023 2 commits

io_uring: get rid of unnecessary 'length' variable · 003f242b

Jens Axboe authored Jun 07, 2023

Just use the ARRAY_SIZE directly, we don't use length for anything else
in this function.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

003f242b

io_uring: cleanup io_aux_cqe() API · d86eaed1

Jens Axboe authored Jun 07, 2023

Everybody is passing in the request, so get rid of the io_ring_ctx and
explicit user_data pass-in. Both the ctx and user_data can be deduced
from the request at hand.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d86eaed1

02 Jun, 2023 1 commit

io_uring: avoid indirect function calls for the hottest task_work · c92fcfc2

Jens Axboe authored Jun 02, 2023

We use task_work for a variety of reasons, but doing completions or
triggering rety after poll are by far the hottest two. Use the indirect
funtion call wrappers to avoid the indirect function call if
CONFIG_RETPOLINE is set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c92fcfc2

25 May, 2023 2 commits

nvme: optimise io_uring passthrough completion · f026be0e

Pavel Begunkov authored May 15, 2023

Use IOU_F_TWQ_LAZY_WAKE via iou_cmd_exec_in_task_lazy() for passthrough
commands completion. It further delays the execution of task_work for
DEFER_TASKRUN until there are enough of task_work items queued to meet
the waiting criteria, which reduces the number of wake ups we issue.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ecdfacd0967a22d88b7779e2efd09e040825d0f8.1684154817.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f026be0e

io_uring/cmd: add cmd lazy tw wake helper · 5f3139fc

Pavel Begunkov authored May 15, 2023

We want to use IOU_F_TWQ_LAZY_WAKE in commands. First, introduce a new
cmd tw helper accepting TWQ flags, and then add
io_uring_cmd_do_in_task_laz() that will pass IOU_F_TWQ_LAZY_WAKE and
imply the "lazy" semantics, i.e. it posts no more than 1 CQE and
delaying execution of this tw should not prevent forward progress.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b9f6716006df7e817f18bd555aee2f8f9c8b0c3.1684154817.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5f3139fc

20 May, 2023 1 commit

io_uring: annotate offset timeout races · 5498bf28

Pavel Begunkov authored May 19, 2023

It's racy to read ->cached_cq_tail without taking proper measures
(usually grabbing ->completion_lock) as timeout requests with CQE
offsets do, however they have never had a good semantics for from
when they start counting. Annotate racy reads with data_race().

Reported-by: syzbot+cb265db2f3f3468ef436@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4de3685e185832a92a572df2be2c735d2e21a83d.1684506056.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5498bf28

19 May, 2023 1 commit

io_uring: maintain ordering for DEFER_TASKRUN tw list · 3af0356c

Jens Axboe authored May 19, 2023

We use lockless lists for the local and deferred task_work, which means
that when we queue up events for processing, we ultimately process them
in reverse order to how they were received. This usually doesn't matter,
but for some cases, it does seem to make a big difference. Do the right
thing and reverse the list before processing it, so that we know it's
processed in the same order in which it was received.

This makes a rather big difference for some medium load network tests,
where consistency of performance was a bit all over the place. Here's
a case that has 4 connections each doing two sends and receives:

io_uring port=10002: rps:161.13k Bps:  1.45M idle=256ms
io_uring port=10002: rps:107.27k Bps:  0.97M idle=413ms
io_uring port=10002: rps:136.98k Bps:  1.23M idle=321ms
io_uring port=10002: rps:155.58k Bps:  1.40M idle=268ms

and after the change:

io_uring port=10002: rps:205.48k Bps:  1.85M idle=140ms user=40ms
io_uring port=10002: rps:203.57k Bps:  1.83M idle=139ms user=20ms
io_uring port=10002: rps:218.79k Bps:  1.97M idle=106ms user=30ms
io_uring port=10002: rps:217.88k Bps:  1.96M idle=110ms user=20ms
io_uring port=10002: rps:222.31k Bps:  2.00M idle=101ms user=0ms
io_uring port=10002: rps:218.74k Bps:  1.97M idle=102ms user=20ms
io_uring port=10002: rps:208.43k Bps:  1.88M idle=125ms user=40ms

using more of the time to actually process work rather than sitting
idle.

No effects have been observed at the peak end of the spectrum, where
performance is still the same even with deep batch depths (and hence
more items to sort).
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3af0356c

17 May, 2023 4 commits

io_uring/net: don't retry recvmsg() unnecessarily · a2741c58

Jens Axboe authored May 17, 2023

If we're doing multishot receives, then we always end up doing two trips
through sock_recvmsg(). For protocols that sanely set msghdr->msg_inq,
then we don't need to waste time picking a new buffer and attempting a
new receive if there's nothing there.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a2741c58

io_uring/net: push IORING_CQE_F_SOCK_NONEMPTY into io_recv_finish() · 7d41bcb7

Jens Axboe authored May 17, 2023

Rather than have this logic in both io_recv() and io_recvmsg_multishot(),
push it into the handler they both call when finishing a receive
operation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7d41bcb7

io_uring/net: initalize msghdr->msg_inq to known value · 88fc8b84

Jens Axboe authored May 17, 2023

We can't currently tell if ->msg_inq was set when we ask for msg_get_inq,
initialize it to -1U so we can tell apart if it was set and there's
no data left, or if it just wasn't set at all by the protocol.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

88fc8b84

io_uring/net: initialize struct msghdr more sanely for io_recv() · bf34e697

Jens Axboe authored May 17, 2023

We only need to clear the input fields on the first invocation, not
when potentially doing a retry.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bf34e697

16 May, 2023 5 commits

io_uring: Add io_uring_setup flag to pre-register ring fd and never install it · 6e76ac59

Josh Triplett authored Apr 29, 2023

With IORING_REGISTER_USE_REGISTERED_RING, an application can register
the ring fd and use it via registered index rather than installed fd.
This allows using a registered ring for everything *except* the initial
mmap.

With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the
user, rather than requiring a subsequent mmap.

The combination of the two allows a user to operate *entirely* via a
registered ring fd, making it unnecessary to ever install the fd in the
first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make
io_uring_setup register the fd and return a registered index, without
installing the fd.

This allows an application to avoid touching the fd table at all, and
allows a library to never even momentarily install a file descriptor.

This splits out an io_ring_add_registered_file helper from
io_ring_add_registered_fd, for use by io_uring_setup.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

6e76ac59

io_uring: support for user allocated memory for rings/sqes · 03d89a2d

Jens Axboe authored Nov 05, 2021

Currently io_uring applications must call mmap(2) twice to map the rings
themselves, and the sqes array. This works fine, but it does not support
using huge pages to back the rings/sqes.

Provide a way for the application to pass in pre-allocated memory for
the rings/sqes, which can then suitably be allocated from shmfs or
via mmap to get huge page support.

Particularly for larger rings, this reduces the TLBs needed.

If an application wishes to take advantage of that, it must pre-allocate
the memory needed for the sq/cq ring, and the sqes. The former must
be passed in via the io_uring_params->cq_off.user_data field, while the
latter is passed in via the io_uring_params->sq_off.user_data field. Then
it must set IORING_SETUP_NO_MMAP in the io_uring_params->flags field,
and io_uring will then map the existing memory into the kernel for shared
use. The application must not call mmap(2) to map rings as it otherwise
would have, that will now fail with -EINVAL if this setup flag was used.

The pages used for the rings and sqes must be contigious. The intent here
is clearly that huge pages should be used, otherwise the normal setup
procedure works fine as-is. The application may use one huge page for
both the rings and sqes.

Outside of those initialization changes, everything works like it did
before.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

03d89a2d

io_uring: add ring freeing helper · 9c189eee

Jens Axboe authored Nov 05, 2021

We do rings and sqes separately, move them into a helper that does both
the freeing and clearing of the memory.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9c189eee

io_uring: return error pointer from io_mem_alloc() · e27cef86

Jens Axboe authored Nov 05, 2021

In preparation for having more than one time of ring allocator, make the
existing one return valid/error-pointer rather than just NULL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e27cef86

io_uring: remove sq/cq_off memset · 9b1b58ca

Jens Axboe authored Nov 05, 2021

We only have two reserved members we're not clearing, do so manually
instead. This is in preparation for using one of these members for
a new feature.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9b1b58ca

15 May, 2023 3 commits

io_uring: rely solely on FMODE_NOWAIT · caec5ebe

Jens Axboe authored May 09, 2023

Now that we have both sockets and block devices setting FMODE_NOWAIT
appropriately, we can get rid of all the odd special casing in
__io_file_supports_nowait() and rely soley on FMODE_NOWAIT and
O_NONBLOCK rather than special case sockets and (in particular) bdevs.

Link: https://lore.kernel.org/r/20230509151910.183637-4-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

caec5ebe

block: mark bdev files as FMODE_NOWAIT if underlying device supports it · e9833d87

Jens Axboe authored May 09, 2023

We set this unconditionally, but it really should be dependent on if
the underlying device is nowait compliant.

Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230509151910.183637-3-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

e9833d87

net: set FMODE_NOWAIT for sockets · fe34db06

Jens Axboe authored May 09, 2023

The socket read/write functions deal with O_NONBLOCK and IOCB_NOWAIT
just fine, so we can flag them as being FMODE_NOWAIT compliant. With
this, we can remove socket special casing in io_uring when checking
if a file type is sane for nonblocking IO, and it's also the defined
way to flag file types as such in the kernel.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20230509151910.183637-2-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

fe34db06

14 May, 2023 2 commits

Linux 6.4-rc2 · f1fcbaa1
Linus Torvalds authored May 14, 2023

f1fcbaa1

Merge tag 'cxl-fixes-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 533c5454

Linus Torvalds authored May 14, 2023

Pull compute express link fixes from Dan Williams:

 - Fix a compilation issue with DEFINE_STATIC_SRCU() in the unit tests

 - Fix leaking kernel memory to a root-only sysfs attribute

* tag 'cxl-fixes-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl: Add missing return to cdat read error path
  tools/testing/cxl: Use DEFINE_STATIC_SRCU()

533c5454