Commits · 686b56cbeedc9f4c72f9bb781918194a9a3e8334 · Kirill Smelkov / linux

An error occurred fetching the project authors.

15 Apr, 2024 35 commits

io_uring: ensure overflow entries are dropped when ring is exiting · 686b56cb

Jens Axboe authored 10 months ago

A previous consolidation cleanup missed handling the case where the ring
is dying, and __io_cqring_overflow_flush() doesn't flush entries if the
CQ ring is already full. This is fine for the normal CQE overflow
flushing, but if the ring is going away, we need to flush everything,
even if it means simply freeing the overflown entries.

Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

686b56cb

io_uring: consolidate overflow flushing · 6b231248

Pavel Begunkov authored 10 months ago

Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill()
into a single function as it once was, it's easier to work with it this
way.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6b231248

io_uring: always lock __io_cqring_overflow_flush · 8d09a88e

Pavel Begunkov authored 10 months ago

Conditional locking is never great, in case of
__io_cqring_overflow_flush(), which is a slow path, it's not justified.
Don't handle IOPOLL separately, always grab uring_lock for overflow
flushing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8d09a88e

io_uring: open code io_cqring_overflow_flush() · 408024b9

Pavel Begunkov authored 10 months ago

There is only one caller of io_cqring_overflow_flush(), open code it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

408024b9

io_uring: remove extra SQPOLL overflow flush · e45ec969

Pavel Begunkov authored 10 months ago

c1edbf5f ("io_uring: flag SQPOLL busy condition to userspace")
added an extra overflowed CQE flush in the SQPOLL submission path due to
backpressure, which was later removed. Remove the flush and let
io_cqring_wait() / iopoll handle it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e45ec969

io_uring: unexport io_req_cqe_overflow() · a5bff518

Pavel Begunkov authored 10 months ago

There are no users of io_req_cqe_overflow() apart from io_uring.c, make
it static.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a5bff518

io_uring: return void from io_put_kbuf_comp() · bbbef3e9

Ming Lei authored 11 months ago

The only caller doesn't handle the return value of io_put_kbuf_comp(), so
change its return type into void.

Also follow Jens's suggestion to rename it as io_put_kbuf_drop().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bbbef3e9

io_uring: remove io_req_put_rsrc_locked() · c29006a2

Pavel Begunkov authored 11 months ago

io_req_put_rsrc_locked() is a weird shim function around
io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding
->uring_lock, so we can just use it directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.comReviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c29006a2

io_uring: remove async request cache · d9713ad3

Pavel Begunkov authored 11 months ago

io_req_complete_post() was a sole user of ->locked_free_list, but
since we just gutted the function, the cache is not used anymore and
can be removed.

->locked_free_list served as an asynhronous counterpart of the main
request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq.
Now they're all forced to be completed into the main cache directly,
off of the normal completion path or via io_free_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.comReviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9713ad3

io_uring: turn implicit assumptions into a warning · de96e9ae

Pavel Begunkov authored 11 months ago

io_req_complete_post() is now io-wq only and shouldn't be used outside
of it, i.e. it relies that io-wq holds a ref for the request as
explained in a comment below. Let's add a warning to enforce the
assumption and make sure nobody would try to do anything weird.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.comReviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

de96e9ae

io_uring: kill dead code in io_req_complete_post · f3913000

Ming Lei authored 11 months ago

Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"),
io_req_complete_post() is only called from io-wq submit work, where the
request reference is guaranteed to be grabbed and won't drop to zero
in io_req_complete_post().

Kill the dead code, meantime add req_ref_put() to put the reference.

Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f3913000

io_uring: move mapping/allocation helpers to a separate file · f15ed8b4

Jens Axboe authored 11 months ago

Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f15ed8b4

io_uring: use unpin_user_pages() where appropriate · 18595c0a

Jens Axboe authored 11 months ago

There are a few cases of open-rolled loops around unpin_user_page(), use
the generic helper instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

18595c0a

io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring · 87585b05

Jens Axboe authored 11 months ago

Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.

This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87585b05

io_uring: unify io_pin_pages() · 1943f96b

Jens Axboe authored 11 months ago

Move it into io_uring.c where it belongs, and use it in there as well
rather than have two implementations of this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1943f96b

io_uring: use vmap() for ring mapping · 09fc75e0

Jens Axboe authored 11 months ago

This is the last holdout which does odd page checking, convert it to
vmap just like what is done for the non-mmap path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

09fc75e0

io_uring: get rid of remap_pfn_range() for mapping rings/sqes · 3ab1db3c

Jens Axboe authored 11 months ago

Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.

If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.

This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3ab1db3c

io_uring: Remove the now superfluous sentinel elements from ctl_table array · a80929d1

Joel Granados authored 11 months ago

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from kernel_io_uring_disabled_table
Signed-off-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a80929d1

io_uring: Remove unused function · 4e9706c6

Jiapeng Chong authored 11 months ago

The function are defined in the io_uring.c file, but not called
elsewhere, so delete the unused function.

io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4e9706c6

io_uring: refill request cache in memory order · 05eb5fe2

Jens Axboe authored 11 months ago

The allocator will generally return memory in order, but
__io_alloc_req_refill() then adds them to a stack and we'll extract them
in the opposite order. This obviously isn't a huge deal, but:

1) it makes debugging easier when they are in order
2) keeping them in-order is the right thing to do
3) reduces the code for adding them to the stack

Just add them in reverse to the stack.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05eb5fe2

io_uring/poll: shrink alloc cache size to 32 · da22bdf3

Jens Axboe authored 11 months ago

This should be plenty, rather than the default of 128, and matches what
we have on the rsrc and futex side as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da22bdf3

io_uring/alloc_cache: switch to array based caching · 414d0f45

Jens Axboe authored 11 months ago

Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.

Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.

Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.

Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.

This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

414d0f45

io_uring: drop ->prep_async() · e10677a8

Jens Axboe authored 11 months ago

It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.

While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e10677a8

io_uring/uring_cmd: switch to always allocating async data · d10f19df

Jens Axboe authored 11 months ago

Basic conversion ensuring async_data is allocated off the prep path. Adds
a basic alloc cache as well, as passthrough IO can be quite high in rate.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d10f19df

io_uring/rw: always setup io_async_rw for read/write requests · a9165b83

Jens Axboe authored 11 months ago

read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.

Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.

This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.

In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9165b83

io_uring: remove timeout/poll specific cancelations · 29f858a7

Jens Axboe authored 11 months ago

For historical reasons these were special cased, as they were the only
ones that needed cancelation. But now we handle cancelations generally,
and hence there's no need to check for these in
io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles
both these and the rest as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

29f858a7

io_uring: flush delayed fallback task_work in cancelation · 25417623

Jens Axboe authored 11 months ago

Just like we run the inline task_work, ensure we also factor in and
run the fallback task_work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

25417623

io_uring: refactor io_req_complete_post() · 0667db14

Pavel Begunkov authored 11 months ago

Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
to task_work, it's much cleaner and should normally happen. We couldn't
do it before because there was a possibility of looping in

complete_post() -> tw -> complete_post() -> ...

Also, unexport the function and inline __io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0667db14

io_uring: remove current check from complete_post · 23fbdde6

Pavel Begunkov authored 11 months ago

task_work execution is now always locked, and we shouldn't get into
io_req_complete_post() from them. That means that complete_post() is
always called out of the original task context and we don't even need to
check current.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

23fbdde6

io_uring: get rid of intermediate aux cqe caches · 902ce82c

Pavel Begunkov authored 11 months ago

io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.

DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

902ce82c

io_uring: refactor io_fill_cqe_req_aux · e5c12945

Pavel Begunkov authored 11 months ago

The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.

Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e5c12945

io_uring: remove struct io_tw_state::locked · 8e5b3b89

Pavel Begunkov authored 11 months ago

ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8e5b3b89

io_uring: force tw ctx locking · 92219afb

Pavel Begunkov authored 11 months ago

We can run normal task_work without locking the ctx, however we try to
lock anyway and most handlers prefer or require it locked. It might have
been interesting to multi-submitter ring with high contention completing
async read/write requests via task_work, however that will still need to
go through io_req_complete_post() and potentially take the lock for
rsrc node putting or some other case.

In other words, it's hard to care about it, so alawys force the locking.
The case described would also because of various io_uring caches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

92219afb

io_uring/rw: avoid punting to io-wq directly · 6e6b8c62

Pavel Begunkov authored 11 months ago

kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6e6b8c62

io_uring/cmd: move io_uring_try_cancel_uring_cmd() · da12d9ab

Pavel Begunkov authored 11 months ago

io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's
move it closer to all cmd bits into uring_cmd.c
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

da12d9ab

06 Apr, 2024 1 commit

io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure · 978e5c19

Alexey Izbyshev authored 11 months ago

This bug was introduced in commit 950e79dd ("io_uring: minor
io_cqring_wait() optimization"), which was made in preparation for
adc8682e ("io_uring: Add support for napi_busy_poll"). The latter
got reverted in cb318216 ("Revert "io_uring: Add support for
napi_busy_poll""), so simply undo the former as well.

Cc: stable@vger.kernel.org
Fixes: 950e79dd ("io_uring: minor io_cqring_wait() optimization")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Link: https://lore.kernel.org/r/20240405125551.237142-1-izbyshev@ispras.ruSigned-off-by: Jens Axboe <axboe@kernel.dk>

978e5c19

03 Apr, 2024 2 commits

io_uring/kbuf: hold io_buffer_list reference over mmap · 561e4f94

Jens Axboe authored 11 months ago

If we look up the kbuf, ensure that it doesn't get unregistered until
after we're done with it. Since we're inside mmap, we cannot safely use
the io_uring lock. Rely on the fact that we can lookup the buffer list
under RCU now and grab a reference to it, preventing it from being
unregistered until we're done with it. The lookup returns the
io_buffer_list directly with it referenced.

Cc: stable@vger.kernel.org # v6.4+
Fixes: 5cf4f52e ("io_uring: free io_buffer_list entries via RCU")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

561e4f94

io_uring/kbuf: get rid of lower BGID lists · 09ab7eff

Jens Axboe authored 11 months ago

Just rely on the xarray for any kind of bgid. This simplifies things, and
it really doesn't bring us much, if anything.

Cc: stable@vger.kernel.org # v6.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

09ab7eff

02 Apr, 2024 1 commit

io_uring: use private workqueue for exit work · 73eaa2b5

Jens Axboe authored 11 months ago

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: https://github.com/axboe/liburing/issues/1113Signed-off-by: Jens Axboe <axboe@kernel.dk>

73eaa2b5

01 Apr, 2024 1 commit

io_uring: disable io-wq execution of multishot NOWAIT requests · bee1d5be

Jens Axboe authored 11 months ago

Do the same check for direct io-wq execution for multishot requests that
commit 2a975d42 did for the inline execution, and disable multishot
mode (and revert to single shot) if the file type doesn't support NOWAIT,
and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's
a requirement that nonblocking read attempts can be done.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bee1d5be