Commits · 3d200242a6c968af321913b635fc4014b238cba4 · Kirill Smelkov / linux

18 May, 2022 2 commits

io_uring: add buffer selection support to IORING_OP_NOP · 3d200242

Jens Axboe authored May 05, 2022

Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3d200242

io_uring: fix locking state for empty buffer group · e7637a49

Jens Axboe authored May 15, 2022

io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:

====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
 #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48

stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace.part.0+0xa4/0xd4
 show_stack+0x14/0x5c
 dump_stack_lvl+0x88/0xb0
 dump_stack+0x14/0x2c
 debug_check_no_locks_held+0x84/0x90
 try_to_freeze.isra.0+0x18/0x44
 get_signal+0x94/0x6ec
 io_wqe_worker+0x1d8/0x2b4
 ret_from_fork+0x10/0x20

and triggering later hangs off get_signal() because we attempt to
re-grab the lock.

Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 149c69b0 ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e7637a49

14 May, 2022 4 commits

io_uring: implement multishot mode for accept · 4e86a2c9

Hao Xu authored May 14, 2022

Refactor io_accept() to support multishot mode.

theoretical analysis:
  1) when connections come in fast
    - singleshot:
              add accept sqe(userspace) --> accept inline
                              ^                 |
                              |-----------------|
    - multishot:
             add accept sqe(userspace) --> accept inline
                                              ^     |
                                              |--*--|

    we do accept repeatedly in * place until get EAGAIN

  2) when connections come in at a low pressure
    similar thing like 1), we reduce a lot of userspace-kernel context
    switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

  server    client(multiple)
  accept    connect
  read      write
  write     read
  close     close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
  1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
  +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
  +1934226+1914385)/20.0 = 1927633.75
after:
  1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
  +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
  +1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4e86a2c9

io_uring: let fast poll support multishot · dbc2564c

Hao Xu authored May 14, 2022

For operations like accept, multishot is a useful feature, since we can
reduce a number of accept sqe. Let's integrate it to fast poll, it may
be good for other operations in the future.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

dbc2564c

io_uring: add REQ_F_APOLL_MULTISHOT for requests · 227685eb

Hao Xu authored May 14, 2022

Add a flag to indicate multishot mode for fast poll. currently only
accept use it, but there may be more operations leveraging it in the
future. Also add a mask IO_APOLL_MULTI_POLLED which stands for
REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

227685eb

io_uring: add IORING_ACCEPT_MULTISHOT for accept · 390ed29b

Hao Xu authored May 14, 2022

add an accept_flag IORING_ACCEPT_MULTISHOT for accept, which is to
support multishot.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-2-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

390ed29b

13 May, 2022 8 commits

io_uring: only wake when the correct events are set · 1b1d7b4b

Dylan Yudaken authored May 12, 2022

The check for waking up a request compares the poll_t bits, however this
will always contain some common flags so this always wakes up.

For files with single wait queues such as sockets this can cause the
request to be sent to the async worker unnecesarily. Further if it is
non-blocking will complete the request with EAGAIN which is not desired.

Here exclude these common events, making sure to not exclude POLLERR which
might be important.

Fixes: d7718a9d ("io_uring: use poll driven retry for files that support it")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1b1d7b4b

io_uring: avoid io-wq -EAGAIN looping for !IOPOLL · e0deb6a0

Pavel Begunkov authored May 13, 2022

If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work()
might continue busily hammer the same handler over and over again, which
is not ideal. The -EAGAIN handling in question was put there only for
IOPOLL, so restrict it to IOPOLL mode only where there is no other
recourse than to retry as we cannot wait.

Fixes: def596e9 ("io_uring: support for IO polling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e0deb6a0

io_uring: add flag for allocating a fully sparse direct descriptor space · a8da73a3

Jens Axboe authored May 09, 2022

Currently to setup a fully sparse descriptor space upfront, the app needs
to alloate an array of the full size and memset it to -1 and then pass
that in. Make this a bit easier by allowing a flag that simply does
this internally rather than needing to copy each slot separately.

This works with IORING_REGISTER_FILES2 as the flag is set in struct
io_uring_rsrc_register, and is only allow when the type is
IORING_RSRC_FILE as this doesn't make sense for registered buffers.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8da73a3

io_uring: bump max direct descriptor count to 1M · 09893e15

Jens Axboe authored May 09, 2022

We currently limit these to 32K, but since we're now backing the table
space with vmalloc when needed, there's no reason why we can't make it
bigger. The total space is limited by RLIMIT_NOFILE as well.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

09893e15

io_uring: allow allocated fixed files for accept · c30c3e00

Jens Axboe authored May 07, 2022

If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space,
and also allows multi-shot support to work with fixed files.

Normal accept direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c30c3e00

io_uring: allow allocated fixed files for openat/openat2 · 1339f24b

Jens Axboe authored May 07, 2022

If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space.

Normal open direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1339f24b

io_uring: add basic fixed file allocator · b70b8e33

Jens Axboe authored May 07, 2022

Applications currently always pick where they want fixed files to go.
In preparation for allowing these types of commands with multishot
support, add a basic allocator in the fixed file table.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b70b8e33

io_uring: track fixed files with a bitmap · d78bd8ad

Jens Axboe authored May 07, 2022

In preparation for adding a basic allocator for direct descriptors,
add helpers that set/clear whether a file slot is used.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d78bd8ad

09 May, 2022 11 commits

io_uring: don't clear req->kbuf when buffer selection is done · 7ccba24d

Jens Axboe authored May 01, 2022

It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of
whether or not kbuf is valid, so just drop it.
Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7ccba24d

io_uring: eliminate the need to track provided buffer ID separately · 1dbd023e

Jens Axboe authored May 01, 2022

We have io_kiocb->buf_index which is used for either fixed buffers, or
for provided buffers. For the latter, it's used to hold the buffer group
ID for buffer selection. Post selection, req->kbuf->bid is used to get
the buffer ID.

Store the buffer ID, when selected, in req->buf_index. If we do end up
recycling the buffer, reset it back to the buffer group ID.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1dbd023e

io_uring: move provided buffer state closer to submit state · 660cbfa2

Jens Axboe authored May 01, 2022

The timeout and other items that follow are less hot, so let's move the
provided buffer state above that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

660cbfa2

io_uring: move provided and fixed buffers into the same io_kiocb area · a4f8d94c

Jens Axboe authored Apr 30, 2022

These are mutually exclusive - if you use provided buffers, then you
cannot use fixed buffers and vice versa. Move them into the same spot
in the io_kiocb, which is also advantageous for provided buffers as
they get near the submit side hot cacheline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a4f8d94c

io_uring: abstract out provided buffer list selection · 149c69b0

Jens Axboe authored Apr 30, 2022

In preparation for providing another way to select a buffer, move the
existing logic into a helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

149c69b0

io_uring: never call io_buffer_select() for a buffer re-select · b66e65f4

Jens Axboe authored Apr 30, 2022

Callers already have room to store the addr and length information,
clean it up by having the caller just assign the previously provided
data.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b66e65f4

io_uring: get rid of hashed provided buffer groups · 9cfc7e94

Jens Axboe authored May 01, 2022

Use a plain array for any group ID that's less than 64, and punt
anything beyond that to an xarray. 64 fits in a page even for 4KB
page sizes and with the planned additions.

This makes the expected group usage faster by avoiding a hash and lookup
to find our list, and it uses less memory upfront by not allocating any
memory for provided buffers unless it's actually being used.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9cfc7e94

io_uring: always use req->buf_index for the provided buffer group · 4e906702

Jens Axboe authored Apr 28, 2022

The read/write opcodes use it already, but the recv/recvmsg do not. If
we switch them over and read and validate this at init time while we're
checking if the opcode supports it anyway, then we can do it in one spot
and we don't have to pass in a separate group ID for io_buffer_select().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e906702

io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set · bb68d504

Jens Axboe authored Apr 29, 2022

There's no point in validity checking buf_index if the request doesn't
have REQ_F_BUFFER_SELECT set, as we will never use it for that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bb68d504

io_uring: kill io_rw_buffer_select() wrapper · e5b00349

Jens Axboe authored Apr 28, 2022

After the recent changes, this is direct call to io_buffer_select()
anyway. With this change, there are no wrappers left for provided
buffer selection.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e5b00349

io_uring: make io_buffer_select() return the user address directly · c54d52c2

Jens Axboe authored Apr 28, 2022

There's no point in having callers provide a kbuf, we're just returning
the address anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c54d52c2

05 May, 2022 4 commits

io_uring: kill io_recv_buffer_select() wrapper · 9396ed85

Jens Axboe authored Apr 28, 2022

It's just a thin wrapper around io_buffer_select(), get rid of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9396ed85

io_uring: use 'sr' vs 'req->sr_msg' consistently · 0a352aaa

Jens Axboe authored Apr 30, 2022

For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0a352aaa

io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg · 0455d4cc

Jens Axboe authored Apr 26, 2022

If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0455d4cc

io_uring: check IOPOLL/ioprio support upfront · 73911426

Jens Axboe authored Apr 26, 2022

Don't punt this check to the op prep handlers, add the support to
io_op_defs and we can check them while setting up the request.

This reduces the text size by 500 bytes on aarch64, and makes this less
fragile by having the check in one spot and needing opcodes to opt in
to IOPOLL or ioprio support.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

73911426

30 Apr, 2022 7 commits

io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread() · f2e030dd

Almog Khaikin authored Apr 26, 2022

The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which
implies a full barrier on some architectures but it is not required to
do so. Use the more appropriate smp_mb__after_atomic() which avoids the
extra barrier on those architectures.
Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com
Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f2e030dd

io_uring: add IORING_SETUP_TASKRUN_FLAG · ef060ea9

Jens Axboe authored Apr 25, 2022

If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for
running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the
application can tell if task_work is pending in the kernel for this
ring. This allows use cases like io_uring_peek_cqe() to still function
appropriately, or for the task to know when it would be useful to
call io_uring_wait_cqe() to run pending events.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

ef060ea9

io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is used · e1169f06

Jens Axboe authored Apr 25, 2022

If this is set, io_uring will never use an IPI to deliver a task_work
notification. This can be used in the common case where a single task or
thread communicates with the ring, and doesn't rely on
io_uring_cqe_peek().

This provides a noticeable win in performance, both from eliminating
the IPI itself, but also from avoiding interrupting the submitting
task unnecessarily.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

e1169f06

io_uring: set task_work notify method at init time · 9f010507

Jens Axboe authored Apr 25, 2022

While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that
just does a task wakeup and then we can remove the special wakeup we
have in task_work_add.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

9f010507

io-wq: use __set_notify_signal() to wake workers · 6cf5862e

Jens Axboe authored Apr 25, 2022

The only difference between set_notify_signal() and __set_notify_signal()
is that the former checks if it needs to deliver an IPI to force a
reschedule. As the io-wq workers never leave the kernel, and IPI is never
needed, they simply need a wakeup.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-4-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

6cf5862e

io_uring: serialize ctx->rings->sq_flags with atomic_or/and · 3a4b89a2

Jens Axboe authored Apr 25, 2022

Rather than require ctx->completion_lock for ensuring that we don't
clobber the flags, use the atomic bitop helpers instead. This removes
the need to grab the completion_lock, in preparation for needing to set
or clear sq_flags when we don't know the status of this lock.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-3-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

3a4b89a2

task_work: allow TWA_SIGNAL without a rescheduling IPI · e788be95

Jens Axboe authored Apr 28, 2022

Some use cases don't always need an IPI when sending a TWA_SIGNAL
notification. Add TWA_SIGNAL_NO_IPI, which is just like TWA_SIGNAL, except
it doesn't send an IPI to the target task. It merely sets
TIF_NOTIFY_SIGNAL and wakes up the task.

This can be useful in avoiding a forceful transition to the kernel if the
task is running in userspace. Depending on the task_work in question, it
may be quite fine waiting for the next reschedule or kernel enter anyway,
or the use case may even have other mechanisms for hinting to the task
that a transition may be useful. This can drive more cooperative
scheduling of task_work.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/821f42b6-7d91-8074-8212-d34998097de4@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk>

e788be95

25 Apr, 2022 4 commits

io_uring: fix compile warning for 32-bit builds · 69cc1b6f

Jens Axboe authored Apr 25, 2022

If IO_URING_SCM_ALL isn't set, as it would not be on 32-bit builds,
then we trigger a warning:

fs/io_uring.c: In function '__io_sqe_files_unregister':
fs/io_uring.c:8992:13: warning: unused variable 'i' [-Wunused-variable]
 8992 |         int i;
      |             ^

Move the ifdef up to include the 'i' variable declaration.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 5e45690a ("io_uring: store SCM state in io_fixed_file->file_ptr")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

69cc1b6f

io_uring: return an error when cqe is dropped · 155bc950

Dylan Yudaken authored Apr 21, 2022

Right now io_uring will not actively inform userspace if a CQE is
dropped. This is extremely rare, requiring a CQ ring overflow, as well as
a GFP_ATOMIC kmalloc failure. However the consequences could cause for
example applications to go into an undefined state, possibly waiting for a
CQE that never arrives.

Return an error code (EBADR) in these cases. Since this is expected to be
incredibly rare, try and avoid as much as possible affecting the hot code
paths, and so it only is returned lazily and when there is no other
available CQEs.

Once the error is returned, reset the error condition assuming the user is
either ok with it or will clean up appropriately.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-6-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

155bc950

io_uring: use constants for cq_overflow bitfield · 10988a0a

Dylan Yudaken authored Apr 21, 2022

Prepare to use this bitfield for more flags by using constants instead of
magic value 0
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-5-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

10988a0a

io_uring: rework io_uring_enter to simplify return value · 3e813c90

Dylan Yudaken authored Apr 21, 2022

io_uring_enter returns the count submitted preferrably over an error
code. In some code paths this check is not required, so reorganise the
code so that the check is only done as needed.
This is also a prep for returning error codes only in waiting scenarios.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-4-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3e813c90