Commits · f237c30a5610d35a584f3296d397b93d80ce374e · Kirill Smelkov / linux

23 Aug, 2021 40 commits

io_uring: batch task work locking · f237c30a

Pavel Begunkov authored Aug 18, 2021

Many task_work handlers either grab ->uring_lock, or may benefit from
having it. Move locking logic out of individual handlers to a lazy
approach controlled by tctx_task_work(), so we don't keep doing
tons of mutex lock/unlock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f237c30a

io_uring: flush completions for fallbacks · 5636c00d

Pavel Begunkov authored Aug 18, 2021

io_fallback_req_func() doesn't expect anyone creating inline
completions, and no one currently does that. Teach the function to flush
completions preparing for further changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5636c00d

io_uring: add ->splice_fd_in checks · 26578cda

Pavel Begunkov authored Aug 20, 2021

->splice_fd_in is used only by splice/tee, but no other request checks
it for validity. Add the check for most of request types excluding
reads/writes/sends/recvs, we don't want overhead for them and can leave
them be as is until the field is actually used.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

26578cda

io_uring: add clarifying comment for io_cqring_ev_posted() · 2c5d763c

Jens Axboe authored Aug 21, 2021

We've previously had an issue where overflow flush unconditionally calls
io_cqring_ev_posted() even if it didn't flush any events to the ring,
causing wake and eventfd increment where no new events are available.
Some applications don't like that, see commit b18032bb for details.

This came up in discussion for another patch recently, hence add a
comment detailing what the relationship between calling the events
posted helper and CQ ring entries is.

Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/Signed-off-by: Jens Axboe <axboe@kernel.dk>

2c5d763c

io_uring: place fixed tables under memcg limits · 0bea96f5

Pavel Begunkov authored Aug 20, 2021

Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0bea96f5

io_uring: limit fixed table size by RLIMIT_NOFILE · 3a1b8a4e

Pavel Begunkov authored Aug 20, 2021

Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE,
that's the first and the simpliest restriction that we should impose.

Cc: stable@vger.kernel.org
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3a1b8a4e

io_uring: fix lack of protection for compl_nr · 99c8bc52

Hao Xu authored Aug 21, 2021

coml_nr in ctx_flush_and_put() is not protected by uring_lock, this
may cause problems when accessing in parallel:

say coml_nr > 0

  ctx_flush_and put                  other context
   if (compl_nr)                      get mutex
                                      coml_nr > 0
                                      do flush
                                          coml_nr = 0
                                      release mutex
        get mutex
           do flush (*)
        release mutex

in (*) place, we call io_cqring_ev_posted() and users likely get
no events there. To avoid spurious events, re-check the value when
under the lock.

Fixes: 2c32395d ("io_uring: fix __tctx_task_work() ctx race")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

99c8bc52

io_uring: Add register support for non-4k PAGE_SIZE · 187f08c1

wangyangbo authored Aug 19, 2021

Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and
accessing this table relies on each level index from fixed TABLE_SHIFT
(12 - 3) in 4k page case. In order to correctly work in non-4k page,
define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for
2nd-level table entry number.
Signed-off-by: wangyangbo <wangyangbo@uniontech.com>
Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

187f08c1

io_uring: extend task put optimisations · e98e49b2

Pavel Begunkov authored Aug 18, 2021

Now with IRQ completions done via IRQ, almost all requests freeing
are done from the context of submitter task, so it makes sense to
extend task_put optimisation from io_req_free_batch_finish() to cover
all the cases including task_work by moving it into io_put_task().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e98e49b2

io_uring: add comments on why PF_EXITING checking is safe · 316319e8

Jens Axboe authored Aug 19, 2021

We have two checks of task->flags & PF_EXITING left:

1) In io_req_task_submit(), which is called in task_work and hence always
   in the context of the original task. That means that
   req->task == current, and hence checking ->flags is totally fine.

2) In io_poll_rewait(), where we need to stop re-arming poll to prevent
   it interfering with cancelation. This is only run from task_work as
   well, and hence for this case too req->task == current.

Add a comment to both spots detailing that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

316319e8

io-wq: move nr_running and worker_refs out of wqe->lock protection · 79dca184

Hao Xu authored Aug 10, 2021

We don't need to protect nr_running and worker_refs by wqe->lock, so
narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

79dca184

io_uring: fix io_timeout_remove locking · ec3c3d0f

Pavel Begunkov authored Aug 18, 2021

io_timeout_cancel() posts CQEs so needs ->completion_lock to be held,
so grab it in io_timeout_remove().

Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ec3c3d0f

io_uring: improve same wq polling · 23a65db8

Pavel Begunkov authored Aug 17, 2021

Move earlier the check for whether __io_queue_proc() tries to poll
already polled waitqueue, and do the same for the second poll entry, if
any. Shouldn't really matter, but at least it would have a more
predictable behaviour.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

23a65db8

io_uring: reuse io_req_complete_post() · 505657bc

Pavel Begunkov authored Aug 17, 2021

We have io_req_complete_post() to post a CQE and put the request. It
takes care of all synchronisation and is more concise and efficent, so
replace all hancoded occurrences of
"lock; post CQE; unlock; + put_req()" with io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

505657bc

io_uring: better encapsulate buffer select for rw · ae421d93

Pavel Begunkov authored Aug 17, 2021

Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the
callers don't need to hand code it. The number of places where we call
io_put_rw_kbuf() is growing, so saves some pain.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ae421d93

io_uring: optimise io_prep_linked_timeout() · 906c6caa

Pavel Begunkov authored Aug 15, 2021

Linked timeout handling during issuing is heavy, it adds extra
instructions and forces to save the next linked timeout before
io_issue_sqe().

Follwing the same reasoning as in refcounting patches, a request can't
be freed by the time it returns from io_issue_sqe(), so now we don't
need to do io_prep_linked_timeout() in advance, and it can be delayed to
colder paths optimising the generic path.

Also, it should also save quite a lot for requests with linked timeouts
and completed inline on timeout spinlocking + hrtimer_start() +
hrtimer_try_to_cancel() and so on.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

906c6caa

io_uring: cancel not-armed linked touts separately · 0756a869

Pavel Begunkov authored Aug 15, 2021

Adjust io_disarm_next(), so it can detect if there is a linked but
not-yet-armed timeout and complete/cancel it separately. Will be used in
the following patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0756a869

io_uring: simplify io_prep_linked_timeout · 4d13d1a4

Pavel Begunkov authored Aug 15, 2021

The link test in io_prep_linked_timeout() is pretty bulky, replace it
with a flag. It's better for normal path and linked requests, and also
will be used further for request failing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4d13d1a4

io_uring: kill REQ_F_LTIMEOUT_ACTIVE · b97e736a

Pavel Begunkov authored Aug 15, 2021

Instead of handling double consecutive linked timeouts through tricky
flag combinations, just check the submit_state.link during timeout_prep
and fail that case in advance.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b97e736a

io_uring: deduplicate cancellation code · 8cb01fac

Pavel Begunkov authored Aug 15, 2021

IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of
overlap, so extract a helper for request cancellation and use in both.
Also, removes some amount of ugliness because of success_ret.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8cb01fac

io_uring: kill not necessary resubmit switch · a8576af9

Pavel Begunkov authored Aug 15, 2021

773af691 ("io_uring: always reissue from task_work context") makes
all resubmission to be made from task_work, so we don't need that hack
with resubmit/not-resubmit switch anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a8576af9

io_uring: optimise initial ltimeout refcounting · fb682099

Pavel Begunkov authored Aug 15, 2021

Linked timeouts are never refcounted when it comes to the first call to
__io_prep_linked_timeout(), so save an io_ref_get() and set the desired
value directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fb682099

io_uring: don't inflight-track linked timeouts · 761bcac1

Pavel Begunkov authored Aug 15, 2021

Tracking linked timeouts as infligh was needed to make sure that io-wq
is not destroyed by io_uring_cancel_generic() racing with
io_async_cancel_one() accessing it. Now, cancellations issued by linked
timeouts are done in the task context, so it's already synchronised.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

761bcac1

io_uring: optimise iowq refcounting · 48dcd38d

Pavel Begunkov authored Aug 15, 2021

If a requests is forwarded into io-wq, there is a good chance it hasn't
been refcounted yet and we can save one req_ref_get() by setting the
refcount number to the right value directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

48dcd38d

io_uring: correct __must_hold annotation · a141dd89

Jens Axboe authored Aug 12, 2021

io_req_free_batch() has a __must_hold annotation referencing a
request being passed in, but we're passing in the context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a141dd89

io_uring: code clean for completion_lock in io_arm_poll_handler() · 41a5169c

Hao Xu authored Aug 12, 2021

We can merge two spin_unlock() operations to one since we removed some
code not long ago.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

41a5169c

io_uring: remove files pointer in cancellation functions · f552a27a

Hao Xu authored Aug 12, 2021

When doing cancellation, we use a parameter to indicate where it's from
do_exit or exec. So a boolean value is good enough for this, remove the
struct files* as it is not necessary.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f552a27a

io_uring: extract io_uring_files_cancel() in io_uring_task_cancel() · a4aadd11

Hao Xu authored Aug 12, 2021

Extract io_uring_files_cancel() call in io_uring_task_cancel() to make
io_uring_files_cancel() and io_uring_task_cancel() coherent and easy to
read.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a4aadd11

io_uring: optimise hot path of ltimeout prep · fd08e530

Pavel Begunkov authored Aug 11, 2021

io_prep_linked_timeout() grew too heavy and compiler now refuse to
inline the function. Help it by splitting in two and annotating with
inline.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fd08e530

io_uring: skip request refcounting · 20e60a38

Pavel Begunkov authored Aug 11, 2021

As submission references are gone, there is only one initial reference
left. Instead of actually doing atomic refcounting, add a flag
indicating whether we're going to take more refs or doing any other sync
magic. The flag should be set before the request may get used in
parallel.

Together with the previous patch it saves 2 refcount atomics per request
for IOPOLL and IRQ completions, and 1 atomic per req for inline
completions, with some exceptions. In particular, currently, there are
three cases, when the refcounting have to be enabled:
- Polling, including apoll. Because double poll entries takes a ref.
  Might get relaxed in the near future.
- Link timeouts, enabled for both, the timeout and the request it's
  bound to, because they work in-parallel and we need to synchronise
  to cancel one of them on completion.
- When a request gets in io-wq, because it doesn't hold uring_lock and
  we need guarantees of submission references.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

20e60a38

io_uring: remove submission references · 5d5901a3

Pavel Begunkov authored Aug 11, 2021

Requests are by default given with two references, submission and
completion. Completion references are straightforward, they represent
request ownership and are put when a request is completed or so.
Submission references are a bit more trickier. They're needed when
io_issue_sqe() followed deep into the submission stack (e.g. in fs,
block, drivers, etc.), request may have given away for concurrent
execution or already completed, and the code unwinding back to
io_issue_sqe() may be accessing some pieces of our requests, e.g.
file or iov.

Now, we prevent such async/in-depth completions by pushing requests
through task_work. Punting to io-wq is also done through task_works,
apart from a couple of cases with a pretty well known context. So,
there're two cases:
1) io_issue_sqe() from the task context and protected by ->uring_lock.
Either requests return back to io_uring or handed to task_work, which
won't be executed because we're currently controlling that task. So,
we can be sure that requests are staying alive all the time and we don't
need submission references to pin them.

2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of
submission reference is played by io-wq reference, which is put by
io_wq_submit_work(). Hence, it should be fine.

Considering that, we can carefully kill the submission reference.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5d5901a3

io_uring: remove req_ref_sub_and_test() · 91c2f697

Pavel Begunkov authored Aug 11, 2021

Soon, we won't need to put several references at once, remove
req_ref_sub_and_test() and @nr argument from io_put_req_deferred(),
and put the rest of the references by hand.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

91c2f697

io_uring: move req_ref_get() and friends · 21c843d5

Pavel Begunkov authored Aug 11, 2021

Move all request refcount helpers to avoid forward declarations in the
future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

21c843d5

io_uring: remove IRQ aspect of io_ring_ctx completion lock · 79ebeaee

Jens Axboe authored Aug 10, 2021

We have no hard/soft IRQ users of this lock left, remove any IRQ
disabling/saving and restoring when grabbing this lock.

This is straight forward with no users entering with IRQs disabled
anymore, the only thing to look out for is the waitqueue poll head
lock which nests inside the completion lock. That needs IRQs disabled,
and hence we have to do that now instead of relying on the outer lock
doing so.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

79ebeaee

io_uring: run regular file completions from task_work · 8ef12efe

Jens Axboe authored Aug 10, 2021

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8ef12efe

io_uring: run linked timeouts from task_work · 89b263f6

Jens Axboe authored Aug 10, 2021

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

89b263f6

io_uring: run timeouts from task_work · 89850fce

Jens Axboe authored Aug 10, 2021

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Add a timeout_lock to handle the ordering of timeout completions or
cancelations with the timeouts actually triggering.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

89850fce

io_uring: remove file batch-get optimisation · 62906e89

Pavel Begunkov authored Aug 10, 2021

For requests with non-fixed files, instead of grabbing just one
reference, we get by the number of left requests, so the following
requests using the same file can take it without atomics.

However, it's not all win. If there is one request in the middle
not using files or having a fixed file, we'll need to put back the left
references. Even worse if an application submits requests dealing with
different files, it will do a put for each new request, so doubling the
number of atomics needed. Also, even if not used, it's still takes some
cycles in the submission path.

If a file used many times, it rather makes sense to pre-register it, if
not, we may fall in the described pitfall. So, this optimisation is a
matter of use case. Go with the simpliest code-wise way, remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62906e89

io_uring: clean up tctx_task_work() · 6294f368

Pavel Begunkov authored Aug 10, 2021

After recent fixes, tctx_task_work() always does proper spinlocking
before looking into ->task_list, so now we don't need atomics for
->task_state, replace it with non-atomic task_running using the critical
section.

Tide it up, combine two separate block with spinlocking, and always try
to splice in there, so we do less locking when new requests are arriving
during the function execution.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fix missing ->task_running reset on task_work_add() failure]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6294f368

io_uring: inline io_poll_remove_waitqs · 5d709043

Pavel Begunkov authored Aug 09, 2021

Inline io_poll_remove_waitqs() into its only user and clean it up.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5d709043