Commits · 1e16c2f917a59d27fb6b540c44d66978c8ad29ef · nexedi / linux

27 Jun, 2020 1 commit

io_uring: fix function args for !CONFIG_NET · 1e16c2f9

Randy Dunlap authored Jun 26, 2020

Fix build errors when CONFIG_NET is not set/enabled:

../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’
../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’
../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’
../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’
../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’
../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1e16c2f9

26 Jun, 2020 4 commits

Merge branch 'io_uring-5.8' into for-5.9/io_uring · 2237d765

Jens Axboe authored Jun 26, 2020

Merge in changes that went into 5.8-rc3. GIT will silently do the
merge, but we still need a tweak on top of that since
io_complete_rw_common() was modified to take a io_comp_state pointer.
The auto-merge fails on that, and we end up with something that
doesn't compile.

* io_uring-5.8:
  io_uring: fix current->mm NULL dereference on exit
  io_uring: fix hanging iopoll in case of -EAGAIN
  io_uring: fix io_sq_thread no schedule when busy
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2237d765

io-wq: return next work from ->do_work() directly · f4db7182

Pavel Begunkov authored Jun 25, 2020

It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f4db7182

io-wq: compact io-wq flags numbers · e883a79d

Pavel Begunkov authored Jun 25, 2020

Renumerate IO_WQ flags, so they take adjacent bits
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e883a79d

io_uring: use task_work for links if possible · c40f6379

Jens Axboe authored Jun 25, 2020

Currently links are always done in an async fashion, unless we catch them
inline after we successfully complete a request without having to resort
to blocking. This isn't necessarily the most efficient approach, it'd be
more ideal if we could just use the task_work handling for this.

Outside of saving an async jump, we can also do less prep work for these
kinds of requests.

Running dependent links from the task_work handler yields some nice
performance benefits. As an example, examples/link-cp from the liburing
repository uses read+write links to implement a copy operation. Without
this patch, the a cache fold 4G file read from a VM runs in about 3
seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.986s
user	0m0.051s
sys	0m2.843s

and a subsequent cache hot run looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.898s
user	0m0.069s
sys	0m0.797s

With this patch in place, the cold case takes about 2.4 seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.400s
user	0m0.020s
sys	0m2.366s

and the cache hot case looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.676s
user	0m0.010s
sys	0m0.665s

As expected, the (mostly) cache hot case yields the biggest improvement,
running about 25% faster with this change, while the cache cold case
yields about a 20% increase in performance. Outside of the performance
increase, we're using less CPU as well, as we're not using the async
offload threads at all for this anymore.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c40f6379

25 Jun, 2020 8 commits

io_uring: enable READ/WRITE to use deferred completions · a1d7c393

Jens Axboe authored Jun 22, 2020

A bit more surgery required here, as completions are generally done
through the kiocb->ki_complete() callback, even if they complete inline.
This enables the regular read/write path to use the io_comp_state
logic to batch inline completions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a1d7c393

io_uring: pass in completion state to appropriate issue side handlers · 229a7b63

Jens Axboe authored Jun 22, 2020

Provide the completion state to the handlers that we know can complete
inline, so they can utilize this for batching completions.

Cap the max batch count at 32. This should be enough to provide a good
amortization of the cost of the lock+commit dance for completions, while
still being low enough not to cause any real latency issues for SQPOLL
applications.

Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
profile from:

17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
 5.94% [kernel] [k] skb_release_data
 4.31% [kernel] [k] udp_rmem_release
 2.68% [kernel] [k] __check_object_size
 2.24% [kernel] [k] __slab_free
 2.22% [kernel] [k] _raw_spin_lock_bh
 2.21% [kernel] [k] kmem_cache_free
 2.13% [kernel] [k] free_pcppages_bulk
 1.83% [kernel] [k] io_submit_sqes
 1.38% [kernel] [k] page_frag_free
 1.31% [kernel] [k] inet_recvmsg

to

19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
 9.36% [kernel] [k] udp_rmem_release
 8.64% [kernel] [k] udp_recvmsg
 6.21% [kernel] [k] __slab_free
 4.39% [kernel] [k] __check_object_size
 3.64% [kernel] [k] free_pcppages_bulk
 2.41% [kernel] [k] kmem_cache_free
 2.00% [kernel] [k] io_submit_sqes
 1.95% [kernel] [k] page_frag_free
 1.54% [kernel] [k] io_put_req
[...]
 0.07% [kernel] [k] io_commit_cqring
 0.44% [kernel] [k] __io_cqring_fill_event
Signed-off-by: Jens Axboe <axboe@kernel.dk>

229a7b63

io_uring: pass down completion state on the issue side · f13fad7b

Jens Axboe authored Jun 22, 2020

No functional changes in this patch, just in preparation for having the
completion state be available on the issue side. Later on, this will
allow requests that complete inline to be completed in batches.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f13fad7b

io_uring: add 'io_comp_state' to struct io_submit_state · 013538bd

Jens Axboe authored Jun 22, 2020

No functional changes in this patch, just in preparation for passing back
pending completions to the caller and completing them in a batched
fashion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

013538bd

io_uring: provide generic io_req_complete() helper · e1e16097

Jens Axboe authored Jun 22, 2020

We have lots of callers of:

io_cqring_add_event(req, result);
io_put_req(req);

Provide a helper that does this for us. It helps clean up the code, and
also provides a more convenient location for us to change the completion
handling.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e1e16097

io_uring: fix NULL-mm for linked reqs · d3cac64c

Pavel Begunkov authored Jun 25, 2020

__io_queue_sqe() tries to handle all request of a link,
so it's not enough to grab mm in io_sq_thread_acquire_mm()
based just on the head.

Don't check req->needs_mm and do it always.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

d3cac64c

io_uring: fix current->mm NULL dereference on exit · d60b5fbc

Pavel Begunkov authored Jun 25, 2020

Don't reissue requests from io_iopoll_reap_events(), the task may not
have mm, which ends up with NULL. It's better to kill everything off on
exit anyway.

[  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
...
[  677.734679] Call Trace:
[  677.734695]  ? __send_signal+0x1f2/0x420
[  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  677.734699]  ? send_signal+0xf5/0x140
[  677.734700]  io_iopoll_getevents+0x12f/0x1a0
[  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
[  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
[  677.734704]  io_uring_release+0x20/0x30
[  677.734706]  __fput+0xcd/0x230
[  677.734707]  ____fput+0xe/0x10
[  677.734709]  task_work_run+0x67/0xa0
[  677.734710]  do_exit+0x35d/0xb70
[  677.734712]  do_group_exit+0x43/0xa0
[  677.734713]  get_signal+0x140/0x900
[  677.734715]  do_signal+0x37/0x780
[  677.734717]  ? enqueue_hrtimer+0x41/0xb0
[  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
[  677.734720]  ? ktime_get+0x3e/0xa0
[  677.734721]  ? lapic_next_deadline+0x26/0x30
[  677.734723]  ? tick_program_event+0x4d/0x90
[  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
[  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
[  677.734741]  prepare_exit_to_usermode+0x9/0x40
[  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
[  677.734743]  sysvec_reschedule_ipi+0x92/0x160
[  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d60b5fbc

io_uring: fix hanging iopoll in case of -EAGAIN · cd664b0e

Pavel Begunkov authored Jun 25, 2020

io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.

Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().

io_kiocb's result and iopoll_completed")

Fixes: bbde017a ("io_uring: add memory barrier to synchronize
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd664b0e

23 Jun, 2020 1 commit

io_uring: fix io_sq_thread no schedule when busy · b772f07a

Xuan Zhuo authored Jun 23, 2020

When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:

rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]

This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b772f07a

22 Jun, 2020 25 commits

io_uring: kill NULL checks for submit state · f6b6c7d6

Pavel Begunkov authored Jun 21, 2020

After recent changes, io_submit_sqes() always passes valid submit state,
so kill leftovers checking it for NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f6b6c7d6

io_uring: set @poll->file after @poll init · b90cd197

Pavel Begunkov authored Jun 21, 2020

It's a good practice to modify fields of a struct after but not before
it was initialised. Even though io_init_poll_iocb() doesn't touch
poll->file, call it first.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b90cd197

io_uring: remove REQ_F_MUST_PUNT · 24c74678

Pavel Begunkov authored Jun 21, 2020

REQ_F_MUST_PUNT may seem looking good and clear, but it's the same
as not having REQ_F_NOWAIT set. That rather creates more confusion.
Moreover, it doesn't even affect any behaviour (e.g. see the patch
removing it from io_{read,write}).

Kill theg flag and update already outdated comments.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

24c74678

io_uring: remove setting REQ_F_MUST_PUNT in rw · 62ef7316

Pavel Begunkov authored Jun 21, 2020

io_{read,write}() {
	...
copy_iov: // prep async
  	if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file))
		flags |= REQ_F_MUST_PUNT;
}

REQ_F_MUST_PUNT there is pointless, because if it happens then
REQ_F_NOWAIT is known to be _not_ set, and the request will go
async path in __io_queue_sqe() anyway. file_can_poll() check
is also repeated in arm_poll*(), so don't need it.

Remove the mentioned assignment REQ_F_MUST_PUNT in preparation
for killing the flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62ef7316

Merge branch 'async-buffered.8' into for-5.9/io_uring · 895aa7b1

Jens Axboe authored Jun 21, 2020

Pull in async buffered reads branch.

* async-buffered.8:
  io_uring: support true async buffered reads, if file provides it
  mm: add kiocb_wait_page_queue_init() helper
  btrfs: flag files as supporting buffered async reads
  xfs: flag files as supporting buffered async reads
  block: flag block devices as supporting IOCB_WAITQ
  fs: add FMODE_BUF_RASYNC
  mm: support async buffered reads in generic_file_buffered_read()
  mm: add support for async page locking
  mm: abstract out wake_page_match() from wake_page_function()
  mm: allow read-ahead with IOCB_NOWAIT set
  io_uring: re-issue block requests that failed because of resources
  io_uring: catch -EIO from buffered issue request failure
  io_uring: always plug for any number of IOs
  block: provide plug based way of signaling forced no-wait semantics

895aa7b1

io_uring: support true async buffered reads, if file provides it · bcf5a063

Jens Axboe authored May 22, 2020

If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.

The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bcf5a063

mm: add kiocb_wait_page_queue_init() helper · d1932dc3

Jens Axboe authored May 22, 2020

Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d1932dc3

btrfs: flag files as supporting buffered async reads · 8730f12b

Jens Axboe authored May 22, 2020

btrfs uses generic_file_read_iter(), which already supports this.
Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8730f12b

xfs: flag files as supporting buffered async reads · f89fb730

Jens Axboe authored May 22, 2020

XFS uses generic_file_read_iter(), which already supports this.
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f89fb730

block: flag block devices as supporting IOCB_WAITQ · a304f074
Jens Axboe authored May 22, 2020
```
Signed-off-by: Jens Axboe <axboe@kernel.dk>
```
a304f074

fs: add FMODE_BUF_RASYNC · c2a25ec0

Jens Axboe authored May 22, 2020

If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c2a25ec0

mm: support async buffered reads in generic_file_buffered_read() · 1a0a7853

Jens Axboe authored May 22, 2020

Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1a0a7853

mm: add support for async page locking · dd3e6d50

Jens Axboe authored May 22, 2020

Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dd3e6d50

mm: abstract out wake_page_match() from wake_page_function() · c7510ab2

Jens Axboe authored May 23, 2020

No functional changes in this patch, just in preparation for allowing
more callers.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7510ab2

mm: allow read-ahead with IOCB_NOWAIT set · 2e85abf0

Jens Axboe authored May 22, 2020

The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e85abf0

io_uring: re-issue block requests that failed because of resources · b63534c4

Jens Axboe authored Jun 04, 2020

Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.

Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.

This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b63534c4

io_uring: catch -EIO from buffered issue request failure · 4503b767

Jens Axboe authored Jun 01, 2020

-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.

Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4503b767

io_uring: always plug for any number of IOs · ac8691c4

Jens Axboe authored Jun 01, 2020

Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ac8691c4

block: provide plug based way of signaling forced no-wait semantics · 5a473e83

Jens Axboe authored Jun 04, 2020

Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a473e83

io_uring: separate reporting of ring pages from registered pages · 2e0464d4

Bijan Mottahedeh authored Jun 16, 2020

Ring pages are not pinned so it is more appropriate to report them
as locked.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e0464d4

io_uring: report pinned memory usage · 30975825

Bijan Mottahedeh authored Jun 16, 2020

Report pinned memory usage always, regardless of whether locked memory
limit is enforced.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

30975825

io_uring: rename ctx->account_mem field · aad5d8da

Bijan Mottahedeh authored Jun 16, 2020

Rename account_mem to limit_name to clarify its purpose.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

aad5d8da

io_uring: add wrappers for memory accounting · a087e2b5

Bijan Mottahedeh authored Jun 16, 2020

Facilitate separation of locked memory usage reporting vs. limiting for
upcoming patches.  No functional changes.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[axboe: kill unnecessary () around return in io_account_mem()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a087e2b5

io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior · a31eb4a2

Jiufei Xue authored Jun 17, 2020

Applications can pass this flag in to avoid accept thundering herd.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a31eb4a2

io_uring: change the poll type to be 32-bits · 5769a351

Jiufei Xue authored Jun 17, 2020

poll events should be 32-bits to cover EPOLLEXCLUSIVE.

Explicit word-swap the poll32_events for big endian to make sure the ABI
is not changed.  We call this feature IORING_FEAT_POLL_32BITS,
applications who want to use EPOLLEXCLUSIVE should check the feature bit
first.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5769a351

21 Jun, 2020 1 commit
- Linux 5.8-rc2 · 48778464
  Linus Torvalds authored Jun 21, 2020
  
  48778464