- 29 Jan, 2023 40 commits
-
-
Dylan Yudaken authored
REQ_F_FORCE_ASYNC was being ignored for re-queueing linked requests. Instead obey that flag. Signed-off-by:
Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20230127135227.3646353-2-dylany@meta.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If CONFIG_PREEMPT_NONE is set and the task_work chains are long, we could be running into issues blocking others for too long. Add a reschedule check in handle_tw_list(), and flush the ctx if we need to reschedule. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the kernel is configured with CONFIG_PREEMPT_NONE, we could be sitting in a tight loop reaping events but not giving them a chance to finish. This results in a trace ala: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 2-...!: (5249 ticks this GP) idle=935c/1/0x4000000000000000 softirq=4265/4274 fqs=1 (t=5251 jiffies g=465 q=4135 ncpus=4) rcu: rcu_sched kthread starved for 5249 jiffies! g465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_sched state:R running task stack:0 pid:12 ppid:2 flags:0x00000008 Call trace: __switch_to+0xb0/0xc8 __schedule+0x43c/0x520 schedule+0x4c/0x98 schedule_timeout+0xbc/0xdc rcu_gp_fqs_loop+0x308/0x344 rcu_gp_kthread+0xd8/0xf0 kthread+0xb8/0xc8 ret_from_fork+0x10/0x20 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 0: task:kworker/u8:10 state:R running task stack:0 pid:89 ppid:2 flags:0x0000000a Workqueue: events_unbound io_ring_exit_work Call trace: __switch_to+0xb0/0xc8 0xffff0000c8fefd28 CPU: 2 PID: 95 Comm: kworker/u8:13 Not tainted 6.2.0-rc5-00042-g40316e337c80-dirty #2759 Hardware name: linux,dummy-virt (DT) Workqueue: events_unbound io_ring_exit_work pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : io_do_iopoll+0x344/0x360 lr : io_do_iopoll+0xb8/0x360 sp : ffff800009bebc60 x29: ffff800009bebc60 x28: 0000000000000000 x27: 0000000000000000 x26: ffff0000c0f67d48 x25: ffff0000c0f67840 x24: ffff800008950024 x23: 0000000000000001 x22: 0000000000000000 x21: ffff0000c27d3200 x20: ffff0000c0f67840 x19: ffff0000c0f67800 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000 x11: 0000000000000179 x10: 0000000000000870 x9 : ffff800009bebd60 x8 : ffff0000c27d3ad0 x7 : fefefefefefefeff x6 : 0000646e756f626e x5 : ffff0000c0f67840 x4 : 0000000000000000 x3 : ffff0000c2398000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: io_do_iopoll+0x344/0x360 io_uring_try_cancel_requests+0x21c/0x334 io_ring_exit_work+0x90/0x40c process_one_work+0x1a4/0x254 worker_thread+0x1ec/0x258 kthread+0xb8/0xc8 ret_from_fork+0x10/0x20 Add a cond_resched() in the cancelation IOPOLL loop to fix this. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_submit_flush_completions() may produce new task_work items, so it's a good idea to recheck the task_work list after flushing completions. The optimisation is not new and was accidentially removed by f88262e6 ("io_uring: lockless task list") Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a7ed5ede84de190832cc33ebbcdd6e91cd90f5b6.1674484266.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Merge almost identical sections of tctx_task_work(), this will make code modifications later easier and also inlines handle_tw_list(). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d06592d91e3e7559e7a4dbb8907d110863008dc7.1674484266.git.asml.silence@gmail.com [axboe: fold in setting count to zero patch from Tom Rix] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Add a helper for putting refs from the target task context, rename __io_put_task() and add a couple of comments around. Use the remote version for __io_req_complete_post(), the local is only needed for __io_submit_flush_completions(). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3bf92ebd594769d8a5d648472a8e335f2031d542.1674484266.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Follow the io_get_sqe pattern returning the result via a pointer and hide request cache refill inside io_alloc_req(). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8c37c2e8a3cb5e4cd6a8ae3b91371227a92708a6.1674484266.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Return an SQE from io_get_sqe() as a parameter and use the return value to determine if it failed or not. This enables the compiler to compile out the sqe NULL check when we know that the return SQE is valid. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9cceb11329240ea097dffef6bf0a675bca14cf42.1674484266.git.asml.silence@gmail.com [axboe: remove bogus const modifier on return value] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
__io_cqring_overflow_flush() doesn't return anything anymore, remove outdate comment. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ce2bcbb17eac80cdf883fd1459d5ee6586e238c.1674484266.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We return POLLIN from io_uring_poll() depending on whether there are CQEs for the userspace, and so we should use the user visible tail pointer instead of a transient cached value. Cc: stable@vger.kernel.org Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/228ffcbf30ba98856f66ffdb9a6a60ead1dd96c0.1674484266.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This generates better code for me, avoiding an extra load on arm64, and both call sites already have this variable available for easy passing. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
Every io_uring request is represented by struct io_kiocb, which is cached locally by io_uring (not SLAB/SLUB) in the list called submit_state.freelist. This patch simply enabled KASAN for this free list. This list is initially created by KMEM_CACHE, but later, managed by io_uring. This patch basically poisons the objects that are not used (i.e., they are the free list), and unpoisons it when the object is allocated/removed from the list. Touching these poisoned objects while in the freelist will cause a KASAN warning. Suggested-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Breno Leitao <leitao@debian.org> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If TIF_NOTIFY_RESUME is set, then we need to call resume_user_mode_work() for PF_IO_WORKER threads. They never return to usermode, hence never get a chance to process any items that are marked by this flag. Most notably this includes the final put of files, but also any throttling markers set by block cgroups. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the target ring is using IORING_SETUP_SINGLE_ISSUER and we're posting a message from a different thread, then we need to ensure that the fallback task_work that posts the CQE knwos about the flags passing as well. If not we'll always be posting 0 as the flags. Fixes: 3563d7ed58a5 ("io_uring/msg_ring: Pass custom flags to the cqe") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
This patch removes some "cold" fields from `struct io_issue_def`. The plan is to keep only highly used fields into `struct io_issue_def`, so, it may be hot in the cache. The hot fields are basically all the bitfields and the callback functions for .issue and .prep. The other less frequently used fields are now located in a secondary and cold struct, called `io_cold_def`. This is the size for the structs: Before: io_issue_def = 56 bytes After: io_issue_def = 24 bytes; io_cold_def = 40 bytes Signed-off-by:
Breno Leitao <leitao@debian.org> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20230112144411.2624698-2-leitao@debian.orgSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
The current io_op_def struct is becoming huge and the name is a bit generic. The goal of this patch is to rename this struct to `io_issue_def`. This struct will contain the hot functions associated with the issue code path. For now, this patch only renames the structure, and an upcoming patch will break up the structure in two, moving the non-issue fields to a secondary struct. Signed-off-by:
Breno Leitao <leitao@debian.org> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20230112144411.2624698-1-leitao@debian.orgSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Keep parts of __io_req_complete_post() relying on req->flags together so the value can be cached. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b4fbb42f404a0e75c4d9f0a5b16f314a839d0a9.1673887636.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
There may be different cost for reeading just one byte or more, so it's benificial to keep ctx flag bits that we access together in a single byte. That affected code generation of __io_cq_unlock_post_flush() and removed one memory load. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbe8ca4705704690319d65e45845f9fc9d35f420.1673887636.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Lock the ring with uring_lock in io_fallback_req_func(), which should make it a bit safer and easier. With that we also don't need refs pinning as io_ring_exit_work() will wait until uring_lock is freed. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/56170e6a0cbfc8edee2794c6613e8f6f1d76d276.1673887636.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_put_task() is only used in uring.c so enclose it there together with __io_put_task(). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/43c7f9227e2ab215f1a6069dadbc5382bed346fe.1673887636.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_submit_flush_completions() may queue new requests for tw execution, especially true for linked requests. Recheck the tw list for emptiness after flushing completions. Note that this doesn't really fix the commit referenced below, but it does reinstate an optimization that existed before that got merged. Fixes: f88262e6 ("io_uring: lockless task list") Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6328acdbb5e60efc762b18003382de077e6e1367.1673887636.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Quanfa Fu authored
Change the return type to void since it always return 0, and no need to do the checking in syscall io_uring_enter. Signed-off-by:
Quanfa Fu <quanfafu@gmail.com> Link: https://lore.kernel.org/r/20230115071519.554282-1-quanfafu@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We needed fake nodes in __io_run_local_work() and to avoid unecessary wake ups while the task already running task_works, but we don't need them anymore since wake ups are protected by cq_waiting, which is always cleared by the time we're executing deferred task_work items. Note that because of loose sync around cq_waiting clearing io_req_local_work_add() may wake the task more than once, but that's fine and should be rare to not hurt perf. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8839534891f0a2f1076e78554a31ea7e099f7de5.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Don't wake the master task after queueing a deferred tw unless it's currently waiting in io_cqring_wait. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/717702d772825a6647e6c315b4690277ba84c3fc.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
With DEFER_TASKRUN only ctx->submitter_task might be waiting for CQEs, we can use this to optimise io_cqring_wait(). Replace ->cq_wait waitqueue with waking the task directly. It works but misses an important optimisation covered by the following patch, so this patch without follow ups might hurt performance. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/103d174d35d919d4cb0922d8a9c93a8f0c35f74a.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Flush completions is done either from the submit syscall or by the task_work, both are in the context of the submitter task, and when it goes for a single threaded rings like implied by ->task_complete, there won't be any waiters on ->cq_wait but the master task. That means that there can be no tasks sleeping on cq_wait while we run __io_submit_flush_completions() and so waking up can be skipped. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/60ad9768ec74435a0ddaa6eec0ffa7729474f69f.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do for another waitqueue, it's not going to be the case in the future and so we want to have a fast path for it when the ring has never been polled. Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag called ->poll_activated. The idea behind the flag is to set it when the ring was polled for the first time. This requires additional sync to not miss events, which is done here by using task_work for ->task_complete rings, and by default enabling the flag for all other types of rings. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/060785e8e9137a920b232c0c7f575b131af19cac.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Don't use ->cq_wait for ring polling but add a separate wait queue for it. We need it for following patches. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dea0be0bf990503443c5c6c337fc66824af7d590.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_run_local_work_locked() is only used in io_uring.c, move it there. With that we can also make __io_run_local_work() static. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91757bcb33e5774e49fed6f2b6e058630608119b.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_run_local_work is enclosed in io_uring.c, we don't need to export it. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b477fb81f5e77044f724a06fe245d5c078659364.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
The CQ waiting loop sets TASK_RUNNING before trying to execute task_work, no need to repeat it in io_run_local_work(). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9d9422c429ef3f9457b4f4b8288bf4789564f33b.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Remove a local variable ctx in io_wake_function(), we don't need it if io_should_wake() triggers it to wake up. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e60eb1008aebe286aab7d34c772ed01c447bddb1.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
->submitter_task is used somewhat more frequent now than before, i.e. for local tw enqueue and run, let's move it from the end of ctx, which is full of cold data, to the first cacheline with mostly constants. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/415ca91dc5ad1dec612b892e489cda98e1069542.1673274244.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Dmitrii Bundin authored
The IS_ERR function uses the IS_ERR_VALUE macro under the hood which already wraps the condition into unlikely. Signed-off-by:
Dmitrii Bundin <dmitrii.bundin.a@gmail.com> Link: https://lore.kernel.org/r/20230109185854.25698-1-dmitrii.bundin.a@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Breno Leitao authored
This patch adds a new flag (IORING_MSG_RING_FLAGS_PASS) in the message ring operations (IORING_OP_MSG_RING). This new flag enables the sender to specify custom flags, which will be copied over to cqe->flags in the receiving ring. These custom flags should be specified using the sqe->file_index field. This mechanism provides additional flexibility when sending messages between rings. Signed-off-by:
Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230103160507.617416-1-leitao@debian.orgSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Move waiting timeout into io_wait_queue Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4b48a9e26a3b1cf97c80121e62d4b5ab873d28d.1672916894.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Unlike the jiffy scheduling version, schedule_hrtimeout() jumps a few functions before getting into schedule() even if there is no actual timeout needed. Some tests showed that it takes up to 1% of CPU cycles. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/89f880574eceee6f4899783377ead234df7b3d04.1672916894.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Instead of constantly watching that the state of the task is running before executing tw or taking locks in io_cqring_wait(), switch it back to TASK_RUNNING immediately. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/246dddee247d89fd52023f785ed17cc34962a008.1672916894.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
->work_llist should never be non-empty for a non DEFER_TASKRUN ring, so we can safely skip checking the flag. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/26af9f73c09a56c9a035f94db56127358688f3aa.1672916894.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_cqring_wait_schedule() is called after we started waiting on the cq wq and set the state to TASK_INTERRUPTIBLE, for that reason we have to constantly worry whether we has returned the state back to running or not. Leave only quick checks in io_cqring_wait_schedule() and move the rest including running task work to the callers. Note, we run tw in the loop after the sched checks because of the fast path in the beginning of the function. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2814fabe75e2e019e7ca43ea07daa94564349805.1672916894.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-