1. 09 Oct, 2020 3 commits
  2. 08 Oct, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: no need to call xa_destroy() on empty xarray · ca6484cd
      Jens Axboe authored
      The kernel test robot reports this lockdep issue:
      
      [child1:659] mbind (274) returned ENOSYS, marking as inactive.
      [child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive.
      [main] 10175 iterations. [F:7781 S:2344 HI:2397]
      [   24.610601]
      [   24.610743] ================================
      [   24.611083] WARNING: inconsistent lock state
      [   24.611437] 5.9.0-rc7-00017-g0f212204 #5 Not tainted
      [   24.611861] --------------------------------
      [   24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [   24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes:
      [   24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1
      [   24.613642] {SOFTIRQ-ON-W} state was registered at:
      [   24.614024]   lock_acquire+0x20c/0x29b
      [   24.614341]   _raw_spin_lock+0x21/0x30
      [   24.614636]   io_uring_add_task_file+0xe8/0x13a
      [   24.614987]   io_uring_create+0x535/0x6bd
      [   24.615297]   io_uring_setup+0x11d/0x136
      [   24.615606]   __ia32_sys_io_uring_setup+0xd/0xf
      [   24.615977]   do_int80_syscall_32+0x53/0x6c
      [   24.616306]   restore_all_switch_stack+0x0/0xb1
      [   24.616677] irq event stamp: 939881
      [   24.616968] hardirqs last  enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145
      [   24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e
      [   24.618321] softirqs last  enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a
      [   24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61
      [   24.619521]
      [   24.619521] other info that might help us debug this:
      [   24.620028]  Possible unsafe locking scenario:
      [   24.620028]
      [   24.620492]        CPU0
      [   24.620685]        ----
      [   24.620894]   lock(&xa->xa_lock#4);
      [   24.621168]   <Interrupt>
      [   24.621381]     lock(&xa->xa_lock#4);
      [   24.621695]
      [   24.621695]  *** DEADLOCK ***
      [   24.621695]
      [   24.622154] 1 lock held by ksoftirqd/0/7:
      [   24.622468]  #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155
      [   24.623106]
      [   24.623106] stack backtrace:
      [   24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f212204 #5
      [   24.624090] Call Trace:
      [   24.624284]  ? show_stack+0x40/0x46
      [   24.624551]  dump_stack+0x1b/0x1d
      [   24.624809]  print_usage_bug+0x17a/0x185
      [   24.625142]  mark_lock+0x11d/0x1db
      [   24.625474]  ? print_shortest_lock_dependencies+0x121/0x121
      [   24.625905]  __lock_acquire+0x41e/0x7bf
      [   24.626206]  lock_acquire+0x20c/0x29b
      [   24.626517]  ? xa_destroy+0x43/0xc1
      [   24.626810]  ? lock_acquire+0x20c/0x29b
      [   24.627110]  _raw_spin_lock_irqsave+0x3e/0x4e
      [   24.627450]  ? xa_destroy+0x43/0xc1
      [   24.627725]  xa_destroy+0x43/0xc1
      [   24.627989]  __io_uring_free+0x57/0x71
      [   24.628286]  ? get_pid+0x22/0x22
      [   24.628544]  __put_task_struct+0xf2/0x163
      [   24.628865]  put_task_struct+0x1f/0x2a
      [   24.629161]  delayed_put_task_struct+0xe2/0xe9
      [   24.629509]  rcu_process_callbacks+0x128/0x155
      [   24.629860]  __do_softirq+0x1a3/0x45a
      [   24.630151]  run_ksoftirqd+0x35/0x61
      [   24.630443]  smpboot_thread_fn+0x304/0x31a
      [   24.630763]  kthread+0x124/0x139
      [   24.631016]  ? sort_range+0x18/0x18
      [   24.631290]  ? kthread_create_worker_on_cpu+0x17/0x17
      [   24.631682]  ret_from_fork+0x1c/0x28
      
      which is complaining about xa_destroy() grabbing the xa lock in an
      IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt
      safe. This is really an xarray issue, since it should not assume the
      lock type. But for our use case, since we know the xarray is empty at
      this point, there's no need to actually call xa_destroy(). So just get
      rid of it.
      
      Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ca6484cd
  3. 07 Oct, 2020 1 commit
  4. 01 Oct, 2020 35 commits
    • Jens Axboe's avatar
      io_uring: kill callback_head argument for io_req_task_work_add() · 87c4311f
      Jens Axboe authored
      We always use &req->task_work anyway, no point in passing it in.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      87c4311f
    • Pavel Begunkov's avatar
      io_uring: move req preps out of io_issue_sqe() · c1379e24
      Pavel Begunkov authored
      All request preparations are done only during submission, reflect it in
      the code by moving io_req_prep() much earlier into io_queue_sqe().
      
      That's much cleaner, because it doen't expose bits to async code which
      it won't ever use. Also it makes the interface harder to misuse, and
      there are potential places for bugs.
      
      For instance, __io_queue() doesn't clear @sqe before proceeding to a
      next linked request, that could have been disastrous, but hopefully
      there are linked requests IFF sqe==NULL, so not actually a bug.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c1379e24
    • Pavel Begunkov's avatar
      io_uring: decouple issuing and req preparation · bfe76559
      Pavel Begunkov authored
      io_issue_sqe() does two things at once, trying to prepare request and
      issuing them. Split it in two and deduplicate with io_defer_prep().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bfe76559
    • Pavel Begunkov's avatar
      io_uring: remove nonblock arg from io_{rw}_prep() · 73debe68
      Pavel Begunkov authored
      All io_*_prep() functions including io_{read,write}_prep() are called
      only during submission where @force_nonblock is always true. Don't keep
      propagating it and instead remove the @force_nonblock argument
      from prep() altogether.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73debe68
    • Pavel Begunkov's avatar
      io_uring: set/clear IOCB_NOWAIT into io_read/write · a88fc400
      Pavel Begunkov authored
      Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
      it's set/cleared in a single place. Also remove @force_nonblock
      parameter from io_prep_rw().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a88fc400
    • Pavel Begunkov's avatar
      io_uring: remove F_NEED_CLEANUP check in *prep() · 2d199895
      Pavel Begunkov authored
      REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
      be called only once, so there is no one who may have set the flag
      before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d199895
    • Pavel Begunkov's avatar
      io_uring: io_kiocb_ppos() style change · 5b09e37e
      Pavel Begunkov authored
      Put brackets around bitwise ops in a complex expression
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b09e37e
    • Pavel Begunkov's avatar
      io_uring: simplify io_alloc_req() · 291b2821
      Pavel Begunkov authored
      Extract common code from if/else branches. That is cleaner and optimised
      even better.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      291b2821
    • Jens Axboe's avatar
      io-wq: kill unused IO_WORKER_F_EXITING · 145cc8c6
      Jens Axboe authored
      This flag is no longer used, remove it.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      145cc8c6
    • Hillf Danton's avatar
      io-wq: fix use-after-free in io_wq_worker_running · c4068bf8
      Hillf Danton authored
      The smart syzbot has found a reproducer for the following issue:
      
       ==================================================================
       BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
       BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
       BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
       BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
       Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
      
       CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x198/0x1fd lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
        __kasan_report mm/kasan/report.c:513 [inline]
        kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
        check_memory_region_inline mm/kasan/generic.c:186 [inline]
        check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
        instrument_atomic_write include/linux/instrumented.h:71 [inline]
        atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
        io_wqe_inc_running fs/io-wq.c:301 [inline]
        io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
        schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
        io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       Allocated by task 7768:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track mm/kasan/common.c:56 [inline]
        __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
        kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
        kmalloc_node include/linux/slab.h:572 [inline]
        kzalloc_node include/linux/slab.h:677 [inline]
        io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
        io_init_wq_offload fs/io_uring.c:7432 [inline]
        io_sq_offload_start fs/io_uring.c:7504 [inline]
        io_uring_create fs/io_uring.c:8625 [inline]
        io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
        do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       Freed by task 21:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
        kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
        __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
        __cache_free mm/slab.c:3418 [inline]
        kfree+0x10e/0x2b0 mm/slab.c:3756
        __io_wq_destroy fs/io-wq.c:1138 [inline]
        io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
        io_finish_async fs/io_uring.c:6836 [inline]
        io_ring_ctx_free fs/io_uring.c:7870 [inline]
        io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
        process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
        worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       The buggy address belongs to the object at ffff8882183db000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 140 bytes inside of
        1024-byte region [ffff8882183db000, ffff8882183db400)
       The buggy address belongs to the page:
       page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
       flags: 0x57ffe0000000200(slab)
       raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
       raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ==================================================================
      
      which is down to the comment below,
      
      	/* all workers gone, wq exit can proceed */
      	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
      		complete(&wqe->wq->done);
      
      because there might be multiple cases of wqe in a wq and we would wait
      for every worker in every wqe to go home before releasing wq's resources
      on destroying.
      
      To that end, rework wq's refcount by making it independent of the tracking
      of workers because after all they are two different things, and keeping
      it balanced when workers come and go. Note the manager kthread, like
      other workers, now holds a grab to wq during its lifetime.
      
      Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
      and do nothing for exiting wq.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
      Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarHillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4068bf8
    • Joseph Qi's avatar
      io_uring: show sqthread pid and cpu in fdinfo · dbbe9c64
      Joseph Qi authored
      In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
      io_uring instances in a host. Since all sqthreads are named
      "io_uring-sq", it's hard to distinguish the relations between
      application process and its io_uring sqthread.
      With this patch, application can get its corresponding sqthread pid
      and cpu through show_fdinfo.
      Steps:
      1. Get io_uring fd first.
      $ ls -l /proc/<pid>/fd | grep -w io_uring
      2. Then get io_uring instance related info, including corresponding
      sqthread pid and cpu.
      $ cat /proc/<pid>/fdinfo/<io_uring_fd>
      
      pos:	0
      flags:	02000002
      mnt_id:	13
      SqThread:	6929
      SqThreadCpu:	2
      UserFiles:	1
          0: testfile
      UserBufs:	0
      PollList:
      Signed-off-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      [axboe: fixed for new shared SQPOLL]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbbe9c64
    • Jens Axboe's avatar
      io_uring: process task work in io_uring_register() · af9c1a44
      Jens Axboe authored
      We do this for CQ ring wait, in case task_work completions come in. We
      should do the same in io_uring_register(), to avoid spurious -EINTR
      if the ring quiescing ends up having to process task_work to complete
      the operation
      Reported-by: default avatarDan Melnic <dmm@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af9c1a44
    • Dennis Zhou's avatar
      io_uring: add blkcg accounting to offloaded operations · 91d8f519
      Dennis Zhou authored
      There are a few operations that are offloaded to the worker threads. In
      this case, we lose process context and end up in kthread context. This
      results in ios to be not accounted to the issuing cgroup and
      consequently end up as issued by root. Just like others, adopt the
      personality of the blkcg too when issuing via the workqueues.
      
      For the SQPOLL thread, it will live and attach in the inited cgroup's
      context.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91d8f519
    • Jens Axboe's avatar
      io_uring: improve registered buffer accounting for huge pages · de293938
      Jens Axboe authored
      io_uring does account any registered buffer as pinned/locked memory, and
      checks limit and fails if the given user doesn't have a big enough limit
      to register the ranges specified. However, if huge pages are used, we
      are potentially under-accounting the memory in terms of what gets pinned
      on the vm side.
      
      This patch rectifies that, by ensuring that we account the full size of
      a compound page, regardless of how much of it is being registered. Huge
      pages are not accounted mulitple times - if multiple sections of a huge
      page is registered, then the page is only accounted once.
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de293938
    • Zheng Bin's avatar
      io_uring: remove unneeded semicolon · 14db8411
      Zheng Bin authored
      Fixes coccicheck warning:
      
      fs/io_uring.c:4242:13-14: Unneeded semicolon
      Signed-off-by: default avatarZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14db8411
    • Jens Axboe's avatar
      io_uring: cap SQ submit size for SQPOLL with multiple rings · e95eee2d
      Jens Axboe authored
      In the spirit of fairness, cap the max number of SQ entries we'll submit
      for SQPOLL if we have multiple rings. If we don't do that, we could be
      submitting tons of entries for one ring, while others are waiting to get
      service.
      
      The value of 8 is somewhat arbitrarily chosen as something that allows
      a fair bit of batching, without using an excessive time per ring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e95eee2d
    • Jens Axboe's avatar
      io_uring: get rid of req->io/io_async_ctx union · e8c2bc1f
      Jens Axboe authored
      There's really no point in having this union, it just means that we're
      always allocating enough room to cater to any command. But that's
      pointless, as the ->io field is request type private anyway.
      
      This gets rid of the io_async_ctx structure, and fills in the required
      size in the io_op_defs[] instead.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8c2bc1f
    • Pavel Begunkov's avatar
      io_uring: kill extra user_bufs check · 4be1c615
      Pavel Begunkov authored
      Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
      because in that case ctx->nr_user_bufs would be zero, and the following
      check would fail.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4be1c615
    • Pavel Begunkov's avatar
      io_uring: fix overlapped memcpy in io_req_map_rw() · ab0b196c
      Pavel Begunkov authored
      When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
      iorw->iter into itself. Even though it doesn't lead to an error, such a
      memcpy()'s aliasing rules violation is considered to be a bad practise.
      
      Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
      remapping there, so it's much simpler than the generic implementation.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab0b196c
    • Pavel Begunkov's avatar
      io_uring: refactor io_req_map_rw() · afb87658
      Pavel Begunkov authored
      Set rw->free_iovec to @iovec, that gives an identical result and stresses
      that @iovec param rw->free_iovec play the same role.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      afb87658
    • Pavel Begunkov's avatar
      io_uring: simplify io_rw_prep_async() · f4bff104
      Pavel Begunkov authored
      Don't touch iter->iov and iov in between __io_import_iovec() and
      io_req_map_rw(), the former function aleady sets it correctly, because it
      creates one more case with NULL'ed iov to consider in io_req_map_rw().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4bff104
    • Jens Axboe's avatar
      io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits · 90554200
      Jens Axboe authored
      When using SQPOLL, applications can run into the issue of running out of
      SQ ring entries because the thread hasn't consumed them yet. The only
      option for dealing with that is checking later, or busy checking for the
      condition.
      
      Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
      condition.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90554200
    • Jens Axboe's avatar
      io_uring: mark io_uring_fops/io_op_defs as __read_mostly · 738277ad
      Jens Axboe authored
      These structures are never written, move them appropriately.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      738277ad
    • Jens Axboe's avatar
      io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too · aa06165d
      Jens Axboe authored
      We support using IORING_SETUP_ATTACH_WQ to share async backends between
      rings created by the same process, this now also allows the same to
      happen with SQPOLL. The setup procedure remains the same, the caller
      sets io_uring_params->wq_fd to the 'parent' context, and then the newly
      created ring will attach to that async backend.
      
      This means that multiple rings can share the same SQPOLL thread, saving
      resources.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa06165d
    • Jens Axboe's avatar
      io_uring: base SQPOLL handling off io_sq_data · 69fb2131
      Jens Axboe authored
      Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
      data structure we pass in. io_sq_data has a list of ctx's that we can
      then iterate over and handle.
      
      As of now we're ready to handle multiple ctx's, though we're still just
      handling a single one after this patch.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      69fb2131
    • Jens Axboe's avatar
      io_uring: split SQPOLL data into separate structure · 534ca6d6
      Jens Axboe authored
      Move all the necessary state out of io_ring_ctx, and into a new
      structure, io_sq_data. The latter now deals with any state or
      variables associated with the SQPOLL thread itself.
      
      In preparation for supporting more than one io_ring_ctx per SQPOLL
      thread.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      534ca6d6
    • Jens Axboe's avatar
      io_uring: split work handling part of SQPOLL into helper · c8d1ba58
      Jens Axboe authored
      This is done in preparation for handling more than one ctx, but it also
      cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
      get a get overview on.
      
      __io_sq_thread() is now the main handler, and it returns an enum sq_ret
      that tells io_sq_thread() what it ended up doing. The parent then makes
      a decision on idle, spinning, or work handling based on that.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8d1ba58
    • Jens Axboe's avatar
      io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler · 3f0e64d0
      Jens Axboe authored
      We need to decouple the clearing on wakeup from the the inline schedule,
      as that is going to be required for handling multiple rings in one
      thread.
      
      Wrap our wakeup handler so we can clear it when we get the wakeup, by
      definition that is when we no longer need the flag set.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f0e64d0
    • Jens Axboe's avatar
      io_uring: use private ctx wait queue entries for SQPOLL · 6a779382
      Jens Axboe authored
      This is in preparation to sharing the poller thread between rings. For
      that we need per-ring wait_queue_entry storage, and we can't easily put
      that on the stack if one thread is managing multiple rings.
      
      We'll also be sharing the wait_queue_head across rings for the purposes
      of wakeups, provide the usual private ring wait_queue_head for now but
      make it a pointer so we can easily override it when sharing.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6a779382
    • Jens Axboe's avatar
      fs: align IOCB_* flags with RWF_* flags · ce71bfea
      Jens Axboe authored
      We have a set of flags that are shared between the two and inherired
      in kiocb_set_rw_flags(), but we check and set these individually.
      Reorder the IOCB flags so that the bottom part of the space is synced
      with the RWF flag space, and then we can do them all in one mask and
      set operation.
      
      The only exception is RWF_SYNC, which needs to mark IOCB_SYNC and
      IOCB_DSYNC. Do that one separately.
      
      This shaves 15 bytes of text from kiocb_set_rw_flags() for me.
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce71bfea
    • Jens Axboe's avatar
      io_uring: io_sq_thread() doesn't need to flush signals · e35afcf9
      Jens Axboe authored
      We're not handling signals by default in kernel threads, and we never
      use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
      have a signal pending, and we don't need to check for it (nor flush it).
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e35afcf9
    • Sebastian Andrzej Siewior's avatar
      io_wq: Make io_wqe::lock a raw_spinlock_t · 95da8465
      Sebastian Andrzej Siewior authored
      During a context switch the scheduler invokes wq_worker_sleeping() with
      disabled preemption. Disabling preemption is needed because it protects
      access to `worker->sleeping'. As an optimisation it avoids invoking
      schedule() within the schedule path as part of possible wake up (thus
      preempt_enable_no_resched() afterwards).
      
      The io-wq has been added to the mix in the same section with disabled
      preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
      acquires a spinlock_t. Also within the schedule() the spinlock_t must be
      acquired after tsk_is_pi_blocked() otherwise it will block on the
      sleeping lock again while scheduling out.
      
      While playing with `io_uring-bench' I didn't notice a significant
      latency spike after converting io_wqe::lock to a raw_spinlock_t. The
      latency was more or less the same.
      
      In order to keep the spinlock_t it would have to be moved after the
      tsk_is_pi_blocked() check which would introduce a branch instruction
      into the hot path.
      
      The lock is used to maintain the `work_list' and wakes one task up at
      most.
      Should io_wqe_cancel_pending_work() cause latency spikes, while
      searching for a specific item, then it would need to drop the lock
      during iterations.
      revert_creds() is also invoked under the lock. According to debug
      cred::non_rcu is 0. Otherwise it should be moved outside of the locked
      section because put_cred_rcu()->free_uid() acquires a sleeping lock.
      
      Convert io_wqe::lock to a raw_spinlock_t.c
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      95da8465
    • Stefano Garzarella's avatar
      io_uring: allow disabling rings during the creation · 7e84e1c7
      Stefano Garzarella authored
      This patch adds a new IORING_SETUP_R_DISABLED flag to start the
      rings disabled, allowing the user to register restrictions,
      buffers, files, before to start processing SQEs.
      
      When IORING_SETUP_R_DISABLED is set, SQE are not processed and
      SQPOLL kthread is not started.
      
      The restrictions registration are allowed only when the rings
      are disable to prevent concurrency issue while processing SQEs.
      
      The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
      opcode with io_uring_register(2).
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7e84e1c7
    • Stefano Garzarella's avatar
      io_uring: add IOURING_REGISTER_RESTRICTIONS opcode · 21b55dbc
      Stefano Garzarella authored
      The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
      permanently installs a feature allowlist on an io_ring_ctx.
      The io_ring_ctx can then be passed to untrusted code with the
      knowledge that only operations present in the allowlist can be
      executed.
      
      The allowlist approach ensures that new features added to io_uring
      do not accidentally become available when an existing application
      is launched on a newer kernel version.
      
      Currently is it possible to restrict sqe opcodes, sqe flags, and
      register opcodes.
      
      IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
      it is not possible to change restrictions anymore.
      This prevents untrusted code from removing restrictions.
      Suggested-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      21b55dbc
    • Stefano Garzarella's avatar
      io_uring: use an enumeration for io_uring_register(2) opcodes · 9d4a75ef
      Stefano Garzarella authored
      The enumeration allows us to keep track of the last
      io_uring_register(2) opcode available.
      
      Behaviour and opcodes names don't change.
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9d4a75ef