1. 10 Oct, 2020 10 commits
  2. 09 Oct, 2020 4 commits
  3. 08 Oct, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: no need to call xa_destroy() on empty xarray · ca6484cd
      Jens Axboe authored
      The kernel test robot reports this lockdep issue:
      
      [child1:659] mbind (274) returned ENOSYS, marking as inactive.
      [child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive.
      [main] 10175 iterations. [F:7781 S:2344 HI:2397]
      [   24.610601]
      [   24.610743] ================================
      [   24.611083] WARNING: inconsistent lock state
      [   24.611437] 5.9.0-rc7-00017-g0f212204 #5 Not tainted
      [   24.611861] --------------------------------
      [   24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [   24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes:
      [   24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1
      [   24.613642] {SOFTIRQ-ON-W} state was registered at:
      [   24.614024]   lock_acquire+0x20c/0x29b
      [   24.614341]   _raw_spin_lock+0x21/0x30
      [   24.614636]   io_uring_add_task_file+0xe8/0x13a
      [   24.614987]   io_uring_create+0x535/0x6bd
      [   24.615297]   io_uring_setup+0x11d/0x136
      [   24.615606]   __ia32_sys_io_uring_setup+0xd/0xf
      [   24.615977]   do_int80_syscall_32+0x53/0x6c
      [   24.616306]   restore_all_switch_stack+0x0/0xb1
      [   24.616677] irq event stamp: 939881
      [   24.616968] hardirqs last  enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145
      [   24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e
      [   24.618321] softirqs last  enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a
      [   24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61
      [   24.619521]
      [   24.619521] other info that might help us debug this:
      [   24.620028]  Possible unsafe locking scenario:
      [   24.620028]
      [   24.620492]        CPU0
      [   24.620685]        ----
      [   24.620894]   lock(&xa->xa_lock#4);
      [   24.621168]   <Interrupt>
      [   24.621381]     lock(&xa->xa_lock#4);
      [   24.621695]
      [   24.621695]  *** DEADLOCK ***
      [   24.621695]
      [   24.622154] 1 lock held by ksoftirqd/0/7:
      [   24.622468]  #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155
      [   24.623106]
      [   24.623106] stack backtrace:
      [   24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f212204 #5
      [   24.624090] Call Trace:
      [   24.624284]  ? show_stack+0x40/0x46
      [   24.624551]  dump_stack+0x1b/0x1d
      [   24.624809]  print_usage_bug+0x17a/0x185
      [   24.625142]  mark_lock+0x11d/0x1db
      [   24.625474]  ? print_shortest_lock_dependencies+0x121/0x121
      [   24.625905]  __lock_acquire+0x41e/0x7bf
      [   24.626206]  lock_acquire+0x20c/0x29b
      [   24.626517]  ? xa_destroy+0x43/0xc1
      [   24.626810]  ? lock_acquire+0x20c/0x29b
      [   24.627110]  _raw_spin_lock_irqsave+0x3e/0x4e
      [   24.627450]  ? xa_destroy+0x43/0xc1
      [   24.627725]  xa_destroy+0x43/0xc1
      [   24.627989]  __io_uring_free+0x57/0x71
      [   24.628286]  ? get_pid+0x22/0x22
      [   24.628544]  __put_task_struct+0xf2/0x163
      [   24.628865]  put_task_struct+0x1f/0x2a
      [   24.629161]  delayed_put_task_struct+0xe2/0xe9
      [   24.629509]  rcu_process_callbacks+0x128/0x155
      [   24.629860]  __do_softirq+0x1a3/0x45a
      [   24.630151]  run_ksoftirqd+0x35/0x61
      [   24.630443]  smpboot_thread_fn+0x304/0x31a
      [   24.630763]  kthread+0x124/0x139
      [   24.631016]  ? sort_range+0x18/0x18
      [   24.631290]  ? kthread_create_worker_on_cpu+0x17/0x17
      [   24.631682]  ret_from_fork+0x1c/0x28
      
      which is complaining about xa_destroy() grabbing the xa lock in an
      IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt
      safe. This is really an xarray issue, since it should not assume the
      lock type. But for our use case, since we know the xarray is empty at
      this point, there's no need to actually call xa_destroy(). So just get
      rid of it.
      
      Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ca6484cd
  4. 07 Oct, 2020 1 commit
  5. 01 Oct, 2020 24 commits
    • Jens Axboe's avatar
      io_uring: kill callback_head argument for io_req_task_work_add() · 87c4311f
      Jens Axboe authored
      We always use &req->task_work anyway, no point in passing it in.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      87c4311f
    • Pavel Begunkov's avatar
      io_uring: move req preps out of io_issue_sqe() · c1379e24
      Pavel Begunkov authored
      All request preparations are done only during submission, reflect it in
      the code by moving io_req_prep() much earlier into io_queue_sqe().
      
      That's much cleaner, because it doen't expose bits to async code which
      it won't ever use. Also it makes the interface harder to misuse, and
      there are potential places for bugs.
      
      For instance, __io_queue() doesn't clear @sqe before proceeding to a
      next linked request, that could have been disastrous, but hopefully
      there are linked requests IFF sqe==NULL, so not actually a bug.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c1379e24
    • Pavel Begunkov's avatar
      io_uring: decouple issuing and req preparation · bfe76559
      Pavel Begunkov authored
      io_issue_sqe() does two things at once, trying to prepare request and
      issuing them. Split it in two and deduplicate with io_defer_prep().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bfe76559
    • Pavel Begunkov's avatar
      io_uring: remove nonblock arg from io_{rw}_prep() · 73debe68
      Pavel Begunkov authored
      All io_*_prep() functions including io_{read,write}_prep() are called
      only during submission where @force_nonblock is always true. Don't keep
      propagating it and instead remove the @force_nonblock argument
      from prep() altogether.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73debe68
    • Pavel Begunkov's avatar
      io_uring: set/clear IOCB_NOWAIT into io_read/write · a88fc400
      Pavel Begunkov authored
      Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
      it's set/cleared in a single place. Also remove @force_nonblock
      parameter from io_prep_rw().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a88fc400
    • Pavel Begunkov's avatar
      io_uring: remove F_NEED_CLEANUP check in *prep() · 2d199895
      Pavel Begunkov authored
      REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
      be called only once, so there is no one who may have set the flag
      before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d199895
    • Pavel Begunkov's avatar
      io_uring: io_kiocb_ppos() style change · 5b09e37e
      Pavel Begunkov authored
      Put brackets around bitwise ops in a complex expression
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b09e37e
    • Pavel Begunkov's avatar
      io_uring: simplify io_alloc_req() · 291b2821
      Pavel Begunkov authored
      Extract common code from if/else branches. That is cleaner and optimised
      even better.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      291b2821
    • Jens Axboe's avatar
      io-wq: kill unused IO_WORKER_F_EXITING · 145cc8c6
      Jens Axboe authored
      This flag is no longer used, remove it.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      145cc8c6
    • Hillf Danton's avatar
      io-wq: fix use-after-free in io_wq_worker_running · c4068bf8
      Hillf Danton authored
      The smart syzbot has found a reproducer for the following issue:
      
       ==================================================================
       BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
       BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
       BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
       BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
       Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
      
       CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x198/0x1fd lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
        __kasan_report mm/kasan/report.c:513 [inline]
        kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
        check_memory_region_inline mm/kasan/generic.c:186 [inline]
        check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
        instrument_atomic_write include/linux/instrumented.h:71 [inline]
        atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
        io_wqe_inc_running fs/io-wq.c:301 [inline]
        io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
        schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
        io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       Allocated by task 7768:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track mm/kasan/common.c:56 [inline]
        __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
        kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
        kmalloc_node include/linux/slab.h:572 [inline]
        kzalloc_node include/linux/slab.h:677 [inline]
        io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
        io_init_wq_offload fs/io_uring.c:7432 [inline]
        io_sq_offload_start fs/io_uring.c:7504 [inline]
        io_uring_create fs/io_uring.c:8625 [inline]
        io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
        do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       Freed by task 21:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
        kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
        __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
        __cache_free mm/slab.c:3418 [inline]
        kfree+0x10e/0x2b0 mm/slab.c:3756
        __io_wq_destroy fs/io-wq.c:1138 [inline]
        io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
        io_finish_async fs/io_uring.c:6836 [inline]
        io_ring_ctx_free fs/io_uring.c:7870 [inline]
        io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
        process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
        worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       The buggy address belongs to the object at ffff8882183db000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 140 bytes inside of
        1024-byte region [ffff8882183db000, ffff8882183db400)
       The buggy address belongs to the page:
       page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
       flags: 0x57ffe0000000200(slab)
       raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
       raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ==================================================================
      
      which is down to the comment below,
      
      	/* all workers gone, wq exit can proceed */
      	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
      		complete(&wqe->wq->done);
      
      because there might be multiple cases of wqe in a wq and we would wait
      for every worker in every wqe to go home before releasing wq's resources
      on destroying.
      
      To that end, rework wq's refcount by making it independent of the tracking
      of workers because after all they are two different things, and keeping
      it balanced when workers come and go. Note the manager kthread, like
      other workers, now holds a grab to wq during its lifetime.
      
      Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
      and do nothing for exiting wq.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
      Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarHillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4068bf8
    • Joseph Qi's avatar
      io_uring: show sqthread pid and cpu in fdinfo · dbbe9c64
      Joseph Qi authored
      In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
      io_uring instances in a host. Since all sqthreads are named
      "io_uring-sq", it's hard to distinguish the relations between
      application process and its io_uring sqthread.
      With this patch, application can get its corresponding sqthread pid
      and cpu through show_fdinfo.
      Steps:
      1. Get io_uring fd first.
      $ ls -l /proc/<pid>/fd | grep -w io_uring
      2. Then get io_uring instance related info, including corresponding
      sqthread pid and cpu.
      $ cat /proc/<pid>/fdinfo/<io_uring_fd>
      
      pos:	0
      flags:	02000002
      mnt_id:	13
      SqThread:	6929
      SqThreadCpu:	2
      UserFiles:	1
          0: testfile
      UserBufs:	0
      PollList:
      Signed-off-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      [axboe: fixed for new shared SQPOLL]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbbe9c64
    • Jens Axboe's avatar
      io_uring: process task work in io_uring_register() · af9c1a44
      Jens Axboe authored
      We do this for CQ ring wait, in case task_work completions come in. We
      should do the same in io_uring_register(), to avoid spurious -EINTR
      if the ring quiescing ends up having to process task_work to complete
      the operation
      Reported-by: default avatarDan Melnic <dmm@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af9c1a44
    • Dennis Zhou's avatar
      io_uring: add blkcg accounting to offloaded operations · 91d8f519
      Dennis Zhou authored
      There are a few operations that are offloaded to the worker threads. In
      this case, we lose process context and end up in kthread context. This
      results in ios to be not accounted to the issuing cgroup and
      consequently end up as issued by root. Just like others, adopt the
      personality of the blkcg too when issuing via the workqueues.
      
      For the SQPOLL thread, it will live and attach in the inited cgroup's
      context.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91d8f519
    • Jens Axboe's avatar
      io_uring: improve registered buffer accounting for huge pages · de293938
      Jens Axboe authored
      io_uring does account any registered buffer as pinned/locked memory, and
      checks limit and fails if the given user doesn't have a big enough limit
      to register the ranges specified. However, if huge pages are used, we
      are potentially under-accounting the memory in terms of what gets pinned
      on the vm side.
      
      This patch rectifies that, by ensuring that we account the full size of
      a compound page, regardless of how much of it is being registered. Huge
      pages are not accounted mulitple times - if multiple sections of a huge
      page is registered, then the page is only accounted once.
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de293938
    • Zheng Bin's avatar
      io_uring: remove unneeded semicolon · 14db8411
      Zheng Bin authored
      Fixes coccicheck warning:
      
      fs/io_uring.c:4242:13-14: Unneeded semicolon
      Signed-off-by: default avatarZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14db8411
    • Jens Axboe's avatar
      io_uring: cap SQ submit size for SQPOLL with multiple rings · e95eee2d
      Jens Axboe authored
      In the spirit of fairness, cap the max number of SQ entries we'll submit
      for SQPOLL if we have multiple rings. If we don't do that, we could be
      submitting tons of entries for one ring, while others are waiting to get
      service.
      
      The value of 8 is somewhat arbitrarily chosen as something that allows
      a fair bit of batching, without using an excessive time per ring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e95eee2d
    • Jens Axboe's avatar
      io_uring: get rid of req->io/io_async_ctx union · e8c2bc1f
      Jens Axboe authored
      There's really no point in having this union, it just means that we're
      always allocating enough room to cater to any command. But that's
      pointless, as the ->io field is request type private anyway.
      
      This gets rid of the io_async_ctx structure, and fills in the required
      size in the io_op_defs[] instead.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8c2bc1f
    • Pavel Begunkov's avatar
      io_uring: kill extra user_bufs check · 4be1c615
      Pavel Begunkov authored
      Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
      because in that case ctx->nr_user_bufs would be zero, and the following
      check would fail.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4be1c615
    • Pavel Begunkov's avatar
      io_uring: fix overlapped memcpy in io_req_map_rw() · ab0b196c
      Pavel Begunkov authored
      When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
      iorw->iter into itself. Even though it doesn't lead to an error, such a
      memcpy()'s aliasing rules violation is considered to be a bad practise.
      
      Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
      remapping there, so it's much simpler than the generic implementation.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab0b196c
    • Pavel Begunkov's avatar
      io_uring: refactor io_req_map_rw() · afb87658
      Pavel Begunkov authored
      Set rw->free_iovec to @iovec, that gives an identical result and stresses
      that @iovec param rw->free_iovec play the same role.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      afb87658
    • Pavel Begunkov's avatar
      io_uring: simplify io_rw_prep_async() · f4bff104
      Pavel Begunkov authored
      Don't touch iter->iov and iov in between __io_import_iovec() and
      io_req_map_rw(), the former function aleady sets it correctly, because it
      creates one more case with NULL'ed iov to consider in io_req_map_rw().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4bff104
    • Jens Axboe's avatar
      io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits · 90554200
      Jens Axboe authored
      When using SQPOLL, applications can run into the issue of running out of
      SQ ring entries because the thread hasn't consumed them yet. The only
      option for dealing with that is checking later, or busy checking for the
      condition.
      
      Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
      condition.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90554200
    • Jens Axboe's avatar
      io_uring: mark io_uring_fops/io_op_defs as __read_mostly · 738277ad
      Jens Axboe authored
      These structures are never written, move them appropriately.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      738277ad
    • Jens Axboe's avatar
      io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too · aa06165d
      Jens Axboe authored
      We support using IORING_SETUP_ATTACH_WQ to share async backends between
      rings created by the same process, this now also allows the same to
      happen with SQPOLL. The setup procedure remains the same, the caller
      sets io_uring_params->wq_fd to the 'parent' context, and then the newly
      created ring will attach to that async backend.
      
      This means that multiple rings can share the same SQPOLL thread, saving
      resources.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa06165d