1. 01 Sep, 2021 1 commit
    • Jens Axboe's avatar
      io-wq: split bounded and unbounded work into separate lists · f95dc207
      Jens Axboe authored
      We've got a few issues that all boil down to the fact that we have one
      list of pending work items, yet two different types of workers to
      serve them. This causes some oddities around workers switching type and
      even hashed work vs regular work on the same bounded list.
      
      Just separate them out cleanly, similarly to how we already do
      accounting of what is running. That provides a clean separation and
      removes some corner cases that can cause stalls when handling IO
      that is punted to io-wq.
      
      Fixes: ecc53c48 ("io-wq: check max_worker limits if a worker transitions bound state")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f95dc207
  2. 31 Aug, 2021 7 commits
    • Jens Axboe's avatar
      io-wq: fix queue stalling race · 0242f642
      Jens Axboe authored
      We need to set the stalled bit early, before we drop the lock for adding
      us to the stall hash queue. If not, then we can race with new work being
      queued between adding us to the stall hash and io_worker_handle_work()
      marking us stalled.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0242f642
    • Pavel Begunkov's avatar
      io_uring: don't submit half-prepared drain request · b8ce1b9d
      Pavel Begunkov authored
      [ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
      [ 3784.910926] Call Trace:
      [ 3784.910928]  ? io_read+0x17c/0x480
      [ 3784.910945]  io_issue_sqe+0xcb/0x1840
      [ 3784.910953]  __io_queue_sqe+0x44/0x300
      [ 3784.910959]  io_req_task_submit+0x27/0x70
      [ 3784.910962]  tctx_task_work+0xeb/0x1d0
      [ 3784.910966]  task_work_run+0x61/0xa0
      [ 3784.910968]  io_run_task_work_sig+0x53/0xa0
      [ 3784.910975]  __x64_sys_io_uring_enter+0x22/0x30
      [ 3784.910977]  do_syscall_64+0x3d/0x90
      [ 3784.910981]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
      submitting under-prepared request (e.g. failed in io_init_req(). Fail
      such drained requests as well.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b8ce1b9d
    • Pavel Begunkov's avatar
      io_uring: fix queueing half-created requests · c6d3d9cb
      Pavel Begunkov authored
      [   27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
      [   27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
      [   27.263730] RIP: 0010:sock_from_file+0x20/0x90
      [   27.272444] Call Trace:
      [   27.272736]  io_sendmsg+0x98/0x600
      [   27.279216]  io_issue_sqe+0x498/0x68d0
      [   27.281142]  __io_queue_sqe+0xab/0xb50
      [   27.285830]  io_req_task_submit+0xbf/0x1b0
      [   27.286306]  tctx_task_work+0x178/0xad0
      [   27.288211]  task_work_run+0xe2/0x190
      [   27.288571]  exit_to_user_mode_prepare+0x1a1/0x1b0
      [   27.289041]  syscall_exit_to_user_mode+0x19/0x50
      [   27.289521]  do_syscall_64+0x48/0x90
      [   27.289871]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_req_complete_failed() -> io_req_complete_post() ->
      io_req_task_queue() still would try to enqueue hard linked request,
      which can be half prepared (e.g. failed init), so we can't allow
      that to happen.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c6d3d9cb
    • Jens Axboe's avatar
      io-wq: ensure that hash wait lock is IRQ disabling · 08bdbd39
      Jens Axboe authored
      A previous commit removed the IRQ safety of the worker and wqe locks,
      but that left one spot of the hash wait lock now being done without
      already having IRQs disabled.
      
      Ensure that we use the right locking variant for the hashed waitqueue
      lock.
      
      Fixes: a9a4aa9f ("io-wq: wqe and worker locks no longer need to be IRQ safe")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08bdbd39
    • Ming Lei's avatar
      io_uring: retry in case of short read on block device · 7db30437
      Ming Lei authored
      In case of buffered reading from block device, when short read happens,
      we should retry to read more, otherwise the IO will be completed
      partially, for example, the following fio expects to read 2MB, but it
      can only read 1M or less bytes:
      
          fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
      	--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
      	--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
      	--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
      
      Fix the issue by allowing short read retry for block device, which sets
      FMODE_BUF_RASYNC really.
      
      Fixes: 9a173346 ("io_uring: fix short read retries for non-reg files")
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7db30437
    • Jens Axboe's avatar
      io_uring: IORING_OP_WRITE needs hash_reg_file set · 7b3188e7
      Jens Axboe authored
      During some testing, it became evident that using IORING_OP_WRITE doesn't
      hash buffered writes like the other writes commands do. That's simply
      an oversight, and can cause performance regressions when doing buffered
      writes with this command.
      
      Correct that and add the flag, so that buffered writes are correctly
      hashed when using the non-iovec based write command.
      
      Cc: stable@vger.kernel.org
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b3188e7
    • Jens Axboe's avatar
      io-wq: fix race between adding work and activating a free worker · 94ffb0a2
      Jens Axboe authored
      The attempt to find and activate a free worker for new work is currently
      combined with creating a new one if we don't find one, but that opens
      io-wq up to a race where the worker that is found and activated can
      put itself to sleep without knowing that it has been selected to perform
      this new work.
      
      Fix this by moving the activation into where we add the new work item,
      then we can retain it within the wqe->lock scope and elimiate the race
      with the worker itself checking inside the lock, but sleeping outside of
      it.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94ffb0a2
  3. 30 Aug, 2021 3 commits
    • Jens Axboe's avatar
      io-wq: fix wakeup race when adding new work · 87df7fb9
      Jens Axboe authored
      When new work is added, io_wqe_enqueue() checks if we need to wake or
      create a new worker. But that check is done outside the lock that
      otherwise synchronizes us with a worker going to sleep, so we can end
      up in the following situation:
      
      CPU0				CPU1
      lock
      insert work
      unlock
      atomic_read(nr_running) != 0
      				lock
      				atomic_dec(nr_running)
      no wakeup needed
      
      Hold the wqe lock around the "need to wakeup" check. Then we can also get
      rid of the temporary work_flags variable, as we know the work will remain
      valid as long as we hold the lock.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      87df7fb9
    • Jens Axboe's avatar
      io-wq: wqe and worker locks no longer need to be IRQ safe · a9a4aa9f
      Jens Axboe authored
      io_uring no longer queues async work off completion handlers that run in
      hard or soft interrupt context, and that use case was the only reason that
      io-wq had to use IRQ safe locks for wqe and worker locks.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a9a4aa9f
    • Jens Axboe's avatar
      io-wq: check max_worker limits if a worker transitions bound state · ecc53c48
      Jens Axboe authored
      For the two places where new workers are created, we diligently check if
      we are allowed to create a new worker. If we're currently at the limit
      of how many workers of a given type we can have, then we don't create
      any new ones.
      
      If you have a mixed workload with various types of bound and unbounded
      work, then it can happen that a worker finishes one type of work and
      is then transitioned to the other type. For this case, we don't check
      if we are actually allowed to do so. This can cause io-wq to temporarily
      exceed the allowed number of workers for a given type.
      
      When retrieving work, check that the types match. If they don't, check
      if we are allowed to transition to the other type. If not, then don't
      handle the new work.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJohannes Lundberg <johalun0@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ecc53c48
  4. 29 Aug, 2021 4 commits
  5. 27 Aug, 2021 5 commits
  6. 25 Aug, 2021 5 commits
  7. 23 Aug, 2021 15 commits