1. 03 Sep, 2021 5 commits
  2. 02 Sep, 2021 2 commits
    • Jens Axboe's avatar
      io-wq: make worker creation resilient against signals · 3146cba9
      Jens Axboe authored
      If a task is queueing async work and also handling signals, then we can
      run into the case where create_io_thread() is interrupted and returns
      failure because of that. If this happens for creating the first worker
      in a group, then that worker will never get created and we can hang the
      ring.
      
      If we do get a fork failure, retry from task_work. With signals we have
      to be a bit careful as we cannot simply queue as task_work, as we'll
      still have signals pending at that point. Punt over a normal workqueue
      first and then create from task_work after that.
      
      Lastly, ensure that we handle fatal worker creations. Worker creation
      failures are normally not fatal, only if we fail to create one in an empty
      worker group can we not make progress. Right now that is ignored, ensure
      that we handle that and run cancel on the work item.
      
      There are two paths that create new workers - one is the "existing worker
      going to sleep", and the other is "no workers found for this work, create
      one". The former is never fatal, as workers do exist in the group. Only
      the latter needs to be carefully handled.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3146cba9
    • Jens Axboe's avatar
      io-wq: get rid of FIXED worker flag · 05c5f4ee
      Jens Axboe authored
      It makes the logic easier to follow if we just get rid of the fixed worker
      flag, and simply ensure that we never exit the last worker in the group.
      This also means that no particular worker is special.
      
      Just track the last timeout state, and if we have hit it and no work
      is pending, check if there are other workers. If yes, then we can exit
      this one safely.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05c5f4ee
  3. 01 Sep, 2021 2 commits
    • Jens Axboe's avatar
      io-wq: only exit on fatal signals · 15e20db2
      Jens Axboe authored
      If the application uses io_uring and also relies heavily on signals
      for communication, that can cause io-wq workers to spuriously exit
      just because the parent has a signal pending. Just ignore signals
      unless they are fatal.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      15e20db2
    • Jens Axboe's avatar
      io-wq: split bounded and unbounded work into separate lists · f95dc207
      Jens Axboe authored
      We've got a few issues that all boil down to the fact that we have one
      list of pending work items, yet two different types of workers to
      serve them. This causes some oddities around workers switching type and
      even hashed work vs regular work on the same bounded list.
      
      Just separate them out cleanly, similarly to how we already do
      accounting of what is running. That provides a clean separation and
      removes some corner cases that can cause stalls when handling IO
      that is punted to io-wq.
      
      Fixes: ecc53c48 ("io-wq: check max_worker limits if a worker transitions bound state")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f95dc207
  4. 31 Aug, 2021 7 commits
    • Jens Axboe's avatar
      io-wq: fix queue stalling race · 0242f642
      Jens Axboe authored
      We need to set the stalled bit early, before we drop the lock for adding
      us to the stall hash queue. If not, then we can race with new work being
      queued between adding us to the stall hash and io_worker_handle_work()
      marking us stalled.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0242f642
    • Pavel Begunkov's avatar
      io_uring: don't submit half-prepared drain request · b8ce1b9d
      Pavel Begunkov authored
      [ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
      [ 3784.910926] Call Trace:
      [ 3784.910928]  ? io_read+0x17c/0x480
      [ 3784.910945]  io_issue_sqe+0xcb/0x1840
      [ 3784.910953]  __io_queue_sqe+0x44/0x300
      [ 3784.910959]  io_req_task_submit+0x27/0x70
      [ 3784.910962]  tctx_task_work+0xeb/0x1d0
      [ 3784.910966]  task_work_run+0x61/0xa0
      [ 3784.910968]  io_run_task_work_sig+0x53/0xa0
      [ 3784.910975]  __x64_sys_io_uring_enter+0x22/0x30
      [ 3784.910977]  do_syscall_64+0x3d/0x90
      [ 3784.910981]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
      submitting under-prepared request (e.g. failed in io_init_req(). Fail
      such drained requests as well.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b8ce1b9d
    • Pavel Begunkov's avatar
      io_uring: fix queueing half-created requests · c6d3d9cb
      Pavel Begunkov authored
      [   27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
      [   27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
      [   27.263730] RIP: 0010:sock_from_file+0x20/0x90
      [   27.272444] Call Trace:
      [   27.272736]  io_sendmsg+0x98/0x600
      [   27.279216]  io_issue_sqe+0x498/0x68d0
      [   27.281142]  __io_queue_sqe+0xab/0xb50
      [   27.285830]  io_req_task_submit+0xbf/0x1b0
      [   27.286306]  tctx_task_work+0x178/0xad0
      [   27.288211]  task_work_run+0xe2/0x190
      [   27.288571]  exit_to_user_mode_prepare+0x1a1/0x1b0
      [   27.289041]  syscall_exit_to_user_mode+0x19/0x50
      [   27.289521]  do_syscall_64+0x48/0x90
      [   27.289871]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_req_complete_failed() -> io_req_complete_post() ->
      io_req_task_queue() still would try to enqueue hard linked request,
      which can be half prepared (e.g. failed init), so we can't allow
      that to happen.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c6d3d9cb
    • Jens Axboe's avatar
      io-wq: ensure that hash wait lock is IRQ disabling · 08bdbd39
      Jens Axboe authored
      A previous commit removed the IRQ safety of the worker and wqe locks,
      but that left one spot of the hash wait lock now being done without
      already having IRQs disabled.
      
      Ensure that we use the right locking variant for the hashed waitqueue
      lock.
      
      Fixes: a9a4aa9f ("io-wq: wqe and worker locks no longer need to be IRQ safe")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08bdbd39
    • Ming Lei's avatar
      io_uring: retry in case of short read on block device · 7db30437
      Ming Lei authored
      In case of buffered reading from block device, when short read happens,
      we should retry to read more, otherwise the IO will be completed
      partially, for example, the following fio expects to read 2MB, but it
      can only read 1M or less bytes:
      
          fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
      	--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
      	--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
      	--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
      
      Fix the issue by allowing short read retry for block device, which sets
      FMODE_BUF_RASYNC really.
      
      Fixes: 9a173346 ("io_uring: fix short read retries for non-reg files")
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7db30437
    • Jens Axboe's avatar
      io_uring: IORING_OP_WRITE needs hash_reg_file set · 7b3188e7
      Jens Axboe authored
      During some testing, it became evident that using IORING_OP_WRITE doesn't
      hash buffered writes like the other writes commands do. That's simply
      an oversight, and can cause performance regressions when doing buffered
      writes with this command.
      
      Correct that and add the flag, so that buffered writes are correctly
      hashed when using the non-iovec based write command.
      
      Cc: stable@vger.kernel.org
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b3188e7
    • Jens Axboe's avatar
      io-wq: fix race between adding work and activating a free worker · 94ffb0a2
      Jens Axboe authored
      The attempt to find and activate a free worker for new work is currently
      combined with creating a new one if we don't find one, but that opens
      io-wq up to a race where the worker that is found and activated can
      put itself to sleep without knowing that it has been selected to perform
      this new work.
      
      Fix this by moving the activation into where we add the new work item,
      then we can retain it within the wqe->lock scope and elimiate the race
      with the worker itself checking inside the lock, but sleeping outside of
      it.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94ffb0a2
  5. 30 Aug, 2021 3 commits
    • Jens Axboe's avatar
      io-wq: fix wakeup race when adding new work · 87df7fb9
      Jens Axboe authored
      When new work is added, io_wqe_enqueue() checks if we need to wake or
      create a new worker. But that check is done outside the lock that
      otherwise synchronizes us with a worker going to sleep, so we can end
      up in the following situation:
      
      CPU0				CPU1
      lock
      insert work
      unlock
      atomic_read(nr_running) != 0
      				lock
      				atomic_dec(nr_running)
      no wakeup needed
      
      Hold the wqe lock around the "need to wakeup" check. Then we can also get
      rid of the temporary work_flags variable, as we know the work will remain
      valid as long as we hold the lock.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      87df7fb9
    • Jens Axboe's avatar
      io-wq: wqe and worker locks no longer need to be IRQ safe · a9a4aa9f
      Jens Axboe authored
      io_uring no longer queues async work off completion handlers that run in
      hard or soft interrupt context, and that use case was the only reason that
      io-wq had to use IRQ safe locks for wqe and worker locks.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a9a4aa9f
    • Jens Axboe's avatar
      io-wq: check max_worker limits if a worker transitions bound state · ecc53c48
      Jens Axboe authored
      For the two places where new workers are created, we diligently check if
      we are allowed to create a new worker. If we're currently at the limit
      of how many workers of a given type we can have, then we don't create
      any new ones.
      
      If you have a mixed workload with various types of bound and unbounded
      work, then it can happen that a worker finishes one type of work and
      is then transitioned to the other type. For this case, we don't check
      if we are actually allowed to do so. This can cause io-wq to temporarily
      exceed the allowed number of workers for a given type.
      
      When retrieving work, check that the types match. If they don't, check
      if we are allowed to transition to the other type. If not, then don't
      handle the new work.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJohannes Lundberg <johalun0@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ecc53c48
  6. 29 Aug, 2021 4 commits
  7. 27 Aug, 2021 5 commits
  8. 25 Aug, 2021 5 commits
  9. 23 Aug, 2021 7 commits