1. 01 Oct, 2020 36 commits
    • Hillf Danton's avatar
      io-wq: fix use-after-free in io_wq_worker_running · c4068bf8
      Hillf Danton authored
      The smart syzbot has found a reproducer for the following issue:
      
       ==================================================================
       BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
       BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
       BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
       BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
       Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
      
       CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x198/0x1fd lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
        __kasan_report mm/kasan/report.c:513 [inline]
        kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
        check_memory_region_inline mm/kasan/generic.c:186 [inline]
        check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
        instrument_atomic_write include/linux/instrumented.h:71 [inline]
        atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
        io_wqe_inc_running fs/io-wq.c:301 [inline]
        io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
        schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
        io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       Allocated by task 7768:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track mm/kasan/common.c:56 [inline]
        __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
        kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
        kmalloc_node include/linux/slab.h:572 [inline]
        kzalloc_node include/linux/slab.h:677 [inline]
        io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
        io_init_wq_offload fs/io_uring.c:7432 [inline]
        io_sq_offload_start fs/io_uring.c:7504 [inline]
        io_uring_create fs/io_uring.c:8625 [inline]
        io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
        do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       Freed by task 21:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
        kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
        __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
        __cache_free mm/slab.c:3418 [inline]
        kfree+0x10e/0x2b0 mm/slab.c:3756
        __io_wq_destroy fs/io-wq.c:1138 [inline]
        io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
        io_finish_async fs/io_uring.c:6836 [inline]
        io_ring_ctx_free fs/io_uring.c:7870 [inline]
        io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
        process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
        worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       The buggy address belongs to the object at ffff8882183db000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 140 bytes inside of
        1024-byte region [ffff8882183db000, ffff8882183db400)
       The buggy address belongs to the page:
       page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
       flags: 0x57ffe0000000200(slab)
       raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
       raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ==================================================================
      
      which is down to the comment below,
      
      	/* all workers gone, wq exit can proceed */
      	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
      		complete(&wqe->wq->done);
      
      because there might be multiple cases of wqe in a wq and we would wait
      for every worker in every wqe to go home before releasing wq's resources
      on destroying.
      
      To that end, rework wq's refcount by making it independent of the tracking
      of workers because after all they are two different things, and keeping
      it balanced when workers come and go. Note the manager kthread, like
      other workers, now holds a grab to wq during its lifetime.
      
      Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
      and do nothing for exiting wq.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
      Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarHillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4068bf8
    • Joseph Qi's avatar
      io_uring: show sqthread pid and cpu in fdinfo · dbbe9c64
      Joseph Qi authored
      In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
      io_uring instances in a host. Since all sqthreads are named
      "io_uring-sq", it's hard to distinguish the relations between
      application process and its io_uring sqthread.
      With this patch, application can get its corresponding sqthread pid
      and cpu through show_fdinfo.
      Steps:
      1. Get io_uring fd first.
      $ ls -l /proc/<pid>/fd | grep -w io_uring
      2. Then get io_uring instance related info, including corresponding
      sqthread pid and cpu.
      $ cat /proc/<pid>/fdinfo/<io_uring_fd>
      
      pos:	0
      flags:	02000002
      mnt_id:	13
      SqThread:	6929
      SqThreadCpu:	2
      UserFiles:	1
          0: testfile
      UserBufs:	0
      PollList:
      Signed-off-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      [axboe: fixed for new shared SQPOLL]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbbe9c64
    • Jens Axboe's avatar
      io_uring: process task work in io_uring_register() · af9c1a44
      Jens Axboe authored
      We do this for CQ ring wait, in case task_work completions come in. We
      should do the same in io_uring_register(), to avoid spurious -EINTR
      if the ring quiescing ends up having to process task_work to complete
      the operation
      Reported-by: default avatarDan Melnic <dmm@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af9c1a44
    • Dennis Zhou's avatar
      io_uring: add blkcg accounting to offloaded operations · 91d8f519
      Dennis Zhou authored
      There are a few operations that are offloaded to the worker threads. In
      this case, we lose process context and end up in kthread context. This
      results in ios to be not accounted to the issuing cgroup and
      consequently end up as issued by root. Just like others, adopt the
      personality of the blkcg too when issuing via the workqueues.
      
      For the SQPOLL thread, it will live and attach in the inited cgroup's
      context.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91d8f519
    • Jens Axboe's avatar
      io_uring: improve registered buffer accounting for huge pages · de293938
      Jens Axboe authored
      io_uring does account any registered buffer as pinned/locked memory, and
      checks limit and fails if the given user doesn't have a big enough limit
      to register the ranges specified. However, if huge pages are used, we
      are potentially under-accounting the memory in terms of what gets pinned
      on the vm side.
      
      This patch rectifies that, by ensuring that we account the full size of
      a compound page, regardless of how much of it is being registered. Huge
      pages are not accounted mulitple times - if multiple sections of a huge
      page is registered, then the page is only accounted once.
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de293938
    • Zheng Bin's avatar
      io_uring: remove unneeded semicolon · 14db8411
      Zheng Bin authored
      Fixes coccicheck warning:
      
      fs/io_uring.c:4242:13-14: Unneeded semicolon
      Signed-off-by: default avatarZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14db8411
    • Jens Axboe's avatar
      io_uring: cap SQ submit size for SQPOLL with multiple rings · e95eee2d
      Jens Axboe authored
      In the spirit of fairness, cap the max number of SQ entries we'll submit
      for SQPOLL if we have multiple rings. If we don't do that, we could be
      submitting tons of entries for one ring, while others are waiting to get
      service.
      
      The value of 8 is somewhat arbitrarily chosen as something that allows
      a fair bit of batching, without using an excessive time per ring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e95eee2d
    • Jens Axboe's avatar
      io_uring: get rid of req->io/io_async_ctx union · e8c2bc1f
      Jens Axboe authored
      There's really no point in having this union, it just means that we're
      always allocating enough room to cater to any command. But that's
      pointless, as the ->io field is request type private anyway.
      
      This gets rid of the io_async_ctx structure, and fills in the required
      size in the io_op_defs[] instead.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8c2bc1f
    • Pavel Begunkov's avatar
      io_uring: kill extra user_bufs check · 4be1c615
      Pavel Begunkov authored
      Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
      because in that case ctx->nr_user_bufs would be zero, and the following
      check would fail.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4be1c615
    • Pavel Begunkov's avatar
      io_uring: fix overlapped memcpy in io_req_map_rw() · ab0b196c
      Pavel Begunkov authored
      When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
      iorw->iter into itself. Even though it doesn't lead to an error, such a
      memcpy()'s aliasing rules violation is considered to be a bad practise.
      
      Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
      remapping there, so it's much simpler than the generic implementation.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab0b196c
    • Pavel Begunkov's avatar
      io_uring: refactor io_req_map_rw() · afb87658
      Pavel Begunkov authored
      Set rw->free_iovec to @iovec, that gives an identical result and stresses
      that @iovec param rw->free_iovec play the same role.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      afb87658
    • Pavel Begunkov's avatar
      io_uring: simplify io_rw_prep_async() · f4bff104
      Pavel Begunkov authored
      Don't touch iter->iov and iov in between __io_import_iovec() and
      io_req_map_rw(), the former function aleady sets it correctly, because it
      creates one more case with NULL'ed iov to consider in io_req_map_rw().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4bff104
    • Jens Axboe's avatar
      io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits · 90554200
      Jens Axboe authored
      When using SQPOLL, applications can run into the issue of running out of
      SQ ring entries because the thread hasn't consumed them yet. The only
      option for dealing with that is checking later, or busy checking for the
      condition.
      
      Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
      condition.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90554200
    • Jens Axboe's avatar
      io_uring: mark io_uring_fops/io_op_defs as __read_mostly · 738277ad
      Jens Axboe authored
      These structures are never written, move them appropriately.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      738277ad
    • Jens Axboe's avatar
      io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too · aa06165d
      Jens Axboe authored
      We support using IORING_SETUP_ATTACH_WQ to share async backends between
      rings created by the same process, this now also allows the same to
      happen with SQPOLL. The setup procedure remains the same, the caller
      sets io_uring_params->wq_fd to the 'parent' context, and then the newly
      created ring will attach to that async backend.
      
      This means that multiple rings can share the same SQPOLL thread, saving
      resources.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa06165d
    • Jens Axboe's avatar
      io_uring: base SQPOLL handling off io_sq_data · 69fb2131
      Jens Axboe authored
      Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
      data structure we pass in. io_sq_data has a list of ctx's that we can
      then iterate over and handle.
      
      As of now we're ready to handle multiple ctx's, though we're still just
      handling a single one after this patch.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      69fb2131
    • Jens Axboe's avatar
      io_uring: split SQPOLL data into separate structure · 534ca6d6
      Jens Axboe authored
      Move all the necessary state out of io_ring_ctx, and into a new
      structure, io_sq_data. The latter now deals with any state or
      variables associated with the SQPOLL thread itself.
      
      In preparation for supporting more than one io_ring_ctx per SQPOLL
      thread.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      534ca6d6
    • Jens Axboe's avatar
      io_uring: split work handling part of SQPOLL into helper · c8d1ba58
      Jens Axboe authored
      This is done in preparation for handling more than one ctx, but it also
      cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
      get a get overview on.
      
      __io_sq_thread() is now the main handler, and it returns an enum sq_ret
      that tells io_sq_thread() what it ended up doing. The parent then makes
      a decision on idle, spinning, or work handling based on that.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8d1ba58
    • Jens Axboe's avatar
      io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler · 3f0e64d0
      Jens Axboe authored
      We need to decouple the clearing on wakeup from the the inline schedule,
      as that is going to be required for handling multiple rings in one
      thread.
      
      Wrap our wakeup handler so we can clear it when we get the wakeup, by
      definition that is when we no longer need the flag set.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f0e64d0
    • Jens Axboe's avatar
      io_uring: use private ctx wait queue entries for SQPOLL · 6a779382
      Jens Axboe authored
      This is in preparation to sharing the poller thread between rings. For
      that we need per-ring wait_queue_entry storage, and we can't easily put
      that on the stack if one thread is managing multiple rings.
      
      We'll also be sharing the wait_queue_head across rings for the purposes
      of wakeups, provide the usual private ring wait_queue_head for now but
      make it a pointer so we can easily override it when sharing.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6a779382
    • Jens Axboe's avatar
      fs: align IOCB_* flags with RWF_* flags · ce71bfea
      Jens Axboe authored
      We have a set of flags that are shared between the two and inherired
      in kiocb_set_rw_flags(), but we check and set these individually.
      Reorder the IOCB flags so that the bottom part of the space is synced
      with the RWF flag space, and then we can do them all in one mask and
      set operation.
      
      The only exception is RWF_SYNC, which needs to mark IOCB_SYNC and
      IOCB_DSYNC. Do that one separately.
      
      This shaves 15 bytes of text from kiocb_set_rw_flags() for me.
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce71bfea
    • Jens Axboe's avatar
      io_uring: io_sq_thread() doesn't need to flush signals · e35afcf9
      Jens Axboe authored
      We're not handling signals by default in kernel threads, and we never
      use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
      have a signal pending, and we don't need to check for it (nor flush it).
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e35afcf9
    • Sebastian Andrzej Siewior's avatar
      io_wq: Make io_wqe::lock a raw_spinlock_t · 95da8465
      Sebastian Andrzej Siewior authored
      During a context switch the scheduler invokes wq_worker_sleeping() with
      disabled preemption. Disabling preemption is needed because it protects
      access to `worker->sleeping'. As an optimisation it avoids invoking
      schedule() within the schedule path as part of possible wake up (thus
      preempt_enable_no_resched() afterwards).
      
      The io-wq has been added to the mix in the same section with disabled
      preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
      acquires a spinlock_t. Also within the schedule() the spinlock_t must be
      acquired after tsk_is_pi_blocked() otherwise it will block on the
      sleeping lock again while scheduling out.
      
      While playing with `io_uring-bench' I didn't notice a significant
      latency spike after converting io_wqe::lock to a raw_spinlock_t. The
      latency was more or less the same.
      
      In order to keep the spinlock_t it would have to be moved after the
      tsk_is_pi_blocked() check which would introduce a branch instruction
      into the hot path.
      
      The lock is used to maintain the `work_list' and wakes one task up at
      most.
      Should io_wqe_cancel_pending_work() cause latency spikes, while
      searching for a specific item, then it would need to drop the lock
      during iterations.
      revert_creds() is also invoked under the lock. According to debug
      cred::non_rcu is 0. Otherwise it should be moved outside of the locked
      section because put_cred_rcu()->free_uid() acquires a sleeping lock.
      
      Convert io_wqe::lock to a raw_spinlock_t.c
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      95da8465
    • Stefano Garzarella's avatar
      io_uring: allow disabling rings during the creation · 7e84e1c7
      Stefano Garzarella authored
      This patch adds a new IORING_SETUP_R_DISABLED flag to start the
      rings disabled, allowing the user to register restrictions,
      buffers, files, before to start processing SQEs.
      
      When IORING_SETUP_R_DISABLED is set, SQE are not processed and
      SQPOLL kthread is not started.
      
      The restrictions registration are allowed only when the rings
      are disable to prevent concurrency issue while processing SQEs.
      
      The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
      opcode with io_uring_register(2).
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7e84e1c7
    • Stefano Garzarella's avatar
      io_uring: add IOURING_REGISTER_RESTRICTIONS opcode · 21b55dbc
      Stefano Garzarella authored
      The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
      permanently installs a feature allowlist on an io_ring_ctx.
      The io_ring_ctx can then be passed to untrusted code with the
      knowledge that only operations present in the allowlist can be
      executed.
      
      The allowlist approach ensures that new features added to io_uring
      do not accidentally become available when an existing application
      is launched on a newer kernel version.
      
      Currently is it possible to restrict sqe opcodes, sqe flags, and
      register opcodes.
      
      IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
      it is not possible to change restrictions anymore.
      This prevents untrusted code from removing restrictions.
      Suggested-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      21b55dbc
    • Stefano Garzarella's avatar
      io_uring: use an enumeration for io_uring_register(2) opcodes · 9d4a75ef
      Stefano Garzarella authored
      The enumeration allows us to keep track of the last
      io_uring_register(2) opcode available.
      
      Behaviour and opcodes names don't change.
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9d4a75ef
    • Jens Axboe's avatar
      io_uring: move io_uring_get_socket() into io_uring.h · a3ec6005
      Jens Axboe authored
      Now we have a io_uring kernel header, move this definition out of fs.h
      and into io_uring.h where it belongs.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a3ec6005
    • Jens Axboe's avatar
      io_uring: reference ->nsproxy for file table commands · 9b828492
      Jens Axboe authored
      If we don't get and assign the namespace for the async work, then certain
      paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
      Anything that references the current namespace of the given task should
      be assigned for async work on behalf of that task.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b828492
    • Jens Axboe's avatar
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe authored
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f212204
    • Jens Axboe's avatar
      io_uring: enable task/files specific overflow flushing · e6c8aa9a
      Jens Axboe authored
      This allows us to selectively flush out pending overflows, depending on
      the task and/or files_struct being passed in.
      
      No intended functional changes in this patch.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e6c8aa9a
    • Jens Axboe's avatar
      io_uring: return cancelation status from poll/timeout/files handlers · 76e1b642
      Jens Axboe authored
      Return whether we found and canceled requests or not. This is in
      preparation for using this information, no functional changes in this
      patch.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      76e1b642
    • Jens Axboe's avatar
      io_uring: unconditionally grab req->task · e3bc8e9d
      Jens Axboe authored
      Sometimes we assign a weak reference to it, sometimes we grab a
      reference to it. Clean this up and make it unconditional, and drop the
      flag related to tracking this state.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e3bc8e9d
    • Jens Axboe's avatar
      io_uring: stash ctx task reference for SQPOLL · 2aede0e4
      Jens Axboe authored
      We can grab a reference to the task instead of stashing away the task
      files_struct. This is doable without creating a circular reference
      between the ring fd and the task itself.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2aede0e4
    • Jens Axboe's avatar
      io_uring: move dropping of files into separate helper · f573d384
      Jens Axboe authored
      No functional changes in this patch, prep patch for grabbing references
      to the files_struct.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f573d384
    • Jens Axboe's avatar
      io_uring: allow timeout/poll/files killing to take task into account · f3606e3a
      Jens Axboe authored
      We currently cancel these when the ring exits, and we cancel all of
      them. This is in preparation for killing only the ones associated
      with a given task.
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f3606e3a
    • Jens Axboe's avatar
      Merge branch 'io_uring-5.9' into for-5.10/io_uring · 0f078896
      Jens Axboe authored
      * io_uring-5.9:
        io_uring: fix async buffered reads when readahead is disabled
        io_uring: fix potential ABBA deadlock in ->show_fdinfo()
        io_uring: always delete double poll wait entry on match
      0f078896
  2. 29 Sep, 2020 1 commit
    • Hao Xu's avatar
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu authored
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8d317aa
  3. 28 Sep, 2020 3 commits
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · fb0155a0
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights include:
      
         - NFSv4.2: copy_file_range needs to invalidate caches on success
      
         - NFSv4.2: Fix security label length not being reset
      
         - pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
           on read
      
         - pNFS/flexfiles: Fix signed/unsigned type issues with mirror
           indices"
      
      * tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        pNFS/flexfiles: Be consistent about mirror index types
        pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
        NFSv4.2: fix client's attribute cache management for copy_file_range
        nfs: Fix security label length not being reset
      fb0155a0
    • Jason A. Donenfeld's avatar
      mm: do not rely on mm == current->mm in __get_user_pages_locked · a4d63c37
      Jason A. Donenfeld authored
      It seems likely this block was pasted from internal_get_user_pages_fast,
      which is not passed an mm struct and therefore uses current's.  But
      __get_user_pages_locked is passed an explicit mm, and current->mm is not
      always valid. This was hit when being called from i915, which uses:
      
        pin_user_pages_remote->
          __get_user_pages_remote->
            __gup_longterm_locked->
              __get_user_pages_locked
      
      Before, this would lead to an OOPS:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000064
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S   U     O      5.9.0-rc7+ #140
        Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
        Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
        RIP: 0010:__get_user_pages_remote+0xd7/0x310
        Call Trace:
         __i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
         process_one_work+0x1ca/0x390
         worker_thread+0x48/0x3c0
         kthread+0x114/0x130
         ret_from_fork+0x1f/0x30
        CR2: 0000000000000064
      
      This commit fixes the problem by using the mm pointer passed to the
      function rather than the bogus one in current.
      
      Fixes: 008cfe44 ("mm: Introduce mm_struct.has_pinned")
      Tested-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reported-by: default avatarHarald Arnesen <harald@skogtun.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4d63c37
    • Jens Axboe's avatar
      io_uring: fix potential ABBA deadlock in ->show_fdinfo() · fad8e0de
      Jens Axboe authored
      syzbot reports a potential lock deadlock between the normal IO path and
      ->show_fdinfo():
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.9.0-rc6-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.2/19710 is trying to acquire lock:
      ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296
      
      but task is already holding lock:
      ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&ctx->uring_lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:956 [inline]
             __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
             __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
             io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
             seq_show+0x4a8/0x700 fs/proc/fd.c:65
             seq_read+0x432/0x1070 fs/seq_file.c:208
             do_loop_readv_writev fs/read_write.c:734 [inline]
             do_loop_readv_writev fs/read_write.c:721 [inline]
             do_iter_read+0x48e/0x6e0 fs/read_write.c:955
             vfs_readv+0xe5/0x150 fs/read_write.c:1073
             kernel_readv fs/splice.c:355 [inline]
             default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
             do_splice_to+0x137/0x170 fs/splice.c:871
             splice_direct_to_actor+0x307/0x980 fs/splice.c:950
             do_splice_direct+0x1b3/0x280 fs/splice.c:1059
             do_sendfile+0x55f/0xd40 fs/read_write.c:1540
             __do_sys_sendfile64 fs/read_write.c:1601 [inline]
             __se_sys_sendfile64 fs/read_write.c:1587 [inline]
             __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&p->lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:956 [inline]
             __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
             seq_read+0x61/0x1070 fs/seq_file.c:155
             pde_read fs/proc/inode.c:306 [inline]
             proc_reg_read+0x221/0x300 fs/proc/inode.c:318
             do_loop_readv_writev fs/read_write.c:734 [inline]
             do_loop_readv_writev fs/read_write.c:721 [inline]
             do_iter_read+0x48e/0x6e0 fs/read_write.c:955
             vfs_readv+0xe5/0x150 fs/read_write.c:1073
             kernel_readv fs/splice.c:355 [inline]
             default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
             do_splice_to+0x137/0x170 fs/splice.c:871
             splice_direct_to_actor+0x307/0x980 fs/splice.c:950
             do_splice_direct+0x1b3/0x280 fs/splice.c:1059
             do_sendfile+0x55f/0xd40 fs/read_write.c:1540
             __do_sys_sendfile64 fs/read_write.c:1601 [inline]
             __se_sys_sendfile64 fs/read_write.c:1587 [inline]
             __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (sb_writers#4){.+.+}-{0:0}:
             check_prev_add kernel/locking/lockdep.c:2496 [inline]
             check_prevs_add kernel/locking/lockdep.c:2601 [inline]
             validate_chain kernel/locking/lockdep.c:3218 [inline]
             __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
             lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
             percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
             __sb_start_write+0x228/0x450 fs/super.c:1672
             io_write+0x6b5/0xb30 fs/io_uring.c:3296
             io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
             __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
             io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
             io_submit_sqe fs/io_uring.c:6324 [inline]
             io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
             __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      other info that might help us debug this:
      
      Chain exists of:
        sb_writers#4 --> &p->lock --> &ctx->uring_lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ctx->uring_lock);
                                     lock(&p->lock);
                                     lock(&ctx->uring_lock);
        lock(sb_writers#4);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.2/19710:
       #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
      
      stack backtrace:
      CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x198/0x1fd lib/dump_stack.c:118
       check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
       check_prev_add kernel/locking/lockdep.c:2496 [inline]
       check_prevs_add kernel/locking/lockdep.c:2601 [inline]
       validate_chain kernel/locking/lockdep.c:3218 [inline]
       __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
       lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write+0x228/0x450 fs/super.c:1672
       io_write+0x6b5/0xb30 fs/io_uring.c:3296
       io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
       __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
       io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
       io_submit_sqe fs/io_uring.c:6324 [inline]
       io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
       __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45e179
      Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
      RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
      RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
      R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c
      
      Fix this by just not diving into details if we fail to trylock the
      io_uring mutex. We know the ctx isn't going away during this operation,
      but we cannot safely iterate buffers/files/personalities if we don't
      hold the io_uring mutex.
      
      Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fad8e0de