1. 06 Mar, 2019 1 commit
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_POLL · 221c5eb2
      Jens Axboe authored
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      221c5eb2
  2. 28 Feb, 2019 13 commits
    • Jens Axboe's avatar
      io_uring: add io_kiocb ref count · c16361c1
      Jens Axboe authored
      We'll use this for the POLL implementation. Regular requests will
      NOT be using references, so initialize it to 0. Any real use of
      the io_kiocb ref will initialize it to at least 2.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c16361c1
    • Jens Axboe's avatar
      io_uring: add submission polling · 6c271ce2
      Jens Axboe authored
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6c271ce2
    • Jens Axboe's avatar
      io_uring: add file set registration · 6b06314c
      Jens Axboe authored
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6b06314c
    • Jens Axboe's avatar
      net: split out functions related to registering inflight socket files · f4e65870
      Jens Axboe authored
      We need this functionality for the io_uring file registration, but
      we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers
      to a separate file, that's always builtin to the kernel if CONFIG_UNIX is
      m/y.
      
      No functional changes in this patch, just moving code around.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4e65870
    • Jens Axboe's avatar
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe authored
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      edafccee
    • Jens Axboe's avatar
      block: implement bio helper to add iter bvec pages to bio · 6d0c48ae
      Jens Axboe authored
      For an ITER_BVEC, we can just iterate the iov and add the pages
      to the bio directly. For now, we grab a reference to those pages,
      and release them normally on IO completion. This isn't really needed
      for the normal case of O_DIRECT from/to a file, but some of the more
      esoteric use cases (like splice(2)) will unconditionally put the
      pipe buffer pages when the buffers are released. Until we can manage
      that case properly, ITER_BVEC pages are treated like normal pages
      in terms of reference counting.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d0c48ae
    • Jens Axboe's avatar
      io_uring: batch io_kiocb allocation · 2579f913
      Jens Axboe authored
      Similarly to how we use the state->ios_left to know how many references
      to get to a file, we can use it to allocate the io_kiocb's we need in
      bulk.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2579f913
    • Jens Axboe's avatar
      io_uring: use fget/fput_many() for file references · 9a56a232
      Jens Axboe authored
      Add a separate io_submit_state structure, to cache some of the things
      we need for IO submission.
      
      One such example is file reference batching. io_submit_state. We get as
      many references as the number of sqes we are submitting, and drop
      unused ones if we end up switching files. The assumption here is that
      we're usually only dealing with one fd, and if there are multiple,
      hopefuly they are at least somewhat ordered. Could trivially be extended
      to cover multiple fds, if needed.
      
      On the completion side we do the same thing, except this is trivially
      done just locally in io_iopoll_reap().
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9a56a232
    • Jens Axboe's avatar
      fs: add fget_many() and fput_many() · 091141a4
      Jens Axboe authored
      Some uses cases repeatedly get and put references to the same file, but
      the only exposed interface is doing these one at the time. As each of
      these entail an atomic inc or dec on a shared structure, that cost can
      add up.
      
      Add fget_many(), which works just like fget(), except it takes an
      argument for how many references to get on the file. Ditto fput_many(),
      which can drop an arbitrary number of references to a file.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      091141a4
    • Jens Axboe's avatar
      io_uring: support for IO polling · def596e9
      Jens Axboe authored
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      def596e9
    • Christoph Hellwig's avatar
      io_uring: add fsync support · c992fe29
      Christoph Hellwig authored
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c992fe29
    • Jens Axboe's avatar
      Add io_uring IO interface · 2b188cc1
      Jens Axboe authored
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b188cc1
    • Ming Lei's avatar
      block: introduce mp_bvec_for_each_page() for iterating over page · 594b9a89
      Ming Lei authored
      mp_bvec_for_each_segment() is a bit big for the iteration, so introduce
      a light-weight helper for iterating over pages, then 32bytes stack
      space can be saved.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      594b9a89
  3. 27 Feb, 2019 3 commits
  4. 24 Feb, 2019 4 commits
  5. 22 Feb, 2019 2 commits
  6. 21 Feb, 2019 3 commits
    • Ming Lei's avatar
      block: bounce: make sure that bvec table is updated · 8f4e80da
      Ming Lei authored
      Block bounce needs to allocate new page for doing IO, and the
      new page has to be updated to bvec table.
      
      Commit 6dc4f100 switches __blk_queue_bounce() to use the new
      bio_for_each_segment_all() interface. Unfortunately the new
      bio_for_each_segment_all() can't be used to update bvec table.
      
      This patch fixes this issue by retrieving bvec from the table
      directly, then the new allocated page can be updated to the bio.
      This way is safe because the cloned bio is single page bvec.
      
      Fixes: 6dc4f100 ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f4e80da
    • Jens Axboe's avatar
      Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-5.1/block · 037b2625
      Jens Axboe authored
      Pull NVMe changes for 5.1 from Christoph
      
      * 'nvme-5.1' of git://git.infradead.org/nvme: (22 commits)
        nvme-rdma: use nr_phys_segments when map rq to sgl
        nvmet: convert to SPDX identifiers
        nvmet-rdma: convert to SPDX identifiers
        nvme-loop: convert to SPDX identifiers
        nvmet-fcloop: convert to SPDX identifiers
        nvmet-fc: convert to SPDX identifiers
        nvme: convert to SPDX identifiers
        nvme-pci: convert to SPDX identifiers
        nvme-lightnvm: convert to SPDX identifiers
        nvme-rdma: convert to SPDX identifiers
        nvme-fc: convert to SPDX identifiers
        nvme-fabrics: convert to SPDX identifiers
        nvme-tcp.h: fix SPDX header
        nvme_ioctl.h: remove duplicate GPL boilerplate
        nvme: return error from nvme_alloc_ns()
        nvme: avoid that deleting a controller triggers a circular locking complaint
        nvme: introduce a helper function for controller deletion
        nvme: unexport nvme_delete_ctrl_sync()
        nvme-pci: check kstrtoint() return value in queue_count_set()
        nvme-fabrics: document the poll function argument
        ...
      037b2625
    • Chaitanya Kulkarni's avatar
      nvme-rdma: use nr_phys_segments when map rq to sgl · 34e08191
      Chaitanya Kulkarni authored
      Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check
      if a command contains data to be mapped.  This fixes the case where
      a struct request contains LBAs, but it has no payload, such as
      Write Zeroes support.
      
      Fixes: 6e02318e ("nvme: add support for the Write Zeroes command")
      Reported-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Tested-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      34e08191
  7. 20 Feb, 2019 14 commits