1. 28 Feb, 2019 2 commits
    • Jens Axboe's avatar
      Add io_uring IO interface · 2b188cc1
      Jens Axboe authored
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b188cc1
    • Ming Lei's avatar
      block: introduce mp_bvec_for_each_page() for iterating over page · 594b9a89
      Ming Lei authored
      mp_bvec_for_each_segment() is a bit big for the iteration, so introduce
      a light-weight helper for iterating over pages, then 32bytes stack
      space can be saved.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      594b9a89
  2. 27 Feb, 2019 3 commits
  3. 24 Feb, 2019 4 commits
  4. 22 Feb, 2019 2 commits
  5. 21 Feb, 2019 3 commits
    • Ming Lei's avatar
      block: bounce: make sure that bvec table is updated · 8f4e80da
      Ming Lei authored
      Block bounce needs to allocate new page for doing IO, and the
      new page has to be updated to bvec table.
      
      Commit 6dc4f100 switches __blk_queue_bounce() to use the new
      bio_for_each_segment_all() interface. Unfortunately the new
      bio_for_each_segment_all() can't be used to update bvec table.
      
      This patch fixes this issue by retrieving bvec from the table
      directly, then the new allocated page can be updated to the bio.
      This way is safe because the cloned bio is single page bvec.
      
      Fixes: 6dc4f100 ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f4e80da
    • Jens Axboe's avatar
      Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-5.1/block · 037b2625
      Jens Axboe authored
      Pull NVMe changes for 5.1 from Christoph
      
      * 'nvme-5.1' of git://git.infradead.org/nvme: (22 commits)
        nvme-rdma: use nr_phys_segments when map rq to sgl
        nvmet: convert to SPDX identifiers
        nvmet-rdma: convert to SPDX identifiers
        nvme-loop: convert to SPDX identifiers
        nvmet-fcloop: convert to SPDX identifiers
        nvmet-fc: convert to SPDX identifiers
        nvme: convert to SPDX identifiers
        nvme-pci: convert to SPDX identifiers
        nvme-lightnvm: convert to SPDX identifiers
        nvme-rdma: convert to SPDX identifiers
        nvme-fc: convert to SPDX identifiers
        nvme-fabrics: convert to SPDX identifiers
        nvme-tcp.h: fix SPDX header
        nvme_ioctl.h: remove duplicate GPL boilerplate
        nvme: return error from nvme_alloc_ns()
        nvme: avoid that deleting a controller triggers a circular locking complaint
        nvme: introduce a helper function for controller deletion
        nvme: unexport nvme_delete_ctrl_sync()
        nvme-pci: check kstrtoint() return value in queue_count_set()
        nvme-fabrics: document the poll function argument
        ...
      037b2625
    • Chaitanya Kulkarni's avatar
      nvme-rdma: use nr_phys_segments when map rq to sgl · 34e08191
      Chaitanya Kulkarni authored
      Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check
      if a command contains data to be mapped.  This fixes the case where
      a struct request contains LBAs, but it has no payload, such as
      Write Zeroes support.
      
      Fixes: 6e02318e ("nvme: add support for the Write Zeroes command")
      Reported-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Tested-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      34e08191
  6. 20 Feb, 2019 21 commits
  7. 19 Feb, 2019 1 commit
  8. 15 Feb, 2019 4 commits
    • Jens Axboe's avatar
      Merge tag 'v5.0-rc6' into for-5.1/block · 6fb845f0
      Jens Axboe authored
      Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
      This is needed since io_uring is now based on the block branch,
      to avoid a conflict between the multi-page bvecs and the bits
      of io_uring that touch the core block parts.
      
      * tag 'v5.0-rc6': (525 commits)
        Linux 5.0-rc6
        x86/mm: Make set_pmd_at() paravirt aware
        MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
        blk-mq: remove duplicated definition of blk_mq_freeze_queue
        Blk-iolatency: warn on negative inflight IO counter
        blk-iolatency: fix IO hang due to negative inflight counter
        MAINTAINERS: unify reference to xen-devel list
        x86/mm/cpa: Fix set_mce_nospec()
        futex: Handle early deadlock return correctly
        futex: Fix barrier comment
        net: dsa: b53: Fix for failure when irq is not defined in dt
        blktrace: Show requests without sector
        mips: cm: reprime error cause
        mips: loongson64: remove unreachable(), fix loongson_poweroff().
        sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
        geneve: should not call rt6_lookup() when ipv6 was disabled
        KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
        KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
        kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
        signal: Better detection of synchronous signals
        ...
      6fb845f0
    • Ming Lei's avatar
      block: kill BLK_MQ_F_SG_MERGE · 56d18f62
      Ming Lei authored
      QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      56d18f62
    • Ming Lei's avatar
      block: kill QUEUE_FLAG_NO_SG_MERGE · 2705c937
      Ming Lei authored
      Since bdced438 ("block: setup bi_phys_segments after splitting"),
      physical segment number is mainly figured out in blk_queue_split() for
      fast path, and the flag of BIO_SEG_VALID is set there too.
      
      Now only blk_recount_segments() and blk_recalc_rq_segments() use this
      flag.
      
      Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
      is set in blk_queue_split().
      
      For another user of blk_recalc_rq_segments():
      
      - run in partial completion branch of blk_update_request, which is an unusual case
      
      - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
      since dm-rq is the only user.
      
      Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
      current setup of the I/O path, as it isn't going to save you a significant amount
      of cycles.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2705c937
    • Ming Lei's avatar
      block: document usage of bio iterator helpers · ac4fa1d1
      Ming Lei authored
      Now multi-page bvec is supported, some helpers may return page by
      page, meantime some may return segment by segment, this patch
      documents the usage.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac4fa1d1