Commit 38e7571c authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block

Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
parents 80201fe1 21b4aa5d
...@@ -429,3 +429,6 @@ ...@@ -429,3 +429,6 @@
421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time64 421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time64
422 i386 futex_time64 sys_futex __ia32_sys_futex 422 i386 futex_time64 sys_futex __ia32_sys_futex
423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval 423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
...@@ -345,6 +345,9 @@ ...@@ -345,6 +345,9 @@
334 common rseq __x64_sys_rseq 334 common rseq __x64_sys_rseq
# don't use numbers 387 through 423, add new calls after the last # don't use numbers 387 through 423, add new calls after the last
# 'common' entry # 'common' entry
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
# #
# x32-specific system call numbers start at 512 to avoid cache impact # x32-specific system call numbers start at 512 to avoid cache impact
......
...@@ -836,6 +836,40 @@ int bio_add_page(struct bio *bio, struct page *page, ...@@ -836,6 +836,40 @@ int bio_add_page(struct bio *bio, struct page *page,
} }
EXPORT_SYMBOL(bio_add_page); EXPORT_SYMBOL(bio_add_page);
static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
{
const struct bio_vec *bv = iter->bvec;
unsigned int len;
size_t size;
if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
return -EINVAL;
len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
size = bio_add_page(bio, bv->bv_page, len,
bv->bv_offset + iter->iov_offset);
if (size == len) {
struct page *page;
int i;
/*
* For the normal O_DIRECT case, we could skip grabbing this
* reference and then not have to put them again when IO
* completes. But this breaks some in-kernel users, like
* splicing to/from a loop device, where we release the pipe
* pages unconditionally. If we can fix that case, we can
* get rid of the get here and the need to call
* bio_release_pages() at IO completion time.
*/
mp_bvec_for_each_page(page, bv, i)
get_page(page);
iov_iter_advance(iter, size);
return 0;
}
return -EINVAL;
}
#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))
/** /**
...@@ -884,23 +918,35 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) ...@@ -884,23 +918,35 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
} }
/** /**
* bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio * bio_iov_iter_get_pages - add user or kernel pages to a bio
* @bio: bio to add pages to * @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped * @iter: iov iterator describing the region to be added
*
* This takes either an iterator pointing to user memory, or one pointing to
* kernel pages (BVEC iterator). If we're adding user pages, we pin them and
* map them into the kernel. On IO completion, the caller should put those
* pages. For now, when adding kernel pages, we still grab a reference to the
* page. This isn't strictly needed for the common case, but some call paths
* end up releasing pages from eg a pipe and we can't easily control these.
* See comment in __bio_iov_bvec_add_pages().
* *
* Pins pages from *iter and appends them to @bio's bvec array. The
* pages will have to be released using put_page() when done.
* The function tries, but does not guarantee, to pin as many pages as * The function tries, but does not guarantee, to pin as many pages as
* fit into the bio, or are requested in *iter, whatever is smaller. * fit into the bio, or are requested in *iter, whatever is smaller. If
* If MM encounters an error pinning the requested pages, it stops. * MM encounters an error pinning the requested pages, it stops. Error
* Error is returned only if 0 pages could be pinned. * is returned only if 0 pages could be pinned.
*/ */
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{ {
const bool is_bvec = iov_iter_is_bvec(iter);
unsigned short orig_vcnt = bio->bi_vcnt; unsigned short orig_vcnt = bio->bi_vcnt;
do { do {
int ret = __bio_iov_iter_get_pages(bio, iter); int ret;
if (is_bvec)
ret = __bio_iov_bvec_add_pages(bio, iter);
else
ret = __bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) if (unlikely(ret))
return bio->bi_vcnt > orig_vcnt ? 0 : ret; return bio->bi_vcnt > orig_vcnt ? 0 : ret;
......
...@@ -31,6 +31,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o ...@@ -31,6 +31,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_AIO) += aio.o obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_IO_URING) += io_uring.o
obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FS_ENCRYPTION) += crypto/
obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_FILE_LOCKING) += locks.o
......
...@@ -706,7 +706,7 @@ void do_close_on_exec(struct files_struct *files) ...@@ -706,7 +706,7 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock); spin_unlock(&files->file_lock);
} }
static struct file *__fget(unsigned int fd, fmode_t mask) static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
{ {
struct files_struct *files = current->files; struct files_struct *files = current->files;
struct file *file; struct file *file;
...@@ -721,7 +721,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) ...@@ -721,7 +721,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
*/ */
if (file->f_mode & mask) if (file->f_mode & mask)
file = NULL; file = NULL;
else if (!get_file_rcu(file)) else if (!get_file_rcu_many(file, refs))
goto loop; goto loop;
} }
rcu_read_unlock(); rcu_read_unlock();
...@@ -729,15 +729,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) ...@@ -729,15 +729,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
return file; return file;
} }
struct file *fget_many(unsigned int fd, unsigned int refs)
{
return __fget(fd, FMODE_PATH, refs);
}
struct file *fget(unsigned int fd) struct file *fget(unsigned int fd)
{ {
return __fget(fd, FMODE_PATH); return __fget(fd, FMODE_PATH, 1);
} }
EXPORT_SYMBOL(fget); EXPORT_SYMBOL(fget);
struct file *fget_raw(unsigned int fd) struct file *fget_raw(unsigned int fd)
{ {
return __fget(fd, 0); return __fget(fd, 0, 1);
} }
EXPORT_SYMBOL(fget_raw); EXPORT_SYMBOL(fget_raw);
...@@ -768,7 +773,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) ...@@ -768,7 +773,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
return 0; return 0;
return (unsigned long)file; return (unsigned long)file;
} else { } else {
file = __fget(fd, mask); file = __fget(fd, mask, 1);
if (!file) if (!file)
return 0; return 0;
return FDPUT_FPUT | (unsigned long)file; return FDPUT_FPUT | (unsigned long)file;
......
...@@ -326,9 +326,9 @@ void flush_delayed_fput(void) ...@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
void fput(struct file *file) void fput_many(struct file *file, unsigned int refs)
{ {
if (atomic_long_dec_and_test(&file->f_count)) { if (atomic_long_sub_and_test(refs, &file->f_count)) {
struct task_struct *task = current; struct task_struct *task = current;
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
...@@ -347,6 +347,11 @@ void fput(struct file *file) ...@@ -347,6 +347,11 @@ void fput(struct file *file)
} }
} }
void fput(struct file *file)
{
fput_many(file, 1);
}
/* /*
* synchronous analog of fput(); for kernel threads that might be needed * synchronous analog of fput(); for kernel threads that might be needed
* in some umount() (and thus can't use flush_delayed_fput() without * in some umount() (and thus can't use flush_delayed_fput() without
......
This diff is collapsed.
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
struct file; struct file;
extern void fput(struct file *); extern void fput(struct file *);
extern void fput_many(struct file *, unsigned int);
struct file_operations; struct file_operations;
struct vfsmount; struct vfsmount;
...@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) ...@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
} }
extern struct file *fget(unsigned int fd); extern struct file *fget(unsigned int fd);
extern struct file *fget_many(unsigned int fd, unsigned int refs);
extern struct file *fget_raw(unsigned int fd); extern struct file *fget_raw(unsigned int fd);
extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget(unsigned int fd);
extern unsigned long __fdget_raw(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd);
......
...@@ -961,7 +961,9 @@ static inline struct file *get_file(struct file *f) ...@@ -961,7 +961,9 @@ static inline struct file *get_file(struct file *f)
atomic_long_inc(&f->f_count); atomic_long_inc(&f->f_count);
return f; return f;
} }
#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) #define get_file_rcu_many(x, cnt) \
atomic_long_add_unless(&(x)->f_count, (cnt), 0)
#define get_file_rcu(x) get_file_rcu_many((x), 1)
#define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1)
#define file_count(x) atomic_long_read(&(x)->f_count) #define file_count(x) atomic_long_read(&(x)->f_count)
...@@ -3511,4 +3513,13 @@ extern void inode_nohighmem(struct inode *inode); ...@@ -3511,4 +3513,13 @@ extern void inode_nohighmem(struct inode *inode);
extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len, extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
int advice); int advice);
#if defined(CONFIG_IO_URING)
extern struct sock *io_uring_get_socket(struct file *file);
#else
static inline struct sock *io_uring_get_socket(struct file *file)
{
return NULL;
}
#endif
#endif /* _LINUX_FS_H */ #endif /* _LINUX_FS_H */
...@@ -40,7 +40,7 @@ struct user_struct { ...@@ -40,7 +40,7 @@ struct user_struct {
kuid_t uid; kuid_t uid;
#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \ #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
defined(CONFIG_NET) defined(CONFIG_NET) || defined(CONFIG_IO_URING)
atomic_long_t locked_vm; atomic_long_t locked_vm;
#endif #endif
......
...@@ -69,6 +69,7 @@ struct file_handle; ...@@ -69,6 +69,7 @@ struct file_handle;
struct sigaltstack; struct sigaltstack;
struct rseq; struct rseq;
union bpf_attr; union bpf_attr;
struct io_uring_params;
#include <linux/types.h> #include <linux/types.h>
#include <linux/aio_abi.h> #include <linux/aio_abi.h>
...@@ -314,6 +315,13 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, ...@@ -314,6 +315,13 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
struct io_event __user *events, struct io_event __user *events,
struct old_timespec32 __user *timeout, struct old_timespec32 __user *timeout,
const struct __aio_sigset *sig); const struct __aio_sigset *sig);
asmlinkage long sys_io_uring_setup(u32 entries,
struct io_uring_params __user *p);
asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
u32 min_complete, u32 flags,
const sigset_t __user *sig, size_t sigsz);
asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
void __user *arg, unsigned int nr_args);
/* fs/xattr.c */ /* fs/xattr.c */
asmlinkage long sys_setxattr(const char __user *path, const char __user *name, asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
......
...@@ -10,6 +10,7 @@ ...@@ -10,6 +10,7 @@
void unix_inflight(struct user_struct *user, struct file *fp); void unix_inflight(struct user_struct *user, struct file *fp);
void unix_notinflight(struct user_struct *user, struct file *fp); void unix_notinflight(struct user_struct *user, struct file *fp);
void unix_destruct_scm(struct sk_buff *skb);
void unix_gc(void); void unix_gc(void);
void wait_for_unix_gc(void); void wait_for_unix_gc(void);
struct sock *unix_get_socket(struct file *filp); struct sock *unix_get_socket(struct file *filp);
......
...@@ -824,8 +824,15 @@ __SYSCALL(__NR_futex_time64, sys_futex) ...@@ -824,8 +824,15 @@ __SYSCALL(__NR_futex_time64, sys_futex)
__SYSCALL(__NR_sched_rr_get_interval_time64, sys_sched_rr_get_interval) __SYSCALL(__NR_sched_rr_get_interval_time64, sys_sched_rr_get_interval)
#endif #endif
#define __NR_io_uring_setup 425
__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
#define __NR_io_uring_enter 426
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
#undef __NR_syscalls #undef __NR_syscalls
#define __NR_syscalls 424 #define __NR_syscalls 428
/* /*
* 32 bit systems traditionally used different * 32 bit systems traditionally used different
......
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
/*
* Header file for the io_uring interface.
*
* Copyright (C) 2019 Jens Axboe
* Copyright (C) 2019 Christoph Hellwig
*/
#ifndef LINUX_IO_URING_H
#define LINUX_IO_URING_H
#include <linux/fs.h>
#include <linux/types.h>
/*
* IO submission data structure (Submission Queue Entry)
*/
struct io_uring_sqe {
__u8 opcode; /* type of operation for this sqe */
__u8 flags; /* IOSQE_ flags */
__u16 ioprio; /* ioprio for the request */
__s32 fd; /* file descriptor to do IO on */
__u64 off; /* offset into file */
__u64 addr; /* pointer to buffer or iovecs */
__u32 len; /* buffer size or number of iovecs */
union {
__kernel_rwf_t rw_flags;
__u32 fsync_flags;
__u16 poll_events;
};
__u64 user_data; /* data to be passed back at completion time */
union {
__u16 buf_index; /* index into fixed buffers, if used */
__u64 __pad2[3];
};
};
/*
* sqe->flags
*/
#define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */
/*
* io_uring_setup() flags
*/
#define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */
#define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */
#define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */
#define IORING_OP_NOP 0
#define IORING_OP_READV 1
#define IORING_OP_WRITEV 2
#define IORING_OP_FSYNC 3
#define IORING_OP_READ_FIXED 4
#define IORING_OP_WRITE_FIXED 5
#define IORING_OP_POLL_ADD 6
#define IORING_OP_POLL_REMOVE 7
/*
* sqe->fsync_flags
*/
#define IORING_FSYNC_DATASYNC (1U << 0)
/*
* IO completion data structure (Completion Queue Entry)
*/
struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
__s32 res; /* result code for this event */
__u32 flags;
};
/*
* Magic offsets for the application to mmap the data it needs
*/
#define IORING_OFF_SQ_RING 0ULL
#define IORING_OFF_CQ_RING 0x8000000ULL
#define IORING_OFF_SQES 0x10000000ULL
/*
* Filled with the offset for mmap(2)
*/
struct io_sqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 flags;
__u32 dropped;
__u32 array;
__u32 resv1;
__u64 resv2;
};
/*
* sq_ring->flags
*/
#define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */
struct io_cqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 overflow;
__u32 cqes;
__u64 resv[2];
};
/*
* io_uring_enter(2) flags
*/
#define IORING_ENTER_GETEVENTS (1U << 0)
#define IORING_ENTER_SQ_WAKEUP (1U << 1)
/*
* Passed in for io_uring_setup(2). Copied back with updated info on success
*/
struct io_uring_params {
__u32 sq_entries;
__u32 cq_entries;
__u32 flags;
__u32 sq_thread_cpu;
__u32 sq_thread_idle;
__u32 resv[5];
struct io_sqring_offsets sq_off;
struct io_cqring_offsets cq_off;
};
/*
* io_uring_register(2) opcodes and arguments
*/
#define IORING_REGISTER_BUFFERS 0
#define IORING_UNREGISTER_BUFFERS 1
#define IORING_REGISTER_FILES 2
#define IORING_UNREGISTER_FILES 3
#endif
...@@ -1414,6 +1414,15 @@ config AIO ...@@ -1414,6 +1414,15 @@ config AIO
by some high performance threaded applications. Disabling by some high performance threaded applications. Disabling
this option saves about 7k. this option saves about 7k.
config IO_URING
bool "Enable IO uring support" if EXPERT
select ANON_INODES
default y
help
This option enables support for the io_uring interface, enabling
applications to submit and complete IO through submission and
completion rings that are shared between the kernel and application.
config ADVISE_SYSCALLS config ADVISE_SYSCALLS
bool "Enable madvise/fadvise syscalls" if EXPERT bool "Enable madvise/fadvise syscalls" if EXPERT
default y default y
......
...@@ -48,6 +48,9 @@ COND_SYSCALL(io_pgetevents_time32); ...@@ -48,6 +48,9 @@ COND_SYSCALL(io_pgetevents_time32);
COND_SYSCALL(io_pgetevents); COND_SYSCALL(io_pgetevents);
COND_SYSCALL_COMPAT(io_pgetevents_time32); COND_SYSCALL_COMPAT(io_pgetevents_time32);
COND_SYSCALL_COMPAT(io_pgetevents); COND_SYSCALL_COMPAT(io_pgetevents);
COND_SYSCALL(io_uring_setup);
COND_SYSCALL(io_uring_enter);
COND_SYSCALL(io_uring_register);
/* fs/xattr.c */ /* fs/xattr.c */
......
...@@ -18,7 +18,7 @@ obj-$(CONFIG_NETFILTER) += netfilter/ ...@@ -18,7 +18,7 @@ obj-$(CONFIG_NETFILTER) += netfilter/
obj-$(CONFIG_INET) += ipv4/ obj-$(CONFIG_INET) += ipv4/
obj-$(CONFIG_TLS) += tls/ obj-$(CONFIG_TLS) += tls/
obj-$(CONFIG_XFRM) += xfrm/ obj-$(CONFIG_XFRM) += xfrm/
obj-$(CONFIG_UNIX) += unix/ obj-$(CONFIG_UNIX_SCM) += unix/
obj-$(CONFIG_NET) += ipv6/ obj-$(CONFIG_NET) += ipv6/
obj-$(CONFIG_BPFILTER) += bpfilter/ obj-$(CONFIG_BPFILTER) += bpfilter/
obj-$(CONFIG_PACKET) += packet/ obj-$(CONFIG_PACKET) += packet/
......
...@@ -19,6 +19,11 @@ config UNIX ...@@ -19,6 +19,11 @@ config UNIX
Say Y unless you know what you are doing. Say Y unless you know what you are doing.
config UNIX_SCM
bool
depends on UNIX
default y
config UNIX_DIAG config UNIX_DIAG
tristate "UNIX: socket monitoring interface" tristate "UNIX: socket monitoring interface"
depends on UNIX depends on UNIX
......
...@@ -10,3 +10,5 @@ unix-$(CONFIG_SYSCTL) += sysctl_net_unix.o ...@@ -10,3 +10,5 @@ unix-$(CONFIG_SYSCTL) += sysctl_net_unix.o
obj-$(CONFIG_UNIX_DIAG) += unix_diag.o obj-$(CONFIG_UNIX_DIAG) += unix_diag.o
unix_diag-y := diag.o unix_diag-y := diag.o
obj-$(CONFIG_UNIX_SCM) += scm.o
...@@ -119,6 +119,8 @@ ...@@ -119,6 +119,8 @@
#include <linux/freezer.h> #include <linux/freezer.h>
#include <linux/file.h> #include <linux/file.h>
#include "scm.h"
struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE]; struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
EXPORT_SYMBOL_GPL(unix_socket_table); EXPORT_SYMBOL_GPL(unix_socket_table);
DEFINE_SPINLOCK(unix_table_lock); DEFINE_SPINLOCK(unix_table_lock);
...@@ -1496,67 +1498,6 @@ static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer) ...@@ -1496,67 +1498,6 @@ static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
return err; return err;
} }
static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
int i;
scm->fp = UNIXCB(skb).fp;
UNIXCB(skb).fp = NULL;
for (i = scm->fp->count-1; i >= 0; i--)
unix_notinflight(scm->fp->user, scm->fp->fp[i]);
}
static void unix_destruct_scm(struct sk_buff *skb)
{
struct scm_cookie scm;
memset(&scm, 0, sizeof(scm));
scm.pid = UNIXCB(skb).pid;
if (UNIXCB(skb).fp)
unix_detach_fds(&scm, skb);
/* Alas, it calls VFS */
/* So fscking what? fput() had been SMP-safe since the last Summer */
scm_destroy(&scm);
sock_wfree(skb);
}
/*
* The "user->unix_inflight" variable is protected by the garbage
* collection lock, and we just read it locklessly here. If you go
* over the limit, there might be a tiny race in actually noticing
* it across threads. Tough.
*/
static inline bool too_many_unix_fds(struct task_struct *p)
{
struct user_struct *user = current_user();
if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
return false;
}
static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
int i;
if (too_many_unix_fds(current))
return -ETOOMANYREFS;
/*
* Need to duplicate file references for the sake of garbage
* collection. Otherwise a socket in the fps might become a
* candidate for GC while the skb is not yet queued.
*/
UNIXCB(skb).fp = scm_fp_dup(scm->fp);
if (!UNIXCB(skb).fp)
return -ENOMEM;
for (i = scm->fp->count - 1; i >= 0; i--)
unix_inflight(scm->fp->user, scm->fp->fp[i]);
return 0;
}
static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds) static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
{ {
int err = 0; int err = 0;
......
...@@ -86,77 +86,13 @@ ...@@ -86,77 +86,13 @@
#include <net/scm.h> #include <net/scm.h>
#include <net/tcp_states.h> #include <net/tcp_states.h>
#include "scm.h"
/* Internal data structures and random procedures: */ /* Internal data structures and random procedures: */
static LIST_HEAD(gc_inflight_list);
static LIST_HEAD(gc_candidates); static LIST_HEAD(gc_candidates);
static DEFINE_SPINLOCK(unix_gc_lock);
static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait); static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
unsigned int unix_tot_inflight;
struct sock *unix_get_socket(struct file *filp)
{
struct sock *u_sock = NULL;
struct inode *inode = file_inode(filp);
/* Socket ? */
if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
struct socket *sock = SOCKET_I(inode);
struct sock *s = sock->sk;
/* PF_UNIX ? */
if (s && sock->ops && sock->ops->family == PF_UNIX)
u_sock = s;
}
return u_sock;
}
/* Keep the number of times in flight count for the file
* descriptor if it is for an AF_UNIX socket.
*/
void unix_inflight(struct user_struct *user, struct file *fp)
{
struct sock *s = unix_get_socket(fp);
spin_lock(&unix_gc_lock);
if (s) {
struct unix_sock *u = unix_sk(s);
if (atomic_long_inc_return(&u->inflight) == 1) {
BUG_ON(!list_empty(&u->link));
list_add_tail(&u->link, &gc_inflight_list);
} else {
BUG_ON(list_empty(&u->link));
}
unix_tot_inflight++;
}
user->unix_inflight++;
spin_unlock(&unix_gc_lock);
}
void unix_notinflight(struct user_struct *user, struct file *fp)
{
struct sock *s = unix_get_socket(fp);
spin_lock(&unix_gc_lock);
if (s) {
struct unix_sock *u = unix_sk(s);
BUG_ON(!atomic_long_read(&u->inflight));
BUG_ON(list_empty(&u->link));
if (atomic_long_dec_and_test(&u->inflight))
list_del_init(&u->link);
unix_tot_inflight--;
}
user->unix_inflight--;
spin_unlock(&unix_gc_lock);
}
static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
struct sk_buff_head *hitlist) struct sk_buff_head *hitlist)
{ {
......
// SPDX-License-Identifier: GPL-2.0
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/socket.h>
#include <linux/net.h>
#include <linux/fs.h>
#include <net/af_unix.h>
#include <net/scm.h>
#include <linux/init.h>
#include "scm.h"
unsigned int unix_tot_inflight;
EXPORT_SYMBOL(unix_tot_inflight);
LIST_HEAD(gc_inflight_list);
EXPORT_SYMBOL(gc_inflight_list);
DEFINE_SPINLOCK(unix_gc_lock);
EXPORT_SYMBOL(unix_gc_lock);
struct sock *unix_get_socket(struct file *filp)
{
struct sock *u_sock = NULL;
struct inode *inode = file_inode(filp);
/* Socket ? */
if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
struct socket *sock = SOCKET_I(inode);
struct sock *s = sock->sk;
/* PF_UNIX ? */
if (s && sock->ops && sock->ops->family == PF_UNIX)
u_sock = s;
} else {
/* Could be an io_uring instance */
u_sock = io_uring_get_socket(filp);
}
return u_sock;
}
EXPORT_SYMBOL(unix_get_socket);
/* Keep the number of times in flight count for the file
* descriptor if it is for an AF_UNIX socket.
*/
void unix_inflight(struct user_struct *user, struct file *fp)
{
struct sock *s = unix_get_socket(fp);
spin_lock(&unix_gc_lock);
if (s) {
struct unix_sock *u = unix_sk(s);
if (atomic_long_inc_return(&u->inflight) == 1) {
BUG_ON(!list_empty(&u->link));
list_add_tail(&u->link, &gc_inflight_list);
} else {
BUG_ON(list_empty(&u->link));
}
unix_tot_inflight++;
}
user->unix_inflight++;
spin_unlock(&unix_gc_lock);
}
void unix_notinflight(struct user_struct *user, struct file *fp)
{
struct sock *s = unix_get_socket(fp);
spin_lock(&unix_gc_lock);
if (s) {
struct unix_sock *u = unix_sk(s);
BUG_ON(!atomic_long_read(&u->inflight));
BUG_ON(list_empty(&u->link));
if (atomic_long_dec_and_test(&u->inflight))
list_del_init(&u->link);
unix_tot_inflight--;
}
user->unix_inflight--;
spin_unlock(&unix_gc_lock);
}
/*
* The "user->unix_inflight" variable is protected by the garbage
* collection lock, and we just read it locklessly here. If you go
* over the limit, there might be a tiny race in actually noticing
* it across threads. Tough.
*/
static inline bool too_many_unix_fds(struct task_struct *p)
{
struct user_struct *user = current_user();
if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
return false;
}
int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
int i;
if (too_many_unix_fds(current))
return -ETOOMANYREFS;
/*
* Need to duplicate file references for the sake of garbage
* collection. Otherwise a socket in the fps might become a
* candidate for GC while the skb is not yet queued.
*/
UNIXCB(skb).fp = scm_fp_dup(scm->fp);
if (!UNIXCB(skb).fp)
return -ENOMEM;
for (i = scm->fp->count - 1; i >= 0; i--)
unix_inflight(scm->fp->user, scm->fp->fp[i]);
return 0;
}
EXPORT_SYMBOL(unix_attach_fds);
void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
int i;
scm->fp = UNIXCB(skb).fp;
UNIXCB(skb).fp = NULL;
for (i = scm->fp->count-1; i >= 0; i--)
unix_notinflight(scm->fp->user, scm->fp->fp[i]);
}
EXPORT_SYMBOL(unix_detach_fds);
void unix_destruct_scm(struct sk_buff *skb)
{
struct scm_cookie scm;
memset(&scm, 0, sizeof(scm));
scm.pid = UNIXCB(skb).pid;
if (UNIXCB(skb).fp)
unix_detach_fds(&scm, skb);
/* Alas, it calls VFS */
/* So fscking what? fput() had been SMP-safe since the last Summer */
scm_destroy(&scm);
sock_wfree(skb);
}
EXPORT_SYMBOL(unix_destruct_scm);
#ifndef NET_UNIX_SCM_H
#define NET_UNIX_SCM_H
extern struct list_head gc_inflight_list;
extern spinlock_t unix_gc_lock;
int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb);
void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb);
#endif
# SPDX-License-Identifier: GPL-2.0
# Makefile for io_uring test tools
CFLAGS += -Wall -Wextra -g -D_GNU_SOURCE
LDLIBS += -lpthread
all: io_uring-cp io_uring-bench
%: %.c
$(CC) $(CFLAGS) -o $@ $^
io_uring-bench: syscall.o io_uring-bench.o
$(CC) $(CFLAGS) $(LDLIBS) -o $@ $^
io_uring-cp: setup.o syscall.o queue.o
clean:
$(RM) io_uring-cp io_uring-bench *.o
.PHONY: all clean
This directory includes a few programs that demonstrate how to use io_uring
in an application. The examples are:
io_uring-cp
A very basic io_uring implementation of cp(1). It takes two
arguments, copies the first argument to the second. This example
is part of liburing, and hence uses the simplified liburing API
for setting up an io_uring instance, submitting IO, completing IO,
etc. The support functions in queue.c and setup.c are straight
out of liburing.
io_uring-bench
Benchmark program that does random reads on a number of files. This
app demonstrates the various features of io_uring, like fixed files,
fixed buffers, and polled IO. There are options in the program to
control which features to use. Arguments is the file (or files) that
io_uring-bench should operate on. This uses the raw io_uring
interface.
liburing can be cloned with git here:
git://git.kernel.dk/liburing
and contains a number of unit tests as well for testing io_uring. It also
comes with man pages for the three system calls.
Fio includes an io_uring engine, you can clone fio here:
git://git.kernel.dk/fio
#ifndef LIBURING_BARRIER_H
#define LIBURING_BARRIER_H
#if defined(__x86_64) || defined(__i386__)
#define read_barrier() __asm__ __volatile__("":::"memory")
#define write_barrier() __asm__ __volatile__("":::"memory")
#else
/*
* Add arch appropriate definitions. Be safe and use full barriers for
* archs we don't have support for.
*/
#define read_barrier() __sync_synchronize()
#define write_barrier() __sync_synchronize()
#endif
#endif
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/*
* Simple test program that demonstrates a file copy through io_uring. This
* uses the API exposed by liburing.
*
* Copyright (C) 2018-2019 Jens Axboe
*/
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <inttypes.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include "liburing.h"
#define QD 64
#define BS (32*1024)
static int infd, outfd;
struct io_data {
int read;
off_t first_offset, offset;
size_t first_len;
struct iovec iov;
};
static int setup_context(unsigned entries, struct io_uring *ring)
{
int ret;
ret = io_uring_queue_init(entries, ring, 0);
if (ret < 0) {
fprintf(stderr, "queue_init: %s\n", strerror(-ret));
return -1;
}
return 0;
}
static int get_file_size(int fd, off_t *size)
{
struct stat st;
if (fstat(fd, &st) < 0)
return -1;
if (S_ISREG(st.st_mode)) {
*size = st.st_size;
return 0;
} else if (S_ISBLK(st.st_mode)) {
unsigned long long bytes;
if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
return -1;
*size = bytes;
return 0;
}
return -1;
}
static void queue_prepped(struct io_uring *ring, struct io_data *data)
{
struct io_uring_sqe *sqe;
sqe = io_uring_get_sqe(ring);
assert(sqe);
if (data->read)
io_uring_prep_readv(sqe, infd, &data->iov, 1, data->offset);
else
io_uring_prep_writev(sqe, outfd, &data->iov, 1, data->offset);
io_uring_sqe_set_data(sqe, data);
}
static int queue_read(struct io_uring *ring, off_t size, off_t offset)
{
struct io_uring_sqe *sqe;
struct io_data *data;
sqe = io_uring_get_sqe(ring);
if (!sqe)
return 1;
data = malloc(size + sizeof(*data));
data->read = 1;
data->offset = data->first_offset = offset;
data->iov.iov_base = data + 1;
data->iov.iov_len = size;
data->first_len = size;
io_uring_prep_readv(sqe, infd, &data->iov, 1, offset);
io_uring_sqe_set_data(sqe, data);
return 0;
}
static void queue_write(struct io_uring *ring, struct io_data *data)
{
data->read = 0;
data->offset = data->first_offset;
data->iov.iov_base = data + 1;
data->iov.iov_len = data->first_len;
queue_prepped(ring, data);
io_uring_submit(ring);
}
static int copy_file(struct io_uring *ring, off_t insize)
{
unsigned long reads, writes;
struct io_uring_cqe *cqe;
off_t write_left, offset;
int ret;
write_left = insize;
writes = reads = offset = 0;
while (insize || write_left) {
unsigned long had_reads;
int got_comp;
/*
* Queue up as many reads as we can
*/
had_reads = reads;
while (insize) {
off_t this_size = insize;
if (reads + writes >= QD)
break;
if (this_size > BS)
this_size = BS;
else if (!this_size)
break;
if (queue_read(ring, this_size, offset))
break;
insize -= this_size;
offset += this_size;
reads++;
}
if (had_reads != reads) {
ret = io_uring_submit(ring);
if (ret < 0) {
fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
break;
}
}
/*
* Queue is full at this point. Find at least one completion.
*/
got_comp = 0;
while (write_left) {
struct io_data *data;
if (!got_comp) {
ret = io_uring_wait_completion(ring, &cqe);
got_comp = 1;
} else
ret = io_uring_get_completion(ring, &cqe);
if (ret < 0) {
fprintf(stderr, "io_uring_get_completion: %s\n",
strerror(-ret));
return 1;
}
if (!cqe)
break;
data = (struct io_data *) (uintptr_t) cqe->user_data;
if (cqe->res < 0) {
if (cqe->res == -EAGAIN) {
queue_prepped(ring, data);
continue;
}
fprintf(stderr, "cqe failed: %s\n",
strerror(-cqe->res));
return 1;
} else if ((size_t) cqe->res != data->iov.iov_len) {
/* Short read/write, adjust and requeue */
data->iov.iov_base += cqe->res;
data->iov.iov_len -= cqe->res;
data->offset += cqe->res;
queue_prepped(ring, data);
continue;
}
/*
* All done. if write, nothing else to do. if read,
* queue up corresponding write.
*/
if (data->read) {
queue_write(ring, data);
write_left -= data->first_len;
reads--;
writes++;
} else {
free(data);
writes--;
}
}
}
return 0;
}
int main(int argc, char *argv[])
{
struct io_uring ring;
off_t insize;
int ret;
if (argc < 3) {
printf("%s: infile outfile\n", argv[0]);
return 1;
}
infd = open(argv[1], O_RDONLY);
if (infd < 0) {
perror("open infile");
return 1;
}
outfd = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (outfd < 0) {
perror("open outfile");
return 1;
}
if (setup_context(QD, &ring))
return 1;
if (get_file_size(infd, &insize))
return 1;
ret = copy_file(&ring, insize);
close(infd);
close(outfd);
io_uring_queue_exit(&ring);
return ret;
}
#ifndef LIB_URING_H
#define LIB_URING_H
#include <sys/uio.h>
#include <signal.h>
#include <string.h>
#include "../../include/uapi/linux/io_uring.h"
/*
* Library interface to io_uring
*/
struct io_uring_sq {
unsigned *khead;
unsigned *ktail;
unsigned *kring_mask;
unsigned *kring_entries;
unsigned *kflags;
unsigned *kdropped;
unsigned *array;
struct io_uring_sqe *sqes;
unsigned sqe_head;
unsigned sqe_tail;
size_t ring_sz;
};
struct io_uring_cq {
unsigned *khead;
unsigned *ktail;
unsigned *kring_mask;
unsigned *kring_entries;
unsigned *koverflow;
struct io_uring_cqe *cqes;
size_t ring_sz;
};
struct io_uring {
struct io_uring_sq sq;
struct io_uring_cq cq;
int ring_fd;
};
/*
* System calls
*/
extern int io_uring_setup(unsigned entries, struct io_uring_params *p);
extern int io_uring_enter(unsigned fd, unsigned to_submit,
unsigned min_complete, unsigned flags, sigset_t *sig);
extern int io_uring_register(int fd, unsigned int opcode, void *arg,
unsigned int nr_args);
/*
* Library interface
*/
extern int io_uring_queue_init(unsigned entries, struct io_uring *ring,
unsigned flags);
extern int io_uring_queue_mmap(int fd, struct io_uring_params *p,
struct io_uring *ring);
extern void io_uring_queue_exit(struct io_uring *ring);
extern int io_uring_get_completion(struct io_uring *ring,
struct io_uring_cqe **cqe_ptr);
extern int io_uring_wait_completion(struct io_uring *ring,
struct io_uring_cqe **cqe_ptr);
extern int io_uring_submit(struct io_uring *ring);
extern struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring);
/*
* Command prep helpers
*/
static inline void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *data)
{
sqe->user_data = (unsigned long) data;
}
static inline void io_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd,
void *addr, unsigned len, off_t offset)
{
memset(sqe, 0, sizeof(*sqe));
sqe->opcode = op;
sqe->fd = fd;
sqe->off = offset;
sqe->addr = (unsigned long) addr;
sqe->len = len;
}
static inline void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd,
struct iovec *iovecs, unsigned nr_vecs,
off_t offset)
{
io_uring_prep_rw(IORING_OP_READV, sqe, fd, iovecs, nr_vecs, offset);
}
static inline void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd,
void *buf, unsigned nbytes,
off_t offset)
{
io_uring_prep_rw(IORING_OP_READ_FIXED, sqe, fd, buf, nbytes, offset);
}
static inline void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd,
struct iovec *iovecs, unsigned nr_vecs,
off_t offset)
{
io_uring_prep_rw(IORING_OP_WRITEV, sqe, fd, iovecs, nr_vecs, offset);
}
static inline void io_uring_prep_write_fixed(struct io_uring_sqe *sqe, int fd,
void *buf, unsigned nbytes,
off_t offset)
{
io_uring_prep_rw(IORING_OP_WRITE_FIXED, sqe, fd, buf, nbytes, offset);
}
static inline void io_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd,
short poll_mask)
{
memset(sqe, 0, sizeof(*sqe));
sqe->opcode = IORING_OP_POLL_ADD;
sqe->fd = fd;
sqe->poll_events = poll_mask;
}
static inline void io_uring_prep_poll_remove(struct io_uring_sqe *sqe,
void *user_data)
{
memset(sqe, 0, sizeof(*sqe));
sqe->opcode = IORING_OP_POLL_REMOVE;
sqe->addr = (unsigned long) user_data;
}
static inline void io_uring_prep_fsync(struct io_uring_sqe *sqe, int fd,
int datasync)
{
memset(sqe, 0, sizeof(*sqe));
sqe->opcode = IORING_OP_FSYNC;
sqe->fd = fd;
if (datasync)
sqe->fsync_flags = IORING_FSYNC_DATASYNC;
}
#endif
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include "liburing.h"
#include "barrier.h"
static int __io_uring_get_completion(struct io_uring *ring,
struct io_uring_cqe **cqe_ptr, int wait)
{
struct io_uring_cq *cq = &ring->cq;
const unsigned mask = *cq->kring_mask;
unsigned head;
int ret;
*cqe_ptr = NULL;
head = *cq->khead;
do {
/*
* It's necessary to use a read_barrier() before reading
* the CQ tail, since the kernel updates it locklessly. The
* kernel has the matching store barrier for the update. The
* kernel also ensures that previous stores to CQEs are ordered
* with the tail update.
*/
read_barrier();
if (head != *cq->ktail) {
*cqe_ptr = &cq->cqes[head & mask];
break;
}
if (!wait)
break;
ret = io_uring_enter(ring->ring_fd, 0, 1,
IORING_ENTER_GETEVENTS, NULL);
if (ret < 0)
return -errno;
} while (1);
if (*cqe_ptr) {
*cq->khead = head + 1;
/*
* Ensure that the kernel sees our new head, the kernel has
* the matching read barrier.
*/
write_barrier();
}
return 0;
}
/*
* Return an IO completion, if one is readily available
*/
int io_uring_get_completion(struct io_uring *ring,
struct io_uring_cqe **cqe_ptr)
{
return __io_uring_get_completion(ring, cqe_ptr, 0);
}
/*
* Return an IO completion, waiting for it if necessary
*/
int io_uring_wait_completion(struct io_uring *ring,
struct io_uring_cqe **cqe_ptr)
{
return __io_uring_get_completion(ring, cqe_ptr, 1);
}
/*
* Submit sqes acquired from io_uring_get_sqe() to the kernel.
*
* Returns number of sqes submitted
*/
int io_uring_submit(struct io_uring *ring)
{
struct io_uring_sq *sq = &ring->sq;
const unsigned mask = *sq->kring_mask;
unsigned ktail, ktail_next, submitted;
int ret;
/*
* If we have pending IO in the kring, submit it first. We need a
* read barrier here to match the kernels store barrier when updating
* the SQ head.
*/
read_barrier();
if (*sq->khead != *sq->ktail) {
submitted = *sq->kring_entries;
goto submit;
}
if (sq->sqe_head == sq->sqe_tail)
return 0;
/*
* Fill in sqes that we have queued up, adding them to the kernel ring
*/
submitted = 0;
ktail = ktail_next = *sq->ktail;
while (sq->sqe_head < sq->sqe_tail) {
ktail_next++;
read_barrier();
sq->array[ktail & mask] = sq->sqe_head & mask;
ktail = ktail_next;
sq->sqe_head++;
submitted++;
}
if (!submitted)
return 0;
if (*sq->ktail != ktail) {
/*
* First write barrier ensures that the SQE stores are updated
* with the tail update. This is needed so that the kernel
* will never see a tail update without the preceeding sQE
* stores being done.
*/
write_barrier();
*sq->ktail = ktail;
/*
* The kernel has the matching read barrier for reading the
* SQ tail.
*/
write_barrier();
}
submit:
ret = io_uring_enter(ring->ring_fd, submitted, 0,
IORING_ENTER_GETEVENTS, NULL);
if (ret < 0)
return -errno;
return 0;
}
/*
* Return an sqe to fill. Application must later call io_uring_submit()
* when it's ready to tell the kernel about it. The caller may call this
* function multiple times before calling io_uring_submit().
*
* Returns a vacant sqe, or NULL if we're full.
*/
struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
{
struct io_uring_sq *sq = &ring->sq;
unsigned next = sq->sqe_tail + 1;
struct io_uring_sqe *sqe;
/*
* All sqes are used
*/
if (next - sq->sqe_head > *sq->kring_entries)
return NULL;
sqe = &sq->sqes[sq->sqe_tail & *sq->kring_mask];
sq->sqe_tail = next;
return sqe;
}
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment