[PATCH] readv/writev speedup
This is Janet Morgan's patch which converts the readv/writev code to submit all segments for IO before waiting on them, rather than submitting each segment separately. This is a critical performance fix for O_DIRECT reads and writes. Prior to this change, O_DIRECT vectored IO was forced to wait for completion against each segment of the iovec rather than submitting all segments and waiting on the lot. ie: for ten segments, this code will be ten times faster. There will also be moderate improvements for buffered IO - smaller code paths, plus writev() only takes i_sem once. The patch ended up quite large unfortunately - turned out that the only sane way to implement this without duplicating significant amounts of code (the generic_file_write() bounds checking, all the O_DIRECT handling, etc) was to redo generic_file_read() and generic_file_write() to take an iovec/nr_segs pair rather than `buf, count'. New exported functions generic_file_readv() and generic_file_writev() have been added: ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos); ssize_t generic_file_writev(struct file *file, const struct iovec *iov, unsigned long nr_segs, loff_t * ppos); If a driver does not use these in their file_operations then they will continue to use the old readv/writev code, which sits in a loop calling calls fops->read() or fops->write(). ext2, ext3, JFS and the blockdev driver are currently using this capability. Some coding cleanups were made in fs/read_write.c. Mainly: - pass "READ" or "WRITE" around to indicate the diretion of the operation, rather than the (confusing, inverted) VERIFY_READ/VERIFY_WRITE. - Use the identifier `nr_segs' everywhere to indicate the iovec length rather than `count', which is often used to indicate the number of bytes in the syscall. It was confusing the heck out of me. - Some cleanups to the raw driver. - Some additional generality in fs/direct_io.c: the core `struct dio' used to be a "populate-and-go" thing. Janet has broken that up so you can initialise a struct dio once, then loop around feeding it more file segments, then wait on completion against everything. - In a couple of places we needed to handle the situation where we knew, a-priori, that the user was going to get a short read or write. File size limit exceeded, read past i_size, etc. We handled that by shortening the iovec in-place with iov_shorten(). Which is not particularly pretty, but neither were the alternatives.
Showing
Please register or sign in to comment