[PATCH] direct IO updates

This patch is a performance and correctness update to the direct-IO code: O_DIRECT and the raw driver. It mainly affects IO against blockdevs. The direct_io code was returning -EINVAL for a filesystem hole. Change it to clear the userspace page instead. There were a few restrictions and weirdnesses wrt blocksize and alignments. The code has been reworked so we now lay out maximum-sized BIOs at any sector alignment. Because of this, the raw driver has been altered to set the blockdev's soft blocksize to the minimum possible at open() time. Typically, 512 bytes. There are now no performance disadvantages to using small blocksizes, and this gives the finest possible alignment. There is no API here for setting or querying the soft blocksize of the raw driver (there never was, really), which could conceivably be a problem. If it is, we can permit BLKBSZSET and BLKBSZGET against the fd which /dev/raw/rawN returned, but that would require that blk_ioctl() be exported to modules again. This code is wickedly quick. Here's an oprofile of a single 500MHz PIII reading from four (old) scsi disks (two aic7xxx controllers) via the raw driver. Aggregate throughput is 72 megabytes/second: c013363c 24 0.0896492 __set_page_dirty_buffers c021b8cc 24 0.0896492 ahc_linux_isr c012b5dc 25 0.0933846 kmem_cache_free c014d894 26 0.09712 dio_bio_complete c01cc78c 26 0.09712 number c0123bd4 40 0.149415 follow_page c01eed8c 46 0.171828 end_that_request_first c01ed410 49 0.183034 blk_recount_segments c01ed574 65 0.2428 blk_rq_map_sg c014db38 85 0.317508 do_direct_IO c021b090 90 0.336185 ahc_linux_run_device_queue c010bb78 236 0.881551 timer_interrupt c01052d8 25354 94.707 poll_idle A testament to the efficiency of the 2.5 block layer. And against four IDE disks on an HPT374 controller. Throughput is 120 megabytes/sec: c01eed8c 80 0.292462 end_that_request_first c01fe850 87 0.318052 hpt3xx_intrproc c01ed574 123 0.44966 blk_rq_map_sg c01f8f10 141 0.515464 ata_select c014db38 153 0.559333 do_direct_IO c010bb78 235 0.859107 timer_interrupt c01f9144 281 1.02727 ata_irq_enable c01ff990 290 1.06017 udma_pci_init c01fe878 308 1.12598 hpt3xx_maskproc c02006f8 379 1.38554 idedisk_do_request c02356a0 609 2.22637 pci_conf1_read c01ff8dc 611 2.23368 udma_pci_start c01ff950 922 3.37062 udma_pci_irq_status c01f8fac 1002 3.66308 ata_status c01ff26c 1059 3.87146 ata_start_dma c01feb70 1141 4.17124 hpt374_udma_stop c01f9228 3072 11.2305 ata_out_regfile c01052d8 15193 55.5422 poll_idle Not so good. One problem which has been identified with O_DIRECT is the cost of repeated calls into the mapping's get_block() callback. Not a big problem with ext2 but other filesystems have more complex get_block implementations. So what I have done is to require that callers of generic_direct_IO() implement the new `get_blocks()' interface. This is a small extension to get_block(). It gets passed another argument which indicates the maximum number of blocks which should be mapped, and it returns the number of blocks which it did map in bh_result->b_size. This allows the fs to map up to 4G of disk (or of hole) in a single get_block() invokation. There are some other caveats and requirements of get_blocks() which are documented in the comment block over fs/direct_io.c:get_more_blocks(). Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block mapping. It certainly allows good speedups. But it doesn't allow the fs to return a scatter list of blocks - it only understands linear chunks of disk. I think that's really all it _should_ do. I'll let get_blocks() sit for a while and wait for some feedback. If it is sufficient and nobody objects too much, I shall convert all get_block() instances in the kernel to be get_blocks() instances. And I'll teach readahead (at least) to use the get_blocks() extension. Delayed allocate writeback could use get_blocks(). As could block_prepare_write() for blocksize < PAGE_CACHE_SIZE. There's no mileage using it in mpage_writepages() because all our filesystems are syncalloc, and nobody uses MAP_SHARED for much. It will be tricky to use get_blocks() for writes, because if a ton of blocks have been mapped into the file and then something goes wrong, the kernel needs to either remove those blocks from the file or zero them out. The direct_io code zeroes them out. btw, some time ago you mentioned that some drivers and/or hardware may get upset if there are multiple simultaneous IOs in progress against the same block. Well, the raw driver has always allowed that to happen. O_DIRECT writes to blockdevs do as well now. todo: 1) The driver will probably explode if someone runs BLKBSZSET while IO is in progress. Need to use bdclaim() somewhere. 2) readv() and writev() need to become direct_io-aware. At present we're doing stop-and-wait for each segment when performing readv/writev against the raw driver and O_DIRECT blockdevs.

[PATCH] direct IO updates
This patch is a performance and correctness update to the direct-IO code: O_DIRECT and the raw driver. It mainly affects IO against blockdevs. The direct_io code was returning -EINVAL for a filesystem hole. Change it to clear the userspace page instead. There were a few restrictions and weirdnesses wrt blocksize and alignments. The code has been reworked so we now lay out maximum-sized BIOs at any sector alignment. Because of this, the raw driver has been altered to set the blockdev's soft blocksize to the minimum possible at open() time. Typically, 512 bytes. There are now no performance disadvantages to using small blocksizes, and this gives the finest possible alignment. There is no API here for setting or querying the soft blocksize of the raw driver (there never was, really), which could conceivably be a problem. If it is, we can permit BLKBSZSET and BLKBSZGET against the fd which /dev/raw/rawN returned, but that would require that blk_ioctl() be exported to modules again. This code is wickedly quick. Here's an oprofile of a single 500MHz PIII reading from four (old) scsi disks (two aic7xxx controllers) via the raw driver. Aggregate throughput is 72 megabytes/second: c013363c 24 0.0896492 __set_page_dirty_buffers c021b8cc 24 0.0896492 ahc_linux_isr c012b5dc 25 0.0933846 kmem_cache_free c014d894 26 0.09712 dio_bio_complete c01cc78c 26 0.09712 number c0123bd4 40 0.149415 follow_page c01eed8c 46 0.171828 end_that_request_first c01ed410 49 0.183034 blk_recount_segments c01ed574 65 0.2428 blk_rq_map_sg c014db38 85 0.317508 do_direct_IO c021b090 90 0.336185 ahc_linux_run_device_queue c010bb78 236 0.881551 timer_interrupt c01052d8 25354 94.707 poll_idle A testament to the efficiency of the 2.5 block layer. And against four IDE disks on an HPT374 controller. Throughput is 120 megabytes/sec: c01eed8c 80 0.292462 end_that_request_first c01fe850 87 0.318052 hpt3xx_intrproc c01ed574 123 0.44966 blk_rq_map_sg c01f8f10 141 0.515464 ata_select c014db38 153 0.559333 do_direct_IO c010bb78 235 0.859107 timer_interrupt c01f9144 281 1.02727 ata_irq_enable c01ff990 290 1.06017 udma_pci_init c01fe878 308 1.12598 hpt3xx_maskproc c02006f8 379 1.38554 idedisk_do_request c02356a0 609 2.22637 pci_conf1_read c01ff8dc 611 2.23368 udma_pci_start c01ff950 922 3.37062 udma_pci_irq_status c01f8fac 1002 3.66308 ata_status c01ff26c 1059 3.87146 ata_start_dma c01feb70 1141 4.17124 hpt374_udma_stop c01f9228 3072 11.2305 ata_out_regfile c01052d8 15193 55.5422 poll_idle Not so good. One problem which has been identified with O_DIRECT is the cost of repeated calls into the mapping's get_block() callback. Not a big problem with ext2 but other filesystems have more complex get_block implementations. So what I have done is to require that callers of generic_direct_IO() implement the new `get_blocks()' interface. This is a small extension to get_block(). It gets passed another argument which indicates the maximum number of blocks which should be mapped, and it returns the number of blocks which it did map in bh_result->b_size. This allows the fs to map up to 4G of disk (or of hole) in a single get_block() invokation. There are some other caveats and requirements of get_blocks() which are documented in the comment block over fs/direct_io.c:get_more_blocks(). Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block mapping. It certainly allows good speedups. But it doesn't allow the fs to return a scatter list of blocks - it only understands linear chunks of disk. I think that's really all it _should_ do. I'll let get_blocks() sit for a while and wait for some feedback. If it is sufficient and nobody objects too much, I shall convert all get_block() instances in the kernel to be get_blocks() instances. And I'll teach readahead (at least) to use the get_blocks() extension. Delayed allocate writeback could use get_blocks(). As could block_prepare_write() for blocksize < PAGE_CACHE_SIZE. There's no mileage using it in mpage_writepages() because all our filesystems are syncalloc, and nobody uses MAP_SHARED for much. It will be tricky to use get_blocks() for writes, because if a ton of blocks have been mapped into the file and then something goes wrong, the kernel needs to either remove those blocks from the file or zero them out. The direct_io code zeroes them out. btw, some time ago you mentioned that some drivers and/or hardware may get upset if there are multiple simultaneous IOs in progress against the same block. Well, the raw driver has always allowed that to happen. O_DIRECT writes to blockdevs do as well now. todo: 1) The driver will probably explode if someone runs BLKBSZSET while IO is in progress. Need to use bdclaim() somewhere. 2) readv() and writev() need to become direct_io-aware. At present we're doing stop-and-wait for each segment when performing readv/writev against the raw driver and O_DIRECT blockdevs.
0d85f8bf · Andrew Morton · Linus Torvalds · 62b52f5c · 0d85f8bf · 0d85f8bf
Commit 0d85f8bf authored Jul 28, 2002 by Andrew Morton Committed by Linus Torvalds Jul 28, 2002
6 changed files
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -17,11 +17,9 @@
 #include <linux/smp_lock.h>
 #include <asm/uaccess.h>

-#define dprintk(x...) 
-
 typedef struct raw_device_data_s {
 	struct block_device *binding;
-	int inuse, sector_size, sector_bits;
+	int inuse;
 	struct semaphore mutex;
 } raw_device_data_t;

@@ -65,15 +63,15 @@ __initcall(raw_init);

 /* 
 * Open/close code for raw IO.
+ *
+ * Set the device's soft blocksize to the minimum possible.  This gives the 
+ * finest possible alignment and has no adverse impact on performance.
 */
-
 int raw_open(struct inode *inode, struct file *filp)
 {
 	int minor;
 	struct block_device * bdev;
 	int err;
-	int sector_size;
-	int sector_bits;

 	minor = minor(inode->i_rdev);
 	
@@ -87,12 +85,11 @@ int raw_open(struct inode *inode, struct file *filp)
 	}
 	
 	down(&raw_devices[minor].mutex);
+
 	/*
 	 * No, it is a normal raw device.  All we need to do on open is
-	 * to check that the device is bound, and force the underlying
-	 * block device to a sector-size blocksize. 
+	 * to check that the device is bound.
 	 */
-
 	bdev = raw_devices[minor].binding;
 	err = -ENODEV;
 	if (!bdev)
@@ -100,23 +97,19 @@ int raw_open(struct inode *inode, struct file *filp)

 	atomic_inc(&bdev->bd_count);
 	err = blkdev_get(bdev, filp->f_mode, 0, BDEV_RAW);
-	if (err)
-		goto out;
+	if (!err) {
+		int minsize = bdev_hardsect_size(bdev);

-	/*
-	 * Don't change the blocksize if we already have users using
-	 * this device 
-	 */
-
-	if (raw_devices[minor].inuse++)
-		goto out;
-
-	sector_size = bdev_hardsect_size(bdev);
-	raw_devices[minor].sector_size = sector_size;
-	for (sector_bits = 0; !(sector_size & 1); )
-		sector_size>>=1, sector_bits++;
-	raw_devices[minor].sector_bits = sector_bits;
+		if (bdev) {
+			int ret;

+			ret = set_blocksize(bdev, minsize);
+			if (ret)
+				printk("%s: set_blocksize() failed: %d\n",
+					__FUNCTION__, ret);
+		}
+		raw_devices[minor].inuse++;
+	}
 out:
 	up(&raw_devices[minor].mutex);
 	
@@ -137,24 +130,27 @@ int raw_release(struct inode *inode, struct file *filp)
 	return 0;
 }

-
-
 /* Forward ioctls to the underlying block device. */ 
 int raw_ioctl(struct inode *inode, 
-		  struct file *flip,
+		  struct file *filp,
 		  unsigned int command, 
 		  unsigned long arg)
 {
-	int minor = minor(inode->i_rdev), err; 
+	int minor = minor(inode->i_rdev);
+	int err; 
 	struct block_device *b; 
+
+	err = -ENODEV;
 	if (minor < 1 && minor > 255)
-		return -ENODEV;
+		goto out;

 	b = raw_devices[minor].binding;
 	err = -EINVAL;
-	if (b && b->bd_inode && b->bd_op && b->bd_op->ioctl) { 
+	if (b == NULL)
+		goto out;
+	if (b->bd_inode && b->bd_op && b->bd_op->ioctl)
 		err = b->bd_op->ioctl(b->bd_inode, NULL, command, arg); 
-	} 
+out:
 	return err;
 }

@@ -164,12 +160,12 @@ int raw_ioctl(struct inode *inode,
 */

 int raw_ctl_ioctl(struct inode *inode, 
-		  struct file *flip,
+		  struct file *filp,
 		  unsigned int command, 
 		  unsigned long arg)
 {
 	struct raw_config_request rq;
-	int err = 0;
+	int err;
 	int minor;
 	
 	switch (command) {
@@ -178,26 +174,23 @@ int raw_ctl_ioctl(struct inode *inode,

 		/* First, find out which raw minor we want */

-		if (copy_from_user(&rq, (void *) arg, sizeof(rq))) {
 		err = -EFAULT;
-			break;
-		}
+		if (copy_from_user(&rq, (void *) arg, sizeof(rq)))
+			goto out;
 		
 		minor = rq.raw_minor;
-		if (minor <= 0 || minor > MINORMASK) {
 		err = -EINVAL;
-			break;
-		}
+		if (minor <= 0 || minor > MINORMASK)
+			goto out;

 		if (command == RAW_SETBIND) {
 			/*
 			 * This is like making block devices, so demand the
 			 * same capability
 			 */
-			if (!capable(CAP_SYS_ADMIN)) {
 			err = -EPERM;
-				break;
-			}
+			if (!capable(CAP_SYS_ADMIN))
+				goto out;

 			/* 
 			 * For now, we don't need to check that the underlying
@@ -206,24 +199,23 @@ int raw_ctl_ioctl(struct inode *inode,
 			 * major/minor numbers make sense. 
 			 */

-			if ((rq.block_major == 0 && 
-			     rq.block_minor != 0) ||
-			    rq.block_major > MAX_BLKDEV ||
-			    rq.block_minor > MINORMASK) {
 			err = -EINVAL;
-				break;
-			}
+			if ((rq.block_major == 0 && rq.block_minor != 0) ||
+					rq.block_major > MAX_BLKDEV ||
+					rq.block_minor > MINORMASK)
+				goto out;
 			
 			down(&raw_devices[minor].mutex);
+			err = -EBUSY;
 			if (raw_devices[minor].inuse) {
 				up(&raw_devices[minor].mutex);
-				err = -EBUSY;
-				break;
+				goto out;
 			}
 			if (raw_devices[minor].binding)
 				bdput(raw_devices[minor].binding);
 			raw_devices[minor].binding = 
-				bdget(kdev_t_to_nr(mk_kdev(rq.block_major, rq.block_minor)));
+				bdget(kdev_t_to_nr(mk_kdev(rq.block_major,
+							rq.block_minor)));
 			up(&raw_devices[minor].mutex);
 		} else {
 			struct block_device *bdev;
@@ -237,16 +229,18 @@ int raw_ctl_ioctl(struct inode *inode,
 			} else {
 				rq.block_major = rq.block_minor = 0;
 			}
-			err = copy_to_user((void *) arg, &rq, sizeof(rq));
-			if (err)
 			err = -EFAULT;
+			if (copy_to_user((void *) arg, &rq, sizeof(rq)))
+				goto out;
 		}
+		err = 0;
 		break;
 		
 	default:
 		err = -EINVAL;
+		break;
 	}
-	
+out:
 	return err;
 }

@@ -257,7 +251,7 @@ ssize_t raw_read(struct file *filp, char * buf, size_t size, loff_t *offp)

 ssize_t	raw_write(struct file *filp, const char *buf, size_t size, loff_t *offp)
 {
-	return rw_raw_dev(WRITE, filp, (char *) buf, size, offp);
+	return rw_raw_dev(WRITE, filp, (char *)buf, size, offp);
 }

 ssize_t

--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -24,14 +24,14 @@

 #include <asm/uaccess.h>

-static unsigned long max_block(struct block_device *bdev)
+static sector_t max_block(struct block_device *bdev)
 {
-	unsigned int retval = ~0U;
+	sector_t retval = ~0U;
 	loff_t sz = bdev->bd_inode->i_size;

 	if (sz) {
-		unsigned int size = block_size(bdev);
-		unsigned int sizebits = blksize_bits(size);
+		sector_t size = block_size(bdev);
+		unsigned sizebits = blksize_bits(size);
 		retval = (sz >> sizebits);
 	}
 	return retval;
@@ -88,7 +88,9 @@ int sb_min_blocksize(struct super_block *sb, int size)
 	return sb_set_blocksize(sb, size);
 }

-static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer_head * bh, int create)
+static int
+blkdev_get_block(struct inode *inode, sector_t iblock,
+		struct buffer_head *bh, int create)
 {
 	if (iblock >= max_block(inode->i_bdev))
 		return -EIO;
@@ -99,12 +101,26 @@ static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer
 	return 0;
 }

+static int
+blkdev_get_blocks(struct inode *inode, sector_t iblock,
+		unsigned long max_blocks, struct buffer_head *bh, int create)
+{
+	if ((iblock + max_blocks) >= max_block(inode->i_bdev))
+		return -EIO;
+
+	bh->b_bdev = inode->i_bdev;
+	bh->b_blocknr = iblock;
+	bh->b_size = max_blocks << inode->i_blkbits;
+	set_buffer_mapped(bh);
+	return 0;
+}
+
 static int
 blkdev_direct_IO(int rw, struct inode *inode, char *buf,
 			loff_t offset, size_t count)
 {
 	return generic_direct_IO(rw, inode, buf, offset,
-				count, blkdev_get_block);
+				count, blkdev_get_blocks);
 }

 static int blkdev_writepage(struct page * page)

--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -13,6 +13,7 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/bio.h>
 #include <linux/wait.h>
@@ -39,13 +40,17 @@ struct dio {
 	struct bio_vec *bvec;		/* current bvec in that bio */
 	struct inode *inode;
 	int rw;
+	unsigned blkbits;		/* doesn't change */
 	sector_t block_in_file;		/* changes */
+	unsigned blocks_available;	/* At block_in_file.  changes */
 	sector_t final_block_in_request;/* doesn't change */
-	unsigned first_block_in_page;	/* doesn't change */
+	unsigned first_block_in_page;	/* doesn't change, Used only once */
 	int boundary;			/* prev block is at a boundary */
 	int reap_counter;		/* rate limit reaping */
-	get_block_t *get_block;
-	sector_t last_block_in_bio;
+	get_blocks_t *get_blocks;	/* block mapping function */
+	sector_t last_block_in_bio;	/* current final block in bio */
+	sector_t next_block_in_bio;	/* next block to be added to bio */
+	struct buffer_head map_bh;	/* last get_blocks() result */

 	/* Page fetching state */
 	int curr_page;			/* changes */
@@ -53,15 +58,16 @@ struct dio {
 	unsigned long curr_user_address;/* changes */

 	/* Page queue */
-	struct page *pages[DIO_PAGES];
-	unsigned head;
-	unsigned tail;
+	struct page *pages[DIO_PAGES];	/* page buffer */
+	unsigned head;			/* next page to process */
+	unsigned tail;			/* last valid page + 1 */
+	int page_errors;		/* errno from get_user_pages() */

 	/* BIO completion state */
-	atomic_t bio_count;
-	spinlock_t bio_list_lock;
+	atomic_t bio_count;		/* nr bios in flight */
+	spinlock_t bio_list_lock;	/* protects bio_list */
 	struct bio *bio_list;		/* singly linked via bi_private */
-	struct task_struct *waiter;
+	struct task_struct *waiter;	/* waiting task (NULL if none) */
 };

 /*
@@ -93,6 +99,21 @@ static int dio_refill_pages(struct dio *dio)
 		NULL);				/* vmas */
 	up_read(&current->mm->mmap_sem);

+	if (ret < 0 && dio->blocks_available && (dio->rw == WRITE)) {
+		/*
+		 * A memory fault, but the filesystem has some outstanding
+		 * mapped blocks.  We need to use those blocks up to avoid
+		 * leaking stale data in the file.
+		 */
+		if (dio->page_errors == 0)
+			dio->page_errors = ret;
+		dio->pages[0] = ZERO_PAGE(dio->cur_user_address);
+		dio->head = 0;
+		dio->tail = 1;
+		ret = 0;
+		goto out;
+	}
+
 	if (ret >= 0) {
 		dio->curr_user_address += ret * PAGE_SIZE;
 		dio->curr_page += ret;
@@ -100,6 +121,7 @@ static int dio_refill_pages(struct dio *dio)
 		dio->tail = ret;
 		ret = 0;
 	}
+out:
 	return ret;	
 }

@@ -115,11 +137,8 @@ static struct page *dio_get_page(struct dio *dio)
 		int ret;

 		ret = dio_refill_pages(dio);
-		if (ret) {
-			printk("%s: dio_refill_pages returns %d\n",
-				__FUNCTION__, ret);
+		if (ret)
 			return ERR_PTR(ret);
-		}
 		BUG_ON(dio_pages_present(dio) == 0);
 	}
 	return dio->pages[dio->head++];
@@ -140,8 +159,9 @@ static void dio_bio_end_io(struct bio *bio)
 	spin_lock_irqsave(&dio->bio_list_lock, flags);
 	bio->bi_private = dio->bio_list;
 	dio->bio_list = bio;
-	spin_unlock_irqrestore(&dio->bio_list_lock, flags);
+	if (dio->waiter)
 		wake_up_process(dio->waiter);
+	spin_unlock_irqrestore(&dio->bio_list_lock, flags);
 }

 static int
@@ -179,6 +199,7 @@ static void dio_bio_submit(struct dio *dio)

 	dio->bio = NULL;
 	dio->bvec = NULL;
+	dio->boundary = 0;
 }

 /*
@@ -202,10 +223,12 @@ static struct bio *dio_await_one(struct dio *dio)
 	while (dio->bio_list == NULL) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (dio->bio_list == NULL) {
+			dio->waiter = current;
 			spin_unlock_irqrestore(&dio->bio_list_lock, flags);
 			blk_run_queues();
 			schedule();
 			spin_lock_irqsave(&dio->bio_list_lock, flags);
+			dio->waiter = NULL;
 		}
 		set_current_state(TASK_RUNNING);
 	}
@@ -268,29 +291,142 @@ static int dio_bio_reap(struct dio *dio)
 		while (dio->bio_list) {
 			unsigned long flags;
 			struct bio *bio;
-			int ret2;

 			spin_lock_irqsave(&dio->bio_list_lock, flags);
 			bio = dio->bio_list;
 			dio->bio_list = bio->bi_private;
 			spin_unlock_irqrestore(&dio->bio_list_lock, flags);
-			ret2 = dio_bio_complete(dio, bio);
-			if (ret == 0)
-				ret = ret2;
+			ret = dio_bio_complete(dio, bio);
 		}
 		dio->reap_counter = 0;
 	}
 	return ret;
 }

+/*
+ * Call into the fs to map some more disk blocks.  We record the current number
+ * of available blocks at dio->blocks_available.  These are in units of the
+ * fs blocksize, (1 << inode->i_blkbits).
+ *
+ * The fs is allowed to map lots of blocks at once.  If it wants to do that,
+ * it uses the passed inode-relative block number as the file offset, as usual.
+ *
+ * get_blocks() is passed the number of i_blkbits-sized blocks which direct_io
+ * has remaining to do.  The fs should not map more than this number of blocks.
+ *
+ * If the fs has mapped a lot of blocks, it should populate bh->b_size to
+ * indicate how much contiguous disk space has been made available at
+ * bh->b_blocknr.
+ *
+ * If *any* of the mapped blocks are new, then the fs must set buffer_new().
+ * This isn't very efficient...
+ *
+ * In the case of filesystem holes: the fs may return an arbitrarily-large
+ * hole by returning an appropriate value in b_size and by clearing
+ * buffer_mapped().  This code _should_ handle that case correctly, but it has
+ * only been tested against single-block holes (b_size == blocksize).
+ */
+static int get_more_blocks(struct dio *dio)
+{
+	int ret;
+	struct buffer_head *map_bh = &dio->map_bh;
+
+	if (dio->blocks_available)
+		return 0;
+
+	/*
+	 * If there was a memory error and we've overwritten all the
+	 * mapped blocks then we can now return that memory error
+	 */
+	if (dio->page_errors) {
+		ret = dio->page_errors;
+		goto out;
+	}
+
+	map_bh->b_state = 0;
+	map_bh->b_size = 0;
+	BUG_ON(dio->block_in_file >= dio->final_block_in_request);
+	ret = (*dio->get_blocks)(dio->inode, dio->block_in_file,
+			dio->final_block_in_request - dio->block_in_file,
+			map_bh, dio->rw == WRITE);
+	if (ret)
+		goto out;
+
+	if (buffer_mapped(map_bh)) {
+		BUG_ON(map_bh->b_size == 0);
+		BUG_ON((map_bh->b_size & ((1 << dio->blkbits) - 1)) != 0);
+
+		dio->blocks_available = map_bh->b_size >> dio->blkbits;
+
+		/* blockdevs do not set buffer_new */
+		if (buffer_new(map_bh)) {
+			sector_t block = map_bh->b_blocknr;
+			unsigned i;
+
+			for (i = 0; i < dio->blocks_available; i++)
+				unmap_underlying_metadata(map_bh->b_bdev,
+							block++);
+		}
+	} else {
+		BUG_ON(dio->rw != READ);
+		if (dio->bio)
+			dio_bio_submit(dio);
+	}
+	dio->next_block_in_bio = map_bh->b_blocknr;
+out:
+	return ret;
+}
+
+/*
+ * Check to see if we can continue to grow the BIO. If not, then send it.
+ */
+static void dio_prep_bio(struct dio *dio)
+{
+	if (dio->bio == NULL)
+		return;
+
+	if (dio->bio->bi_idx == dio->bio->bi_vcnt ||
+			dio->boundary ||
+			dio->last_block_in_bio != dio->next_block_in_bio - 1)
+		dio_bio_submit(dio);
+}
+
+/*
+ * There is no bio.  Make one now.
+ */
+static int dio_new_bio(struct dio *dio)
+{
+	sector_t sector;
+	int ret;
+
+	ret = dio_bio_reap(dio);
+	if (ret)
+		goto out;
+	sector = dio->next_block_in_bio << (dio->blkbits - 9);
+	ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector,
+				DIO_BIO_MAX_SIZE / PAGE_SIZE);
+	dio->boundary = 0;
+out:
+	return ret;
+}
+
 /*
 * Walk the user pages, and the file, mapping blocks to disk and emitting BIOs.
+ *
+ * Direct IO against a blockdev is different from a file.  Because we can
+ * happily perform page-sized but 512-byte aligned IOs.  It is important that
+ * blockdev IO be able to have fine alignment and large sizes.
+ *
+ * So what we do is to permit the ->get_blocks function to populate bh.b_size
+ * with the size of IO which is permitted at this offset and this i_blkbits.
+ *
+ * For best results, the blockdev should be set up with 512-byte i_blkbits and
+ * it should set b_size to PAGE_SIZE or more inside get_blocks().  This gives
+ * fine alignment but still allows this function to work in PAGE_SIZE units.
 */
 int do_direct_IO(struct dio *dio)
 {
-	struct inode * const inode = dio->inode;
-	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocksize = 1 << blkbits;
+	const unsigned blkbits = dio->blkbits;
 	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
 	struct page *page;
 	unsigned block_in_page;
@@ -309,46 +445,35 @@ int do_direct_IO(struct dio *dio)
 		}

 		new_page = 1;
-		for ( ; block_in_page < blocks_per_page; block_in_page++) {
-			struct buffer_head map_bh;
+		while (block_in_page < blocks_per_page) {
 			struct bio *bio;
+			unsigned this_chunk_bytes;	/* # of bytes mapped */
+			unsigned this_chunk_blocks;	/* # of blocks */
+			unsigned u;

-			map_bh.b_state = 0;
-			ret = (*dio->get_block)(inode, dio->block_in_file,
-						&map_bh, dio->rw == WRITE);
-			if (ret) {
-				printk("%s: get_block returns %d\n",
-					__FUNCTION__, ret);
-				goto fail_release;
-			}
-			/* blockdevs do not set buffer_new */
-			if (buffer_new(&map_bh))
-				unmap_underlying_metadata(map_bh.b_bdev,
-							map_bh.b_blocknr);
-			if (!buffer_mapped(&map_bh)) {
-				ret = -EINVAL;		/* A hole */
+			ret = get_more_blocks(dio);
+			if (ret)
 				goto fail_release;
+
+			/* Handle holes */
+			if (!buffer_mapped(&dio->map_bh)) {
+				char *kaddr = kmap_atomic(page, KM_USER0);
+				memset(kaddr + (block_in_page << blkbits),
+						0, 1 << blkbits);
+				flush_dcache_page(page);
+				kunmap_atomic(kaddr, KM_USER0);
+				dio->block_in_file++;
+				dio->next_block_in_bio++;
+				block_in_page++;
+				goto next_block;
 			}
-			if (dio->bio) {
-				if (dio->bio->bi_idx == dio->bio->bi_vcnt ||
-						dio->boundary ||
-						dio->last_block_in_bio !=
-							map_bh.b_blocknr - 1) {
-					dio_bio_submit(dio);
-					dio->boundary = 0;
-				}
-			}
+
+			dio_prep_bio(dio);
 			if (dio->bio == NULL) {
-				ret = dio_bio_reap(dio);
-				if (ret)
-					goto fail_release;
-				ret = dio_bio_alloc(dio, map_bh.b_bdev,
-					map_bh.b_blocknr << (blkbits - 9),
-					DIO_BIO_MAX_SIZE / PAGE_SIZE);
+				ret = dio_new_bio(dio);
 				if (ret)
 					goto fail_release;
 				new_page = 1;
-				dio->boundary = 0;
 			}

 			bio = dio->bio;
@@ -357,17 +482,34 @@ int do_direct_IO(struct dio *dio)
 				page_cache_get(page);
 				dio->bvec->bv_page = page;
 				dio->bvec->bv_len = 0;
-				dio->bvec->bv_offset = block_in_page*blocksize;
+				dio->bvec->bv_offset = block_in_page << blkbits;
 				bio->bi_idx++;
-			}
 				new_page = 0;
-			dio->bvec->bv_len += blocksize;
-			bio->bi_size += blocksize;
-			dio->last_block_in_bio = map_bh.b_blocknr;
-			dio->boundary = buffer_boundary(&map_bh);
+			}

-			dio->block_in_file++;
-			if (dio->block_in_file >= dio->final_block_in_request)
+			/* Work out how much disk we can add to this page */
+			this_chunk_blocks = dio->blocks_available;
+			u = (PAGE_SIZE - dio->bvec->bv_len) >> blkbits;
+			if (this_chunk_blocks > u)
+				this_chunk_blocks = u;
+			u = dio->final_block_in_request - dio->block_in_file;
+			if (this_chunk_blocks > u)
+				this_chunk_blocks = u;
+			this_chunk_bytes = this_chunk_blocks << blkbits;
+			BUG_ON(this_chunk_bytes == 0);
+
+			dio->bvec->bv_len += this_chunk_bytes;
+			bio->bi_size += this_chunk_bytes;
+			dio->next_block_in_bio += this_chunk_blocks;
+			dio->last_block_in_bio = dio->next_block_in_bio - 1;
+			dio->boundary = buffer_boundary(&dio->map_bh);
+			dio->block_in_file += this_chunk_blocks;
+			block_in_page += this_chunk_blocks;
+			dio->blocks_available -= this_chunk_blocks;
+next_block:
+			if (dio->block_in_file > dio->final_block_in_request)
+				BUG();
+			if (dio->block_in_file == dio->final_block_in_request)
 				break;
 		}
 		block_in_page = 0;
@@ -381,11 +523,16 @@ int do_direct_IO(struct dio *dio)
 	return ret;
 }

+/*
+ * The main direct-IO function.  This is a library function for use by
+ * filesystem drivers.
+ */
 int
 generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
-			size_t count, get_block_t get_block)
+			size_t count, get_blocks_t get_blocks)
 {
-	const unsigned blocksize_mask = (1 << inode->i_blkbits) - 1;
+	const unsigned blkbits = inode->i_blkbits;
+	const unsigned blocksize_mask = (1 << blkbits) - 1;
 	const unsigned long user_addr = (unsigned long)buf;
 	int ret;
 	int ret2;
@@ -403,16 +550,18 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
 	dio.bvec = NULL;
 	dio.inode = inode;
 	dio.rw = rw;
-	dio.block_in_file = offset >> inode->i_blkbits;
-	dio.final_block_in_request = (offset + count) >> inode->i_blkbits;
+	dio.blkbits = blkbits;
+	dio.block_in_file = offset >> blkbits;
+	dio.blocks_available = 0;
+	dio.final_block_in_request = (offset + count) >> blkbits;

 	/* Index into the first page of the first block */
-	dio.first_block_in_page = (user_addr & (PAGE_SIZE - 1))
-						>> inode->i_blkbits;
+	dio.first_block_in_page = (user_addr & (PAGE_SIZE - 1)) >> blkbits;
 	dio.boundary = 0;
 	dio.reap_counter = 0;
-	dio.get_block = get_block;
+	dio.get_blocks = get_blocks;
 	dio.last_block_in_bio = -1;
+	dio.next_block_in_bio = -1;

 	/* Page fetching state */
 	dio.curr_page = 0;
@@ -428,12 +577,13 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
 	/* Page queue */
 	dio.head = 0;
 	dio.tail = 0;
+	dio.page_errors = 0;

 	/* BIO completion state */
 	atomic_set(&dio.bio_count, 0);
 	spin_lock_init(&dio.bio_list_lock);
 	dio.bio_list = NULL;
-	dio.waiter = current;
+	dio.waiter = NULL;

 	ret = do_direct_IO(&dio);

@@ -444,9 +594,11 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
 	ret2 = dio_await_completion(&dio);
 	if (ret == 0)
 		ret = ret2;
+	if (ret == 0)
+		ret = dio.page_errors;
 	if (ret == 0)
 		ret = count - ((dio.final_block_in_request -
-				dio.block_in_file) << inode->i_blkbits);
+				dio.block_in_file) << blkbits);
 out:
 	return ret;
 }

--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -606,11 +606,24 @@ static int ext2_bmap(struct address_space *mapping, long block)
 	return generic_block_bmap(mapping,block,ext2_get_block);
 }

+static int
+ext2_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
+			struct buffer_head *bh_result, int create)
+{
+	int ret;
+
+	ret = ext2_get_block(inode, iblock, bh_result, create);
+	if (ret == 0)
+		bh_result->b_size = (1 << inode->i_blkbits);
+	return ret;
+}
+
 static int
 ext2_direct_IO(int rw, struct inode *inode, char *buf,
 			loff_t offset, size_t count)
 {
-	return generic_direct_IO(rw, inode, buf, offset, count, ext2_get_block);
+	return generic_direct_IO(rw, inode, buf,
+				offset, count, ext2_get_blocks);
 }

 static int

--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -293,10 +293,23 @@ static int jfs_bmap(struct address_space *mapping, long block)
 	return generic_block_bmap(mapping, block, jfs_get_block);
 }

+static int
+jfs_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
+			struct buffer_head *bh_result, int create)
+{
+	int ret;
+
+	ret = jfs_get_block(inode, iblock, bh_result, create);
+	if (ret == 0)
+		bh_result->b_size = (1 << inode->i_blkbits);
+	return ret;
+}
+
 static int jfs_direct_IO(int rw, struct inode *inode, char *buf,
 			loff_t offset, size_t count)
 {
-	return generic_direct_IO(rw, inode, buf, offset, count, jfs_get_block);
+	return generic_direct_IO(rw, inode, buf,
+				offset, count, jfs_get_blocks);
 }

 struct address_space_operations jfs_aops = {

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -211,7 +211,11 @@ extern void mnt_init(unsigned long);
 extern void files_init(unsigned long);

 struct buffer_head;
-typedef int (get_block_t)(struct inode*,sector_t,struct buffer_head*,int);
+typedef int (get_block_t)(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create);
+typedef int (get_blocks_t)(struct inode *inode, sector_t iblock,
+			unsigned long max_blocks,
+			struct buffer_head *bh_result, int create);

 #include <linux/pipe_fs_i.h>
 /* #include <linux/umsdos_fs_i.h> */
@@ -1238,7 +1242,7 @@ extern void do_generic_file_read(struct file *, loff_t *, read_descriptor_t *, r
 ssize_t generic_file_direct_IO(int rw, struct inode *inode, char *buf,
 				loff_t offset, size_t count);
 int generic_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count, get_block_t *get_block);
+			loff_t offset, size_t count, get_blocks_t *get_blocks);

 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);