[PATCH] direct-to-BIO I/O for swapcache pages

This patch changes the swap I/O handling. The objectives are: - Remove swap special-casing - Stop using buffer_heads -> direct-to-BIO - Make S_ISREG swapfiles more robust. I've spent quite some time with swap. The first patches converted swap to use block_read/write_full_page(). These were discarded because they are still using buffer_heads, and a reasonable amount of otherwise unnecessary infrastructure had to be added to the swap code just to make it look like a regular fs. So this code just has a custom direct-to-BIO path for swap, which seems to be the most comfortable approach. A significant thing here is the introduction of "swap extents". A swap extent is a simple data structure which maps a range of swap pages onto a range of disk sectors. It is simply: struct swap_extent { struct list_head list; pgoff_t start_page; pgoff_t nr_pages; sector_t start_block; }; At swapon time (for an S_ISREG swapfile), each block in the file is bmapped() and the block numbers are parsed to generate the device's swap extent list. This extent list is quite compact - a 512 megabyte swapfile generates about 130 nodes in the list. That's about 4 kbytes of storage. The conversion from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon time. At swapon time (for an S_ISBLK swapfile), we install a single swap extent which describes the entire device. The advantages of the swap extents are: 1: We never have to run bmap() (ie: read from disk) at swapout time. So S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles. 2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are handled at swapon time. During normal operation, we just don't care. Both types of swapfiles are handled the same way. 3: The extent lists always operate in PAGE_SIZE units. So the problems of going from fs blocksize to PAGE_SIZE are handled at swapon time and normal operating code doesn't need to care. 4: Because we don't have to fiddle with different blocksizes, we can go direct-to-BIO for swap_readpage() and swap_writepage(). This introduces the kernel-wide invariant "anonymous pages never have buffers attached", which cleans some things up nicely. All those block_flushpage() calls in the swap code simply go away. 5: The kernel no longer has to allocate both buffer_heads and BIOs to perform swapout. Just a BIO. 6: It permits us to perform swapcache writeout and throttling for GFP_NOFS allocations (a later patch). (Well, there is one sort of anon page which can have buffers: the pages which are cast adrift in truncate_complete_page() because do_invalidatepage() failed. But these pages are never added to swapcache, and nobody except the VM LRU has to deal with them). The swapfile parser in setup_swap_extents() will attempt to extract the largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of disk from the S_ISREG swapfile. Any stray blocks (due to file discontiguities) are simply discarded - we never swap to those. If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then the swapon attempt will fail. The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG swapfile). It needs to be consulted once for each page within swap_readpage() and swap_writepage(). Hence there is a risk that we could blow significant amounts of CPU walking that list. However I have implemented a "where we found the last block" cache, which is used as the starting point for the next search. Empirical testing indicates that this is wildly effective - the average length of the list walk in map_swap_page() is 0.3 iterations per page, with a 130-element list. It _could_ be that some workloads do start suffering long walks in that code, and perhaps a tree would be needed there. But I doubt that, and if this is happening then it means that we're seeking all over the disk for swap I/O, and the list walk is the least of our problems. rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It has been renamed to rw_swap_page_sync() and it takes care of locking and unlocking the page itself. Which is all a much better interface. Support for type 0 swap has been removed. Current versions of mkwap(8) seem to never produce v0 swap unless you explicitly ask for it, so I doubt if this will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and the message version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3 is printed. We can remove that code for real later on. Really, all that swapfile header parsing should be pushed out to userspace. This code always uses single-page BIOs for swapin and swapout. I have an additional patch which converts swap to use mpage_writepages(), so we swap out in 16-page BIOs. It works fine, but I don't intend to submit that. There just doesn't seem to be any significant advantage to it. I can't see anything in sys_swapon()/sys_swapoff() which needs the lock_kernel() calls, so I deleted them. If you ftruncate an S_ISREG swapfile to a shorter size while it is in use, subsequent swapout will destroy the filesystem. It was always thus, but it is much, much easier to do now. Not really a kernel problem, but swapon(8) should not be allowing the kernel to use swapfiles which are modifiable by unprivileged users.

[PATCH] direct-to-BIO I/O for swapcache pages
This patch changes the swap I/O handling. The objectives are: - Remove swap special-casing - Stop using buffer_heads -> direct-to-BIO - Make S_ISREG swapfiles more robust. I've spent quite some time with swap. The first patches converted swap to use block_read/write_full_page(). These were discarded because they are still using buffer_heads, and a reasonable amount of otherwise unnecessary infrastructure had to be added to the swap code just to make it look like a regular fs. So this code just has a custom direct-to-BIO path for swap, which seems to be the most comfortable approach. A significant thing here is the introduction of "swap extents". A swap extent is a simple data structure which maps a range of swap pages onto a range of disk sectors. It is simply: struct swap_extent { struct list_head list; pgoff_t start_page; pgoff_t nr_pages; sector_t start_block; }; At swapon time (for an S_ISREG swapfile), each block in the file is bmapped() and the block numbers are parsed to generate the device's swap extent list. This extent list is quite compact - a 512 megabyte swapfile generates about 130 nodes in the list. That's about 4 kbytes of storage. The conversion from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon time. At swapon time (for an S_ISBLK swapfile), we install a single swap extent which describes the entire device. The advantages of the swap extents are: 1: We never have to run bmap() (ie: read from disk) at swapout time. So S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles. 2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are handled at swapon time. During normal operation, we just don't care. Both types of swapfiles are handled the same way. 3: The extent lists always operate in PAGE_SIZE units. So the problems of going from fs blocksize to PAGE_SIZE are handled at swapon time and normal operating code doesn't need to care. 4: Because we don't have to fiddle with different blocksizes, we can go direct-to-BIO for swap_readpage() and swap_writepage(). This introduces the kernel-wide invariant "anonymous pages never have buffers attached", which cleans some things up nicely. All those block_flushpage() calls in the swap code simply go away. 5: The kernel no longer has to allocate both buffer_heads and BIOs to perform swapout. Just a BIO. 6: It permits us to perform swapcache writeout and throttling for GFP_NOFS allocations (a later patch). (Well, there is one sort of anon page which can have buffers: the pages which are cast adrift in truncate_complete_page() because do_invalidatepage() failed. But these pages are never added to swapcache, and nobody except the VM LRU has to deal with them). The swapfile parser in setup_swap_extents() will attempt to extract the largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of disk from the S_ISREG swapfile. Any stray blocks (due to file discontiguities) are simply discarded - we never swap to those. If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then the swapon attempt will fail. The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG swapfile). It needs to be consulted once for each page within swap_readpage() and swap_writepage(). Hence there is a risk that we could blow significant amounts of CPU walking that list. However I have implemented a "where we found the last block" cache, which is used as the starting point for the next search. Empirical testing indicates that this is wildly effective - the average length of the list walk in map_swap_page() is 0.3 iterations per page, with a 130-element list. It _could_ be that some workloads do start suffering long walks in that code, and perhaps a tree would be needed there. But I doubt that, and if this is happening then it means that we're seeking all over the disk for swap I/O, and the list walk is the least of our problems. rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It has been renamed to rw_swap_page_sync() and it takes care of locking and unlocking the page itself. Which is all a much better interface. Support for type 0 swap has been removed. Current versions of mkwap(8) seem to never produce v0 swap unless you explicitly ask for it, so I doubt if this will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and the message version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3 is printed. We can remove that code for real later on. Really, all that swapfile header parsing should be pushed out to userspace. This code always uses single-page BIOs for swapin and swapout. I have an additional patch which converts swap to use mpage_writepages(), so we swap out in 16-page BIOs. It works fine, but I don't intend to submit that. There just doesn't seem to be any significant advantage to it. I can't see anything in sys_swapon()/sys_swapoff() which needs the lock_kernel() calls, so I deleted them. If you ftruncate an S_ISREG swapfile to a shorter size while it is in use, subsequent swapout will destroy the filesystem. It was always thus, but it is much, much easier to do now. Not really a kernel problem, but swapon(8) should not be allowing the kernel to use swapfiles which are modifiable by unprivileged users.
88c4650a · Andrew Morton · Linus Torvalds · 3ab86fb0 · 88c4650a · 88c4650a
Commit 88c4650a authored Jun 17, 2002 by Andrew Morton Committed by Linus Torvalds Jun 17, 2002
8 changed files
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -492,7 +492,7 @@ static void free_more_memory(void)
 }

 /*
- * I/O completion handler for block_read_full_page() and brw_page() - pages
+ * I/O completion handler for block_read_full_page() - pages
 * which come unlocked at the end of I/O.
 */
 static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
@@ -551,9 +551,8 @@ static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
 }

 /*
- * Completion handler for block_write_full_page() and for brw_page() - pages
- * which are unlocked during I/O, and which have PageWriteback cleared
- * upon I/O completion.
+ * Completion handler for block_write_full_page() - pages which are unlocked
+ * during I/O, and which have PageWriteback cleared upon I/O completion.
 */
 static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 {
@@ -1360,11 +1359,11 @@ int block_invalidatepage(struct page *page, unsigned long offset)
 {
 	struct buffer_head *head, *bh, *next;
 	unsigned int curr_off = 0;
+	int ret = 1;

-	if (!PageLocked(page))
-		BUG();
+	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
-		return 1;
+		goto out;

 	head = page_buffers(page);
 	bh = head;
@@ -1386,12 +1385,10 @@ int block_invalidatepage(struct page *page, unsigned long offset)
 	 * The get_block cached value has been unconditionally invalidated,
 	 * so real IO is not possible anymore.
 	 */
-	if (offset == 0) {
-		if (!try_to_release_page(page, 0))
-			return 0;
-	}
-
-	return 1;
+	if (offset == 0)
+		ret = try_to_release_page(page, 0);
+out:
+	return ret;
 }
 EXPORT_SYMBOL(block_invalidatepage);

@@ -2266,57 +2263,6 @@ int brw_kiovec(int rw, int nr, struct kiobuf *iovec[],
 	return err ? err : transferred;
 }

-/*
- * Start I/O on a page.
- * This function expects the page to be locked and may return
- * before I/O is complete. You then have to check page->locked
- * and page->uptodate.
- *
- * FIXME: we need a swapper_inode->get_block function to remove
- *        some of the bmap kludges and interface ugliness here.
- */
-int brw_page(int rw, struct page *page,
-		struct block_device *bdev, sector_t b[], int size)
-{
-	struct buffer_head *head, *bh;
-
-	BUG_ON(!PageLocked(page));
-
-	if (!page_has_buffers(page))
-		create_empty_buffers(page, size, 0);
-	head = bh = page_buffers(page);
-
-	/* Stage 1: lock all the buffers */
-	do {
-		lock_buffer(bh);
-		bh->b_blocknr = *(b++);
-		bh->b_bdev = bdev;
-		set_buffer_mapped(bh);
-		if (rw == WRITE) {
-			set_buffer_uptodate(bh);
-			clear_buffer_dirty(bh);
-			mark_buffer_async_write(bh);
-		} else {
-			mark_buffer_async_read(bh);
-		}
-		bh = bh->b_this_page;
-	} while (bh != head);
-
-	if (rw == WRITE) {
-		BUG_ON(PageWriteback(page));
-		SetPageWriteback(page);
-		unlock_page(page);
-	}
-
-	/* Stage 2: start the IO */
-	do {
-		struct buffer_head *next = bh->b_this_page;
-		submit_bh(rw, bh);
-		bh = next;
-	} while (bh != head);
-	return 0;
-}
-
 /*
 * Sanity checks for try_to_free_buffers.
 */

--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -183,7 +183,6 @@ struct buffer_head * __bread(struct block_device *, int, int);
 void wakeup_bdflush(void);
 struct buffer_head *alloc_buffer_head(int async);
 void free_buffer_head(struct buffer_head * bh);
-int brw_page(int, struct page *, struct block_device *, sector_t [], int);
 void FASTCALL(unlock_buffer(struct buffer_head *bh));

 /*

--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -5,6 +5,7 @@
 #include <linux/kdev_t.h>
 #include <linux/linkage.h>
 #include <linux/mmzone.h>
+#include <linux/list.h>
 #include <asm/page.h>

 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
@@ -61,6 +62,21 @@ typedef struct {

 #ifdef __KERNEL__

+/*
+ * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
+ * disk blocks.  A list of swap extents maps the entire swapfile.  (Where the
+ * term `swapfile' refers to either a blockdevice or an IS_REG file.  Apart
+ * from setup, they're handled identically.
+ *
+ * We always assume that blocks are of size PAGE_SIZE.
+ */
+struct swap_extent {
+	struct list_head list;
+	pgoff_t start_page;
+	pgoff_t nr_pages;
+	sector_t start_block;
+};
+
 /*
 * Max bad pages in the new format..
 */
@@ -83,11 +99,17 @@ enum {

 /*
 * The in-memory structure used to track swap areas.
+ * extent_list.prev points at the lowest-index extent.  That list is
+ * sorted.
 */
 struct swap_info_struct {
 	unsigned int flags;
 	spinlock_t sdev_lock;
 	struct file *swap_file;
+	struct block_device *bdev;
+	struct list_head extent_list;
+	int nr_extents;
+	struct swap_extent *curr_swap_extent;
 	unsigned old_block_size;
 	unsigned short * swap_map;
 	unsigned int lowest_bit;
@@ -134,8 +156,9 @@ extern wait_queue_head_t kswapd_wait;
 extern int FASTCALL(try_to_free_pages(zone_t *, unsigned int, unsigned int));

 /* linux/mm/page_io.c */
-extern void rw_swap_page(int, struct page *);
-extern void rw_swap_page_nolock(int, swp_entry_t, char *);
+int swap_readpage(struct file *file, struct page *page);
+int swap_writepage(struct page *page);
+int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page);

 /* linux/mm/page_alloc.c */

@@ -163,12 +186,13 @@ extern unsigned int nr_swapfiles;
 extern struct swap_info_struct swap_info[];
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
-extern void get_swaphandle_info(swp_entry_t, unsigned long *, struct inode **);
 extern int swap_duplicate(swp_entry_t);
-extern int swap_count(struct page *);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
 extern void free_swap_and_cache(swp_entry_t);
+sector_t map_swap_page(struct swap_info_struct *p, pgoff_t offset);
+struct swap_info_struct *get_swap_info_struct(unsigned type);
+
 struct swap_list_t {
 	int head;	/* head of priority-ordered swapfile list */
 	int next;	/* swapfile to be used next */

--- a/kernel/ksyms.c
+++ b/kernel/ksyms.c
@@ -559,7 +559,6 @@ EXPORT_SYMBOL(buffer_insert_list);
 EXPORT_SYMBOL(make_bad_inode);
 EXPORT_SYMBOL(is_bad_inode);
 EXPORT_SYMBOL(event);
-EXPORT_SYMBOL(brw_page);

 #ifdef CONFIG_UID16
 EXPORT_SYMBOL(overflowuid);

--- a/kernel/suspend.c
+++ b/kernel/suspend.c
@@ -320,14 +320,15 @@ static void mark_swapfiles(swp_entry_t prev, int mode)
 {
 	swp_entry_t entry;
 	union diskpage *cur;
-	
-	cur = (union diskpage *)get_free_page(GFP_ATOMIC);
-	if (!cur)
+	struct page *page;
+
+	page = alloc_page(GFP_ATOMIC);
+	if (!page)
 		panic("Out of memory in mark_swapfiles");
+	cur = page_address(page);
 	/* XXX: this is dirty hack to get first page of swap file */
 	entry = swp_entry(root_swap, 0);
-	lock_page(virt_to_page((unsigned long)cur));
-	rw_swap_page_nolock(READ, entry, (char *) cur);
+	rw_swap_page_sync(READ, entry, page);

 	if (mode == MARK_SWAP_RESUME) {
 	  	if (!memcmp("SUSP1R",cur->swh.magic.magic,6))
@@ -345,10 +346,8 @@ static void mark_swapfiles(swp_entry_t prev, int mode)
 		cur->link.next = prev; /* prev is the first/last swap page of the resume area */
 		/* link.next lies *no more* in last 4 bytes of magic */
 	}
-	lock_page(virt_to_page((unsigned long)cur));
-	rw_swap_page_nolock(WRITE, entry, (char *)cur);
-	
-	free_page((unsigned long)cur);
+	rw_swap_page_sync(WRITE, entry, page);
+	__free_page(page);
 }

 static void read_swapfiles(void) /* This is called before saving image */
@@ -409,6 +408,7 @@ static int write_suspend_image(void)
 	int nr_pgdir_pages = SUSPEND_PD_PAGES(nr_copy_pages);
 	union diskpage *cur,  *buffer = (union diskpage *)get_free_page(GFP_ATOMIC);
 	unsigned long address;
+	struct page *page;

 	PRINTS( "Writing data to swap (%d pages): ", nr_copy_pages );
 	for (i=0; i<nr_copy_pages; i++) {
@@ -421,13 +421,8 @@ static int write_suspend_image(void)
 			panic("\nPage %d: not enough swapspace on suspend device", i );
 	    
 		address = (pagedir_nosave+i)->address;
-		lock_page(virt_to_page(address));
-		{
-			long dummy1;
-			struct inode *suspend_file;
-			get_swaphandle_info(entry, &dummy1, &suspend_file);
-		}
-		rw_swap_page_nolock(WRITE, entry, (char *) address);
+		page = virt_to_page(address);
+		rw_swap_page_sync(WRITE, entry, page);
 		(pagedir_nosave+i)->swap_address = entry;
 	}
 	PRINTK(" done\n");
@@ -452,8 +447,8 @@ static int write_suspend_image(void)
 		if (PAGE_SIZE % sizeof(struct pbe))
 			panic("I need PAGE_SIZE to be integer multiple of struct pbe, otherwise next assignment could damage pagedir");
 		cur->link.next = prev;				
-		lock_page(virt_to_page((unsigned long)cur));
-		rw_swap_page_nolock(WRITE, entry, (char *) cur);
+		page = virt_to_page((unsigned long)cur);
+		rw_swap_page_sync(WRITE, entry, page);
 		prev = entry;
 	}
 	PRINTK(", header");
@@ -473,8 +468,8 @@ static int write_suspend_image(void)
 		
 	cur->link.next = prev;

-	lock_page(virt_to_page((unsigned long)cur));
-	rw_swap_page_nolock(WRITE, entry, (char *) cur);
+	page = virt_to_page((unsigned long)cur);
+	rw_swap_page_sync(WRITE, entry, page);
 	prev = entry;

 	PRINTK( ", signature" );

--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -14,112 +14,163 @@
 #include <linux/kernel_stat.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
-#include <linux/swapctl.h>
-#include <linux/buffer_head.h>		/* for brw_page() */
-
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
+#include <linux/swapops.h>

-/*
- * Reads or writes a swap page.
- * wait=1: start I/O and wait for completion. wait=0: start asynchronous I/O.
- *
- * Important prevention of race condition: the caller *must* atomically 
- * create a unique swap cache entry for this swap page before calling
- * rw_swap_page, and must lock that page.  By ensuring that there is a
- * single page of memory reserved for the swap entry, the normal VM page
- * lock on that page also doubles as a lock on swap entries.  Having only
- * one lock to deal with per swap entry (rather than locking swap and memory
- * independently) also makes it easier to make certain swapping operations
- * atomic, which is particularly important when we are trying to ensure 
- * that shared pages stay shared while being swapped.
- */
+static int
+swap_get_block(struct inode *inode, sector_t iblock,
+		struct buffer_head *bh_result, int create)
+{
+	struct swap_info_struct *sis;
+	swp_entry_t entry;

-static int rw_swap_page_base(int rw, swp_entry_t entry, struct page *page)
+	entry.val = iblock;
+	sis = get_swap_info_struct(swp_type(entry));
+	bh_result->b_bdev = sis->bdev;
+	bh_result->b_blocknr = map_swap_page(sis, swp_offset(entry));
+	bh_result->b_size = PAGE_SIZE;
+	set_buffer_mapped(bh_result);
+	return 0;
+}
+
+static struct bio *
+get_swap_bio(int gfp_flags, struct page *page, bio_end_io_t end_io)
 {
-	unsigned long offset;
-	sector_t zones[PAGE_SIZE/512];
-	int zones_used;
-	int block_size;
-	struct inode *swapf = 0;
-	struct block_device *bdev;
+	struct bio *bio;
+	struct buffer_head bh;

-	if (rw == READ) {
+	bio = bio_alloc(gfp_flags, 1);
+	if (bio) {
+		swap_get_block(NULL, page->index, &bh, 1);
+		bio->bi_sector = bh.b_blocknr * (PAGE_SIZE >> 9);
+		bio->bi_bdev = bh.b_bdev;
+		bio->bi_io_vec[0].bv_page = page;
+		bio->bi_io_vec[0].bv_len = PAGE_SIZE;
+		bio->bi_io_vec[0].bv_offset = 0;
+		bio->bi_vcnt = 1;
+		bio->bi_idx = 0;
+		bio->bi_size = PAGE_SIZE;
+		bio->bi_end_io = end_io;
+	}
+	return bio;
+}
+
+static void end_swap_bio_write(struct bio *bio)
+{
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct page *page = bio->bi_io_vec[0].bv_page;
+
+	if (!uptodate)
+		SetPageError(page);
+	end_page_writeback(page);
+	bio_put(bio);
+}
+
+static void end_swap_bio_read(struct bio *bio)
+{
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct page *page = bio->bi_io_vec[0].bv_page;
+
+	if (!uptodate) {
+		SetPageError(page);
 		ClearPageUptodate(page);
-		kstat.pswpin++;
-	} else
-		kstat.pswpout++;
-
-	get_swaphandle_info(entry, &offset, &swapf);
-	bdev = swapf->i_bdev;
-	if (bdev) {
-		zones[0] = offset;
-		zones_used = 1;
-		block_size = PAGE_SIZE;
 	} else {
-		int i, j;
-		unsigned int block = offset
-			<< (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);
-
-		block_size = swapf->i_sb->s_blocksize;
-		for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
-			if (!(zones[i] = bmap(swapf,block++))) {
-				printk("rw_swap_page: bad swap file\n");
-				return 0;
-			}
-		zones_used = i;
-		bdev = swapf->i_sb->s_bdev;
+		SetPageUptodate(page);
 	}
+	unlock_page(page);
+	bio_put(bio);
+}

- 	/* block_size == PAGE_SIZE/zones_used */
- 	brw_page(rw, page, bdev, zones, block_size);
+/*
+ * We may have stale swap cache pages in memory: notice
+ * them here and get rid of the unnecessary final write.
+ */
+int swap_writepage(struct page *page)
+{
+	struct bio *bio;
+	int ret = 0;

- 	/* Note! For consistency we do all of the logic,
- 	 * decrementing the page count, and unlocking the page in the
- 	 * swap lock map - in the IO completion handler.
- 	 */
-	return 1;
+	if (remove_exclusive_swap_page(page)) {
+		unlock_page(page);
+		goto out;
+	}
+	bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
+	if (bio == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	kstat.pswpout++;
+	SetPageWriteback(page);
+	unlock_page(page);
+	submit_bio(WRITE, bio);
+out:
+	return ret;
 }

+int swap_readpage(struct file *file, struct page *page)
+{
+	struct bio *bio;
+	int ret = 0;
+
+	ClearPageUptodate(page);
+	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
+	if (bio == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	kstat.pswpin++;
+	submit_bio(READ, bio);
+out:
+	return ret;
+}
 /*
- * A simple wrapper so the base function doesn't need to enforce
- * that all swap pages go through the swap cache! We verify that:
- *  - the page is locked
- *  - it's marked as being swap-cache
- *  - it's associated with the swap inode
+ * swapper_space doesn't have a real inode, so it gets a special vm_writeback()
+ * so we don't need swap special cases in generic_vm_writeback().
+ *
+ * Swap pages are PageLocked and PageWriteback while under writeout so that
+ * memory allocators will throttle against them.
 */
-void rw_swap_page(int rw, struct page *page)
+static int swap_vm_writeback(struct page *page, int *nr_to_write)
 {
-	swp_entry_t entry;
+	struct address_space *mapping = page->mapping;

-	entry.val = page->index;
-
-	if (!PageLocked(page))
-		PAGE_BUG(page);
-	if (!PageSwapCache(page))
-		PAGE_BUG(page);
-	if (!rw_swap_page_base(rw, entry, page))
-		unlock_page(page);
+	unlock_page(page);
+	return generic_writepages(mapping, nr_to_write);
 }

+struct address_space_operations swap_aops = {
+	vm_writeback:	swap_vm_writeback,
+	writepage:	swap_writepage,
+	readpage:	swap_readpage,
+	sync_page:	block_sync_page,
+	set_page_dirty:	__set_page_dirty_nobuffers,
+};
+
 /*
- * The swap lock map insists that pages be in the page cache!
- * Therefore we can't use it.  Later when we can remove the need for the
- * lock map and we can reduce the number of functions exported.
+ * A scruffy utility function to read or write an arbitrary swap page
+ * and wait on the I/O.
 */
-void rw_swap_page_nolock(int rw, swp_entry_t entry, char *buf)
+int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
 {
-	struct page *page = virt_to_page(buf);
-	
-	if (!PageLocked(page))
-		PAGE_BUG(page);
-	if (page->mapping)
-		PAGE_BUG(page);
-	/* needs sync_page to wait I/O completation */
+	int ret;
+
+	lock_page(page);
+
+	BUG_ON(page->mapping);
 	page->mapping = &swapper_space;
-	if (rw_swap_page_base(rw, entry, page))
-		lock_page(page);
-	if (page_has_buffers(page) && !try_to_free_buffers(page))
-		PAGE_BUG(page);
+	page->index = entry.val;
+
+	if (rw == READ) {
+		ret = swap_readpage(NULL, page);
+		wait_on_page_locked(page);
+	} else {
+		ret = swap_writepage(page);
+		wait_on_page_writeback(page);
+	}
 	page->mapping = NULL;
-	unlock_page(page);
+	if (ret == 0 && (!PageUptodate(page) || PageError(page)))
+		ret = -EIO;
+	return ret;
 }
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,54 +14,27 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/smp_lock.h>
-#include <linux/buffer_head.h>	/* block_sync_page()/try_to_free_buffers() */
+#include <linux/buffer_head.h>	/* block_sync_page() */

 #include <asm/pgtable.h>

-/*
- * We may have stale swap cache pages in memory: notice
- * them here and get rid of the unnecessary final write.
- */
-static int swap_writepage(struct page *page)
-{
-	if (remove_exclusive_swap_page(page)) {
-		unlock_page(page);
-		return 0;
-	}
-	rw_swap_page(WRITE, page);
-	return 0;
-}
-
-/*
- * swapper_space doesn't have a real inode, so it gets a special vm_writeback()
- * so we don't need swap special cases in generic_vm_writeback().
- *
- * Swap pages are PageLocked and PageWriteback while under writeout so that
- * memory allocators will throttle against them.
- */
-static int swap_vm_writeback(struct page *page, int *nr_to_write)
-{
-	struct address_space *mapping = page->mapping;
-
-	unlock_page(page);
-	return generic_writepages(mapping, nr_to_write);
-}
-
-static struct address_space_operations swap_aops = {
-	vm_writeback:	swap_vm_writeback,
-	writepage:	swap_writepage,
-	sync_page:	block_sync_page,
-	set_page_dirty:	__set_page_dirty_nobuffers,
-};
-
 /*
 * swapper_inode doesn't do anything much.  It is really only here to
 * avoid some special-casing in other parts of the kernel.
+ *
+ * We set i_size to "infinity" to keep the page I/O functions happy.  The swap
+ * block allocator makes sure that allocations are in-range.  A strange
+ * number is chosen to prevent various arith overflows elsewhere.  For example,
+ * `lblock' in block_read_full_page().
 */
 static struct inode swapper_inode = {
-	i_mapping:		&swapper_space,
+	i_mapping:	&swapper_space,
+	i_size:		PAGE_SIZE * 0xffffffffLL,
+	i_blkbits:	PAGE_SHIFT,
 };

+extern struct address_space_operations swap_aops;
+
 struct address_space swapper_space = {
 	page_tree:	RADIX_TREE_INIT(GFP_ATOMIC),
 	page_lock:	RW_LOCK_UNLOCKED,
@@ -149,14 +122,9 @@ void delete_from_swap_cache(struct page *page)
 {
 	swp_entry_t entry;

-	/*
-	 * I/O should have completed and nobody can have a ref against the
-	 * page's buffers
-	 */
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageWriteback(page));
-	if (page_has_buffers(page) && !try_to_free_buffers(page))
-		BUG();
+	BUG_ON(page_has_buffers(page));
  
 	entry.val = page->index;

@@ -222,16 +190,9 @@ int move_from_swap_cache(struct page *page, unsigned long index,
 	void **pslot;
 	int err;

-	/*
-	 * Drop the buffers now, before taking the page_lock.  Because
-	 * mapping->private_lock nests outside mapping->page_lock.
-	 * This "must" succeed.  The page is locked and all I/O has completed
-	 * and nobody else has a ref against its buffers.
-	 */
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageWriteback(page));
-	if (page_has_buffers(page) && !try_to_free_buffers(page))
-		BUG();
+	BUG_ON(page_has_buffers(page));

 	write_lock(&swapper_space.page_lock);
 	write_lock(&mapping->page_lock);
@@ -361,7 +322,7 @@ struct page * read_swap_cache_async(swp_entry_t entry)
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			rw_swap_page(READ, new_page);
+			swap_readpage(NULL, new_page);
 			return new_page;
 		}
 	} while (err != -ENOENT && err != -ENOMEM);

--- a/mm/swapfile.c
+++ b/mm/swapfile.c