[PATCH] writeback from address spaces

[ I reversed the order in which writeback walks the superblock's dirty inodes. It sped up dbench's unlink phase greatly. I'm such a sleaze ] The core writeback patch. Switches file writeback from the dirty buffer LRU over to address_space.dirty_pages. - The buffer LRU is removed - The buffer hash is removed (uses blockdev pagecache lookups) - The bdflush and kupdate functions are implemented against address_spaces, via pdflush. - The relationship between pages and buffers is changed. - If a page has dirty buffers, it is marked dirty - If a page is marked dirty, it *may* have dirty buffers. - A dirty page may be "partially dirty". block_write_full_page discovers this. - A bunch of consistency checks of the form if (!something_which_should_be_true()) buffer_error(); have been introduced. These fog the code up but are important for ensuring that the new buffer/page code is working correctly. - New locking (inode.i_bufferlist_lock) is introduced for exclusion from try_to_free_buffers(). This is needed because set_page_dirty is called under spinlock, so it cannot lock the page. But it needs access to page->buffers to set them all dirty. i_bufferlist_lock is also used to protect inode.i_dirty_buffers. - fs/inode.c has been split: all the code related to file data writeback has been moved into fs/fs-writeback.c - Code related to file data writeback at the address_space level is in the new mm/page-writeback.c - try_to_free_buffers() is now non-blocking - Switches vmscan.c over to understand that all pages with dirty data are now marked dirty. - Introduces a new a_op for VM writeback: ->vm_writeback(struct page *page, int *nr_to_write) this is a bit half-baked at present. The intent is that the address_space is given the opportunity to perform clustered writeback. To allow it to opportunistically write out disk-contiguous dirty data which may be in other zones. To allow delayed-allocate filesystems to get good disk layout. - Added address_space.io_pages. Pages which are being prepared for writeback. This is here for two reasons: 1: It will be needed later, when BIOs are assembled direct against pagecache, bypassing the buffer layer. It avoids a deadlock which would occur if someone moved the page back onto the dirty_pages list after it was added to the BIO, but before it was submitted. (hmm. This may not be a problem with PG_writeback logic). 2: Avoids a livelock which would occur if some other thread is continually redirtying pages. - There are two known performance problems in this code: 1: Pages which are locked for writeback cause undesirable blocking when they are being overwritten. A patch which leaves pages unlocked during writeback comes later in the series. 2: While inodes are under writeback, they are locked. This causes namespace lookups against the file to get unnecessarily blocked in wait_on_inode(). This is a fairly minor problem. I don't have a fix for this at present - I'll fix this when I attach dirty address_spaces direct to super_blocks. - The patch vastly increases the amount of dirty data which the kernel permits highmem machines to maintain. This is because the balancing decisions are made against the amount of memory in the machine, not against the amount of buffercache-allocatable memory. This may be very wrong, although it works fine for me (2.5 gigs). We can trivially go back to the old-style throttling with s/nr_free_pagecache_pages/nr_free_buffer_pages/ in balance_dirty_pages(). But better would be to allow blockdev mappings to use highmem (I'm thinking about this one, slowly). And to move writer-throttling and writeback decisions into the VM (modulo the file-overwriting problem). - Drops 24 bytes from struct buffer_head. More to come. - There's some gunk like super_block.flags:MS_FLUSHING which needs to be killed. Need a better way of providing collision avoidance between pdflush threads, to prevent more than one pdflush thread working a disk at the same time. The correct way to do that is to put a flag in the request queue to say "there's a pdlfush thread working this disk". This is easy to do: just generalise the "ra_pages" pointer to point at a struct which includes ra_pages and the new collision-avoidance flag.

[PATCH] writeback from address spaces
[ I reversed the order in which writeback walks the superblock's dirty inodes. It sped up dbench's unlink phase greatly. I'm such a sleaze ] The core writeback patch. Switches file writeback from the dirty buffer LRU over to address_space.dirty_pages. - The buffer LRU is removed - The buffer hash is removed (uses blockdev pagecache lookups) - The bdflush and kupdate functions are implemented against address_spaces, via pdflush. - The relationship between pages and buffers is changed. - If a page has dirty buffers, it is marked dirty - If a page is marked dirty, it *may* have dirty buffers. - A dirty page may be "partially dirty". block_write_full_page discovers this. - A bunch of consistency checks of the form if (!something_which_should_be_true()) buffer_error(); have been introduced. These fog the code up but are important for ensuring that the new buffer/page code is working correctly. - New locking (inode.i_bufferlist_lock) is introduced for exclusion from try_to_free_buffers(). This is needed because set_page_dirty is called under spinlock, so it cannot lock the page. But it needs access to page->buffers to set them all dirty. i_bufferlist_lock is also used to protect inode.i_dirty_buffers. - fs/inode.c has been split: all the code related to file data writeback has been moved into fs/fs-writeback.c - Code related to file data writeback at the address_space level is in the new mm/page-writeback.c - try_to_free_buffers() is now non-blocking - Switches vmscan.c over to understand that all pages with dirty data are now marked dirty. - Introduces a new a_op for VM writeback: ->vm_writeback(struct page *page, int *nr_to_write) this is a bit half-baked at present. The intent is that the address_space is given the opportunity to perform clustered writeback. To allow it to opportunistically write out disk-contiguous dirty data which may be in other zones. To allow delayed-allocate filesystems to get good disk layout. - Added address_space.io_pages. Pages which are being prepared for writeback. This is here for two reasons: 1: It will be needed later, when BIOs are assembled direct against pagecache, bypassing the buffer layer. It avoids a deadlock which would occur if someone moved the page back onto the dirty_pages list after it was added to the BIO, but before it was submitted. (hmm. This may not be a problem with PG_writeback logic). 2: Avoids a livelock which would occur if some other thread is continually redirtying pages. - There are two known performance problems in this code: 1: Pages which are locked for writeback cause undesirable blocking when they are being overwritten. A patch which leaves pages unlocked during writeback comes later in the series. 2: While inodes are under writeback, they are locked. This causes namespace lookups against the file to get unnecessarily blocked in wait_on_inode(). This is a fairly minor problem. I don't have a fix for this at present - I'll fix this when I attach dirty address_spaces direct to super_blocks. - The patch vastly increases the amount of dirty data which the kernel permits highmem machines to maintain. This is because the balancing decisions are made against the amount of memory in the machine, not against the amount of buffercache-allocatable memory. This may be very wrong, although it works fine for me (2.5 gigs). We can trivially go back to the old-style throttling with s/nr_free_pagecache_pages/nr_free_buffer_pages/ in balance_dirty_pages(). But better would be to allow blockdev mappings to use highmem (I'm thinking about this one, slowly). And to move writer-throttling and writeback decisions into the VM (modulo the file-overwriting problem). - Drops 24 bytes from struct buffer_head. More to come. - There's some gunk like super_block.flags:MS_FLUSHING which needs to be killed. Need a better way of providing collision avoidance between pdflush threads, to prevent more than one pdflush thread working a disk at the same time. The correct way to do that is to put a flag in the request queue to say "there's a pdlfush thread working this disk". This is easy to do: just generalise the "ra_pages" pointer to point at a struct which includes ra_pages and the new collision-avoidance flag.
090da372 · Andrew Morton · Linus Torvalds · 00d6555e · 090da372 · 090da372
Commit 090da372 authored Apr 29, 2002 by Andrew Morton Committed by Linus Torvalds Apr 29, 2002
35 changed files
--- a/drivers/block/ll_rw_blk.c
+++ b/drivers/block/ll_rw_blk.c
@@ -1409,6 +1409,11 @@ int submit_bh(int rw, struct buffer_head * bh)
 	BUG_ON(!buffer_mapped(bh));
 	BUG_ON(!bh->b_end_io);

+	if ((rw == READ || rw == READA) && buffer_uptodate(bh))
+		printk("%s: read of uptodate buffer\n", __FUNCTION__);
+	if (rw == WRITE && !buffer_uptodate(bh))
+		printk("%s: write of non-uptodate buffer\n", __FUNCTION__);
+		
 	set_bit(BH_Req, &bh->b_state);

 	/*
@@ -1465,6 +1470,7 @@ int submit_bh(int rw, struct buffer_head * bh)
 *  a multiple of the current approved size for the device.
 *
 **/
+
 void ll_rw_block(int rw, int nr, struct buffer_head * bhs[])
 {
 	unsigned int major;
@@ -1513,7 +1519,6 @@ void ll_rw_block(int rw, int nr, struct buffer_head * bhs[])
 			if (!atomic_set_buffer_clean(bh))
 				/* Hmmph! Nothing to write */
 				goto end_io;
-			__mark_buffer_clean(bh);
 			break;

 		case READA:

--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -548,8 +548,6 @@ static int loop_thread(void *data)
 	atomic_inc(&lo->lo_pending);
 	spin_unlock_irq(&lo->lo_lock);

-	current->flags |= PF_NOIO;
-
 	/*
 	 * up sem, we are running
 	 */

--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -474,7 +474,6 @@ static struct buffer_head *raid5_build_block (struct stripe_head *sh, int i)

 	bh->b_state	= (1 << BH_Req) | (1 << BH_Mapped);
 	bh->b_size	= sh->size;
-	bh->b_list	= BUF_LOCKED;
 	return bh;
 }


--- a/fs/Makefile
+++ b/fs/Makefile
@@ -14,7 +14,8 @@ obj-y :=	open.o read_write.o devices.o file_table.o buffer.o \
 		bio.o super.o block_dev.o char_dev.o stat.o exec.o pipe.o \
 		namei.o fcntl.o ioctl.o readdir.o select.o fifo.o locks.o \
 		dcache.o inode.o attr.o bad_inode.o file.o iobuf.o dnotify.o \
-		filesystems.o namespace.o seq_file.o xattr.o libfs.o
+		filesystems.o namespace.o seq_file.o xattr.o libfs.o \
+		fs-writeback.o

 ifneq ($(CONFIG_NFSD),n)
 ifneq ($(CONFIG_NFSD),)

--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -180,7 +180,9 @@ static loff_t block_llseek(struct file *file, loff_t offset, int origin)
 	return retval;
 }
 	
-
+/*
+ * AKPM: fixme.  unneeded stuff here.
+ */
 static int __block_fsync(struct inode * inode)
 {
 	int ret, err;
@@ -759,6 +761,8 @@ struct address_space_operations def_blk_aops = {
 	sync_page: block_sync_page,
 	prepare_write: blkdev_prepare_write,
 	commit_write: blkdev_commit_write,
+	writeback_mapping: generic_writeback_mapping,
+	vm_writeback: generic_vm_writeback,
 	direct_IO: blkdev_direct_IO,
 };


--- a/fs/buffer.c
+++ b/fs/buffer.c
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1290,8 +1290,13 @@ static int ext3_writepage(struct page *page)

 	/* bget() all the buffers */
 	if (order_data) {
-		if (!page_has_buffers(page))
-			create_empty_buffers(page, inode->i_sb->s_blocksize);
+		if (!page_has_buffers(page)) {
+			if (!Page_Uptodate(page))
+				buffer_error();
+			create_empty_buffers(page,
+				inode->i_sb->s_blocksize,
+				(1 << BH_Dirty)|(1 << BH_Uptodate));
+		}
 		page_bufs = page_buffers(page);
 		walk_page_buffers(handle, page_bufs, 0,
 				PAGE_CACHE_SIZE, NULL, bget_one);
@@ -1394,7 +1399,7 @@ static int ext3_block_truncate_page(handle_t *handle,
 		goto out;

 	if (!page_has_buffers(page))
-		create_empty_buffers(page, blocksize);
+		create_empty_buffers(page, blocksize, 0);

 	/* Find the buffer that contains "offset" */
 	bh = page_buffers(page);
@@ -1448,7 +1453,7 @@ static int ext3_block_truncate_page(handle_t *handle,
 	} else {
 		if (ext3_should_order_data(inode))
 			err = ext3_journal_dirty_data(handle, bh, 0);
-		__mark_buffer_dirty(bh);
+		mark_buffer_dirty(bh);
 	}

 unlock:

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
--- a/fs/inode.c
+++ b/fs/inode.c
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -62,8 +62,6 @@ static int __try_to_free_cp_buf(struct journal_head *jh)
 		__journal_remove_checkpoint(jh);
 		__journal_remove_journal_head(bh);
 		BUFFER_TRACE(bh, "release");
-		/* BUF_LOCKED -> BUF_CLEAN (fwiw) */
-		refile_buffer(bh);
 		__brelse(bh);
 		ret = 1;
 	}
@@ -149,8 +147,7 @@ static int __cleanup_transaction(journal_t *journal, transaction_t *transaction)
 		/*
 		 * We used to test for (jh->b_list != BUF_CLEAN) here.
 		 * But unmap_underlying_metadata() can place buffer onto
-		 * BUF_CLEAN. Since refile_buffer() no longer takes buffers
-		 * off checkpoint lists, we cope with it here
+		 * BUF_CLEAN.
 		 */
 		/*
 		 * AKPM: I think the buffer_jdirty test is redundant - it
@@ -161,7 +158,6 @@ static int __cleanup_transaction(journal_t *journal, transaction_t *transaction)
 			BUFFER_TRACE(bh, "remove from checkpoint");
 			__journal_remove_checkpoint(jh);
 			__journal_remove_journal_head(bh);
-			refile_buffer(bh);
 			__brelse(bh);
 			ret = 1;
 		}

--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -210,7 +210,6 @@ void journal_commit_transaction(journal_t *journal)
 				__journal_unfile_buffer(jh);
 				jh->b_transaction = NULL;
 				__journal_remove_journal_head(bh);
-				refile_buffer(bh);
 				__brelse(bh);
 			}
 		}
@@ -291,10 +290,6 @@ void journal_commit_transaction(journal_t *journal)
 			jh->b_transaction = NULL;
 			__journal_remove_journal_head(bh);
 			BUFFER_TRACE(bh, "finished async writeout: refile");
-			/* It can sometimes be on BUF_LOCKED due to migration
-			 * from syncdata to asyncdata */
-			if (bh->b_list != BUF_CLEAN)
-				refile_buffer(bh);
 			__brelse(bh);
 		}
 	}
@@ -454,6 +449,7 @@ void journal_commit_transaction(journal_t *journal)
 				struct buffer_head *bh = wbuf[i];
 				set_bit(BH_Lock, &bh->b_state);
 				clear_bit(BH_Dirty, &bh->b_state);
+				mark_buffer_uptodate(bh, 1);
 				bh->b_end_io = journal_end_buffer_io_sync;
 				submit_bh(WRITE, bh);
 			}
@@ -592,6 +588,7 @@ void journal_commit_transaction(journal_t *journal)
 	JBUFFER_TRACE(descriptor, "write commit block");
 	{
 		struct buffer_head *bh = jh2bh(descriptor);
+		mark_buffer_uptodate(bh, 1);
 		ll_rw_block(WRITE, 1, &bh);
 		wait_on_buffer(bh);
 		__brelse(bh);		/* One for getblk() */

--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -328,7 +328,6 @@ void __journal_clean_data_list(transaction_t *transaction)
 			__journal_unfile_buffer(jh);
 			jh->b_transaction = NULL;
 			__journal_remove_journal_head(bh);
-			refile_buffer(bh);
 			__brelse(bh);
 			goto restart;
 		}
@@ -464,8 +463,6 @@ int journal_write_metadata_buffer(transaction_t *transaction,
 		}
 	} while (!new_bh);
 	/* keep subsequent assertions sane */
-	new_bh->b_prev_free = 0;
-	new_bh->b_next_free = 0;
 	new_bh->b_state = 0;
 	init_buffer(new_bh, NULL, NULL);
 	atomic_set(&new_bh->b_count, 1);

--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -406,11 +406,12 @@ int journal_cancel_revoke(handle_t *handle, struct journal_head *jh)
 	 * buffer_head?  If so, we'd better make sure we clear the
 	 * revoked status on any hashed alias too, otherwise the revoke
 	 * state machine will get very upset later on. */
-	if (need_cancel && !bh->b_pprev) {
+	if (need_cancel) {
 		struct buffer_head *bh2;
 		bh2 = __get_hash_table(bh->b_bdev, bh->b_blocknr, bh->b_size);
 		if (bh2) {
-			clear_bit(BH_Revoked, &bh2->b_state);
+			if (bh2 != bh)
+				clear_bit(BH_Revoked, &bh2->b_state);
 			__brelse(bh2);
 		}
 	}
@@ -540,6 +541,7 @@ static void flush_descriptor(journal_t *journal,
 	{
 		struct buffer_head *bh = jh2bh(descriptor);
 		BUFFER_TRACE(bh, "write");
+		mark_buffer_uptodate(bh, 1);
 		ll_rw_block (WRITE, 1, &bh);
 	}
 }

--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -592,9 +592,6 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, int force_copy)
 			JBUFFER_TRACE(jh, "file as BJ_Reserved");
 			__journal_file_buffer(jh, transaction, BJ_Reserved);

-			/* And pull it off BUF_DIRTY, onto BUF_CLEAN */
-			refile_buffer(jh2bh(jh));
-
 			/*
 			 * The buffer is now hidden from bdflush.   It is
 			 * metadata against the current transaction.
@@ -812,8 +809,6 @@ int journal_get_create_access (handle_t *handle, struct buffer_head *bh)
 		jh->b_transaction = transaction;
 		JBUFFER_TRACE(jh, "file as BJ_Reserved");
 		__journal_file_buffer(jh, transaction, BJ_Reserved);
-		JBUFFER_TRACE(jh, "refile");
-		refile_buffer(jh2bh(jh));
 	} else if (jh->b_transaction == journal->j_committing_transaction) {
 		JBUFFER_TRACE(jh, "set next transaction");
 		jh->b_next_transaction = transaction;
@@ -1099,7 +1094,6 @@ int journal_dirty_metadata (handle_t *handle, struct buffer_head *bh)
 	
 	spin_lock(&journal_datalist_lock);
 	set_bit(BH_JBDDirty, &bh->b_state);
-	set_buffer_flushtime(bh);

 	J_ASSERT_JH(jh, jh->b_transaction != NULL);
 	
@@ -1691,7 +1685,7 @@ int journal_try_to_free_buffers(journal_t *journal,
 out:
 	ret = 0;
 	if (call_ttfb)
-		ret = try_to_free_buffers(page, gfp_mask);
+		ret = try_to_free_buffers(page);
 	return ret;
 }

@@ -1864,7 +1858,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh)
 	if (buffer_dirty(bh))
 		mark_buffer_clean(bh);
 	J_ASSERT_BH(bh, !buffer_jdirty(bh));
-	clear_bit(BH_Uptodate, &bh->b_state);
+//	clear_bit(BH_Uptodate, &bh->b_state);
 	clear_bit(BH_Mapped, &bh->b_state);
 	clear_bit(BH_Req, &bh->b_state);
 	clear_bit(BH_New, &bh->b_state);
@@ -1913,7 +1907,7 @@ int journal_flushpage(journal_t *journal,
 	unlock_journal(journal);

 	if (!offset) {
-		if (!may_free || !try_to_free_buffers(page, 0))
+		if (!may_free || !try_to_free_buffers(page))
 			return 0;
 		J_ASSERT(!page_has_buffers(page));
 	}
@@ -2021,9 +2015,6 @@ void __journal_refile_buffer(struct journal_head *jh)
 	if (jh->b_transaction != NULL) {
 		__journal_file_buffer(jh, jh->b_transaction, BJ_Metadata);
 		J_ASSERT_JH(jh, jh->b_transaction->t_state == T_RUNNING);
-	} else {
-		/* Onto BUF_DIRTY for writeback */
-		refile_buffer(jh2bh(jh));
 	}
 }


--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -120,7 +120,7 @@ static int ntfs_file_read_block(struct page *page)
 	blocksize = 1 << blocksize_bits;

 	if (!page_has_buffers(page))
-		create_empty_buffers(page, blocksize);
+		create_empty_buffers(page, blocksize, 0);
 	bh = head = page_buffers(page);
 	if (!bh)
 		return -ENOMEM;
@@ -417,7 +417,7 @@ static int ntfs_mftbmp_readpage(ntfs_volume *vol, struct page *page)
 	blocksize_bits = vol->sb->s_blocksize_bits;

 	if (!page_has_buffers(page))
-		create_empty_buffers(page, blocksize);
+		create_empty_buffers(page, blocksize, 0);
 	bh = head = page_buffers(page);
 	if (!bh)
 		return -ENOMEM;
@@ -656,7 +656,7 @@ int ntfs_mst_readpage(struct file *dir, struct page *page)
 	blocksize = 1 << blocksize_bits;

 	if (!page_has_buffers(page))
-		create_empty_buffers(page, blocksize);
+		create_empty_buffers(page, blocksize, 0);
 	bh = head = page_buffers(page);
 	if (!bh)
 		return -ENOMEM;

--- a/fs/reiserfs/do_balan.c
+++ b/fs/reiserfs/do_balan.c
@@ -29,13 +29,27 @@ struct tree_balance * cur_tb = NULL; /* detects whether more than one
                                        is interrupting do_balance */
 #endif

+/*
+ * AKPM: The __mark_buffer_dirty() call here will not
+ * put the buffer on the dirty buffer LRU because we've just
+ * set BH_Dirty.  That's a thinko in reiserfs.
+ *
+ * I'm reluctant to "fix" this bug because that would change
+ * behaviour.  Using mark_buffer_dirty() here would make the
+ * buffer eligible for VM and periodic writeback, which may
+ * violate ordering constraints.  I'll just leave the code
+ * as-is by removing the __mark_buffer_dirty call altogether.
+ *
+ * Chris says this code has "probably never been run" anyway.
+ * It is due to go away.
+ */

 inline void do_balance_mark_leaf_dirty (struct tree_balance * tb, 
 					struct buffer_head * bh, int flag)
 {
    if (reiserfs_dont_log(tb->tb_sb)) {
 	if (!test_and_set_bit(BH_Dirty, &bh->b_state)) {
-	    __mark_buffer_dirty(bh) ;
+//	    __mark_buffer_dirty(bh) ;
 	    tb->need_balance_dirty = 1;
 	}
    } else {

--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -107,7 +107,7 @@ inline void make_le_item_head (struct item_head * ih, const struct cpu_key * key
 static void add_to_flushlist(struct inode *inode, struct buffer_head *bh) {
    struct list_head *list = &(SB_JOURNAL(inode->i_sb)->j_dirty_buffers) ;

-    buffer_insert_list(bh, list) ;
+    buffer_insert_list(NULL, bh, list) ;
 }

 //
@@ -779,7 +779,13 @@ int reiserfs_get_block (struct inode * inode, sector_t block,
 	    /* mark it dirty now to prevent commit_write from adding
 	    ** this buffer to the inode's dirty buffer list
 	    */
-	    __mark_buffer_dirty(unbh) ;
+		/*
+		 * AKPM: changed __mark_buffer_dirty to mark_buffer_dirty().
+		 * It's still atomic, but it sets the page dirty too,
+		 * which makes it eligible for writeback at any time by the
+		 * VM (which was also the case with __mark_buffer_dirty())
+		 */
+	    mark_buffer_dirty(unbh) ;
 		  
 	    //inode->i_blocks += inode->i_sb->s_blocksize / 512;
 	    //mark_tail_converted (inode);

--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -123,10 +123,8 @@ static void init_journal_hash(struct super_block *p_s_sb) {
 ** more details.
 */
 static int reiserfs_clean_and_file_buffer(struct buffer_head *bh) {
-  if (bh) {
-    clear_bit(BH_Dirty, &bh->b_state) ;
-    refile_buffer(bh) ;
-  }
+  if (bh)
+    mark_buffer_clean(bh);
  return 0 ;
 }

@@ -1079,7 +1077,6 @@ printk("journal-813: BAD! buffer %lu %cdirty %cjwait, not in a newer tranasction
 	if (!buffer_uptodate(cn->bh)) {
 	  reiserfs_panic(s, "journal-949: buffer write failed\n") ;
 	}
-	refile_buffer(cn->bh) ;
        brelse(cn->bh) ;
      }
      cn = cn->next ;
@@ -3125,7 +3122,7 @@ printk("journal-2020: do_journal_end: BAD desc->j_len is ZERO\n") ;
  SB_JOURNAL_LIST_INDEX(p_s_sb) = jindex ;

  /* write any buffers that must hit disk before this commit is done */
-  fsync_buffers_list(&(SB_JOURNAL(p_s_sb)->j_dirty_buffers)) ;
+  fsync_buffers_list(NULL, &(SB_JOURNAL(p_s_sb)->j_dirty_buffers)) ;

  /* honor the flush and async wishes from the caller */
  if (flush) {

--- a/fs/reiserfs/prints.c
+++ b/fs/reiserfs/prints.c
@@ -138,8 +138,9 @@ static void sprintf_block_head (char * buf, struct buffer_head * bh)

 static void sprintf_buffer_head (char * buf, struct buffer_head * bh) 
 {
-  sprintf (buf, "dev %s, size %d, blocknr %ld, count %d, list %d, state 0x%lx, page %p, (%s, %s, %s)",
-	   bdevname (bh->b_bdev), bh->b_size, bh->b_blocknr, atomic_read (&(bh->b_count)), bh->b_list,
+  sprintf (buf, "dev %s, size %d, blocknr %ld, count %d, state 0x%lx, page %p, (%s, %s, %s)",
+	   bdevname (bh->b_bdev), bh->b_size, bh->b_blocknr,
+	   atomic_read (&(bh->b_count)),
 	   bh->b_state, bh->b_page,
 	   buffer_uptodate (bh) ? "UPTODATE" : "!UPTODATE",
 	   buffer_dirty (bh) ? "DIRTY" : "CLEAN",

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -361,8 +361,6 @@ static inline void set_page_zone(struct page *page, unsigned long zone_num)

 #endif /* CONFIG_HIGHMEM || WANT_PAGE_VIRTUAL */

-extern void FASTCALL(set_page_dirty(struct page *));
-
 /*
 * Error return values for the *_nopage functions
 */
@@ -405,6 +403,26 @@ extern int ptrace_check_attach(struct task_struct *task, int kill);
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);

+int __set_page_dirty_buffers(struct page *page);
+int __set_page_dirty_nobuffers(struct page *page);
+
+/*
+ * If the mapping doesn't provide a set_page_dirty a_op, then
+ * just fall through and assume that it wants buffer_heads.
+ * FIXME: make the method unconditional.
+ */
+static inline int set_page_dirty(struct page *page)
+{
+	if (page->mapping) {
+		int (*spd)(struct page *);
+
+		spd = page->mapping->a_ops->set_page_dirty;
+		if (spd)
+			return (*spd)(page);
+	}
+	return __set_page_dirty_buffers(page);
+}
+
 /*
 * On a two-level page table, this ends up being trivial. Thus the
 * inlining and the symmetry break with pte_alloc_map() that does all
@@ -496,6 +514,9 @@ extern void truncate_inode_pages(struct address_space *, loff_t);
 extern int filemap_sync(struct vm_area_struct *, unsigned long,	size_t, unsigned int);
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int);

+/* mm/page-writeback.c */
+int generic_writeback_mapping(struct address_space *mapping, int *nr_to_write);
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
@@ -550,9 +571,6 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m

 extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);

-extern int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
-extern int pdflush_flush(unsigned long nr_pages);
-
 extern struct page * vmalloc_to_page(void *addr);
 extern unsigned long get_page_cache_size(void);


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -368,8 +368,7 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_MEMDIE	0x00001000	/* Killed for out-of-memory */
 #define PF_FREE_PAGES	0x00002000	/* per process page freeing */
-#define PF_NOIO		0x00004000	/* avoid generating further I/O */
-#define PF_FLUSHER	0x00008000	/* responsible for disk writeback */
+#define PF_FLUSHER	0x00004000	/* responsible for disk writeback */

 /*
 * Ptrace flags

--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -97,6 +97,7 @@ extern int nr_swap_pages;

 extern unsigned int nr_free_pages(void);
 extern unsigned int nr_free_buffer_pages(void);
+extern unsigned int nr_free_pagecache_pages(void);
 extern int nr_active_pages;
 extern int nr_inactive_pages;
 extern atomic_t nr_async_pages;

--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -133,7 +133,7 @@ enum
 	VM_SWAPCTL=1,		/* struct: Set vm swapping control */
 	VM_SWAPOUT=2,		/* int: Linear or sqrt() swapout for hogs */
 	VM_FREEPG=3,		/* struct: Set free page thresholds */
-	VM_BDFLUSH=4,		/* struct: Control buffer cache flushing */
+	VM_BDFLUSH_UNUSED=4,	/* Spare */
 	VM_OVERCOMMIT_MEMORY=5,	/* Turn off the virtual memory safety limit */
 	VM_BUFFERMEM=6,		/* struct: Set buffer memory thresholds */
 	VM_PAGECACHE=7,		/* struct: Set cache memory thresholds */

--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
+/*
+ * include/linux/writeback.h.
+ *
+ * These declarations are private to fs/ and mm/.
+ * Declarations which are exported to filesystems do not
+ * get placed here.
+ */
+#ifndef WRITEBACK_H
+#define WRITEBACK_H
+
+extern spinlock_t inode_lock;
+extern struct list_head inode_in_use;
+extern struct list_head inode_unused;
+
+/*
+ * fs/fs-writeback.c
+ */
+#define WB_SYNC_NONE	0	/* Don't wait on anything */
+#define WB_SYNC_LAST	1	/* Wait on the last-written mapping */
+#define WB_SYNC_ALL	2	/* Wait on every mapping */
+
+void try_to_writeback_unused_inodes(unsigned long pexclusive);
+void writeback_single_inode(struct inode *inode,
+				int sync, int *nr_to_write);
+void writeback_unlocked_inodes(int *nr_to_write, int sync_mode,
+				unsigned long *older_than_this);
+void writeback_inodes_sb(struct super_block *);
+void __wait_on_inode(struct inode * inode);
+void sync_inodes(void);
+
+static inline void wait_on_inode(struct inode *inode)
+{
+	if (inode->i_state & I_LOCK)
+		__wait_on_inode(inode);
+}
+
+/*
+ * mm/page-writeback.c
+ */
+/*
+ * How much data to write out at a time in various places.  This isn't
+ * really very important - it's just here to prevent any thread from
+ * locking an inode for too long and blocking other threads which wish
+ * to write the same file for allocation throttling purposes.
+ */
+#define WRITEOUT_PAGES	((4096 * 1024) / PAGE_CACHE_SIZE)
+
+void balance_dirty_pages(struct address_space *mapping);
+void balance_dirty_pages_ratelimited(struct address_space *mapping);
+int pdflush_flush(unsigned long nr_pages);
+int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+
+#endif		/* WRITEBACK_H */
--- a/init/main.c
+++ b/init/main.c
@@ -390,7 +390,6 @@ asmlinkage void __init start_kernel(void)
 	fork_init(mempages);
 	proc_caches_init();
 	vfs_caches_init(mempages);
-	buffer_init(mempages);
 	radix_tree_init();
 #if defined(CONFIG_ARCH_S390)
 	ccwcache_init();

--- a/kernel/ksyms.c
+++ b/kernel/ksyms.c
@@ -169,7 +169,6 @@ EXPORT_SYMBOL(__d_path);
 EXPORT_SYMBOL(mark_buffer_dirty);
 EXPORT_SYMBOL(end_buffer_io_sync);
 EXPORT_SYMBOL(set_buffer_async_io);
-EXPORT_SYMBOL(__mark_buffer_dirty);
 EXPORT_SYMBOL(__mark_inode_dirty);
 EXPORT_SYMBOL(get_empty_filp);
 EXPORT_SYMBOL(init_private_file);
@@ -212,7 +211,6 @@ EXPORT_SYMBOL(unlock_buffer);
 EXPORT_SYMBOL(__wait_on_buffer);
 EXPORT_SYMBOL(___wait_on_page);
 EXPORT_SYMBOL(generic_direct_IO);
-EXPORT_SYMBOL(discard_bh_page);
 EXPORT_SYMBOL(block_write_full_page);
 EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_prepare_write);
@@ -339,7 +337,6 @@ EXPORT_SYMBOL(register_disk);
 EXPORT_SYMBOL(read_dev_sector);
 EXPORT_SYMBOL(tq_disk);
 EXPORT_SYMBOL(init_buffer);
-EXPORT_SYMBOL(refile_buffer);
 EXPORT_SYMBOL(wipe_partitions);

 /* tty routines */

--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -43,7 +43,6 @@
 /* External variables not in a header file. */
 extern int panic_timeout;
 extern int C_A_D;
-extern int bdf_prm[], bdflush_min[], bdflush_max[];
 extern int sysctl_overcommit_memory;
 extern int max_threads;
 extern atomic_t nr_queued_signals;
@@ -259,9 +258,6 @@ static ctl_table kern_table[] = {
 };

 static ctl_table vm_table[] = {
-	{VM_BDFLUSH, "bdflush", &bdf_prm, 9*sizeof(int), 0644, NULL,
-	 &proc_dointvec_minmax, &sysctl_intvec, NULL,
-	 &bdflush_min, &bdflush_max},
 	{VM_OVERCOMMIT_MEMORY, "overcommit_memory", &sysctl_overcommit_memory,
 	 sizeof(sysctl_overcommit_memory), 0644, NULL, &proc_dointvec},
 	{VM_PAGERDAEMON, "kswapd",

--- a/mm/Makefile
+++ b/mm/Makefile
@@ -9,12 +9,13 @@

 O_TARGET := mm.o

-export-objs := shmem.o filemap.o mempool.o page_alloc.o
+export-objs := shmem.o filemap.o mempool.o page_alloc.o \
+		page-writeback.o

 obj-y	 := memory.o mmap.o filemap.o mprotect.o mlock.o mremap.o \
 	    vmalloc.o slab.o bootmem.o swap.o vmscan.o page_io.o \
 	    page_alloc.o swap_state.o swapfile.o numa.o oom_kill.o \
 	    shmem.o highmem.o mempool.o msync.o mincore.o readahead.o \
-	    pdflush.o
+	    pdflush.o page-writeback.o

 include $(TOPDIR)/Rules.make
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,29 +11,19 @@
 */
 #include <linux/module.h>
 #include <linux/slab.h>
-#include <linux/shm.h>
+#include <linux/compiler.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
 #include <linux/mman.h>
-#include <linux/locks.h>
 #include <linux/pagemap.h>
-#include <linux/swap.h>
-#include <linux/smp_lock.h>
-#include <linux/blkdev.h>
 #include <linux/file.h>
-#include <linux/swapctl.h>
-#include <linux/init.h>
-#include <linux/mm.h>
 #include <linux/iobuf.h>
-#include <linux/compiler.h>
-#include <linux/fs.h>
 #include <linux/hash.h>
-#include <linux/blkdev.h>
+#include <linux/writeback.h>

-#include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/mman.h>

-#include <linux/highmem.h>
-
 /*
 * Shared mappings implemented 30.11.1994. It's not fully working yet,
 * though.
@@ -49,13 +39,17 @@

 /*
 * Lock ordering:
- *	pagemap_lru_lock ==> page_lock ==> i_shared_lock
+ *
+ *  pagemap_lru_lock
+ *  ->i_shared_lock		(vmtruncate)
+ *    ->i_bufferlist_lock	(__free_pte->__set_page_dirty_buffers)
+ *      ->unused_list_lock	(try_to_free_buffers)
+ *        ->mapping->page_lock
+ *      ->inode_lock		(__mark_inode_dirty)
+ *        ->sb_lock		(fs/fs-writeback.c)
 */
 spinlock_t pagemap_lru_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;

-#define CLUSTER_PAGES		(1 << page_cluster)
-#define CLUSTER_OFFSET(x)	(((x) >> page_cluster) << page_cluster)
-
 /*
 * Remove a page from the page cache and free it. Caller has to make
 * sure the page is locked and that nobody else uses it - or that usage
@@ -97,26 +91,6 @@ static inline int sync_page(struct page *page)
 	return 0;
 }

-/*
- * Add a page to the dirty page list.
- */
-void set_page_dirty(struct page *page)
-{
-	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page->mapping;
-
-		if (mapping) {
-			write_lock(&mapping->page_lock);
-			list_del(&page->list);
-			list_add(&page->list, &mapping->dirty_pages);
-			write_unlock(&mapping->page_lock);
-
-			if (mapping->host)
-				mark_inode_dirty_pages(mapping->host);
-		}
-	}
-}
-
 /**
 * invalidate_inode_pages - Invalidate all the unlocked pages of one inode
 * @inode: the inode which pages we want to invalidate
@@ -194,20 +168,19 @@ static void truncate_complete_page(struct page *page)
 	/* Leave it on the LRU if it gets converted into anonymous buffers */
 	if (!PagePrivate(page) || do_flushpage(page, 0))
 		lru_cache_del(page);
-
-	/*
-	 * We remove the page from the page cache _after_ we have
-	 * destroyed all buffer-cache references to it. Otherwise some
-	 * other process might think this inode page is not in the
-	 * page cache and creates a buffer-cache alias to it causing
-	 * all sorts of fun problems ...  
-	 */
 	ClearPageDirty(page);
 	ClearPageUptodate(page);
 	remove_inode_page(page);
 	page_cache_release(page);
 }

+/*
+ * Writeback walks the page list in ->prev order, which is low-to-high file
+ * offsets in the common case where he file was written linearly. So truncate
+ * walks the page list in the opposite (->next) direction, to avoid getting
+ * into lockstep with writeback's cursor.  To prune as many pages as possible
+ * before the truncate cursor collides with the writeback cursor.
+ */
 static int truncate_list_pages(struct address_space *mapping,
 	struct list_head *head, unsigned long start, unsigned *partial)
 {
@@ -216,7 +189,7 @@ static int truncate_list_pages(struct address_space *mapping,
 	int unlocked = 0;

 restart:
-	curr = head->prev;
+	curr = head->next;
 	while (curr != head) {
 		unsigned long offset;

@@ -233,10 +206,10 @@ static int truncate_list_pages(struct address_space *mapping,
 			list_del(head);
 			if (!failed)
 				/* Restart after this page */
-				list_add_tail(head, curr);
+				list_add(head, curr);
 			else
 				/* Restart on this page */
-				list_add(head, curr);
+				list_add_tail(head, curr);

 			write_unlock(&mapping->page_lock);
 			unlocked = 1;
@@ -262,7 +235,7 @@ static int truncate_list_pages(struct address_space *mapping,
 			write_lock(&mapping->page_lock);
 			goto restart;
 		}
-		curr = curr->prev;
+		curr = curr->next;
 	}
 	return unlocked;
 }
@@ -284,10 +257,12 @@ void truncate_inode_pages(struct address_space * mapping, loff_t lstart)

 	write_lock(&mapping->page_lock);
 	do {
-		unlocked = truncate_list_pages(mapping,
-				&mapping->clean_pages, start, &partial);
+		unlocked |= truncate_list_pages(mapping,
+				&mapping->io_pages, start, &partial);
 		unlocked |= truncate_list_pages(mapping,
 				&mapping->dirty_pages, start, &partial);
+		unlocked = truncate_list_pages(mapping,
+				&mapping->clean_pages, start, &partial);
 		unlocked |= truncate_list_pages(mapping,
 				&mapping->locked_pages, start, &partial);
 	} while (unlocked);
@@ -305,6 +280,7 @@ static inline int invalidate_this_page2(struct address_space * mapping,
 	/*
 	 * The page is locked and we hold the mapping lock as well
 	 * so both page_count(page) and page_buffers stays constant here.
+	 * AKPM: fixme: No global lock any more.  Is this still OK?
 	 */
 	if (page_count(page) == 1 + !!page_has_buffers(page)) {
 		/* Restart after this page */
@@ -322,7 +298,7 @@ static inline int invalidate_this_page2(struct address_space * mapping,

 			page_cache_get(page);
 			write_unlock(&mapping->page_lock);
-			block_invalidate_page(page);
+			block_flushpage(page, 0);
 		} else
 			unlocked = 0;

@@ -393,6 +369,8 @@ void invalidate_inode_pages2(struct address_space * mapping)
 				&mapping->clean_pages);
 		unlocked |= invalidate_list_pages2(mapping,
 				&mapping->dirty_pages);
+		unlocked |= invalidate_list_pages2(mapping,
+				&mapping->io_pages);
 		unlocked |= invalidate_list_pages2(mapping,
 				&mapping->locked_pages);
 	} while (unlocked);
@@ -449,6 +427,8 @@ int generic_buffer_fdatasync(struct inode *inode, unsigned long start_idx, unsig
 	/* writeout dirty buffers on pages from both clean and dirty lists */
 	retval = do_buffer_fdatasync(mapping, &mapping->dirty_pages,
 			start_idx, end_idx, writeout_one_page);
+	retval = do_buffer_fdatasync(mapping, &mapping->io_pages,
+			start_idx, end_idx, writeout_one_page);
 	retval |= do_buffer_fdatasync(mapping, &mapping->clean_pages,
 			start_idx, end_idx, writeout_one_page);
 	retval |= do_buffer_fdatasync(mapping, &mapping->locked_pages,
@@ -457,6 +437,8 @@ int generic_buffer_fdatasync(struct inode *inode, unsigned long start_idx, unsig
 	/* now wait for locked buffers on pages from both clean and dirty lists */
 	retval |= do_buffer_fdatasync(mapping, &mapping->dirty_pages,
 			start_idx, end_idx, waitfor_one_page);
+	retval |= do_buffer_fdatasync(mapping, &mapping->io_pages,
+			start_idx, end_idx, waitfor_one_page);
 	retval |= do_buffer_fdatasync(mapping, &mapping->clean_pages,
 			start_idx, end_idx, waitfor_one_page);
 	retval |= do_buffer_fdatasync(mapping, &mapping->locked_pages,
@@ -495,47 +477,17 @@ int fail_writepage(struct page *page)
 EXPORT_SYMBOL(fail_writepage);

 /**
- *      filemap_fdatasync - walk the list of dirty pages of the given address space
- *     	and writepage() all of them.
- * 
- *      @mapping: address space structure to write
+ *  filemap_fdatasync - walk the list of dirty pages of the given address space
+ *                      and writepage() all of them.
+ *
+ *  @mapping: address space structure to write
 *
 */
-int filemap_fdatasync(struct address_space * mapping)
+int filemap_fdatasync(struct address_space *mapping)
 {
-	int ret = 0;
-	int (*writepage)(struct page *) = mapping->a_ops->writepage;
-
-	write_lock(&mapping->page_lock);
-
-        while (!list_empty(&mapping->dirty_pages)) {
-		struct page *page = list_entry(mapping->dirty_pages.prev, struct page, list);
-
-		list_del(&page->list);
-		list_add(&page->list, &mapping->locked_pages);
-
-		if (!PageDirty(page))
-			continue;
-
-		page_cache_get(page);
-		write_unlock(&mapping->page_lock);
-
-		lock_page(page);
-
-		if (PageDirty(page)) {
-			int err;
-			ClearPageDirty(page);
-			err = writepage(page);
-			if (err && !ret)
-				ret = err;
-		} else
-			UnlockPage(page);
-
-		page_cache_release(page);
-		write_lock(&mapping->page_lock);
-	}
-	write_unlock(&mapping->page_lock);
-	return ret;
+	if (mapping->a_ops->writeback_mapping)
+		return mapping->a_ops->writeback_mapping(mapping, NULL);
+	return generic_writeback_mapping(mapping, NULL);
 }

 /**
@@ -2324,6 +2276,7 @@ generic_file_write(struct file *file,const char *buf,size_t count, loff_t *ppos)

 		if (status < 0)
 			break;
+		balance_dirty_pages_ratelimited(mapping);
 	} while (count);
 done:
 	*ppos = pos;

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -56,7 +56,7 @@ static unsigned long last_empty_jifs;
 *
 * Thread pool management algorithm:
 * 
- * - The minumum and maximum number of pdflush instances are bound
+ * - The minimum and maximum number of pdflush instances are bound
 *   by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
 * 
 * - If there have been no idle pdflush instances for 1 second, create
@@ -155,8 +155,8 @@ static int __pdflush(struct pdflush_work *my_work)
 /*
 * Of course, my_work wants to be just a local in __pdflush().  It is
 * separated out in this manner to hopefully prevent the compiler from
- * performing unfortunate optimisations agains the auto variables.  Because
- * there are visible to other tasks and CPUs.  (No problem has actually
+ * performing unfortunate optimisations against the auto variables.  Because
+ * these are visible to other tasks and CPUs.  (No problem has actually
 * been observed.  This is just paranoia).
 */
 static int pdflush(void *dummy)

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c