[PATCH] make the pagecache lock irq-safe.

Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. These patches can end up making ext3 (even) slower: for i in 1 2 3 4 do dd if=/dev/zero of=$i bs=1M count=2000 & done runs awfully slow on SMP. This is, yet again, because all the file blocks are jumbled up and the per-file linear writeout causes tons of seeking. The above test runs sweetly on UP because the on UP we don't allocate blocks to different files in parallel. Mingming and Badari are working on getting block reservation working for ext3 (preallocation on steroids). That should fix ext3 up. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions.

[PATCH] make the pagecache lock irq-safe.
Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. These patches can end up making ext3 (even) slower: for i in 1 2 3 4 do dd if=/dev/zero of=$i bs=1M count=2000 & done runs awfully slow on SMP. This is, yet again, because all the file blocks are jumbled up and the per-file linear writeout causes tons of seeking. The above test runs sweetly on UP because the on UP we don't allocate blocks to different files in parallel. Mingming and Badari are working on getting block reservation working for ext3 (preallocation on steroids). That should fix ext3 up. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions.
89261aab · Andrew Morton · Linus Torvalds · 8691fb83 · 89261aab · 89261aab
Commit 89261aab authored Apr 11, 2004 by Andrew Morton Committed by Linus Torvalds Apr 11, 2004
14 changed files
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -396,7 +396,7 @@ asmlinkage long sys_fdatasync(unsigned int fd)
 * Hack idea: for the blockdev mapping, i_bufferlist_lock contention
 * may be quite high.  This code could TryLock the page, and if that
 * succeeds, there is no need to take private_lock. (But if
- * private_lock is contended then so is mapping->page_lock).
+ * private_lock is contended then so is mapping->tree_lock).
 */
 static struct buffer_head *
 __find_get_block_slow(struct block_device *bdev, sector_t block, int unused)
@@ -867,14 +867,14 @@ int __set_page_dirty_buffers(struct page *page)
 	spin_unlock(&mapping->private_lock);

 	if (!TestSetPageDirty(page)) {
-		spin_lock(&mapping->page_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (page->mapping) {	/* Race with truncate? */
 			if (!mapping->backing_dev_info->memory_backed)
 				inc_page_state(nr_dirty);
 			list_del(&page->list);
 			list_add(&page->list, &mapping->dirty_pages);
 		}
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
 	
@@ -1254,7 +1254,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
 * inode to its superblock's dirty inode list.
 *
 * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->page_lock and the global inode_lock.
+ * mapping->tree_lock and the global inode_lock.
 */
 void fastcall mark_buffer_dirty(struct buffer_head *bh)
 {

--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -898,11 +898,9 @@ static void cifs_copy_cache_pages(struct address_space *mapping,
 		if(list_empty(pages))
 			break;

-		spin_lock(&mapping->page_lock);
 		page = list_entry(pages->prev, struct page, list);

 		list_del(&page->list);
-		spin_unlock(&mapping->page_lock);

 		if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
 			page_cache_release(page);
@@ -962,14 +960,10 @@ cifs_readpages(struct file *file, struct address_space *mapping,
 	pagevec_init(&lru_pvec, 0);

 	for(i = 0;i<num_pages;) {
-		spin_lock(&mapping->page_lock);
-		if(list_empty(page_list)) {
-			spin_unlock(&mapping->page_lock);
+		if(list_empty(page_list))
 			break;
-		}
 		page = list_entry(page_list->prev, struct page, list);
 		offset = (loff_t)page->index << PAGE_CACHE_SHIFT;
-	        spin_unlock(&mapping->page_lock);

 		/* for reads over a certain size could initiate async read ahead */

@@ -989,12 +983,10 @@ cifs_readpages(struct file *file, struct address_space *mapping,
 			cFYI(1,("Read error in readpages: %d",rc));
 			/* clean up remaing pages off list */
            
-			spin_lock(&mapping->page_lock);
 			while (!list_empty(page_list) && (i < num_pages)) {
 				page = list_entry(page_list->prev, struct page, list);
 				list_del(&page->list);
 			}
-			spin_unlock(&mapping->page_lock);
 			break;
 		} else if (bytes_read > 0) {
 			pSMBr = (struct smb_com_read_rsp *)smb_read_data;

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -159,10 +159,10 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * read speculatively by this cpu before &= ~I_DIRTY  -- mikulas
 	 */

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (wait || !wbc->for_kupdate || list_empty(&mapping->io_pages))
 		list_splice_init(&mapping->dirty_pages, &mapping->io_pages);
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	spin_unlock(&inode_lock);

 	ret = do_writepages(mapping, wbc);

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -187,7 +187,7 @@ void inode_init_once(struct inode *inode)
 	sema_init(&inode->i_sem, 1);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	spin_lock_init(&inode->i_data.page_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	init_MUTEX(&inode->i_data.i_shared_sem);
 	atomic_set(&inode->i_data.truncate_count, 0);
 	INIT_LIST_HEAD(&inode->i_data.private_list);

--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -635,7 +635,7 @@ mpage_writepages(struct address_space *mapping,
 	if (get_block == NULL)
 		writepage = mapping->a_ops->writepage;

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	while (!list_empty(&mapping->io_pages) && !done) {
 		struct page *page = list_entry(mapping->io_pages.prev,
 					struct page, list);
@@ -655,10 +655,10 @@ mpage_writepages(struct address_space *mapping,
 		list_add(&page->list, &mapping->locked_pages);

 		page_cache_get(page);
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);

 		/*
-		 * At this point we hold neither mapping->page_lock nor
+		 * At this point we hold neither mapping->tree_lock nor
 		 * lock on the page itself: the page may be truncated or
 		 * invalidated (changing page->mapping to NULL), or even
 		 * swizzled back from swapper_space to tmpfs file mapping.
@@ -695,12 +695,12 @@ mpage_writepages(struct address_space *mapping,
 			unlock_page(page);
 		}
 		page_cache_release(page);
-		spin_lock(&mapping->page_lock);
+		spin_lock_irq(&mapping->tree_lock);
 	}
 	/*
 	 * Leave any remaining dirty pages on ->io_pages
 	 */
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	if (bio)
 		mpage_bio_submit(WRITE, bio);
 	return ret;

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,7 +322,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	spinlock_t		page_lock;	/* and spinlock protecting it */
+	spinlock_t		tree_lock;	/* and spinlock protecting it */
 	struct list_head	clean_pages;	/* list of clean pages */
 	struct list_head	dirty_pages;	/* list of dirty pages */
 	struct list_head	locked_pages;	/* list of locked pages */

--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -380,9 +380,7 @@ static void shm_get_stat(unsigned long *rss, unsigned long *swp)

 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			spin_lock(&mapping->page_lock);
 			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
-			spin_unlock(&mapping->page_lock);
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -59,7 +59,7 @@
 *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
 *      ->swap_list_lock
 *        ->swap_device_lock	(exclusive_swap_page, others)
- *          ->mapping->page_lock
+ *          ->mapping->tree_lock
 *
 *  ->i_sem
 *    ->i_shared_sem		(truncate->invalidate_mmap_range)
@@ -78,12 +78,12 @@
 *
 *  ->inode_lock
 *    ->sb_lock			(fs/fs-writeback.c)
- *    ->mapping->page_lock	(__sync_single_inode)
+ *    ->mapping->tree_lock	(__sync_single_inode)
 *
 *  ->page_table_lock
 *    ->swap_device_lock	(try_to_unmap_one)
 *    ->private_lock		(try_to_unmap_one)
- *    ->page_lock		(try_to_unmap_one)
+ *    ->tree_lock		(try_to_unmap_one)
 *    ->zone.lru_lock		(follow_page->mark_page_accessed)
 *
 *  ->task->proc_lock
@@ -93,7 +93,7 @@
 /*
 * Remove a page from the page cache and free it. Caller has to make
 * sure the page is locked and that nobody else uses it - or that usage
- * is safe.  The caller must hold a write_lock on the mapping's page_lock.
+ * is safe.  The caller must hold a write_lock on the mapping's tree_lock.
 */
 void __remove_from_page_cache(struct page *page)
 {
@@ -114,9 +114,9 @@ void remove_from_page_cache(struct page *page)
 	if (unlikely(!PageLocked(page)))
 		PAGE_BUG(page);

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }

 static inline int sync_page(struct page *page)
@@ -148,9 +148,9 @@ static int __filemap_fdatawrite(struct address_space *mapping, int sync_mode)
 	if (mapping->backing_dev_info->memory_backed)
 		return 0;

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	list_splice_init(&mapping->dirty_pages, &mapping->io_pages);
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ret = do_writepages(mapping, &wbc);
 	return ret;
 }
@@ -185,7 +185,7 @@ int filemap_fdatawait(struct address_space * mapping)

 restart:
 	progress = 0;
-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
        while (!list_empty(&mapping->locked_pages)) {
 		struct page *page;

@@ -199,7 +199,7 @@ int filemap_fdatawait(struct address_space * mapping)
 		if (!PageWriteback(page)) {
 			if (++progress > 32) {
 				if (need_resched()) {
-					spin_unlock(&mapping->page_lock);
+					spin_unlock_irq(&mapping->tree_lock);
 					__cond_resched();
 					goto restart;
 				}
@@ -209,16 +209,16 @@ int filemap_fdatawait(struct address_space * mapping)

 		progress = 0;
 		page_cache_get(page);
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);

 		wait_on_page_writeback(page);
 		if (PageError(page))
 			ret = -EIO;

 		page_cache_release(page);
-		spin_lock(&mapping->page_lock);
+		spin_lock_irq(&mapping->tree_lock);
 	}
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);

 	/* Check for outstanding write errors */
 	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
@@ -267,7 +267,7 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,

 	if (error == 0) {
 		page_cache_get(page);
-		spin_lock(&mapping->page_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			SetPageLocked(page);
@@ -275,7 +275,7 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,
 		} else {
 			page_cache_release(page);
 		}
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 	}
 	return error;
@@ -411,11 +411,11 @@ struct page * find_get_page(struct address_space *mapping, unsigned long offset)
 	 * We scan the hash list read-only. Addition to and removal from
 	 * the hash-list needs a held write-lock.
 	 */
-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page)
 		page_cache_get(page);
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return page;
 }

@@ -428,11 +428,11 @@ struct page *find_trylock_page(struct address_space *mapping, unsigned long offs
 {
 	struct page *page;

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page && TestSetPageLocked(page))
 		page = NULL;
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return page;
 }

@@ -454,15 +454,15 @@ struct page *find_lock_page(struct address_space *mapping,
 {
 	struct page *page;

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
-			spin_unlock(&mapping->page_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			lock_page(page);
-			spin_lock(&mapping->page_lock);
+			spin_lock_irq(&mapping->tree_lock);

 			/* Has the page been truncated while we slept? */
 			if (page->mapping != mapping || page->index != offset) {
@@ -472,7 +472,7 @@ struct page *find_lock_page(struct address_space *mapping,
 			}
 		}
 	}
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return page;
 }

@@ -546,12 +546,12 @@ unsigned int find_get_pages(struct address_space *mapping, pgoff_t start,
 	unsigned int i;
 	unsigned int ret;

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }


--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -472,12 +472,12 @@ int write_one_page(struct page *page, int wait)
 	if (wait)
 		wait_on_page_writeback(page);

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	list_del(&page->list);
 	if (test_clear_page_dirty(page)) {
 		list_add(&page->list, &mapping->locked_pages);
 		page_cache_get(page);
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		ret = mapping->a_ops->writepage(page, &wbc);
 		if (ret == 0 && wait) {
 			wait_on_page_writeback(page);
@@ -487,7 +487,7 @@ int write_one_page(struct page *page, int wait)
 		page_cache_release(page);
 	} else {
 		list_add(&page->list, &mapping->clean_pages);
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		unlock_page(page);
 	}
 	return ret;
@@ -515,7 +515,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 		struct address_space *mapping = page->mapping;

 		if (mapping) {
-			spin_lock(&mapping->page_lock);
+			spin_lock_irq(&mapping->tree_lock);
 			if (page->mapping) {	/* Race with truncate? */
 				BUG_ON(page->mapping != mapping);
 				if (!mapping->backing_dev_info->memory_backed)
@@ -523,7 +523,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 				list_del(&page->list);
 				list_add(&page->list, &mapping->dirty_pages);
 			}
-			spin_unlock(&mapping->page_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			if (!PageSwapCache(page))
 				__mark_inode_dirty(mapping->host,
 							I_DIRTY_PAGES);

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -230,7 +230,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		unsigned long page_offset = offset + page_idx;
 		
@@ -241,16 +241,16 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (page)
 			continue;

-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		spin_lock(&mapping->page_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
 		list_add(&page->list, &page_pool);
 		ret++;
 	}
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);

 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,7 +25,7 @@ extern struct address_space_operations swap_aops;

 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
-	.page_lock	= SPIN_LOCK_UNLOCKED,
+	.tree_lock	= SPIN_LOCK_UNLOCKED,
 	.clean_pages	= LIST_HEAD_INIT(swapper_space.clean_pages),
 	.dirty_pages	= LIST_HEAD_INIT(swapper_space.dirty_pages),
 	.io_pages	= LIST_HEAD_INIT(swapper_space.io_pages),
@@ -182,9 +182,9 @@ void delete_from_swap_cache(struct page *page)
  
 	entry.val = page->index;

-	spin_lock(&swapper_space.page_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	spin_unlock(&swapper_space.page_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);

 	swap_free(entry);
 	page_cache_release(page);
@@ -195,8 +195,8 @@ int move_to_swap_cache(struct page *page, swp_entry_t entry)
 	struct address_space *mapping = page->mapping;
 	int err;

-	spin_lock(&swapper_space.page_lock);
-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
+	spin_lock(&mapping->tree_lock);

 	err = radix_tree_insert(&swapper_space.page_tree, entry.val, page);
 	if (!err) {
@@ -204,8 +204,8 @@ int move_to_swap_cache(struct page *page, swp_entry_t entry)
 		___add_to_page_cache(page, &swapper_space, entry.val);
 	}

-	spin_unlock(&mapping->page_lock);
-	spin_unlock(&swapper_space.page_lock);
+	spin_unlock(&mapping->tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);

 	if (!err) {
 		if (!swap_duplicate(entry))
@@ -231,8 +231,8 @@ int move_from_swap_cache(struct page *page, unsigned long index,

 	entry.val = page->index;

-	spin_lock(&swapper_space.page_lock);
-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
+	spin_lock(&mapping->tree_lock);

 	err = radix_tree_insert(&mapping->page_tree, index, page);
 	if (!err) {
@@ -240,8 +240,8 @@ int move_from_swap_cache(struct page *page, unsigned long index,
 		___add_to_page_cache(page, mapping, index);
 	}

-	spin_unlock(&mapping->page_lock);
-	spin_unlock(&swapper_space.page_lock);
+	spin_unlock(&mapping->tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);

 	if (!err) {
 		swap_free(entry);

--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -253,10 +253,10 @@ static int exclusive_swap_page(struct page *page)
 		/* Is the only swap cache user the cache itself? */
 		if (p->swap_map[swp_offset(entry)] == 1) {
 			/* Recheck the page count with the pagecache lock held.. */
-			spin_lock(&swapper_space.page_lock);
+			spin_lock_irq(&swapper_space.tree_lock);
 			if (page_count(page) - !!PagePrivate(page) == 2)
 				retval = 1;
-			spin_unlock(&swapper_space.page_lock);
+			spin_unlock_irq(&swapper_space.tree_lock);
 		}
 		swap_info_put(p);
 	}
@@ -324,13 +324,13 @@ int remove_exclusive_swap_page(struct page *page)
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the pagecache lock held.. */
-		spin_lock(&swapper_space.page_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock(&swapper_space.page_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 	}
 	swap_info_put(p);


--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -62,7 +62,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 * This is for invalidate_inode_pages().  That function can be called at
 * any time, and is not supposed to throw away dirty pages.  But pages can
 * be marked dirty at any time too.  So we re-check the dirtiness inside
- * ->page_lock.  That provides exclusion against the __set_page_dirty
+ * ->tree_lock.  That provides exclusion against the __set_page_dirty
 * functions.
 */
 static int
@@ -74,13 +74,13 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;

-	spin_lock(&mapping->page_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page)) {
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		return 0;
 	}
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->page_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -354,7 +354,6 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 				goto keep_locked;
 			if (!may_write_to_queue(mapping->backing_dev_info))
 				goto keep_locked;
-			spin_lock(&mapping->page_lock);
 			if (test_clear_page_dirty(page)) {
 				int res;
 				struct writeback_control wbc = {
@@ -364,9 +363,6 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 					.for_reclaim = 1,
 				};

-				list_move(&page->list, &mapping->locked_pages);
-				spin_unlock(&mapping->page_lock);
-
 				SetPageReclaim(page);
 				res = mapping->a_ops->writepage(page, &wbc);
 				if (res < 0)
@@ -381,7 +377,6 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 				}
 				goto keep;
 			}
-			spin_unlock(&mapping->page_lock);
 		}

 		/*
@@ -415,7 +410,7 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 		if (!mapping)
 			goto keep_locked;	/* truncate got there first */

-		spin_lock(&mapping->page_lock);
+		spin_lock_irq(&mapping->tree_lock);

 		/*
 		 * The non-racy check for busy page.  It is critical to check
@@ -423,7 +418,7 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 		 * not in use by anybody. 	(pagecache + us == 2)
 		 */
 		if (page_count(page) != 2 || PageDirty(page)) {
-			spin_unlock(&mapping->page_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			goto keep_locked;
 		}

@@ -431,7 +426,7 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 		if (PageSwapCache(page)) {
 			swp_entry_t swap = { .val = page->index };
 			__delete_from_swap_cache(page);
-			spin_unlock(&mapping->page_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			swap_free(swap);
 			__put_page(page);	/* The pagecache ref */
 			goto free_it;
@@ -439,7 +434,7 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
 #endif /* CONFIG_SWAP */

 		__remove_from_page_cache(page);
-		spin_unlock(&mapping->page_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__put_page(page);

 free_it: