Commit 20b96b52 authored by Andrew Morton's avatar Andrew Morton Committed by Jaroslav Kysela

[PATCH] fs-writeback rework.

I've revisited all the superblock->inode->page writeback paths.  There
were several silly things in there, and things were not as clear as they
could be.

scenario 1: create and dirty a MAP_SHARED segment over a sparse file,
then exit.

  All the memory turns into dirty pagecache, but the kupdate function
  only writes it out at a trickle - 4 megabytes every thirty seconds.
  We should sync it all within 30 seconds.

  What's happening is that when writeback tries to write those pages,
  the filesystem needs to instantiate new blocks for them (they're over
  holes).  The filesystem runs mark_inode_dirty() within the writeback
  function.

  This redirtying of the inode while we're writing it out triggers
  some livelock avoidance code in __sync_single_inode().  That function
  says "ah, someone redirtied the file while I was writing it.  Let's
  move the file to the new end of the superblock dirty list and write
  it out later." Problem is, writeback dirtied the inode itself.

  (It is rather silly that mark_inode_dirty() sets I_DIRTY_PAGES when
  clearly no pages have been dirtied.  Fixing that up would be a
  largish work, so work around it here).

  So this patch just removes the livelock avoidance from
  __sync_single_inode().  It is no longer needed anyway - writeback
  livelock is now avoided (in all writeback paths) by writing a finite
  number of pages.

scenario 2: an application is continuously dirtying a 200 megabyte
file, and your disk has a bandwidth of less than 40 megabytes/sec.

  What happens is that once 30 seconds passes, pdflush starts writing
  out the file.  And because that writeout will take more than five
  seconds (a `kupdate' interval), pdflush just keeps writing it out
  forever - continuous I/O.

  What we _want_ to happen is that the 200 megabytes gets written,
  and then IO stops for thirty seconds (minus the writeout period).  So
  the file is fully synced every thirty seconds.

The patch solves this by using mapping->io_pages more intelligently.
When the time comes to write the file out, move all the dirty pages
onto io_pages.  That is a "batch of pages for this kupdate round".
When io_pages is empty, we know we're done.

The address_space_operations.writepages() API is changed!  It now only
needs to write the pages which the caller placed on mapping->io_pages.

This conceptually cleans things up a bit, by more clearly defining the
role of ->io_pages, and the motion between the various mapping lists.

The treatment of sb->s_dirty and sb->s_io is now conceptually identical
to mapping->dirty_pages and mapping->io_pages: move the items-to-be
written onto ->s_io/io_pages, alk walk that list.  As inodes (or pages)
are written, move them over to the clean/locked/dirty lists.

Oh, scenario 3: start an app whcih continuously overwrites a 5 meg
file.  Wait five seconds, start another, wait 5 seconds, start another.
 What we _should_ see is three 5-meg writes, five seconds apart, every
thirty seconds.  That did all sorts of odd things.  It now does the
right thing.
parent 21c2baef
...@@ -211,6 +211,9 @@ written. The address_space implementation may write more (or less) pages ...@@ -211,6 +211,9 @@ written. The address_space implementation may write more (or less) pages
than *nr_to_write asks for, but it should try to be reasonably close. If than *nr_to_write asks for, but it should try to be reasonably close. If
nr_to_write is NULL, all dirty pages must be written. nr_to_write is NULL, all dirty pages must be written.
writepages should _only_ write pages which are present on
mapping->io_pages.
->set_page_dirty() is called from various places in the kernel ->set_page_dirty() is called from various places in the kernel
when the target page is marked as needing writeback. It may be called when the target page is marked as needing writeback. It may be called
under spinlock (it cannot block) and is sometimes called with the page under spinlock (it cannot block) and is sometimes called with the page
......
...@@ -67,14 +67,10 @@ void __mark_inode_dirty(struct inode *inode, int flags) ...@@ -67,14 +67,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_lock(&inode_lock); spin_lock(&inode_lock);
if ((inode->i_state & flags) != flags) { if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
struct address_space *mapping = inode->i_mapping; struct address_space *mapping = inode->i_mapping;
inode->i_state |= flags; inode->i_state |= flags;
if (!was_dirty)
mapping->dirtied_when = jiffies;
/* /*
* If the inode is locked, just update its dirty state. * If the inode is locked, just update its dirty state.
* The unlocker will place the inode on the appropriate * The unlocker will place the inode on the appropriate
...@@ -84,19 +80,21 @@ void __mark_inode_dirty(struct inode *inode, int flags) ...@@ -84,19 +80,21 @@ void __mark_inode_dirty(struct inode *inode, int flags)
goto out; goto out;
/* /*
* Only add valid (hashed) inode to the superblock's * Only add valid (hashed) inodes to the superblock's
* dirty list. Add blockdev inodes as well. * dirty list. Add blockdev inodes as well.
*/ */
if (list_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode)) if (list_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode))
goto out; goto out;
/* /*
* If the inode was already on s_dirty, don't reposition * If the inode was already on s_dirty or s_io, don't
* it (that would break s_dirty time-ordering). * reposition it (that would break s_dirty time-ordering).
*/ */
if (!was_dirty) if (!mapping->dirtied_when) {
mapping->dirtied_when = jiffies|1; /* 0 is special */
list_move(&inode->i_list, &sb->s_dirty); list_move(&inode->i_list, &sb->s_dirty);
} }
}
out: out:
spin_unlock(&inode_lock); spin_unlock(&inode_lock);
} }
...@@ -110,19 +108,17 @@ static void write_inode(struct inode *inode, int sync) ...@@ -110,19 +108,17 @@ static void write_inode(struct inode *inode, int sync)
/* /*
* Write a single inode's dirty pages and inode data out to disk. * Write a single inode's dirty pages and inode data out to disk.
* If `sync' is set, wait on the writeout. * If `wait' is set, wait on the writeout.
* Subtract the number of written pages from nr_to_write.
* *
* Normally it is not legal for a single process to lock more than one * The whole writeout design is quite complex and fragile. We want to avoid
* page at a time, due to ab/ba deadlock problems. But writepages() * starvation of particular inodes when others are being redirtied, prevent
* does want to lock a large number of pages, without immediately submitting * livelocks, etc.
* I/O against them (starting I/O is a "deferred unlock_page").
* *
* However it *is* legal to lock multiple pages, if this is only ever performed * So what we do is to move all pages which are to be written from dirty_pages
* by a single process. We provide that exclusion via locking in the * onto io_pages. And keep on writing io_pages until it's empty. Refusing to
* filesystem's ->writepages a_op. This ensures that only a single * move more pages onto io_pages until io_pages is empty. Once that point has
* process is locking multiple pages against this inode. And as I/O is * been reached, we are ready to take another pass across the inode's dirty
* submitted against all those locked pages, there is no deadlock. * pages.
* *
* Called under inode_lock. * Called under inode_lock.
*/ */
...@@ -131,7 +127,6 @@ __sync_single_inode(struct inode *inode, int wait, ...@@ -131,7 +127,6 @@ __sync_single_inode(struct inode *inode, int wait,
struct writeback_control *wbc) struct writeback_control *wbc)
{ {
unsigned dirty; unsigned dirty;
unsigned long orig_dirtied_when;
struct address_space *mapping = inode->i_mapping; struct address_space *mapping = inode->i_mapping;
struct super_block *sb = inode->i_sb; struct super_block *sb = inode->i_sb;
...@@ -141,8 +136,11 @@ __sync_single_inode(struct inode *inode, int wait, ...@@ -141,8 +136,11 @@ __sync_single_inode(struct inode *inode, int wait,
dirty = inode->i_state & I_DIRTY; dirty = inode->i_state & I_DIRTY;
inode->i_state |= I_LOCK; inode->i_state |= I_LOCK;
inode->i_state &= ~I_DIRTY; inode->i_state &= ~I_DIRTY;
orig_dirtied_when = mapping->dirtied_when;
mapping->dirtied_when = 0; /* assume it's whole-file writeback */ write_lock(&mapping->page_lock);
if (wait || !wbc->for_kupdate || list_empty(&mapping->io_pages))
list_splice_init(&mapping->dirty_pages, &mapping->io_pages);
write_unlock(&mapping->page_lock);
spin_unlock(&inode_lock); spin_unlock(&inode_lock);
do_writepages(mapping, wbc); do_writepages(mapping, wbc);
...@@ -155,24 +153,26 @@ __sync_single_inode(struct inode *inode, int wait, ...@@ -155,24 +153,26 @@ __sync_single_inode(struct inode *inode, int wait,
filemap_fdatawait(mapping); filemap_fdatawait(mapping);
spin_lock(&inode_lock); spin_lock(&inode_lock);
inode->i_state &= ~I_LOCK; inode->i_state &= ~I_LOCK;
if (!(inode->i_state & I_FREEING)) { if (!(inode->i_state & I_FREEING)) {
list_del(&inode->i_list); if (!list_empty(&mapping->io_pages)) {
if (inode->i_state & I_DIRTY) { /* Redirtied */ /* Needs more writeback */
list_add(&inode->i_list, &sb->s_dirty); inode->i_state |= I_DIRTY_PAGES;
} else { } else if (!list_empty(&mapping->dirty_pages)) {
if (!list_empty(&mapping->dirty_pages) || /* Redirtied */
!list_empty(&mapping->io_pages)) {
/* Not a whole-file writeback */
mapping->dirtied_when = orig_dirtied_when;
inode->i_state |= I_DIRTY_PAGES; inode->i_state |= I_DIRTY_PAGES;
list_add_tail(&inode->i_list, &sb->s_dirty); mapping->dirtied_when = jiffies|1;
list_move(&inode->i_list, &sb->s_dirty);
} else if (inode->i_state & I_DIRTY) {
/* Redirtied */
mapping->dirtied_when = jiffies|1;
list_move(&inode->i_list, &sb->s_dirty);
} else if (atomic_read(&inode->i_count)) { } else if (atomic_read(&inode->i_count)) {
list_add(&inode->i_list, &inode_in_use); mapping->dirtied_when = 0;
list_move(&inode->i_list, &inode_in_use);
} else { } else {
list_add(&inode->i_list, &inode_unused); mapping->dirtied_when = 0;
} list_move(&inode->i_list, &inode_unused);
} }
} }
wake_up_inode(inode); wake_up_inode(inode);
...@@ -185,8 +185,10 @@ static void ...@@ -185,8 +185,10 @@ static void
__writeback_single_inode(struct inode *inode, int sync, __writeback_single_inode(struct inode *inode, int sync,
struct writeback_control *wbc) struct writeback_control *wbc)
{ {
if (current_is_pdflush() && (inode->i_state & I_LOCK)) if (current_is_pdflush() && (inode->i_state & I_LOCK)) {
list_move(&inode->i_list, &inode->i_sb->s_dirty);
return; return;
}
while (inode->i_state & I_LOCK) { while (inode->i_state & I_LOCK) {
__iget(inode); __iget(inode);
...@@ -233,7 +235,9 @@ sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc) ...@@ -233,7 +235,9 @@ sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
{ {
const unsigned long start = jiffies; /* livelock avoidance */ const unsigned long start = jiffies; /* livelock avoidance */
if (!wbc->for_kupdate || list_empty(&sb->s_io))
list_splice_init(&sb->s_dirty, &sb->s_io); list_splice_init(&sb->s_dirty, &sb->s_io);
while (!list_empty(&sb->s_io)) { while (!list_empty(&sb->s_io)) {
struct inode *inode = list_entry(sb->s_io.prev, struct inode *inode = list_entry(sb->s_io.prev,
struct inode, i_list); struct inode, i_list);
...@@ -275,7 +279,6 @@ sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc) ...@@ -275,7 +279,6 @@ sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
really_sync = (wbc->sync_mode == WB_SYNC_ALL); really_sync = (wbc->sync_mode == WB_SYNC_ALL);
BUG_ON(inode->i_state & I_FREEING); BUG_ON(inode->i_state & I_FREEING);
__iget(inode); __iget(inode);
list_move(&inode->i_list, &sb->s_dirty);
__writeback_single_inode(inode, really_sync, wbc); __writeback_single_inode(inode, really_sync, wbc);
if (wbc->sync_mode == WB_SYNC_HOLD) { if (wbc->sync_mode == WB_SYNC_HOLD) {
mapping->dirtied_when = jiffies; mapping->dirtied_when = jiffies;
......
...@@ -525,12 +525,12 @@ mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block, ...@@ -525,12 +525,12 @@ mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
* Pages can be moved from clean_pages or locked_pages onto dirty_pages * Pages can be moved from clean_pages or locked_pages onto dirty_pages
* at any time - it's not possible to lock against that. So pages which * at any time - it's not possible to lock against that. So pages which
* have already been added to a BIO may magically reappear on the dirty_pages * have already been added to a BIO may magically reappear on the dirty_pages
* list. And generic_writepages() will again try to lock those pages. * list. And mpage_writepages() will again try to lock those pages.
* But I/O has not yet been started against the page. Thus deadlock. * But I/O has not yet been started against the page. Thus deadlock.
* *
* To avoid this, the entire contents of the dirty_pages list are moved * To avoid this, mpage_writepages() will only write pages from io_pages. The
* onto io_pages up-front. We then walk io_pages, locking the * caller must place them there. We walk io_pages, locking the pages and
* pages and submitting them for I/O, moving them to locked_pages. * submitting them for I/O, moving them to locked_pages.
* *
* This has the added benefit of preventing a livelock which would otherwise * This has the added benefit of preventing a livelock which would otherwise
* occur if pages are being dirtied faster than we can write them out. * occur if pages are being dirtied faster than we can write them out.
...@@ -539,8 +539,8 @@ mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block, ...@@ -539,8 +539,8 @@ mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
* if it's dirty. This is desirable behaviour for memory-cleaning writeback, * if it's dirty. This is desirable behaviour for memory-cleaning writeback,
* but it is INCORRECT for data-integrity system calls such as fsync(). fsync() * but it is INCORRECT for data-integrity system calls such as fsync(). fsync()
* and msync() need to guarentee that all the data which was dirty at the time * and msync() need to guarentee that all the data which was dirty at the time
* the call was made get new I/O started against them. The way to do this is * the call was made get new I/O started against them. So if called_for_sync()
* to run filemap_fdatawait() before calling filemap_fdatawrite(). * is true, we must wait for existing IO to complete.
* *
* It's fairly rare for PageWriteback pages to be on ->dirty_pages. It * It's fairly rare for PageWriteback pages to be on ->dirty_pages. It
* means that someone redirtied the page while it was under I/O. * means that someone redirtied the page while it was under I/O.
...@@ -570,9 +570,6 @@ mpage_writepages(struct address_space *mapping, ...@@ -570,9 +570,6 @@ mpage_writepages(struct address_space *mapping,
pagevec_init(&pvec, 0); pagevec_init(&pvec, 0);
write_lock(&mapping->page_lock); write_lock(&mapping->page_lock);
list_splice_init(&mapping->dirty_pages, &mapping->io_pages);
while (!list_empty(&mapping->io_pages) && !done) { while (!list_empty(&mapping->io_pages) && !done) {
struct page *page = list_entry(mapping->io_pages.prev, struct page *page = list_entry(mapping->io_pages.prev,
struct page, list); struct page, list);
......
...@@ -41,6 +41,7 @@ struct writeback_control { ...@@ -41,6 +41,7 @@ struct writeback_control {
this for each page written */ this for each page written */
int nonblocking; /* Don't get stuck on request queues */ int nonblocking; /* Don't get stuck on request queues */
int encountered_congestion; /* An output: a queue is full */ int encountered_congestion; /* An output: a queue is full */
int for_kupdate; /* A kupdate writeback */
}; };
/* /*
......
...@@ -62,6 +62,7 @@ ...@@ -62,6 +62,7 @@
* ->mapping->page_lock * ->mapping->page_lock
* ->inode_lock * ->inode_lock
* ->sb_lock (fs/fs-writeback.c) * ->sb_lock (fs/fs-writeback.c)
* ->mapping->page_lock (__sync_single_inode)
* ->page_table_lock * ->page_table_lock
* ->swap_device_lock (try_to_unmap_one) * ->swap_device_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one)
...@@ -133,6 +134,9 @@ int filemap_fdatawrite(struct address_space *mapping) ...@@ -133,6 +134,9 @@ int filemap_fdatawrite(struct address_space *mapping)
return 0; return 0;
current->flags |= PF_SYNC; current->flags |= PF_SYNC;
write_lock(&mapping->page_lock);
list_splice_init(&mapping->dirty_pages, &mapping->io_pages);
write_unlock(&mapping->page_lock);
ret = do_writepages(mapping, &wbc); ret = do_writepages(mapping, &wbc);
current->flags &= ~PF_SYNC; current->flags &= ~PF_SYNC;
return ret; return ret;
......
...@@ -286,6 +286,7 @@ static void wb_kupdate(unsigned long arg) ...@@ -286,6 +286,7 @@ static void wb_kupdate(unsigned long arg)
.older_than_this = &oldest_jif, .older_than_this = &oldest_jif,
.nr_to_write = 0, .nr_to_write = 0,
.nonblocking = 1, .nonblocking = 1,
.for_kupdate = 1,
}; };
sync_supers(); sync_supers();
...@@ -299,7 +300,7 @@ static void wb_kupdate(unsigned long arg) ...@@ -299,7 +300,7 @@ static void wb_kupdate(unsigned long arg)
wbc.encountered_congestion = 0; wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES; wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc); writeback_inodes(&wbc);
if (wbc.nr_to_write == MAX_WRITEBACK_PAGES) { if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion) if (wbc.encountered_congestion)
blk_congestion_wait(WRITE, HZ); blk_congestion_wait(WRITE, HZ);
else else
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment