• Andrew Morton's avatar
    [PATCH] page writeback locking update · a2bcb3a0
    Andrew Morton authored
    - Fixes a performance problem - callers of
      prepare_write/commit_write, etc are locking pages, which synchronises
      them behind writeback, which also locks these pages.  Significant
      slowdowns for some workloads.
    
    - So pages are no longer locked while under writeout.  Introduce a
      new PG_writeback and associated infrastructure to support this design
      change.
    
    - Pages which are under read I/O still use PageLocked.  Pages which
      are under write I/O have PageWriteback() true.
    
      I considered creating Page_IO instead of PageWriteback, and marking
      both readin and writeout pages as PageIO().  So pages are unlocked
      during both read and write.  There just doesn't seem a need to do
      this - nobody ever needs unblocking access to a page which is under
      read I/O.
    
    - Pages under swapout (brw_page) are PageLocked, not PageWriteback.
      So their treatment is unchangeded.
    
      It's not obvious that pages which are under swapout actually need
      the more asynchronous behaviour of PageWriteback.
    
      I was setting the swapout pages PageWriteback and unlocking them
      prior to submitting the buffers in brw_page().  This led to deadlocks
      on the exit_mmap->zap_page_range->free_swap_and_cache path.  These
      functions call block_flushpage under spinlock.  If the page is
      unlocked but has locked buffers, block_flushpage->discard_buffer()
      sleeps.  Under spinlock.  So that will need fixing if for some reason
      we want swapout to use PageWriteback.
    
      Kernel has called block_flushpage() under spinlock for a long time.
       It is assuming that a locked page will never have locked buffers.
      This appears to be true, but it's ugly.
    
    - Adds new function wait_on_page_writeback().  Renames wait_on_page()
      to wait_on_page_locked() to remind people that they need to call the
      appropriate one.
    
    - Renames filemap_fdatasync() to filemap_fdatawrite().  It's more
      accurate - "sync" implies, if anything, writeout and wait.  (fsync,
      msync) Or writeout.  it's not clear.
    
    - Subtly changes the filemap_fdatawrite() internals - this function
      used to do a lock_page() - it waited for any other user of the page
      to let go before submitting new I/O against a page.  It has been
      changed to simply skip over any pages which are currently under
      writeback.
    
      This is the right thing to do for memory-cleansing reasons.
    
      But it's the wrong thing to do for data consistency operations (eg,
      fsync()).  For those operations we must ensure that all data which
      was dirty *at the time of the system call* are tight on disk before
      the call returns.
    
      So all places which care about this have been converted to do:
    
    	filemap_fdatawait(mapping);	/* Wait for current writeback */
    	filemap_fdatawrite(mapping);	/* Write all dirty pages */
    	filemap_fdatawait(mapping);	/* Wait for I/O to complete */
    
    - Fixes a truncate_inode_pages problem - truncate currently will
      block when it hits a locked page, so it ends up getting into lockstep
      behind writeback and all of the file is pointlessly written back.
    
      One fix for this is for truncate to simply walk the page list in the
      opposite direction from writeback.
    
      I chose to use a separate cleansing pass.  It is more
      CPU-intensive, but it is surer and clearer.  This is because there is
      no reason why the per-address_space ->vm_writeback and
      ->writeback_mapping functions *have* to perform writeout in
      ->dirty_pages order.  They may choose to do something totally
      different.
    
      (set_page_dirty() is an a_op now, so address_spaces could almost
      privatise the whole dirty-page handling thing.  Except
      truncate_inode_pages and invalidate_inode_pages assume that the pages
      are on the address_space lists.  hmm.  So making truncate_inode_pages
      and invalidate_inode_pages a_ops would make some sense).
    a2bcb3a0
inode.c 66.2 KB