• Andrew Morton's avatar
    [PATCH] fix ext3 memory leak · 2a6cb303
    Andrew Morton authored
    This is the leak which Con found.  Long story...
    
    - If a dirty page is fed into ext3_writepage() during truncate,
      block_write_full_page() will reutrn -EIO (it's outside i_size) and will
      leave the buffers dirty.  In the expectation that discard_buffer() will
      clean them.
    
    - ext3_writepage() then adds the still-dirty buffers to the journal's
      "async data list".  These are buffers which are known to have had IO
      started.  All we need to do is to wait on them in commit.
    
    - meanwhile, truncate will chop the pages off the address_space.  But
      truncate cannot invalidate the buffers (in journal_unmap_buffer()) because
      the buffers are attached to the committing transaction.  (hm.  This
      behaviour in journal_unmap_buffer() is bogus.  We just never need to write
      these buffers.)
    
    - ext3 commit will "wait on writeout" of these writepage buffers (even
      though it was never started) and will then release them from the
      journalling system.
    
    So we end up with pages which are attached to no mapping, which are clean and
    which have dirty buffers.  These are unreclaimable.
    
    
    Aside:
    
      ext3-ordered has two buffer lists: the "sync data list" and the "async
      data list".
    
      The sync list consists of probably-dirty buffers which were dirtied in
      commit_write().  Transaction commit must write all thee out and wait on
      them.
    
      The async list supposedly consists of clean buffers which were attached
      to the journal in ->writepage.  These have had IO started (by writepage) so
      commit merely needs to wait on them.
    
      This is all designed for the 2.4 VM really.  In 2.5, tons of writeback
      goes via writepage (instead of the buffer lru) and these buffers end up
      madly hpooing between the async and sync lists.
    
      Plus it's arguably incorrect to just wait on the writes in commit - if
      the buffers were set dirty again (say, by zap_pte_range()) then perhaps we
      should write them again before committing.
    
    
    So what the patch does is to remove the async list.  All ordered-data buffers
    are now attached to the single "sync data list".  So when we come to commit,
    those buffers which are dirty will have IO started and all buffers are waited
    upon.
    
    This means that the dirty buffers against a clean page which came about from
    block_write_full_page()'s -EIO will be written to disk in commit - this
    cleans them, and the page is now reclaimable.  No leak.
    
    It seems bogus to write these buffers in commit, and indeed it is.  But ext3
    will not allow those blocks to be reused until the commit has ended so there
    is no corruption risk.  And the amount of data involved is low - it only
    comes about as a race between truncate and writepage().
    2a6cb303
inode.c 83.7 KB