• Andrew Morton's avatar
    [PATCH] improved I/O scheduling for indirect blocks · 799391cc
    Andrew Morton authored
    Fixes a performance problem with many-small-file writeout.
    
    At present, files are written out via their mapping and their indirect
    blocks are written out via the blockdev mapping.  As we know that
    indirects are disk-adjacent to the data it is better to start I/O
    against the indirects at the same time as the data.
    
    The delalloc pathes have code in ext2_writepage() which recognises when
    the target page->index was at an indirect boundary and does an explicit
    hunt-and-write against the neighbouring indirect block.  Which is
    ideal.  (Unless the file was dirtied seekily and the page which is next
    to the indirect was not dirtied).
    
    This patch does it the other way: when we start writeback against a
    mapping, also start writeback against any dirty buffers which are
    attached to mapping->private_list.  Let the elevator take care of the
    rest.
    
    The patch makes a number of tuning changes to the writeback path in
    fs-writeback.c.  This is very fiddly code: getting the throughput
    tuned, getting the data-integrity "sync" operations right, avoiding
    most of the livelock opportunities, getting the `kupdate' function
    working efficiently, keeping it all least somewhat comprehensible.
    
    An important intent here is to ensure that metadata blocks for inodes
    are marked dirty before writeback starts working the blockdev mapping,
    so all the inode blocks are efficiently written back.
    
    The patch removes try_to_writeback_unused_inodes(), which became
    unreferenced in vm-writeback.patch.
    
    The patch has a tweak in ext2_put_inode() to prevent ext2 from
    incorrectly droppping its preallocation window in response to a random
    iput().
    
    
    Generally, many-small-file writeout is a lot faster than 2.5.7 (which
    is linux-before-I-futzed-with-it).  The workload which was optimised was
    
    	tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync
    
    on mem=128M and mem=2048M.
    
    With these patches, 2.5.15 is completing in about 2/3 of the time of
    2.5.7.  But it is only a shade faster than 2.4.19-pre7.  Why is 2.5.7
    so much slower than 2.4.19?  Not sure yet.
    
    Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than
    2.5.7 and significantly slower than 2.4.19.  It appears that the cause
    is poor read throughput at the later stages of the run.  Because there
    are background writeback threads operating at the same time.
    
    The 2.4.19-pre8 write scheduling manages to stop writeback during the
    latter stages of the dbench run in a way which I haven't been able to
    sanely emulate yet.  It may not be desirable to do this anyway - it's
    optimising for the case where the files are about to be deleted.  But
    it would be good to find a way of "pausing" the writeback for a few
    seconds to allow readers to get an interval of decent bandwidth.
    
    tiobench throughput is basically the same across all recent kernels.
    CPU load on writes is down maybe 30% in 2.5.15.
    799391cc
buffer.c 65.4 KB