• Andrew Morton's avatar
    [PATCH] ext3: smarter block allocation startup · d2562c9d
    Andrew Morton authored
    When an ext3 (or ext2) file is first created the filesystem has to
    choose the initial starting block for its data allocations.  In the
    usual (new-file) case, that initial goal block is the zeroeth block of
    a particular blockgroup.
    
    This is the worst possible choice.  Because it _guarantees_ that this
    file's blocks will be pessimally intermingled with the blocks of
    another file which is growing within the same blockgroup.
    
    We've always had this problem with files in the same directory.  With
    the introduction of the Orlov allocator we now have the problem with
    files in different directories.  And it got noticed.  This is the cause
    of the post-Orlov 50% slowdown in dbench throughput on ext3 on
    write-through caching SCSI on SMP.  And 25% in ext2.
    
    It doesn't happen on uniprocessor because a single CPU will not exhibit
    sufficient concurrency in allocation against two or more files.
    
    It will happen on uniprocessor if the files are growing slowly.
    
    It has always happened if the files are in the same directory.
    
    ext2 has the same problem but it is siginficantly less damaging there
    because of ext2's eight-block per-inode preallocation window.
    
    The patch largely solves this problem by not always starting the
    allocation goal at the zeroeth block of the blockgroup.  We instead
    chop the blockgroup into sixteen starting points and select one of those
    based on the lower four bits of the calling process's PID.
    
    The PID was chosen as the index because this will help to ensure that
    related files have the same starting goal.  If one process is slowly
    writing two files in the same directory, we still lose.
    
    
    Using the PID in the heuristic is a bit weird.  As an alternative I
    tried using the file's directory's i_ino.  That fixed the dbench
    problem OK but caused a 15% slowdown in the fast-growth `untar a kernel
    tree' workload.  Because this approach will cause files which are in
    different directories to spread out more.  Suppressing that behaviour
    when the files are all being created by the same process is a
    reasonable heuristic.
    
    
    I changed dbench to never unlink its files, and used e2fsck to
    determine how many fragmented files were present after a `dbench 32'
    run.  With this patch and the next couple, ext2's fragmentation went
    from 22% to 13% and ext3's from 25% to 10.4%.
    d2562c9d
inode.c 83.6 KB