Commit d2562c9d authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] ext3: smarter block allocation startup

When an ext3 (or ext2) file is first created the filesystem has to
choose the initial starting block for its data allocations.  In the
usual (new-file) case, that initial goal block is the zeroeth block of
a particular blockgroup.

This is the worst possible choice.  Because it _guarantees_ that this
file's blocks will be pessimally intermingled with the blocks of
another file which is growing within the same blockgroup.

We've always had this problem with files in the same directory.  With
the introduction of the Orlov allocator we now have the problem with
files in different directories.  And it got noticed.  This is the cause
of the post-Orlov 50% slowdown in dbench throughput on ext3 on
write-through caching SCSI on SMP.  And 25% in ext2.

It doesn't happen on uniprocessor because a single CPU will not exhibit
sufficient concurrency in allocation against two or more files.

It will happen on uniprocessor if the files are growing slowly.

It has always happened if the files are in the same directory.

ext2 has the same problem but it is siginficantly less damaging there
because of ext2's eight-block per-inode preallocation window.

The patch largely solves this problem by not always starting the
allocation goal at the zeroeth block of the blockgroup.  We instead
chop the blockgroup into sixteen starting points and select one of those
based on the lower four bits of the calling process's PID.

The PID was chosen as the index because this will help to ensure that
related files have the same starting goal.  If one process is slowly
writing two files in the same directory, we still lose.


Using the PID in the heuristic is a bit weird.  As an alternative I
tried using the file's directory's i_ino.  That fixed the dbench
problem OK but caused a 15% slowdown in the fast-growth `untar a kernel
tree' workload.  Because this approach will cause files which are in
different directories to spread out more.  Suppressing that behaviour
when the files are all being created by the same process is a
reasonable heuristic.


I changed dbench to never unlink its files, and used e2fsck to
determine how many fragmented files were present after a `dbench 32'
run.  With this patch and the next couple, ext2's fragmentation went
from 22% to 13% and ext3's from 25% to 10.4%.
parent 1cdf4231
......@@ -455,14 +455,22 @@ static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
* + if pointer will live in indirect block - allocate near that block.
* + if pointer will live in inode - allocate in the same
* cylinder group.
*
* In the latter case we colour the starting block by the callers PID to
* prevent it from clashing with concurrent allocations for a different inode
* in the same block group. The PID is used here so that functionally related
* files will be close-by on-disk.
*
* Caller must make sure that @ind is valid and will stay that way.
*/
static inline unsigned long ext3_find_near(struct inode *inode, Indirect *ind)
static unsigned long ext3_find_near(struct inode *inode, Indirect *ind)
{
struct ext3_inode_info *ei = EXT3_I(inode);
u32 *start = ind->bh ? (u32*) ind->bh->b_data : ei->i_data;
u32 *p;
unsigned long bg_start;
unsigned long colour;
/* Try to find previous block */
for (p = ind->p - 1; p >= start; p--)
......@@ -477,8 +485,11 @@ static inline unsigned long ext3_find_near(struct inode *inode, Indirect *ind)
* It is going to be refered from inode itself? OK, just put it into
* the same cylinder group then.
*/
return (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) +
le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block);
bg_start = (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) +
le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block);
colour = (current->pid % 16) *
(EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16);
return bg_start + colour;
}
/**
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment