[PATCH] ext3: smarter block allocation startup

When an ext3 (or ext2) file is first created the filesystem has to choose the initial starting block for its data allocations. In the usual (new-file) case, that initial goal block is the zeroeth block of a particular blockgroup. This is the worst possible choice. Because it _guarantees_ that this file's blocks will be pessimally intermingled with the blocks of another file which is growing within the same blockgroup. We've always had this problem with files in the same directory. With the introduction of the Orlov allocator we now have the problem with files in different directories. And it got noticed. This is the cause of the post-Orlov 50% slowdown in dbench throughput on ext3 on write-through caching SCSI on SMP. And 25% in ext2. It doesn't happen on uniprocessor because a single CPU will not exhibit sufficient concurrency in allocation against two or more files. It will happen on uniprocessor if the files are growing slowly. It has always happened if the files are in the same directory. ext2 has the same problem but it is siginficantly less damaging there because of ext2's eight-block per-inode preallocation window. The patch largely solves this problem by not always starting the allocation goal at the zeroeth block of the blockgroup. We instead chop the blockgroup into sixteen starting points and select one of those based on the lower four bits of the calling process's PID. The PID was chosen as the index because this will help to ensure that related files have the same starting goal. If one process is slowly writing two files in the same directory, we still lose. Using the PID in the heuristic is a bit weird. As an alternative I tried using the file's directory's i_ino. That fixed the dbench problem OK but caused a 15% slowdown in the fast-growth `untar a kernel tree' workload. Because this approach will cause files which are in different directories to spread out more. Suppressing that behaviour when the files are all being created by the same process is a reasonable heuristic. I changed dbench to never unlink its files, and used e2fsck to determine how many fragmented files were present after a `dbench 32' run. With this patch and the next couple, ext2's fragmentation went from 22% to 13% and ext3's from 25% to 10.4%.

[PATCH] ext3: smarter block allocation startup
When an ext3 (or ext2) file is first created the filesystem has to choose the initial starting block for its data allocations. In the usual (new-file) case, that initial goal block is the zeroeth block of a particular blockgroup. This is the worst possible choice. Because it _guarantees_ that this file's blocks will be pessimally intermingled with the blocks of another file which is growing within the same blockgroup. We've always had this problem with files in the same directory. With the introduction of the Orlov allocator we now have the problem with files in different directories. And it got noticed. This is the cause of the post-Orlov 50% slowdown in dbench throughput on ext3 on write-through caching SCSI on SMP. And 25% in ext2. It doesn't happen on uniprocessor because a single CPU will not exhibit sufficient concurrency in allocation against two or more files. It will happen on uniprocessor if the files are growing slowly. It has always happened if the files are in the same directory. ext2 has the same problem but it is siginficantly less damaging there because of ext2's eight-block per-inode preallocation window. The patch largely solves this problem by not always starting the allocation goal at the zeroeth block of the blockgroup. We instead chop the blockgroup into sixteen starting points and select one of those based on the lower four bits of the calling process's PID. The PID was chosen as the index because this will help to ensure that related files have the same starting goal. If one process is slowly writing two files in the same directory, we still lose. Using the PID in the heuristic is a bit weird. As an alternative I tried using the file's directory's i_ino. That fixed the dbench problem OK but caused a 15% slowdown in the fast-growth `untar a kernel tree' workload. Because this approach will cause files which are in different directories to spread out more. Suppressing that behaviour when the files are all being created by the same process is a reasonable heuristic. I changed dbench to never unlink its files, and used e2fsck to determine how many fragmented files were present after a `dbench 32' run. With this patch and the next couple, ext2's fragmentation went from 22% to 13% and ext3's from 25% to 10.4%.
d2562c9d · Andrew Morton · Linus Torvalds · 1cdf4231 · d2562c9d
Commit d2562c9d authored Dec 21, 2002 by Andrew Morton Committed by Linus Torvalds Dec 21, 2002
Show whitespace changes
Inline Side-by-side

Showing with 14 additions and 3 deletions

fs/ext3/inode.c fs/ext3/inode.c +14 -3

No files found.
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -455,14 +455,22 @@ static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
 *	  + if pointer will live in indirect block - allocate near that block.
 *	  + if pointer will live in inode - allocate in the same
 *	    cylinder group. 
+ *
+ * In the latter case we colour the starting block by the callers PID to
+ * prevent it from clashing with concurrent allocations for a different inode
+ * in the same block group.   The PID is used here so that functionally related
+ * files will be close-by on-disk.
+ *
 *	Caller must make sure that @ind is valid and will stay that way.
 */

-static inline unsigned long ext3_find_near(struct inode *inode, Indirect *ind)
+static unsigned long ext3_find_near(struct inode *inode, Indirect *ind)
 {
 	struct ext3_inode_info *ei = EXT3_I(inode);
 	u32 *start = ind->bh ? (u32*) ind->bh->b_data : ei->i_data;
 	u32 *p;
+	unsigned long bg_start;
+	unsigned long colour;

 	/* Try to find previous block */
 	for (p = ind->p - 1; p >= start; p--)
@@ -477,8 +485,11 @@ static inline unsigned long ext3_find_near(struct inode *inode, Indirect *ind)
 	 * It is going to be refered from inode itself? OK, just put it into
 	 * the same cylinder group then.
 	 */
-	return (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) +
+	bg_start = (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) +
 		le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block);
+	colour = (current->pid % 16) *
+			(EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16);
+	return bg_start + colour;
 }

 /**