Commits · 838cd0cf9af52e034ee81513642083bbe8e4ddb1 · nexedi / linux

24 Sep, 2012 2 commits

ext4: check free block counters in ext4_mb_find_by_goal · 838cd0cf

Yongqiang Yang authored Sep 23, 2012

Free block counters should be checked before doing allocation.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

838cd0cf

ext4: fix crash when accessing /proc/mounts concurrently · 50df9fd5

Herton Ronaldo Krzesinski authored Sep 23, 2012

The crash was caused by a variable being erronously declared static in
token2str().

In addition to /proc/mounts, the problem can also be easily replicated
by accessing /proc/fs/ext4/<partition>/options in parallel:

$ cat /proc/fs/ext4/<partition>/options > options.txt

... and then running the following command in two different terminals:

$ while diff /proc/fs/ext4/<partition>/options options.txt; do true; done

This is also the cause of the following a crash while running xfstests
#234, as reported in the following bug reports:

	https://bugs.launchpad.net/bugs/1053019
	https://bugzilla.kernel.org/show_bug.cgi?id=47731Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Figg <brad.figg@canonical.com>
Cc: stable@vger.kernel.org

50df9fd5

20 Sep, 2012 2 commits

ext4: remove erroneous ext4_superblock_csum_set() in update_backups() · bef53b01

Tao Ma authored Sep 20, 2012

The update_backups() function is used to backup all the metadata
blocks, so we should not take it for granted that 'data' is pointed to
a super block and use ext4_superblock_csum_set to calculate the
checksum there.  In case where the data is a group descriptor block,
it will corrupt the last group descriptor, and then e2fsck will
complain about it it.

As all the metadata checksums should already be OK when we do the
backup, remove the wrong ext4_superblock_csum_set and it should be
just fine.
Reported-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

bef53b01

ext4: fix potential deadlock in ext4_nonda_switch() · 00d4e736

Theodore Ts'o authored Sep 19, 2012

In ext4_nonda_switch(), if the file system is getting full we used to
call writeback_inodes_sb_if_idle().  The problem is that we can be
holding i_mutex already, and this causes a potential deadlock when
writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
lockdep output below).

As it turns out we don't need need to hold s_umount; the fact that we
are in the middle of the write(2) system call will keep the superblock
pinned.  Unfortunately writeback_inodes_sb() checks to make sure
s_umount is taken, and the VFS uses a different mechanism for making
sure the file system doesn't get unmounted out from under us.  The
simplest way of dealing with this is to just simply grab s_umount
using a trylock, and skip kicking the writeback flusher thread in the
very unlikely case that we can't take a read lock on s_umount without
blocking.

Also, we now check the cirteria for kicking the writeback thread
before we decide to whether to fall back to non-delayed writeback, so
if there are any outstanding delayed allocation writes, we try to get
them resolved as soon as possible.

   [ INFO: possible circular locking dependency detected ]
   3.6.0-rc1-00042-gce894ca #367 Not tainted
   -------------------------------------------------------
   dd/8298 is trying to acquire lock:
    (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46

   but task is already holding lock:
    (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3

   which lock already depends on the new lock.

   2 locks held by dd/8298:
    #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
    #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3

   stack backtrace:
   Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
   Call Trace:
    [<c015b79c>] ? console_unlock+0x345/0x372
    [<c06d62a1>] print_circular_bug+0x190/0x19d
    [<c019906c>] __lock_acquire+0x86d/0xb6c
    [<c01999db>] ? mark_held_locks+0x5c/0x7b
    [<c0199724>] lock_acquire+0x66/0xb9
    [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
    [<c06db935>] down_read+0x28/0x58
    [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
    [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
    [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
    [<c0271ece>] ext4_da_write_begin+0x27/0x193
    [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
    [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
    [<c01ddce7>] generic_file_aio_write+0x78/0xd3
    [<c026d336>] ext4_file_write+0x480/0x4a6
    [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
    [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
    [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
    [<c018099f>] ? local_clock+0x37/0x4e
    [<c0209f2c>] do_sync_write+0x67/0x9d
    [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
    [<c020a7b9>] vfs_write+0x7b/0xe6
    [<c020a9a6>] sys_write+0x3b/0x64
    [<c06dd4bd>] syscall_call+0x7/0xb
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

00d4e736

19 Sep, 2012 2 commits

ext4: speed up truncate/unlink by not using bforget() unless needed · 18888cf0

Andrey Sidorov authored Sep 19, 2012

Do not iterate over data blocks scanning for bh's to forget as they're
never exist. This improves time taken by unlink / truncate syscall.
Tested by continuously truncating file that is being written by dd.
Another test is rm -rf of linux tree while tar unpacks it. With
ordered data mode condition unlikely(!tbh) was always met in
ext4_free_blocks. With journal data mode tbh was found only few times,
so optimisation is also possible.

Unlinking fallocated 60G file after doing sync && echo 3 >
/proc/sys/vm/drop_caches && time rm --help

X86 before (linux 3.6-rc4):
# time rm -f test1
real    0m2.710s
user    0m0.000s
sys     0m1.530s

X86 after:
# time rm -f test1
real    0m0.644s
user    0m0.003s
sys     0m0.060s

MIPS before (linux 2.6.37):
# time rm -f test1
real    0m 4.93s
user    0m 0.00s
sys     0m 4.61s

MIPS after:
# time rm -f test1
real    0m 0.16s
user    0m 0.00s
sys     0m 0.06s
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>

18888cf0

ext4: fix online resizing when the # of block groups is constant · 59e31c15

Theodore Ts'o authored Sep 19, 2012

Commit 1c6bd717 introduced a regression where an online resize
operation which did not change the number of block groups would fail,
i.e:

	mke2fs -t /dev/vdc 60000
	mount /dev/vdc
	resize2fs /dev/vdc 60001

This was due to a bug in the logic regarding when to try converting
the filesystem to use meta_bg.

Also fix up a number of other minor issues with the online resizing
code: (a) Fix a sparse warning; (b) only check to make sure the device
is large enough once, instead of multiple times through the resize
loop.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

59e31c15

18 Sep, 2012 4 commits

ext4: make orphan functions be no-op in no-journal mode · c9b92530

Anatol Pomozov authored Sep 18, 2012

Instead of checking whether the handle is valid, we check if journal
is enabled. This avoids taking the s_orphan_lock mutex in all cases
when there is no journal in use, including the error paths where
ext4_orphan_del() is called with a handle set to NULL.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

c9b92530

ext4: re-enable -o discard functionality in no-journal mode · b5e2368b

Theodore Ts'o authored Sep 18, 2012

This is a revert of commit b56ff9d3, which removed the call to
ext4_issue_discard() to fix a BUG reported because
ext4_issue_discard() was being called from inside a block group
spinlock.  As it turns out this bug had already been fixed by Lukas
Czerner in commit 53fdcf99 by the simple expedient of moving when
we call ext4_issue_discard() outside the spinlock.

So it should be safe to re-enable this functionality, which I tested
by putting an BUG_ON(in_atomic) just after the restored callsite to
ext4_issue_discard().

Addresses-Google-Bug: #6750518
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>

b5e2368b

ext4: fix possible non-initialized variable in htree_dirblock_to_tree() · 90b0a973

Carlos Maiolino authored Sep 17, 2012

htree_dirblock_to_tree() declares a non-initialized 'err' variable,
which is passed as a reference to another functions expecting them to
set this variable with their error codes.

It's passed to ext4_bread(), which then passes it to ext4_getblk(). If
ext4_map_blocks() returns 0 due to a lookup failure, leaving the
ext4_getblk() buffer_head uninitialized, it will make ext4_getblk()
return to ext4_bread() without initialize the 'err' variable, and
ext4_bread() will return to htree_dirblock_to_tree() with this variable
still uninitialized. htree_dirblock_to_tree() will pass this variable
with garbage back to ext4_htree_fill_tree(), which expects a number of
directory entries added to the rb-tree. which, in case, might return a
fake non-zero value due the garbage left in the 'err' variable, leading
the kernel to an Oops in ext4_dx_readdir(), once this is expecting a
filled rb-tree node, when in turn it will have a NULL-ed one, causing an
invalid page request when trying to get a fname struct from this NULL-ed
rb-tree node in this line:

fname = rb_entry(info->curr_node, struct fname, rb_hash);

The patch itself initializes the err variable in
htree_dirblock_to_tree() to avoid usage mistakes by the called
functions, and also fix ext4_getblk() to return a initialized 'err'
variable when ext4_map_blocks() fails a lookup.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

90b0a973

ext4: do not enable delalloc by default for ext2 · bc0b75f7

Theodore Ts'o authored Sep 17, 2012

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

bc0b75f7

13 Sep, 2012 3 commits

ext4: advertise the fact that the kernel supports meta_bg resizing · 5e7bbef1
Theodore Ts'o authored Sep 13, 2012
```
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
```
5e7bbef1

ext4: log a resize update to the console every 10 seconds · 4da4a56e

Theodore Ts'o authored Sep 13, 2012

For very long online resizes, a periodic update to the console log is
helpful for debugging and for progress reporting.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

4da4a56e

ext4: convert file system to meta_bg if needed during resizing · 1c6bd717

Theodore Ts'o authored Sep 13, 2012

If we have run out of reserved gdt blocks, then clear the resize_inode
feature and enable the meta_bg feature, so that we can continue
resizing the file system seamlessly.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

1c6bd717

12 Sep, 2012 1 commit

ext4: set bg_itable_unused when resizing · 93f90526

Theodore Ts'o authored Sep 12, 2012

Set bg_itable_unused for file systems that have uninit_bg enabled.
This will speed up the first e2fsck run after the file system is
resized.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

93f90526

05 Sep, 2012 7 commits

ext4: add online resizing support for meta_bg and 64-bit file systems · 01f795f9

Yongqiang Yang authored Sep 05, 2012

This patch adds support for resizing file systems with the meta_bg and
64bit features.

[ Added a fix by tytso to fix a divide by zero when resizing a
  filesystem from 14 TB to 18TB.  Also fixed overhead accounting for
  meta_bg file systems.]
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

01f795f9

ext4: grow the s_group_info array as needed · 28623c2f

Theodore Ts'o authored Sep 05, 2012

Previously we allocated the s_group_info array with enough space for
any future possible growth of the file system via online resize.  This
is unfortunate because it wastes memory, and it doesn't work for the
meta_bg scheme, since there is no limit based on the number of
reserved gdt blocks.  So add the code to grow the s_group_info array
as needed.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

28623c2f

ext4: grow the s_flex_groups array as needed when resizing · 117fff10

Theodore Ts'o authored Sep 05, 2012

Previously, we allocated the s_flex_groups array to the maximum size
that the file system could be resized.  There was two problems with
this approach.  First, it wasted memory in the common case where the
file system was not resized.  Secondly, once we start allowing online
resizing using the meta_bg scheme, there is no maximum size that the
file system can be resized.  So instead, we need to grow the
s_flex_groups at inline resize time.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

117fff10

ext4: avoid duplicate writes of the backup bg descriptor blocks · 2ebd1704

Yongqiang Yang authored Sep 05, 2012

The resize code was needlessly writing the backup block group
descriptor blocks multiple times (once per block group) during an
online resize.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

2ebd1704

ext4: don't copy non-existent gdt blocks when resizing · 6df935ad

Yongqiang Yang authored Sep 05, 2012

The resize code was copying blocks at the beginning of each block
group in order to copy the superblock and block group descriptor table
(gdt) blocks.  This was, unfortunately, being done even for block
groups that did not have super blocks or gdt blocks.  This is a
complete waste of perfectly good I/O bandwidth, to skip writing those
blocks for sparse bg's.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

6df935ad

ext4: report the original old blocks count in a debug message when resizing · d7574ad0

Yongqiang Yang authored Sep 05, 2012

Avoid changing o_blocks_count, since it is used later when reporting
old blocks count in debug mode.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

d7574ad0

ext4: ignore last group w/o enough space when resizing instead of BUG'ing · 03c1c290

Yongqiang Yang authored Sep 05, 2012

If the last group does not have enough space for group tables, ignore
it instead of calling BUG_ON().
Reported-by: Daniel Drake <dsd@laptop.org>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

03c1c290

19 Aug, 2012 5 commits

ext4: remove duplicated declarations in inode.c · 8a2f8460

Zheng Liu authored Aug 19, 2012

In patch cb20d518, ext4_set_bh_endio
and ext4_end_io_buffer_write are declared at the beginning of inode.c,
and again later on in the middle of the file.  Remove the second set
of duplicated function declarations.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

8a2f8460

ext4: fix trivial typo in comment · 30cb27d6

Wang Sheng-Hui authored Aug 18, 2012

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

30cb27d6

ext4: no need to add inode to orphan list during hole punch · e3d2e433

Ashish Sangwan authored Aug 18, 2012

While performing punch hole for an inode, i_disksize is not changed.
So, there is no need to add the inode to orphan list.
Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

e3d2e433

jbd2: don't write superblock when if its empty · eeecef0a

Eric Sandeen authored Aug 18, 2012

This sequence:

# truncate --size=1g fsfile
# mkfs.ext4 -F fsfile
# mount -o loop,ro fsfile /mnt
# umount /mnt
# dmesg | tail

results in an IO error when unmounting the RO filesystem:

[  318.020828] Buffer I/O error on device loop1, logical block 196608
[  318.027024] lost page write due to I/O error on loop1
[  318.032088] JBD2: Error -5 detected when updating journal superblock for loop1-8.

This was a regression introduced by commit 24bcc89c: "jbd2: split
updating of journal superblock and marking journal empty".
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

eeecef0a

ext4: replace plain integer with NULL in super.c · caecd0af

Sachin Kamat authored Aug 18, 2012

Fixes the following sparse warning:
fs/ext4/super.c:1672:45: warning: Using plain integer as NULL pointer
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

caecd0af

17 Aug, 2012 14 commits

ext4: drop lock_super()/unlock_super() · 07724f98

Theodore Ts'o authored Aug 17, 2012

We don't need lock_super()/unlock_super() any more, since the places
where it is used, we are protected by the s_umount r/w semaphore.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Marco Stornelli <marco.stornelli@gmail.com>

07724f98

ext4: return an error if kset_create_and_add fails in ext4_init_fs() · 0e376b1e

Theodore Ts'o authored Aug 17, 2012

In the very unlikely case that kset_create_and_add() fails when the
ext4.ko module is being loaded (or during kernel startup) set err so
that it's clear that the module load failed.

https://bugzilla.kernel.org/show_bug.cgi?id=27912Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

0e376b1e

ext4: remove unused function argument 'order' in mb_find_extent() · 15c006a2

Robin Dong authored Aug 17, 2012

All the routines call mb_find_extent are setting argument 'order' to 0
just like:

	mb_find_extent(e4b, 0, ex.fe_start, ex.fe_len, &ex);

therefore the useless argument should be removed.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

15c006a2

ext4: remove unused macro MB_DEFAULT_MAX_GROUPS_TO_SCAN · cc6eb18d
Robin Dong authored Aug 17, 2012
```
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
```
cc6eb18d

ext4: check return value of blkdev_issue_flush() · a4a39040

Theodore Ts'o authored Aug 17, 2012

    
blkdev_issue_flush() can fail; make sure the error gets properly
propagated.
    
This is a port of the equivalent ext3 patch from commit 44f4f729.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

a4a39040

jbd2: check return value of blkdev_issue_flush() · 316e4cfd

Theodore Ts'o authored Aug 17, 2012

blkdev_issue_flush() can fail; make sure the error gets properly
propagated.

This is a port of the equivalent jbd patch from commit 349ecd6a.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

316e4cfd

ext4: make the zero-out chunk size tunable · 67a5da56

Zheng Liu authored Aug 17, 2012

Currently in ext4 the length of zero-out chunk is set to 7 file system
blocks.  But if an inode has uninitailized extents from using
fallocate to preallocate space, and the workload issues many random
writes, this can cause a fragmented extent tree that will
unnecessarily grow the extent tree.

So create a new sysfs tunable, extent_max_zeroout_kb, which controls
the maximum size where blocks will be zeroed out instead of creating a
new uninitialized extent.  The default of this has been sent to 32kb.

CC: Zach Brown <zab@zabbo.net>
CC: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

67a5da56

ext4: add missing space to trace message · 81370291

Anatol Pomozov authored Aug 17, 2012

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

81370291

ext4: realign trace events structs to make it smaller · 210c0526

Anatol Pomozov authored Aug 17, 2012

Most hardware architectures require that data (including struct fields)
have to be aligned in memory. To make it happen compiler inserts padding
between struct fields if they are not aligned correctly.

Reorder fields to remove paddings and make structures denser. Making data
smaller saves some memory that is very important for trace events.
Tracing buffer has limited size and making objects smaller we can put more
of them without overflowing the tracing buffer.

To find data struct holes I used 'pahole -H 1 -E -I vmlinux.o' from
'dwarves' package.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

210c0526

ext4: add max_dir_size_kb mount option · df981d03

Theodore Ts'o authored Aug 17, 2012

Very large directories can cause significant performance problems, or
perhaps even invoke the OOM killer, if the process is running in a
highly constrained memory environment (whether it is VM's with a small
amount of memory or in a small memory cgroup).

So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size.  We do this
via a new mount option, max_dir_size_kb.  If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.

Google-Bug-Id: 6863013
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

df981d03

ext4: don't load the block bitmap for block groups which have no space · 01fc48e8

Theodore Ts'o authored Aug 17, 2012

Add a short circuit check to ext4_mb_group_group() so that we don't
bother to load the block bitmap for a block group which does not have
any space available.  (Or which does not have enough space until we
are in desperation mode, i.e., when cr == 3.)

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
Reported-by: mirek@me.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

01fc48e8

ext4: collapse a single extent tree block into the inode if possible · ecb94f5f

Theodore Ts'o authored Aug 17, 2012

If an inode has more than 4 extents, but then later some of the
extents are merged together, we can optimize the file system by moving
the extents up into the inode, and discarding the extent tree block.
This is important, because if there are a large number of inodes with
an external extent tree blocks where the contents could fit in the
inode, this can significantly increase the fsck time of the file
system.

Google-Bug-Id: 6801242
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ecb94f5f

ext4: fix kernel BUG on large-scale rm -rf commands · 89a4e48f

Theodore Ts'o authored Aug 17, 2012

Commit 968dee77: "ext4: fix hole punch failure when depth is greater
than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
crashes when users ran run "rm -rf" on large directory hierarchy on
ext4 filesystems on RAID devices:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

    Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
    Call Trace:
     [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110
     [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0
     [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0
     [<ffffffff81207e05>] ext4_truncate+0xf5/0x100
     [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490
     [<ffffffff811a1312>] evict+0xa2/0x1a0
     [<ffffffff811a1513>] iput+0x103/0x1f0
     [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0
     [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40
     [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50
     [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b
    Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00

    RIP  [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0

This could be reproduced as follows:

The problem in commit 968dee77 was that caused the variable 'i' to
be left uninitialized if the truncate required more space than was
available in the journal.  This resulted in the function
ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
ext4_ext_remove_space() to restart the truncate operation after
starting a new jbd2 handle.
Reported-by: Maciej Żenczykowski <maze@google.com>
Reported-by: Marti Raudsepp <marti@juffo.org>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

89a4e48f

ext4: fix long mount times on very big file systems · 0548bbb8

Theodore Ts'o authored Aug 16, 2012

Commit 8aeb00ff: "ext4: fix overhead calculation used by
ext4_statfs()" introduced a O(n**2) calculation which makes very large
file systems take forever to mount.  Fix this with an optimization for
non-bigalloc file systems.  (For bigalloc file systems the overhead
needs to be set in the the superblock.)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

0548bbb8