Commits · c254c9ec14d5c418c8f36ea7573edae2470a1dc1 · nexedi / linux

14 Mar, 2012 4 commits

jbd2: remove always true condition in __journal_try_to_free_buffer() · c254c9ec

Jan Kara authored Mar 13, 2012

The check b_jlist == BJ_None in __journal_try_to_free_buffer() is
always true (__jbd2_journal_temp_unlink_buffer() also checks this in
an assertion) so just remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

c254c9ec

jbd2: declare __jbd2_journal_temp_unlink_buffer() static · 5bebccf9
Jan Kara authored Mar 13, 2012
```
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
```
5bebccf9

jbd2: fix BH_JWrite setting in checkpointing code · 96c86678

Jan Kara authored Mar 13, 2012

BH_JWrite bit should be set when buffer is written to the journal. So
checkpointing shouldn't set this bit when writing out buffer. This didn't
cause any observable bug since BH_JWrite bit is used only for debugging
purposes but it's good to have this consistent.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

96c86678

jbd2: issue cache flush after checkpointing even with internal journal · 79feb521

Jan Kara authored Mar 13, 2012

When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash. Thus we must unconditionally issue a cache flush before we
update journal superblock in these cases.

A similar problem can also occur if journal superblock is written only in
disk's caches, other transaction starts reusing space of the transaction
cleaned from the log and power failure happens. Subsequent journal replay would
still try to replay the old transaction but some of it's blocks may be already
overwritten by the new transaction. For this reason we must use WRITE_FUA when
updating log tail and we must first write new log tail to disk and update
in-memory information only after that.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

79feb521

13 Mar, 2012 2 commits

jbd2: protect all log tail updates with j_checkpoint_mutex · a78bb11d

Jan Kara authored Mar 13, 2012

There are some log tail updates that are not protected by j_checkpoint_mutex.
Some of these are harmless because they happen during startup or shutdown but
updates in jbd2_journal_commit_transaction() and jbd2_journal_flush() can
really race with other log tail updates (e.g. someone doing
jbd2_journal_flush() with someone running jbd2_cleanup_journal_tail()). So
protect all log tail updates with j_checkpoint_mutex.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

a78bb11d

jbd2: split updating of journal superblock and marking journal empty · 24bcc89c

Jan Kara authored Mar 13, 2012

There are three case of updating journal superblock. In the first case, we want
to mark journal as empty (setting s_sequence to 0), in the second case we want
to update log tail, in the third case we want to update s_errno. Split these
cases into separate functions. It makes the code slightly more straightforward
and later patches will make the distinction even more important.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

24bcc89c

12 Mar, 2012 1 commit

ext4: check for zero length extent · 31d4f3a2

Theodore Ts'o authored Mar 11, 2012

Explicitly test for an extent whose length is zero, and flag that as a
corrupted extent.

This avoids a kernel BUG_ON assertion failure.

Tested: Without this patch, the file system image found in
tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a
kernel panic.  With this patch, an ext4 file system error is noted
instead, and the file system is marked as being corrupted.

https://bugzilla.kernel.org/show_bug.cgi?id=42859Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org

31d4f3a2

05 Mar, 2012 8 commits

ext4: add comments to definition of ext4_io_end_t · 4188188b

Curt Wohlgemuth authored Mar 05, 2012

This should make it more clear what this structure is used
for, and how some of the (mutually exclusive) fields are
used to keep page cache references.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

4188188b

ext4: don't release page refs in ext4_end_bio() · b43d17f3

Curt Wohlgemuth authored Mar 05, 2012

We can clear PageWriteback on each page when the IO
completes, but we can't release the references on the page
until we convert any uninitialized extents.

Without this patch, the use of the dioread_nolock mount
option can break buffered writes, because extents may
not be converted by the time a subsequent buffered read
comes in; if the page is not in the page cache, a read
will return zeros if the extent is still uninitialized.

I tested this with a (temporary) patch that adds a call
to msleep(1000) at the start of ext4_end_io_work(), to delay
processing of each DIO-unwritten work queue item.  With this
msleep(), a simple workload of

  fallocate
  write
  fadvise
  read

will fail without this patch, succeeds with it.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

b43d17f3

ext4: fix race between sync and completed io work · 491caa43

Jeff Moyer authored Mar 05, 2012

The following command line will leave the aio-stress process unkillable
on an ext4 file system (in my case, mounted on /mnt/test):

aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2

This is using the aio-stress program from the xfstests test suite.
That particular command line tells aio-stress to do random writes to
20 files from 20 threads (one thread per file). The files are NOT
preallocated, so you will get writes to random offsets within the
file, thus creating holes and extending i_size. It also opens the
file with O_DIRECT and O_SYNC.

On to the problem. When an I/O requires unwritten extent conversion,
it is queued onto the completed_io_list for the ext4 inode. Two code
paths will pull work items from this list. The first is the
ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
which is called via the fsync path (and O_SYNC handling, as well).
There are two issues I've found in these code paths. First, if the
fsync path beats the work routine to a particular I/O, the work
routine will free the io_end structure! It does not take into account
the fact that the io_end may still be in use by the fsync path. I've
fixed this issue by adding yet another IO_END flag, indicating that
the io_end is being processed by the fsync path.

The second problem is that the work routine will make an assignment to
io->flag outside of the lock. I have witnessed this result in a hang
at umount. Moving the flag setting inside the lock resolved that
problem.

The problem was introduced by commit b82e384c ("ext4: optimize
locking for end_io extent conversion"), which first appeared in 3.2.
As such, the fix should be backported to that release (probably along
with the unwritten extent conversion race fix).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: stable@kernel.org

491caa43

ext4: clean up the flags passed to __blockdev_direct_IO · 93ef8541

Jeff Moyer authored Mar 05, 2012

For extent-based files, you can perform DIO to holes, as mentioned in
the comments in ext4_ext_direct_IO.  However, that function passes
DIO_SKIP_HOLES to __blockdev_direct_IO, which is *really* confusing to
the uninitiated reader.  The key, here, is that the get_block function
passed in, ext4_get_block_write, completely ignores the create flag
that is passed to it (the create flag is passed in from the direct I/O
code, which uses the DIO_SKIP_HOLES flag to determine whether or not
it should be cleared).

This is a long-winded way of saying that the DIO_SKIP_HOLES flag is
ultimately ignored.  So let's remove it.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

93ef8541

ext4: try to deprecate noacl and noxattr_user mount options · f7048605

Theodore Ts'o authored Mar 04, 2012

No other file system allows ACL's and extended attributes to be
enabled or disabled via a mount option.  So let's try to deprecate
these options from ext4.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

f7048605

ext4: ignore mount options supported by ext2/3 (but have since been removed) · c7198b9c

Theodore Ts'o authored Mar 04, 2012

Users who tried to use the ext4 file system driver is being used for
the ext2 or ext3 file systems (via the CONFIG_EXT4_USE_FOR_EXT23
option) could have failed mounts if their /etc/fstab contains options
recognized by ext2 or ext3 but which have since been removed in ext4.

So teach ext4 to recognize them and give a warning that the mount
option was removed.

Report: https://bbs.archlinux.org/profile.php?id=33804Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Baechler <thomas@archlinux.org>
Cc: Tobias Powalowski <tobias.powalowski@googlemail.com>
Cc: Dave Reisner <d@falconindy.com>

c7198b9c

ext4: add debugging /proc file showing file system options · 66acdcf4

Theodore Ts'o authored Mar 04, 2012

Now that /proc/mounts is consistently showing only those mount options
which need to be specified in /etc/fstab or on the mount command line,
it is useful to have file which shows exactly which file system
options are enabled.  This can be useful when debugging a user
problem.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

66acdcf4

ext4: make ext4_show_options() be table-driven · 5a916be1

Theodore Ts'o authored Mar 04, 2012

Consistently show mount options which are the non-default, so that
/proc/mounts accurately shows the mount options that would be
necessary to mount the file system in its current mode of operation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

5a916be1

04 Mar, 2012 2 commits

ext4: move ext4_show_options() after parse_options() · 2adf6da8

Theodore Ts'o authored Mar 03, 2012

This commit is strictly a code movement so in preparation of changing
ext4_show_options to be table driven.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

2adf6da8

ext4: use a table-driven handler for mount options · 26092bf5

Theodore Ts'o authored Mar 03, 2012

By using a table-drive approach, we shave about 100 lines of code from
ext4, and make the code a bit more regular and factored out. This
will also make it possible in a future patch to use this table for
displaying the mount options that were specified in /proc/mounts.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

26092bf5

03 Mar, 2012 2 commits
- ext4: unify handling of mount options which have been removed · 72578c33
  Theodore Ts'o authored Mar 03, 2012
```
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
```
  72578c33
- ext4: simplify handling of the errors=* mount options · 39ef17f1
  Theodore Ts'o authored Mar 03, 2012
```
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
```
  39ef17f1
02 Mar, 2012 3 commits

ext4: remove the I_VERSION mount flag and use the super_block flag instead · c64db50e

Theodore Ts'o authored Mar 02, 2012

There's no point to have two bits that are set in parallel; so use the
MS_I_VERSION flag that is needed by the VFS anyway, and that way we
free up a bit in sbi->s_mount_opts.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

c64db50e

ext4: remove Opt_ignore · ee4a3fcd

Theodore Ts'o authored Mar 02, 2012

This is completely unused so let's just get rid of it.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ee4a3fcd

ext4: remove deprecation warnings for minix_df and grpid · 87f26807

Theodore Ts'o authored Mar 02, 2012

People complained about removing both of these features, so per
Linus's dictate, we won't be able to remove them.  Sigh...
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

87f26807

27 Feb, 2012 1 commit

ext4: Fix endianness bug when reading the MMP block · 85d21650

Santosh Nayak authored Feb 27, 2012

Sparse complained about this endian bug in fs/ext4/mmp.c.
Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

85d21650

21 Feb, 2012 3 commits

ext4: format flag in dx_probe() · 9ee49302

Zheng Liu authored Feb 20, 2012

Fix ext4_warning format flag in dx_probe().

CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

9ee49302

ext4: avoid deadlock on sync-mounted FS w/o journal · c1bb05a6

Eric Sandeen authored Feb 20, 2012

Processes hang forever on a sync-mounted ext2 file system that
is mounted with the ext4 module (default in Fedora 16).

I can reproduce this reliably by mounting an ext2 partition with
"-o sync" and opening a new file an that partition with vim. vim
will hang in "D" state forever.  The same happens on ext4 without
a journal.

I am attaching a small patch here that solves this issue for me.
In the sync mounted case without a journal,
ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which
can't be called with buffer lock held.

Also move mb_cache_entry_release inside lock to avoid race
fixed previously by 8a2bfdcb ext[34]: EA block reference count racing fix
Note too that ext2 fixed this same problem in 2006 with
b2f49033 [PATCH] fix deadlock in ext2

Signed-off-by: Martin.Wilck@ts.fujitsu.com
[sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

c1bb05a6

ext4: fix resize when resizing within single group · a0ade1de

Lukas Czerner authored Feb 20, 2012

When resizing file system in the way that the new size of the file
system is still in the same group (no new groups are added), then we can
hit a BUG_ON in ext4_alloc_group_tables()

BUG_ON(flex_gd->count == 0 || group_data == NULL);

because flex_gd->count is zero. The reason is the missing check for such
case, so the code always extend the last group fully and then attempt to
add more groups, but at that time n_blocks_count is actually smaller
than o_blocks_count.

It can be easily reproduced like this:

mkfs.ext4 -b 4096 /dev/sda 30M
mount /dev/sda /mnt/test
resize2fs /dev/sda 50M

Fix this by checking whether the resize happens within the singe group
and only add that many blocks into the last group to satisfy user
request. Then o_blocks_count == n_blocks_count and the resize will exit
successfully without and attempt to add more groups into the fs.

Also fix mixing together block number and blocks count which might be
confusing and can easily lead to off-by-one errors (but it is actually
not the case here since the two occurrence of this mix-up will cancel
each other).
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

a0ade1de

20 Feb, 2012 14 commits

ext4: fix race between unwritten extent conversion and truncate · 266991b1

Jeff Moyer authored Feb 20, 2012

The following comment in ext4_end_io_dio caught my attention:

	/* XXX: probably should move into the real I/O completion handler */
        inode_dio_done(inode);

The truncate code takes i_mutex, then calls inode_dio_wait.  Because the
ext4 code path above will end up dropping the mutex before it is
reacquired by the worker thread that does the extent conversion, it
seems to me that the truncate can happen out of order.  Jan Kara
mentioned that this might result in error messages in the system logs,
but that should be the extent of the "damage."

The fix is pretty straight-forward: don't call inode_dio_done until the
extent conversion is complete.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

266991b1

ext4: fix balloc.c printk-format-warning · d4dc462f

Heiko Carstens authored Feb 20, 2012

Get rid of this one:

fs/ext4/balloc.c: In function 'ext4_wait_block_bitmap':
fs/ext4/balloc.c:405:3: warning: format '%llu' expects argument of
  type 'long long unsigned int', but argument 6 has type 'sector_t' [-Wformat]

Happens because sector_t is u64 (unsigned long long) or unsigned long
dependent on CONFIG_64BIT.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

d4dc462f

ext4: remove EXT4_MB_{BITMAP,BUDDY} macros · c5e8f3f3

Theodore Ts'o authored Feb 20, 2012

The EXT4_MB_BITMAP and EXT4_MB_BUDDY macros obfuscate more than they
provide any abstraction.   So remove them.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

c5e8f3f3

ext4: using PTR_ERR() on the wrong variable in ext4_ext_migrate() · a0cc910f

Dan Carpenter authored Feb 20, 2012

"inode" is a valid pointer here.  "tmp_inode" was intended.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

a0cc910f

ext4: remove an unneeded NULL check in __ext4_check_dir_entry() · 4fda4003

Dan Carpenter authored Feb 20, 2012

We dereference "bh" unconditionally a couple lines down to find
"by->b_size".  This function is never called with a NULL "bh" so I have
removed the check.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

4fda4003

ext4: remove unneeded variable in ext4_xattr_check_block() · f1b3a2a7

Zheng Liu authored Feb 20, 2012

We could return directly from ext4_xattr_check_block(). Thus, we
shouldn't need to define a 'error' variable.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

f1b3a2a7

ext4: remove the resize mount option · 661aa520

Eric Sandeen authored Feb 20, 2012

The resize mount option seems to be of limited value,
especially in the age of online resize2fs.  Nuke it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

661aa520

ext4: remove the journal=update mount option · 43e625d8

Eric Sandeen authored Feb 20, 2012

The V2 journal format was introduced around ten years ago,
for ext3. It seems highly unlikely that anyone will need this
migration option for ext4.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

43e625d8

ext4: mark possibly unused variable in ext4_mb_normalize_request() · 1592d2c5

Curt Wohlgemuth authored Feb 20, 2012

The 'orig_size' local variable is only used in a call to
mb_debug().  Mark it with '__maybe_unused'.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

1592d2c5

jbd2: use KMEM_CACHE instead of kmem_cache_create() · 9c0e00e5

Yongqiang Yang authored Feb 20, 2012

Use the KMEM_CACHE helper macro instead of kmem_cache_create().
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

9c0e00e5

jbd2: rename functions which initialize slab caches · 4185a2ac

Yongqiang Yang authored Feb 20, 2012

This patch renames functions initializing the slab caches for the
journal head and handle structures to so they are consistent with the
names of the corresponding functions which destroys those slab caches.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

4185a2ac

jbd2: allocate transaction from separate slab cache · 0c2022ec

Yongqiang Yang authored Feb 20, 2012

There is normally only a handful of these active at any one time, but
putting them in a separate slab cache makes debugging memory
corruption problems easier.  Manish Katiyar also wanted this make it
easier to test memory failure scenarios in the jbd2 layer.

Cc: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

0c2022ec

ext4: expand commit callback and · 18aadd47

Bobi Jam authored Feb 20, 2012

The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released.  This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

18aadd47

jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer · 15291164

Eric Sandeen authored Feb 20, 2012

journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head
state ala discard_buffer(), but does not touch _Delay or _Unwritten as
discard_buffer() does.

This can be problematic in some areas of the ext4 code which assume
that if they have found a buffer marked unwritten or delay, then it's
a live one.  Perhaps those spots should check whether it is mapped
as well, but if jbd2 is going to tear down a buffer, let's really
tear it down completely.

Without this I get some fsx failures on sub-page-block filesystems
up until v3.2, at which point 4e96b2db
and 189e868f make the failures go
away, because buried within that large change is some more flag
clearing.  I still think it's worth doing in jbd2, since
->invalidatepage leads here directly, and it's the right place
to clear away these flags.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

15291164