1. 01 Jul, 2013 2 commits
    • Alexey Khoroshilov's avatar
      ext4: implement error handling of ext4_mb_new_preallocation() · 2c00ef3e
      Alexey Khoroshilov authored
      If memory allocation in ext4_mb_new_group_pa() is failed,
      it returns error code, ext4_mb_new_preallocation() propages it,
      but ext4_mb_new_blocks() ignores it.
      
      An observed result was:
      
      - allocation fail means ext4_mb_new_group_pa() does not update
        ext4_allocation_context;
      
      - ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
        ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
        of number of blocks requested (1);
      
      - that activates update cycle in ext4_splice_branch():
          for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
            *(where->p + i) = cpu_to_le32(current_block++);
      
      - it iterates 511 times and corrupts a chunk of memory including inode
        structure;
      
      - page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();
      
      - system hangs with 'scheduling while atomic' BUG.
      
      The patch implements a check for ext4_mb_new_preallocation() error
      code and handles its failure as if ext4_mb_regular_allocator() fails.
      
      Found by Linux File System Verification project (linuxtesting.org).
      
      [ Patch restructed by tytso to make the flow of control easier to follow. ]
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      2c00ef3e
    • Maarten ter Huurne's avatar
      ext4: fix corruption when online resizing a fs with 1K block size · 6ca792ed
      Maarten ter Huurne authored
      Subtracting the number of the first data block places the superblock
      backups one block too early, corrupting the file system. When the block
      size is larger than 1K, the first data block is 0, so the subtraction
      has no effect and no corruption occurs.
      Signed-off-by: default avatarMaarten ter Huurne <maarten@treewalker.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      CC: stable@vger.kernel.org
      6ca792ed
  2. 17 Jun, 2013 1 commit
  3. 13 Jun, 2013 10 commits
    • Jie Liu's avatar
      ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents · 72dac95d
      Jie Liu authored
      Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
      FIEMAP_EXTENT_DELALLOC because the data location of an
      delayed allocation extent is unknown.
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      72dac95d
    • Paul Gortmaker's avatar
      jbd2: remove debug dependency on debug_fs and update Kconfig help text · 75497d06
      Paul Gortmaker authored
      Commit b6e96d00 ("jbd2: use module parameters instead of debugfs
      for jbd_debug") removed any need for a dependency on DEBUG_FS.  It
      also moved the /sys variables out from underneath the typical debugfs
      mount point.  Delete the dependency and update the /sys path to where
      the debug settings are currently.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      75497d06
    • Paul Gortmaker's avatar
      jbd2: use a single printk for jbd_debug() · 169f1a2a
      Paul Gortmaker authored
      Since the jbd_debug() is implemented with two separate printk()
      calls, it can lead to corrupted and misleading debug output like
      the following (see lines marked with "*"):
      
      [  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
      [  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
      [  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
      [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
      [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
      [* 290.339382] JBD2: starting commit of transaction 42104
      [  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
      [  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079
      
      i.e. the debug output from log_wait_commit and journal_commit_transaction
      have become interleaved.  The output should have been:
      
      (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
      (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104
      
      It is expected that this is not easy to replicate -- I was only able
      to cause it on preempt-rt kernels, and even then only under heavy
      I/O load.
      Reported-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Suggested-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      169f1a2a
    • Paul Gortmaker's avatar
      jbd/jbd2: relocate bit_spinlock header to jbd_common · c9b3a8cc
      Paul Gortmaker authored
      The bit_spinlock functions are only used for the jbd_lock_bh_state
      functions (and friends) in jbd_common.h and are not directly used
      by either of jbd.h or jbd2.h content.
      
      The jbd_common file is new as of commit 44606672 ("jdb/jbd2: factor
      out common functions from the jbd[2] header files") but common
      (and isolated) headers were not considered for factoring at that time.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c9b3a8cc
    • Paul Gortmaker's avatar
      jbd2: fix duplicate debug label for phase 2 · cfc7bc89
      Paul Gortmaker authored
      Currently we see this output:
      
        $git grep phase fs/jbd2
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 1\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 3\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 4\n");
        [...]
      
      There is clearly a duplicate label for phase 2, and they are
      both active (i.e. not in #if ... #else block).  Rename them to
      be "2a" and "2b" so the debug output is unambiguous.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      cfc7bc89
    • Paul Gortmaker's avatar
      jbd2: drop checkpoint mutex when waiting in __jbd2_log_wait_for_space() · 0ef54180
      Paul Gortmaker authored
      While trying to debug an an issue under extreme I/O loading
      on preempt-rt kernels, the following backtrace was observed
      via SysRQ output:
      
      rm              D ffff8802203afbc0  4600  4878   4748 0x00000000
       ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
       ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
       ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
      Call Trace:
       [<ffffffff8172dc34>] schedule+0x24/0x70
       [<ffffffff81225b5d>] jbd2_log_wait_commit+0xbd/0x140
       [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
       [<ffffffff81223635>] jbd2_log_do_checkpoint+0xf5/0x520
       [<ffffffff81223b09>] __jbd2_log_wait_for_space+0xa9/0x1f0
       [<ffffffff8121dc40>] start_this_handle.isra.10+0x2e0/0x530
       [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
       [<ffffffff8121e0a3>] jbd2__journal_start+0xc3/0x110
       [<ffffffff811de7ce>] ? ext4_rmdir+0x6e/0x230
       [<ffffffff8121e0fe>] jbd2_journal_start+0xe/0x10
       [<ffffffff811f308b>] ext4_journal_start_sb+0x5b/0x160
       [<ffffffff811de7ce>] ext4_rmdir+0x6e/0x230
       [<ffffffff811435c5>] vfs_rmdir+0xd5/0x140
       [<ffffffff8114370f>] do_rmdir+0xdf/0x120
       [<ffffffff8105c6b4>] ? task_work_run+0x44/0x80
       [<ffffffff81002889>] ? do_notify_resume+0x89/0x100
       [<ffffffff817361ae>] ? int_signal+0x12/0x17
       [<ffffffff81145d85>] sys_unlinkat+0x25/0x40
       [<ffffffff81735f22>] system_call_fastpath+0x16/0x1b
      
      What is interesting here, is that we call log_wait_commit, from
      within wait_for_space, but we are still holding the checkpoint_mutex
      as it surrounds mostly the whole of wait_for_space.  And then, as we
      are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
      bit is set, then we will also try to take the same checkpoint_mutex.
      
      It seems that we need to drop the checkpoint_mutex while sitting in
      jbd2_log_wait_commit, if we want to guarantee that progress can be made
      by jbd2_journal_commit_transaction().  There does not seem to be
      anything preempt-rt specific about this, other then perhaps increasing
      the odds of it happening.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0ef54180
    • Paul Gortmaker's avatar
      jbd2: relocate assert after state lock in journal_commit_transaction() · 3ca841c1
      Paul Gortmaker authored
      The state lock is taken after we are doing an assert on the state
      value, not before.  So we might in fact be doing an assert on a
      transient value.  Ensure the state check is within the scope of
      the state lock being taken.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      3ca841c1
    • Dmitry Monakhov's avatar
      ext4: Fix fsync error handling after filesystem abort · 4418e141
      Dmitry Monakhov authored
      If filesystem was aborted after inode's write back is complete
      but before its metadata was updated we may return success
      results in data loss.
      In order to handle fs abort correctly we have to check
      fs state once we discover that it is in MS_RDONLY state
      
      Test case: http://patchwork.ozlabs.org/patch/244297Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4418e141
    • Dmitry Monakhov's avatar
      ext4: fix data integrity for ext4_sync_fs · 06a407f1
      Dmitry Monakhov authored
      Inode's data or non journaled quota may be written w/o jounral so we
      _must_ send a barrier at the end of ext4_sync_fs. But it can be
      skipped if journal commit will do it for us.
      
      Also fix data integrity for nojournal mode.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      06a407f1
    • Dmitry Monakhov's avatar
      jbd2: optimize jbd2_journal_force_commit · 9ff86446
      Dmitry Monakhov authored
      Current implementation of jbd2_journal_force_commit() is suboptimal because
      result in empty and useless commits. But callers just want to force and wait
      any unfinished commits. We already have jbd2_journal_force_commit_nested()
      which does exactly what we want, except we are guaranteed that we do not hold
      journal transaction open.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9ff86446
  4. 12 Jun, 2013 2 commits
    • Theodore Ts'o's avatar
      ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily · 981250ca
      Theodore Ts'o authored
      Commit 18888cf0: "ext4: speed up truncate/unlink by not using
      bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
      the most important codepath for file systems using extents, but a
      similar optimization also can be done for file systems using indirect
      blocks, and for the two special cases in the ext4 extents code.
      
      Cc: Andrey Sidorov <qrxd43@motorola.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      981250ca
    • Theodore Ts'o's avatar
      ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator() · 2ed5724d
      Theodore Ts'o authored
      For a file systems with a very large number of block groups, if all of
      the block group bitmaps are in memory and the file system is
      relatively badly fragmented, it's possible ext4_mb_regular_allocator()
      to take a long time trying to find a good match.  This is especially
      true if the tuning parameter mb_max_to_scan has been sent to a very
      large number.  So add a cond_resched() to avoid soft lockup warnings
      and to provide better system responsiveness.
      
      For ext4_free_blocks(), if we are deleting a large range of blocks,
      and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
      the loop to call sb_find_get_block() and to call ext4_forget() can
      take over 10-15 milliseocnds or more.  So it's better to add a
      cond_resched() here a well.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      
      
      2ed5724d
  5. 06 Jun, 2013 5 commits
  6. 04 Jun, 2013 20 commits