An error occurred fetching the project authors.
  1. 09 Apr, 2013 1 commit
    • Eric Whitney's avatar
      ext4: fix free space estimate in ext4_nonda_switch() · 5c1ff336
      Eric Whitney authored
      Values stored in s_freeclusters_counter and s_dirtyclusters_counter
      are both in cluster units.  Remove the cluster to block conversion
      applied to s_freeclusters_counter causing an inflated estimate of
      free space because s_dirtyclusters_counter is not similarly
      converted.  Rename free_blocks and dirty_blocks to better reflect
      the units these variables contain to avoid future confusion.  This
      fix corrects ENOSPC failures for xfstests 127 and 231 on bigalloc
      file systems.
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5c1ff336
  2. 08 Apr, 2013 1 commit
    • Dr. Tilmann Bubeck's avatar
      ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOT · 393d1d1d
      Dr. Tilmann Bubeck authored
      Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
      associated attributes (like i_blocks, i_size, i_flags, ...) from the
      specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
      typically used to store a boot loader in a secure part of the
      filesystem, where it can't be changed by a normal user by accident.
      The data blocks of the previous boot loader will be associated with
      the given inode.
      
      This usercode program is a simple example of the usage:
      
      int main(int argc, char *argv[])
      {
        int fd;
        int err;
      
        if ( argc != 2 ) {
          printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
          exit(1);
        }
      
        fd = open(argv[1], O_WRONLY);
        if ( fd < 0 ) {
          perror("open");
          exit(1);
        }
      
        err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
        if ( err < 0 ) {
          perror("ioctl");
          exit(1);
        }
      
        close(fd);
        exit(0);
      }
      
      [ Modified by Theodore Ts'o to fix a number of bugs in the original code.]
      Signed-off-by: default avatarDr. Tilmann Bubeck <t.bubeck@reinform.de>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      393d1d1d
  3. 04 Apr, 2013 3 commits
    • Lukas Czerner's avatar
      ext4: print more info in ext4_print_free_blocks() · f78ee70d
      Lukas Czerner authored
      Additionally print i_allocated_meta_blocks information as well.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      f78ee70d
    • Theodore Ts'o's avatar
      ext4/jbd2: don't wait (forever) for stale tid caused by wraparound · d76a3a77
      Theodore Ts'o authored
      In the case where an inode has a very stale transaction id (tid) in
      i_datasync_tid or i_sync_tid, it's possible that after a very large
      (2**31) number of transactions, that the tid number space might wrap,
      causing tid_geq()'s calculations to fail.
      
      Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
      by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
      attempted to fix this problem, but it only avoided kjournald spinning
      forever by fixing the logic in jbd2_log_start_commit().
      
      Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
      that might call jbd2_log_start_commit() with a stale tid, those
      functions will subsequently call jbd2_log_wait_commit() with the same
      stale tid, and then wait for a very long time.  To fix this, we
      replace the calls to jbd2_log_start_commit() and
      jbd2_log_wait_commit() with a call to a new function,
      jbd2_complete_transaction(), which will correctly handle stale tid's.
      
      As a bonus, jbd2_complete_transaction() will avoid locking
      j_state_lock for writing unless a commit needs to be started.  This
      should have a small (but probably not measurable) improvement for
      ext4's scalability.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Reported-by: default avatarGeorge Barnett <gbarnett@atlassian.com>
      Cc: stable@vger.kernel.org
      
      d76a3a77
    • Theodore Ts'o's avatar
      ext4: add mutex_is_locked() assertion to ext4_truncate() · 19b5ef61
      Theodore Ts'o authored
      [ Added fixup from Lukáš Czerner which only checks the assertion when
        the inode is not new and is not being freed. ]
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      19b5ef61
  4. 03 Apr, 2013 4 commits
  5. 20 Mar, 2013 2 commits
    • Theodore Ts'o's avatar
      ext4: fix data=journal fast mount/umount hang · 2b405bfa
      Theodore Ts'o authored
      In data=journal mode, if we unmount the file system before a
      transaction has a chance to complete, when the journal inode is being
      evicted, we can end up calling into jbd2_log_wait_commit() for the
      last transaction, after the journalling machinery has been shut down.
      
      Arguably we should adjust ext4_should_journal_data() to return FALSE
      for the journal inode, but the only place it matters is
      ext4_evict_inode(), and so to save a bit of CPU time, and to make the
      patch much more obviously correct by inspection(tm), we'll fix it by
      explicitly not trying to waiting for a journal commit when we are
      evicting the journal inode, since it's guaranteed to never succeed in
      this case.
      
      This can be easily replicated via: 
      
           mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb
      
      ------------[ cut here ]------------
      WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
      Hardware name: Bochs
      JBD2: bad log_start_commit: 3005630206 3005630206 0 0
      Modules linked in:
      Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
      Call Trace:
       [<c015c0ef>] warn_slowpath_common+0x68/0x7d
       [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd
       [<c015c177>] warn_slowpath_fmt+0x2b/0x2f
       [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd
       [<c02b8075>] jbd2_log_start_commit+0x24/0x34
       [<c0279ed5>] ext4_evict_inode+0x71/0x2e3
       [<c021f0ec>] evict+0x94/0x135
       [<c021f9aa>] iput+0x10a/0x110
       [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce
       [<c0175284>] ? bit_waitqueue+0x50/0x50
       [<c028d23f>] ext4_put_super+0x52/0x294
       [<c020efe3>] generic_shutdown_super+0x48/0xb4
       [<c020f071>] kill_block_super+0x22/0x60
       [<c020f3e0>] deactivate_locked_super+0x22/0x49
       [<c020f5d6>] deactivate_super+0x30/0x33
       [<c0222795>] mntput_no_expire+0x107/0x10c
       [<c02233a7>] sys_umount+0x2cf/0x2e0
       [<c02233ca>] sys_oldumount+0x12/0x14
       [<c08096b8>] syscall_call+0x7/0xb
      ---[ end trace 6a954cc790501c1f ]---
      jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      2b405bfa
    • Theodore Ts'o's avatar
      ext4: fix ext4_evict_inode() racing against workqueue processing code · 1ada47d9
      Theodore Ts'o authored
      Commit 84c17543 (ext4: move work from io_end to inode) triggered a
      regression when running xfstest #270 when the file system is mounted
      with dioread_nolock.
      
      The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
      this guarantees that last io_end structure has been freed, but it does
      not guarantee that the workqueue structure, which was moved into the
      inode by commit 84c17543, is actually finished.  Once
      ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
      will allow ext4_ioend_wait() to return on CPU #2, at which point the
      evict_inode() codepath can race against the workqueue code on CPU #1
      accessing EXT4_I(inode)->i_unwritten_work to find the next item of
      work to do.
      
      Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
      will be renamed ext4_ioend_shutdown(), since it is only used by
      ext4_evict_inode().  Also, move the call to ext4_ioend_shutdown()
      until after truncate_inode_pages() and filemap_write_and_wait() are
      called, to make sure all dirty pages have been written back and
      flushed from the page cache first.
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
      IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
      *pdpt = 0000000030bc3001 *pde = 0000000000000000 
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in:
      Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c17543-dirty #91 Bochs Bochs
      EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
      EIP is at cwq_activate_delayed_work+0x3b/0x7e
      EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
      ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
       DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
      DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      DR6: ffff0ff0 DR7: 00000400
      Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
      Stack:
       f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
       f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
       f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
      Call Trace:
       [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
       [<c01def1d>] process_one_work+0x5d8/0x637
       [<c040a10b>] ? ext4_end_bio+0x300/0x300
       [<c01e3105>] worker_thread+0x249/0x3ef
       [<c01ea317>] kthread+0xd8/0xeb
       [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
       [<c023a370>] ? trace_hardirqs_on+0x27/0x37
       [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
       [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
      Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
       6c c1 00 31 c9 83
      EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
      CR2: 0000000000000000
      ---[ end trace a1923229da53d8a4 ]---
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      1ada47d9
  6. 11 Mar, 2013 5 commits
  7. 23 Feb, 2013 1 commit
  8. 22 Feb, 2013 1 commit
    • Darrick J. Wong's avatar
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong authored
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  9. 18 Feb, 2013 4 commits
    • Zheng Liu's avatar
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu authored
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Zheng Liu's avatar
      ext4: track all extent status in extent status tree · f7fec032
      Zheng Liu authored
      By recording the phycisal block and status, extent status tree is able
      to track the status of every extents.  When we call _map_blocks
      functions to lookup an extent or create a new written/unwritten/delayed
      extent, this extent will be inserted into extent status tree.
      
      We don't load all extents from disk in alloc_inode() because it costs
      too much memory, and if a file is opened and closed frequently it will
      takes too much time to load all extent information.  So currently when
      we create/lookup an extent, this extent will be inserted into extent
      status tree.  Hence, the extent status tree may not comprehensively
      contain all of the extents found in the file.
      
      Here a condition we need to take care is that an extent might contains
      unwritten and delayed status simultaneously because an extent is delayed
      allocated and could be allocated by fallocate.  At this time we need to
      keep delayed status because later we need to update delayed reservation
      space using it.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      f7fec032
    • Zheng Liu's avatar
      ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag · a25a4e1a
      Zheng Liu authored
      This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
      because in later commit ext4_map_blocks needs to use this flag to
      determine the extent status.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      a25a4e1a
    • Zheng Liu's avatar
      ext4: add physical block and status member into extent status tree · fdc0212e
      Zheng Liu authored
      This commit adds two members in extent_status structure to let it record
      physical block and extent status.  Here es_pblk is used to record both
      of them because physical block only has 48 bits.  So extent status could
      be stashed into it so that we can save some memory.  Now written,
      unwritten, delayed and hole are defined as status.
      
      Due to new member is added into extent status tree, all interfaces need
      to be adjusted.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      fdc0212e
  10. 15 Feb, 2013 1 commit
  11. 14 Feb, 2013 2 commits
  12. 09 Feb, 2013 2 commits
    • Theodore Ts'o's avatar
      ext4: grab page before starting transaction handle in write_begin() · 47564bfb
      Theodore Ts'o authored
      The grab_cache_page_write_begin() function can potentially sleep for a
      long time, since it may need to do memory allocation which can block
      if the system is under significant memory pressure, and because it may
      be blocked on page writeback.  If it does take a long time to grab the
      page, it's better that we not hold an active jbd2 handle.
      
      So grab a handle on the page first, and _then_ start the transaction
      handle.
      
      This commit fixes the following long transaction handle hold time:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      47564bfb
    • Theodore Ts'o's avatar
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o authored
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
  13. 30 Jan, 2013 1 commit
  14. 29 Jan, 2013 1 commit
    • Jan Kara's avatar
      ext4: fix ext4_writepage() to achieve data=ordered guarantees · fe386132
      Jan Kara authored
      So far ext4_writepage() skipped writing pages that had any delayed or
      unwritten buffers attached. When blocksize < pagesize this breaks
      data=ordered mode guarantees as we can have a page with one freshly
      allocated buffer whose allocation is part of the committing
      transaction and another buffer in the page which is delayed or
      unwritten. So fix this problem by calling ext4_bio_writepage()
      anyway. It will submit mapped buffers and leave others alone.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      fe386132
  15. 28 Jan, 2013 5 commits
  16. 17 Jan, 2013 1 commit
  17. 12 Jan, 2013 3 commits
  18. 25 Dec, 2012 2 commits
    • Jan Kara's avatar
      ext4: fix deadlock in journal_unmap_buffer() · 53e87268
      Jan Kara authored
      We cannot wait for transaction commit in journal_unmap_buffer()
      because we hold page lock which ranks below transaction start.  We
      solve the issue by bailing out of journal_unmap_buffer() and
      jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
      for waiting for transaction commit to finish and try invalidation
      again. Since the issue can happen only for page stradding i_size, it
      is simple enough to manually call jbd2_journal_invalidatepage() for
      such page from ext4_setattr(), check the return value and wait if
      necessary.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      53e87268
    • Jan Kara's avatar
      ext4: split off ext4_journalled_invalidatepage() · 4520fb3c
      Jan Kara authored
      In data=journal mode we don't need delalloc or DIO handling in invalidatepage
      and similarly in other modes we don't need the journal handling. So split
      invalidatepage implementations.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4520fb3c