1. 20 Mar, 2012 3 commits
    • Lukas Czerner's avatar
      ext4: give more helpful error message in ext4_ext_rm_leaf() · dc1841d6
      Lukas Czerner authored
      The error message produced by the ext4_ext_rm_leaf() when we are
      removing blocks which accidentally ends up inside the existing extent,
      is not very helpful, because we would like to also know which extent did
      we collide with.
      
      This commit changes the error message to get us also the information
      about the extent we are colliding with.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      dc1841d6
    • Lukas Czerner's avatar
      ext4: remove unused code from ext4_ext_map_blocks() · 7877191c
      Lukas Czerner authored
      Since the commit 'Rewrite punch hole to use ext4_ext_remove_space()'
      reworked the punch hole implementation to use ext4_ext_remove_space()
      instead of ext4_ext_map_blocks(), we can remove the code which is no
      longer needed from the ext4_ext_map_blocks().
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      7877191c
    • Lukas Czerner's avatar
      ext4: rewrite punch hole to use ext4_ext_remove_space() · 5f95d21f
      Lukas Czerner authored
      This commit rewrites ext4 punch hole implementation to use
      ext4_ext_remove_space() instead of its home gown way of doing this via
      ext4_ext_map_blocks(). There are several reasons for changing this.
      
      Firstly it is quite non obvious that punching hole needs to
      ext4_ext_map_blocks() to punch a hole, especially given that this
      function should map blocks, not unmap it. It also required a lot of new
      code in ext4_ext_map_blocks().
      
      Secondly the design of it is not very effective. The reason is that we
      are trying to punch out blocks in ext4_ext_punch_hole() in opposite
      direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf()
      to iterate through the whole tree from the end to the start to find the
      requested extent for every extent we are going to punch out.
      
      And finally the current implementation does not use the existing code,
      but bring a lot of new code, which is IMO unnecessary since there
      already is some infrastructure we can use. Specifically
      ext4_ext_remove_space().
      
      This commit changes ext4_ext_remove_space() to accept 'end' parameter so
      we can not only truncate to the end of file, but also remove the space
      in the middle of the file (punch a hole). Moreover, because the last
      block to punch out, might be in the middle of the extent, we have to
      split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either
      remove the whole fist part of split extent, or change its size.
      
      ext4_ext_remove_space() is then used to actually remove the space
      (extents) from within the hole, instead of ext4_ext_map_blocks().
      
      Note that this also fix the issue with punch hole, where we would forget
      to remove empty index blocks from the extent tree, resulting in double
      free block error and file system corruption. This is simply because we
      now use different code path, where this problem does not exist.
      
      This has been tested with fsx running for several days and xfstests,
      plus xfstest #251 with '-o discard' run on the loop image (which
      converts discard requestes into punch hole to the backing file). All of
      it on 1K and 4K file system block size.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5f95d21f
  2. 14 Mar, 2012 6 commits
    • Jan Kara's avatar
      jbd2: cleanup journal tail after transaction commit · 3339578f
      Jan Kara authored
      Normally, we have to issue a cache flush before we can update journal tail in
      journal superblock, effectively wiping out old transactions from the journal.
      So use the fact that during transaction commit we issue cache flush anyway and
      opportunistically push journal tail as far as we can. Since update of journal
      superblock is still costly (we have to use WRITE_FUA), we update log tail only
      if we can free significant amount of space.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      3339578f
    • Jan Kara's avatar
      jbd2: remove bh_state lock from checkpointing code · 932bb305
      Jan Kara authored
      All accesses to checkpointing entries in journal_head are protected
      by j_list_lock. Thus __jbd2_journal_remove_checkpoint() doesn't really
      need bh_state lock.
      
      Also the only part of journal head that the rest of checkpointing code
      needs to check is jh->b_transaction which is safe to read under
      j_list_lock.
      
      So we can safely remove bh_state lock from all of checkpointing code which
      makes it considerably prettier.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      932bb305
    • Jan Kara's avatar
      jbd2: remove always true condition in __journal_try_to_free_buffer() · c254c9ec
      Jan Kara authored
      The check b_jlist == BJ_None in __journal_try_to_free_buffer() is
      always true (__jbd2_journal_temp_unlink_buffer() also checks this in
      an assertion) so just remove it.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c254c9ec
    • Jan Kara's avatar
      jbd2: declare __jbd2_journal_temp_unlink_buffer() static · 5bebccf9
      Jan Kara authored
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5bebccf9
    • Jan Kara's avatar
      jbd2: fix BH_JWrite setting in checkpointing code · 96c86678
      Jan Kara authored
      BH_JWrite bit should be set when buffer is written to the journal. So
      checkpointing shouldn't set this bit when writing out buffer. This didn't
      cause any observable bug since BH_JWrite bit is used only for debugging
      purposes but it's good to have this consistent.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      96c86678
    • Jan Kara's avatar
      jbd2: issue cache flush after checkpointing even with internal journal · 79feb521
      Jan Kara authored
      When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
      checkpointed buffers are on a stable storage - especially if buffers were
      written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
      caches. Thus when we update journal superblock effectively removing old
      transaction from journal, this write of superblock can get to stable storage
      before those checkpointed buffers which can result in filesystem corruption
      after a crash. Thus we must unconditionally issue a cache flush before we
      update journal superblock in these cases.
      
      A similar problem can also occur if journal superblock is written only in
      disk's caches, other transaction starts reusing space of the transaction
      cleaned from the log and power failure happens. Subsequent journal replay would
      still try to replay the old transaction but some of it's blocks may be already
      overwritten by the new transaction. For this reason we must use WRITE_FUA when
      updating log tail and we must first write new log tail to disk and update
      in-memory information only after that.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      79feb521
  3. 13 Mar, 2012 2 commits
    • Jan Kara's avatar
      jbd2: protect all log tail updates with j_checkpoint_mutex · a78bb11d
      Jan Kara authored
      There are some log tail updates that are not protected by j_checkpoint_mutex.
      Some of these are harmless because they happen during startup or shutdown but
      updates in jbd2_journal_commit_transaction() and jbd2_journal_flush() can
      really race with other log tail updates (e.g. someone doing
      jbd2_journal_flush() with someone running jbd2_cleanup_journal_tail()). So
      protect all log tail updates with j_checkpoint_mutex.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a78bb11d
    • Jan Kara's avatar
      jbd2: split updating of journal superblock and marking journal empty · 24bcc89c
      Jan Kara authored
      There are three case of updating journal superblock. In the first case, we want
      to mark journal as empty (setting s_sequence to 0), in the second case we want
      to update log tail, in the third case we want to update s_errno. Split these
      cases into separate functions. It makes the code slightly more straightforward
      and later patches will make the distinction even more important.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      24bcc89c
  4. 12 Mar, 2012 1 commit
  5. 05 Mar, 2012 8 commits
    • Curt Wohlgemuth's avatar
      ext4: add comments to definition of ext4_io_end_t · 4188188b
      Curt Wohlgemuth authored
      This should make it more clear what this structure is used
      for, and how some of the (mutually exclusive) fields are
      used to keep page cache references.
      Signed-off-by: default avatarCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4188188b
    • Curt Wohlgemuth's avatar
      ext4: don't release page refs in ext4_end_bio() · b43d17f3
      Curt Wohlgemuth authored
      We can clear PageWriteback on each page when the IO
      completes, but we can't release the references on the page
      until we convert any uninitialized extents.
      
      Without this patch, the use of the dioread_nolock mount
      option can break buffered writes, because extents may
      not be converted by the time a subsequent buffered read
      comes in; if the page is not in the page cache, a read
      will return zeros if the extent is still uninitialized.
      
      I tested this with a (temporary) patch that adds a call
      to msleep(1000) at the start of ext4_end_io_work(), to delay
      processing of each DIO-unwritten work queue item.  With this
      msleep(), a simple workload of
      
        fallocate
        write
        fadvise
        read
      
      will fail without this patch, succeeds with it.
      Signed-off-by: default avatarCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b43d17f3
    • Jeff Moyer's avatar
      ext4: fix race between sync and completed io work · 491caa43
      Jeff Moyer authored
      The following command line will leave the aio-stress process unkillable
      on an ext4 file system (in my case, mounted on /mnt/test):
      
      aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2
      
      This is using the aio-stress program from the xfstests test suite.
      That particular command line tells aio-stress to do random writes to
      20 files from 20 threads (one thread per file).  The files are NOT
      preallocated, so you will get writes to random offsets within the
      file, thus creating holes and extending i_size.  It also opens the
      file with O_DIRECT and O_SYNC.
      
      On to the problem.  When an I/O requires unwritten extent conversion,
      it is queued onto the completed_io_list for the ext4 inode.  Two code
      paths will pull work items from this list.  The first is the
      ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
      which is called via the fsync path (and O_SYNC handling, as well).
      There are two issues I've found in these code paths.  First, if the
      fsync path beats the work routine to a particular I/O, the work
      routine will free the io_end structure!  It does not take into account
      the fact that the io_end may still be in use by the fsync path.  I've
      fixed this issue by adding yet another IO_END flag, indicating that
      the io_end is being processed by the fsync path.
      
      The second problem is that the work routine will make an assignment to
      io->flag outside of the lock.  I have witnessed this result in a hang
      at umount.  Moving the flag setting inside the lock resolved that
      problem.
      
      The problem was introduced by commit b82e384c ("ext4: optimize
      locking for end_io extent conversion"), which first appeared in 3.2.
      As such, the fix should be backported to that release (probably along
      with the unwritten extent conversion race fix).
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      CC: stable@kernel.org
      491caa43
    • Jeff Moyer's avatar
      ext4: clean up the flags passed to __blockdev_direct_IO · 93ef8541
      Jeff Moyer authored
      For extent-based files, you can perform DIO to holes, as mentioned in
      the comments in ext4_ext_direct_IO.  However, that function passes
      DIO_SKIP_HOLES to __blockdev_direct_IO, which is *really* confusing to
      the uninitiated reader.  The key, here, is that the get_block function
      passed in, ext4_get_block_write, completely ignores the create flag
      that is passed to it (the create flag is passed in from the direct I/O
      code, which uses the DIO_SKIP_HOLES flag to determine whether or not
      it should be cleared).
      
      This is a long-winded way of saying that the DIO_SKIP_HOLES flag is
      ultimately ignored.  So let's remove it.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      93ef8541
    • Theodore Ts'o's avatar
      ext4: try to deprecate noacl and noxattr_user mount options · f7048605
      Theodore Ts'o authored
      No other file system allows ACL's and extended attributes to be
      enabled or disabled via a mount option.  So let's try to deprecate
      these options from ext4.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f7048605
    • Theodore Ts'o's avatar
      ext4: ignore mount options supported by ext2/3 (but have since been removed) · c7198b9c
      Theodore Ts'o authored
      Users who tried to use the ext4 file system driver is being used for
      the ext2 or ext3 file systems (via the CONFIG_EXT4_USE_FOR_EXT23
      option) could have failed mounts if their /etc/fstab contains options
      recognized by ext2 or ext3 but which have since been removed in ext4.
      
      So teach ext4 to recognize them and give a warning that the mount
      option was removed.
      
      Report: https://bbs.archlinux.org/profile.php?id=33804Signed-off-by: default avatarTom Gundersen <teg@jklm.no>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Baechler <thomas@archlinux.org>
      Cc: Tobias Powalowski <tobias.powalowski@googlemail.com>
      Cc: Dave Reisner <d@falconindy.com>
      c7198b9c
    • Theodore Ts'o's avatar
      ext4: add debugging /proc file showing file system options · 66acdcf4
      Theodore Ts'o authored
      Now that /proc/mounts is consistently showing only those mount options
      which need to be specified in /etc/fstab or on the mount command line,
      it is useful to have file which shows exactly which file system
      options are enabled.  This can be useful when debugging a user
      problem.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      66acdcf4
    • Theodore Ts'o's avatar
      ext4: make ext4_show_options() be table-driven · 5a916be1
      Theodore Ts'o authored
      Consistently show mount options which are the non-default, so that
      /proc/mounts accurately shows the mount options that would be
      necessary to mount the file system in its current mode of operation.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5a916be1
  6. 04 Mar, 2012 2 commits
  7. 03 Mar, 2012 2 commits
  8. 02 Mar, 2012 3 commits
  9. 27 Feb, 2012 1 commit
  10. 21 Feb, 2012 3 commits
    • Zheng Liu's avatar
      ext4: format flag in dx_probe() · 9ee49302
      Zheng Liu authored
      Fix ext4_warning format flag in dx_probe().
      
      CC: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9ee49302
    • Eric Sandeen's avatar
      ext4: avoid deadlock on sync-mounted FS w/o journal · c1bb05a6
      Eric Sandeen authored
      Processes hang forever on a sync-mounted ext2 file system that
      is mounted with the ext4 module (default in Fedora 16).
      
      I can reproduce this reliably by mounting an ext2 partition with
      "-o sync" and opening a new file an that partition with vim. vim
      will hang in "D" state forever.  The same happens on ext4 without
      a journal.
      
      I am attaching a small patch here that solves this issue for me.
      In the sync mounted case without a journal,
      ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which
      can't be called with buffer lock held.
      
      Also move mb_cache_entry_release inside lock to avoid race
      fixed previously by 8a2bfdcb ext[34]: EA block reference count racing fix
      Note too that ext2 fixed this same problem in 2006 with
      b2f49033 [PATCH] fix deadlock in ext2
      
      Signed-off-by: Martin.Wilck@ts.fujitsu.com
      [sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg]
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c1bb05a6
    • Lukas Czerner's avatar
      ext4: fix resize when resizing within single group · a0ade1de
      Lukas Czerner authored
      When resizing file system in the way that the new size of the file
      system is still in the same group (no new groups are added), then we can
      hit a BUG_ON in ext4_alloc_group_tables()
      
      BUG_ON(flex_gd->count == 0 || group_data == NULL);
      
      because flex_gd->count is zero. The reason is the missing check for such
      case, so the code always extend the last group fully and then attempt to
      add more groups, but at that time n_blocks_count is actually smaller
      than o_blocks_count.
      
      It can be easily reproduced like this:
      
      mkfs.ext4 -b 4096 /dev/sda 30M
      mount /dev/sda /mnt/test
      resize2fs /dev/sda 50M
      
      Fix this by checking whether the resize happens within the singe group
      and only add that many blocks into the last group to satisfy user
      request. Then o_blocks_count == n_blocks_count and the resize will exit
      successfully without and attempt to add more groups into the fs.
      
      Also fix mixing together block number and blocks count which might be
      confusing and can easily lead to off-by-one errors (but it is actually
      not the case here since the two occurrence of this mix-up will cancel
      each other).
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reported-by: default avatarMilan Broz <mbroz@redhat.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a0ade1de
  11. 20 Feb, 2012 9 commits