1. 09 Mar, 2014 6 commits
  2. 08 Mar, 2014 1 commit
    • Theodore Ts'o's avatar
      jbd2: don't unplug after writing revoke records · df3c1e9a
      Theodore Ts'o authored
      During commit process, keep the block device plugged after we are done
      writing the revoke records, until we are finished writing the rest of
      the commit records in the journal.  This will allow most of the
      journal blocks to be written in a single I/O operation, instead of
      separating the the revoke blocks from the rest of the journal blocks.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      df3c1e9a
  3. 04 Mar, 2014 1 commit
    • Jan Kara's avatar
      ext4: Speedup WB_SYNC_ALL pass called from sync(2) · 10542c22
      Jan Kara authored
      When doing filesystem wide sync, there's no need to force transaction
      commit (or synchronously write inode buffer) separately for each inode
      because ext4_sync_fs() takes care of forcing commit at the end (VFS
      takes care of flushing buffer cache, respectively). Most of the time
      this slowness doesn't manifest because previous WB_SYNC_NONE writeback
      doesn't leave much to write but when there are processes aggressively
      creating new files and several filesystems to sync, the sync slowness
      can be noticeable. In the following test script sync(1) takes around 6
      minutes when there are two ext4 filesystems mounted on a standard SATA
      drive. After this patch sync takes a couple of seconds so we have about
      two orders of magnitude improvement.
      
            function run_writers
            {
              for (( i = 0; i < 10; i++ )); do
                mkdir $1/dir$i
                for (( j = 0; j < 40000; j++ )); do
                  dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
                done &
              done
            }
      
            for dir in "$@"; do
              run_writers $dir
            done
      
            sleep 40
            time sync
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      10542c22
  4. 23 Feb, 2014 1 commit
  5. 22 Feb, 2014 1 commit
  6. 21 Feb, 2014 1 commit
  7. 20 Feb, 2014 7 commits
    • Maxim Patlasov's avatar
      ext4: avoid exposure of stale data in ext4_punch_hole() · e251f9bc
      Maxim Patlasov authored
      While handling punch-hole fallocate, it's useless to truncate page cache
      before removing the range from extent tree (or block map in indirect case)
      because page cache can be re-populated (by read-ahead or read(2) or mmap-ed
      read) immediately after truncating page cache, but before updating extent
      tree (or block map). In that case the user will see stale data even after
      fallocate is completed.
      
      Until the problem of data corruption resulting from pages backed by
      already freed blocks is fully resolved, the simple thing we can do now
      is to add another truncation of pagecache after punch hole is done.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      e251f9bc
    • Eric Whitney's avatar
      ext4: silence warnings in extent status tree debugging code · ce140cdd
      Eric Whitney authored
      Adjust the conversion specifications in a few optionally compiled debug
      messages to match the return type of ext4_es_status().  Also, make a
      couple of minor grammatical message edits while we're at it.
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      ce140cdd
    • Eric Sandeen's avatar
      ext4: remove unused ac_ex_scanned · dc9ddd98
      Eric Sandeen authored
      When looking at a bug report with:
      
      > kernel: EXT4-fs: 0 scanned, 0 found
      
      I thought wow, 0 scanned, that's odd?  But it's not odd; it's printing
      a variable that is initialized to 0 and never touched again.
      
      It's never been used since the original merge, so I don't really even
      know what the original intent was, either.
      
      If anyone knows how to hook it up, speak now via patch, otherwise just
      yank it so it's not making a confusing situation more confusing in
      kernel logs.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      dc9ddd98
    • Theodore Ts'o's avatar
      ext4: avoid possible overflow in ext4_map_blocks() · e861b5e9
      Theodore Ts'o authored
      The ext4_map_blocks() function returns the number of blocks which
      satisfying the caller's request.  This number of blocks requested by
      the caller is specified by an unsigned integer, but the return value
      of ext4_map_blocks() is a signed integer (to accomodate error codes
      per the kernel's standard error signalling convention).
      
      Historically, overflows could never happen since mballoc() will refuse
      to allocate more than 2048 blocks at a time (which is something we
      should fix), and if the blocks were already allocated, the fact that
      there would be some number of intervening metadata blocks pretty much
      guaranteed that there could never be a contiguous region of data
      blocks that was greater than 2**31 blocks.
      
      However, this is now possible if there is a file system which is a bit
      bigger than 8TB, and is created using the new mke2fs hugeblock
      feature, which can create a perfectly contiguous file.  In that case,
      if a userspace program attempted to call fallocate() on this already
      fully allocated file, it's possible that ext4_map_blocks() could
      return a number large enough that it would overflow a signed integer,
      resulting in a ext4 thinking that the ext4_map_blocks() call had
      failed with some strange error code.
      
      Since ext4_map_blocks() is always free to return a smaller number of
      blocks than what was requested by the caller, fix this by capping the
      number of blocks that ext4_map_blocks() will ever try to map to 2**31
      - 1.  In practice this should never get hit, except by someone
      deliberately trying to provke the above-described bug.
      
      Thanks to the PaX team for asking whethre this could possibly happen
      in some off-line discussions about using some static code checking
      technology they are developing to find bugs in kernel code.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e861b5e9
    • Theodore Ts'o's avatar
      ext4: make sure ex.fe_logical is initialized · ab0c00fc
      Theodore Ts'o authored
      The lowest levels of mballoc set all of the fields of struct
      ext4_free_extent except for fe_logical, since they are just trying to
      find the requested free set of blocks, and the logical block hasn't
      been set yet.  This makes some static code checkers sad.  Set it to
      various different debug values, which would be useful when
      debugging mballoc if these values were to ever show up due to the
      parts of mballoc triyng to use ac->ac_b_ex.fe_logical before it is
      properly upper layers of mballoc failing to properly set, usually by
      ext4_mb_use_best_found().
      
      Addresses-Coverity-Id: #139697
      Addresses-Coverity-Id: #139698
      Addresses-Coverity-Id: #139699
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      
      ab0c00fc
    • Theodore Ts'o's avatar
      ext4: don't calculate total xattr header size unless needed · 7b1b2c1b
      Theodore Ts'o authored
      The function ext4_expand_extra_isize_ea() doesn't need the size of all
      of the extended attribute headers.  So if we don't calculate it when
      it is unneeded, it we can skip some undeeded memory references, and as
      a bonus, we eliminate some kvetching by static code analysis tools.
      
      Addresses-Coverity-Id: #741291
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      7b1b2c1b
    • Theodore Ts'o's avatar
      ext4: add ext4_es_store_pblock_status() · 9a6633b1
      Theodore Ts'o authored
      Avoid false positives by static code analysis tools such as sparse and
      coverity caused by the fact that we set the physical block, and then
      the status in the extent_status structure.  It is also more efficient
      to set both of these values at once.
      
      Addresses-Coverity-Id: #989077
      Addresses-Coverity-Id: #989078
      Addresses-Coverity-Id: #1080722
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      9a6633b1
  8. 19 Feb, 2014 1 commit
  9. 18 Feb, 2014 6 commits
  10. 17 Feb, 2014 1 commit
  11. 16 Feb, 2014 2 commits
    • Theodore Ts'o's avatar
      ext4: fix online resize with a non-standard blocks per group setting · 3d2660d0
      Theodore Ts'o authored
      The set_flexbg_block_bitmap() function assumed that the number of
      blocks in a blockgroup was sb->blocksize * 8, which is normally true,
      but not always!  Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block
      bitmap corruption after:
      
      mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G
      mount -t ext4 /dev/vdd /vdd
      resize2fs /dev/vdd 8G
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: default avatarJon Bernard <jbernard@tuxion.com>
      Cc: stable@vger.kernel.org
      3d2660d0
    • Theodore Ts'o's avatar
      ext4: fix online resize with very large inode tables · b93c9535
      Theodore Ts'o authored
      If a file system has a large number of inodes per block group, all of
      the metadata blocks in a flex_bg may be larger than what can fit in a
      single block group.  Unfortunately, ext4_alloc_group_tables() in
      resize.c was never tested to see if it would handle this case
      correctly, and there were a large number of bugs which caused the
      following sequence to result in a BUG_ON:
      
      kernel bug at fs/ext4/resize.c:409!
         ...
      call trace:
       [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830
       [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80
       [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00
       [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0
       [<ffffffff811b9df2>] ? final_putname+0x22/0x50
       [<ffffffff811c1371>] sys_ioctl+0x81/0xa0
       [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b
      code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0
      rip  [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180
      
      
      This can be reproduced with the following command sequence:
      
         mke2fs -t ext4 -i 4096 /dev/vdd 1G
         mount -t ext4 /dev/vdd /vdd
         resize2fs /dev/vdd 8G
      
      To fix this, we need to make sure the right thing happens when a block
      group's inode table straddles two block groups, which means the
      following bugs had to be fixed:
      
      1) Not clearing the BLOCK_UNINIT flag in the second block group in
         ext4_alloc_group_tables --- the was proximate cause of the BUG_ON.
      
      2) Incorrectly determining how many block groups contained contiguous
         free blocks in ext4_alloc_group_tables().
      
      3) Incorrectly setting the start of the next block range to be marked
         in use after a discontinuity in setup_new_flex_group_blocks().
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      b93c9535
  12. 12 Feb, 2014 3 commits
    • Theodore Ts'o's avatar
      ext4: don't try to modify s_flags if the the file system is read-only · 23301410
      Theodore Ts'o authored
      If an ext4 file system is created by some tool other than mke2fs
      (perhaps by someone who has a pathalogical fear of the GPL) that
      doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
      and that file system is then mounted read-only, don't try to modify
      the s_flags field.  Otherwise, if dm_verity is in use, the superblock
      will change, causing an dm_verity failure.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      23301410
    • Zheng Liu's avatar
      ext4: fix error paths in swap_inode_boot_loader() · 30d29b11
      Zheng Liu authored
      In swap_inode_boot_loader() we forgot to release ->i_mutex and resume
      unlocked dio for inode and inode_bl if there is an error starting the
      journal handle.  This commit fixes this issue.
      Reported-by: default avatarAhmed Tamrawi <ahmedtamrawi@gmail.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org  # v3.10+
      30d29b11
    • Eric Whitney's avatar
      ext4: fix xfstest generic/299 block validity failures · 15cc1767
      Eric Whitney authored
      Commit a115f749 (ext4: remove wait for unwritten extent conversion from
      ext4_truncate) exposed a bug in ext4_ext_handle_uninitialized_extents().
      It can be triggered by xfstest generic/299 when run on a test file
      system created without a journal.  This test continuously fallocates and
      truncates files to which random dio/aio writes are simultaneously
      performed by a separate process.  The test completes successfully, but
      if the test filesystem is mounted with the block_validity option, a
      warning message stating that a logical block has been mapped to an
      illegal physical block is posted in the kernel log.
      
      The bug occurs when an extent is being converted to the written state
      by ext4_end_io_dio() and ext4_ext_handle_uninitialized_extents()
      discovers a mapping for an existing uninitialized extent. Although it
      sets EXT4_MAP_MAPPED in map->m_flags, it fails to set map->m_pblk to
      the discovered physical block number.  Because map->m_pblk is not
      otherwise initialized or set by this function or its callers, its
      uninitialized value is returned to ext4_map_blocks(), where it is
      stored as a bogus mapping in the extent status tree.
      
      Since map->m_pblk can accidentally contain illegal values that are
      larger than the physical size of the file system,  calls to
      check_block_validity() in ext4_map_blocks() that are enabled if the
      block_validity mount option is used can fail, resulting in the logged
      warning message.
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org  # 3.11+
      15cc1767
  13. 10 Feb, 2014 4 commits
  14. 09 Feb, 2014 5 commits
    • Al Viro's avatar
      fix a kmap leak in virtio_console · c9efe511
      Al Viro authored
      While we are at it, don't do kmap() under kmap_atomic(), *especially*
      for a page we'd allocated with GFP_KERNEL.  It's spelled "page_address",
      and had that been more than that, we'd have a real trouble - kmap_high()
      can block, and doing that while holding kmap_atomic() is a Bad Idea(tm).
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c9efe511
    • Al Viro's avatar
      fix O_SYNC|O_APPEND syncing the wrong range on write() · d311d79d
      Al Viro authored
      It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
      when sync_page_range() had been introduced; generic_file_write{,v}() correctly
      synced
      	pos_after_write - written .. pos_after_write - 1
      but generic_file_aio_write() synced
      	pos_before_write .. pos_before_write + written - 1
      instead.  Which is not the same thing with O_APPEND, obviously.
      A couple of years later correct variant had been killed off when
      everything switched to use of generic_file_aio_write().
      
      All users of generic_file_aio_write() are affected, and the same bug
      has been copied into other instances of ->aio_write().
      
      The fix is trivial; the only subtle point is that generic_write_sync()
      ought to be inlined to avoid calculations useless for the majority of
      calls.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d311d79d
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 9c1db779
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "This is a small collection of fixes"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix data corruption when reading/updating compressed extents
        Btrfs: don't loop forever if we can't run because of the tree mod log
        btrfs: reserve no transaction units in btrfs_ioctl_set_features
        btrfs: commit transaction after setting label and features
        Btrfs: fix assert screwup for the pending move stuff
      9c1db779
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6f2a1c1e
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Tooling fixes, mostly related to the KASLR fallout, but also other
        fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf buildid-cache: Check relocation when checking for existing kcore
        perf tools: Adjust kallsyms for relocated kernel
        perf tests: No need to set up ref_reloc_sym
        perf symbols: Prevent the use of kcore if the kernel has moved
        perf record: Get ref_reloc_sym from kernel map
        perf machine: Set up ref_reloc_sym in machine__create_kernel_maps()
        perf machine: Add machine__get_kallsyms_filename()
        perf tools: Add kallsyms__get_function_start()
        perf symbols: Fix symbol annotation for relocated kernel
        perf tools: Fix include for non x86 architectures
        perf tools: Fix AAAAARGH64 memory barriers
        perf tools: Demangle kernel and kernel module symbols too
        perf/doc: Remove mention of non-existent set_perf_event_pending() from design.txt
      6f2a1c1e
    • Filipe David Borba Manana's avatar
      Btrfs: fix data corruption when reading/updating compressed extents · a2aa75e1
      Filipe David Borba Manana authored
      When using a mix of compressed file extents and prealloc extents, it
      is possible to fill a page of a file with random, garbage data from
      some unrelated previous use of the page, instead of a sequence of zeroes.
      
      A simple sequence of steps to get into such case, taken from the test
      case I made for xfstests, is:
      
         _scratch_mkfs
         _scratch_mount "-o compress-force=lzo"
         $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
      This results in the following file items in the fs tree:
      
         item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
             inode generation 6 transid 6 size 542872 block group 0 mode 100600
         item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
             inode ref index 2 namelen 6 name: foobar
         item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
             extent data disk byte 0 nr 0 gen 6
             extent data offset 0 nr 24576 ram 266240
             extent compression 0
         item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
             prealloc data disk byte 12849152 nr 241664 gen 6
             prealloc data offset 0 nr 241664
         item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
             extent data disk byte 12845056 nr 4096 gen 6
             extent data offset 0 nr 20480 ram 20480
             extent compression 2
         item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
             prealloc data disk byte 13090816 nr 405504 gen 6
             prealloc data offset 0 nr 258048
      
      The on disk extent at offset 266240 (which corresponds to 1 single disk block),
      contains 5 compressed chunks of file data. Each of the first 4 compress 4096
      bytes of file data, while the last one only compresses 3024 bytes of file data.
      Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
      1072 bytes) should always return zeroes (our next extent is a prealloc one).
      
      The solution here is the compression code path to zero the remaining (untouched)
      bytes of the last page it uncompressed data into, as the information about how
      much space the file data consumes in the last page is not known in the upper layer
      fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
      the remainder of the page but only if it corresponds to the last page of the inode
      and if the inode's size is not a multiple of the page size.
      
      This would cause not only returning random data on reads, but also permanently
      storing random data when updating parts of the region that should be zeroed.
      For the example above, it means updating a single byte in the region [285648 ; 286720[
      would store that byte correctly but also store random data on disk.
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a2aa75e1