1. 25 Apr, 2017 17 commits
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      xfs: simplify validation of the unwritten extent bit · 0c1d9e4a
      Christoph Hellwig authored
      XFS only supports the unwritten extent bit in the data fork, and only if
      the file system has a version 5 superblock or the unwritten extent
      feature bit.
      
      We currently have two routines that validate the invariant:
      xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
      and xfs_validate_extent that triggers and assert in debug build.
      
      Both of them iterate over all extents of an inode fork when called,
      which isn't very efficient.
      
      This patch instead adds a new helper that verifies the invariant one
      extent at a time, and calls it from the places where we iterate over
      all extents to converted them from or two the in-memory format.  The
      callers then return -EFSCORRUPTED when reading invalid extents from
      disk, or trigger an assert when writing them to disk.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0c1d9e4a
    • Christoph Hellwig's avatar
      xfs: remove unused values from xfs_exntst_t · 37f7f9bb
      Christoph Hellwig authored
      We only ever use the normal and unwritten states.  And the actual
      ondisk format (this enum isn't despite being in xfs_format.h) only
      has space for the unwritten bit anyway.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      37f7f9bb
    • Christoph Hellwig's avatar
    • Eric Sandeen's avatar
      xfs: more do_div cleanups · 4f1adf33
      Eric Sandeen authored
      On some architectures do_div does the pointer compare
      trick to make sure that we've sent it an unsigned 64-bit
      number.  (Why unsigned?  I don't know.)
      
      Fix up the few places that squawk about this; in
      xfs_bmap_wants_extents() we just used a bare int64_t so change
      that to unsigned.
      
      In xfs_adjust_extent_unmap_boundaries() all we wanted was the
      mod, and we have an xfs-specific function to handle that w/o
      side effects, which includes proper casting for do_div.
      
      In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
      XFS_BB_TO_FSBT returns a block in the filesystem, so use
      xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
      from that type as a bonus.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4f1adf33
    • Eric Sandeen's avatar
      xfs: remove use of do_div with 32-bit dividend in quota · 90115407
      Eric Sandeen authored
      The kbuild test robot caught this; in debug code we have another
      caller of do_div with a 32-bit dividend (j) which is caught now
      that we are using the kernel-supplied do_div.
      
      None of the values used here are 64-bit; just use simple division.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      90115407
    • Hou Tao's avatar
      xfs: remove the trailing newline used in the fmt parameter of TP_printk · 42bf9dba
      Hou Tao authored
      The trailing newlines wil lead to extra newlines in the trace file
      which looks like the following output, so remove them.
      >kworker/4:1H-1508  [004] .... 47879.101608: xfs_discard_extent: dev 8:0
      >
      >kworker/u16:2-238  [004] .... 47879.101725: xfs_extent_busy_clear: dev 8:0
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix the getfsmap tracepoints too]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      42bf9dba
    • Brian Foster's avatar
      xfs: prevent multi-fsb dir readahead from reading random blocks · cb52ee33
      Brian Foster authored
      Directory block readahead uses a complex iteration mechanism to map
      between high-level directory blocks and underlying physical extents.
      This mechanism attempts to traverse the higher-level dir blocks in a
      manner that handles multi-fsb directory blocks and simultaneously
      maintains a reference to the corresponding physical blocks.
      
      This logic doesn't handle certain (discontiguous) physical extent
      layouts correctly with multi-fsb directory blocks. For example,
      consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
      block size and a directory with the following extent layout:
      
       EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
         0: [0..7]:          88..95            0 (88..95)             8
         1: [8..15]:         80..87            0 (80..87)             8
         2: [16..39]:        168..191          0 (168..191)          24
         3: [40..63]:        5242952..5242975  1 (72..95)            24
      
      Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
      entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
      extent 2 is larger than the directory block size, the readahead code
      erroneously assumes the block is contiguous and issues a readahead
      based on the physical mapping of the first fsb of the dirblk. This
      results in read verifier failure and a spurious corruption or crc
      failure, depending on the filesystem format.
      
      Further, the subsequent readahead code responsible for walking
      through the physical table doesn't correctly advance the physical
      block reference for dirblk 2. Instead of advancing two physical
      filesystem blocks, the first iteration of the loop advances 1 block
      (correctly), but the subsequent iteration advances 2 more physical
      blocks because the next physical extent (extent 3, above) happens to
      cover more than dirblk 2. At this point, the higher-level directory
      block walking is completely off the rails of the actual physical
      layout of the directory for the respective mapping table.
      
      Update the contiguous dirblock logic to consider the current offset
      in the physical extent to avoid issuing directory readahead to
      unrelated blocks. Also, update the mapping table advancing code to
      consider the current offset within the current dirblock to avoid
      advancing the mapping reference too far beyond the dirblock.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      cb52ee33
    • Eric Sandeen's avatar
      xfs: handle array index overrun in xfs_dir2_leaf_readbuf() · 023cc840
      Eric Sandeen authored
      Carlos had a case where "find" seemed to start spinning
      forever and never return.
      
      This was on a filesystem with non-default multi-fsb (8k)
      directory blocks, and a fragmented directory with extents
      like this:
      
      0:[0,133646,2,0]
      1:[2,195888,1,0]
      2:[3,195890,1,0]
      3:[4,195892,1,0]
      4:[5,195894,1,0]
      5:[6,195896,1,0]
      6:[7,195898,1,0]
      7:[8,195900,1,0]
      8:[9,195902,1,0]
      9:[10,195908,1,0]
      10:[11,195910,1,0]
      11:[12,195912,1,0]
      12:[13,195914,1,0]
      ...
      
      i.e. the first extent is a contiguous 2-fsb dir block, but
      after that it is fragmented into 1 block extents.
      
      At the top of the readdir path, we allocate a mapping array
      which (for this filesystem geometry) can hold 10 extents; see
      the assignment to map_info->map_size.  During readdir, we are
      therefore able to map extents 0 through 9 above into the array
      for readahead purposes.  If we count by 2, we see that the last
      mapped index (9) is the first block of a 2-fsb directory block.
      
      At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
      more readahead; the outer loop assumes one full dir block is
      processed each loop iteration, and an inner loop that ensures
      that this is so by advancing to the next extent until a full
      directory block is mapped.
      
      The problem is that this inner loop may step past the last
      extent in the mapping array as it tries to reach the end of
      the directory block.  This will read garbage for the extent
      length, and as a result the loop control variable 'j' may
      become corrupted and never fail the loop conditional.
      
      The number of valid mappings we have in our array is stored
      in map->map_valid, so stop this inner loop based on that limit.
      
      There is an ASSERT at the top of the outer loop for this
      same condition, but we never made it out of the inner loop,
      so the ASSERT never fired.
      
      Huge appreciation for Carlos for debugging and isolating
      the problem.
      Debugged-and-analyzed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Tested-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarBill O'Donnell <billodo@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      023cc840
    • Chandan Rajendra's avatar
      iomap_dio_rw: Prevent reading file data beyond iomap_dio->i_size · a008c31c
      Chandan Rajendra authored
      On a ppc64 machine executing overlayfs/019 with xfs as the lower and
      upper filesystem causes the following call trace,
      
      WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
      Modules linked in:
      CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 #100
      task: c000000631314880 task.stack: c0000003915d4000
      NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
      REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
      MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
        CR: 24004284  XER: 00000000
      CFAR: c0000000006f7190 SOFTE: 1
      GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
      GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
      GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
      GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
      GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
      GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
      GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
      GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
      NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
      LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
      Call Trace:
      [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
      [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
      [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
      [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
      [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
      [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
      [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
      [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
      [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
      Instruction dump:
      78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
      2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378
      
      The above problem can also be recreated on a regular xfs filesystem
      using the command,
      
      $ fsstress -d /mnt -l 1000 -n 1000 -p 1000
      
      The reason for the call trace is,
      1. When 'reserving' blocks for delayed allocation , XFS reserves more
         blocks (i.e. past file's current EOF) than required. This is done
         because XFS assumes that userspace might write more data and hence
         'reserving' more blocks might lead to the file's new data being
         stored contiguously on disk.
      2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
         then cover the prealloc-ed EOF blocks in addition to the regular blocks.
      3. When flushing the dirty blocks to disk, we only flush data till the
         file's EOF. But before writing out the dirty data, we allocate blocks
         on the disk for holding the file's new data. This allocation includes
         the blocks that are part of the 'prealloc EOF blocks'.
      4. Later, when the last reference to the inode is being closed, XFS frees the
         unused 'prealloc EOF blocks' in xfs_inactive().
      
      In step 3 above, When allocating space on disk for the delayed allocation
      range, the space allocator might sometimes allocate less blocks than
      required. If such an allocation ends right at the current EOF of the
      file, We will not be able to clear the "delayed allocation" flag for the
      'prealloc EOF blocks', since we won't have dirty buffer heads associated
      with that range of the file.
      
      In such a situation if a Direct I/O read operation is performed on file
      range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
      range [X, Y] and invalidate page cache for that range (Refer to
      iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
      the extent items (which are still cached in memory) for the file
      range. When doing so we are not supposed to get an extent item with
      IOMAP_DELALLOC flag set, since the previous "flush" operation should
      have converted any delayed allocation data in the range [X, Y]. Hence we
      end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().
      
      This commit fixes the bug by preventing the read operation from going
      beyond iomap_dio->i_size.
      Reported-by: default avatarSanthosh G <santhog4@linux.vnet.ibm.com>
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      a008c31c
    • Christoph Hellwig's avatar
      xfs: remove bmap block allocation retries · 7590632a
      Christoph Hellwig authored
      Now that reflink operations don't set the firstblock value we don't
      need the workarounds for non-NULL firstblock values without a prior
      allocation.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7590632a
    • Christoph Hellwig's avatar
      xfs: remove xfs_bmap_remap_alloc · bf8eadba
      Christoph Hellwig authored
      The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
      to what we do in the space allocator.  But the reflink code doesn't touch
      the allocation btree unlike the normal space allocator, so we couldn't
      care less about the state of the AGFL.
      
      So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
      the caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      bf8eadba
    • Christoph Hellwig's avatar
      xfs: introduce xfs_bmapi_remap · 6ebd5a44
      Christoph Hellwig authored
      Add a new helper to be used for reflink extent list additions instead of
      funneling them through xfs_bmapi_write and overloading the firstblock
      member in struct xfs_bmalloca and struct xfs_alloc_args.
      
      With some small changes to xfs_bmap_remap_alloc this also means we do
      not need a xfs_bmalloca structure for this case at all.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      6ebd5a44
    • Christoph Hellwig's avatar
      xfs: pass individual arguments to xfs_bmap_add_extent_hole_real · 6d04558f
      Christoph Hellwig authored
      For the reflink case we'd much rather pass the required arguments than
      faking up a struct xfs_bmalloca.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      6d04558f
    • Christoph Hellwig's avatar
      xfs: remove attr fork handling in xfs_bmap_finish_one · 39e07daa
      Christoph Hellwig authored
      We never do COW operations for the attr fork, so don't pretend we handle
      them.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      39e07daa
    • Christoph Hellwig's avatar
      xfs: fix integer truncation in xfs_bmap_remap_alloc · 52813fb1
      Christoph Hellwig authored
      bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
      xfs_aglock_t, which truncates the value to 32 bits.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      52813fb1
  2. 12 Apr, 2017 3 commits
  3. 06 Apr, 2017 2 commits
  4. 03 Apr, 2017 16 commits
  5. 28 Mar, 2017 1 commit
    • Darrick J. Wong's avatar
      xfs: rework the inline directory verifiers · 005c5db8
      Darrick J. Wong authored
      The inline directory verifiers should be called on the inode fork data,
      which means after iformat_local on the read side, and prior to
      ifork_flush on the write side.  This makes the fork verifier more
      consistent with the way buffer verifiers work -- i.e. they will operate
      on the memory buffer that the code will be reading and writing directly.
      
      Furthermore, revise the verifier function to return -EFSCORRUPTED so
      that we don't flood the logs with corruption messages and assert
      notices.  This has been a particular problem with xfs/348, which
      triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the
      kernel when CONFIG_XFS_DEBUG=y.  Disk corruption isn't supposed to do
      that, at least not in a verifier.
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      ---
      v2: get the inode d_ops the proper way
      v3: describe the bug that this patch fixes; no code changes
      005c5db8
  6. 26 Mar, 2017 1 commit