1. 14 Jun, 2017 24 commits
  2. 07 Jun, 2017 16 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.9.31 · f1aa865a
      Greg Kroah-Hartman authored
      f1aa865a
    • Jan Kara's avatar
      xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff() · 11214bd2
      Jan Kara authored
      commit d7fd2425 upstream.
      
      There is an off-by-one error in loop termination conditions in
      xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of
      desired range if 'endoff' is page aligned. It doesn't have any visible
      effects but still it is good to fix it.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11214bd2
    • Eric Sandeen's avatar
      xfs: fix unaligned access in xfs_btree_visit_blocks · 75c5afd5
      Eric Sandeen authored
      commit a4d768e7 upstream.
      
      This structure copy was throwing unaligned access warnings on sparc64:
      
      Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]
      
      xfs_btree_copy_ptrs does a memcpy, which avoids it.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75c5afd5
    • Darrick J. Wong's avatar
      xfs: avoid mount-time deadlock in CoW extent recovery · 7fb8ab8f
      Darrick J. Wong authored
      commit 3ecb3ac7 upstream.
      
      If a malicious user corrupts the refcount btree to cause a cycle between
      different levels of the tree, the next mount attempt will deadlock in
      the CoW recovery routine while grabbing buffer locks.  We can use the
      ability to re-grab a buffer that was previous locked to a transaction to
      avoid deadlocks, so do that here.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7fb8ab8f
    • Christoph Hellwig's avatar
      xfs: xfs_trans_alloc_empty · e40c145c
      Christoph Hellwig authored
      This is a partial cherry-pick of commit e89c0413
      ("xfs: implement the GETFSMAP ioctl"), which also adds this helper, and
      a great example of why feature patches should be properly split into
      their parts.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [hch: split from the larger patch for -stable]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e40c145c
    • Zorro Lang's avatar
      xfs: bad assertion for delalloc an extent that start at i_size · 0e542792
      Zorro Lang authored
      commit 892d2a5f upstream.
      
      By run fsstress long enough time enough in RHEL-7, I find an
      assertion failure (harder to reproduce on linux-4.11, but problem
      is still there):
      
        XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c
      
      The assertion is in xfs_getbmap() funciton:
      
        if (map[i].br_startblock == DELAYSTARTBLOCK &&
      -->   map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
                ASSERT((iflags & BMV_IF_DELALLOC) != 0);
      
      When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
      startoff is just at EOF. But we only need to make sure delalloc
      extents that are within EOF, not include EOF.
      Signed-off-by: default avatarZorro Lang <zlang@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e542792
    • Darrick J. Wong's avatar
      xfs: BMAPX shouldn't barf on inline-format directories · f60d76ef
      Darrick J. Wong authored
      commit 6eadbf4c upstream.
      
      When we're fulfilling a BMAPX request, jump out early if the data fork
      is in local format.  This prevents us from hitting a debugging check in
      bmapi_read and barfing errors back to userspace.  The on-disk extent
      count check later isn't sufficient for IF_DELALLOC mode because da
      extents are in memory and not on disk.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f60d76ef
    • Brian Foster's avatar
      xfs: fix indlen accounting error on partial delalloc conversion · 53c44c23
      Brian Foster authored
      commit 0daaecac upstream.
      
      The delalloc -> real block conversion path uses an incorrect
      calculation in the case where the middle part of a delalloc extent
      is being converted. This is documented as a rare situation because
      XFS generally attempts to maximize contiguity by converting as much
      of a delalloc extent as possible.
      
      If this situation does occur, the indlen reservation for the two new
      delalloc extents left behind by the conversion of the middle range
      is calculated and compared with the original reservation. If more
      blocks are required, the delta is allocated from the global block
      pool. This delta value can be characterized as the difference
      between the new total requirement (temp + temp2) and the currently
      available reservation minus those blocks that have already been
      allocated (startblockval(PREV.br_startblock) - allocated).
      
      The problem is that the current code does not account for previously
      allocated blocks correctly. It subtracts the current allocation
      count from the (new - old) delta rather than the old indlen
      reservation. This means that more indlen blocks than have been
      allocated end up stashed in the remaining extents and free space
      accounting is broken as a result.
      
      Fix up the calculation to subtract the allocated block count from
      the original extent indlen and thus correctly allocate the
      reservation delta based on the difference between the new total
      requirement and the unused blocks from the original reservation.
      Also remove a bogus assert that contradicts the fact that the new
      indlen reservation can be larger than the original indlen
      reservation.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53c44c23
    • Eryu Guan's avatar
      xfs: fix use-after-free in xfs_finish_page_writeback · 54894ea3
      Eryu Guan authored
      commit 161f55ef upstream.
      
      Commit 28b783e4 ("xfs: bufferhead chains are invalid after
      end_page_writeback") fixed one use-after-free issue by
      pre-calculating the loop conditionals before calling bh->b_end_io()
      in the end_io processing loop, but it assigned 'next' pointer before
      checking end offset boundary & breaking the loop, at which point the
      bh might be freed already, and caused use-after-free.
      
      This is caught by KASAN when running fstests generic/127 on sub-page
      block size XFS.
      
      [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
      [ 2747.868840] ==================================================================
      [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
      ...
      [ 2747.918245] Call Trace:
      [ 2747.920975]  dump_stack+0x63/0x84
      [ 2747.924673]  kasan_object_err+0x21/0x70
      [ 2747.928950]  kasan_report+0x271/0x530
      [ 2747.933064]  ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
      [ 2747.938409]  ? end_page_writeback+0xce/0x110
      [ 2747.943171]  __asan_report_load8_noabort+0x19/0x20
      [ 2747.948545]  xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
      [ 2747.953724]  xfs_end_io+0x1af/0x2b0 [xfs]
      [ 2747.958197]  process_one_work+0x5ff/0x1000
      [ 2747.962766]  worker_thread+0xe4/0x10e0
      [ 2747.966946]  kthread+0x2d3/0x3d0
      [ 2747.970546]  ? process_one_work+0x1000/0x1000
      [ 2747.975405]  ? kthread_create_on_node+0xc0/0xc0
      [ 2747.980457]  ? syscall_return_slowpath+0xe6/0x140
      [ 2747.985706]  ? do_page_fault+0x30/0x80
      [ 2747.989887]  ret_from_fork+0x2c/0x40
      [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
      [ 2748.001155] Allocated:
      [ 2748.003782] PID = 8327
      [ 2748.006411]  save_stack_trace+0x1b/0x20
      [ 2748.010688]  save_stack+0x46/0xd0
      [ 2748.014383]  kasan_kmalloc+0xad/0xe0
      [ 2748.018370]  kasan_slab_alloc+0x12/0x20
      [ 2748.022648]  kmem_cache_alloc+0xb8/0x1b0
      [ 2748.027024]  alloc_buffer_head+0x22/0xc0
      [ 2748.031399]  alloc_page_buffers+0xd1/0x250
      [ 2748.035968]  create_empty_buffers+0x30/0x410
      [ 2748.040730]  create_page_buffers+0x120/0x1b0
      [ 2748.045493]  __block_write_begin_int+0x17a/0x1800
      [ 2748.050740]  iomap_write_begin+0x100/0x2f0
      [ 2748.055308]  iomap_zero_range_actor+0x253/0x5c0
      [ 2748.060362]  iomap_apply+0x157/0x270
      [ 2748.064347]  iomap_zero_range+0x5a/0x80
      [ 2748.068624]  iomap_truncate_page+0x6b/0xa0
      [ 2748.073227]  xfs_setattr_size+0x1f7/0xa10 [xfs]
      [ 2748.078312]  xfs_vn_setattr_size+0x68/0x140 [xfs]
      [ 2748.083589]  xfs_file_fallocate+0x4ac/0x820 [xfs]
      [ 2748.088838]  vfs_fallocate+0x2cf/0x780
      [ 2748.093021]  SyS_fallocate+0x48/0x80
      [ 2748.097006]  do_syscall_64+0x18a/0x430
      [ 2748.101186]  return_from_SYSCALL_64+0x0/0x6a
      [ 2748.105948] Freed:
      [ 2748.108189] PID = 8327
      [ 2748.110816]  save_stack_trace+0x1b/0x20
      [ 2748.115093]  save_stack+0x46/0xd0
      [ 2748.118788]  kasan_slab_free+0x73/0xc0
      [ 2748.122969]  kmem_cache_free+0x7a/0x200
      [ 2748.127247]  free_buffer_head+0x41/0x80
      [ 2748.131524]  try_to_free_buffers+0x178/0x250
      [ 2748.136316]  xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
      [ 2748.141563]  try_to_release_page+0x100/0x180
      [ 2748.146325]  invalidate_inode_pages2_range+0x7da/0xcf0
      [ 2748.152087]  xfs_shift_file_space+0x37d/0x6e0 [xfs]
      [ 2748.157557]  xfs_collapse_file_space+0x49/0x120 [xfs]
      [ 2748.163223]  xfs_file_fallocate+0x2a7/0x820 [xfs]
      [ 2748.168462]  vfs_fallocate+0x2cf/0x780
      [ 2748.172642]  SyS_fallocate+0x48/0x80
      [ 2748.176629]  do_syscall_64+0x18a/0x430
      [ 2748.180810]  return_from_SYSCALL_64+0x0/0x6a
      
      Fixed it by checking on offset against end & breaking out first,
      dereference bh only if there're still bufferheads to process.
      Signed-off-by: default avatarEryu Guan <eguan@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54894ea3
    • Darrick J. Wong's avatar
      xfs: reserve enough blocks to handle btree splits when remapping · d457f822
      Darrick J. Wong authored
      commit fe0be23e upstream.
      
      In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
      handle adding 1 extent.  This is problematic if we fragment free space,
      have to do CoW, and then have to perform multiple bmap btree expansions.
      Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
      handle btree splits, so log recovery fails after our first error causes
      the filesystem to go down.
      
      Therefore, refactor the transaction block reservation macros until we
      have a macro that works for our deferred (re)mapping activities, and fix
      both problems by using that macro.
      
      With 1k blocks we can hit this fairly often in g/187 if the scratch fs
      is big enough.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d457f822
    • Brian Foster's avatar
      xfs: wait on new inodes during quotaoff dquot release · 0ba833fe
      Brian Foster authored
      commit e20c8a51 upstream.
      
      The quotaoff operation has a race with inode allocation that results
      in a livelock. An inode allocation that occurs before the quota
      status flags are updated acquires the appropriate dquots for the
      inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
      into the perag radix tree, sometime later attaches the dquots to the
      inode and finally clears the XFS_INEW flag. Quotaoff expects to
      release the dquots from all inodes in the filesystem via
      xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
      which skips inodes in the XFS_INEW state because they are not fully
      constructed. If the scan occurs after dquots have been attached to
      an inode, but before XFS_INEW is cleared, the newly allocated inode
      will continue to hold a reference to the applicable dquots. When
      quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
      dquot(s) remain elevated and the dqpurge scan spins indefinitely.
      
      To address this problem, update the xfs_qm_dqrele_all_inodes() scan
      to wait on inodes marked on the XFS_INEW state. We wait on the
      inodes explicitly rather than skip and retry to avoid continuous
      retry loops due to a parallel inode allocation workload. Since
      quotaoff updates the quota state flags and uses a synchronous
      transaction before the dqrele scan, and dquots are attached to
      inodes after radix tree insertion iff quota is enabled, one INEW
      waiting pass through the AG guarantees that the scan has processed
      all inodes that could possibly hold dquot references.
      Reported-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ba833fe
    • Brian Foster's avatar
      xfs: update ag iterator to support wait on new inodes · 2ea882d8
      Brian Foster authored
      commit ae2c4ac2 upstream.
      
      The AG inode iterator currently skips new inodes as such inodes are
      inserted into the inode radix tree before they are fully
      constructed. Certain contexts require the ability to wait on the
      construction of new inodes, however. The fs-wide dquot release from
      the quotaoff sequence is an example of this.
      
      Update the AG inode iterator to support the ability to wait on
      inodes flagged with XFS_INEW upon request. Create a new
      xfs_inode_ag_iterator_flags() interface and support a set of
      iteration flags to modify the iteration behavior. When the
      XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
      radix tree inode lookup and wait on them before the callback is
      executed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ea882d8
    • Brian Foster's avatar
      xfs: support ability to wait on new inodes · e86b616b
      Brian Foster authored
      commit 756baca2 upstream.
      
      Inodes that are inserted into the perag tree but still under
      construction are flagged with the XFS_INEW bit. Most contexts either
      skip such inodes when they are encountered or have the ability to
      handle them.
      
      The runtime quotaoff sequence introduces a context that must wait
      for construction of such inodes to correctly ensure that all dquots
      in the fs are released. In anticipation of this, support the ability
      to wait on new inodes. Wake the appropriate bit when XFS_INEW is
      cleared.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e86b616b
    • Brian Foster's avatar
      xfs: fix up quotacheck buffer list error handling · 10f0b2c3
      Brian Foster authored
      commit 20e8a063 upstream.
      
      The quotacheck error handling of the delwri buffer list assumes the
      resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
      on the buffers that are dequeued. This can lead to assert failures
      on buffer release and possibly other locking problems.
      
      Move this code to a delwri queue cancel helper function to
      encapsulate the logic required to properly release buffers from a
      delwri queue. Update the helper to clear the delwri queue flag and
      call it from quotacheck.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10f0b2c3
    • Brian Foster's avatar
      xfs: prevent multi-fsb dir readahead from reading random blocks · 95487d4b
      Brian Foster authored
      commit cb52ee33 upstream.
      
      Directory block readahead uses a complex iteration mechanism to map
      between high-level directory blocks and underlying physical extents.
      This mechanism attempts to traverse the higher-level dir blocks in a
      manner that handles multi-fsb directory blocks and simultaneously
      maintains a reference to the corresponding physical blocks.
      
      This logic doesn't handle certain (discontiguous) physical extent
      layouts correctly with multi-fsb directory blocks. For example,
      consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
      block size and a directory with the following extent layout:
      
       EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
         0: [0..7]:          88..95            0 (88..95)             8
         1: [8..15]:         80..87            0 (80..87)             8
         2: [16..39]:        168..191          0 (168..191)          24
         3: [40..63]:        5242952..5242975  1 (72..95)            24
      
      Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
      entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
      extent 2 is larger than the directory block size, the readahead code
      erroneously assumes the block is contiguous and issues a readahead
      based on the physical mapping of the first fsb of the dirblk. This
      results in read verifier failure and a spurious corruption or crc
      failure, depending on the filesystem format.
      
      Further, the subsequent readahead code responsible for walking
      through the physical table doesn't correctly advance the physical
      block reference for dirblk 2. Instead of advancing two physical
      filesystem blocks, the first iteration of the loop advances 1 block
      (correctly), but the subsequent iteration advances 2 more physical
      blocks because the next physical extent (extent 3, above) happens to
      cover more than dirblk 2. At this point, the higher-level directory
      block walking is completely off the rails of the actual physical
      layout of the directory for the respective mapping table.
      
      Update the contiguous dirblock logic to consider the current offset
      in the physical extent to avoid issuing directory readahead to
      unrelated blocks. Also, update the mapping table advancing code to
      consider the current offset within the current dirblock to avoid
      advancing the mapping reference too far beyond the dirblock.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      95487d4b
    • Eric Sandeen's avatar
      xfs: handle array index overrun in xfs_dir2_leaf_readbuf() · 93bd1698
      Eric Sandeen authored
      commit 023cc840 upstream.
      
      Carlos had a case where "find" seemed to start spinning
      forever and never return.
      
      This was on a filesystem with non-default multi-fsb (8k)
      directory blocks, and a fragmented directory with extents
      like this:
      
      0:[0,133646,2,0]
      1:[2,195888,1,0]
      2:[3,195890,1,0]
      3:[4,195892,1,0]
      4:[5,195894,1,0]
      5:[6,195896,1,0]
      6:[7,195898,1,0]
      7:[8,195900,1,0]
      8:[9,195902,1,0]
      9:[10,195908,1,0]
      10:[11,195910,1,0]
      11:[12,195912,1,0]
      12:[13,195914,1,0]
      ...
      
      i.e. the first extent is a contiguous 2-fsb dir block, but
      after that it is fragmented into 1 block extents.
      
      At the top of the readdir path, we allocate a mapping array
      which (for this filesystem geometry) can hold 10 extents; see
      the assignment to map_info->map_size.  During readdir, we are
      therefore able to map extents 0 through 9 above into the array
      for readahead purposes.  If we count by 2, we see that the last
      mapped index (9) is the first block of a 2-fsb directory block.
      
      At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
      more readahead; the outer loop assumes one full dir block is
      processed each loop iteration, and an inner loop that ensures
      that this is so by advancing to the next extent until a full
      directory block is mapped.
      
      The problem is that this inner loop may step past the last
      extent in the mapping array as it tries to reach the end of
      the directory block.  This will read garbage for the extent
      length, and as a result the loop control variable 'j' may
      become corrupted and never fail the loop conditional.
      
      The number of valid mappings we have in our array is stored
      in map->map_valid, so stop this inner loop based on that limit.
      
      There is an ASSERT at the top of the outer loop for this
      same condition, but we never made it out of the inner loop,
      so the ASSERT never fired.
      
      Huge appreciation for Carlos for debugging and isolating
      the problem.
      Debugged-and-analyzed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Tested-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarBill O'Donnell <billodo@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93bd1698