1. 03 Aug, 2016 7 commits
    • Darrick J. Wong's avatar
      xfs: support btrees with overlapping intervals for keys · 2c813ad6
      Darrick J. Wong authored
      On a filesystem with both reflink and reverse mapping enabled, it's
      possible to have multiple rmap records referring to the same blocks on
      disk.  When overlapping intervals are possible, querying a classic
      btree to find all records intersecting a given interval is inefficient
      because we cannot use the left side of the search interval to filter
      out non-matching records the same way that we can use the existing
      btree key to filter out records coming after the right side of the
      search interval.  This will become important once we want to use the
      rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
      
      (For the non-overlapping case, we can perform such queries trivially
      by starting at the left side of the interval and walking the tree
      until we pass the right side.)
      
      Therefore, extend the btree code to come closer to supporting
      intervals as a first-class record attribute.  This involves widening
      the btree node's key space to store both the lowest key reachable via
      the node pointer (as the btree does now) and the highest key reachable
      via the same pointer and teaching the btree modifying functions to
      keep the highest-key records up to date.
      
      This behavior can be turned on via a new btree ops flag so that btrees
      that cannot store overlapping intervals don't pay the overhead costs
      in terms of extra code and disk format changes.
      
      When we're deleting a record in a btree that supports overlapped
      interval records and the deletion results in two btree blocks being
      joined, we defer updating the high/low keys until after all possible
      joining (at higher levels in the tree) have finished.  At this point,
      the btree pointers at all levels have been updated to remove the empty
      blocks and we can update the low and high keys.
      
      When we're doing this, we must be careful to update the keys of all
      node pointers up to the root instead of stopping at the first set of
      keys that don't need updating.  This is because it's possible for a
      single deletion to cause joining of multiple levels of tree, and so
      we need to update everything going back to the root.
      
      The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
      equal to, or greater than key2, respectively.  This is consistent
      with the rest of the kernel and the C library.
      
      In btree_updkeys(), we need to evaluate the force_all parameter before
      running the key diff to avoid reading uninitialized memory when we're
      forcing a key update.  This happens when we've allocated an empty slot
      at level N + 1 to point to a new block at level N and we're in the
      process of filling out the new keys.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2c813ad6
    • Darrick J. Wong's avatar
      xfs: add function pointers for get/update keys to the btree · 70b22659
      Darrick J. Wong authored
      Add some function pointers to bc_ops to get the btree keys for
      leaf and node blocks, and to update parent keys of a block.
      Convert the _btree_updkey calls to use our new pointer, and
      modify the tree shape changing code to call the appropriate
      get_*_keys pointer instead of _btree_copy_keys because the
      overlapping btree has to calculate high key values.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      70b22659
    • Darrick J. Wong's avatar
      xfs: during btree split, save new block key & ptr for future insertion · e5821e57
      Darrick J. Wong authored
      When a btree block has to be split, we pass the new block's ptr from
      xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
      however, we pass the block's key through the cursor's record.  It is a
      little weird to "initialize" a record from a key since the non-key
      attributes will have garbage values.
      
      When we go to add support for interval queries, we have to be able to
      pass the lowest and highest keys accessible via a pointer.  There's no
      clean way to pass this back through the cursor's record field.
      Therefore, pass the key directly back to xfs_btree_insert() the same
      way that we pass the btree_ptr.
      
      As a bonus, we no longer need init_rec_from_key and can drop it from the
      codebase.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e5821e57
    • Darrick J. Wong's avatar
      xfs: set *stat=1 after iroot realloc · 0d309791
      Darrick J. Wong authored
      If we make the inode root block of a btree unfull by expanding the
      root, we must set *stat to 1 to signal success, rather than leaving
      it uninitialized.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      0d309791
    • Darrick J. Wong's avatar
      xfs: fix locking of the rt bitmap/summary inodes · f4a0660d
      Darrick J. Wong authored
      When we're deleting realtime extents, we need to lock the summary
      inode in case we need to update the summary info to prevent an assert
      on the rsumip inode lock on a debug kernel.  While we're at it, fix
      the locking annotations so that we avoid triggering lockdep warnings.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f4a0660d
    • Darrick J. Wong's avatar
      xfs: fix attr shortform structure alignment on cris · 3dadf901
      Darrick J. Wong authored
      Apparently cris doesn't require structure stride to align with the
      largest type in the struct, so list[0] isn't at offset 4 like it is
      everywhere else.  Fix this... insofar as existing XFSes on cris are
      screwed.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      3dadf901
    • Darrick J. Wong's avatar
      xfs: in _attrlist_by_handle, copy the cursor back to userspace · 0facef7f
      Darrick J. Wong authored
      When we're iterating inode xattrs by handle, we have to copy the
      cursor back to userspace so that a subsequent invocation actually
      retrieves subsequent contents.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      0facef7f
  2. 22 Jul, 2016 2 commits
  3. 21 Jul, 2016 5 commits
    • Dave Chinner's avatar
      xfs: bufferhead chains are invalid after end_page_writeback · 28b783e4
      Dave Chinner authored
      In xfs_finish_page_writeback(), we have a loop that looks like this:
      
              do {
                      if (off < bvec->bv_offset)
                              goto next_bh;
                      if (off > end)
                              break;
                      bh->b_end_io(bh, !error);
      next_bh:
                      off += bh->b_size;
              } while ((bh = bh->b_this_page) != head);
      
      The b_end_io function is end_buffer_async_write(), which will call
      end_page_writeback() once all the buffers have marked as no longer
      under IO.  This issue here is that the only thing currently
      protecting both the bufferhead chain and the page from being
      reclaimed is the PageWriteback state held on the page.
      
      While we attempt to limit the loop to just the buffers covered by
      the IO, we still read from the buffer size and follow the next
      pointer in the bufferhead chain. There is no guarantee that either
      of these are valid after the PageWriteback flag has been cleared.
      Hence, loops like this are completely unsafe, and result in
      use-after-free issues. One such problem was caught by Calvin Owens
      with KASAN:
      
      .....
       INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
        free_buffer_head+0x41/0x90
        __slab_free+0x1ed/0x340
        kmem_cache_free+0x270/0x300
        free_buffer_head+0x41/0x90
        try_to_free_buffers+0x171/0x240
        xfs_vm_releasepage+0xcb/0x3b0
        try_to_release_page+0x106/0x190
        shrink_page_list+0x118e/0x1a10
        shrink_inactive_list+0x42c/0xdf0
        shrink_zone_memcg+0xa09/0xfa0
        shrink_zone+0x2c3/0xbc0
      .....
       Call Trace:
        <IRQ>  [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
        [<ffffffff8153a995>] print_trailer+0x115/0x1a0
        [<ffffffff81541174>] object_err+0x34/0x40
        [<ffffffff815436e7>] kasan_report_error+0x217/0x530
        [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
        [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
        [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
        [<ffffffff81de0c58>] bio_endio+0x158/0x1b0
        [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
        [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
        [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
        [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
        [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340
      
      
      Where the access is occuring during IO completion after the buffer
      had been freed from direct memory reclaim.
      
      Prevent use-after-free accidents in this end_io processing loop by
      pre-calculating the loop conditionals before calling bh->b_end_io().
      The loop is already limited to just the bufferheads covered by the
      IO in progress, so the offset checks are sufficient to prevent
      accessing buffers in the chain after end_page_writeback() has been
      called by the the bh->b_end_io() callout.
      
      Yet another example of why Bufferheads Must Die.
      
      cc: <stable@vger.kernel.org> # 4.7
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reported-and-Tested-by: default avatarCalvin Owens <calvinowens@fb.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      28b783e4
    • Dave Chinner's avatar
      xfs: allocate log vector buffers outside CIL context lock · b1c5ebb2
      Dave Chinner authored
      One of the problems we currently have with delayed logging is that
      under serious memory pressure we can deadlock memory reclaim. THis
      occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
      inodes and issues a log force to unpin inodes that are dirty in the
      CIL.
      
      The CIL is pushed, but this will only occur once it gets the CIL
      context lock to ensure that all committing transactions are complete
      and no new transactions start being committed to the CIL while the
      push switches to a new context.
      
      The deadlock occurs when the CIL context lock is held by a
      committing process that is doing memory allocation for log vector
      buffers, and that allocation is then blocked on memory reclaim
      making progress. Memory reclaim, however, is blocked waiting for
      a log force to make progress, and so we effectively deadlock at this
      point.
      
      To solve this problem, we have to move the CIL log vector buffer
      allocation outside of the context lock so that memory reclaim can
      always make progress when it needs to force the log. The problem
      with doing this is that a CIL push can take place while we are
      determining if we need to allocate a new log vector buffer for
      an item and hence the current log vector may go away without
      warning. That means we canot rely on the existing log vector being
      present when we finally grab the context lock and so we must have a
      replacement buffer ready to go at all times.
      
      To ensure this, introduce a "shadow log vector" buffer that is
      always guaranteed to be present when we gain the CIL context lock
      and format the item. This shadow buffer may or may not be used
      during the formatting, but if the log item does not have an existing
      log vector buffer or that buffer is too small for the new
      modifications, we swap it for the new shadow buffer and format
      the modifications into that new log vector buffer.
      
      The result of this is that for any object we modify more than once
      in a given CIL checkpoint, we double the memory required
      to track dirty regions in the log. For single modifications then
      we consume the shadow log vectorwe allocate on commit, and that gets
      consumed by the checkpoint. However, if we make multiple
      modifications, then the second transaction commit will allocate a
      shadow log vector and hence we will end up with double the memory
      usage as only one of the log vectors is consumed by the CIL
      checkpoint. The remaining shadow vector will be freed when th elog
      item is freed.
      
      This can probably be optimised in future - access to the shadow log
      vector is serialised by the object lock (as opposited to the active
      log vector, which is controlled by the CIL context lock) and so we
      can probably free shadow log vector from some objects when the log
      item is marked clean on removal from the AIL.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b1c5ebb2
    • Dave Chinner's avatar
      libxfs: directory node splitting does not have an extra block · 160ae76f
      Dave Chinner authored
      xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8
      
      xfs_da3_split() has to handle all three versions of the
      directory/attribute btree structure. The attr tree is v1, the dir
      tre is v2 or v3. The main difference between the v1 and v2/3 trees
      is the way tree nodes are split - in the v1 tree we can require a
      double split to occur because the object to be inserted may be
      larger than the space made by splitting a leaf. In this case we need
      to do a double split - one to split the full leaf, then another to
      allocate an empty leaf block in the correct location for the new
      entry.  This does not happen with dir (v2/v3) formats as the objects
      being inserted are always guaranteed to fit into the new space in
      the split blocks.
      
      Indeed, for directories they *may* be an extra block on this buffer
      pointer. However, it's guaranteed not to be a leaf block (i.e. a
      directory data block) - the directory code only ever places hash
      index or free space blocks in this pointer (as a cursor of
      sorts), and so to use it as a directory data block will immediately
      corrupt the directory.
      
      The problem is that the code assumes that there may be extra blocks
      that we need to link into the tree once we've split the root, but
      this is not true for either dir or attr trees, because the extra
      attr block is always consumed by the last node split before we split
      the root. Hence the linking in an extra block is always wrong at the
      root split level, and this manifests itself in repair as a directory
      corruption in a repaired directory, leaving the directory rebuild
      incomplete.
      
      This is a dir v2 zero-day bug - it was in the initial dir v2 commit
      that was made back in February 1998.
      
      Fix this by ensuring the linking of the blocks after the root split
      never tries to make use of the extra blocks that may be held in the
      cursor. They are held there for other purposes and should never be
      touched by the root splitting code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      160ae76f
    • Arnd Bergmann's avatar
      xfs: remove dax code from object file when disabled · f021bd07
      Arnd Bergmann authored
      We check IS_DAX(inode) before calling either xfs_file_dax_read or
      xfs_file_dax_write, and this will lead the call being optimized out at
      compile time when CONFIG_FS_DAX is disabled.
      
      However, the two functions are marked STATIC, so they become global
      symbols when CONFIG_XFS_DEBUG is set, leaving us with two unused global
      functions that call into an undefined function and a broken "allmodconfig"
      build:
      
      fs/built-in.o: In function `xfs_file_dax_read':
      fs/xfs/xfs_file.c:348: undefined reference to `dax_do_io'
      fs/built-in.o: In function `xfs_file_dax_write':
      fs/xfs/xfs_file.c:758: undefined reference to `dax_do_io'
      
      Marking the two functions 'static noinline' instead of 'STATIC' will let
      the compiler drop the symbols when there are no callers but avoid the
      implicit inlining.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 16d4d435 ("xfs: split direct I/O and DAX path")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f021bd07
    • Brian Foster's avatar
      xfs: skip dirty pages in ->releasepage() · 99579cce
      Brian Foster authored
      XFS has had scattered reports of delalloc blocks present at
      ->releasepage() time. This results in a warning with a stack trace
      similar to the following:
      
       ...
       Call Trace:
        [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
        [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
        [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
        [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
        [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
        [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
        [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
        [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
        [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
        [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
        [<ffffffffa2168539>] kswapd+0x4f9/0x970
        [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
        [<ffffffffa20a0d99>] kthread+0xc9/0xe0
        [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
        [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
        [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
      
      This occurs because it is possible for shrink_active_list() to send
      pages marked dirty to ->releasepage() when certain buffer_head threshold
      conditions are met. shrink_active_list() doesn't check the page dirty
      state apparently to handle an old ext3 corner case where in some cases
      clean pages would not have the dirty bit cleared, thus it is up to the
      filesystem to determine how to handle the page.
      
      XFS currently handles the delalloc case properly, but this behavior
      makes the warning spurious. Update the XFS ->releasepage() handler to
      explicitly skip dirty pages. Retain the existing delalloc/unwritten
      checks so we continue to warn if such buffers exist on clean pages when
      they shouldn't.
      Diagnosed-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      99579cce
  4. 20 Jul, 2016 23 commits
  5. 21 Jun, 2016 3 commits