1. 13 Oct, 2020 3 commits
    • Darrick J. Wong's avatar
      xfs: annotate grabbing the realtime bitmap/summary locks in growfs · ace74e79
      Darrick J. Wong authored
      Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and
      summary locks when we grow the realtime volume, just like we do most
      everywhere else.  This shuts up lockdep warnings about grabbing the
      ILOCK class of locks recursively:
      
      ============================================
      WARNING: possible recursive locking detected
      5.9.0-rc4-djw #rc4 Tainted: G           O
      --------------------------------------------
      xfs_growfs/4841 is trying to acquire lock:
      ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]
      
      but task is already holding lock:
      ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&xfs_nondir_ilock_class);
        lock(&xfs_nondir_ilock_class);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      ace74e79
    • Darrick J. Wong's avatar
      xfs: make xfs_growfs_rt update secondary superblocks · 7249c95a
      Darrick J. Wong authored
      When we call growfs on the data device, we update the secondary
      superblocks to reflect the updated filesystem geometry.  We need to do
      this for growfs on the realtime volume too, because a future xfs_repair
      run could try to fix the filesystem using a backup superblock.
      
      This was observed by the online superblock scrubbers while running
      xfs/233.  One can also trigger this by growing an rt volume, cycling the
      mount, and creating new rt files.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      7249c95a
    • Darrick J. Wong's avatar
      xfs: fix realtime bitmap/summary file truncation when growing rt volume · f4c32e87
      Darrick J. Wong authored
      The realtime bitmap and summary files are regular files that are hidden
      away from the directory tree.  Since they're regular files, inode
      inactivation will try to purge what it thinks are speculative
      preallocations beyond the incore size of the file.  Unfortunately,
      xfs_growfs_rt forgets to update the incore size when it resizes the
      inodes, with the result that inactivating the rt inodes at unmount time
      will cause their contents to be truncated.
      
      Fix this by updating the incore size when we change the ondisk size as
      part of updating the superblock.  Note that we don't do this when we're
      allocating blocks to the rt inodes because we actually want those blocks
      to get purged if the growfs fails.
      
      This fixes corruption complaints from the online rtsummary checker when
      running xfs/233.  Since that test requires rmap, one can also trigger
      this by growing an rt volume, cycling the mount, and creating rt files.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      f4c32e87
  2. 07 Oct, 2020 16 commits
    • Kaixu Xia's avatar
      xfs: fix the indent in xfs_trans_mod_dquot · e5b23740
      Kaixu Xia authored
      The formatting is strange in xfs_trans_mod_dquot, so do a reindent.
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e5b23740
    • Kaixu Xia's avatar
      xfs: do the ASSERT for the arguments O_{u,g,p}dqpp · 97611f93
      Kaixu Xia authored
      If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid
      than them currently associated with the inode, the arguments
      O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them.
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      97611f93
    • Darrick J. Wong's avatar
      xfs: fix deadlock and streamline xfs_getfsmap performance · 8ffa90e1
      Darrick J. Wong authored
      Refactor xfs_getfsmap to improve its performance: instead of indirectly
      calling a function that copies one record to userspace at a time, create
      a shadow buffer in the kernel and copy the whole array once at the end.
      On the author's computer, this reduces the runtime on his /home by ~20%.
      
      This also eliminates a deadlock when running GETFSMAP against the
      realtime device.  The current code locks the rtbitmap to create
      fsmappings and copies them into userspace, having not released the
      rtbitmap lock.  If the userspace buffer is an mmap of a sparse file that
      itself resides on the realtime device, the write page fault will recurse
      into the fs for allocation, which will deadlock on the rtbitmap lock.
      
      Fixes: 4c934c7d ("xfs: report realtime space information via the rtbitmap")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      8ffa90e1
    • Darrick J. Wong's avatar
      xfs: limit entries returned when counting fsmap records · acd1ac3a
      Darrick J. Wong authored
      If userspace asked fsmap to count the number of entries, we cannot
      return more than UINT_MAX entries because fmh_entries is u32.
      Therefore, stop counting if we hit this limit or else we will waste time
      to return truncated results.
      
      Fixes: e89c0413 ("xfs: implement the GETFSMAP ioctl")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      acd1ac3a
    • Darrick J. Wong's avatar
      xfs: only relog deferred intent items if free space in the log gets low · 74f4d6a1
      Darrick J. Wong authored
      Now that we have the ability to ask the log how far the tail needs to be
      pushed to maintain its free space targets, augment the decision to relog
      an intent item so that we only do it if the log has hit the 75% full
      threshold.  There's no point in relogging an intent into the same
      checkpoint, and there's no need to relog if there's plenty of free space
      in the log.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      74f4d6a1
    • Darrick J. Wong's avatar
      xfs: expose the log push threshold · ed1575da
      Darrick J. Wong authored
      Separate the computation of the log push threshold and the push logic in
      xlog_grant_push_ail.  This enables higher level code to determine (for
      example) that it is holding on to a logged intent item and the log is so
      busy that it is more than 75% full.  In that case, it would be desirable
      to move the log item towards the head to release the tail, which we will
      cover in the next patch.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      ed1575da
    • Darrick J. Wong's avatar
      xfs: periodically relog deferred intent items · 4e919af7
      Darrick J. Wong authored
      There's a subtle design flaw in the deferred log item code that can lead
      to pinning the log tail.  Taking up the defer ops chain examples from
      the previous commit, we can get trapped in sequences like this:
      
      Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
      chain will look like the following if the transaction rolls succeed:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      In transaction 9, we finish d9 and try to roll to t10 while holding onto
      an intent item for D3 that we logged in t0.
      
      The previous commit changed the order in which we place new defer ops in
      the defer ops processing chain to reduce the maximum chain length.  Now
      make xfs_defer_finish_noroll capable of relogging the entire chain
      periodically so that we can always move the log tail forward.  Most
      chains will never get relogged, except for operations that generate very
      long chains (large extents containing many blocks with different sharing
      levels) or are on filesystems with small logs and a lot of ongoing
      metadata updates.
      
      Callers are now required to ensure that the transaction reservation is
      large enough to handle logging done items and new intent items for the
      maximum possible chain length.  Most callers are careful to keep the
      chain lengths low, so the overhead should be minimal.
      
      The decision to relog an intent item is made based on whether the intent
      was logged in a previous checkpoint, since there's no point in relogging
      an intent into the same checkpoint.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      4e919af7
    • Darrick J. Wong's avatar
      xfs: change the order in which child and parent defer ops are finished · 27dada07
      Darrick J. Wong authored
      The defer ops code has been finishing items in the wrong order -- if a
      top level defer op creates items A and B, and finishing item A creates
      more defer ops A1 and A2, we'll put the new items on the end of the
      chain and process them in the order A B A1 A2.  This is kind of weird,
      since it's convenient for programmers to be able to think of A and B as
      an ordered sequence where all the sub-tasks for A must finish before we
      move on to B, e.g. A A1 A2 D.
      
      Right now, our log intent items are not so complex that this matters,
      but this will become important for the atomic extent swapping patchset.
      In order to maintain correct reference counting of extents, we have to
      unmap and remap extents in that order, and we want to complete that work
      before moving on to the next range that the user wants to swap.  This
      patch fixes defer ops to satsify that requirement.
      
      The primary symptom of the incorrect order was noticed in an early
      performance analysis of the atomic extent swap code.  An astonishingly
      large number of deferred work items accumulated when userspace requested
      an atomic update of two very fragmented files.  The cause of this was
      traced to the same ordering bug in the inner loop of
      xfs_defer_finish_noroll.
      
      If the ->finish_item method of a deferred operation queues new deferred
      operations, those new deferred ops are appended to the tail of the
      pending work list.  To illustrate, say that a caller creates a
      transaction t0 with four deferred operations D0-D3.  The first thing
      defer ops does is roll the transaction to t1, leaving us with:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      
      Let's say that finishing each of D0-D3 will create two new deferred ops.
      After finish D0 and roll, we'll have the following chain:
      
      t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
      
      d4 and d5 were logged to t1.  Notice that while we're about to start
      work on D1, we haven't actually completed all the work implied by D0
      being finished.  So far we've been careful (or lucky) to structure the
      dfops callers such that D1 doesn't depend on d4 or d5 being finished,
      but this is a potential logic bomb.
      
      There's a second problem lurking.  Let's see what happens as we finish
      D1-D3:
      
      t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
      t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
      t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      
      Let's say that d4-d11 are simple work items that don't queue any other
      operations, which means that we can complete each d4 and roll to t6:
      
      t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      ...
      t11: d10(t4), d11(t4)
      t12: d11(t4)
      <done>
      
      When we try to roll to transaction #12, we're holding defer op d11,
      which we logged way back in t4.  This means that the tail of the log is
      pinned at t4.  If the log is very small or there are a lot of other
      threads updating metadata, this means that we might have wrapped the log
      and cannot get roll to t11 because there isn't enough space left before
      we'd run into t4.
      
      Let's shift back to the original failure.  I mentioned before that I
      discovered this flaw while developing the atomic file update code.  In
      that scenario, we have a defer op (D0) that finds a range of file blocks
      to remap, creates a handful of new defer ops to do that, and then asks
      to be continued with however much work remains.
      
      So, D0 is the original swapext deferred op.  The first thing defer ops
      does is rolls to t1:
      
      t1: D0(t0)
      
      We try to finish D0, logging d1 and d2 in the process, but can't get all
      the work done.  We log a done item and a new intent item for the work
      that D0 still has to do, and roll to t2:
      
      t2: D0'(t1), d1(t1), d2(t1)
      
      We roll and try to finish D0', but still can't get all the work done, so
      we log a done item and a new intent item for it, requeue D0 a second
      time, and roll to t3:
      
      t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
      
      If it takes 48 more rolls to complete D0, then we'll finally dispense
      with D0 in t50:
      
      t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
      
      We then try to roll again to get a chain like this:
      
      t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
      ...
      t152: d102(t50)
      <done>
      
      Notice that in rolling to transaction #51, we're holding on to a log
      intent item for d1 that was logged in transaction #1.  This means that
      the tail of the log is pinned at t1.  If the log is very small or there
      are a lot of other threads updating metadata, this means that we might
      have wrapped the log and cannot roll to t51 because there isn't enough
      space left before we'd run into t1.  This is of course problem #2 again.
      
      But notice the third problem with this scenario: we have 102 defer ops
      tied to this transaction!  Each of these items are backed by pinned
      kernel memory, which means that we risk OOM if the chains get too long.
      
      Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
      future; problem #2 applies (rarely) to the current upstream, and problem
      #3 applies to work under development.
      
      This is not how incremental deferred operations were supposed to work.
      The dfops design of logging in the same transaction an intent-done item
      and a new intent item for the work remaining was to make it so that we
      only have to juggle enough deferred work items to finish that one small
      piece of work.  Deferred log item recovery will find that first
      unfinished work item and restart it, no matter how many other intent
      items might follow it in the log.  Therefore, it's ok to put the new
      intents at the start of the dfops chain.
      
      For the first example, the chains look like this:
      
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      For the second example, the chains look like this:
      
      t1: D0(t0)
      t2: d1(t1), d2(t1), D0'(t1)
      t3: d2(t1), D0'(t1)
      t4: D0'(t1)
      t5: d1(t4), d2(t4), D0''(t4)
      ...
      t148: D0<50 primes>(t147)
      t149: d101(t148), d102(t148)
      t150: d102(t148)
      <done>
      
      This actually sucks more for pinning the log tail (we try to roll to t10
      while holding an intent item that was logged in t1) but we've solved
      problem #1.  We've also reduced the maximum chain length from:
      
          sum(all the new items) + nr_original_items
      
      to:
      
          max(new items that each original item creates) + nr_original_items
      
      This solves problem #3 by sharply reducing the number of defer ops that
      can be attached to a transaction at any given time.  The change makes
      the problem of log tail pinning worse, but is improvement we need to
      solve problem #2.  Actually solving #2, however, is left to the next
      patch.
      
      Note that a subsequent analysis of some hard-to-trigger reflink and COW
      livelocks on extremely fragmented filesystems (or systems running a lot
      of IO threads) showed the same symptoms -- uncomfortably large numbers
      of incore deferred work items and occasional stalls in the transaction
      grant code while waiting for log reservations.  I think this patch and
      the next one will also solve these problems.
      
      As originally written, the code used list_splice_tail_init instead of
      list_splice_init, so change that, and leave a short comment explaining
      our actions.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      27dada07
    • Darrick J. Wong's avatar
      xfs: fix an incore inode UAF in xfs_bui_recover · ff4ab5e0
      Darrick J. Wong authored
      In xfs_bui_item_recover, there exists a use-after-free bug with regards
      to the inode that is involved in the bmap replay operation.  If the
      mapping operation does not complete, we call xfs_bmap_unmap_extent to
      create a deferred op to finish the unmapping work, and we retain a
      pointer to the incore inode.
      
      Unfortunately, the very next thing we do is commit the transaction and
      drop the inode.  If reclaim tears down the inode before we try to finish
      the defer ops, we dereference garbage and blow up.  Therefore, create a
      way to join inodes to the defer ops freezer so that we can maintain the
      xfs_inode reference until we're done with the inode.
      
      Note: This imposes the requirement that there be enough memory to keep
      every incore inode in memory throughout recovery.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ff4ab5e0
    • Darrick J. Wong's avatar
      xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering · 64a3f331
      Darrick J. Wong authored
      In most places in XFS, we have a specific order in which we gather
      resources: grab the inode, allocate a transaction, then lock the inode.
      xfs_bui_item_recover doesn't do it in that order, so fix it to be more
      consistent.  This also makes the error bailout code a bit less weird.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      64a3f331
    • Darrick J. Wong's avatar
      xfs: clean up bmap intent item recovery checking · 919522e8
      Darrick J. Wong authored
      The bmap intent item checking code in xfs_bui_item_recover is spread all
      over the function.  We should check the recovered log item at the top
      before we allocate any resources or do anything else, so do that.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      919522e8
    • Darrick J. Wong's avatar
      xfs: xfs_defer_capture should absorb remaining transaction reservation · 929b92f6
      Darrick J. Wong authored
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the transaction reservation type
      from the old transaction so that when we continue the dfops chain, we
      still use the same reservation parameters.
      
      Doing this means that the log item recovery functions get to determine
      the transaction reservation instead of abusing tr_itruncate in yet
      another part of xfs.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      929b92f6
    • Darrick J. Wong's avatar
      xfs: xfs_defer_capture should absorb remaining block reservations · 4f9a60c4
      Darrick J. Wong authored
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the remaining block reservations so
      that when we continue the dfops chain, we can reserve the same number of
      blocks to use.  We capture the reservations for both data and realtime
      volumes.
      
      This adds the requirement that every log intent item recovery function
      must be careful to reserve enough blocks to handle both itself and all
      defer ops that it can queue.  On the other hand, this enables us to do
      away with the handwaving block estimation nonsense that was going on in
      xlog_finish_defer_ops.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      4f9a60c4
    • Darrick J. Wong's avatar
      xfs: proper replay of deferred ops queued during log recovery · e6fff81e
      Darrick J. Wong authored
      When we replay unfinished intent items that have been recovered from the
      log, it's possible that the replay will cause the creation of more
      deferred work items.  As outlined in commit 50995582 ("xfs: log
      recovery should replay deferred ops in order"), later work items have an
      implicit ordering dependency on earlier work items.  Therefore, recovery
      must replay the items (both recovered and created) in the same order
      that they would have been during normal operation.
      
      For log recovery, we enforce this ordering by using an empty transaction
      to collect deferred ops that get created in the process of recovering a
      log intent item to prevent them from being committed before the rest of
      the recovered intent items.  After we finish committing all the
      recovered log items, we allocate a transaction with an enormous block
      reservation, splice our huge list of created deferred ops into that
      transaction, and commit it, thereby finishing all those ops.
      
      This is /really/ hokey -- it's the one place in XFS where we allow
      nested transactions; the splicing of the defer ops list is is inelegant
      and has to be done twice per recovery function; and the broken way we
      handle inode pointers and block reservations cause subtle use-after-free
      and allocator problems that will be fixed by this patch and the two
      patches after it.
      
      Therefore, replace the hokey empty transaction with a structure designed
      to capture each chain of deferred ops that are created as part of
      recovering a single unfinished log intent.  Finally, refactor the loop
      that replays those chains to do so using one transaction per chain.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e6fff81e
    • Darrick J. Wong's avatar
      xfs: remove XFS_LI_RECOVERED · 901219bb
      Darrick J. Wong authored
      The ->iop_recover method of a log intent item removes the recovered
      intent item from the AIL by logging an intent done item and committing
      the transaction, so it's superfluous to have this flag check.  Nothing
      else uses it, so get rid of the flag entirely.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      901219bb
    • Darrick J. Wong's avatar
      xfs: remove xfs_defer_reset · b80b29d6
      Darrick J. Wong authored
      Remove this one-line helper since the assert is trivially true in one
      call site and the rest obscures a bitmask operation.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      b80b29d6
  3. 30 Sep, 2020 1 commit
  4. 25 Sep, 2020 13 commits
  5. 23 Sep, 2020 7 commits