1. 12 Feb, 2023 9 commits
    • Dave Chinner's avatar
      xfs: rework xfs_alloc_vextent() · ecd788a9
      Dave Chinner authored
      It's a multiplexing mess that can be greatly simplified, and really
      needs to be simplified to allow active per-ag references to
      propagate from initial AG selection code the the bmapi code.
      
      This splits the code out into separate a parameter checking
      function, an iterator function, and allocation completion functions
      and then implements the individual policies using these functions.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ecd788a9
    • Dave Chinner's avatar
      xfs: introduce xfs_for_each_perag_wrap() · 76257a15
      Dave Chinner authored
      In several places we iterate every AG from a specific start agno and
      wrap back to the first AG when we reach the end of the filesystem to
      continue searching. We don't have a primitive for this iteration
      yet, so add one for conversion of these algorithms to per-ag based
      iteration.
      
      The filestream AG select code is a mess, and this initially makes it
      worse. The per-ag selection needs to be driven completely into the
      filestream code to clean this up and it will be done in a future
      patch that makes the filestream allocator use active per-ag
      references correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      76257a15
    • Dave Chinner's avatar
      xfs: perags need atomic operational state · 7ac2ff8b
      Dave Chinner authored
      We currently don't have any flags or operational state in the
      xfs_perag except for the pagf_init and pagi_init flags. And the
      agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
      flags, too.
      
      For controlling per-ag operations, we are going to need some atomic
      state flags. Hence add an opstate field similar to what we already
      have in the mount and log, and convert all these state flags across
      to atomic bit operations.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      7ac2ff8b
    • Dave Chinner's avatar
      xfs: convert xfs_ialloc_next_ag() to an atomic · 20a5eab4
      Dave Chinner authored
      This is currently a spinlock lock protected rotor which can be
      implemented with a single atomic operation. Change it to be more
      efficient and get rid of the m_agirotor_lock. Noticed while
      converting the inode allocation AG selection loop to active perag
      references.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      20a5eab4
    • Dave Chinner's avatar
      xfs: inobt can use perags in many more places than it does · bab8b795
      Dave Chinner authored
      Lots of code in the inobt infrastructure is passed both xfs_mount
      and perags. We only need perags for the per-ag inode allocation
      code, so reduce the duplication by passing only the perags as the
      primary object.
      
      This ends up reducing the code size by a bit:
      
      	   text    data     bss     dec     hex filename
      orig	1138878  323979     548 1463405  16546d (TOTALS)
      patched	1138709  323979     548 1463236  1653c4 (TOTALS)
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      bab8b795
    • Dave Chinner's avatar
      xfs: use active perag references for inode allocation · dedab3e4
      Dave Chinner authored
      Convert the inode allocation routines to use active perag references
      or references held by callers rather than grab their own. Also drive
      the perag further inwards to replace xfs_mounts when doing
      operations on a specific AG.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dedab3e4
    • Dave Chinner's avatar
      xfs: convert xfs_imap() to take a perag · 498f0adb
      Dave Chinner authored
      Callers have referenced perags but they don't pass it into
      xfs_imap() so it takes it's own reference. Fix that so we can change
      inode allocation over to using active references.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      498f0adb
    • Dave Chinner's avatar
      xfs: rework the perag trace points to be perag centric · 368e2d09
      Dave Chinner authored
      So that they all output the same information in the traces to make
      debugging refcount issues easier.
      
      This means that all the lookup/drop functions no longer need to use
      the full memory barrier atomic operations (atomic*_return()) so
      will have less overhead when tracing is off. The set/clear tag
      tracepoints no longer abuse the reference count to pass the tag -
      the tag being cleared is obvious from the _RET_IP_ that is recorded
      in the trace point.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      368e2d09
    • Dave Chinner's avatar
      xfs: active perag reference counting · c4d5660a
      Dave Chinner authored
      We need to be able to dynamically remove instantiated AGs from
      memory safely, either for shrinking the filesystem or paging AG
      state in and out of memory (e.g. supporting millions of AGs). This
      means we need to be able to safely exclude operations from accessing
      perags while dynamic removal is in progress.
      
      To do this, introduce the concept of active and passive references.
      Active references are required for high level operations that make
      use of an AG for a given operation (e.g. allocation) and pin the
      perag in memory for the duration of the operation that is operating
      on the perag (e.g. transaction scope). This means we can fail to get
      an active reference to an AG, hence callers of the new active
      reference API must be able to handle lookup failure gracefully.
      
      Passive references are used in low level code, where we might need
      to access the perag structure for the purposes of completing high
      level operations. For example, buffers need to use passive
      references because:
      - we need to be able to do metadata IO during operations like grow
        and shrink transactions where high level active references to the
        AG have already been blocked
      - buffers need to pin the perag until they are reclaimed from
        memory, something that high level code has no direct control over.
      - unused cached buffers should not prevent a shrink from being
        started.
      
      Hence we have active references that will form exclusion barriers
      for operations to be performed on an AG, and passive references that
      will prevent reclaim of the perag until all objects with passive
      references have been reclaimed themselves.
      
      This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
      for active AG reference functionality. We also need to convert the
      for_each_perag*() iterators to use active references, which will
      start the process of converting high level code over to using active
      references. Conversion of non-iterator based code to active
      references will be done in followup patches.
      
      Note that the implementation using reference counting is really just
      a development vehicle for the API to ensure we don't have any leaks
      in the callers. Once we need to remove perag structures from memory
      dyanmically, we will need a much more robust per-ag state transition
      mechanism for preventing new references from being taken while we
      wait for existing references to drain before removal from memory can
      occur....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c4d5660a
  2. 10 Feb, 2023 9 commits
    • Dave Chinner's avatar
      xfs: don't assert fail on transaction cancel with deferred ops · 55d5c3a3
      Dave Chinner authored
      We can error out of an allocation transaction when updating BMBT
      blocks when things go wrong. This can be a btree corruption, and
      unexpected ENOSPC, etc. In these cases, we already have deferred ops
      queued for the first allocation that has been done, and we just want
      to cancel out the transaction and shut down the filesystem on error.
      
      In fact, we do just that for production systems - the assert that we
      can't have a transaction with defer ops attached unless we are
      already shut down is bogus and gets in the way of debugging
      whatever issue is actually causing the transaction to be cancelled.
      
      Remove the assert because it is causing spurious test failures to
      hang test machines.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      55d5c3a3
    • Dave Chinner's avatar
      xfs: t_firstblock is tracking AGs not blocks · 692b6cdd
      Dave Chinner authored
      The tp->t_firstblock field is now raelly tracking the highest AG we
      have locked, not the block number of the highest allocation we've
      made. It's purpose is to prevent AGF locking deadlocks, so rename it
      to "highest AG" and simplify the implementation to just track the
      agno rather than a fsbno.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      692b6cdd
    • Dave Chinner's avatar
      xfs: drop firstblock constraints from allocation setup · 36b6ad2d
      Dave Chinner authored
      Now that xfs_alloc_vextent() does all the AGF deadlock prevention
      filtering for multiple allocations in a single transaction, we no
      longer need the allocation setup code to care about what AGs we
      might already have locked.
      
      Hence we can remove all the "nullfb" conditional logic in places
      like xfs_bmap_btalloc() and instead have them focus simply on
      setting up locality constraints. If the allocation fails due to
      AGF lock filtering in xfs_alloc_vextent, then we just fall back as
      we normally do to more relaxed allocation constraints.
      
      As a result, any allocation that allows AG scanning (i.e. not
      confined to a single AG) and does not force a worst case full
      filesystem scan will now be able to attempt allocation from AGs
      lower than that defined by tp->t_firstblock. This is because
      xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
      low space algorithms to at least -try- to get space from AGs lower
      than the one that we have currently locked and allocated from. This
      is a significant improvement in the low space allocation algorithm.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      36b6ad2d
    • Dave Chinner's avatar
      xfs: block reservation too large for minleft allocation · d5753847
      Dave Chinner authored
      When we enter xfs_bmbt_alloc_block() without having first allocated
      a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
      are doing something like unwritten extent conversion, the transaction
      block reservation is used as the minleft value.
      
      This works for operations like unwritten extent conversion, but it
      assumes that the block reservation is only for a BMBT split. THis is
      not always true, and sometimes results in larger than necessary
      minleft values being set. We only actually need enough space for a
      btree split, something we already handle correctly in
      xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
      
      We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
      calculate the number of blocks a BMBT split on this inode is going to
      require, not use the transaction block reservation that contains the
      maximum number of blocks this transaction may consume in it...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d5753847
    • Dave Chinner's avatar
      xfs: prefer free inodes at ENOSPC over chunk allocation · f08f984c
      Dave Chinner authored
      When an XFS filesystem has free inodes in chunks already allocated
      on disk, it will still allocate new inode chunks if the target AG
      has no free inodes in it. Normally, this is a good idea as it
      preserves locality of all the inodes in a given directory.
      
      However, at ENOSPC this can lead to using the last few remaining
      free filesystem blocks to allocate a new chunk when there are many,
      many free inodes that could be allocated without consuming free
      space. This results in speeding up the consumption of the last few
      blocks and inode create operations then returning ENOSPC when there
      free inodes available because we don't have enough block left in the
      filesystem for directory creation reservations to proceed.
      
      Hence when we are near ENOSPC, we should be attempting to preserve
      the remaining blocks for directory block allocation rather than
      using them for unnecessary inode chunk creation.
      
      This particular behaviour is exposed by xfs/294, when it drives to
      ENOSPC on empty file creation whilst there are still thousands of
      free inodes available for allocation in other AGs in the filesystem.
      
      Hence, when we are within 1% of ENOSPC, change the inode allocation
      behaviour to prefer to use existing free inodes over allocating new
      inode chunks, even though it results is poorer locality of the data
      set. It is more important for the allocations to be space efficient
      near ENOSPC than to have optimal locality for performance, so lets
      modify the inode AG selection code to reflect that fact.
      
      This allows generic/294 to not only pass with this allocator rework
      patchset, but to increase the number of post-ENOSPC empty inode
      allocations to from ~600 to ~9080 before we hit ENOSPC on the
      directory create transaction reservation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f08f984c
    • Dave Chinner's avatar
      xfs: fix low space alloc deadlock · 1dd0510f
      Dave Chinner authored
      I've recently encountered an ABBA deadlock with g/476. The upcoming
      changes seem to make this much easier to hit, but the underlying
      problem is a pre-existing one.
      
      Essentially, if we select an AG for allocation, then lock the AGF
      and then fail to allocate for some reason (e.g. minimum length
      requirements cannot be satisfied), then we drop out of the
      allocation with the AGF still locked.
      
      The caller then modifies the allocation constraints - usually
      loosening them up - and tries again. This can result in trying to
      access AGFs that are lower than the AGF we already have locked from
      the failed attempt. e.g. the failed attempt skipped several AGs
      before failing, so we have locks an AG higher than the start AG.
      Retrying the allocation from the start AG then causes us to violate
      AGF lock ordering and this can lead to deadlocks.
      
      The deadlock exists even if allocation succeeds - we can do a
      followup allocations in the same transaction for BMBT blocks that
      aren't guaranteed to be in the same AG as the original, and can move
      into higher AGs. Hence we really need to move the tp->t_firstblock
      tracking down into xfs_alloc_vextent() where it can be set when we
      exit with a locked AG.
      
      xfs_alloc_vextent() can also check there if the requested
      allocation falls within the allow range of AGs set by
      tp->t_firstblock. If we can't allocate within the range set, we have
      to fail the allocation. If we are allowed to to non-blocking AGF
      locking, we can ignore the AG locking order limitations as we can
      use try-locks for the first iteration over requested AG range.
      
      This invalidates a set of post allocation asserts that check that
      the allocation is always above tp->t_firstblock if it is set.
      Because we can use try-locks to avoid the deadlock in some
      circumstances, having a pre-existing locked AGF doesn't always
      prevent allocation from lower order AGFs. Hence those ASSERTs need
      to be removed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1dd0510f
    • Darrick J. Wong's avatar
      xfs: revert commit 8954c44f · dd07bb8b
      Darrick J. Wong authored
      The name passed into __xfs_xattr_put_listent is exactly namelen bytes
      long and not null-terminated.  Passing namelen+1 to the strscpy function
      
          strscpy(offset, (char *)name, namelen + 1);
      
      is therefore wrong.  Go back to the old code, which works fine because
      strncpy won't find a null in @name and stops after namelen bytes.  It
      really could be a memcpy call, but it worked for years.
      
      Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com
      Fixes: 8954c44f ("xfs: use strscpy() to instead of strncpy()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dd07bb8b
    • Thomas Weißschuh's avatar
      xfs: make kobj_type structures constant · 2ee83335
      Thomas Weißschuh authored
      Since commit ee6d3dd4 ("driver core: make kobj_type constant.")
      the driver core allows the usage of const struct kobj_type.
      
      Take advantage of this to constify the structure definitions to prevent
      modification at runtime.
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2ee83335
    • Donald Douwsma's avatar
      xfs: allow setting full range of panic tags · 167ce4cb
      Donald Douwsma authored
      xfs will not allow combining other panic masks with
      XFS_PTAG_VERIFIER_ERROR.
      
       # sysctl fs.xfs.panic_mask=511
       sysctl: setting key "fs.xfs.panic_mask": Invalid argument
       fs.xfs.panic_mask = 511
      
      Update to the maximum value that can be set to allow the full range of
      masks. Do this using a mask of possible values to prevent this happening
      again as suggested by Darrick.
      
      Fixes: d519da41 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask")
      Signed-off-by: default avatarDonald Douwsma <ddouwsma@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      167ce4cb
  3. 05 Feb, 2023 10 commits
    • Dave Chinner's avatar
      xfs: don't use BMBT btree split workers for IO completion · c85007e2
      Dave Chinner authored
      When we split a BMBT due to record insertion, we offload it to a
      worker thread because we can be deep in the stack when we try to
      allocate a new block for the BMBT. Allocation can use several
      kilobytes of stack (full memory reclaim, swap and/or IO path can
      end up on the stack during allocation) and we can already be several
      kilobytes deep in the stack when we need to split the BMBT.
      
      A recent workload demonstrated a deadlock in this BMBT split
      offload. It requires several things to happen at once:
      
      1. two inodes need a BMBT split at the same time, one must be
      unwritten extent conversion from IO completion, the other must be
      from extent allocation.
      
      2. there must be a no available xfs_alloc_wq worker threads
      available in the worker pool.
      
      3. There must be sustained severe memory shortages such that new
      kworker threads cannot be allocated to the xfs_alloc_wq pool for
      both threads that need split work to be run
      
      4. The split work from the unwritten extent conversion must run
      first.
      
      5. when the BMBT block allocation runs from the split work, it must
      loop over all AGs and not be able to either trylock an AGF
      successfully, or each AGF is is able to lock has no space available
      for a single block allocation.
      
      6. The BMBT allocation must then attempt to lock the AGF that the
      second task queued to the rescuer thread already has locked before
      it finds an AGF it can allocate from.
      
      At this point, we have an ABBA deadlock between tasks queued on the
      xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
      holding the AGF lock can't be run by the rescuer thread until the
      task the rescuer thread is runing gets the AGF lock....
      
      This is a highly improbably series of events, but there it is.
      
      There's a couple of ways to fix this, but the easiest way to ensure
      that we only punt tasks with a locked AGF that holds enough space
      for the BMBT block allocations to the worker thread.
      
      This works for unwritten extent conversion in IO completion (which
      doesn't have a locked AGF and space reservations) because we have
      tight control over the IO completion stack. It is typically only 6
      functions deep when xfs_btree_split() is called because we've
      already offloaded the IO completion work to a worker thread and
      hence we don't need to worry about stack overruns here.
      
      The other place we can be called for a BMBT split without a
      preceeding allocation is __xfs_bunmapi() when punching out the
      center of an existing extent. We don't remove extents in the IO
      path, so these operations don't tend to be called with a lot of
      stack consumed. Hence we don't really need to ship the split off to
      a worker thread in these cases, either.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c85007e2
    • Darrick J. Wong's avatar
      xfs: fix confusing variable names in xfs_refcount_item.c · 01a3af22
      Darrick J. Wong authored
      Variable names in this code module are inconsistent and confusing.
      xfs_phys_extent describe physical mappings, so rename them "pmap".
      xfs_refcount_intents describe refcount intents, so rename them "ri".
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      01a3af22
    • Darrick J. Wong's avatar
      xfs: pass refcount intent directly through the log intent code · 0b11553e
      Darrick J. Wong authored
      Pass the incore refcount intent through the CUI logging code instead of
      repeatedly boxing and unboxing parameters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0b11553e
    • Darrick J. Wong's avatar
      xfs: fix confusing variable names in xfs_rmap_item.c · ffaa196f
      Darrick J. Wong authored
      Variable names in this code module are inconsistent and confusing.
      xfs_map_extent describe file mappings, so rename them "map".
      xfs_rmap_intents describe block mapping intents, so rename them "ri".
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ffaa196f
    • Darrick J. Wong's avatar
      xfs: pass rmap space mapping directly through the log intent code · 1534328b
      Darrick J. Wong authored
      Pass the incore rmap space mapping through the RUI logging code instead
      of repeatedly boxing and unboxing parameters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1534328b
    • Darrick J. Wong's avatar
      xfs: fix confusing xfs_extent_item variable names · 578c714b
      Darrick J. Wong authored
      Change the name of all pointers to xfs_extent_item structures to "xefi"
      to make the name consistent and because the current selections ("new"
      and "free") mean other things in C.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      578c714b
    • Darrick J. Wong's avatar
      xfs: pass xfs_extent_free_item directly through the log intent code · 72ba4555
      Darrick J. Wong authored
      Pass the incore xfs_extent_free_item through the EFI logging code
      instead of repeatedly boxing and unboxing parameters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      72ba4555
    • Darrick J. Wong's avatar
      xfs: fix confusing variable names in xfs_bmap_item.c · f3ebac4c
      Darrick J. Wong authored
      Variable names in this code module are inconsistent and confusing.
      xfs_map_extent describe file mappings, so rename them "map".
      xfs_bmap_intents describe block mapping intents, so rename them "bi".
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f3ebac4c
    • Darrick J. Wong's avatar
      xfs: pass the xfs_bmbt_irec directly through the log intent code · ddccb81b
      Darrick J. Wong authored
      Instead of repeatedly boxing and unboxing the incore extent mapping
      structure as it passes through the BUI code, pass the pointer directly
      through.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ddccb81b
    • Xu Panda's avatar
      xfs: use strscpy() to instead of strncpy() · 8954c44f
      Xu Panda authored
      The implementation of strscpy() is more robust and safer.
      That's now the recommended way to copy NUL-terminated strings.
      Signed-off-by: default avatarXu Panda <xu.panda@zte.com.cn>
      Signed-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8954c44f
  4. 29 Jan, 2023 6 commits
  5. 28 Jan, 2023 6 commits