1. 12 Feb, 2023 30 commits
    • Dave Chinner's avatar
      xfs: remove xfs_filestream_select_ag() longest extent check · 3e43877a
      Dave Chinner authored
      Picking a new AG checks the longest free extent in the AG is valid,
      so there's no need to repeat the check in
      xfs_filestream_select_ag(). Remove it.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      3e43877a
    • Dave Chinner's avatar
      xfs: merge new filestream AG selection into xfs_filestream_select_ag() · ba34de8d
      Dave Chinner authored
      This is largely a wrapper around xfs_filestream_pick_ag() that
      repeats a lot of the lookups that we just merged back into
      xfs_filestream_select_ag() from the lookup code. Merge the
      xfs_filestream_new_ag() code back into _select_ag() to get rid
      of all the unnecessary logic.
      
      Indeed, this makes it obvious that if we have no parent inode,
      the filestreams allocator always selects AG 0 regardless of whether
      it is fit for purpose or not.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ba34de8d
    • Dave Chinner's avatar
      xfs: merge filestream AG lookup into xfs_filestream_select_ag() · a52dc2ad
      Dave Chinner authored
      The lookup currently either returns the cached filestream AG or it
      calls xfs_filestreams_select_lengths() to looks up a new AG. This
      has verify the AG that is selected, so we end up doing "select a new
      AG loop in a couple of places when only one really is needed.  Merge
      the initial lookup functionality with the length selection so that
      we only need to do a single pick loop on lookup or verification
      failure.
      
      This undoes a lot of the factoring that enabled the selection to be
      moved over to the filestreams code. It makes
      xfs_filestream_select_ag() an awful messier, but it has to be made
      worse before it can get better in future patches...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a52dc2ad
    • Dave Chinner's avatar
      xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c · 8f7747ad
      Dave Chinner authored
      xfs_bmap_btalloc_filestreams() calls two filestreams functions to
      select the AG to allocate from. Both those functions end up in
      the same selection function that iterates all AGs multiple times.
      Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4
      times just to select the initial AG to allocate in.
      
      Move the AG selection to fs/xfs/xfs_filestreams.c as a single
      interface so that the inefficient AG interation is contained
      entirely within the filestreams code. This will allow the
      implementation to be simplified and made more efficient in future
      patches.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8f7747ad
    • Dave Chinner's avatar
      xfs: use xfs_bmap_longest_free_extent() in filestreams · 05cf492a
      Dave Chinner authored
      The code in xfs_bmap_longest_free_extent() is open coded in
      xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and
      call it from the filestreams code instead.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      05cf492a
    • Dave Chinner's avatar
      xfs: get rid of notinit from xfs_bmap_longest_free_extent · 6b637ad0
      Dave Chinner authored
      It is only set if reading the AGF gets a EAGAIN error. Just return
      the EAGAIN error and handle that error in the callers.
      
      This means we can remove the not_init parameter from
      xfs_bmap_select_minlen(), too, because the use of not_init there is
      pessimistic. If we can't read the agf, it won't increase blen.
      
      The only time we actually care whether we checked all the AGFs for
      contiguous free space is when the best length is less than the
      minimum allocation length. If not_init is set, then we ignore blen
      and set the minimum alloc length to the absolute minimum, not the
      best length we know already is present.
      
      However, if blen is less than the minimum we're going to ignore it
      anyway, regardless of whether we scanned all the AGFs or not.  Hence
      not_init can go away, because we only use if blen is good from
      the scanned AGs otherwise we ignore it altogether and use minlen.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      6b637ad0
    • Dave Chinner's avatar
      xfs: factor out filestreams from xfs_bmap_btalloc_nullfb · 89563e7d
      Dave Chinner authored
      There's many if (filestreams) {} else {} branches in this function.
      Split it out into a filestreams specific function so that we can
      then work directly on cleaning up the filestreams code without
      impacting the rest of the allocation algorithms.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      89563e7d
    • Dave Chinner's avatar
      xfs: convert trim to use for_each_perag_range · 35bf2b1a
      Dave Chinner authored
      To convert it to using active perag references and hence make it
      shrink safe.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      35bf2b1a
    • Dave Chinner's avatar
      xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker · 3432ef61
      Dave Chinner authored
      Now that the AG iteration code in the core allocation code has been
      cleaned up, we can easily convert it to use a for_each_perag..()
      variant to use active references and skip AGs that it can't get
      active references on.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      3432ef61
    • Dave Chinner's avatar
      xfs: move the minimum agno checks into xfs_alloc_vextent_check_args · 8b813568
      Dave Chinner authored
      All of the allocation functions now extract the minimum allowed AG
      from the transaction and then use it in some way. The allocation
      functions that are restricted to a single AG all check if the
      AG requested can be allocated from and return an error if so. These
      all set args->agno appropriately.
      
      All the allocation functions that iterate AGs use it to calculate
      the scan start AG. args->agno is not set until the iterator starts
      walking AGs.
      
      Hence we can easily set up a conditional check against the minimum
      AG allowed in xfs_alloc_vextent_check_args() based on whether
      args->agno contains NULLAGNUMBER or not and move all the repeated
      setup code to xfs_alloc_vextent_check_args(), further simplifying
      the allocation functions.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8b813568
    • Dave Chinner's avatar
      xfs: fold xfs_alloc_ag_vextent() into callers · 230e8fe8
      Dave Chinner authored
      We don't need the multiplexing xfs_alloc_ag_vextent() provided
      anymore - we can just call the exact/near/size variants directly.
      This allows us to remove args->type completely and stop using
      args->fsbno as an input to the allocator algorithms.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      230e8fe8
    • Dave Chinner's avatar
      xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() · e4d17426
      Dave Chinner authored
      Move it from xfs_alloc_ag_vextent() so we can get rid of that layer.
      Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish()
      to indicate that it's function is finishing off the allocation that
      we've run now that it contains much more functionality.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e4d17426
    • Dave Chinner's avatar
      xfs: introduce xfs_alloc_vextent_prepare() · 74b9aa63
      Dave Chinner authored
      Now that we have wrapper functions for each type of allocation we
      can ask for, we can start unravelling xfs_alloc_ag_vextent(). That
      is essentially just a prepare stage, the allocation multiplexer
      and a post-allocation accounting step is the allocation proceeded.
      
      The current xfs_alloc_vextent*() wrappers all have a prepare stage,
      the allocation operation and a post-allocation accounting step.
      
      We can consolidate this by moving the AG alloc prep code into the
      wrapper functions, the accounting code in the wrapper accounting
      functions, and cut out the multiplexer layer entirely.
      
      This patch consolidates the AG preparation stage.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      74b9aa63
    • Dave Chinner's avatar
      xfs: introduce xfs_alloc_vextent_exact_bno() · 5f36b2ce
      Dave Chinner authored
      Two of the callers to xfs_alloc_vextent_this_ag() actually want
      exact block number allocation, not anywhere-in-ag allocation. Split
      this out from _this_ag() as a first class citizen so no external
      extent allocation code needs to care about args->type anymore.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5f36b2ce
    • Dave Chinner's avatar
      xfs: introduce xfs_alloc_vextent_near_bno() · db4710fd
      Dave Chinner authored
      The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO
      allocations. We can replace that function with a new
      xfs_alloc_vextent_near_bno() function that does this explicitly.
      
      We also multiplex NEAR_BNO allocations through
      xfs_alloc_vextent_this_ag via args->type. Replace all of these with
      direct calls to xfs_alloc_vextent_near_bno(), too.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      db4710fd
    • Dave Chinner's avatar
      xfs: use xfs_alloc_vextent_start_bno() where appropriate · 2a7f6d41
      Dave Chinner authored
      Change obvious callers of single AG allocation to use
      xfs_alloc_vextent_start_bno(). Callers no long need to specify
      XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and
      removed.
      
      While doing this, also pass the allocation target fsb as a parameter
      rather than encoding it in args->fsbno.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2a7f6d41
    • Dave Chinner's avatar
      xfs: use xfs_alloc_vextent_first_ag() where appropriate · 319c9e87
      Dave Chinner authored
      Change obvious callers of single AG allocation to use
      xfs_alloc_vextent_first_ag(). This gets rid of
      XFS_ALLOCTYPE_FIRST_AG as the type used within
      xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we
      can remove the setting of args->type from all the callers of
      _first_ag() and remove the alloctype.
      
      While doing this, pass the allocation target fsb as a parameter
      rather than encoding it in args->fsbno. This starts the process
      of making args->fsbno an output only variable rather than
      input/output.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      319c9e87
    • Dave Chinner's avatar
      xfs: factor xfs_bmap_btalloc() · 85843327
      Dave Chinner authored
      There are several different contexts xfs_bmap_btalloc() handles, and
      large chunks of the code execute independent allocation contexts.
      Try to untangle this mess a bit.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      85843327
    • Dave Chinner's avatar
      xfs: use xfs_alloc_vextent_this_ag() where appropriate · 74c36a86
      Dave Chinner authored
      Change obvious callers of single AG allocation to use
      xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
      callers, too, so that callers with active references don't need
      to do new lookups just for an allocation in a context that already
      has a perag reference.
      
      The only remaining caller that does single AG allocation through
      xfs_alloc_vextent() is xfs_bmap_btalloc() with
      XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
      it can be converted cleanly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      74c36a86
    • Dave Chinner's avatar
      xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent · 4811c933
      Dave Chinner authored
      There's a bit of a recursive conundrum around
      xfs_alloc_ag_vextent(). We can't first call xfs_alloc_ag_vextent()
      without preparing the AGFL for the allocation, and preparing the
      AGFL calls xfs_alloc_ag_vextent() to prepare the AGFL for the
      allocation. This "double allocation" requirement is not really clear
      from the current xfs_alloc_fix_freelist() calls that are sprinkled
      through the allocation code.
      
      It's not helped that xfs_alloc_ag_vextent() can actually allocate
      from the AGFL itself, but there's special code to prevent AGFL prep
      allocations from allocating from the free list it's trying to prep.
      The naming is also not consistent: args->wasfromfl is true when we
      allocated _from_ the free list, but the indication that we are
      allocating _for_ the free list is via checking that (args->resv ==
      XFS_AG_RESV_AGFL).
      
      So, lets make this "allocation required for allocation" situation
      clear by moving it all inside xfs_alloc_ag_vextent(). The freelist
      allocation is a specific XFS_ALLOCTYPE_THIS_AG allocation, which
      translated directly to xfs_alloc_ag_vextent_size() allocation.
      
      This enables us to replace __xfs_alloc_vextent_this_ag() with a call
      to xfs_alloc_ag_vextent(), and we drive the freelist fixing further
      into the per-ag allocation algorithm.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      4811c933
    • Dave Chinner's avatar
      xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags() · 2edf06a5
      Dave Chinner authored
      The core of the per-ag iteration is effectively doing a "this ag"
      allocation on one AG at a time. Use the same code to implement the
      core "this ag" allocation in both xfs_alloc_vextent_this_ag()
      and xfs_alloc_vextent_iterate_ags().
      
      This means we only call xfs_alloc_ag_vextent() from one place so we
      can easily collapse the call stack in future patches.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2edf06a5
    • Dave Chinner's avatar
      xfs: rework xfs_alloc_vextent() · ecd788a9
      Dave Chinner authored
      It's a multiplexing mess that can be greatly simplified, and really
      needs to be simplified to allow active per-ag references to
      propagate from initial AG selection code the the bmapi code.
      
      This splits the code out into separate a parameter checking
      function, an iterator function, and allocation completion functions
      and then implements the individual policies using these functions.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ecd788a9
    • Dave Chinner's avatar
      xfs: introduce xfs_for_each_perag_wrap() · 76257a15
      Dave Chinner authored
      In several places we iterate every AG from a specific start agno and
      wrap back to the first AG when we reach the end of the filesystem to
      continue searching. We don't have a primitive for this iteration
      yet, so add one for conversion of these algorithms to per-ag based
      iteration.
      
      The filestream AG select code is a mess, and this initially makes it
      worse. The per-ag selection needs to be driven completely into the
      filestream code to clean this up and it will be done in a future
      patch that makes the filestream allocator use active per-ag
      references correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      76257a15
    • Dave Chinner's avatar
      xfs: perags need atomic operational state · 7ac2ff8b
      Dave Chinner authored
      We currently don't have any flags or operational state in the
      xfs_perag except for the pagf_init and pagi_init flags. And the
      agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
      flags, too.
      
      For controlling per-ag operations, we are going to need some atomic
      state flags. Hence add an opstate field similar to what we already
      have in the mount and log, and convert all these state flags across
      to atomic bit operations.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      7ac2ff8b
    • Dave Chinner's avatar
      xfs: convert xfs_ialloc_next_ag() to an atomic · 20a5eab4
      Dave Chinner authored
      This is currently a spinlock lock protected rotor which can be
      implemented with a single atomic operation. Change it to be more
      efficient and get rid of the m_agirotor_lock. Noticed while
      converting the inode allocation AG selection loop to active perag
      references.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      20a5eab4
    • Dave Chinner's avatar
      xfs: inobt can use perags in many more places than it does · bab8b795
      Dave Chinner authored
      Lots of code in the inobt infrastructure is passed both xfs_mount
      and perags. We only need perags for the per-ag inode allocation
      code, so reduce the duplication by passing only the perags as the
      primary object.
      
      This ends up reducing the code size by a bit:
      
      	   text    data     bss     dec     hex filename
      orig	1138878  323979     548 1463405  16546d (TOTALS)
      patched	1138709  323979     548 1463236  1653c4 (TOTALS)
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      bab8b795
    • Dave Chinner's avatar
      xfs: use active perag references for inode allocation · dedab3e4
      Dave Chinner authored
      Convert the inode allocation routines to use active perag references
      or references held by callers rather than grab their own. Also drive
      the perag further inwards to replace xfs_mounts when doing
      operations on a specific AG.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dedab3e4
    • Dave Chinner's avatar
      xfs: convert xfs_imap() to take a perag · 498f0adb
      Dave Chinner authored
      Callers have referenced perags but they don't pass it into
      xfs_imap() so it takes it's own reference. Fix that so we can change
      inode allocation over to using active references.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      498f0adb
    • Dave Chinner's avatar
      xfs: rework the perag trace points to be perag centric · 368e2d09
      Dave Chinner authored
      So that they all output the same information in the traces to make
      debugging refcount issues easier.
      
      This means that all the lookup/drop functions no longer need to use
      the full memory barrier atomic operations (atomic*_return()) so
      will have less overhead when tracing is off. The set/clear tag
      tracepoints no longer abuse the reference count to pass the tag -
      the tag being cleared is obvious from the _RET_IP_ that is recorded
      in the trace point.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      368e2d09
    • Dave Chinner's avatar
      xfs: active perag reference counting · c4d5660a
      Dave Chinner authored
      We need to be able to dynamically remove instantiated AGs from
      memory safely, either for shrinking the filesystem or paging AG
      state in and out of memory (e.g. supporting millions of AGs). This
      means we need to be able to safely exclude operations from accessing
      perags while dynamic removal is in progress.
      
      To do this, introduce the concept of active and passive references.
      Active references are required for high level operations that make
      use of an AG for a given operation (e.g. allocation) and pin the
      perag in memory for the duration of the operation that is operating
      on the perag (e.g. transaction scope). This means we can fail to get
      an active reference to an AG, hence callers of the new active
      reference API must be able to handle lookup failure gracefully.
      
      Passive references are used in low level code, where we might need
      to access the perag structure for the purposes of completing high
      level operations. For example, buffers need to use passive
      references because:
      - we need to be able to do metadata IO during operations like grow
        and shrink transactions where high level active references to the
        AG have already been blocked
      - buffers need to pin the perag until they are reclaimed from
        memory, something that high level code has no direct control over.
      - unused cached buffers should not prevent a shrink from being
        started.
      
      Hence we have active references that will form exclusion barriers
      for operations to be performed on an AG, and passive references that
      will prevent reclaim of the perag until all objects with passive
      references have been reclaimed themselves.
      
      This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
      for active AG reference functionality. We also need to convert the
      for_each_perag*() iterators to use active references, which will
      start the process of converting high level code over to using active
      references. Conversion of non-iterator based code to active
      references will be done in followup patches.
      
      Note that the implementation using reference counting is really just
      a development vehicle for the API to ensure we don't have any leaks
      in the callers. Once we need to remove perag structures from memory
      dyanmically, we will need a much more robust per-ag state transition
      mechanism for preventing new references from being taken while we
      wait for existing references to drain before removal from memory can
      occur....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c4d5660a
  2. 10 Feb, 2023 9 commits
    • Dave Chinner's avatar
      xfs: don't assert fail on transaction cancel with deferred ops · 55d5c3a3
      Dave Chinner authored
      We can error out of an allocation transaction when updating BMBT
      blocks when things go wrong. This can be a btree corruption, and
      unexpected ENOSPC, etc. In these cases, we already have deferred ops
      queued for the first allocation that has been done, and we just want
      to cancel out the transaction and shut down the filesystem on error.
      
      In fact, we do just that for production systems - the assert that we
      can't have a transaction with defer ops attached unless we are
      already shut down is bogus and gets in the way of debugging
      whatever issue is actually causing the transaction to be cancelled.
      
      Remove the assert because it is causing spurious test failures to
      hang test machines.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      55d5c3a3
    • Dave Chinner's avatar
      xfs: t_firstblock is tracking AGs not blocks · 692b6cdd
      Dave Chinner authored
      The tp->t_firstblock field is now raelly tracking the highest AG we
      have locked, not the block number of the highest allocation we've
      made. It's purpose is to prevent AGF locking deadlocks, so rename it
      to "highest AG" and simplify the implementation to just track the
      agno rather than a fsbno.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      692b6cdd
    • Dave Chinner's avatar
      xfs: drop firstblock constraints from allocation setup · 36b6ad2d
      Dave Chinner authored
      Now that xfs_alloc_vextent() does all the AGF deadlock prevention
      filtering for multiple allocations in a single transaction, we no
      longer need the allocation setup code to care about what AGs we
      might already have locked.
      
      Hence we can remove all the "nullfb" conditional logic in places
      like xfs_bmap_btalloc() and instead have them focus simply on
      setting up locality constraints. If the allocation fails due to
      AGF lock filtering in xfs_alloc_vextent, then we just fall back as
      we normally do to more relaxed allocation constraints.
      
      As a result, any allocation that allows AG scanning (i.e. not
      confined to a single AG) and does not force a worst case full
      filesystem scan will now be able to attempt allocation from AGs
      lower than that defined by tp->t_firstblock. This is because
      xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
      low space algorithms to at least -try- to get space from AGs lower
      than the one that we have currently locked and allocated from. This
      is a significant improvement in the low space allocation algorithm.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      36b6ad2d
    • Dave Chinner's avatar
      xfs: block reservation too large for minleft allocation · d5753847
      Dave Chinner authored
      When we enter xfs_bmbt_alloc_block() without having first allocated
      a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
      are doing something like unwritten extent conversion, the transaction
      block reservation is used as the minleft value.
      
      This works for operations like unwritten extent conversion, but it
      assumes that the block reservation is only for a BMBT split. THis is
      not always true, and sometimes results in larger than necessary
      minleft values being set. We only actually need enough space for a
      btree split, something we already handle correctly in
      xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
      
      We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
      calculate the number of blocks a BMBT split on this inode is going to
      require, not use the transaction block reservation that contains the
      maximum number of blocks this transaction may consume in it...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d5753847
    • Dave Chinner's avatar
      xfs: prefer free inodes at ENOSPC over chunk allocation · f08f984c
      Dave Chinner authored
      When an XFS filesystem has free inodes in chunks already allocated
      on disk, it will still allocate new inode chunks if the target AG
      has no free inodes in it. Normally, this is a good idea as it
      preserves locality of all the inodes in a given directory.
      
      However, at ENOSPC this can lead to using the last few remaining
      free filesystem blocks to allocate a new chunk when there are many,
      many free inodes that could be allocated without consuming free
      space. This results in speeding up the consumption of the last few
      blocks and inode create operations then returning ENOSPC when there
      free inodes available because we don't have enough block left in the
      filesystem for directory creation reservations to proceed.
      
      Hence when we are near ENOSPC, we should be attempting to preserve
      the remaining blocks for directory block allocation rather than
      using them for unnecessary inode chunk creation.
      
      This particular behaviour is exposed by xfs/294, when it drives to
      ENOSPC on empty file creation whilst there are still thousands of
      free inodes available for allocation in other AGs in the filesystem.
      
      Hence, when we are within 1% of ENOSPC, change the inode allocation
      behaviour to prefer to use existing free inodes over allocating new
      inode chunks, even though it results is poorer locality of the data
      set. It is more important for the allocations to be space efficient
      near ENOSPC than to have optimal locality for performance, so lets
      modify the inode AG selection code to reflect that fact.
      
      This allows generic/294 to not only pass with this allocator rework
      patchset, but to increase the number of post-ENOSPC empty inode
      allocations to from ~600 to ~9080 before we hit ENOSPC on the
      directory create transaction reservation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f08f984c
    • Dave Chinner's avatar
      xfs: fix low space alloc deadlock · 1dd0510f
      Dave Chinner authored
      I've recently encountered an ABBA deadlock with g/476. The upcoming
      changes seem to make this much easier to hit, but the underlying
      problem is a pre-existing one.
      
      Essentially, if we select an AG for allocation, then lock the AGF
      and then fail to allocate for some reason (e.g. minimum length
      requirements cannot be satisfied), then we drop out of the
      allocation with the AGF still locked.
      
      The caller then modifies the allocation constraints - usually
      loosening them up - and tries again. This can result in trying to
      access AGFs that are lower than the AGF we already have locked from
      the failed attempt. e.g. the failed attempt skipped several AGs
      before failing, so we have locks an AG higher than the start AG.
      Retrying the allocation from the start AG then causes us to violate
      AGF lock ordering and this can lead to deadlocks.
      
      The deadlock exists even if allocation succeeds - we can do a
      followup allocations in the same transaction for BMBT blocks that
      aren't guaranteed to be in the same AG as the original, and can move
      into higher AGs. Hence we really need to move the tp->t_firstblock
      tracking down into xfs_alloc_vextent() where it can be set when we
      exit with a locked AG.
      
      xfs_alloc_vextent() can also check there if the requested
      allocation falls within the allow range of AGs set by
      tp->t_firstblock. If we can't allocate within the range set, we have
      to fail the allocation. If we are allowed to to non-blocking AGF
      locking, we can ignore the AG locking order limitations as we can
      use try-locks for the first iteration over requested AG range.
      
      This invalidates a set of post allocation asserts that check that
      the allocation is always above tp->t_firstblock if it is set.
      Because we can use try-locks to avoid the deadlock in some
      circumstances, having a pre-existing locked AGF doesn't always
      prevent allocation from lower order AGFs. Hence those ASSERTs need
      to be removed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1dd0510f
    • Darrick J. Wong's avatar
      xfs: revert commit 8954c44f · dd07bb8b
      Darrick J. Wong authored
      The name passed into __xfs_xattr_put_listent is exactly namelen bytes
      long and not null-terminated.  Passing namelen+1 to the strscpy function
      
          strscpy(offset, (char *)name, namelen + 1);
      
      is therefore wrong.  Go back to the old code, which works fine because
      strncpy won't find a null in @name and stops after namelen bytes.  It
      really could be a memcpy call, but it worked for years.
      
      Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com
      Fixes: 8954c44f ("xfs: use strscpy() to instead of strncpy()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dd07bb8b
    • Thomas Weißschuh's avatar
      xfs: make kobj_type structures constant · 2ee83335
      Thomas Weißschuh authored
      Since commit ee6d3dd4 ("driver core: make kobj_type constant.")
      the driver core allows the usage of const struct kobj_type.
      
      Take advantage of this to constify the structure definitions to prevent
      modification at runtime.
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2ee83335
    • Donald Douwsma's avatar
      xfs: allow setting full range of panic tags · 167ce4cb
      Donald Douwsma authored
      xfs will not allow combining other panic masks with
      XFS_PTAG_VERIFIER_ERROR.
      
       # sysctl fs.xfs.panic_mask=511
       sysctl: setting key "fs.xfs.panic_mask": Invalid argument
       fs.xfs.panic_mask = 511
      
      Update to the maximum value that can be set to allow the full range of
      masks. Do this using a mask of possible values to prevent this happening
      again as suggested by Darrick.
      
      Fixes: d519da41 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask")
      Signed-off-by: default avatarDonald Douwsma <ddouwsma@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      167ce4cb
  3. 05 Feb, 2023 1 commit
    • Dave Chinner's avatar
      xfs: don't use BMBT btree split workers for IO completion · c85007e2
      Dave Chinner authored
      When we split a BMBT due to record insertion, we offload it to a
      worker thread because we can be deep in the stack when we try to
      allocate a new block for the BMBT. Allocation can use several
      kilobytes of stack (full memory reclaim, swap and/or IO path can
      end up on the stack during allocation) and we can already be several
      kilobytes deep in the stack when we need to split the BMBT.
      
      A recent workload demonstrated a deadlock in this BMBT split
      offload. It requires several things to happen at once:
      
      1. two inodes need a BMBT split at the same time, one must be
      unwritten extent conversion from IO completion, the other must be
      from extent allocation.
      
      2. there must be a no available xfs_alloc_wq worker threads
      available in the worker pool.
      
      3. There must be sustained severe memory shortages such that new
      kworker threads cannot be allocated to the xfs_alloc_wq pool for
      both threads that need split work to be run
      
      4. The split work from the unwritten extent conversion must run
      first.
      
      5. when the BMBT block allocation runs from the split work, it must
      loop over all AGs and not be able to either trylock an AGF
      successfully, or each AGF is is able to lock has no space available
      for a single block allocation.
      
      6. The BMBT allocation must then attempt to lock the AGF that the
      second task queued to the rescuer thread already has locked before
      it finds an AGF it can allocate from.
      
      At this point, we have an ABBA deadlock between tasks queued on the
      xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
      holding the AGF lock can't be run by the rescuer thread until the
      task the rescuer thread is runing gets the AGF lock....
      
      This is a highly improbably series of events, but there it is.
      
      There's a couple of ways to fix this, but the easiest way to ensure
      that we only punt tasks with a locked AGF that holds enough space
      for the BMBT block allocations to the worker thread.
      
      This works for unwritten extent conversion in IO completion (which
      doesn't have a locked AGF and space reservations) because we have
      tight control over the IO completion stack. It is typically only 6
      functions deep when xfs_btree_split() is called because we've
      already offloaded the IO completion work to a worker thread and
      hence we don't need to worry about stack overruns here.
      
      The other place we can be called for a BMBT split without a
      preceeding allocation is __xfs_bunmapi() when punching out the
      center of an existing extent. We don't remove extents in the IO
      path, so these operations don't tend to be called with a lot of
      stack consumed. Hence we don't really need to ship the split off to
      a worker thread in these cases, either.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c85007e2