1. 27 Feb, 2023 1 commit
  2. 15 Feb, 2023 2 commits
    • Darrick J. Wong's avatar
      xfs: fix uninitialized variable access · 60b730a4
      Darrick J. Wong authored
      If the end position of a GETFSMAP query overlaps an allocated space and
      we're using the free space info to generate fsmap info, the akeys
      information gets fed into the fsmap formatter with bad results.
      Zero-init the space.
      
      Reported-by: syzbot+090ae72d552e6bd93cfe@syzkaller.appspotmail.com
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      60b730a4
    • Darrick J. Wong's avatar
      Merge tag 'xfs-alloc-perag-conversion' of... · 571dc9ae
      Darrick J. Wong authored
      Merge tag 'xfs-alloc-perag-conversion' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.3-merge-A
      
      xfs: per-ag centric allocation alogrithms
      
      This series continues the work towards making shrinking a filesystem
      possible.  We need to be able to stop operations from taking place
      on AGs that need to be removed by a shrink, so before shrink can be
      implemented we need to have the infrastructure in place to prevent
      incursion into AGs that are going to be, or are in the process, of
      being removed from active duty.
      
      The focus of this is making operations that depend on access to AGs
      use the perag to access and pin the AG in active use, thereby
      creating a barrier we can use to delay shrink until all active uses
      of an AG have been drained and new uses are prevented.
      
      This series starts by fixing some existing issues that are exposed
      by changes later in the series. They stand alone, so can be picked
      up independently of the rest of this patchset.
      
      The most complex of these fixes is cleaning up the mess that is the
      AGF deadlock avoidance algorithm. This algorithm stores the first
      block that is allocated in a transaction in tp->t_firstblock, then
      uses this to try to limit future allocations within the transaction
      to AGs at or higher than the filesystem block stored in
      tp->t_firstblock. This depends on one of the initial bug fixes in
      the series to move the deadlock avoidance checks to
      xfs_alloc_vextent(), and then builds on it to relax the constraints
      of the avoidance algorithm to only be active when a deadlock is
      possible.
      
      We also update the algorithm to record allocations from higher AGs
      that are allocated from, because we when we need to lock more than
      two AGs we still have to ensure lock order is correct. Therefore we
      can't lock AGs in the order 1, 3, 2, even though tp->t_firstblock
      indicates that we've allocated from AG 1 and so AG is valid to lock.
      It's not valid, because we already hold AG 3 locked, and so
      tp->t-first_block should actually point at AG 3, not AG 1 in this
      situation.
      
      It should now be obvious that the deadlock avoidance algorithm
      should record AGs, not filesystem blocks. So the series then changes
      the transaction to store the highest AG we've allocated in rather
      than a filesystem block we allocated.  This makes it obvious what
      the constraints are, and trivial to update as we lock and allocate
      from various AGs.
      
      With all the bug fixes out of the way, the series then starts
      converting the code to use active references. Active reference
      counts are used by high level code that needs to prevent the AG from
      being taken out from under it by a shrink operation. The high level
      code needs to be able to handle not getting an active reference
      gracefully, and the shrink code will need to wait for active
      references to drain before continuing.
      
      Active references are implemented just as reference counts right now
      - an active reference is taken at perag init during mount, and all
      other active references are dependent on the active reference count
      being greater than zero. This gives us an initial method of stopping
      new active references without needing other infrastructure; just
      drop the reference taken at filesystem mount time and when the
      refcount then falls to zero no new references can be taken.
      
      In future, this will need to take into account AG control state
      (e.g. offline, no alloc, etc) as well as the reference count, but
      right now we can implement a basic barrier for shrink with just
      reference count manipulations. As such, patches to convert the perag
      state to atomic opstate fields similar to the xfs_mount and xlog
      opstate fields follow the initial active perag reference counting
      patches.
      
      The first target for active reference conversion is the
      for_each_perag*() iterators. This captures a lot of high level code
      that should skip offline AGs, and introduces the ability to
      differentiate between a lookup that didn't have an online AG and the
      end of the AG iteration range.
      
      From there, the inode allocation AG selection is converted to active
      references, and the perag is driven deeper into the inode allocation
      and btree code to replace the xfs_mount. Most of the inode
      allocation code operates on a single AG once it is selected, hence
      it should pass the perag as the primary referenced object around for
      allocation, not the xfs_mount. There is a bit of churn here, but it
      emphasises that inode allocation is inherently an allocation group
      based operation.
      
      Next the bmap/alloc interface undergoes a major untangling,
      reworking xfs_bmap_btalloc() into separate allocation operations for
      different contexts and failure handling behaviours. This then allows
      us to completely remove the xfs_alloc_vextent() layer via
      restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a
      set of realtively simple helper function that describe the
      allocation that they are doing. e.g.  xfs_alloc_vextent_exact_bno().
      
      This allows the requirements for accessing AGs to be allocation
      context dependent. The allocations that require operation on a
      single AG generally can't tolerate failure after the allocation
      method and AG has been decided on, and hence the caller needs to
      manage the active references to ensure the allocation does not race
      with shrink removing the selected AG for the duration of the
      operation that requires access to that allocation group.
      
      Other allocations iterate AGs and so the first AG is just a hint -
      these do not need to pin a perag first as they can tolerate not
      being able to access an AG by simply skipping over it. These require
      new perag iteration functions that can start at arbitrary AGs and
      wrap around at arbitrary AGs, hence a new set for
      for_each_perag_wrap*() helpers to do this.
      
      Next is the rework of the filestreams allocator. This doesn't change
      any functionality, but gets rid of the unnecessary multi-pass
      selection algorithm when the selected AG is not available. It
      currently does a lookup pass which might iterate all AGs to select
      an AG, then checks if the AG is acceptible and if not does a "new
      AG" pass that is essentially identical to the lookup pass. Both of
      these scans also do the same "longest extent in AG" check before
      selecting an AG as is done after the AG is selected.
      
      IOWs, the filestreams algorithm can be greatly simplified into a
      single new AG selection pass if the there is no current association
      or the currently associated AG doesn't have enough contiguous free
      space for the allocation to proceed.  With this simplification of
      the filestreams allocator, it's then trivial to convert it to use
      for_each_perag_wrap() for the AG scan algorithm.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-alloc-perag-conversion' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
        xfs: refactor the filestreams allocator pick functions
        xfs: return a referenced perag from filestreams allocator
        xfs: pass perag to filestreams tracing
        xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
        xfs: track an active perag reference in filestreams
        xfs: factor out MRU hit case in xfs_filestream_select_ag
        xfs: remove xfs_filestream_select_ag() longest extent check
        xfs: merge new filestream AG selection into xfs_filestream_select_ag()
        xfs: merge filestream AG lookup into xfs_filestream_select_ag()
        xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
        xfs: use xfs_bmap_longest_free_extent() in filestreams
        xfs: get rid of notinit from xfs_bmap_longest_free_extent
        xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
        xfs: convert trim to use for_each_perag_range
        xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
        xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
        xfs: fold xfs_alloc_ag_vextent() into callers
        xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
        xfs: introduce xfs_alloc_vextent_prepare()
        xfs: introduce xfs_alloc_vextent_exact_bno()
        ...
      571dc9ae
  3. 12 Feb, 2023 36 commits
  4. 10 Feb, 2023 1 commit
    • Dave Chinner's avatar
      xfs: don't assert fail on transaction cancel with deferred ops · 55d5c3a3
      Dave Chinner authored
      We can error out of an allocation transaction when updating BMBT
      blocks when things go wrong. This can be a btree corruption, and
      unexpected ENOSPC, etc. In these cases, we already have deferred ops
      queued for the first allocation that has been done, and we just want
      to cancel out the transaction and shut down the filesystem on error.
      
      In fact, we do just that for production systems - the assert that we
      can't have a transaction with defer ops attached unless we are
      already shut down is bogus and gets in the way of debugging
      whatever issue is actually causing the transaction to be cancelled.
      
      Remove the assert because it is causing spurious test failures to
      hang test machines.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      55d5c3a3