1. 12 Apr, 2023 40 commits
    • Darrick J. Wong's avatar
      xfs: fix an inode lookup race in xchk_get_inode · 302436c2
      Darrick J. Wong authored
      In commit d658e, we tried to improve the robustnes of xchk_get_inode in
      the face of EINVAL returns from iget by calling xfs_imap to see if the
      inobt itself thinks that the inode is allocated.  Unfortunately, that
      commit didn't consider the possibility that the inode gets allocated
      after iget but before imap.  In this case, the imap call will succeed,
      but we turn that into a corruption error and tell userspace the inode is
      corrupt.
      
      Avoid this false corruption report by grabbing the AGI header and
      retrying the iget before calling imap.  If the iget succeeds, we can
      proceed with the usual scrub-by-handle code.  Fix all the incorrect
      comments too, since unreadable/corrupt inodes no longer result in EINVAL
      returns.
      
      Fixes: d658e72b ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      302436c2
    • Darrick J. Wong's avatar
      xfs: manage inode DONTCACHE status at irele time · a03297a0
      Darrick J. Wong authored
      Right now, there are statements scattered all over the online fsck
      codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
      about scrub's unusual practice of releasing inodes with transactions
      held.
      
      However, iget is the wrong place to handle this -- the DONTCACHE state
      doesn't matter at all until we try to *release* the inode, and here we
      get things wrong in multiple ways:
      
      First, if we /do/ have a transaction, we must NOT drop the inode,
      because the inode could have dirty pages, dropping the inode will
      trigger writeback, and writeback can trigger a nested transaction.
      
      Second, if the inode already had an active reference and the DONTCACHE
      flag set, the icache hit when scrub grabs another ref will not clear
      DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
      initiate cache drops so that sysadmins can change a file's access mode
      between pagecache and DAX.
      
      Third, if we do actually have the last active reference to the inode, we
      can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
      where we actually want that flag.
      
      Create an xchk_irele helper to encode all that logic and switch the
      online fsck code to use it.  Since this now means that nearly all
      scrubbers use the same xfs_iget flags, we can wrap them too.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      a03297a0
    • Darrick J. Wong's avatar
      xfs: fix parent pointer scrub racing with subdirectory reparenting · 0916056e
      Darrick J. Wong authored
      Jan Kara pointed out that rename() doesn't lock a subdirectory that is
      being moved from one parent to another, even though the move requires an
      update to the subdirectory's dotdot entry.  This means that it's *not*
      sufficient to hold a directory's IOLOCK to stabilize the dotdot entry.
      We must hold the ILOCK of both the child and the alleged parent, and
      there's no use in holding the parent's IOLOCK.
      
      With that in mind, we can get rid of all the messy code that tries to
      grab the parent's IOLOCK, which means we don't need to let go of the
      ILOCK of the directory whose parent we are checking.  We still have to
      use nonblocking mode to take the ILOCK of the alleged parent, so the
      revalidation loop has to stay.
      
      However, we can remove the retry counter, since threads aren't supposed
      to hold the ILOCK for long periods of time.  Remove the inverted ilock
      helper from the common code since nobody uses it.  Remove the entire
      source of -EDEADLOCK-based "retry harder" scrub executions.
      
      Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      0916056e
    • Darrick J. Wong's avatar
      xfs: simplify xchk_parent_validate · b049962c
      Darrick J. Wong authored
      This function is unnecessarily long because it contains code to
      revalidate a dotdot entry after cycling locks to try to confirm a
      subdirectory parent pointer.  Shorten the codebase by making the
      parent's lookup call do double duty as the revalidation code.
      
      This weakeans the efficacy of this scrub function temporarily, but the
      next patch will resolve this as part of fixing an unhandled race that is
      the result of the VFS rename locking model not working the way Darrick
      thought it did.
      
      Rename this stupid 'dnum' variable too.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      b049962c
    • Darrick J. Wong's avatar
      xfs: remove xchk_parent_count_parent_dentries · cbab28f4
      Darrick J. Wong authored
      This helper is now trivial, so get rid of it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      cbab28f4
    • Darrick J. Wong's avatar
      xfs: always check the existence of a dirent's child inode · 6bb9209c
      Darrick J. Wong authored
      When we're scrubbing directory entries, we always need to iget the child
      inode to make sure that the inode pointer points to a valid inode.  The
      original directory scrub code (commit a5c4) only set us up to do this
      for ftype=1 filesystems, which is not sufficient; and then commit 4b80
      made it worse by exempting the dot and dotdot entries.
      
      Sorta-fixes: a5c46e5e ("xfs: scrub directory metadata")
      Sorta-fixes: 4b80ac64 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6bb9209c
    • Darrick J. Wong's avatar
      xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED · d9a94480
      Darrick J. Wong authored
      In commit 4b80ac64, we tried to strengthen the directory scrubber by
      using the iget call to detect directory entries that point to
      unallocated inodes.  Unfortunately, that commit neglected to pass
      XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
      If the inode number points to something that isn't even an inode
      cluster, iget will throw corruption errors and return -EFSCORRUPTED,
      which means that we fail to mark the directory corrupt.
      
      Fixes: 4b80ac64 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d9a94480
    • Darrick J. Wong's avatar
      xfs: streamline the directory iteration code for scrub · 4c233b5c
      Darrick J. Wong authored
      Currently, online scrub reuses the xfs_readdir code to walk every entry
      in a directory.  This isn't awesome for performance, since we end up
      cycling the directory ILOCK needlessly and coding around the particular
      quirks of the VFS dir_context interface.
      
      Create a streamlined version of readdir that keeps the ILOCK (since the
      walk function isn't going to copy stuff to userspace), skips a whole lot
      of directory walk cursor checks (since we start at 0 and walk to the
      end) and has a sane way to return error codes.
      
      Note: Porting the dotdot checking code is left for a subsequent patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      4c233b5c
    • Darrick J. Wong's avatar
      xfs: use the directory name hash function for dir scrubbing · 9dceccc5
      Darrick J. Wong authored
      The directory code has a directory-specific hash computation function
      that includes a modified hash function for case-insensitive lookups.
      Hence we must use that function (and not the raw da_hashname) when
      checking the dabtree structure.
      
      Found by accidentally breaking xfs/188 to create an abnormally huge
      case-insensitive directory and watching scrub break.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      9dceccc5
    • Darrick J. Wong's avatar
      xfs: ensure that single-owner file blocks are not owned by others · 30f8ee5e
      Darrick J. Wong authored
      For any file fork mapping that can only have a single owner, make sure
      that there are no other rmap owners for that mapping.  This patch
      requires the more detailed checking provided by xfs_rmap_count_owners so
      that we can know how many rmap records for a given range of space had a
      matching owner, how many had a non-matching owner, and how many
      conflicted with the records that have a matching owner.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      30f8ee5e
    • Darrick J. Wong's avatar
      xfs: teach scrub to check for sole ownership of metadata objects · 69115f77
      Darrick J. Wong authored
      Strengthen online scrub's checking even further by enabling us to check
      that a range of blocks are owned solely by a given owner.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      69115f77
    • Darrick J. Wong's avatar
      xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results · efc0845f
      Darrick J. Wong authored
      Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
      scan results because for a given range of inode numbers, we might have
      no indexed inodes at all; the entire region might be allocated ondisk
      inodes; or there might be a mix of the two.
      
      Unfortunately, sparse inodes adds to the complexity, because each inode
      record can have holes, which means that we cannot use the generic btree
      _scan_keyfill function because we must look for holes in individual
      records to decide the result.  On the plus side, online fsck can now
      detect sub-chunk discrepancies in the inobt.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      efc0845f
    • Darrick J. Wong's avatar
      xfs: directly cross-reference the inode btrees with each other · bc0f3b55
      Darrick J. Wong authored
      Improve the cross-referencing of the two inode btrees by directly
      checking the free and hole state of each inode with the other btree.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      bc0f3b55
    • Darrick J. Wong's avatar
      xfs: clean up broken eearly-exit code in the inode btree scrubber · c01868b6
      Darrick J. Wong authored
      Corrupt inode chunks should cause us to exit early after setting the
      CORRUPT flag on the scrub state.  While we're at it, collapse trivial
      helpers.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c01868b6
    • Darrick J. Wong's avatar
      xfs: remove pointless shadow variable from xfs_difree_inobt · cc120766
      Darrick J. Wong authored
      In xfs_difree_inobt, the pag passed in was previously used to look up
      the AGI buffer.  There's no need to extract it again, so remove the
      shadow variable and shut up -Wshadow.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      cc120766
    • Darrick J. Wong's avatar
      xfs: ensure that all metadata and data blocks are not cow staging extents · 7ac14fa2
      Darrick J. Wong authored
      Make sure that all filesystem metadata blocks and file data blocks are
      not also marked as CoW staging extents.  The extra checking added here
      was inspired by an actual VM host filesystem corruption incident due to
      bugs in the CoW handling of 4.x kernels.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7ac14fa2
    • Darrick J. Wong's avatar
      xfs: check the reference counts of gaps in the refcount btree · 7ad9ea63
      Darrick J. Wong authored
      Gaps in the reference count btree are also significant -- for these
      regions, there must not be any overlapping reverse mappings.  We don't
      currently check this, so make the refcount scrubber more complete.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7ad9ea63
    • Darrick J. Wong's avatar
      xfs: implement masked btree key comparisons for _has_records scans · 4a200a09
      Darrick J. Wong authored
      For keyspace fullness scans, we want to be able to mask off the parts of
      the key that we don't care about.  For most btree types we /do/ want the
      full keyspace, but for checking that a given space usage also has a full
      complement of rmapbt records (even if different/multiple owners) we need
      this masking so that we only track sparseness of rm_startblock, not the
      whole keyspace (which is extremely sparse).
      
      Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
      third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
      pass this through.  This third "mask" argument should contain a nonzero
      value in each structure field that should be used in the key comparisons
      done during the scan.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      4a200a09
    • Darrick J. Wong's avatar
      xfs: replace xfs_btree_has_record with a general keyspace scanner · 6abc7aef
      Darrick J. Wong authored
      The current implementation of xfs_btree_has_record returns true if it
      finds /any/ record within the given range.  Unfortunately, that's not
      sufficient for scrub.  We want to be able to tell if a range of keyspace
      for a btree is devoid of records, is totally mapped to records, or is
      somewhere in between.  By forcing this to be a boolean, we conflated
      sparseness and fullness, which caused scrub to return incorrect results.
      Fix the API so that we can tell the caller which of those three is the
      current state.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6abc7aef
    • Darrick J. Wong's avatar
      xfs: refactor ->diff_two_keys callsites · bd7e7951
      Darrick J. Wong authored
      Create wrapper functions around ->diff_two_keys so that we don't have to
      remember what the return values mean, and adjust some of the code
      comments to reflect the longtime code behavior.  We're going to
      introduce more uses of ->diff_two_keys in the next patch, so reduce the
      cognitive load for readers by doing this refactoring now.
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      bd7e7951
    • Darrick J. Wong's avatar
      xfs: refactor converting btree irec to btree key · ee5fe8ff
      Darrick J. Wong authored
      We keep doing these conversions to support btree queries, so refactor
      this into a helper.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ee5fe8ff
    • Darrick J. Wong's avatar
      xfs: always scrub record/key order of interior records · 2bea8df0
      Darrick J. Wong authored
      In commit d47fef93, we removed the firstrec and firstkey fields of
      struct xchk_btree because Christoph thought they were unnecessary
      because we could use the record index in the btree cursor.  This is
      incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
      position within a specific btree block, not within the entire level.
      
      The end result is that scrub no longer detects situations where the
      rightmost record of a block is identical to the leftmost record of that
      block's right sibling.  Fix this regression by reintroducing record
      validity booleans so that order checking skips *only* the leftmost
      record/key in each level.
      
      Fixes: d47fef93 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2bea8df0
    • Darrick J. Wong's avatar
      xfs: check btree keys reflect the child block · c99f99fa
      Darrick J. Wong authored
      When scrub is checking a non-root btree block, it should make sure that
      the keys in the parent btree block accurately capture the keyspace that
      the child block stores.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c99f99fa
    • Darrick J. Wong's avatar
      xfs: detect unwritten bit set in rmapbt node block keys · 38384569
      Darrick J. Wong authored
      In the last patch, we changed the rmapbt code to remove the UNWRITTEN
      bit when creating an rmapbt key from an rmapbt record, and we changed
      the rmapbt key comparison code to start considering the ATTR and BMBT
      flags during lookup.  This brought the behavior of the rmapbt
      implementation in line with its specification.
      
      However, there may exist filesystems that have the unwritten bit still
      set in the rmapbt keys.  We should detect these situations and flag the
      rmapbt as one that would benefit from optimization.  Eventually, online
      repair will be able to do something in response to this.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      38384569
    • Darrick J. Wong's avatar
      xfs: fix rm_offset flag handling in rmap keys · 08c987de
      Darrick J. Wong authored
      Keys for extent interval records in the reverse mapping btree are
      supposed to be computed as follows:
      
      (physical block, owner, fork, is_btree, offset)
      
      This provides users the ability to look up a reverse mapping from a file
      block mapping record -- start with the physical block; then if there are
      multiple records for the same block, move on to the owner; then the
      inode fork type; and so on to the file offset.
      
      Unfortunately, the code that creates rmap lookup keys from rmap records
      forgot to mask off the record attribute flags, leading to ondisk keys
      that look like this:
      
      (physical block, owner, fork, is_btree, unwritten state, offset)
      
      Fortunately, this has all worked ok for the past six years because the
      key comparison functions incorrectly ignore the fork/bmbt/unwritten
      information that's encoded in the on-disk offset.  This means that
      lookup comparisons are only done with:
      
      (physical block, owner, offset)
      
      Queries can (theoretically) return incorrect results because of this
      omission.  On consistent filesystems this isn't an issue because xattr
      and bmbt blocks cannot be shared and hence the comparisons succeed
      purely on the contents of the rm_startblock field.  For the one case
      where we support sharing (written data fork blocks) all flag bits are
      zero, so the omission in the comparison has no ill effects.
      
      Unfortunately, this bug prevents scrub from detecting incorrect fork and
      bmbt flag bits in the rmap btree, so we really do need to fix the
      compare code.  Old filesystems with the unwritten bit erroneously set in
      the rmap key struct will work fine on new kernels since we still ignore
      the unwritten bit.  New filesystems on older kernels will work fine
      since the old kernels never paid attention to the unwritten bit.
      
      A previous version of this patch forgot to keep the (un)written state
      flag masked during the comparison and caused a major regression in
      5.9.x since unwritten extent conversion can update an rmap record
      without requiring key updates.
      
      Note that blocks cannot go directly from data fork to attr fork without
      being deallocated and reallocated, nor can they be added to or removed
      from a bmbt without a free/alloc cycle, so this should not cause any
      regressions.
      
      Found by fuzzing keys[1].attrfork = ones on xfs/371.
      
      Fixes: 4b8ed677 ("xfs: add rmap btree operations")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      08c987de
    • Darrick J. Wong's avatar
      xfs: hoist inode record alignment checks from scrub · de1a9ce2
      Darrick J. Wong authored
      Move the inobt record alignment checks from xchk_iallocbt_rec into
      xfs_inobt_check_irec so that they are applied everywhere.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      de1a9ce2
    • Darrick J. Wong's avatar
      xfs: hoist rmap record flag checks from scrub · e774b2ea
      Darrick J. Wong authored
      Move the rmap record flag checks from xchk_rmapbt_rec into
      xfs_rmap_check_irec so that they are applied everywhere.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e774b2ea
    • Darrick J. Wong's avatar
      xfs: hoist rmap record flag checks from scrub · 7d7d6d2f
      Darrick J. Wong authored
      Move the rmap record flag checks from xchk_rmapbt_rec into
      xfs_rmap_check_irec so that they are applied everywhere.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7d7d6d2f
    • Darrick J. Wong's avatar
      xfs: complain about bad file mapping records in the ondisk bmbt · 6a3bd8fc
      Darrick J. Wong authored
      Similar to what we've just done for the other btrees, create a function
      to log corrupt bmbt records and call it whenever we encounter a bad
      record in the ondisk btree.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6a3bd8fc
    • Darrick J. Wong's avatar
      xfs: complain about bad records in query_range helpers · ee12eaaa
      Darrick J. Wong authored
      For every btree type except for the bmbt, refactor the code that
      complains about bad records into a helper and make the ->query_range
      helpers call it so that corruptions found via that avenue are logged.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ee12eaaa
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for bmap btrees · 69010fe3
      Darrick J. Wong authored
      Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
      and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      69010fe3
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for rmap btrees · c4e34172
      Darrick J. Wong authored
      Create a xfs_rmap_check_irec function to detect corruption in btree
      records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
      helper and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c4e34172
    • Darrick J. Wong's avatar
      xfs: return a failure address from xfs_rmap_irec_offset_unpack · 39ab26d5
      Darrick J. Wong authored
      Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
      Change this function to return the code address of a failed conversion
      in preparation for the next patch, which standardizes localized record
      checking and reporting code.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      39ab26d5
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for refcount btrees · 2b30cc0b
      Darrick J. Wong authored
      Create a xfs_refcount_check_irec function to detect corruption in btree
      records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
      helper and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2b30cc0b
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for inode btrees · 366a0b8d
      Darrick J. Wong authored
      Create a xfs_inobt_check_irec function to detect corruption in btree
      records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
      helper and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      366a0b8d
    • Darrick J. Wong's avatar
      xfs: standardize ondisk to incore conversion for free space btrees · 35e3b9a1
      Darrick J. Wong authored
      Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
      an incore record, and a xfs_alloc_check_irec function to detect
      corruption.  Replace all the open-coded logic with calls to the new
      helpers and bubble up corruption reports.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      35e3b9a1
    • Darrick J. Wong's avatar
      xfs: scrub should use ECHRNG to signal that the drain is needed · 88accf17
      Darrick J. Wong authored
      In the previous patch, we added jump labels to the intent drain code so
      that regular filesystem operations need not pay the price of checking
      for someone (scrub) waiting on intents to drain from some part of the
      filesystem when that someone isn't running.
      
      However, I observed that xfs/285 now spends a lot more time pushing the
      AIL from the inode btree scrubber than it used to.  This is because the
      inobt scrubber will try push the AIL to try to get logged inode cores
      written to the filesystem when it sees a weird discrepancy between the
      ondisk inode and the inobt records.  This AIL push is triggered when the
      setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
      return is initiated when the discrepancy is seen.
      
      The solution to this performance slow down is to use a different result
      code (ECHRNG) for scrub code to signal that it needs to wait for
      deferred intent work items to drain out of some part of the filesystem.
      When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
      setup functions will activate the jump label.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      88accf17
    • Darrick J. Wong's avatar
      xfs: minimize overhead of drain wakeups by using jump labels · 466c525d
      Darrick J. Wong authored
      To reduce the runtime overhead even further when online fsck isn't
      running, use a static branch key to decide if we call wake_up on the
      drain.  For compilers that support jump labels, the call to wake_up is
      replaced by a nop sled when nobody is waiting for intents to drain.
      
      From my initial microbenchmarking, every transition of the static key
      between the on and off states takes about 22000ns to complete; this is
      paid entirely by the xfs_scrub process.  When the static key is off
      (which it should be when fsck isn't running), the nop sled adds an
      overhead of approximately 0.36ns to runtime code.  The post-atomic
      lockless waiter check adds about 0.03ns, which is basically free.
      
      For the few compilers that don't support jump labels, runtime code pays
      the cost of calling wake_up on an empty waitqueue, which was observed to
      be about 30ns.  However, most architectures that have sufficient memory
      and CPU capacity to run XFS also support jump labels, so this is not
      much of a worry.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      466c525d
    • Darrick J. Wong's avatar
      xfs: clean up scrub context if scrub setup returns -EDEADLOCK · 3f64c718
      Darrick J. Wong authored
      It has been a longstanding convention that online scrub and repair
      functions can return -EDEADLOCK to signal that they weren't able to
      obtain some necessary resource.  When this happens, the scrub framework
      is supposed to release all resources attached to the scrub context, set
      the TRY_HARDER flag in the scrub context flags, and try again.  In this
      context, individual scrub functions are supposed to take all the
      resources they (incorrectly) speculated were not necessary.
      
      We're about to make it so that the functions that lock and wait for a
      filesystem AG can also return EDEADLOCK to signal that we need to try
      again with the drain waiters enabled.  Therefore, refactor
      xfs_scrub_metadata to support this behavior for ->setup() functions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3f64c718
    • Darrick J. Wong's avatar
      xfs: allow queued AG intents to drain before scrubbing · d5c88131
      Darrick J. Wong authored
      When a writer thread executes a chain of log intent items, the AG header
      buffer locks will cycle during a transaction roll to get from one intent
      item to the next in a chain.  Although scrub takes all AG header buffer
      locks, this isn't sufficient to guard against scrub checking an AG while
      that writer thread is in the middle of finishing a chain because there's
      no higher level locking primitive guarding allocation groups.
      
      When there's a collision, cross-referencing between data structures
      (e.g. rmapbt and refcountbt) yields false corruption events; if repair
      is running, this results in incorrect repairs, which is catastrophic.
      
      Fix this by adding to the perag structure the count of active intents
      and make scrub wait until it has both AG header buffer locks and the
      intent counter reaches zero.
      
      One quirk of the drain code is that deferred bmap updates also bump and
      drop the intent counter.  A fundamental decision made during the design
      phase of the reverse mapping feature is that updates to the rmapbt
      records are always made by the same code that updates the primary
      metadata.  In other words, callers of bmapi functions expect that the
      bmapi functions will queue deferred rmap updates.
      
      Some parts of the reflink code queue deferred refcount (CUI) and bmap
      (BUI) updates in the same head transaction, but the deferred work
      manager completely finishes the CUI before the BUI work is started.  As
      a result, the CUI drops the intent count long before the deferred rmap
      (RUI) update even has a chance to bump the intent count.  The only way
      to keep the intent count elevated between the CUI and RUI is for the BUI
      to bump the counter until the RUI has been created.
      
      A second quirk of the intent drain code is that deferred work items must
      increment the intent counter as soon as the work item is added to the
      transaction.  When a BUI completes and queues an RUI, the RUI must
      increment the counter before the BUI decrements it.  The only way to
      accomplish this is to require that the counter be bumped as soon as the
      deferred work item is created in memory.
      
      In the next patches we'll improve on this facility, but this patch
      provides the basic functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      d5c88131