1. 15 Dec, 2023 40 commits
    • Darrick J. Wong's avatar
      xfs: repair quotas · a5b91555
      Darrick J. Wong authored
      Fix anything that causes the quota verifiers to fail.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a5b91555
    • Darrick J. Wong's avatar
      xfs: improve dquot iteration for scrub · 21d75009
      Darrick J. Wong authored
      Upon a closer inspection of the quota record scrubber, I noticed that
      dqiterate wasn't actually walking all possible dquots for the mapped
      blocks in the quota file.  This is due to xfs_qm_dqget_next skipping all
      XFS_IS_DQUOT_UNINITIALIZED dquots.
      
      For a fsck program, we really want to look at all the dquots, even if
      all counters and limits in the dquot record are zero.  Rewrite the
      implementation to do this, as well as switching to an iterator paradigm
      to reduce the number of indirect calls.
      
      This enables removal of the old broken dqiterate code from xfs_dquot.c.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      21d75009
    • Darrick J. Wong's avatar
      xfs: check dquot resource timers · 774b5c0a
      Darrick J. Wong authored
      For each dquot resource, ensure either (a) the resource usage is over
      the soft limit and there is a nonzero timer; or (b) usage is at or under
      the soft limit and the timer is unset.  (a) is redundant with the dquot
      buffer verifier, but (b) isn't checked anywhere.
      
      Found by fuzzing xfs/426 and noticing that diskdq.btimer = add didn't
      trip any kind of warning for having a timer set even with no limits.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      774b5c0a
    • Darrick J. Wong's avatar
      xfs: check the ondisk space mapping behind a dquot · 7d1f0e16
      Darrick J. Wong authored
      Each xfs_dquot object caches the file offset and daddr of the ondisk
      block that backs the dquot.  Make sure these cached values are the same
      as the bmapi data, and that the block state is written.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7d1f0e16
    • Darrick J. Wong's avatar
      xfs: online repair of realtime bitmaps · ffd37b22
      Darrick J. Wong authored
      Fix all the file metadata surrounding the realtime bitmap file, which
      includes the rt geometry, file size, forks, and space mappings.  The
      bitmap contents themselves cannot be fixed without rt rmap, so that will
      come later.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ffd37b22
    • Darrick J. Wong's avatar
      xfs: create a new inode fork block unmap helper · a59eb5fc
      Darrick J. Wong authored
      Create a new helper to unmap blocks from an inode's fork.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a59eb5fc
    • Darrick J. Wong's avatar
      xfs: repair the inode core and forks of a metadata inode · 5a8e07e7
      Darrick J. Wong authored
      Add a helper function to repair the core and forks of a metadata inode,
      so that we can get move onto the task of repairing higher level metadata
      that lives in an inode.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5a8e07e7
    • Darrick J. Wong's avatar
      xfs: always check the rtbitmap and rtsummary files · 20cc0d39
      Darrick J. Wong authored
      XFS filesystems always have a realtime bitmap and summary file, even if
      there has never been a realtime volume attached.  Always check them.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      20cc0d39
    • Darrick J. Wong's avatar
      xfs: check rt summary file geometry more thoroughly · 04f0c326
      Darrick J. Wong authored
      I forgot that the xfs_mount tracks the size and number of levels in the
      realtime summary file, and that the rt summary file can have more blocks
      mapped to the data fork than m_rsumsize implies if growfsrt fails.
      
      So.  Add to the rtsummary scrubber an explicit check that all the
      summary geometry values are correct, then adjust the rtsummary i_size
      checks to allow for the growfsrt failure case.  Finally, flag post-eof
      blocks in the summary file.
      
      While we're at it, split the extent map checking so that we only call
      xfs_bmapi_read once per extent instead of once per rtsummary block.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      04f0c326
    • Darrick J. Wong's avatar
      xfs: check rt bitmap file geometry more thoroughly · 41991cf2
      Darrick J. Wong authored
      I forgot that the superblock tracks the number of blocks that are in the
      realtime bitmap, and that the rt bitmap file can have more blocks mapped
      to the data fork than sb_rbmblocks if growfsrt fails.
      
      So.  Add to the rtbitmap scrubber an explicit check that sb_rextents and
      sb_rbmblocks are correct, then adjust the rtbitmap i_size checks to
      allow for the growfsrt failure case.  Finally, flag post-eof blocks in
      the rtbitmap.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      41991cf2
    • Darrick J. Wong's avatar
      xfs: repair problems in CoW forks · dbbdbd00
      Darrick J. Wong authored
      Try to repair errors that we see in file CoW forks so that we don't do
      stupid things like remap garbage into a file.  There's not a lot we can
      do with the COW fork -- the ondisk metadata record only that the COW
      staging extents are owned by the refcount btree, which effectively means
      that we can't reconstruct this incore structure from scratch.
      
      Actually, this is even worse -- we can't touch written extents, because
      those map space that are actively under writeback, and there's not much
      to do with delalloc reservations.  Hence we can only detect crosslinked
      unwritten extents and fix them by punching out the problematic parts and
      replacing them with delalloc extents.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      dbbdbd00
    • Darrick J. Wong's avatar
      xfs: create a ranged query function for refcount btrees · d12bf8ba
      Darrick J. Wong authored
      Implement ranged queries for refcount records.  The next patch will use
      this to scan refcount data.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d12bf8ba
    • Darrick J. Wong's avatar
      xfs: refactor repair forcing tests into a repair.c helper · 48a72f60
      Darrick J. Wong authored
      There are a couple of conditions that userspace can set to force repairs
      of metadata.  These really belong in the repair code and not open-coded
      into the check code, so refactor them into a helper.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      48a72f60
    • Darrick J. Wong's avatar
      xfs: repair inode fork block mapping data structures · 8f71bede
      Darrick J. Wong authored
      Use the reverse-mapping btree information to rebuild an inode block map.
      Update the btree bulk loading code as necessary to support inode rooted
      btrees and fix some bitrot problems.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8f71bede
    • Darrick J. Wong's avatar
      xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents · 66da1128
      Darrick J. Wong authored
      Back in commit a55e0730 ("xfs: only allow reaping of per-AG
      blocks in xrep_reap_extents"), we removed from the reaping code the
      ability to handle bmbt blocks.  At the time, the reaping code only
      walked single blocks, didn't correctly detect crosslinked blocks, and
      the special casing made the function hard to understand.  It was easier
      to remove unneeded functionality prior to fixing all the bugs.
      
      Now that we've fixed the problems, we want again the ability to reap
      file metadata blocks.  Reintroduce the per-file reaping functionality
      atop the current implementation.  We require that sc->sa is
      uninitialized, so that we can use it to hold all the per-AG context for
      a given extent.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      66da1128
    • Darrick J. Wong's avatar
      xfs: skip the rmapbt search on an empty attr fork unless we know it was zapped · c3a22c2e
      Darrick J. Wong authored
      The attribute fork scrubber can optionally scan the reverse mapping
      records of the filesystem to determine if the fork is missing mappings
      that it should have.  However, this is a very expensive operation, so we
      only want to do this if we suspect that the fork is missing records.
      For attribute forks the criteria for suspicion is that the attr fork is
      in EXTENTS format and has zero extents.
      
      However, there are several ways that a file can end up in this state
      through regular filesystem usage.  For example, an LSM can set a
      s_security hook but then decide not to set an ACL; or an attr set can
      create the attr fork but then the actual set operation fails with
      ENOSPC; or we can delete all the attrs on a file whose data fork is in
      btree format, in which case we do not delete the attr fork.  We don't
      want to run the expensive check for any case that can be arrived at
      through regular operations.
      
      However.
      
      When online inode repair decides to zap an attribute fork, it cannot
      determine if it is zapping ACL information.  As a precaution it removes
      all the discretionary access control permissions and sets the user and
      group ids to zero.  Check these three additional conditions to decide if
      we want to scan the rmap records.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c3a22c2e
    • Darrick J. Wong's avatar
      xfs: abort directory parent scrub scans if we encounter a zapped directory · 6c728952
      Darrick J. Wong authored
      In a previous patch, we added some code to perform sufficient repairs
      to an ondisk inode record such that the inode cache would be willing to
      load the inode.  If the broken inode was a shortform directory, it will
      reset the directory to something plausible, which is to say an empty
      subdirectory of the root.  The telltale signs that something is
      seriously wrong is the broken link count.
      
      Such directories look clean, but they shouldn't participate in a
      filesystem scan to find or confirm a directory parent pointer.  Create a
      predicate that identifies such directories and abort the scrub.
      
      Found by fuzzing xfs/1554 with multithreaded xfs_scrub enabled and
      u3.bmx[0].startblock = zeroes.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6c728952
    • Darrick J. Wong's avatar
      xfs: zap broken inode forks · e744cef2
      Darrick J. Wong authored
      Determine if inode fork damage is responsible for the inode being unable
      to pass the ifork verifiers in xfs_iget and zap the fork contents if
      this is true.  Once this is done the fork will be empty but we'll be
      able to construct an in-core inode, and a subsequent call to the inode
      fork repair ioctl will search the rmapbt to rebuild the records that
      were in the fork.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e744cef2
    • Darrick J. Wong's avatar
      xfs: repair inode records · 2d295fe6
      Darrick J. Wong authored
      If an inode is so badly damaged that it cannot be loaded into the cache,
      fix the ondisk metadata and try again.  If there /is/ a cached inode,
      fix any problems and apply any optimizations that can be solved incore.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      2d295fe6
    • Darrick J. Wong's avatar
      xfs: set inode sick state flags when we zap either ondisk fork · d9041681
      Darrick J. Wong authored
      In a few patches, we'll add some online repair code that tries to
      massage the ondisk inode record just enough to get it to pass the inode
      verifiers so that we can continue with more file repairs.  Part of that
      massaging can include zapping the ondisk forks to clear errors.  After
      that point, the bmap fork repair functions will rebuild the zapped
      forks.
      
      Christoph asked for stronger protections against online repair zapping a
      fork to get the inode to load vs. other threads trying to access the
      partially repaired file.  Do this by adding a special "[DA]FORK_ZAPPED"
      inode health flag whenever repair zaps a fork, and sprinkling checks for
      that flag into the various file operations for things that don't like
      handling an unexpected zero-extents fork.
      
      In practice xfs_scrub will scrub and fix the forks almost immediately
      after zapping them, so the window is very small.  However, if a crash or
      unmount should occur, we can still detect these zapped inode forks by
      looking for a zero-extents fork when data was expected.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d9041681
    • Darrick J. Wong's avatar
      xfs: dont cast to char * for XFS_DFORK_*PTR macros · 6b5d9177
      Darrick J. Wong authored
      Code in the next patch will assign the return value of XFS_DFORK_*PTR
      macros to a struct pointer.  gcc complains about casting char* strings
      to struct pointers, so let's fix the macro's cast to void* to shut up
      the warnings.
      
      While we're at it, fix one of the scrub tests that uses PTR to use BOFF
      instead for a simpler integer comparison, since other linters whine
      about char* and void* comparisons.
      
      Can't satisfy all these dman bots.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6b5d9177
    • Darrick J. Wong's avatar
      xfs: add missing nrext64 inode flag check to scrub · 576d30ec
      Darrick J. Wong authored
      Add this missing check that the superblock nrext64 flag is set if the
      inode flag is set.
      
      Fixes: 9b7d16e3 ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      576d30ec
    • Darrick J. Wong's avatar
      xfs: try to attach dquots to files before repairing them · 259ba1d3
      Darrick J. Wong authored
      Inode resource usage is tracked in the quota metadata.  Repairing a file
      might change the resources used by that file, which means that we need
      to attach dquots to the file that we're examining before accessing
      anything in the file protected by the ILOCK.
      
      However, there's a twist: a dquot cache miss requires the dquot to be
      read in from the quota file, during which we drop the ILOCK on the file
      being examined.  This means that we *must* try to attach the dquots
      before taking the ILOCK.
      
      Therefore, dquots must be attached to files in the scrub setup function.
      If doing so yields corruption errors (or unknown dquot errors), we
      instead clear the quotachecked status, which will cause a quotacheck on
      next mount.  A future series will make this trigger live quotacheck.
      
      While we're here, change the xrep_ino_dqattach function to use the
      unlocked dqattach functions so that we avoid cycling the ILOCK if the
      inode already has dquots attached.  This makes the naming and locking
      requirements consistent with the rest of the filesystem.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      259ba1d3
    • Darrick J. Wong's avatar
      xfs: disable online repair quota helpers when quota not enabled · d5aa62de
      Darrick J. Wong authored
      Don't compile the quota helper functions if quota isn't being built into
      the XFS module.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d5aa62de
    • Darrick J. Wong's avatar
    • Darrick J. Wong's avatar
      xfs: repair inode btrees · dbfbf3bd
      Darrick J. Wong authored
      Use the rmapbt to find inode chunks, query the chunks to compute hole
      and free masks, and with that information rebuild the inobt and finobt.
      Refer to the case study in
      Documentation/filesystems/xfs-online-fsck-design.rst for more details.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      dbfbf3bd
    • Darrick J. Wong's avatar
      xfs: repair free space btrees · 4bdfd7d1
      Darrick J. Wong authored
      Rebuild the free space btrees from the gaps in the rmap btree.  Refer to
      the case study in Documentation/filesystems/xfs-online-fsck-design.rst
      for more details.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4bdfd7d1
    • Darrick J. Wong's avatar
      xfs: remove trivial bnobt/inobt scrub helpers · 8bd0bf57
      Darrick J. Wong authored
      Christoph Hellwig complained about awkward code in the next two repair
      patches such as:
      
      	sc->sm->sm_type = XFS_SCRUB_TYPE_BNOBT;
      	error = xchk_bnobt(sc);
      
      This is a little silly, so let's export the xchk_{,i}allocbt functions
      to the dispatch table in scrub.c directly and get rid of the helpers.
      Originally I had planned each btree gets its own separate entry point,
      but since repair doesn't work that way, it no longer makes sense to
      complicate the call chain that way.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8bd0bf57
    • Darrick J. Wong's avatar
      xfs: roll the scrub transaction after completing a repair · efb43b35
      Darrick J. Wong authored
      When we've finished repairing an AG header, roll the scrub transaction.
      This ensure that any failures caused by defer ops failing are captured
      by the xrep_done tracepoint and that any stacktraces that occur will
      point to the repair code that caused it, instead of xchk_teardown.
      
      Going forward, repair functions should commit the transaction if they're
      going to return success.  Usually the space reaping functions that run
      after a successful atomic commit of the new metadata will take care of
      that for us.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      efb43b35
    • Darrick J. Wong's avatar
      xfs: move the per-AG datatype bitmaps to separate files · 0f08af0f
      Darrick J. Wong authored
      Move struct xagb_bitmap to its own pair of C and header files per
      request of Christoph.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      0f08af0f
    • Darrick J. Wong's avatar
      xfs: create separate structures and code for u32 bitmaps · 6ece924b
      Darrick J. Wong authored
      Create a version of the xbitmap that handles 32-bit integer intervals
      and adapt the xfs_agblock_t bitmap to use it.  This reduces the size of
      the interval tree nodes from 48 to 36 bytes and enables us to use a more
      efficient slab (:0000040 instead of :0000048) which allows us to pack
      more nodes into a single slab page (102 vs 85).
      
      As a side effect, the users of these bitmaps no longer have to convert
      between u32 and u64 quantities just to use the bitmap; and the hairy
      overflow checking code in xagb_bitmap_test goes away.
      
      Later in this patchset we're going to add bitmaps for xfs_agino_t,
      xfs_rgblock_t, and xfs_dablk_t, so the increase in code size (5622 vs.
      9959 bytes) seems worth it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6ece924b
    • Darrick J. Wong's avatar
      xfs: constrain dirty buffers while formatting a staged btree · e069d549
      Darrick J. Wong authored
      Constrain the number of dirty buffers that are locked by the btree
      staging code at any given time by establishing a threshold at which we
      put them all on the delwri queue and push them to disk.  This limits
      memory consumption while writing out new btrees.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e069d549
    • Darrick J. Wong's avatar
      xfs: move btree bulkload record initialization to ->get_record implementations · 6dfeb0c2
      Darrick J. Wong authored
      When we're performing a bulk load of a btree, move the code that
      actually stores the btree record in the new btree block out of the
      generic code and into the individual ->get_record implementations.
      This is preparation for being able to store multiple records with a
      single indirect call.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6dfeb0c2
    • Darrick J. Wong's avatar
      xfs: add debug knobs to control btree bulk load slack factors · a20ffa7d
      Darrick J. Wong authored
      Add some debug knobs so that we can control the leaf and node block
      slack when rebuilding btrees.
      
      For developers, it might be useful to construct btrees of various
      heights by crafting a filesystem with a certain number of records and
      then using repair+knobs to rebuild the index with a certain shape.
      Practically speaking, you'd only ever do that for extreme stress
      testing of the runtime code or the btree generator.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a20ffa7d
    • Darrick J. Wong's avatar
      xfs: read leaf blocks when computing keys for bulkloading into node blocks · 26de6462
      Darrick J. Wong authored
      When constructing a new btree, xfs_btree_bload_node needs to read the
      btree blocks for level N to compute the keyptrs for the blocks that will
      be loaded into level N+1.  The level N blocks must be formatted at that
      point.
      
      A subsequent patch will change the btree bulkloader to write new btree
      blocks in 256K chunks to moderate memory consumption if the new btree is
      very large.  As a consequence of that, it's possible that the buffers
      for lower level blocks might have been reclaimed by the time the node
      builder comes back to the block.
      
      Therefore, change xfs_btree_bload_node to read the lower level blocks
      to handle the reclaimed buffer case.  As a side effect, the read will
      increase the LRU refs, which will bias towards keeping new btree buffers
      in memory after the new btree commits.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      26de6462
    • Darrick J. Wong's avatar
      xfs: set XBF_DONE on newly formatted btree block that are ready for writing · c1e0f8e6
      Darrick J. Wong authored
      The btree bulkloading code calls xfs_buf_delwri_queue_here when it has
      finished formatting a new btree block and wants to queue it to be
      written to disk.  Once the new btree root has been committed, the blocks
      (and hence the buffers) will be accessible to the rest of the
      filesystem.  Mark each new buffer as DONE when adding it to the delwri
      list so that the next btree traversal can skip reloading the contents
      from disk.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c1e0f8e6
    • Darrick J. Wong's avatar
      xfs: force all buffers to be written during btree bulk load · 13ae04d8
      Darrick J. Wong authored
      While stress-testing online repair of btrees, I noticed periodic
      assertion failures from the buffer cache about buffers with incorrect
      DELWRI_Q state.  Looking further, I observed this race between the AIL
      trying to write out a btree block and repair zapping a btree block after
      the fact:
      
      AIL:    Repair0:
      
      pin buffer X
      delwri_queue:
      set DELWRI_Q
      add to delwri list
      
              stale buf X:
              clear DELWRI_Q
              does not clear b_list
              free space X
              commit
      
      delwri_submit   # oops
      
      Worse yet, I discovered that running the same repair over and over in a
      tight loop can result in a second race that cause data integrity
      problems with the repair:
      
      AIL:    Repair0:        Repair1:
      
      pin buffer X
      delwri_queue:
      set DELWRI_Q
      add to delwri list
      
              stale buf X:
              clear DELWRI_Q
              does not clear b_list
              free space X
              commit
      
                              find free space X
                              get buffer
                              rewrite buffer
                              delwri_queue:
                              set DELWRI_Q
                              already on a list, do not add
                              commit
      
                              BAD: committed tree root before all blocks written
      
      delwri_submit   # too late now
      
      I traced this to my own misunderstanding of how the delwri lists work,
      particularly with regards to the AIL's buffer list.  If a buffer is
      logged and committed, the buffer can end up on that AIL buffer list.  If
      btree repairs are run twice in rapid succession, it's possible that the
      first repair will invalidate the buffer and free it before the next time
      the AIL wakes up.  Marking the buffer stale clears DELWRI_Q from the
      buffer state without removing the buffer from its delwri list.  The
      buffer doesn't know which list it's on, so it cannot know which lock to
      take to protect the list for a removal.
      
      If the second repair allocates the same block, it will then recycle the
      buffer to start writing the new btree block.  Meanwhile, if the AIL
      wakes up and walks the buffer list, it will ignore the buffer because it
      can't lock it, and go back to sleep.
      
      When the second repair calls delwri_queue to put the buffer on the
      list of buffers to write before committing the new btree, it will set
      DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
      buffer list, it won't add it to the bulkload buffer's list.
      
      This is incorrect, because the bulkload caller relies on delwri_submit
      to ensure that all the buffers have been sent to disk /before/
      committing the new btree root pointer.  This ordering requirement is
      required for data consistency.
      
      Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
      drop it, so the next thread to walk through the btree will trip over a
      debug assertion on that flag.
      
      To fix this, create a new function that waits for the buffer to be
      removed from any other delwri lists before adding the buffer to the
      caller's delwri list.  By waiting for the buffer to clear both the
      delwri list and any potential delwri wait list, we can be sure that
      repair will initiate writes of all buffers and report all write errors
      back to userspace instead of committing the new structure.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      13ae04d8
    • Dave Chinner's avatar
      xfs: initialise di_crc in xfs_log_dinode · 0573676f
      Dave Chinner authored
      Alexander Potapenko report that KMSAN was issuing these warnings:
      
      kmalloc-ed xlog buffer of size 512 : ffff88802fc26200
      kmalloc-ed xlog buffer of size 368 : ffff88802fc24a00
      kmalloc-ed xlog buffer of size 648 : ffff88802b631000
      kmalloc-ed xlog buffer of size 648 : ffff88802b632800
      kmalloc-ed xlog buffer of size 648 : ffff88802b631c00
      xlog_write_iovec: copying 12 bytes from ffff888017ddbbd8 to ffff88802c300400
      xlog_write_iovec: copying 28 bytes from ffff888017ddbbe4 to ffff88802c30040c
      xlog_write_iovec: copying 68 bytes from ffff88802fc26274 to ffff88802c300428
      xlog_write_iovec: copying 188 bytes from ffff88802fc262bc to ffff88802c30046c
      =====================================================
      BUG: KMSAN: uninit-value in xlog_write_iovec fs/xfs/xfs_log.c:2227
      BUG: KMSAN: uninit-value in xlog_write_full fs/xfs/xfs_log.c:2263
      BUG: KMSAN: uninit-value in xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
       xlog_write_iovec fs/xfs/xfs_log.c:2227
       xlog_write_full fs/xfs/xfs_log.c:2263
       xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
       xlog_cil_write_chain fs/xfs/xfs_log_cil.c:918
       xlog_cil_push_work+0x30f2/0x44e0 fs/xfs/xfs_log_cil.c:1263
       process_one_work kernel/workqueue.c:2630
       process_scheduled_works+0x1188/0x1e30 kernel/workqueue.c:2703
       worker_thread+0xee5/0x14f0 kernel/workqueue.c:2784
       kthread+0x391/0x500 kernel/kthread.c:388
       ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
      Uninit was created at:
       slab_post_alloc_hook+0x101/0xac0 mm/slab.h:768
       slab_alloc_node mm/slub.c:3482
       __kmem_cache_alloc_node+0x612/0xae0 mm/slub.c:3521
       __do_kmalloc_node mm/slab_common.c:1006
       __kmalloc+0x11a/0x410 mm/slab_common.c:1020
       kmalloc ./include/linux/slab.h:604
       xlog_kvmalloc fs/xfs/xfs_log_priv.h:704
       xlog_cil_alloc_shadow_bufs fs/xfs/xfs_log_cil.c:343
       xlog_cil_commit+0x487/0x4dc0 fs/xfs/xfs_log_cil.c:1574
       __xfs_trans_commit+0x8df/0x1930 fs/xfs/xfs_trans.c:1017
       xfs_trans_commit+0x30/0x40 fs/xfs/xfs_trans.c:1061
       xfs_create+0x15af/0x2150 fs/xfs/xfs_inode.c:1076
       xfs_generic_create+0x4cd/0x1550 fs/xfs/xfs_iops.c:199
       xfs_vn_create+0x4a/0x60 fs/xfs/xfs_iops.c:275
       lookup_open fs/namei.c:3477
       open_last_lookups fs/namei.c:3546
       path_openat+0x29ac/0x6180 fs/namei.c:3776
       do_filp_open+0x24d/0x680 fs/namei.c:3809
       do_sys_openat2+0x1bc/0x330 fs/open.c:1440
       do_sys_open fs/open.c:1455
       __do_sys_openat fs/open.c:1471
       __se_sys_openat fs/open.c:1466
       __x64_sys_openat+0x253/0x330 fs/open.c:1466
       do_syscall_x64 arch/x86/entry/common.c:51
       do_syscall_64+0x4f/0x140 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b arch/x86/entry/entry_64.S:120
      
      Bytes 112-115 of 188 are uninitialized
      Memory access of size 188 starts at ffff88802fc262bc
      
      This is caused by the struct xfs_log_dinode not having the di_crc
      field initialised. Log recovery never uses this field (it is only
      present these days for on-disk format compatibility reasons) and so
      it's value is never checked so nothing in XFS has caught this.
      
      Further, none of the uninitialised memory access warning tools have
      caught this (despite catching other uninit memory accesses in the
      struct xfs_log_dinode back in 2017!) until recently. Alexander
      annotated the XFS code to get the dump of the actual bytes that were
      detected as uninitialised, and from that report it took me about 30s
      to realise what the issue was.
      
      The issue was introduced back in 2016 and every inode that is logged
      fails to initialise this field. This is no actual bad behaviour
      caused by this issue - I find it hard to even classify it as a
      bug...
      Reported-and-tested-by: default avatarAlexander Potapenko <glider@google.com>
      Fixes: f8d55aa0 ("xfs: introduce inode log format object")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      0573676f
    • Darrick J. Wong's avatar
      xfs: fix an off-by-one error in xreap_agextent_binval · c0e37f07
      Darrick J. Wong authored
      Overall, this function tries to find and invalidate all buffers for a
      given extent of space on the data device.  The inner for loop in this
      function tries to find all xfs_bufs for a given daddr.  The lengths of
      all possible cached buffers range from 1 fsblock to the largest needed
      to contain a 64k xattr value (~17fsb).  The scan is capped to avoid
      looking at anything buffer going past the given extent.
      
      Unfortunately, the loop continuation test is wrong -- max_fsbs is the
      largest size we want to scan, not one past that.  Put another way, this
      loop is actually 1-indexed, not 0-indexed.  Therefore, the continuation
      test should use <=, not <.
      
      As a result, online repairs of btree blocks fails to stale any buffers
      for btrees that are being torn down, which causes later assertions in
      the buffer cache when another thread creates a different-sized buffer.
      This happens in xfs/709 when allocating an inode cluster buffer:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 3346128 at fs/xfs/xfs_message.c:104 assfail+0x3a/0x40 [xfs]
       CPU: 0 PID: 3346128 Comm: fsstress Not tainted 6.7.0-rc4-djwx #rc4
       RIP: 0010:assfail+0x3a/0x40 [xfs]
       Call Trace:
        <TASK>
        _xfs_buf_obj_cmp+0x4a/0x50
        xfs_buf_get_map+0x191/0xba0
        xfs_trans_get_buf_map+0x136/0x280
        xfs_ialloc_inode_init+0x186/0x340
        xfs_ialloc_ag_alloc+0x254/0x720
        xfs_dialloc+0x21f/0x870
        xfs_create_tmpfile+0x1a9/0x2f0
        xfs_rename+0x369/0xfd0
        xfs_vn_rename+0xfa/0x170
        vfs_rename+0x5fb/0xc30
        do_renameat2+0x52d/0x6e0
        __x64_sys_renameat2+0x4b/0x60
        do_syscall_64+0x3b/0xe0
        entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      A later refactoring patch in the online repair series fixed this by
      accident, which is why I didn't notice this until I started testing only
      the patches that are likely to end up in 6.8.
      
      Fixes: 1c7ce115 ("xfs: reap large AG metadata extents when possible")
      Signed-off-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      c0e37f07
    • Eric Sandeen's avatar
      xfs: short circuit xfs_growfs_data_private() if delta is zero · 84712492
      Eric Sandeen authored
      Although xfs_growfs_data() doesn't call xfs_growfs_data_private()
      if in->newblocks == mp->m_sb.sb_dblocks, xfs_growfs_data_private()
      further massages the new block count so that we don't i.e. try
      to create a too-small new AG.
      
      This may lead to a delta of "0" in xfs_growfs_data_private(), so
      we end up in the shrink case and emit the EXPERIMENTAL warning
      even if we're not changing anything at all.
      
      Fix this by returning straightaway if the block delta is zero.
      
      (nb: in older kernels, the result of entering the shrink case
      with delta == 0 may actually let an -ENOSPC escape to userspace,
      which is confusing for users.)
      
      Fixes: fb2fc172 ("xfs: support shrinking unused space in the last AG")
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      84712492