Commits · 6cee51e6d02bac7ee72969aa23e32c9bdcd7bb6e · Kirill Smelkov / linux

12 Apr, 2023 40 commits

xfs: clean up xattr scrub initialization · 6cee51e6

Darrick J. Wong authored Apr 11, 2023

Clean up local variable initialization and error returns in xchk_xattr.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

6cee51e6

xfs: check used space of shortform xattr structures · ae0506eb

Darrick J. Wong authored Apr 11, 2023

Make sure that the records used inside a shortform xattr structure do
not overlap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

ae0506eb

xfs: move xattr scrub buffer allocation to top level function · 5b02a3e8

Darrick J. Wong authored Apr 11, 2023

Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr,
since we only need to set up the leaf block bitmaps once.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

5b02a3e8

xfs: remove flags argument from xchk_setup_xattr_buf · f58977ed

Darrick J. Wong authored Apr 11, 2023

All callers pass XCHK_GFP_FLAGS as the flags argument to
xchk_setup_xattr_buf, so get rid of the argument.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

f58977ed

xfs: split valuebuf from xchk_xattr_buf.buf · b996c9a8

Darrick J. Wong authored Apr 11, 2023

Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

b996c9a8

xfs: split usedmap from xchk_xattr_buf.buf · 80069284

Darrick J. Wong authored Apr 11, 2023

Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

80069284

xfs: split freemap from xchk_xattr_buf.buf · 91781ff5

Darrick J. Wong authored Apr 11, 2023

Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.  This is the start of removing the complex overloaded
memory buffer that is the source of weird memory misuse bugs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

91781ff5

xfs: remove unnecessary dstmap in xattr scrubber · 4cb76025

Darrick J. Wong authored Apr 11, 2023

Replace bitmap_and with bitmap_intersects in the xattr leaf block
scrubber, since we only care if there's overlap between the used space
bitmap and the free space bitmap.  This means we don't need dstmap any
more, and can thus reduce the memory requirements.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

4cb76025

xfs: don't shadow @leaf in xchk_xattr_block · ee366fe4

Darrick J. Wong authored Apr 11, 2023

Don't shadow the leaf variable here, because it's misleading to have one
place in the codebase where two variables with different types have the
same name.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

ee366fe4

xfs: xattr scrub should ensure one namespace bit per name · c12ad414

Darrick J. Wong authored Apr 11, 2023

Check that each extended attribute exists in only one namespace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

c12ad414

xfs: check for reverse mapping records that could be merged · 1c1646af

Darrick J. Wong authored Apr 11, 2023

Enhance the rmap scrubber to flag adjacent records that could be merged.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

1c1646af

xfs: check overlapping rmap btree records · 29ab991b

Darrick J. Wong authored Apr 11, 2023

The rmap btree scrubber doesn't contain sufficient checking for records
that cannot overlap but do anyway.  For the other btrees, this is
enforced by the inorder checks in xchk_btree_rec, but the rmap btree is
special because it allows overlapping records to handle shared data
extents.

Therefore, enhance the rmap btree record check function to compare each
record against the previous one so that we can detect overlapping rmap
records for space allocations that do not allow sharing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

29ab991b

xfs: flag refcount btree records that could be merged · db0502b3

Darrick J. Wong authored Apr 11, 2023

Complain if we encounter refcount btree records that could be merged.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

db0502b3

xfs: flag free space btree records that could be merged · d5784ae8

Darrick J. Wong authored Apr 11, 2023

Complain if we encounter free space btree records that could be merged.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

d5784ae8

xfs: don't call xchk_bmap_check_rmaps for btree-format file forks · 1e59fdb7

Darrick J. Wong authored Apr 11, 2023

The logic at the end of xchk_bmap_want_check_rmaps tries to detect a
file fork that has been zapped by what will become the online inode
repair code.  Zapped forks are in FMT_EXTENTS with zero extents, and
some sort of hint that there's supposed to be data somewhere in the
filesystem.

Unfortunately, the inverted logic here is confusing and has the effect
that we always call xchk_bmap_check_rmaps for FMT_BTREE forks.  This is
horribly inefficient and unnecessary, so invert the logic to get rid of
this performance problem.  This has caused 8h delays in generic/333 and
generic/334.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

1e59fdb7

xfs: split the xchk_bmap_check_rmaps into a predicate · e8882f69

Darrick J. Wong authored Apr 11, 2023

This function has two parts: the second part scans every reverse mapping
record for this file fork to make sure that there's a corresponding
mapping in the fork, and the first part decides if we even want to do
that.

Split the first part into a separate predicate so that we can make more
changes to it in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

e8882f69

xfs: alert the user about data/attr fork mappings that could be merged · 336642f7

Darrick J. Wong authored Apr 11, 2023

If the data or attr forks have mappings that could be merged, let the
user know that the structure could be optimized.  This isn't a
filesystem corruption since the regular filesystem does not try to be
smart about merging bmbt records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

336642f7

xfs: split xchk_bmap_xref_rmap into two functions · c0d5a92f

Darrick J. Wong authored Apr 11, 2023

There's more special-cased functionality than not in this function.
Split it into two so that each can be far more cohesive.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

c0d5a92f

xfs: accumulate iextent records when checking bmap · 634d4a79

Darrick J. Wong authored Apr 11, 2023

Currently, the bmap scrubber checks file fork mappings individually. In
the case that the file uses multiple mappings to a single contiguous
piece of space, the scrubber repeatedly locks the AG to check the
existence of a reverse mapping that overlaps this file mapping. If the
reverse mapping starts before or ends after the mapping we're checking,
it will also crawl around in the bmbt checking correspondence for
adjacent extents.

This is not very time efficient because it does the crawling while
holding the AGF buffer, and checks the middle mappings multiple times.
Instead, create a custom iextent record iterator function that combines
multiple adjacent allocated mappings into one large incore bmbt record.
This is feasible because the incore bmbt record length is 64-bits wide.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

634d4a79

xfs: change bmap scrubber to store the previous mapping · 971ee3a6

Darrick J. Wong authored Apr 11, 2023

Convert the inode data/attr/cow fork scrubber to remember the entire
previous mapping, not just the next expected offset.  No behavior
changes here, but this will enable some better checking in subsequent
patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

971ee3a6

xfs: don't take the MMAPLOCK when scrubbing file metadata · 1fc7a059

Darrick J. Wong authored Apr 11, 2023

The MMAPLOCK stabilizes mappings in a file's pagecache. Therefore, we
do not need it to check directories, symlinks, extended attributes, or
file-based metadata. Reduce its usage to the one case that requires it,
which is when we want to scrub the data fork of a regular file.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

1fc7a059

xfs: retain the AGI when we can't iget an inode to scrub the core · 38bb1310

Darrick J. Wong authored Apr 11, 2023

xchk_get_inode is not quite the right function to be calling from the
inode scrubber setup function. The common get_inode function either
gets an inode and installs it in the scrub context, or it returns an
error code explaining what happened. This is acceptable for most file
scrubbers because it is not in their scope to fix corruptions in the
inode core and fork areas that cause iget to fail.

Dealing with these problems is within the scope of the inode scrubber,
however. If iget fails with EFSCORRUPTED, we need to xchk_inode to flag
that as corruption. Since we can't get our hands on an incore inode, we
need to hold the AGI to prevent inode allocation activity so that
nothing changes in the inode metadata.

Looking ahead to the inode core repair patches, we will also need to
hold the AGI buffer into xrep_inode so that we can make modifications to
the xfs_dinode structure without any other thread swooping in to
allocate or free the inode.

Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off
use case where the error codes we check for are a little different, and
the return state is much different from the common function.

xchk_setup_inode prepares to check or repair an inode record, so it must
continue the scrub operation even if the inode/inobt verifiers cause
xfs_iget to return EFSCORRUPTED. This is done by attaching the locked
AGI buffer to the scrub transaction and returning 0 to move on to the
actual scrub. (Later, the online inode repair code will also want the
xfs_imap structure so that it can reset the ondisk xfs_dinode
structure.)

xchk_get_inode retrieves an inode on behalf of a scrubber that operates
on an incore inode -- data/attr/cow forks, directories, xattrs,
symlinks, parent pointers, etc. If the inode/inobt verifiers fail and
xfs_iget returns EFSCORRUPTED, we want to exit to userspace (because the
caller should be fix the inode first) and drop everything we acquired
along the way.

A behavior common to both functions is that it's possible that xfs_scrub
asked for a scrub-by-handle concurrent with the inode being freed or the
passed-in inumber is invalid. In this case, we call xfs_imap to see if
the inobt index thinks the inode is allocated, and return ENOENT
("nothing to check here") to userspace if this is not the case. The
imap lookup is why both functions call xchk_iget_agi.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

38bb1310

xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing · 46e0dd89

Darrick J. Wong authored Apr 11, 2023

Dave Chinner suggested renaming this function to make more obvious what
it does.  The function returns an incore inode to callers that want to
scrub a metadata structure that hangs off an inode.  If the iget fails
with EINVAL, it will single-step the loading process to distinguish
between actually free inodes or impossible inumbers (ENOENT);
discrepancies between the inobt freemask and the free status in the
inode record (EFSCORRUPTED).  Any other negative errno is returned
unchanged.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

46e0dd89

xfs: fix an inode lookup race in xchk_get_inode · 302436c2

Darrick J. Wong authored Apr 11, 2023

In commit d658e, we tried to improve the robustnes of xchk_get_inode in
the face of EINVAL returns from iget by calling xfs_imap to see if the
inobt itself thinks that the inode is allocated.  Unfortunately, that
commit didn't consider the possibility that the inode gets allocated
after iget but before imap.  In this case, the imap call will succeed,
but we turn that into a corruption error and tell userspace the inode is
corrupt.

Avoid this false corruption report by grabbing the AGI header and
retrying the iget before calling imap.  If the iget succeeds, we can
proceed with the usual scrub-by-handle code.  Fix all the incorrect
comments too, since unreadable/corrupt inodes no longer result in EINVAL
returns.

Fixes: d658e72b ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

302436c2

xfs: manage inode DONTCACHE status at irele time · a03297a0

Darrick J. Wong authored Apr 11, 2023

Right now, there are statements scattered all over the online fsck
codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
about scrub's unusual practice of releasing inodes with transactions
held.

However, iget is the wrong place to handle this -- the DONTCACHE state
doesn't matter at all until we try to *release* the inode, and here we
get things wrong in multiple ways:

First, if we /do/ have a transaction, we must NOT drop the inode,
because the inode could have dirty pages, dropping the inode will
trigger writeback, and writeback can trigger a nested transaction.

Second, if the inode already had an active reference and the DONTCACHE
flag set, the icache hit when scrub grabs another ref will not clear
DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
initiate cache drops so that sysadmins can change a file's access mode
between pagecache and DAX.

Third, if we do actually have the last active reference to the inode, we
can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
where we actually want that flag.

Create an xchk_irele helper to encode all that logic and switch the
online fsck code to use it.  Since this now means that nearly all
scrubbers use the same xfs_iget flags, we can wrap them too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

a03297a0

xfs: fix parent pointer scrub racing with subdirectory reparenting · 0916056e

Darrick J. Wong authored Apr 11, 2023

Jan Kara pointed out that rename() doesn't lock a subdirectory that is
being moved from one parent to another, even though the move requires an
update to the subdirectory's dotdot entry. This means that it's *not*
sufficient to hold a directory's IOLOCK to stabilize the dotdot entry.
We must hold the ILOCK of both the child and the alleged parent, and
there's no use in holding the parent's IOLOCK.

With that in mind, we can get rid of all the messy code that tries to
grab the parent's IOLOCK, which means we don't need to let go of the
ILOCK of the directory whose parent we are checking. We still have to
use nonblocking mode to take the ILOCK of the alleged parent, so the
revalidation loop has to stay.

However, we can remove the retry counter, since threads aren't supposed
to hold the ILOCK for long periods of time. Remove the inverted ilock
helper from the common code since nobody uses it. Remove the entire
source of -EDEADLOCK-based "retry harder" scrub executions.

Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

0916056e

xfs: simplify xchk_parent_validate · b049962c

Darrick J. Wong authored Apr 11, 2023

This function is unnecessarily long because it contains code to
revalidate a dotdot entry after cycling locks to try to confirm a
subdirectory parent pointer.  Shorten the codebase by making the
parent's lookup call do double duty as the revalidation code.

This weakeans the efficacy of this scrub function temporarily, but the
next patch will resolve this as part of fixing an unhandled race that is
the result of the VFS rename locking model not working the way Darrick
thought it did.

Rename this stupid 'dnum' variable too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

b049962c

xfs: remove xchk_parent_count_parent_dentries · cbab28f4

Darrick J. Wong authored Apr 11, 2023

This helper is now trivial, so get rid of it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

cbab28f4

xfs: always check the existence of a dirent's child inode · 6bb9209c

Darrick J. Wong authored Apr 11, 2023

When we're scrubbing directory entries, we always need to iget the child
inode to make sure that the inode pointer points to a valid inode. The
original directory scrub code (commit a5c4) only set us up to do this
for ftype=1 filesystems, which is not sufficient; and then commit 4b80
made it worse by exempting the dot and dotdot entries.

Sorta-fixes: a5c46e5e ("xfs: scrub directory metadata")
Sorta-fixes: 4b80ac64 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

6bb9209c

xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED · d9a94480

Darrick J. Wong authored Apr 11, 2023

In commit 4b80ac64, we tried to strengthen the directory scrubber by
using the iget call to detect directory entries that point to
unallocated inodes. Unfortunately, that commit neglected to pass
XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
If the inode number points to something that isn't even an inode
cluster, iget will throw corruption errors and return -EFSCORRUPTED,
which means that we fail to mark the directory corrupt.

Fixes: 4b80ac64 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

d9a94480

xfs: streamline the directory iteration code for scrub · 4c233b5c

Darrick J. Wong authored Apr 11, 2023

Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory. This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.

Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.

Note: Porting the dotdot checking code is left for a subsequent patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

4c233b5c

xfs: use the directory name hash function for dir scrubbing · 9dceccc5

Darrick J. Wong authored Apr 11, 2023

The directory code has a directory-specific hash computation function
that includes a modified hash function for case-insensitive lookups.
Hence we must use that function (and not the raw da_hashname) when
checking the dabtree structure.

Found by accidentally breaking xfs/188 to create an abnormally huge
case-insensitive directory and watching scrub break.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

9dceccc5

xfs: ensure that single-owner file blocks are not owned by others · 30f8ee5e

Darrick J. Wong authored Apr 11, 2023

For any file fork mapping that can only have a single owner, make sure
that there are no other rmap owners for that mapping. This patch
requires the more detailed checking provided by xfs_rmap_count_owners so
that we can know how many rmap records for a given range of space had a
matching owner, how many had a non-matching owner, and how many
conflicted with the records that have a matching owner.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

30f8ee5e

xfs: teach scrub to check for sole ownership of metadata objects · 69115f77

Darrick J. Wong authored Apr 11, 2023

Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

69115f77

xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results · efc0845f

Darrick J. Wong authored Apr 11, 2023

Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

efc0845f

xfs: directly cross-reference the inode btrees with each other · bc0f3b55

Darrick J. Wong authored Apr 11, 2023

Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

bc0f3b55

xfs: clean up broken eearly-exit code in the inode btree scrubber · c01868b6

Darrick J. Wong authored Apr 11, 2023

Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

c01868b6

xfs: remove pointless shadow variable from xfs_difree_inobt · cc120766

Darrick J. Wong authored Apr 11, 2023

In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer.  There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

cc120766

xfs: ensure that all metadata and data blocks are not cow staging extents · 7ac14fa2

Darrick J. Wong authored Apr 11, 2023

Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

7ac14fa2

xfs: check the reference counts of gaps in the refcount btree · 7ad9ea63

Darrick J. Wong authored Apr 11, 2023

Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

7ad9ea63