Commits · be51f8119c2f5e27437d2c4271f6419f3b8e609f · Kirill Smelkov / linux

05 Oct, 2016 6 commits

xfs: support bmapping delalloc extents in the CoW fork · be51f811

Darrick J. Wong authored Oct 03, 2016

Allow the creation of delayed allocation extents in the CoW fork.  In
a subsequent patch we'll wire up iomap_begin to actually do this via
reflink helper functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

be51f811

xfs: introduce the CoW fork · 3993baeb

Darrick J. Wong authored Oct 03, 2016

Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

3993baeb

xfs: don't allow reflinked dir/dev/fifo/socket/pipe files · 11715a21

Darrick J. Wong authored Oct 03, 2016

Only non-rt files can be reflinked, so check that when we load an
inode.  Also, don't leak the attr fork if there's a failure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

11715a21

xfs: add reflink feature flag to geometry · f0ec1b8e

Darrick J. Wong authored Oct 03, 2016

Report the reflink feature in the XFS geometry so that xfs_info and
friends know the filesystem has this feature.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

f0ec1b8e

xfs: define tracepoints for reflink activities · 53aa1c34

Darrick J. Wong authored Oct 03, 2016

Define all the tracepoints we need to inspect the runtime operation
of reflink/dedupe/copy-on-write.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

53aa1c34

xfs: return work remaining at the end of a bunmapi operation · 4453593b

Darrick J. Wong authored Oct 03, 2016

Return the range of file blocks that bunmapi didn't free.  This hint
is used by CoW and reflink to figure out what part of an extent
actually got freed so that it can set up the appropriate atomic
remapping of just the freed range.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

4453593b

04 Oct, 2016 6 commits

xfs: when replaying bmap operations, don't let unlinked inodes get reaped · 17c12bcd

Darrick J. Wong authored Oct 03, 2016

Log recovery will iget an inode to replay BUI items and iput the inode
when it's done.  Unfortunately, if the inode was unlinked, the iput
will see that i_nlink == 0 and decide to truncate & free the inode,
which prevents us from replaying subsequent BUIs.  We can't skip the
BUIs because we have to replay all the redo items to ensure that
atomic operations complete.

Since unlinked inode recovery will reap the inode anyway, we can
safely introduce a new inode flag to indicate that an inode is in this
'unlinked recovery' state and should not be auto-reaped in the
drop_inode path.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

17c12bcd

xfs: implement deferred bmbt map/unmap operations · 9f3afb57

Darrick J. Wong authored Oct 03, 2016

Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

9f3afb57

xfs: pass bmapi flags through to bmap_del_extent · 4847acf8

Darrick J. Wong authored Oct 03, 2016

Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend
BMAPI_REMAP (which means "don't touch the allocator or the quota
accounting") to apply to bunmapi as well.  This will be used to
implement the unmap operation, which will be used by swapext.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

4847acf8

xfs: map an inode's offset to an exact physical block · f65306ea

Darrick J. Wong authored Oct 03, 2016

Teach the bmap routine to know how to map a range of file blocks to a
specific range of physical blocks, instead of simply allocating fresh
blocks.  This enables reflink to map a file to blocks that are already
in use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

f65306ea

xfs: log bmap intent items · 77d61fe4

Darrick J. Wong authored Oct 03, 2016

Provide a mechanism for higher levels to create BUI/BUD items, submit
them to the log, and a stub function to deal with recovered BUI items.
These parts will be connected to the rmapbt in a later patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

77d61fe4

xfs: create bmbt update intent log items · 6413a014

Darrick J. Wong authored Oct 03, 2016

Create bmbt update intent/done log items to record redo information in
the log.  Because we roll transactions multiple times for reflink
operations, we also have to track the status of the metadata updates
that will be recorded in the post-roll transactions in case we crash
before committing the final transaction.  This mechanism enables log
recovery to finish what was already started.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

6413a014

03 Oct, 2016 18 commits

xfs: introduce reflink utility functions · 350a27a6

Darrick J. Wong authored Oct 03, 2016

These functions will be used by the other reflink functions to find
the maximum length of a range of shared blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reviewed-by: Christoph Hellwig <hch@lst.de>

350a27a6

xfs: reserve AG space for the refcount btree root · d0e853f3

Darrick J. Wong authored Oct 03, 2016

Reduce the max AG usable space size so that we always have space for
the refcount btree root.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

d0e853f3

xfs: add refcount btree block detection to log recovery · a90c00f0

Darrick J. Wong authored Oct 03, 2016

Identify refcountbt blocks in the log correctly so that we can
validate them during log recovery.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

a90c00f0

xfs: adjust refcount when unmapping file blocks · 62aab20f

Darrick J. Wong authored Oct 03, 2016

When we're unmapping blocks from a reflinked file, decrease the
refcount of the affected blocks and free the extents that are no
longer in use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

62aab20f

xfs: connect refcount adjust functions to upper layers · 33ba6129

Darrick J. Wong authored Oct 03, 2016

Plumb in the upper level interface to schedule and finish deferred
refcount operations via the deferred ops mechanism.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

33ba6129

xfs: adjust refcount of an extent of blocks in refcount btree · 31727258

Darrick J. Wong authored Oct 03, 2016

Provide functions to adjust the reference counts for an extent of
physical blocks stored in the refcount btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

31727258

xfs: log refcount intent items · f997ee21

Darrick J. Wong authored Oct 03, 2016

Provide a mechanism for higher levels to create CUI/CUD items, submit
them to the log, and a stub function to deal with recovered CUI items.
These parts will be connected to the refcountbt in a later patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

f997ee21

xfs: create refcount update intent log items · baf4bcac

Darrick J. Wong authored Oct 03, 2016

Create refcount update intent/done log items to record redo
information in the log.  Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction.  This mechanism enables log recovery to finish
what was already started.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

baf4bcac

xfs: add refcount btree operations · bdf28630

Darrick J. Wong authored Oct 03, 2016

Implement the generic btree operations required to manipulate refcount
btree blocks.  The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.

Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>

bdf28630

xfs: account for the refcount btree in the alloc/free log reservation · f310bd2e

Darrick J. Wong authored Oct 03, 2016

Every time we allocate or free a data extent, we might need to split
the refcount btree.  Reserve some blocks in the transaction to handle
this possibility.  Even though the deferred refcount code can roll a
transaction to avoid overloading the transaction, we can still exceed
the reservation.

Certain pathological workloads (1k blocks, no cowextsize hint, random
directio writes), cause a perfect storm wherein a refcount adjustment
of a large range of blocks causes full tree splits in two separate
extents in two separate refcount tree blocks; allocating new refcount
tree blocks causes rmap btree splits; and all the allocation activity
causes the freespace btrees to split, blowing the reservation.

(Reproduced by generic/167 over NFS atop XFS)
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

f310bd2e

xfs: add refcount btree support to growfs · ac4fef69

Darrick J. Wong authored Oct 03, 2016

Modify the growfs code to initialize new refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

ac4fef69

xfs: define the on-disk refcount btree format · 1946b91c

Darrick J. Wong authored Oct 03, 2016

Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>

1946b91c

xfs: refcount btree add more reserved blocks · af30dfa1

Darrick J. Wong authored Oct 03, 2016

Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

af30dfa1

xfs: introduce refcount btree definitions · 46eeb521

Darrick J. Wong authored Oct 03, 2016

Add new per-AG refcount btree definitions to the per-AG structures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>

46eeb521

xfs: define tracepoints for refcount btree activities · c75c752d

Darrick J. Wong authored Oct 03, 2016

Define all the tracepoints we need to inspect the refcount btree
runtime operation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

c75c752d

xfs: return an error when an inline directory is too small · 9cdafd8a

Darrick J. Wong authored Oct 03, 2016

If the size of an inline directory is so small that it doesn't
even cover the required header size, return an error to userspace
instead of ASSERTing and returning 0 like everything's ok.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

9cdafd8a

vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks · 71be6b49

Darrick J. Wong authored Oct 03, 2016

Add a new fallocate mode flag that explicitly unshares blocks on
filesystems that support such features.  The new flag can only
be used with an allocate-mode fallocate call.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

71be6b49

vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint · 0a6eab8b

Darrick J. Wong authored Oct 03, 2016

Introduce XFLAGs for the new XFS CoW extent size hint, and actually
plumb the CoW extent size hint into the fsxattr structure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

0a6eab8b

02 Oct, 2016 8 commits

Merge branch 'xfs-4.9-log-recovery-fixes' into for-next · 155cd433
Dave Chinner authored Oct 03, 2016

155cd433
Merge branch 'iomap-4.9-dax' into for-next · a1f45e66
Dave Chinner authored Oct 03, 2016

a1f45e66
Merge branch 'xfs-4.9-delalloc-rework' into for-next · a89b3f97
Dave Chinner authored Oct 03, 2016

a89b3f97
Merge branch 'xfs-4.9-reflink-prep' into for-next · 79ad5761
Dave Chinner authored Oct 03, 2016

79ad5761
Merge branch 'iomap-4.9-misc-fixes-1' into for-next · b036b970
Dave Chinner authored Oct 03, 2016

b036b970

fs: update atime before I/O in generic_file_read_iter · 0d5b0cf2

Christoph Hellwig authored Oct 03, 2016

After the call to ->direct_IO the final reference to the file might have
been dropped by aio_complete already, and the call to file_accessed might
cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

0d5b0cf2

xfs: update atime before I/O in xfs_file_dio_aio_read · a447d7cd

Christoph Hellwig authored Oct 03, 2016

After the call to __blkdev_direct_IO the final reference to the file
might have been dropped by aio_complete already, and the call to
file_accessed might cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

a447d7cd

ext2: fix possible integer truncation in ext2_iomap_begin · d5bfccdf

Christoph Hellwig authored Oct 03, 2016

For 32-bit architectures we need to cast first_block to u64 before
shifting it left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

d5bfccdf

25 Sep, 2016 2 commits

xfs: log recovery tracepoints to track current lsn and buffer submission · 5cd9cee9

Brian Foster authored Sep 26, 2016

Log recovery has particular rules around buffer submission along with
tricky corner cases where independent transactions can share an LSN. As
such, it can be difficult to follow when/why buffers are submitted
during recovery.

Add a couple tracepoints to post the current LSN of a record when a new
record is being processed and when a buffer is being skipped due to LSN
ordering. Also, update the recover item class to include the LSN of the
current transaction for the item being processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

5cd9cee9

xfs: update metadata LSN in buffers during log recovery · 60a4a222

Brian Foster authored Sep 26, 2016

Log recovery is currently broken for v5 superblocks in that it never
updates the metadata LSN of buffers written out during recovery. The
metadata LSN is recorded in various bits of metadata to provide recovery
ordering criteria that prevents transient corruption states reported by
buffer write verifiers. Without such ordering logic, buffer updates can
be replayed out of order and lead to false positive transient corruption
states. This is generally not a corruption vector on its own, but
corruption detection shuts down the filesystem and ultimately prevents a
mount if it occurs during log recovery. This requires an xfs_repair run
that clears the log and potentially loses filesystem updates.

This problem is avoided in most cases as metadata writes during normal
filesystem operation update the metadata LSN appropriately. The problem
with log recovery not updating metadata LSNs manifests if the system
happens to crash shortly after log recovery itself. In this scenario, it
is possible for log recovery to complete all metadata I/O such that the
filesystem is consistent. If a crash occurs after that point but before
the log tail is pushed forward by subsequent operations, however, the
next mount performs the same log recovery over again. If a buffer is
updated multiple times in the dirty range of the log, an earlier update
in the log might not be valid based on the current state of the
associated buffer after all of the updates in the log had been replayed
(before the previous crash). If a verifier happens to detect such a
problem, the filesystem claims corruption and immediately shuts down.

This commonly manifests in practice as directory block verifier failures
such as the following, likely due to directory verifiers being
particularly detailed in their checks as compared to most others:

  ...
  Mounting V5 Filesystem
  XFS (dm-0): Starting recovery (logdev: internal)
  XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
    file fs/xfs/libxfs/xfs_dir2_data.c.  Caller xfs_dir3_data_verify ...
  ...

Update log recovery to update the metadata LSN of recovered buffers.
Since metadata LSNs are already updated by write verifer functions via
attached log items, attach a dummy log item to the buffer during
validation and explicitly set the LSN of the current transaction. This
ensures that the metadata LSN of a buffer is updated based on whether
the recovery I/O actually completes, and if so, that subsequent recovery
attempts identify that the buffer is already up to date with respect to
the current transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

60a4a222