- 05 Oct, 2016 16 commits
-
-
Darrick J. Wong authored
Due to the way the CoW algorithm in XFS works, there's an interval during which blocks allocated to handle a CoW can be lost -- if the FS goes down after the blocks are allocated but before the block remapping takes place. This is exacerbated by the cowextsz hint -- allocated reservations can sit around for a while, waiting to get used. Since the refcount btree doesn't normally store records with refcount of 1, we can use it to record these in-progress extents. In-progress blocks cannot be shared because they're not user-visible, so there shouldn't be any conflicts with other programs. This is a better solution than holding EFIs during writeback because (a) EFIs can't be relogged currently, (b) even if they could, EFIs are bound by available log space, which puts an unnecessary upper bound on how much CoW we can have in flight, and (c) we already have a mechanism to track blocks. At mount time, read the refcount records and free anything we find with a refcount of 1 because those were in-progress when the FS went down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
When destroying the inode, cancel all pending reservations in the CoW fork so that all the reserved blocks go back to the free pile. In theory this sort of cleanup is only needed to clean up after write errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
When we're freeing blocks (truncate, punch, etc.), clear all CoW reservations in the range being freed. If the file block count drops to zero, also clear the inode reflink flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
For O_DIRECT writes to shared blocks, we have to CoW them just like we would with buffered writes. For writes that are not block-aligned, just bounce them to the page cache. For block-aligned writes, however, we can do better than that. Use the same mechanisms that we employ for buffered CoW to set up a delalloc reservation, allocate all the blocks at once, issue the writes against the new blocks and use the same ioend functions to remap the blocks after the write. This should be fairly performant. Christoph discovered that xfs_reflink_allocate_cow_range may stumble over invalid entries in the extent array given that it drops the ilock but still expects the index to be stable. Simple fixing it to a new lookup for every iteration still isn't correct given that xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and there is nothing preventing a xfs_bunmapi_cow call removing extents once we dropped the ilock either. This patch duplicates the inner loop of xfs_bmapi_allocate into a helper for xfs_reflink_allocate_cow_range so that it can be done under the same ilock critical section as our CoW fork delayed allocation. The directio CoW warts will be revisited in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Report shared extents through the iomap interface so that FIEMAP flags shared blocks accurately. Have xfs_vm_bmap return zero for reflinked files because the bmap-based swap code requires static block mappings, which is incompatible with copy on write. NOTE: Existing userspace bmap users such as lilo will have the same problem with reflink files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Darrick J. Wong authored
After the write component of a copy-write operation finishes, clean up the bookkeeping left behind. On error, we simply free the new blocks and pass the error up. If we succeed, however, then we must remove the old data fork mapping and move the cow fork mapping to the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: Call the CoW failure function during xfs_cancel_ioend] Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create a helper method to remove extents from the CoW fork without any of the side effects (rmapbt/bmbt updates) of the regular extent deletion routine. We'll eventually use this to clear out the CoW fork during ioend processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Modify the writepage handler to find and convert pending delalloc extents to real allocations. Furthermore, when we're doing non-cow writes to a part of a file that already has a CoW reservation (the cowextsz hint that we set up in a subsequent patch facilitates this), promote the write to copy-on-write so that the entire extent can get written out as a single extent on disk, thereby reducing post-CoW fragmentation. Christoph moved the CoW support code in _map_blocks to a separate helper function, refactored other functions, and reduced the number of CoW fork lookups, so I merged those changes here to reduce churn. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed allocation extents in the CoW fork to real allocations, and wire this up all the way back to xfs_iomap_write_allocate(). In a subsequent patch, we'll modify the writepage handler to call this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Wire up iomap_begin to detect shared extents and create delayed allocation extents in the CoW fork: 1) Check if we already have an extent in the COW fork for the area. If so nothing to do, we can move along. 2) Look up block number for the current extent, and if there is none it's not shared move along. 3) Unshare the current extent as far as we are going to write into it. For this we avoid an additional COW fork lookup and use the information we set aside in step 1) above. 4) Goto 1) unless we've covered the whole range. Last but not least, this updates the xfs_reflink_reserve_cow_range calling convention to pass a byte offset and length, as that is what both callers expect anyway. This patch has been refactored considerably as part of the iomap transition. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Allow the creation of delayed allocation extents in the CoW fork. In a subsequent patch we'll wire up iomap_begin to actually do this via reflink helper functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Introduce a new in-core fork for storing copy-on-write delalloc reservations and allocated extents that are in the process of being written out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Only non-rt files can be reflinked, so check that when we load an inode. Also, don't leak the attr fork if there's a failure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Report the reflink feature in the XFS geometry so that xfs_info and friends know the filesystem has this feature. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Define all the tracepoints we need to inspect the runtime operation of reflink/dedupe/copy-on-write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Return the range of file blocks that bunmapi didn't free. This hint is used by CoW and reflink to figure out what part of an extent actually got freed so that it can set up the appropriate atomic remapping of just the freed range. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
- 04 Oct, 2016 6 commits
-
-
Darrick J. Wong authored
Log recovery will iget an inode to replay BUI items and iput the inode when it's done. Unfortunately, if the inode was unlinked, the iput will see that i_nlink == 0 and decide to truncate & free the inode, which prevents us from replaying subsequent BUIs. We can't skip the BUIs because we have to replay all the redo items to ensure that atomic operations complete. Since unlinked inode recovery will reap the inode anyway, we can safely introduce a new inode flag to indicate that an inode is in this 'unlinked recovery' state and should not be auto-reaped in the drop_inode path. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Implement deferred versions of the inode block map/unmap functions. These will be used in subsequent patches to make reflink operations atomic. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend BMAPI_REMAP (which means "don't touch the allocator or the quota accounting") to apply to bunmapi as well. This will be used to implement the unmap operation, which will be used by swapext. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Teach the bmap routine to know how to map a range of file blocks to a specific range of physical blocks, instead of simply allocating fresh blocks. This enables reflink to map a file to blocks that are already in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Provide a mechanism for higher levels to create BUI/BUD items, submit them to the log, and a stub function to deal with recovered BUI items. These parts will be connected to the rmapbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create bmbt update intent/done log items to record redo information in the log. Because we roll transactions multiple times for reflink operations, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
- 03 Oct, 2016 18 commits
-
-
Darrick J. Wong authored
These functions will be used by the other reflink functions to find the maximum length of a range of shared blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Reduce the max AG usable space size so that we always have space for the refcount btree root. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Identify refcountbt blocks in the log correctly so that we can validate them during log recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
When we're unmapping blocks from a reflinked file, decrease the refcount of the affected blocks and free the extents that are no longer in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Plumb in the upper level interface to schedule and finish deferred refcount operations via the deferred ops mechanism. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Provide functions to adjust the reference counts for an extent of physical blocks stored in the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Provide a mechanism for higher levels to create CUI/CUD items, submit them to the log, and a stub function to deal with recovered CUI items. These parts will be connected to the refcountbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Create refcount update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Implement the generic btree operations required to manipulate refcount btree blocks. The implementation is similar to the bmapbt, though it will only allocate and free blocks from the AG. Since the refcount root and level fields are separate from the existing roots and levels array, they need a separate logging flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: fix logging of AGF refcount btree fields] Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Every time we allocate or free a data extent, we might need to split the refcount btree. Reserve some blocks in the transaction to handle this possibility. Even though the deferred refcount code can roll a transaction to avoid overloading the transaction, we can still exceed the reservation. Certain pathological workloads (1k blocks, no cowextsize hint, random directio writes), cause a perfect storm wherein a refcount adjustment of a large range of blocks causes full tree splits in two separate extents in two separate refcount tree blocks; allocating new refcount tree blocks causes rmap btree splits; and all the allocation activity causes the freespace btrees to split, blowing the reservation. (Reproduced by generic/167 over NFS atop XFS) Signed-off-by: Christoph Hellwig <hch@lst.de> [darrick.wong@oracle.com: add commit message] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Darrick J. Wong authored
Modify the growfs code to initialize new refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Start constructing the refcount btree implementation by establishing the on-disk format and everything needed to read, write, and manipulate the refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Since XFS reserves a small amount of space in each AG as the minimum free space needed for an operation, save some more space in case we touch the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add new per-AG refcount btree definitions to the per-AG structures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Define all the tracepoints we need to inspect the refcount btree runtime operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
If the size of an inline directory is so small that it doesn't even cover the required header size, return an error to userspace instead of ASSERTing and returning 0 like everything's ok. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add a new fallocate mode flag that explicitly unshares blocks on filesystems that support such features. The new flag can only be used with an allocate-mode fallocate call. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Darrick J. Wong authored
Introduce XFLAGs for the new XFS CoW extent size hint, and actually plumb the CoW extent size hint into the fsxattr structure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-