- 29 Sep, 2014 1 commit
-
-
Dave Chinner authored
-
- 28 Sep, 2014 5 commits
-
-
Dave Chinner authored
Rework the transaction lookup and allocation code in xlog_recovery_process_ophdr() to fold two related call-once helper functions into a single helper. Then fold in all the XLOG_START_TRANS logic to that helper to clean up the remaining logic in xlog_recovery_process_ophdr(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
The code for managing transactions anf the items for recovery is spread across 3 different locations in the file. Move them all together so that it is easy to read the code without needing to jump long distances in the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
When an error occurs during buffer submission in xlog_recover_commit_trans(), we free the trans structure twice. Fix it by only freeing the structure in the caller regardless of the success or failure of the function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
The XLOG_UNMOUNT_TRANS case skips the transaction, despite the fact an unmount record is always in a standalone transaction. Hence whenever we come across one of these we need to free the transaction structure associated with it as there is no commit record that follows it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
Clean up xlog_recover_process_data() structure in preparation for fixing the allocation and freeing context of the transaction being recovered. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
- 23 Sep, 2014 15 commits
-
-
Dave Chinner authored
-
Dave Chinner authored
On a sub-page sized filesystem, truncating a mapped region down leaves us in a world of hurt. We truncate the pagecache, zeroing the newly unused tail, then punch blocks out from under the page. If we then truncate the file back up immediately, we expose that unmapped hole to a dirty page mapped into the user application, and that's where it all goes wrong. In truncating the page cache, we avoid unmapping the tail page of the cache because it still contains valid data. The problem is that it also contains a hole after the truncate, but nobody told the mm subsystem that. Therefore, if the page is dirty before the truncate, we'll never get a .page_mkwrite callout after we extend the file and the application writes data into the hole on the page. Hence when we come to writing that region of the page, it has no blocks and no delayed allocation reservation and hence we toss the data away. This patch adds code to the truncate up case to solve it, by ensuring the partial page at the old EOF is always cleaned after we do any zeroing and move the EOF upwards. We can't actually serialise the page writeback and truncate against page faults (yes, that problem AGAIN) so this is really just a best effort and assumes it is extremely unlikely that someone is concurrently writing to the page at the EOF while extending the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
Fix sparse warning introduced by commit 4ef897a2 ("xfs: flush both inodes in xfs_swap_extents"). Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
Fix sparse warning introduced by commit ac8809f9 ("xfs: abort metadata writeback on permanent errors"). Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Fengguang Wu authored
Fix sparse warning introduced by commit afabfd30 ("xfs: combine xfs_rtmodify_summary and xfs_rtget_summary"). Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Fabian Frederick authored
xfs_quota.h was included twice. Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
xfs_dir3_data_get_ftype() gets the file type off disk, but ASSERTs if it's invalid: ASSERT(type < XFS_DIR3_FT_MAX); We shouldn't ASSERT on bad values read from disk. V3 dirs are CRC-protected, but V2 dirs + ftype are not. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
When running a tight mount/unmount loop on an older kernel, RedHat QE found that unmount would occasionally hang in xfs_buf_unpin_wait() on the superblock buffer. Tracing and other debug work by Eric Sandeen indicated that it was hanging on the writing of the superblock during unmount immediately after logging the superblock counters in a synchronous transaction. Further debug indicated that the synchronous transaction was not waiting for completion correctly, and we narrowed it down to xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing the transaction in the iclog buffer to disk correctly. While this unmount superblock write code is now very different in mainline kernels, the xlog_cil_force_lsn() code is identical, and it was bisected to the backport of commit f876e446 ("xfs: always do log forces via the workqueue"). This commit made the CIL push asynchronous for log forces and hence exposed a race condition that couldn't occur on a synchronous push. Essentially, the xlog_cil_force_lsn() relied implicitly on the fact that the sequence push would be complete by the time xlog_cil_push_now() returned, resulting in the context being pushed being in the committing list. When it was made asynchronous, it was recognised that there was a race condition in detecting whether an asynchronous push has started or not and code was added to handle it. Unfortunately, the fix was not quite right and left a race condition where it it would detect an empty CIL while a push was in progress before the context had been added to the committing list. This was incorrectly seen as a "nothing to do" condition and so would tell xfs_log_force_lsn() that there is nothing to wait for, and hence it would push the iclogbufs in memory. The fix is simple, but explaining the logic and the race condition is a lot more complex. The fix is to add the context to the committing list before we start emptying the CIL. This allows us to detect the difference between an empty "do nothing" push and a push that has not started by adding a discrete "emptying the CIL" state to avoid the transient, incorrect "empty" condition that the (unchanged) waiting code was seeing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
-
Brian Foster authored
xfs_free_file_space() only affects the range of the file for which space is being freed. It currently writes and truncates the page cache from the start offset of the free to EOF. Modify xfs_free_file_space() to write back and truncate page cache of just the range being freed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
The collapse range operation currently writes the entire file before starting the collapse to avoid changes in the in-core extent list due to writeback causing the extent count to change. Now that collapse range is fsb based rather than extent index based it can sustain changes in the extent list during the shift sequence without disruption. Modify xfs_collapse_file_space() to writeback and invalidate pages associated with the range of the file to be shifted. xfs_free_file_space() currently has similar behavior, but the space free need only affect the region of the file that is freed and this could change in the future. Also update the comments to reflect the current implementation. We retain the eofblocks trim permanently as a best option for dealing with delalloc extents. We don't shift delalloc extents because this scenario only occurs with post-eof preallocation (since data must be flushed such that the cache can be invalidated and data can be shifted). That means said space must also be initialized before being shifted into the accessible region of the file only to be immediately truncated off as the last part of the collapse. In other words, the eofblocks trim will happen anyways, we just run it first to ensure the file remains in a consistent state throughout the collapse. Finally, detect and fail explicitly in the event of a delalloc extent during the extent shift. The implementation does not support delalloc extents and the caller is expected to prevent this scenario in advance as is done by collapse. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
xfs_bmap_shift_extents() has a variety of conditions and error checks that make the logic difficult to follow and indent heavy. Refactor the loop body of this function into a new xfs_bmse_shift_one() helper. This simplifies the error checks, eliminates index decrement on merge hack by pushing the index increment down into the helper, and makes the code more readable by reducing multiple levels of indentation. This is a code refactor only. The behavior of extent shift and collapse range is not modified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
The extent shift mechanism in xfs_bmap_shift_extents() is complicated and handles several different, non-deterministic scenarios. These include extent shifts, extent merges and potential btree updates in either of the former scenarios. Refactor the code to be more linear and readable. The loop logic in xfs_bmap_shift_extents() and some initial error checking is adjusted slightly. The associated btree lookup and update/delete operations are condensed into single blocks of code. This reduces the number of btree-specific blocks and facilitates the separation of the merge operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers. This is a code refactor only. The behavior of extent shift and collapse range is not modified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
The collapse range implementation uses a transaction per extent shift. The progress of the overall operation is tracked via the current extent index of the in-core extent list. This is racy because the ilock must be dropped and reacquired for each transaction according to locking and log reservation rules. Therefore, writeback to prior regions of the file is possible and can change the extent count. This changes the extent to which the current index refers and causes the collapse to fail mid operation. To avoid this problem, the entire file is currently written back before the collapse operation starts. To eliminate the need to flush the entire file, use the file offset (fsb) to track the progress of the overall extent shift operation rather than the extent index. Modify xfs_bmap_shift_extents() to unconditionally convert the start_fsb parameter to an extent index and return the file offset of the extent where the shift left off, if further extents exist. The bulk of ths function can remain based on extent index as ilock is held by the caller. xfs_collapse_file_space() now uses the fsb output as the starting point for the subsequent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
XFS has been having trouble with stray delayed allocation extents beyond EOF for a long time. Recent changes to the collapse range code has triggered erroneous EBUSY errors on page invalidtion for block size smaller than page size filesystems. These have been caused by dirty buffers beyond EOF on a partial page which do not get written to disk during a sync. The issue is that write-ahead in xfs_cluster_write() finds such a partial page and handles it by leaving the page dirty but pushing it into a writeback state. This used to work just fine, as the write_cache_pages() code would then find the dirty partial page in the next mapping tree lookup as the dirty tag is still set. Unfortunately, when we moved to a mark and sweep approach to writeback to fix other writeback sync issues, we broken this. THe act of marking the page as under writeback now clears the TOWRITE tag in the radix tree, even though the page is still dirty. This causes the TOWRITE tag to be cleared, and hence the next lookup on the mapping tree does not find the dirty partial page and so doesn't try to write it again. This same writeback bug was found recently in ext4 and fixed in commit 1c8349a1 ("ext4: fix data integrity sync in ordered mode") without communication to the wider filesystem community. We can use exactly the same fix here so the TOWRITE flag is not cleared on partial page writes. cc: stable@vger.kernel.org # dependent on 1c8349a1Root-cause-found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
- 09 Sep, 2014 12 commits
-
-
Dave Chinner authored
-
Eric Sandeen authored
rbpp is always passed into xfs_rtmodify_summary and xfs_rtget_summary, so there is no need to test for it in xfs_rtmodify_summary_int. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
xfs_rtmodify_summary and xfs_rtget_summary are almost identical; fold them into xfs_rtmodify_summary_int(), with wrappers for each of the original calls. The _int function modifies if a delta is passed, and returns a summary pointer if *sum is passed. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
xfs_dir_canenter and xfs_dir_createname are almost identical. Fold the former into the latter, with a helpful wrapper for the former. If createname is called without an inode number, it now only checks for space, and does not actually add the entry. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
Move the resblks test out of the xfs_dir_canenter, and into the caller. This makes a little more sense on the face of it; xfs_dir_canenter immediately returns if resblks !=0; and given some of the comments preceding the calls: * Check for ability to enter directory entry, if no space reserved. even more so. It also facilitates the next patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
In xlog_do_recovery_pass(), there are 2 distinct cases: non-wrapped and wrapped log recovery. If we find a wrapped log, we recover around the end of the log, and then handle the rest of recovery exactly as in the non-wrapped case - using exactly the same (duplicated) code. Rather than having the same code in both cases, we can get the wrapped portion out of the way first if needed, and then recover the non-wrapped portion of the log. There should be no functional change here, just code reorganization & deduplication. The patch looks a bit bigger than it really is; the last hunk is whitespace changes (un-indenting). Tested with xfstests "check -g log" on a stock configuration. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
For some reason, the older commit: 965c8e59 lseek: the "whence" argument is called "whence" lseek: the "whence" argument is called "whence" But the kernel decided to call it "origin" instead. Fix most of the sites. left out xfs. So fix xfs. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
xfs_seek_hole & xfs_seek_data are remarkably similar; so much so that they can be combined, saving a fair bit of semi-complex code duplication. The following patch passes generic/285 and generic/286, which specifically test seek behavior. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
XFS log recovery has been discovered to have race conditions with buffers when I/O errors occur. External tools are available to simulate I/O errors to XFS, but this alone is not sufficient for testing log recovery. XFS unconditionally resets the inactive region of the log prior to log recovery to avoid confusion over processing any partially written log records that might have been written before an unclean shutdown. Therefore, unconditional write I/O failures at mount time are caught by the reset sequence rather than log recovery and hinder the ability to test the latter. The device-mapper dm-flakey module uses an up/down timer to define a cycle for when to fail I/Os. Create a pre log recovery delay tunable that can be used to coordinate XFS log recovery with I/O errors simulated by dm-flakey. This facilitates coordination in userspace that allows the reset of stale log blocks to succeed and writes due to log recovery to fail. For example, define a dm-flakey instance with an uptime long enough to allow log reset to succeed and a log recovery delay long enough to allow the dm-flakey uptime to expire. The 'log_recovery_delay' sysfs tunable is exported under /sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug mode. The value is exported in units of seconds and allows for a delay of up to 60 seconds. Note that this is for XFS debug and test instrumentation purposes only and should not be used by applications. No delay is enabled by default. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
Create a top-level debug directory for global debug sysfs attributes. This directory is added and removed on XFS module initialization and removal respectively for DEBUG mode kernels only. It typically resides at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs hierarchy as attributes might define global behavior or behavior that must be configured before an xfs mount is available (e.g., log recovery behavior). Define the global debug kobject that represents the debug sysfs directory and add generic attribute show/store helpers to support future attributes. No debug attributes are exported as of yet. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Eric Sandeen authored
These were exposed by fsfuzzer runs; without them we fail in various exciting and sometimes convoluted ways when we encounter disk corruption. Without the MAXLEVELS tests we tend to walk off the end of an array in a loop like this: for (i = 0; i < cur->bc_nlevels; i++) { if (cur->bc_bufs[i]) Without the dirblklog test we try to allocate more memory than we could possibly hope for and loop forever: xfs_dabuf_map() nfsb = mp->m_dir_geo->fsbcount; irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP... As for the logbsize check, that's the convoluted one. If logbsize is specified at mount time, it's sanitized in xfs_parseargs; in particular it makes sure that it's not > XLOG_MAX_RECORD_BSIZE. If not specified at mount time, it comes from the superblock via sb_logsunit; this is limited to 256k at mkfs time as well; it's copied into m_logbsize in xfs_finish_flags(). However, if for some reason the on-disk value is corrupt and too large, nothing catches it. It's a circuitous path, but that size eventually finds its way to places that make the kernel very unhappy, leading to oopses in xlog_pack_data() because we use the size as an index into iclog->ic_data, but the array is not necessarily that big. Anyway - bounds checking when we read from disk is a good thing! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
Workqueues must be explicitly set as freezable to ensure they are frozen in the assocated part of the hibernation/suspend sequence. Freezing of workqueues and kernel threads is important to ensure that modifications are not made on-disk after the hibernation image has been created. Otherwise, the in-memory state can become inconsistent with what is on disk and eventually lead to filesystem corruption. We have reports of free space btree corruptions that occur immediately after restore from hibernate that suggest the xfs-eofblocks workqueue could be causing such problems if it races with hibernation. Mark all of the internal XFS workqueues as freezable to ensure nothing changes on-disk once the freezer infrastructure freezes kernel threads and creates the hibernation image. Signed-off-by: Brian Foster <bfoster@redhat.com> Reported-by: Carlos E. R. <carlos.e.r@opensuse.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
- 02 Sep, 2014 7 commits
-
-
Brian Foster authored
xfs_collapse_file_space() currently writes back the entire file undergoing collapse range to settle things down for the extent shift algorithm. While this prevents changes to the extent list during the collapse operation, the writeback itself is not enough to prevent unnecessary collapse failures. The current shift algorithm uses the extent index to iterate the in-core extent list. If a post-eof delalloc extent persists after the writeback (e.g., a prior zero range op where the end of the range aligns with eof can separate the post-eof blocks such that they are not written back and converted), xfs_bmap_shift_extents() becomes confused over the encoded br_startblock value and fails the collapse. As with the full writeback, this is a temporary fix until the algorithm is improved to cope with a volatile extent list and avoid attempts to shift post-eof extents. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
If we have delalloc extents on a file before we run a collapse range opertaion, we sync the range that we are going to collapse to convert delalloc extents in that region to real extents to simplify the shift operation. However, the shift operation then assumes that the extent list is not going to change as it iterates over the extent list moving things about. Unfortunately, this isn't true because we can't hold the ILOCK over all the operations. We can prevent new IO from modifying the extent list by holding the IOLOCK, but that doesn't prevent writeback from running.... And when writeback runs, it can convert delalloc extents is the range of the file prior to the region being collapsed, and this changes the indexes of all the extents in the file. That causes the collapse range operation to Go Bad. The right fix is to rewrite the extent shift operation not to be dependent on the extent list not changing across the entire operation, but this is a fairly significant piece of work to do. Hence, as a short-term workaround for the problem, sync the entire file before starting a collapse operation to remove all delalloc ranges from the file and so avoid the problem of concurrent writeback changing the extent list. Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Brian Foster authored
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse all subsequent extents down into the specified, previously punched out, region. This function performs some validation, such as whether a sufficient hole exists in the target region of the collapse, then shifts the remaining exents downward. The exit path of the function currently logs the inode unconditionally. While we must log the inode (and abort) if an error occurs and the transaction is dirty, the initial validation paths can generate errors before the transaction has been dirtied. This creates an unnecessary filesystem shutdown scenario, as the caller will cancel a transaction that has been marked dirty. Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications are made to the inode bmap. Only log the inode in the exit path if logflags has been set. This ensures we only have to cancel a dirty transaction if modifications have been made and prevents an unnecessary filesystem shutdown otherwise. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
Now we are not doing silly things with dirtying buffers beyond EOF and using invalidation correctly, we can finally reduce the ranges of writeback and invalidation used by direct IO to match that of the IO being issued. Bring the writeback and invalidation ranges back to match the generic direct IO code - this will greatly reduce the perturbation of cached data when direct IO and buffered IO are mixed, but still provide the same buffered vs direct IO coherency behaviour we currently have. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
Similar to direct IO reads, direct IO writes are using truncate_pagecache_range to invalidate the page cache. This is incorrect due to the sub-block zeroing in the page cache that truncate_pagecache_range() triggers. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. cc: stable@vger.kernel.org Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Chris Mason authored
xfs is using truncate_pagecache_range to invalidate the page cache during DIO reads. This is different from the other filesystems who only invalidate pages during DIO writes. truncate_pagecache_range is meant to be used when we are freeing the underlying data structs from disk, so it will zero any partial ranges in the page. This means a DIO read can zero out part of the page cache page, and it is possible the page will stay in cache. buffered reads will find an up to date page with zeros instead of the data actually on disk. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. [dchinner: catch error and warn if it fails. Comment.] cc: stable@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
-
Dave Chinner authored
generic/263 is failing fsx at this point with a page spanning EOF that cannot be invalidated. The operations are: 1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes) 1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes) 1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes) where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO write attempts to invalidate the cached page over this range, it fails with -EBUSY and so any attempt to do page invalidation fails. The real question is this: Why can't that page be invalidated after it has been written to disk and cleaned? Well, there's data on the first two buffers in the page (1k block size, 4k page), but the third buffer on the page (i.e. beyond EOF) is failing drop_buffers because it's bh->b_state == 0x3, which is BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say what? OK, set_buffer_dirty() is called on all buffers from __set_page_buffers_dirty(), regardless of whether the buffer is beyond EOF or not, which means that when we get to ->writepage, we have buffers marked dirty beyond EOF that we need to clean. So, we need to implement our own .set_page_dirty method that doesn't dirty buffers beyond EOF. This is messy because the buffer code is not meant to be shared and it has interesting locking issues on the buffer dirty bits. So just copy and paste it and then modify it to suit what we need. Note: the solutions the other filesystems and generic block code use of marking the buffers clean in ->writepage does not work for XFS. It still leaves dirty buffers beyond EOF and invalidations still fail. Hence rather than play whack-a-mole, this patch simply prevents those buffers from being dirtied in the first place. cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
-