- 06 Aug, 2018 40 commits
-
-
Qu Wenruo authored
If a crafted image has missing block group items, it could cause unexpected behavior and breaks the assumption of 1:1 chunk<->block group mapping. Although we have the block group -> chunk mapping check, we still need chunk -> block group mapping check. This patch will do extra check to ensure each chunk has its corresponding block group. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
A crafted btrfs image with incorrect chunk<->block group mapping will trigger a lot of unexpected things as the mapping is essential. Although the problem can be caught by block group item checker added in "btrfs: tree-checker: Verify block_group_item", it's still not sufficient. A sufficiently valid block group item can pass the check added by the mentioned patch but could fail to match the existing chunk. This patch will add extra block group -> chunk mapping check, to ensure we have a completely matching (start, len, flags) chunk for each block group at mount time. Here we reuse the original helper find_first_block_group(), which is already doing the basic bg -> chunk checks, adding further checks of the start/len and type flags. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When doing an incremental send, if we have a file in the parent snapshot that has prealloc extents beyond EOF and in the send snapshot it got a hole punch that partially covers the prealloc extents, the send stream, when replayed by a receiver, can result in a file that has a size bigger than it should and filled with zeroes past the correct EOF. For example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar $ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send -f /tmp/1.send /mnt/snap1 $ xfs_io -c "fpunch 1M 2M" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2 $ stat --format %s /mnt/snap2/foobar 1048576 $ md5sum /mnt/snap2/foobar d31659e82e87798acd4669a1e0a19d4f /mnt/snap2/foobar $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /mnt/1.snap /mnt $ btrfs receive -f /mnt/2.snap /mnt $ stat --format %s /mnt/snap2/foobar 3145728 # --> should be 1Mb and not 3Mb (which was the end offset of hole # punch operation) $ md5sum /mnt/snap2/foobar 117baf295297c2a995f92da725b0b651 /mnt/snap2/foobar # --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs This issue actually happens only since commit ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations"), but before that commit we were issuing a write operation full of zeroes (to "punch" a hole) which was extending the file size beyond the correct value and then immediately issue a truncate operation to the correct size and undoing the previous write operation. Since the send protocol does not support fallocate, for extent preallocation and hole punching, fix this by not even attempting to send a "hole" (regular write full of zeroes) if it starts at an offset greater then or equals to the file's size. This approach, besides being much more simple then making send issue the truncate operation, adds the benefit of avoiding the useless pair of write of zeroes and truncate operations, saving time and IO at the receiver and reducing the size of the send stream. A test case for fstests follows soon. Fixes: ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations") CC: stable@vger.kernel.org # 4.17+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Misono Tomohiro authored
Cleanup patch and no functional changes. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Al Viro authored
Don't open-code iget_failed(), don't bother with btrfs_free_path(NULL), move handling of positive return values of btrfs_lookup_inode() from btrfs_read_locked_inode() to btrfs_iget() and kill now obviously pointless ASSERT() in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Al Viro authored
We don't need to check is_bad_inode() after the call of btrfs_read_locked_inode() - it's exactly the same as checking return value for being non-zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Al Viro authored
IS_ERR(p) && PTR_ERR(p) == n is a weird way to spell p == ERR_PTR(n). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Al Viro authored
Just get rid of pointless checks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Misono Tomohiro authored
on-disk devs stats value is updated in btrfs_run_dev_stats(), which is called during commit transaction, if device->dev_stats_ccnt is not zero. Since current replace operation does not touch dev_stats_ccnt, on-disk dev stats value is not updated. Therefore "btrfs device stats" may return old device's value after umount/mount (Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish). Fix this by just incrementing dev_stats_ccnt in btrfs_dev_replace_finishing() when replace is succeeded and this will update the values. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Misono Tomohiro authored
There is no user of this function anymore. This was forgotten to be removed in commit a575ceeb ("Btrfs: get rid of unused orphan infrastructure"). Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Misono Tomohiro authored
Use ERR_CAST() instead of void * to make meaning clear. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
Although it is safe to call this on already released paths with no locks held or extent buffers, removing the redundant btrfs_release_path is reasonable. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
All callers pass the root tree of dir, we can push that down to the function itself. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Leftover after fix e339a6b0 ("Btrfs: __btrfs_mod_ref should always use no_quota"), that removed it from the function calls but not the structure. Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The more common use case of send involves creating a RO snapshot and then use it for a send operation. In this case it's not possible to have inodes in the snapshot that have a link count of zero (inode with an orphan item) since during snapshot creation we do the orphan cleanup. However, other less common use cases for send can end up seeing inodes with a link count of zero and in this case the send operation fails with a ENOENT error because any attempt to generate a path for the inode, with the purpose of creating it or updating it at the receiver, fails since there are no inode reference items. One use case it to use a regular subvolume for a send operation after turning it to RO mode or turning a RW snapshot into RO mode and then using it for a send operation. In both cases, if a file gets all its hard links deleted while there is an open file descriptor before turning the subvolume/snapshot into RO mode, the send operation will encounter an inode with a link count of zero and then fail with errno ENOENT. Example using a full send with a subvolume: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv1 $ touch /mnt/sv1/foo $ touch /mnt/sv1/bar # keep an open file descriptor on file bar $ exec 73</mnt/sv1/bar $ unlink /mnt/sv1/bar # Turn the subvolume to RO mode and use it for a full send, while # holding the open file descriptor. $ btrfs property set /mnt/sv1 ro true $ btrfs send -f /tmp/full.send /mnt/sv1 At subvol /mnt/sv1 ERROR: send ioctl failed with -2: No such file or directory Example using an incremental send with snapshots: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv1 $ touch /mnt/sv1/foo $ touch /mnt/sv1/bar $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1 $ echo "hello world" >> /mnt/sv1/bar $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap2 # Turn the second snapshot to RW mode and delete file foo while # holding an open file descriptor on it. $ btrfs property set /mnt/snap2 ro false $ exec 73</mnt/snap2/foo $ unlink /mnt/snap2/foo # Set the second snapshot back to RO mode and do an incremental send. $ btrfs property set /mnt/snap2 ro true $ btrfs send -f /tmp/inc.send -p /mnt/snap1 /mnt/snap2 At subvol /mnt/snap2 ERROR: send ioctl failed with -2: No such file or directory So fix this by ignoring inodes with a link count of zero if we are either doing a full send or if they do not exist in the parent snapshot (they are new in the send snapshot), and unlink all paths found in the parent snapshot when doing an incremental send (and ignoring all other inode items, such as xattrs and extents). A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Reported-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
If we end up with logging an inode reference item which has the same name but different index from the one we have persisted, we end up failing when replaying the log with an errno value of -EEXIST. The error comes from btrfs_add_link(), which is called from add_inode_ref(), when we are replaying an inode reference item. Example scenario where this happens: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foo $ ln /mnt/foo /mnt/bar $ sync # Rename the first hard link (foo) to a new name and rename the second # hard link (bar) to the old name of the first hard link (foo). $ mv /mnt/foo /mnt/qwerty $ mv /mnt/bar /mnt/foo # Create a new file, in the same parent directory, with the old name of # the second hard link (bar) and fsync this new file. # We do this instead of calling fsync on foo/qwerty because if we did # that the fsync resulted in a full transaction commit, not triggering # the problem. $ touch /mnt/bar $ xfs_io -c "fsync" /mnt/bar <power fail> $ mount /dev/sdb /mnt mount: mount /dev/sdb on /mnt failed: File exists So fix this by checking if a conflicting inode reference exists (same name, same parent but different index), removing it (and the associated dir index entries from the parent inode) if it exists, before attempting to add the new reference. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
If we're trying to make a data reservation and we have to allocate a data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if it allocated a chunk. Since the end of the function is the success path just return 0. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The exported helper just calls the static one. There's no obvious reason to have them separate eg. for performance reasons where the static one could be better optimized in the same unit. There's a slight decrease in code size and stack consumption. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Lock owner and nesting level have been unused since day 1, probably copy&pasted from the extent_buffer locking scheme without much thinking. The locking of device replace is simpler and does not need any lock nesting. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Added in 58176a96 ("Btrfs: Add per-root block accounting and sysfs entries") in 2007, the roots had names exported in sysfs. The code was commented out in 4df27c4d ("Btrfs: change how subvolumes are organized") and cleaned by 182608c8 ("btrfs: remove old unused commented out code"). Signed-off-by: David Sterba <dsterba@suse.com>
-
Adam Borowski authored
Requiring a read-write descriptor conflicts both ways with exec, returning ETXTBSY whenever you try to defrag a program that's currently being run, or causing intermittent exec failures on a live system being defragged. As defrag doesn't change the file's contents in any way, there's no reason to consider it a rw operation. Thus, let's check only whether the file could have been opened rw. Such access control is still needed as currently defrag can use extra disk space, and might trigger bugs. We return EINVAL when the request is invalid; here it's ok but merely the user has insufficient privileges. Thus, the EPERM return value reflects the error better -- as discussed in the identical case for dedupe. According to codesearch.debian.net, no userspace program distinguishes these values beyond strerror(). Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: David Sterba <dsterba@suse.com> [ fold the EPERM patch from Adam ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We recently ran into the following deadlock involving btrfs_write_inode(): [ +0.005066] __schedule+0x38e/0x8c0 [ +0.007144] schedule+0x36/0x80 [ +0.006447] bit_wait+0x11/0x60 [ +0.006446] __wait_on_bit+0xbe/0x110 [ +0.007487] ? bit_wait_io+0x60/0x60 [ +0.007319] __inode_wait_for_writeback+0x96/0xc0 [ +0.009568] ? autoremove_wake_function+0x40/0x40 [ +0.009565] inode_wait_for_writeback+0x21/0x30 [ +0.009224] evict+0xb0/0x190 [ +0.006099] iput+0x1a8/0x210 [ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0 [ +0.009047] btrfs_commit_transaction+0x799/0x8c0 [ +0.009567] btrfs_write_inode+0x81/0xb0 [ +0.008008] __writeback_single_inode+0x267/0x320 [ +0.009569] writeback_sb_inodes+0x25b/0x4e0 [ +0.008702] wb_writeback+0x102/0x2d0 [ +0.007487] wb_workfn+0xa4/0x310 [ +0.006794] ? wb_workfn+0xa4/0x310 [ +0.007143] process_one_work+0x150/0x410 [ +0.008179] worker_thread+0x6d/0x520 [ +0.007490] kthread+0x12c/0x160 [ +0.006620] ? put_pwq_unlocked+0x80/0x80 [ +0.008185] ? kthread_park+0xa0/0xa0 [ +0.007484] ? do_syscall_64+0x53/0x150 [ +0.007837] ret_from_fork+0x29/0x40 Writeback calls: btrfs_write_inode btrfs_commit_transaction btrfs_run_delayed_iputs If iput() is called on that same inode, evict() will wait for writeback forever. btrfs_write_inode() was originally added way back in 4730a4bc ("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode() hasn't been used for O_SYNC since 148f948b ("vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so btrfs_write_inode() is actually unnecessary (and leads to a bunch of unnecessary commits). Get rid of it, which also gets rid of the deadlock. CC: stable@vger.kernel.org # 3.2+ Signed-off-by: Josef Bacik <jbacik@fb.com> [Omar: new commit message] Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This function is always passed a well-formed tgtdevice so the fs_info can be referenced from there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can be referenced from the passed 'device' argument which is always a well-formed device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can be referenced from the passed srcdev argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
It can be referenced form the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
In find_free_extent() under checks: label, we have the following code: search_start = ALIGN(offset, fs_info->stripesize); /* move on to the next group */ if (search_start + num_bytes > block_group->key.objectid + block_group->key.offset) { btrfs_add_free_space(block_group, offset, num_bytes); goto loop; } if (offset < search_start) btrfs_add_free_space(block_group, offset, search_start - offset); BUG_ON(offset > search_start); However ALIGN() is rounding up, thus @search_start >= @offset and that BUG_ON() will never be triggered. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At send.c:full_send_tree() we were setting the 'key' variable in the loop while never using it later. We were also using two btrfs_key variables to store the initial key for search and the key found in every iteration of the loop. So remove this useless key assignment and use the same btrfs_key variable to store the initial search key and the key found in each iteration. This was introduced in the initial send commit but was never used (commit 31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The data and metadata callback implementation both use the same function. We can remove the call indirection and intermediate helper completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The data and metadata callback implementation both use the same function. We can remove the call indirection completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
All implementations of the callback are trivial and do the same and there's only one user. Merge everything together. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-