- 16 Jan, 2012 33 commits
-
-
Josef Bacik authored
This in addition to a script in my btrfs-tracing tree will help track down space leaks when we're getting space left over in block groups on umount. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
-
Josef Bacik authored
We've been seeing warnings coming out of the orphan commit stuff forever from ceph. Turns out it's because we're racing with checking if the orphan block reserve is set, because we clear it outside of the spin_lock. So leave the normal fastpath checks where they are, but take the spin_lock and _recheck_ to make sure we haven't had an orphan block rsv added in the meantime. Then clear the root's orphan block rsv and release the lock. With this patch a user said the warnings went away and they usually showed up pretty soon after he started ceph. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
-
Josef Bacik authored
I used these tracepoints when figuring out what the cluster stuff was doing, so add them to mainline in case we need to profile this stuff again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
-
Josef Bacik authored
Btrfs_throttle will make us wait if there is a currently committing transaction until we can open new transactions, which is ridiculous since we don't actually start any transactions within the file write path anyway, so all this does is introduce big latencies if we have a sync/fsync heavy workload going on while somebody else is trying to do work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
If updating the inode gave us an ENOSPC we were just returning in page_mkwrite, which is a problem since we make our reservation right before trying to update the inode, so fix the out label so that we actually free our reservation. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Miao Xie authored
Reproduce steps: # mkfs.btrfs /dev/sdb5 # mount /dev/sdb5 -o compress=lzo /mnt # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1 # sync # truncate -s 64K /mnt/tmpfile root 5 inode 257 errors 400 This is because of the wrong if condition, which is used to check if we should subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not. When we truncate a compressed extent, btrfs substracts the bytes of the whole extent, it's wrong. We should substract the real size that we truncate, no matter it is a compressed extent or not. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
A user reported a problem where things like open with O_CREAT would take up to 30 seconds when he had nfs activity on the same mount. This is because all of our quick metadata operations, like create, symlink etc all do btrfs_end_transaction_throttle, which if the transaction is blocked will wait for the commit to complete before it returns. This adds a ridiculous amount of latency and isn't really needed. The normal btrfs_end_transaction will mark the transaction as blocked and wake the transaction kthread up if it thinks the transaction needs to end (this being in the running out of global reserve space scenario), and this is all that is really needed since we've already done everything we're going to do, we just need to return. This should help people with the latency they were seeing when using synchronous heavy workloads. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
git://btrfs.giantdisaster.de/git/btrfsChris Mason authored
Conflicts: fs/btrfs/ctree.h fs/btrfs/super.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
-
git://repo.or.cz/linux-btrfs-develChris Mason authored
Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
-
Chris Mason authored
-
Ilya Dryomov authored
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the same heuristics used when recovering balance after a crash to try to start where we left off last time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Implement an ioctl for canceling restriper. Currently we wait until relocation of the current block group is finished, in future this can be done by triggering a commit. Balance item is deleted and no memory about the interrupted balance is kept. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Implement an ioctl for pausing restriper. This pauses the relocation, but balance is still considered to be "in progress": balance item is not deleted, other volume operations cannot be started, etc. If paused in the middle of profile changing operation we will continue making allocations with the target profile. Add a hook to close_ctree() to pause restriper and free its data structures on unmount. (It's safe to unmount when restriper is in "paused" state, we will resume with the same parameters on the next mount) Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Since restriper kthread starts involuntarily on mount and can suck cpu and memory bandwidth add a mount option to forcefully skip it. The restriper in that case hangs around in paused state and can be resumed from userspace when it's convenient. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
On mount, if balance item is found, resume balance in a separate kernel thread. Try to be smart to continue roughly where previous balance (or convert) was interrupted. For chunk types that were being converted to some profile we turn on soft convert, in case of a simple balance we turn on usage filter and relocate only less-than-90%-full chunks of that type. These are just heuristics but they help quite a bit, and can be improved in future. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Introduce a new btree objectid for storing balance item. The reason is to be able to resume restriper after a crash with the same parameters. Balance item has a very high objectid and goes into tree of tree roots. The key for the new item is as follows: [ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ] Older kernels simply ignore it so it's safe to mount with an older kernel and then go back to the newer one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
When doing convert from one profile to another if soft mode is on restriper won't touch chunks that already have the profile we are converting to. This is useful if e.g. half of the FS was converted earlier. The soft mode switch is (like every other filter) per-type. This means that we can convert for example meta chunks the "hard" way while converting data chunks selectively with soft switch. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Profile changing is done by launching a balance with BTRFS_BALANCE_CONVERT bits set and target fields of respective btrfs_balance_args structs initialized. Profile reducing code in this case will pick restriper's target profile if it's available instead of doing a blind reduce. If target profile is not yet available it goes back to a plain reduce. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Every caller of do_chunk_alloc() feeds it the reduced allocation profile, so stop trying to reduce it one more time. Instead check the validity of the passed profile. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Select chunks which have at least one byte located inside a given [vstart, vend) virtual address space range. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Select chunks which have at least one byte of at least one stripe located on a device with devid X in a given [pstart,pend) physical address range. This filter only works when devid filter is turned on. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Relocate chunks which have at least one stripe located on a device with devid X. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Select chunks that are less than X percent full. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Select chunks based on a given profile mask. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
This allows to have a separate set of filters for each chunk type (data,meta,sys). The code however is generic and switch on chunk type is only done once. This commit also adds a type filter: it allows to balance for example meta and system chunks w/o touching data ones. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Add basic restriper infrastructure: extended balancing ioctl and all related ioctl data structures, add data structure for tracking restriper's state to fs_info, etc. The semantics of the old balancing ioctl are fully preserved. Explicitly disallow any volume operations when balance is in progress. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Currently when new chunks are created respective avail_alloc_bits field is updated to reflect profiles of all chunks present in the system. However when chunks are removed profile bits are never cleared. This patch clears profile bit of respective avail_alloc_bits field when the last chunk with that profile is removed. Restriper needs this to properly operate when "downgrading". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for avail_{data,metadata,system}_alloc_bits fields, which gather info about available allocation profiles in the FS. When chunk is created or read from disk, its profile is OR'ed with the corresponding avail_alloc_bits field. Since SINGLE is denoted by 0 in the on-disk format, currently there is no way to tell when such chunks become avaialble. Restriper needs that information, so add a separate bit for SINGLE profile. This bit is going to be in-memory only, it should never be written out to disk, so it's not a disk format change. However to avoid remappings in future, reserve corresponding on-disk bit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Chunk's type and profile are encoded in u64 flags field. Introduce masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_* constants, it should be ULL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
{data,metadata,system}_alloc_profile fields have been unused for a long time now. Get rid of them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-
- 11 Jan, 2012 7 commits
-
-
Li Zefan authored
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex, but when we mount a filesystem which has backing seed devices, we have this lock chain: open_ctree() lock(chunk_mutex); read_chunk_tree(); read_one_dev(); open_seed_devices(); lock(uuid_mutex); and then we hit a lockdep splat. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-
Li Zefan authored
A bug was triggered while using seed device: # mkfs.btrfs /dev/loop1 # btrfstune -S 1 /dev/loop1 # mount -o /dev/loop1 /mnt # btrfs dev add /dev/loop2 /mnt btrfs: block rsv returned -28 ------------[ cut here ]------------ WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]() ... Call Trace: ... [<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs] [<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs] [<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs] [<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs] [<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs] [<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs] [<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs] [<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs] [<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs] [<c04f55f7>] do_vfs_ioctl+0x496/0x4cc [<c04f5660>] sys_ioctl+0x33/0x4f [<c07b9edf>] sysenter_do_call+0x12/0x38 ---[ end trace 906adac595facc7d ]--- Since seed device is readonly, there's no usable space in the filesystem. Afterwards we add a sprout device to it, and the kernel creates a METADATA block group and a SYSTEM block group where comes free space we can reserve, but we still get revervation failure because the global block_rsv hasn't been updated accordingly. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-
Li Zefan authored
There are various bugs in block group trimming: - It may trim from offset smaller than user-specified offset. - It may trim beyond user-specified range. - It may leak free space for extents smaller than specified minlen. - It may truncate the last trimmed extent thus leak free space. - With mixed extents+bitmaps, some extents may not be trimmed. - With mixed extents+bitmaps, some bitmaps may not be trimmed (even none will be trimmed). Even for those trimmed, not all the free space in the bitmaps will be trimmed. I rewrite btrfs_trim_block_group() and break it into two functions. One is to trim extents only, and the other is to trim bitmaps only. Before patching: # fstrim -v /mnt/ /mnt/: 1496465408 bytes were trimmed After patching: # fstrim -v /mnt/ /mnt/: 2193768448 bytes were trimmed And this matches the total free space: # btrfs fi df /mnt Data: total=3.58GB, used=1.79GB System, DUP: total=8.00MB, used=4.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=205.12MB, used=97.14MB Metadata: total=8.00MB, used=0.00 Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-
Li Zefan authored
For btrfs raid, while discarding a range of space, we'll need to know the start offset and length to discard for each device, and it's done in btrfs_map_block(). However the calculation is a bit complex for raid0 and raid10, so I reimplement it based on a fact that: dev1 dev2 dev3 (raid0) ----------------------------------- s0 s3 s6 s1 s4 s7 s2 s5 Each device has (total_stripes / nr_dev) stripes, or plus one. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-
Li Zefan authored
We pre-allocate a btrfs bio with fixed size, and then may re-allocate memory if we find stripes are bigger than the fixed size. But this pre-allocation is not necessary. Also we don't have to calcuate the stripe number twice. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-
Li Zefan authored
Some functions never use the transaction handle passed to them. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-
Li Zefan authored
Check and reserve space for btrfs_update_inode(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
-