Commits · f421950f86bf96a11fef932e167ab2e70d4c43a0 · Kirill Smelkov / linux

25 Sep, 2008 40 commits

Btrfs: Fix some data=ordered related data corruptions · f421950f

Chris Mason authored Jul 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f421950f

Btrfs: Use a mutex in the extent buffer for tree block locking · a61e6f29

Chris Mason authored Jul 22, 2008

This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.

The mutexes alone don't fix either problem, but they are the first step.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a61e6f29

Btrfs: Index extent buffers in an rbtree · 6af118ce

Chris Mason authored Jul 22, 2008

Before, extent buffers were a temporary object, meant to map a number of pages
at once and collect operations on them.

But, a few extra fields have crept in, and they are also the best place to
store a per-tree block lock field as well.  This commit puts the extent
buffers into an rbtree, and ensures a single extent buffer for each
tree block.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

6af118ce

Btrfs: Data ordered fixes · 4a096752

Chris Mason authored Jul 21, 2008

* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages.  This is much faster, and more correct

* Properly clear our the PageChecked bit everywhere we redirty the page.

* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered

* Wait for ordered extents outside the transaction.  This isn't required
for locking rules but does improve transaction latencies

* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4a096752

Fix btrfs_wait_ordered_extent_range to properly wait · e5a2217e
Chris Mason authored Jul 18, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
e5a2217e

Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb

Chris Mason authored Jul 18, 2008

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

7f3c74fb

Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is set · 211f90e6
Chris Mason authored Jul 18, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
211f90e6

Btrfs: Handle data checksumming on bios that span multiple ordered extents · 3edf7d33

Chris Mason authored Jul 18, 2008

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3edf7d33

Btrfs: Cleanup and comment ordered-data.c · eb84ae03
Chris Mason authored Jul 17, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
eb84ae03

Btrfs: Force caching of metadata block groups on mount to avoid deadlock · 54641bd1

Chris Mason authored Jul 17, 2008

This is a temporary change to avoid deadlocks until the extent tree locking
is fixed up.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

54641bd1

btrfs_next_leaf: do readahead when skip_locking is turned on · 0bd40a71
Chris Mason authored Jul 17, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
0bd40a71

Add a per-inode lock around btrfs_drop_extents · ee6e6504

Chris Mason authored Jul 17, 2008

btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.

This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree.  It
prevents races from two procs working against adjacent ranges in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ee6e6504

Btrfs: Don't pin pages in ram until the entire ordered extent is on disk. · ba1da2f4

Chris Mason authored Jul 17, 2008

Checksum items are not inserted until the entire ordered extent is on disk,
but individual pages might be clean and available for reclaim long before
the whole extent is on disk.

In order to allow those pages to be freed, we need to be able to search
the list of ordered extents to find the checksum that is going to be inserted
in the tree.  This way if the page needs to be read back in before
the checksums are in the btree, we'll be able to verify the checksum on
the page.

This commit adds the ability to search the pending ordered extents for
a given offset in the file, and changes btrfs_releasepage to allow
ordered pages to be freed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ba1da2f4

btrfs_start_transaction: wait for commits in progress to finish · f9295749

Chris Mason authored Jul 17, 2008

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed.  btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction.  For them, btrfs_join_transaction
is added.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f9295749

Btrfs: Update on disk i_size only after pending ordered extents are done · dbe674a9

Chris Mason authored Jul 17, 2008

This changes the ordered data code to update i_size after the extent
is on disk.  An on disk i_size is maintained in the in-memory btrfs inode
structures, and this is updated as extents finish.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

dbe674a9

Btrfs: Use async helpers to deal with pages that have been improperly dirtied · 247e743c

Chris Mason authored Jul 17, 2008

Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

247e743c

Btrfs: New data=ordered implementation · e6dcd2dc

Chris Mason authored Jul 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e6dcd2dc

Btrfs: Drop some verbose printks · 77a41afb
Chris Mason authored Jul 08, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
77a41afb
Btrfs: Add locking around volume management (device add/remove/balance) · 7d9eb12c
Chris Mason authored Jul 08, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
7d9eb12c

Btrfs: Fix deadlock while searching for dead roots on mount · a7a16fd7

Chris Mason authored Jun 26, 2008

btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which
means we end up calling btrfs_search_slot with a path already held.

The fix is to remember the key inside btrfs_find_dead_roots and drop
the path.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a7a16fd7

Btrfs: Reduce contention on the root node · f9efa9c7

Chris Mason authored Jun 25, 2008

This calls unlock_up sooner in btrfs_search_slot in order to decrease the
amount of work done with the higher level tree locks held.

Also, it changes btrfs_tree_lock to spin for a big against the page lock
before scheduling.  This makes a big difference in context switch rate under
highly contended workloads.

Longer term, a better locking structure is needed than the page lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f9efa9c7

Btrfs: Online btree defragmentation fixes · 3f157a2f

Chris Mason authored Jun 25, 2008

The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3f157a2f

Btrfs: Add a per-inode csum mutex to avoid races creating csum items · 1b1e2135
Chris Mason authored Jun 25, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
1b1e2135

Btrfs: Change find_extent_buffer to use TestSetPageLocked · 079899c2

Chris Mason authored Jun 25, 2008

This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

079899c2

Btrfs: Add btree locking to the tree defragmentation code · e7a84565

Chris Mason authored Jun 25, 2008

The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e7a84565

Btrfs: Replace the transaction work queue with kthreads · a74a4b97

Chris Mason authored Jun 25, 2008

This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a74a4b97

Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63

Chris Mason authored Jun 25, 2008

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

89ce8a63

Btrfs: Fix snapshot deletion to release the alloc_mutex much more often. · 333db94c
Chris Mason authored Jun 25, 2008
```
This lowers the impact of snapshot deletion on the rest of the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
333db94c

Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it · 5cd57b2c

Chris Mason authored Jun 25, 2008

Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

5cd57b2c

Fix btrfs_next_leaf to check for new items after dropping locks · 168fd7d2
Chris Mason authored Jun 25, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
168fd7d2

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

Chris Mason authored Jun 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

594a24eb

Drop locks in btrfs_search_slot when reading a tree block. · 051e1b9f

Chris Mason authored Jun 25, 2008

One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree. This drops
locks held by a path when doing reads during a tree search.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

051e1b9f

Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011

Chris Mason authored Jun 25, 2008

Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a2135011

Btrfs: Start btree concurrency work. · 925baedd

Chris Mason authored Jun 25, 2008

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

925baedd

Btrfs: Add a thread pool just for submit_bio · 1cc127b5

Chris Mason authored Jun 12, 2008

If a bio submission is after a lock holder waiting for the bio
on the work queue, it is possible to deadlock.  Move the bios
into their own pool.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1cc127b5

BTRFS_IOC_TRANS_START should be privilegued · df5b5520

Christoph Hellwig authored Jun 11, 2008

As mentioned in the comment next to it btrfs_ioctl_trans_start can
do bad damage to filesystems and thus should be limited to privilegued
users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

df5b5520

Btrfs: split out ioctl.c · f46b5a66

Christoph Hellwig authored Jun 11, 2008

Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f46b5a66

Btrfs: kerneldoc comments for extent_map.c · 9d2423c5

Christoph Hellwig authored Jun 11, 2008

Add kerneldoc comments for all exported functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9d2423c5

Btrfs: Add a mount option to control worker thread pool size · 4543df7e

Chris Mason authored Jun 11, 2008

mount -o thread_pool_size changes the default, which is
min(num_cpus + 2, 8).  Larger thread pools would make more sense on
very large disk arrays.

This mount option controls the max size of each thread pool.  There
are multiple thread pools, so the total worker count will be larger
than the mount option.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4543df7e

Btrfs: Worker thread optimizations · 35d8ba66

Chris Mason authored Jun 11, 2008

This changes the worker thread pool to maintain a list of idle threads,
avoiding a complex search for a good thread to wake up.

Threads have two states:

idle - we try to reuse the last thread used in hopes of improving the batching
ratios

busy - each time a new work item is added to a busy task, the task is
rotated to the end of the line.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

35d8ba66