Commits · 64f26f745084872b916cd1bef6054e21b15c5784 · Kirill Smelkov / linux

25 Sep, 2008 40 commits

Btrfs: Use assert_spin_locked instead of spin_trylock · 64f26f74
David Woodhouse authored Jul 24, 2008
```
On UP systems spin_trylock always succeeds
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
64f26f74
Btrfs: Add version strings on module load · b3c3da71
Chris Mason authored Jul 23, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
b3c3da71
Btrfs: Fix some build problems on 2.6.18 based enterprise kernels · 4881ee5a
Chris Mason authored Jul 24, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
4881ee5a

Btrfs: Search data ordered extents first for checksums on read · 89642229

Chris Mason authored Jul 24, 2008

Checksum items are not inserted into the tree until all of the io from a
given extent is complete. This means one dirty page from an extent may
be written, freed, and then read again before the entire extent is on disk
and the checksum item is inserted.

The checksums themselves are stored in the ordered extent so they can
be inserted in bulk when IO is complete. On read, if a checksum item isn't
found, the ordered extents were being searched for a checksum record.

This all worked most of the time, but the checksum insertion code tries
to reduce the number of tree operations by pre-inserting checksum items
based on i_size and a few other factors. This means the read code might
find a checksum item that hasn't yet really been filled in.

This commit changes things to check the ordered extents first and only
dive into the btree if nothing was found. This removes the need for
extra locking and is more reliable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

89642229

Btrfs: Fix 32 bit compiles by using an unsigned long byte count in the ordered extent · 9ba4611a
Chris Mason authored Jul 23, 2008
```
The ordered extents have to fit in memory, so an unsigned long is sufficient.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
9ba4611a
Btrfs: Take the csum mutex while reading checksums · ed98b56a
Chris Mason authored Jul 22, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
ed98b56a

Btrfs: alloc_mutex latency reduction · c286ac48

Chris Mason authored Jul 22, 2008

This releases the alloc_mutex in a few places that hold it for over long
operations.  btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c286ac48

Btrfs: Add some conditional schedules near the alloc_mutex · e34a5b4f

Chris Mason authored Jul 22, 2008

This helps prevent stalls, especially while the snapshot cleaner is
running hard
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e34a5b4f

Btrfs: Use mutex_lock_nested for tree locking · 6dddcbeb

Chris Mason authored Jul 22, 2008

Lockdep has the notion of locking subclasses so that you can identify
locks you expect to be taken after other locks of the same class. This
changes the per-extent buffer btree locking routines to use a subclass based
on the level in the tree.

Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs
max level is also 8. So when lockdep is on, use a lower max level.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

6dddcbeb

Btrfs: Fix some data=ordered related data corruptions · f421950f

Chris Mason authored Jul 22, 2008

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f421950f

Btrfs: Use a mutex in the extent buffer for tree block locking · a61e6f29

Chris Mason authored Jul 22, 2008

This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.

The mutexes alone don't fix either problem, but they are the first step.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a61e6f29

Btrfs: Index extent buffers in an rbtree · 6af118ce

Chris Mason authored Jul 22, 2008

Before, extent buffers were a temporary object, meant to map a number of pages
at once and collect operations on them.

But, a few extra fields have crept in, and they are also the best place to
store a per-tree block lock field as well.  This commit puts the extent
buffers into an rbtree, and ensures a single extent buffer for each
tree block.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

6af118ce

Btrfs: Data ordered fixes · 4a096752

Chris Mason authored Jul 21, 2008

* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages.  This is much faster, and more correct

* Properly clear our the PageChecked bit everywhere we redirty the page.

* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered

* Wait for ordered extents outside the transaction.  This isn't required
for locking rules but does improve transaction latencies

* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4a096752

Fix btrfs_wait_ordered_extent_range to properly wait · e5a2217e
Chris Mason authored Jul 18, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
e5a2217e

Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb

Chris Mason authored Jul 18, 2008

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

7f3c74fb

Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is set · 211f90e6
Chris Mason authored Jul 18, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
211f90e6

Btrfs: Handle data checksumming on bios that span multiple ordered extents · 3edf7d33

Chris Mason authored Jul 18, 2008

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3edf7d33

Btrfs: Cleanup and comment ordered-data.c · eb84ae03
Chris Mason authored Jul 17, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
eb84ae03

Btrfs: Force caching of metadata block groups on mount to avoid deadlock · 54641bd1

Chris Mason authored Jul 17, 2008

This is a temporary change to avoid deadlocks until the extent tree locking
is fixed up.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

54641bd1

btrfs_next_leaf: do readahead when skip_locking is turned on · 0bd40a71
Chris Mason authored Jul 17, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
0bd40a71

Add a per-inode lock around btrfs_drop_extents · ee6e6504

Chris Mason authored Jul 17, 2008

btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.

This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree.  It
prevents races from two procs working against adjacent ranges in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ee6e6504

Btrfs: Don't pin pages in ram until the entire ordered extent is on disk. · ba1da2f4

Chris Mason authored Jul 17, 2008

Checksum items are not inserted until the entire ordered extent is on disk,
but individual pages might be clean and available for reclaim long before
the whole extent is on disk.

In order to allow those pages to be freed, we need to be able to search
the list of ordered extents to find the checksum that is going to be inserted
in the tree.  This way if the page needs to be read back in before
the checksums are in the btree, we'll be able to verify the checksum on
the page.

This commit adds the ability to search the pending ordered extents for
a given offset in the file, and changes btrfs_releasepage to allow
ordered pages to be freed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ba1da2f4

btrfs_start_transaction: wait for commits in progress to finish · f9295749

Chris Mason authored Jul 17, 2008

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed.  btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction.  For them, btrfs_join_transaction
is added.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f9295749

Btrfs: Update on disk i_size only after pending ordered extents are done · dbe674a9

Chris Mason authored Jul 17, 2008

This changes the ordered data code to update i_size after the extent
is on disk.  An on disk i_size is maintained in the in-memory btrfs inode
structures, and this is updated as extents finish.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

dbe674a9

Btrfs: Use async helpers to deal with pages that have been improperly dirtied · 247e743c

Chris Mason authored Jul 17, 2008

Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

247e743c

Btrfs: New data=ordered implementation · e6dcd2dc

Chris Mason authored Jul 17, 2008

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e6dcd2dc

Btrfs: Drop some verbose printks · 77a41afb
Chris Mason authored Jul 08, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
77a41afb
Btrfs: Add locking around volume management (device add/remove/balance) · 7d9eb12c
Chris Mason authored Jul 08, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
7d9eb12c

Btrfs: Fix deadlock while searching for dead roots on mount · a7a16fd7

Chris Mason authored Jun 26, 2008

btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which
means we end up calling btrfs_search_slot with a path already held.

The fix is to remember the key inside btrfs_find_dead_roots and drop
the path.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a7a16fd7

Btrfs: Reduce contention on the root node · f9efa9c7

Chris Mason authored Jun 25, 2008

This calls unlock_up sooner in btrfs_search_slot in order to decrease the
amount of work done with the higher level tree locks held.

Also, it changes btrfs_tree_lock to spin for a big against the page lock
before scheduling.  This makes a big difference in context switch rate under
highly contended workloads.

Longer term, a better locking structure is needed than the page lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f9efa9c7

Btrfs: Online btree defragmentation fixes · 3f157a2f

Chris Mason authored Jun 25, 2008

The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3f157a2f

Btrfs: Add a per-inode csum mutex to avoid races creating csum items · 1b1e2135
Chris Mason authored Jun 25, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
1b1e2135

Btrfs: Change find_extent_buffer to use TestSetPageLocked · 079899c2

Chris Mason authored Jun 25, 2008

This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

079899c2

Btrfs: Add btree locking to the tree defragmentation code · e7a84565

Chris Mason authored Jun 25, 2008

The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e7a84565

Btrfs: Replace the transaction work queue with kthreads · a74a4b97

Chris Mason authored Jun 25, 2008

This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a74a4b97

Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63

Chris Mason authored Jun 25, 2008

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

89ce8a63

Btrfs: Fix snapshot deletion to release the alloc_mutex much more often. · 333db94c
Chris Mason authored Jun 25, 2008
```
This lowers the impact of snapshot deletion on the rest of the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
333db94c

Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it · 5cd57b2c

Chris Mason authored Jun 25, 2008

Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

5cd57b2c

Fix btrfs_next_leaf to check for new items after dropping locks · 168fd7d2
Chris Mason authored Jun 25, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
168fd7d2

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

Chris Mason authored Jun 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

594a24eb