Commits · ff7db6e05a93a23eb43c7d20dc2027bcc939b6a8 · Kirill Smelkov / linux

26 Feb, 2016 5 commits
- Merge branch 'foreign/zhaolei/reada' into for-chris-4.6 · ff7db6e0
  David Sterba authored Feb 26, 2016
  
  ff7db6e0
- Merge branch 'foreign/qu/norecovery-v7' into for-chris-4.6 · 23c1a966
  David Sterba authored Feb 26, 2016
  
  23c1a966
- Merge branch 'dev/rename-keys' into for-chris-4.6 · 67d605fe
  David Sterba authored Feb 26, 2016
  
  67d605fe
- Merge branch 'dev/gfp-flags' into for-chris-4.6 · e22b3d1f
  David Sterba authored Feb 26, 2016
  
  e22b3d1f
- Merge branch 'chandan/prep-subpage-blocksize' into for-chris-4.6 · 5f1b5664
  David Sterba authored Feb 26, 2016
```
# Conflicts:
#	fs/btrfs/file.c
```
  5f1b5664
18 Feb, 2016 13 commits

btrfs: reada: ignore creating reada_extent for a non-existent device · 7aff8cf4

Zhao Lei authored Jan 14, 2016

For a non-existent device, old code bypasses adding it in dev's reada
queue.

And to solve problem of unfinished waitting in raid5/6,
commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for
raid5/6 degraded mounting")
adding an exception for the first stripe, in short, the first
stripe will always be processed whether the device exists or not.

Actually we have a better way for the above request: just bypass
creation of the reada_extent for non-existent device, it will make
code simple and effective.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7aff8cf4

btrfs: reada: avoid undone reada extents in btrfs_reada_wait · 4fe7a0e1

Zhao Lei authored Jan 26, 2016

Reada background works is not designed to finish all jobs
completely, it will break in following case:
1: When a device reaches workload limit (MAX_IN_FLIGHT)
2: Total reads reach max limit (10000)
3: All devices don't have queued more jobs, often happened in DUP case

And if all background works exit with remaining jobs,
btrfs_reada_wait() will wait indefinetelly.

Above problem is rarely happened in old code, because:
1: Every work queues 2x new works
   So many works reduced chances of undone jobs.
2: One work will continue 10000 times loop in case of no-jobs
   It reduced no-thread window time.

But after we fixed above case, the "undone reada extents" frequently
happened.

Fix:
 Check to ensure we have at least one thread if there are undone jobs
 in btrfs_reada_wait().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4fe7a0e1

btrfs: reada: limit max works count · 2fefd558

Zhao Lei authored Jan 07, 2016

Reada creates 2 works for each level of tree recursively.

In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.

The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2fefd558

btrfs: reada: simplify dev->reada_in_flight processing · 895a11b8

Zhao Lei authored Jan 12, 2016

No need to decrease dev->reada_in_flight in __readahead_hook()'s
internal and reada_extent_put().
reada_extent_put() have no chance to decrease dev->reada_in_flight
in free operation, because reada_extent have additional refcnt when
scheduled to a dev.

We can put inc and dec operation for dev->reada_in_flight to one
place instead to make logic simple and safe, and move useless
reada_extent->scheduled_for to a bool flag instead.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

895a11b8

btrfs: reada: Fix a debug code typo · 8afd6841

Zhao Lei authored Dec 31, 2015

Remove one copy of loop to fix the typo of iterate zones.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8afd6841

btrfs: reada: Jump into cleanup in direct way for __readahead_hook() · 57f16e08

Zhao Lei authored Dec 31, 2015

Current code set nritems to 0 to make for_loop useless to bypass it,
and set generation's value which is not necessary.
Jump into cleanup directly is better choise.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

57f16e08

btrfs: reada: Use fs_info instead of root in __readahead_hook's argument · 02873e43

Zhao Lei authored Dec 31, 2015

What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

02873e43

btrfs: reada: Pass reada_extent into __readahead_hook directly · 6e39dbe8

Zhao Lei authored Dec 31, 2015

reada_start_machine_dev() already have reada_extent pointer, pass
it into __readahead_hook() directly instead of search radix_tree
will make code run faster.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6e39dbe8

btrfs: reada: move reada_extent_put to place after __readahead_hook() · b257cf50

Zhao Lei authored Dec 31, 2015

We can't release reada_extent earlier than __readahead_hook(), because
__readahead_hook() still need to use it, it is necessary to hode a refcnt
to avoid it be freed.

Actually it is not a problem after my patch named:
  Avoid many times of empty loop
It make reada_extent in above line include at least one reada_extctl,
which keeps additional one refcnt for reada_extent.

But we still need this patch to make the code in pretty logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b257cf50

btrfs: reada: Remove level argument in severial functions · 1e7970c0

Zhao Lei authored Dec 31, 2015

level is not used in severial functions, remove them from arguments,
and remove relative code for get its value.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1e7970c0

btrfs: reada: bypass adding extent when all zone failed · 31945021

Zhao Lei authored Dec 31, 2015

When failed adding all dev_zones for a reada_extent, the extent
will have no chance to be selected to run, and keep in memory
for ever.

We should bypass this extent to avoid above case.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

31945021

btrfs: reada: add all reachable mirrors into reada device list · 6a159d2a

Zhao Lei authored Dec 31, 2015

If some device is not reachable, we should bypass and continus addingb
next, instead of break on bad device.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6a159d2a

btrfs: reada: Move is_need_to_readahead contition earlier · a3f7fde2

Zhao Lei authored Dec 31, 2015

Move is_need_to_readahead contition earlier to avoid useless loop
to get relative data for readahead.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a3f7fde2

16 Feb, 2016 4 commits

btrfs: reada: Avoid many times of empty loop · 97d5f0e6

Zhao Lei authored Dec 31, 2015

We can see following loop(10000 times) in trace_log:
 [   75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 [   75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [   75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [   75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

 ...(10000 times)

 [  124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
 [  124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
 [  124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1

Reason:
 If more than one user trigger reada in same extent, the first task
 finished setting of reada data struct and call reada_start_machine()
 to start, and the second task only add a ref_count but have not
 add reada_extctl struct completely, the reada_extent can not finished
 all jobs, and will be selected in __reada_start_machine() for 10000
 times(total times in __reada_start_machine()).

Fix:
 For a reada_extent without job, we don't need to run it, just return
 0 to let caller break.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

97d5f0e6

btrfs: reada: Add missed segment checking in reada_find_zone · 8e9aa51f

Zhao Lei authored Dec 18, 2015

In rechecking zone-in-tree, we still need to check zone include
our logical address.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8e9aa51f

btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone · c37f49c7

Zhao Lei authored Dec 18, 2015

We can avoid additional locking-acquirment and one pair of
kref_get/put by combine two condition.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c37f49c7

btrfs: reada: Fix in-segment calculation for reada · 50378530

Zhao Lei authored Dec 18, 2015

reada_zone->end is end pos of segment:
 end = start + cache->key.offset - 1;

So we need to use "<=" in condition to judge is a pos in the
segment.

The problem happened rearly, because logical pos rarely pointed
to last 4k of a blockgroup, but we need to fix it to make code
right in logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

50378530

12 Feb, 2016 3 commits

btrfs: Introduce new mount option alias for nologreplay · fed8f166

Qu Wenruo authored Jan 19, 2016

Introduce new mount option alias "norecovery" for nologreplay, to keep
"norecovery" behavior the same with other filesystems.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fed8f166

btrfs: Introduce new mount option to disable tree log replay · 96da0919

Qu Wenruo authored Jan 19, 2016

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

96da0919

btrfs: Introduce new mount option usebackuproot to replace recovery · 8dcddfa0

Qu Wenruo authored Jan 19, 2016

Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.

Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.

Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.

This provides the basis for later unified "norecovery" mount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
  docs ]
Signed-off-by: David Sterba <dsterba@suse.com>

8dcddfa0

11 Feb, 2016 15 commits

btrfs: teach print_leaf about temporary item subtypes · 9f07e1d7
David Sterba authored Jan 25, 2016
```
Signed-off-by: David Sterba <dsterba@suse.com>
```
9f07e1d7
btrfs: teach print_leaf about permanent item subtypes · 585a3d0d
David Sterba authored Jan 25, 2016
```
Signed-off-by: David Sterba <dsterba@suse.com>
```
585a3d0d
btrfs: switch dev stats item to the permanent item key · 242e2956
David Sterba authored Jan 25, 2016
```
Signed-off-by: David Sterba <dsterba@suse.com>
```
242e2956

btrfs: introduce key type for persistent permanent items · 50c2d5ab

David Sterba authored Jan 25, 2016

The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.

Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension.  The
victim is the BTRFS_DEV_STATS_KEY (248).

The device stats are an example of a permanent item.
Signed-off-by: David Sterba <dsterba@suse.com>

50c2d5ab

btrfs: switch balance item to the temporary item key · c479cb4f
David Sterba authored Jan 25, 2016
```
No visible change.
Signed-off-by: David Sterba <dsterba@suse.com>
```
c479cb4f

btrfs: introduce key type for persistent temporary items · 0bbbccb1

David Sterba authored Jan 25, 2016

The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension.  The victim is the BTRFS_BALANCE_ITEM_KEY (248).

The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.
Signed-off-by: David Sterba <dsterba@suse.com>

0bbbccb1

btrfs: switch to kcalloc in btrfs_cmp_data_prepare · 66722f7c

David Sterba authored Feb 11, 2016

Kcalloc is functionally equivalent and does overflow checks.
Signed-off-by: David Sterba <dsterba@suse.com>

66722f7c

btrfs: extent same: use GFP_KERNEL for page array allocations · fd95ef56

David Sterba authored Feb 11, 2016

We can safely use GFP_KERNEL in the functions called from the ioctl
handlers. Here we can allocate up to 32k so less pressure to the
allocator could help.
Signed-off-by: David Sterba <dsterba@suse.com>

fd95ef56

btrfs: device add and remove: use GFP_KERNEL · 78f2c9e6

David Sterba authored Feb 11, 2016

We can safely use GFP_KERNEL in the functions called from the ioctl
handlers.
Signed-off-by: David Sterba <dsterba@suse.com>

78f2c9e6

btrfs: readdir: use GFP_KERNEL · 49e350a4

David Sterba authored Feb 11, 2016

Readdir is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.
Signed-off-by: David Sterba <dsterba@suse.com>

49e350a4

btrfs: fallocate: use GFP_KERNEL · 32fc932e

David Sterba authored Feb 11, 2016

Fallocate is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.
Signed-off-by: David Sterba <dsterba@suse.com>

32fc932e

btrfs: let callers of btrfs_alloc_root pass gfp flags · 74e4d827

David Sterba authored Feb 11, 2016

We don't need to use GFP_NOFS in all contexts, eg. during mount or for
dummy root tree, but we might for the the log tree creation.
Signed-off-by: David Sterba <dsterba@suse.com>

74e4d827

btrfs: scrub: use GFP_KERNEL on the submission path · 58c4e173

David Sterba authored Feb 11, 2016

Scrub is not on the critical writeback path we don't need to use
GFP_NOFS for all allocations. The failures are handled and stats passed
back to userspace.

Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
global structures and the IO submission paths.

Functions that do the repair and fixups still use GFP_NOFS as we might
want to skip any other filesystem activity if we encounter an error.
This could turn out to be unnecessary, but requires more review compared
to the easy cases in this patch.
Signed-off-by: David Sterba <dsterba@suse.com>

58c4e173

btrfs: reada: use GFP_KERNEL everywhere · ed0244fa

David Sterba authored Jan 18, 2016

The readahead framework is not on the critical writeback path we don't
need to use GFP_NOFS for allocations. All error paths are handled and
the readahead failures are not fatal. The actual users (scrub,
dev-replace) will trigger reads if the blocks are not found in cache.
Signed-off-by: David Sterba <dsterba@suse.com>

ed0244fa

btrfs: send: use GFP_KERNEL everywhere · e780b0d1

David Sterba authored Jan 18, 2016

The send operation is not on the critical writeback path we don't need
to use GFP_NOFS for allocations. All error paths are handled and the
whole operation is restartable.
Signed-off-by: David Sterba <dsterba@suse.com>

e780b0d1