1. 15 Oct, 2018 40 commits
    • Filipe Manana's avatar
      Btrfs: fix wrong dentries after fsync of file that got its parent replaced · 0f375eed
      Filipe Manana authored
      In a scenario like the following:
      
        mkdir /mnt/A               # inode 258
        mkdir /mnt/B               # inode 259
        touch /mnt/B/bar           # inode 260
      
        sync
      
        mv /mnt/B/bar /mnt/A/bar
        mv -T /mnt/A /mnt/B
        fsync /mnt/B/bar
      
        <power fail>
      
      After replaying the log we end up with file bar having 2 hard links, both
      with the name 'bar' and one in the directory with inode number 258 and the
      other in the directory with inode number 259. Also, we end up with the
      directory inode 259 still existing and with the directory inode 258 still
      named as 'A', instead of 'B'. In this scenario, file 'bar' should only
      have one hard link, located at directory inode 258, the directory inode
      259 should not exist anymore and the name for directory inode 258 should
      be 'B'.
      
      This incorrect behaviour happens because when attempting to log the old
      parents of an inode, we skip any parents that no longer exist. Fix this
      by forcing a full commit if an old parent no longer exists.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f375eed
    • Filipe Manana's avatar
      Btrfs: fix warning when replaying log after fsync of a tmpfile · f2d72f42
      Filipe Manana authored
      When replaying a log which contains a tmpfile (which necessarily has a
      link count of 0) we end up calling inc_nlink(), at
      fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
      the following:
      
        [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
        [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
        [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
        [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
        [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
        [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
        [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
        [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
        [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
        [195191.943735] FS:  00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
        [195191.943736] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
        [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [195191.943741] Call Trace:
        [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
        [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
        [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
        [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
        [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
        [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
        [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
        [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943906]  ? mount_fs+0x3b/0x140
        [195191.943908]  mount_fs+0x3b/0x140
        [195191.943912]  ? __init_waitqueue_head+0x36/0x50
        [195191.943916]  vfs_kern_mount+0x62/0x160
        [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
        [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943943]  ? mount_fs+0x3b/0x140
        [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
        [195191.943954]  mount_fs+0x3b/0x140
        [195191.943956]  ? __init_waitqueue_head+0x36/0x50
        [195191.943960]  vfs_kern_mount+0x62/0x160
        [195191.943963]  do_mount+0x1f9/0xd40
        [195191.943967]  ? memdup_user+0x4b/0x70
        [195191.943971]  ksys_mount+0x7e/0xd0
        [195191.943974]  __x64_sys_mount+0x21/0x30
        [195191.943977]  do_syscall_64+0x60/0x1b0
        [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [195191.943983] RIP: 0033:0x7f570e4e524a
        [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
        [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
        [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
        [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
        [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
        [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
        [195191.944002] irq event stamp: 8688
        [195191.944010] hardirqs last  enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
        [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
        [195191.944018] softirqs last  enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
        [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
        [195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
      
      This happens because the inode does not have the flag I_LINKABLE set,
      which is a runtime only flag, not meant to be persisted, set when the
      inode is created through open(2) if the flag O_EXCL is not passed to it.
      Except for the warning, there are no other consequences (like corruptions
      or metadata inconsistencies).
      
      Since it's pointless to replay a tmpfile as it would be deleted in a
      later phase of the log replay procedure (it has a link count of 0), fix
      this by not logging tmpfiles and if a tmpfile is found in a log (created
      by a kernel without this change), skip the replay of the inode.
      
      A test case for fstests follows soon.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.18+
      Reported-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2d72f42
    • Josef Bacik's avatar
      btrfs: drop min_size from evict_refill_and_join · ad80cf50
      Josef Bacik authored
      We don't need it, rsv->size is set once and never changes throughout
      its lifetime, so just use that for the reserve size.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad80cf50
    • Josef Bacik's avatar
      btrfs: assert on non-empty delayed iputs · e187831e
      Josef Bacik authored
      I ran into an issue where there was some reference being held on an
      inode that I couldn't track.  This assert wasn't triggered, but it at
      least rules out we're doing something stupid.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e187831e
    • Josef Bacik's avatar
      btrfs: make sure we create all new block groups · 545e3366
      Josef Bacik authored
      Allocating new chunks modifies both the extent and chunk tree, which can
      trigger new chunk allocations.  So instead of doing list_for_each_safe,
      just do while (!list_empty()) so we make sure we don't exit with other
      pending bg's still on our list.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      545e3366
    • Josef Bacik's avatar
      btrfs: reset max_extent_size on clear in a bitmap · 553cceb4
      Josef Bacik authored
      We need to clear the max_extent_size when we clear bits from a bitmap
      since it could have been from the range that contains the
      max_extent_size.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      553cceb4
    • Josef Bacik's avatar
      btrfs: protect space cache inode alloc with GFP_NOFS · 84de76a2
      Josef Bacik authored
      If we're allocating a new space cache inode it's likely going to be
      under a transaction handle, so we need to use memalloc_nofs_save() in
      order to avoid deadlocks, and more importantly lockdep messages that
      make xfstests fail.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84de76a2
    • Josef Bacik's avatar
      btrfs: release metadata before running delayed refs · f45c752b
      Josef Bacik authored
      We want to release the unused reservation we have since it refills the
      delayed refs reserve, which will make everything go smoother when
      running the delayed refs if we're short on our reservation.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f45c752b
    • Liu Bo's avatar
      Btrfs: kill btrfs_clear_path_blocking · 52398340
      Liu Bo authored
      Btrfs's btree locking has two modes, spinning mode and blocking mode,
      while searching btree, locking is always acquired in spinning mode and
      then converted to blocking mode if necessary, and in some hot paths we may
      switch the locking back to spinning mode by btrfs_clear_path_blocking().
      
      When acquiring locks, both of reader and writer need to wait for blocking
      readers and writers to complete before doing read_lock()/write_lock().
      
      The problem is that btrfs_clear_path_blocking() needs to switch nodes
      in the path to blocking mode at first (by btrfs_set_path_blocking) to
      make lockdep happy before doing its actual clearing blocking job.
      
      When switching to blocking mode from spinning mode, it consists of
      
      step 1) bumping up blocking readers counter and
      step 2) read_unlock()/write_unlock(),
      
      this has caused serious ping-pong effect if there're a great amount of
      concurrent readers/writers, as waiters will be woken up and go to
      sleep immediately.
      
      1) Killing this kind of ping-pong results in a big improvement in my 1600k
      files creation script,
      
      MNT=/mnt/btrfs
      mkfs.btrfs -f /dev/sdf
      mount /dev/def $MNT
      time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
              -d  $MNT/0  -d  $MNT/1 \
              -d  $MNT/2  -d  $MNT/3 \
              -d  $MNT/4  -d  $MNT/5 \
              -d  $MNT/6  -d  $MNT/7 \
              -d  $MNT/8  -d  $MNT/9 \
              -d  $MNT/10  -d  $MNT/11 \
              -d  $MNT/12  -d  $MNT/13 \
              -d  $MNT/14  -d  $MNT/15
      
      w/o patch:
      real    2m27.307s
      user    0m12.839s
      sys     13m42.831s
      
      w/ patch:
      real    1m2.273s
      user    0m15.802s
      sys     8m16.495s
      
      1.1) latency histogram from funclatency[1]
      
      Overall with the patch, there're ~50% less write lock acquisition and
      the 95% max latency that write lock takes also reduces to ~100ms from
      >500ms.
      
      --------------------------------------------
      w/o patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 2385222  |****************************************|
               2 -> 3          : 37147    |                                        |
               4 -> 7          : 20452    |                                        |
               8 -> 15         : 13131    |                                        |
              16 -> 31         : 3877     |                                        |
              32 -> 63         : 3900     |                                        |
              64 -> 127        : 2612     |                                        |
             128 -> 255        : 974      |                                        |
             256 -> 511        : 165      |                                        |
             512 -> 1023       : 13       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6743860  |****************************************|
               2 -> 3          : 2146     |                                        |
               4 -> 7          : 190      |                                        |
               8 -> 15         : 38       |                                        |
              16 -> 31         : 4        |                                        |
      
      --------------------------------------------
      w/ patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 1318454  |****************************************|
               2 -> 3          : 6800     |                                        |
               4 -> 7          : 3664     |                                        |
               8 -> 15         : 2145     |                                        |
              16 -> 31         : 809      |                                        |
              32 -> 63         : 219      |                                        |
              64 -> 127        : 10       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6854317  |****************************************|
               2 -> 3          : 2383     |                                        |
               4 -> 7          : 601      |                                        |
               8 -> 15         : 92       |                                        |
      
      2) dbench also proves the improvement,
      dbench -t 120 -D /mnt/btrfs 16
      
      w/o patch:
      Throughput 158.363 MB/sec
      
      w/ patch:
      Throughput 449.52 MB/sec
      
      3) xfstests didn't show any additional failures.
      
      One thing to note is that callers may set path->leave_spinning to have
      all nodes in the path stay in spinning mode, which means callers are
      ready to not sleep before releasing the path, but it won't cause
      problems if they don't want to sleep in blocking mode.
      
      [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.pySigned-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52398340
    • David Sterba's avatar
      btrfs: dev-replace: remove pointless assert in write unlock · 9b142115
      David Sterba authored
      The value of blocking_readers is increased only when the lock is taken
      for read, no way we can fail the condition with the write lock.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b142115
    • David Sterba's avatar
      btrfs: dev-replace: move replace members out of fs_info · 7f8d236a
      David Sterba authored
      The replace_wait and bio_counter were mistakenly added to fs_info in
      commit c404e0dc ("Btrfs: fix use-after-free in the finishing
      procedure of the device replace"), but they logically belong to
      fs_info::dev_replace. Besides, bio_counter is a very generic name and is
      confusing in bare fs_info context.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f8d236a
    • David Sterba's avatar
      btrfs: dev-replace: avoid useless lock on error handling path · aa144bfe
      David Sterba authored
      The exit sequence in btrfs_dev_replace_start does not allow to simply
      add a label to the right place so the error handling after starting
      transaction failure jumps there. Currently there's a lock that pairs
      with the unlock in the section, which is unnecessary and only raises
      questions.  Add a variable to track the locking status and avoid the
      extra locking.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa144bfe
    • David Sterba's avatar
      btrfs: open code btrfs_after_dev_replace_commit · 9f6cbcbb
      David Sterba authored
      Too trivial, the purpose can be simply documented in a comment.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f6cbcbb
    • David Sterba's avatar
      btrfs: open code btrfs_dev_replace_stats_inc · e37abe97
      David Sterba authored
      The wrapper is too trivial, open coding does not make it less readable.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e37abe97
    • David Sterba's avatar
      btrfs: open code btrfs_dev_replace_clear_lock_blocking · 7fb2eced
      David Sterba authored
      There's a single caller and the function name does not say it's actually
      taking the lock, so open coding makes it more explicit.
      
      For now, btrfs_dev_replace_read_lock is used instead of read_lock so
      it's paired with the unlocking wrapper in the same block.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7fb2eced
    • David Sterba's avatar
      btrfs: remove btrfs_dev_replace::read_locks · 3280f874
      David Sterba authored
      This member seems to be copied from the extent_buffer locking scheme and
      is at least used to assert that the read lock/unlock is properly nested.
      In some way. While the _inc/_dec are called inside the read lock
      section, the asserts are both inside and outside, so the ordering is not
      guaranteed and we can see read/inc/dec ordered in any way
      (theoretically).
      
      A missing call of btrfs_dev_replace_clear_lock_blocking could cause
      unexpected read_locks count, so this at least looks like a valid
      assertion, but this will become unnecessary with later updates.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3280f874
    • Qu Wenruo's avatar
      btrfs: tree-checker: Check level for leaves and nodes · f556faa4
      Qu Wenruo authored
      Although we have tree level check at tree read runtime, it's completely
      based on its parent level.
      We still need to do accurate level check to avoid invalid tree blocks
      sneak into kernel space.
      
      The check itself is simple, for leaf its level should always be 0.
      For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f556faa4
    • Qu Wenruo's avatar
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo authored
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't trace subtree if we're dropping reloc tree · 2cd86d30
      Qu Wenruo authored
      Reloc tree doesn't contribute to qgroup numbers, as we have accounted
      them at balance time (see replace_path()).
      
      Skipping the unneeded subtree tracing should reduce the overhead.
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22900          | -0.1%
      qgroup dirty extents | 227757       | 167139         | -26.6%
      time (sys)           | 65.253s      | 50.123s        | -23.2%
      time (real)          | 74.032s      | 52.551s        | -29.0%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cd86d30
    • Qu Wenruo's avatar
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo authored
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f527822
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree · ea49f3e7
      Qu Wenruo authored
      Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
      all new tree blocks in a reloc tree.
      So that qgroup could skip unrelated tree blocks during balance, which
      should hugely speedup balance speed when quota is enabled.
      
      The function qgroup_trace_new_subtree_blocks() itself only cares about
      new tree blocks in reloc tree.
      
      All its main works are:
      
      1) Read out tree blocks according to parent pointers
      
      2) Do recursive depth-first search
         Will call the same function on all its children tree blocks, with
         search level set to current level -1.
         And will also skip all children whose generation is smaller than
         @last_snapshot.
      
      3) Call qgroup_trace_extent_swap() to trace tree blocks
      
      So although we have parameter list related to source file tree, it's not
      used at all, but only passed to qgroup_trace_extent_swap().
      Thus despite the tree read code, the core should be pretty short and all
      about recursive depth-first search.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea49f3e7
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to trace two swaped extents · 25982561
      Qu Wenruo authored
      Introduce a new function, qgroup_trace_extent_swap(), which will be used
      later for balance qgroup speedup.
      
      The basis idea of balance is swapping tree blocks between reloc tree and
      the real file tree.
      
      The swap will happen in highest tree block, but there may be a lot of
      tree blocks involved.
      
      For example:
       OO = Old tree blocks
       NN = New tree blocks allocated during balance
      
                File tree (257)                  Reloc tree for 257
      L2              OO                                NN
                    /    \                            /    \
      L1          OO      OO (a)                    OO      NN (a)
                 / \     / \                       / \     / \
      L0       OO   OO OO   OO                   OO   OO NN   NN
                       (b)  (c)                          (b)  (c)
      
      When calling qgroup_trace_extent_swap(), we will pass:
      @src_eb = OO(a)
      @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
      @dst_level = 0
      @root_level = 1
      
      In that case, qgroup_trace_extent_swap() will search from OO(a) to
      reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
      
      The main work of qgroup_trace_extent_swap() can be split into 3 parts:
      
      1) Tree search from @src_eb
         It should acts as a simplified btrfs_search_slot().
         The key for search can be extracted from @dst_path->nodes[dst_level]
         (first key).
      
      2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
         NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
         They should be marked during preivous (@dst_level = 1) iteration.
      
      3) Mark file extents in leaves dirty
         We don't have good way to pick out new file extents only.
         So we still follow the old method by scanning all file extents in
         the leave.
      
      This function can free us from keeping two pathes, thus later we only need
      to care about how to iterate all new tree blocks in reloc tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ copy changelog to function comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25982561
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted · c337e7b0
      Qu Wenruo authored
      Number of qgroup dirty extents is directly linked to the performance
      overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
      record how many dirty extents is processed in
      btrfs_qgroup_account_extents().
      
      This will be pretty handy to analyze later balance performance
      improvement.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c337e7b0
    • Qu Wenruo's avatar
      btrfs: relocation: Add basic extent backref related comments for build_backref_tree · fa6ac715
      Qu Wenruo authored
      fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
      although it works pretty fine, it's not that easy to understand.
      Especially combined with the complex btrfs backref format.
      
      This patch adds some basic comment for the backref build part of the
      code, making it less hard to read, at least for backref searching part.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa6ac715
    • Omar Sandoval's avatar
      Btrfs: get rid of btrfs_symlink_aops · 4779cc04
      Omar Sandoval authored
      The only aops we define for symlinks are identical to the aops for
      regular files. This has been the case since symlink support was added in
      commit 2b8d99a7 ("Btrfs: symlinks and hard links"). As far as I can
      tell, there wasn't a good reason to have separate aops then, and there
      isn't now, so let's just do what most other filesystems do and reuse the
      same structure.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4779cc04
    • Chris Mason's avatar
      Btrfs: don't clean dirty pages during buffered writes · 7703bdd8
      Chris Mason authored
      During buffered writes, we follow this basic series of steps:
      
      again:
      	lock all the pages
      	wait for writeback on all the pages
      	Take the extent range lock
      	wait for ordered extents on the whole range
      	clean all the pages
      
      	if (copy_from_user_in_atomic() hits a fault) {
      		drop our locks
      		goto again;
      	}
      
      	dirty all the pages
      	release all the locks
      
      The extra waiting, cleaning and locking are there to make sure we don't
      modify pages in flight to the drive, after they've been crc'd.
      
      If some of the pages in the range were already dirty when the write
      began, and we need to goto again, we create a window where a dirty page
      has been cleaned and unlocked.  It may be reclaimed before we're able to
      lock it again, which means we'll read the old contents off the drive and
      lose any modifications that had been pending writeback.
      
      We don't actually need to clean the pages.  All of the other locking in
      place makes sure we don't start IO on the pages, so we can just leave
      them dirty for the duration of the write.
      
      Fixes: 73d59314 (the original btrfs merge)
      CC: stable@vger.kernel.org # v4.4+
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7703bdd8
    • David Sterba's avatar
      btrfs: use common helper instead of open coding a bit test · 818255fe
      David Sterba authored
      The helper does the same math and we take care about the special case
      when flags is 0 too.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      818255fe
    • Nikolay Borisov's avatar
      btrfs: refactor __btrfs_run_delayed_refs loop · 0110a4c4
      Nikolay Borisov authored
      Refactor the delayed refs loop by using the newly introduced
      btrfs_run_delayed_refs_for_head function. This greatly simplifies
      __btrfs_run_delayed_refs and makes it more obvious what is happening.
      
      We now have 1 loop which iterates the existing delayed_heads and then
      each selected ref head is processed by the new helper. All existing
      semantics of the code are preserved so no functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0110a4c4
    • Nikolay Borisov's avatar
      btrfs: Factor out loop processing all refs of a head · e7261386
      Nikolay Borisov authored
      This patch introduces a new helper encompassing the implicit inner loop
      in __btrfs_run_delayed_refs which processes all the refs for a given
      head. The code is mostly copy/paste, the only difference is that if we
      detect a newer reference then -EAGAIN is returned so that callers can
      react correctly.
      
      Also, at the end of the loop the head is relocked and
      btrfs_merge_delayed_refs is run again to retain the pre-refactoring
      semantics.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7261386
    • Nikolay Borisov's avatar
      btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs · b1cdbcb5
      Nikolay Borisov authored
      This is in preparation to refactor the giant loop in
      __btrfs_run_delayed_refs. As a first step define a new function
      which implements acquiring a reference to a btrfs_delayed_refs_head and
      use it. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1cdbcb5
    • David Sterba's avatar
      btrfs: tests: polish ifdefs around testing helper · b2fa1154
      David Sterba authored
      Avoid the inline ifdefs and use two sections for self-tests enabled and
      disabled.
      
      Though there could be no ifdef and unconditional test_bit of
      BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
      any code that would depend on conditions using btrfs_is_testing.
      
      As this is only for the testing code, drop unlikely().
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2fa1154
    • David Sterba's avatar
      a654666a
    • David Sterba's avatar
      btrfs: tests: move testing members of struct btrfs_root to the end · 57ec5fb4
      David Sterba authored
      The data used only for tests are better placed at the end of the
      structure so that they don't change the structure layout. All new
      members of btrfs_root should be placed before.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57ec5fb4
    • David Sterba's avatar
      btrfs: tests: add separate stub for find_lock_delalloc_range · 9c36396c
      David Sterba authored
      The helper find_lock_delalloc_range is now conditionally built static,
      dpending on whether the self-tests are enabled or not. There's a macro
      that is supposed to hide the export, used only once. To discourage
      further use, drop it an add a public wrapper for the helper needed by
      tests.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c36396c
    • Liu Bo's avatar
      Btrfs: preftree: use rb_first_cached · ecf160b4
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      While resolving indirect refs and missing refs, it always looks for the
      first rb entry in a while loop, it's helpful to use rb_first_cached
      instead.
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ecf160b4
    • Liu Bo's avatar
      Btrfs: extent_map: use rb_first_cached · 07e1ce09
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the
      same job as rb_first() but in O(1).
      
      As evict_inode_truncate_pages() removes all extent mapping by always
      looking for the first rb entry, it's helpful to use rb_first_cached
      instead.
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e1ce09
    • Liu Bo's avatar
      Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root · 03a1d4c8
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
      rb_first() but in O(1).
      
      Functions manipulating delayed_item need to get the first entry, this converts
      it to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03a1d4c8
    • Liu Bo's avatar
      Btrfs: delayed-refs: use rb_first_cached for ref_tree · e3d03965
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href->ref_tree need to get the first entry, this
      converts href->ref_tree to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3d03965
    • Liu Bo's avatar
      Btrfs: delayed-refs: use rb_first_cached for href_root · 5c9d028b
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href_root need to get the first entry, this
      converts href_root to use rb_first_cached().
      
      This patch is first in the sequenct of similar updates to other rbtrees
      and this is analysis of the expected behaviour and improvements.
      
      There's a common pattern:
      
      while (node = rb_first) {
              entry = rb_entry(node)
              next = rb_next(node)
              rb_erase(node)
              cleanup(entry)
      }
      
      rb_first needs to traverse the tree up to logN depth, rb_erase can
      completely reshuffle the tree. With the caching we'll skip the traversal
      in rb_first.  That's a cached memory access vs looped pointer
      dereference trade-off that IMHO has a clear winner.
      
      Measurements show there's not much difference in a sample tree with
      10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
      caching and pointer chasing are unpredictable though.
      
      Further optimzations can be done to avoid the expensive rb_erase step.
      In some cases it's ok to process the nodes in any order, so the tree can
      be traversed in post-order, not rebalancing the children nodes and just
      calling free. Care must be taken regarding the next node.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog from mail discussions ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c9d028b
    • Josef Bacik's avatar
      btrfs: wait on caching when putting the bg cache · 3aa7c7a3
      Josef Bacik authored
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3aa7c7a3