1. 17 Oct, 2018 2 commits
    • Filipe Manana's avatar
      Btrfs: fix assertion on fsync of regular file when using no-holes feature · 7ed586d0
      Filipe Manana authored
      When using the NO_HOLES feature and logging a regular file, we were
      expecting that if we find an inline extent, that either its size in RAM
      (uncompressed and unenconded) matches the size of the file or if it does
      not, that it matches the sector size and it represents compressed data.
      This assertion does not cover a case where the length of the inline extent
      is smaller than the sector size and also smaller the file's size, such
      case is possible through fallocate. Example:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
        $ xfs_io -c "falloc 40 40" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
      In the above example we trigger the assertion because the inline extent's
      length is 21 bytes while the file size is 80 bytes. The fallocate() call
      merely updated the file's size and did not touch the existing inline
      extent, as expected.
      
      So fix this by adjusting the assertion so that an inline extent length
      smaller than the file size is valid if the file size is smaller than the
      filesystem's sector size.
      
      A test case for fstests follows soon.
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Fixes: a89ca6f2 ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
      CC: stable@vger.kernel.org # 4.14+
      Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ed586d0
    • Filipe Manana's avatar
      Btrfs: fix null pointer dereference on compressed write path error · 3527a018
      Filipe Manana authored
      At inode.c:compress_file_range(), under the "free_pages_out" label, we can
      end up dereferencing the "pages" pointer when it has a NULL value. This
      case happens when "start" has a value of 0 and we fail to allocate memory
      for the "pages" pointer. When that happens we jump to the "cont" label and
      then enter the "if (start == 0)" branch where we immediately call the
      cow_file_range_inline() function. If that function returns 0 (success
      creating an inline extent) or an error (like -ENOMEM for example) we jump
      to the "free_pages_out" label and then access "pages[i]" leading to a NULL
      pointer dereference, since "nr_pages" has a value greater than zero at
      that point.
      
      Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
      the "pages" pointer.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3527a018
  2. 15 Oct, 2018 38 commits
    • Lu Fengqi's avatar
      btrfs: switch return_bigger to bool in find_ref_head · d9352794
      Lu Fengqi authored
      Using bool is more suitable than int here, and add the comment about the
      return_bigger.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9352794
    • Lu Fengqi's avatar
      btrfs: remove fs_info from btrfs_should_throttle_delayed_refs · 7c861627
      Lu Fengqi authored
      The avg_delayed_ref_runtime can be referenced from the transaction
      handle.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c861627
    • Lu Fengqi's avatar
      btrfs: remove fs_info from btrfs_check_space_for_delayed_refs · af9b8a0e
      Lu Fengqi authored
      It can be referenced from the transaction handle.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af9b8a0e
    • Lu Fengqi's avatar
      btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock · 9e920a6f
      Lu Fengqi authored
      Since trans is only used for referring to delayed_refs, there is no need
      to pass it instead of delayed_refs to btrfs_delayed_ref_lock().
      
      No functional change.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e920a6f
    • Lu Fengqi's avatar
      btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head · 5637c74b
      Lu Fengqi authored
      Since trans is only used for referring to delayed_refs, there is no need
      to pass it instead of delayed_refs to btrfs_select_ref_head().  No
      functional change.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5637c74b
    • Lu Fengqi's avatar
      btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch · b90e22ba
      Lu Fengqi authored
      There is no reason to put this check in (!qgroup)'s else branch because
      if qgroup is null, it will goto out directly. So move it out to reduce
      indentation level.  No functional change.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b90e22ba
    • Qu Wenruo's avatar
      btrfs: relocation: Remove redundant tree level check · 06bbf672
      Qu Wenruo authored
      Commit 581c1760 ("btrfs: Validate child tree block's level and first
      key") has made tree block level check mandatory.
      
      So if tree block level doesn't match, we won't get a valid extent
      buffer.  The extra WARN_ON() check can be removed completely.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06bbf672
    • Qu Wenruo's avatar
      btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe · 98ff7b94
      Qu Wenruo authored
      And add one line comment explaining what we're doing for each loop.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98ff7b94
    • Qu Wenruo's avatar
      btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled · 3628b4ca
      Qu Wenruo authored
      Some qgroup trace events like btrfs_qgroup_release_data() and
      btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
      not enabled.
      
      This is caused by the lack of qgroup status check before calling some
      qgroup functions.  Thankfully the functions can handle quota disabled
      case well and just do nothing for qgroup disabled case.
      
      This patch will do earlier check before triggering related trace events.
      
      And for enabled <-> disabled race case:
      
      1) For enabled->disabled case
         Disable will wipe out all qgroups data including reservation and
         excl/rfer. Even if we leak some reservation or numbers, it will
         still be cleared, so nothing will go wrong.
      
      2) For disabled -> enabled case
         Current btrfs_qgroup_release_data() will use extent_io tree to ensure
         we won't underflow reservation. And for delayed_ref we use
         head->qgroup_reserved to record the reserved space, so in that case
         head->qgroup_reserved should be 0 and we won't underflow.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3628b4ca
    • Filipe Manana's avatar
      Btrfs: fix wrong dentries after fsync of file that got its parent replaced · 0f375eed
      Filipe Manana authored
      In a scenario like the following:
      
        mkdir /mnt/A               # inode 258
        mkdir /mnt/B               # inode 259
        touch /mnt/B/bar           # inode 260
      
        sync
      
        mv /mnt/B/bar /mnt/A/bar
        mv -T /mnt/A /mnt/B
        fsync /mnt/B/bar
      
        <power fail>
      
      After replaying the log we end up with file bar having 2 hard links, both
      with the name 'bar' and one in the directory with inode number 258 and the
      other in the directory with inode number 259. Also, we end up with the
      directory inode 259 still existing and with the directory inode 258 still
      named as 'A', instead of 'B'. In this scenario, file 'bar' should only
      have one hard link, located at directory inode 258, the directory inode
      259 should not exist anymore and the name for directory inode 258 should
      be 'B'.
      
      This incorrect behaviour happens because when attempting to log the old
      parents of an inode, we skip any parents that no longer exist. Fix this
      by forcing a full commit if an old parent no longer exists.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f375eed
    • Filipe Manana's avatar
      Btrfs: fix warning when replaying log after fsync of a tmpfile · f2d72f42
      Filipe Manana authored
      When replaying a log which contains a tmpfile (which necessarily has a
      link count of 0) we end up calling inc_nlink(), at
      fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
      the following:
      
        [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
        [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
        [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
        [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
        [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
        [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
        [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
        [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
        [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
        [195191.943735] FS:  00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
        [195191.943736] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
        [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [195191.943741] Call Trace:
        [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
        [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
        [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
        [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
        [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
        [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
        [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
        [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943906]  ? mount_fs+0x3b/0x140
        [195191.943908]  mount_fs+0x3b/0x140
        [195191.943912]  ? __init_waitqueue_head+0x36/0x50
        [195191.943916]  vfs_kern_mount+0x62/0x160
        [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
        [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943943]  ? mount_fs+0x3b/0x140
        [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
        [195191.943954]  mount_fs+0x3b/0x140
        [195191.943956]  ? __init_waitqueue_head+0x36/0x50
        [195191.943960]  vfs_kern_mount+0x62/0x160
        [195191.943963]  do_mount+0x1f9/0xd40
        [195191.943967]  ? memdup_user+0x4b/0x70
        [195191.943971]  ksys_mount+0x7e/0xd0
        [195191.943974]  __x64_sys_mount+0x21/0x30
        [195191.943977]  do_syscall_64+0x60/0x1b0
        [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [195191.943983] RIP: 0033:0x7f570e4e524a
        [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
        [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
        [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
        [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
        [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
        [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
        [195191.944002] irq event stamp: 8688
        [195191.944010] hardirqs last  enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
        [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
        [195191.944018] softirqs last  enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
        [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
        [195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
      
      This happens because the inode does not have the flag I_LINKABLE set,
      which is a runtime only flag, not meant to be persisted, set when the
      inode is created through open(2) if the flag O_EXCL is not passed to it.
      Except for the warning, there are no other consequences (like corruptions
      or metadata inconsistencies).
      
      Since it's pointless to replay a tmpfile as it would be deleted in a
      later phase of the log replay procedure (it has a link count of 0), fix
      this by not logging tmpfiles and if a tmpfile is found in a log (created
      by a kernel without this change), skip the replay of the inode.
      
      A test case for fstests follows soon.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.18+
      Reported-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2d72f42
    • Josef Bacik's avatar
      btrfs: drop min_size from evict_refill_and_join · ad80cf50
      Josef Bacik authored
      We don't need it, rsv->size is set once and never changes throughout
      its lifetime, so just use that for the reserve size.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad80cf50
    • Josef Bacik's avatar
      btrfs: assert on non-empty delayed iputs · e187831e
      Josef Bacik authored
      I ran into an issue where there was some reference being held on an
      inode that I couldn't track.  This assert wasn't triggered, but it at
      least rules out we're doing something stupid.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e187831e
    • Josef Bacik's avatar
      btrfs: make sure we create all new block groups · 545e3366
      Josef Bacik authored
      Allocating new chunks modifies both the extent and chunk tree, which can
      trigger new chunk allocations.  So instead of doing list_for_each_safe,
      just do while (!list_empty()) so we make sure we don't exit with other
      pending bg's still on our list.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      545e3366
    • Josef Bacik's avatar
      btrfs: reset max_extent_size on clear in a bitmap · 553cceb4
      Josef Bacik authored
      We need to clear the max_extent_size when we clear bits from a bitmap
      since it could have been from the range that contains the
      max_extent_size.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      553cceb4
    • Josef Bacik's avatar
      btrfs: protect space cache inode alloc with GFP_NOFS · 84de76a2
      Josef Bacik authored
      If we're allocating a new space cache inode it's likely going to be
      under a transaction handle, so we need to use memalloc_nofs_save() in
      order to avoid deadlocks, and more importantly lockdep messages that
      make xfstests fail.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84de76a2
    • Josef Bacik's avatar
      btrfs: release metadata before running delayed refs · f45c752b
      Josef Bacik authored
      We want to release the unused reservation we have since it refills the
      delayed refs reserve, which will make everything go smoother when
      running the delayed refs if we're short on our reservation.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f45c752b
    • Liu Bo's avatar
      Btrfs: kill btrfs_clear_path_blocking · 52398340
      Liu Bo authored
      Btrfs's btree locking has two modes, spinning mode and blocking mode,
      while searching btree, locking is always acquired in spinning mode and
      then converted to blocking mode if necessary, and in some hot paths we may
      switch the locking back to spinning mode by btrfs_clear_path_blocking().
      
      When acquiring locks, both of reader and writer need to wait for blocking
      readers and writers to complete before doing read_lock()/write_lock().
      
      The problem is that btrfs_clear_path_blocking() needs to switch nodes
      in the path to blocking mode at first (by btrfs_set_path_blocking) to
      make lockdep happy before doing its actual clearing blocking job.
      
      When switching to blocking mode from spinning mode, it consists of
      
      step 1) bumping up blocking readers counter and
      step 2) read_unlock()/write_unlock(),
      
      this has caused serious ping-pong effect if there're a great amount of
      concurrent readers/writers, as waiters will be woken up and go to
      sleep immediately.
      
      1) Killing this kind of ping-pong results in a big improvement in my 1600k
      files creation script,
      
      MNT=/mnt/btrfs
      mkfs.btrfs -f /dev/sdf
      mount /dev/def $MNT
      time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
              -d  $MNT/0  -d  $MNT/1 \
              -d  $MNT/2  -d  $MNT/3 \
              -d  $MNT/4  -d  $MNT/5 \
              -d  $MNT/6  -d  $MNT/7 \
              -d  $MNT/8  -d  $MNT/9 \
              -d  $MNT/10  -d  $MNT/11 \
              -d  $MNT/12  -d  $MNT/13 \
              -d  $MNT/14  -d  $MNT/15
      
      w/o patch:
      real    2m27.307s
      user    0m12.839s
      sys     13m42.831s
      
      w/ patch:
      real    1m2.273s
      user    0m15.802s
      sys     8m16.495s
      
      1.1) latency histogram from funclatency[1]
      
      Overall with the patch, there're ~50% less write lock acquisition and
      the 95% max latency that write lock takes also reduces to ~100ms from
      >500ms.
      
      --------------------------------------------
      w/o patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 2385222  |****************************************|
               2 -> 3          : 37147    |                                        |
               4 -> 7          : 20452    |                                        |
               8 -> 15         : 13131    |                                        |
              16 -> 31         : 3877     |                                        |
              32 -> 63         : 3900     |                                        |
              64 -> 127        : 2612     |                                        |
             128 -> 255        : 974      |                                        |
             256 -> 511        : 165      |                                        |
             512 -> 1023       : 13       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6743860  |****************************************|
               2 -> 3          : 2146     |                                        |
               4 -> 7          : 190      |                                        |
               8 -> 15         : 38       |                                        |
              16 -> 31         : 4        |                                        |
      
      --------------------------------------------
      w/ patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 1318454  |****************************************|
               2 -> 3          : 6800     |                                        |
               4 -> 7          : 3664     |                                        |
               8 -> 15         : 2145     |                                        |
              16 -> 31         : 809      |                                        |
              32 -> 63         : 219      |                                        |
              64 -> 127        : 10       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6854317  |****************************************|
               2 -> 3          : 2383     |                                        |
               4 -> 7          : 601      |                                        |
               8 -> 15         : 92       |                                        |
      
      2) dbench also proves the improvement,
      dbench -t 120 -D /mnt/btrfs 16
      
      w/o patch:
      Throughput 158.363 MB/sec
      
      w/ patch:
      Throughput 449.52 MB/sec
      
      3) xfstests didn't show any additional failures.
      
      One thing to note is that callers may set path->leave_spinning to have
      all nodes in the path stay in spinning mode, which means callers are
      ready to not sleep before releasing the path, but it won't cause
      problems if they don't want to sleep in blocking mode.
      
      [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.pySigned-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52398340
    • David Sterba's avatar
      btrfs: dev-replace: remove pointless assert in write unlock · 9b142115
      David Sterba authored
      The value of blocking_readers is increased only when the lock is taken
      for read, no way we can fail the condition with the write lock.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b142115
    • David Sterba's avatar
      btrfs: dev-replace: move replace members out of fs_info · 7f8d236a
      David Sterba authored
      The replace_wait and bio_counter were mistakenly added to fs_info in
      commit c404e0dc ("Btrfs: fix use-after-free in the finishing
      procedure of the device replace"), but they logically belong to
      fs_info::dev_replace. Besides, bio_counter is a very generic name and is
      confusing in bare fs_info context.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f8d236a
    • David Sterba's avatar
      btrfs: dev-replace: avoid useless lock on error handling path · aa144bfe
      David Sterba authored
      The exit sequence in btrfs_dev_replace_start does not allow to simply
      add a label to the right place so the error handling after starting
      transaction failure jumps there. Currently there's a lock that pairs
      with the unlock in the section, which is unnecessary and only raises
      questions.  Add a variable to track the locking status and avoid the
      extra locking.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa144bfe
    • David Sterba's avatar
      btrfs: open code btrfs_after_dev_replace_commit · 9f6cbcbb
      David Sterba authored
      Too trivial, the purpose can be simply documented in a comment.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f6cbcbb
    • David Sterba's avatar
      btrfs: open code btrfs_dev_replace_stats_inc · e37abe97
      David Sterba authored
      The wrapper is too trivial, open coding does not make it less readable.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e37abe97
    • David Sterba's avatar
      btrfs: open code btrfs_dev_replace_clear_lock_blocking · 7fb2eced
      David Sterba authored
      There's a single caller and the function name does not say it's actually
      taking the lock, so open coding makes it more explicit.
      
      For now, btrfs_dev_replace_read_lock is used instead of read_lock so
      it's paired with the unlocking wrapper in the same block.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7fb2eced
    • David Sterba's avatar
      btrfs: remove btrfs_dev_replace::read_locks · 3280f874
      David Sterba authored
      This member seems to be copied from the extent_buffer locking scheme and
      is at least used to assert that the read lock/unlock is properly nested.
      In some way. While the _inc/_dec are called inside the read lock
      section, the asserts are both inside and outside, so the ordering is not
      guaranteed and we can see read/inc/dec ordered in any way
      (theoretically).
      
      A missing call of btrfs_dev_replace_clear_lock_blocking could cause
      unexpected read_locks count, so this at least looks like a valid
      assertion, but this will become unnecessary with later updates.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3280f874
    • Qu Wenruo's avatar
      btrfs: tree-checker: Check level for leaves and nodes · f556faa4
      Qu Wenruo authored
      Although we have tree level check at tree read runtime, it's completely
      based on its parent level.
      We still need to do accurate level check to avoid invalid tree blocks
      sneak into kernel space.
      
      The check itself is simple, for leaf its level should always be 0.
      For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f556faa4
    • Qu Wenruo's avatar
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo authored
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't trace subtree if we're dropping reloc tree · 2cd86d30
      Qu Wenruo authored
      Reloc tree doesn't contribute to qgroup numbers, as we have accounted
      them at balance time (see replace_path()).
      
      Skipping the unneeded subtree tracing should reduce the overhead.
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22900          | -0.1%
      qgroup dirty extents | 227757       | 167139         | -26.6%
      time (sys)           | 65.253s      | 50.123s        | -23.2%
      time (real)          | 74.032s      | 52.551s        | -29.0%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cd86d30
    • Qu Wenruo's avatar
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo authored
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f527822
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree · ea49f3e7
      Qu Wenruo authored
      Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
      all new tree blocks in a reloc tree.
      So that qgroup could skip unrelated tree blocks during balance, which
      should hugely speedup balance speed when quota is enabled.
      
      The function qgroup_trace_new_subtree_blocks() itself only cares about
      new tree blocks in reloc tree.
      
      All its main works are:
      
      1) Read out tree blocks according to parent pointers
      
      2) Do recursive depth-first search
         Will call the same function on all its children tree blocks, with
         search level set to current level -1.
         And will also skip all children whose generation is smaller than
         @last_snapshot.
      
      3) Call qgroup_trace_extent_swap() to trace tree blocks
      
      So although we have parameter list related to source file tree, it's not
      used at all, but only passed to qgroup_trace_extent_swap().
      Thus despite the tree read code, the core should be pretty short and all
      about recursive depth-first search.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea49f3e7
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to trace two swaped extents · 25982561
      Qu Wenruo authored
      Introduce a new function, qgroup_trace_extent_swap(), which will be used
      later for balance qgroup speedup.
      
      The basis idea of balance is swapping tree blocks between reloc tree and
      the real file tree.
      
      The swap will happen in highest tree block, but there may be a lot of
      tree blocks involved.
      
      For example:
       OO = Old tree blocks
       NN = New tree blocks allocated during balance
      
                File tree (257)                  Reloc tree for 257
      L2              OO                                NN
                    /    \                            /    \
      L1          OO      OO (a)                    OO      NN (a)
                 / \     / \                       / \     / \
      L0       OO   OO OO   OO                   OO   OO NN   NN
                       (b)  (c)                          (b)  (c)
      
      When calling qgroup_trace_extent_swap(), we will pass:
      @src_eb = OO(a)
      @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
      @dst_level = 0
      @root_level = 1
      
      In that case, qgroup_trace_extent_swap() will search from OO(a) to
      reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
      
      The main work of qgroup_trace_extent_swap() can be split into 3 parts:
      
      1) Tree search from @src_eb
         It should acts as a simplified btrfs_search_slot().
         The key for search can be extracted from @dst_path->nodes[dst_level]
         (first key).
      
      2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
         NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
         They should be marked during preivous (@dst_level = 1) iteration.
      
      3) Mark file extents in leaves dirty
         We don't have good way to pick out new file extents only.
         So we still follow the old method by scanning all file extents in
         the leave.
      
      This function can free us from keeping two pathes, thus later we only need
      to care about how to iterate all new tree blocks in reloc tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ copy changelog to function comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25982561
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted · c337e7b0
      Qu Wenruo authored
      Number of qgroup dirty extents is directly linked to the performance
      overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
      record how many dirty extents is processed in
      btrfs_qgroup_account_extents().
      
      This will be pretty handy to analyze later balance performance
      improvement.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c337e7b0
    • Qu Wenruo's avatar
      btrfs: relocation: Add basic extent backref related comments for build_backref_tree · fa6ac715
      Qu Wenruo authored
      fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
      although it works pretty fine, it's not that easy to understand.
      Especially combined with the complex btrfs backref format.
      
      This patch adds some basic comment for the backref build part of the
      code, making it less hard to read, at least for backref searching part.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa6ac715
    • Omar Sandoval's avatar
      Btrfs: get rid of btrfs_symlink_aops · 4779cc04
      Omar Sandoval authored
      The only aops we define for symlinks are identical to the aops for
      regular files. This has been the case since symlink support was added in
      commit 2b8d99a7 ("Btrfs: symlinks and hard links"). As far as I can
      tell, there wasn't a good reason to have separate aops then, and there
      isn't now, so let's just do what most other filesystems do and reuse the
      same structure.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4779cc04
    • Chris Mason's avatar
      Btrfs: don't clean dirty pages during buffered writes · 7703bdd8
      Chris Mason authored
      During buffered writes, we follow this basic series of steps:
      
      again:
      	lock all the pages
      	wait for writeback on all the pages
      	Take the extent range lock
      	wait for ordered extents on the whole range
      	clean all the pages
      
      	if (copy_from_user_in_atomic() hits a fault) {
      		drop our locks
      		goto again;
      	}
      
      	dirty all the pages
      	release all the locks
      
      The extra waiting, cleaning and locking are there to make sure we don't
      modify pages in flight to the drive, after they've been crc'd.
      
      If some of the pages in the range were already dirty when the write
      began, and we need to goto again, we create a window where a dirty page
      has been cleaned and unlocked.  It may be reclaimed before we're able to
      lock it again, which means we'll read the old contents off the drive and
      lose any modifications that had been pending writeback.
      
      We don't actually need to clean the pages.  All of the other locking in
      place makes sure we don't start IO on the pages, so we can just leave
      them dirty for the duration of the write.
      
      Fixes: 73d59314 (the original btrfs merge)
      CC: stable@vger.kernel.org # v4.4+
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7703bdd8
    • David Sterba's avatar
      btrfs: use common helper instead of open coding a bit test · 818255fe
      David Sterba authored
      The helper does the same math and we take care about the special case
      when flags is 0 too.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      818255fe
    • Nikolay Borisov's avatar
      btrfs: refactor __btrfs_run_delayed_refs loop · 0110a4c4
      Nikolay Borisov authored
      Refactor the delayed refs loop by using the newly introduced
      btrfs_run_delayed_refs_for_head function. This greatly simplifies
      __btrfs_run_delayed_refs and makes it more obvious what is happening.
      
      We now have 1 loop which iterates the existing delayed_heads and then
      each selected ref head is processed by the new helper. All existing
      semantics of the code are preserved so no functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0110a4c4
    • Nikolay Borisov's avatar
      btrfs: Factor out loop processing all refs of a head · e7261386
      Nikolay Borisov authored
      This patch introduces a new helper encompassing the implicit inner loop
      in __btrfs_run_delayed_refs which processes all the refs for a given
      head. The code is mostly copy/paste, the only difference is that if we
      detect a newer reference then -EAGAIN is returned so that callers can
      react correctly.
      
      Also, at the end of the loop the head is relocked and
      btrfs_merge_delayed_refs is run again to retain the pre-refactoring
      semantics.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7261386