1. 13 Mar, 2020 1 commit
    • Filipe Manana's avatar
      btrfs: fix log context list corruption after rename whiteout error · 236ebc20
      Filipe Manana authored
      During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
      we can end up returning from btrfs_rename() with the log context object
      still in the root's log context list - this happens if 'sync_log' was
      set to true before we called btrfs_whiteout_for_rename() and it is
      dangerous because we end up with a corrupt linked list (root->log_ctxs)
      as the log context object was allocated on the stack.
      
      After btrfs_rename() returns, any task that is running btrfs_sync_log()
      concurrently can end up crashing because that linked list is traversed by
      btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
      the same issue that commit e6c61710 ("Btrfs: fix log context list
      corruption after rename exchange operation") fixed.
      
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      236ebc20
  2. 03 Mar, 2020 1 commit
    • Omar Sandoval's avatar
      btrfs: fix RAID direct I/O reads with alternate csums · e7a04894
      Omar Sandoval authored
      btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
      32-bit checksums. If using a larger checksum, this leads to spurious
      failures when a direct I/O read crosses a stripe. This is easy
      to reproduce:
      
        # mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
        ...
        # mount /dev/vdc /mnt
        # cd /mnt
        # dd if=/dev/urandom of=foo bs=1M count=1 status=none
        # dd if=foo of=/dev/null bs=1M iflag=direct status=none
        dd: error reading 'foo': Input/output error
        # dmesg | tail -1
        [  135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...
      
      Fix it by using the actual checksum size.
      
      Fixes: 1e25a2e3 ("btrfs: don't assume ordered sums to be 4 bytes")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7a04894
  3. 21 Feb, 2020 1 commit
    • Filipe Manana's avatar
      Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof · a5ae50de
      Filipe Manana authored
      While logging the prealloc extents of an inode during a fast fsync we call
      btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
      holding a read lock on a leaf of the inode's root (not the log root, the
      fs/subvol root), and then that function locks the file range in the inode's
      iotree. This can lead to a deadlock when:
      
      * the fsync is ranged
      
      * the file has prealloc extents beyond eof
      
      * writeback for a range different from the fsync range starts
        during the fsync
      
      * the size of the file is not sector size aligned
      
      Because when finishing an ordered extent we lock first a file range and
      then try to COW the fs/subvol tree to insert an extent item.
      
      The following diagram shows how the deadlock can happen.
      
                 CPU 1                                        CPU 2
      
        btrfs_sync_file()
          --> for range [0, 1MiB)
      
          --> inode has a size of
              1MiB and has 1 prealloc
              extent beyond the
              i_size, starting at offset
              4MiB
      
          flushes all delalloc for the
          range [0MiB, 1MiB) and waits
          for the respective ordered
          extents to complete
      
                                                    --> before task at CPU 1 locks the
                                                        inode, a write into file range
                                                        [1MiB, 2MiB + 1KiB) is made
      
                                                    --> i_size is updated to 2MiB + 1KiB
      
                                                    --> writeback is started for that
                                                        range, [1MiB, 2MiB + 4KiB)
                                                        --> end offset rounded up to
                                                            be sector size aligned
      
          btrfs_log_dentry_safe()
            btrfs_log_inode_parent()
              btrfs_log_inode()
      
                btrfs_log_changed_extents()
                  btrfs_log_prealloc_extents()
                    --> does a search on the
                        inode's root
                    --> holds a read lock on
                        leaf X
      
                                                    btrfs_finish_ordered_io()
                                                      --> locks range [1MiB, 2MiB + 4KiB)
                                                          --> end offset rounded up
                                                              to be sector size aligned
      
                                                      --> tries to cow leaf X, through
                                                          insert_reserved_file_extent()
                                                          --> already locked by the
                                                              task at CPU 1
      
                    btrfs_truncate_inode_items()
      
                      --> gets an i_size of
                          2MiB + 1KiB, which is
                          not sector size
                          aligned
      
                      --> tries to lock file
                          range [2MiB, (u64)-1)
                          --> the start range
                              is rounded down
                              from 2MiB + 1K
                              to 2MiB to be sector
                              size aligned
      
                          --> but the subrange
                              [2MiB, 2MiB + 4KiB) is
                              already locked by
                              task at CPU 2 which
                              is waiting to get a
                              write lock on leaf X
                              for which we are
                              holding a read lock
      
                                      *** deadlock ***
      
      This results in a stack trace like the following, triggered by test case
      generic/561 from fstests:
      
        [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
        [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
        [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        [ 2779.990466] Call Trace:
        [ 2779.990491]  ? __schedule+0x384/0xa30
        [ 2779.990521]  schedule+0x33/0xe0
        [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
        [ 2779.990632]  ? remove_wait_queue+0x60/0x60
        [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
        [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
        [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
        [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
        [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
        [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
        [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
        [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
        [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
        [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
        [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
        [ 2779.991432]  process_one_work+0x26d/0x6a0
        [ 2779.991460]  worker_thread+0x4f/0x3e0
        [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
        [ 2779.991489]  kthread+0x103/0x140
        [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
        [ 2779.991515]  ret_from_fork+0x3a/0x50
        (...)
        [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
        [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
        [ 2780.030038] Call Trace:
        [ 2780.030044]  ? __schedule+0x384/0xa30
        [ 2780.030052]  schedule+0x33/0xe0
        [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
        [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
        [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
        [ 2780.030102]  ? remove_wait_queue+0x60/0x60
        [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
        [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
        [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
        [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
        [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
        [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
        [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
        [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        [ 2780.030282]  ? dget_parent+0xa1/0x370
        [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
        [ 2780.030339]  do_fsync+0x38/0x60
        [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
        [ 2780.030345]  do_syscall_64+0x5c/0x280
        [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
        [ 2780.030361] Code: Bad RIP value.
        [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
        [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
        [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
        [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
        [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
      
      So fix this by making btrfs_truncate_inode_items() not lock the range in
      the inode's iotree when the target root is a log root, since it's not
      needed to lock the range for log roots as the protection from the inode's
      lock and log_mutex are all that's needed.
      
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5ae50de
  4. 18 Feb, 2020 6 commits
  5. 17 Feb, 2020 1 commit
    • Josef Bacik's avatar
      btrfs: don't set path->leave_spinning for truncate · 52e29e33
      Josef Bacik authored
      The only time we actually leave the path spinning is if we're truncating
      a small amount and don't actually free an extent, which is not a common
      occurrence.  We have to set the path blocking in order to add the
      delayed ref anyway, so the first extent we find we set the path to
      blocking and stay blocking for the duration of the operation.  With the
      upcoming file extent map stuff there will be another case that we have
      to have the path blocking, so just swap to blocking always.
      
      Note: this patch also fixes a warning after 28553fa9 ("Btrfs: fix
      race between shrinking truncate and fiemap") got merged that inserts
      extent locks around truncation so the path must not leave spinning locks
      after btrfs_search_slot.
      
        [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
        [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
        [70.794863] 5 locks held by rsync/1141:
        [70.794876]  #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
        [70.795030]  #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
        [70.795051]  #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
        [70.795124]  #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795203]  #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
        [70.795362] Call Trace:
        [70.795374]  dump_stack+0x71/0xa0
        [70.795445]  ___might_sleep.part.96.cold.106+0xa6/0xb6
        [70.795459]  kmem_cache_alloc+0x1d3/0x290
        [70.795471]  alloc_extent_state+0x22/0x1c0
        [70.795544]  __clear_extent_bit+0x3ba/0x580
        [70.795557]  ? _raw_spin_unlock_irq+0x24/0x30
        [70.795569]  btrfs_truncate_inode_items+0x339/0xe50
        [70.795647]  btrfs_evict_inode+0x269/0x540
        [70.795659]  ? dput.part.38+0x29/0x460
        [70.795671]  evict+0xcd/0x190
        [70.795682]  __dentry_kill+0xd6/0x180
        [70.795754]  dput.part.38+0x2ad/0x460
        [70.795765]  do_renameat2+0x3cb/0x540
        [70.795777]  __x64_sys_rename+0x1c/0x20
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52e29e33
  6. 12 Feb, 2020 7 commits
    • Anand Jain's avatar
      btrfs: sysfs, move device id directories to UUID/devinfo · 1b9867eb
      Anand Jain authored
      Originally it was planned to create device id directories under
      UUID/devinfo, but it got under UUID/devices by mistake. We really want
      it under definfo so the bare device node names are not mixed with device
      ids and are easy to enumerate.
      
      Fixes: 668e48af ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b9867eb
    • Anand Jain's avatar
      btrfs: sysfs, add UUID/devinfo kobject · a013d141
      Anand Jain authored
      Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
      by the id (unlike /devices).
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a013d141
    • Filipe Manana's avatar
      Btrfs: fix race between shrinking truncate and fiemap · 28553fa9
      Filipe Manana authored
      When there is a fiemap executing in parallel with a shrinking truncate
      we can end up in a situation where we have extent maps for which we no
      longer have corresponding file extent items. This is generally harmless
      and at the moment the only consequences are missing file extent items
      representing holes after we expand the file size again after the
      truncate operation removed the prealloc extent items, and stale
      information for future fiemap calls (reporting extents that no longer
      exist or may have been reallocated to other files for example).
      
      Consider the following example:
      
      1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
         and a 1MiB prealloc extent at file offset 128KiB;
      
      2) Task A starts doing a shrinking truncate of our inode to reduce it to
         a size of 64KiB. Before it searches the subvolume tree for file
         extent items to delete, it drops all the extent maps in the range
         from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
      
      3) Task B starts doing a fiemap against our inode. When looking up for
         the inode's extent maps in the range from 128KiB to (u64)-1, it
         doesn't find any in the inode's extent map tree, since they were
         removed by task A.  Because it didn't find any in the extent map
         tree, it scans the inode's subvolume tree for file extent items, and
         it finds the 1MiB prealloc extent at file offset 128KiB, then it
         creates an extent map based on that file extent item and adds it to
         inode's extent map tree (this ends up being done by
         btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
         get_extent_skip_holes());
      
      4) Task A then drops the prealloc extent at file offset 128KiB and
         shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
         truncation operation finishes and we end up with an extent map
         representing a 1MiB prealloc extent at file offset 128KiB, despite we
         don't have any more that extent;
      
      After this the two types of problems we have are:
      
      1) Future calls to fiemap always report that a 1MiB prealloc extent
         exists at file offset 128KiB. This is stale information, no longer
         correct;
      
      2) If the size of the file is increased, by a truncate operation that
         increases the file size or by a write into a file offset > 64KiB for
         example, we end up not inserting file extent items to represent holes
         for any range between 128KiB and 128KiB + 1MiB, since the hole
         expansion function, btrfs_cont_expand() will skip hole insertion for
         any range for which an extent map exists that represents a prealloc
         extent. This causes fsck to complain about missing file extent items
         when not using the NO_HOLES feature.
      
      The second issue could be often triggered by test case generic/561 from
      fstests, which runs fsstress and duperemove in parallel, and duperemove
      does frequent fiemap calls.
      
      Essentially the problems happens because fiemap does not acquire the
      inode's lock while truncate does, and fiemap locks the file range in the
      inode's iotree while truncate does not. So fix the issue by making
      btrfs_truncate_inode_items() lock the file range from the new file size
      to (u64)-1, so that it serializes with fiemap.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28553fa9
    • David Sterba's avatar
      btrfs: log message when rw remount is attempted with unclean tree-log · 10a3a3ed
      David Sterba authored
      A remount to a read-write filesystem is not safe when there's tree-log
      to be replayed. Files that could be opened until now might be affected
      by the changes in the tree-log.
      
      A regular mount is needed to replay the log so the filesystem presents
      the consistent view with the pending changes included.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10a3a3ed
    • David Sterba's avatar
      btrfs: print message when tree-log replay starts · e8294f2f
      David Sterba authored
      There's no logged information about tree-log replay although this is
      something that points to previous unclean unmount. Other filesystems
      report that as well.
      Suggested-by: default avatarChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e8294f2f
    • Filipe Manana's avatar
      Btrfs: fix race between using extent maps and merging them · ac05ca91
      Filipe Manana authored
      We have a few cases where we allow an extent map that is in an extent map
      tree to be merged with other extents in the tree. Such cases include the
      unpinning of an extent after the respective ordered extent completed or
      after logging an extent during a fast fsync. This can lead to subtle and
      dangerous problems because when doing the merge some other task might be
      using the same extent map and as consequence see an inconsistent state of
      the extent map - for example sees the new length but has seen the old start
      offset.
      
      With luck this triggers a BUG_ON(), and not some silent bug, such as the
      following one in __do_readpage():
      
        $ cat -n fs/btrfs/extent_io.c
        3061  static int __do_readpage(struct extent_io_tree *tree,
        3062                           struct page *page,
        (...)
        3127                  em = __get_extent_map(inode, page, pg_offset, cur,
        3128                                        end - cur + 1, get_extent, em_cached);
        3129                  if (IS_ERR_OR_NULL(em)) {
        3130                          SetPageError(page);
        3131                          unlock_extent(tree, cur, end);
        3132                          break;
        3133                  }
        3134                  extent_offset = cur - em->start;
        3135                  BUG_ON(extent_map_end(em) <= cur);
        (...)
      
      Consider the following example scenario, where we end up hitting the
      BUG_ON() in __do_readpage().
      
      We have an inode with a size of 8KiB and 2 extent maps:
      
        extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
                  a previous transaction
      
        extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
                  persisted but writeback started for it already. The extent map
      	    is pinned since there's writeback and an ordered extent in
      	    progress, so it can not be merged with extent map A yet
      
      The following sequence of steps leads to the BUG_ON():
      
      1) The ordered extent for extent B completes, the respective page gets its
         writeback bit cleared and the extent map is unpinned, at that point it
         is not yet merged with extent map A because it's in the list of modified
         extents;
      
      2) Due to memory pressure, or some other reason, the MM subsystem releases
         the page corresponding to extent B - btrfs_releasepage() is called and
         returns 1, meaning the page can be released as it's not dirty, not under
         writeback anymore and the extent range is not locked in the inode's
         iotree. However the extent map is not released, either because we are
         not in a context that allows memory allocations to block or because the
         inode's size is smaller than 16MiB - in this case our inode has a size
         of 8KiB;
      
      3) Task B needs to read extent B and ends up __do_readpage() through the
         btrfs_readpage() callback. At __do_readpage() it gets a reference to
         extent map B;
      
      4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
         while holding the write lock on the inode's extent map tree - this
         results in try_merge_map() being called and since it's possible to merge
         extent map B with extent map A now (the extent map B was removed from
         the list of modified extents), the merging begins - it sets extent map
         B's start offset to 0 (was 4KiB), but before it increments the map's
         length to 8KiB (4kb + 4KiB), task A is at:
      
         BUG_ON(extent_map_end(em) <= cur);
      
         The call to extent_map_end() sees the extent map has a start of 0
         and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
         the BUG_ON() is triggered.
      
      So it's dangerous to modify an extent map that is in the tree, because some
      other task might have got a reference to it before and still using it, and
      needs to see a consistent map while using it. Generally this is very rare
      since most paths that lookup and use extent maps also have the file range
      locked in the inode's iotree. The fsync path is pretty much the only
      exception where we don't do it to avoid serialization with concurrent
      reads.
      
      Fix this by not allowing an extent map do be merged if if it's being used
      by tasks other then the one attempting to merge the extent map (when the
      reference count of the extent map is greater than 2).
      Reported-by: default avatarryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
      Reported-by: default avatarKoki Mitani <koki.mitani.xg@hco.ntt.co.jp>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac05ca91
    • Wenwen Wang's avatar
      btrfs: ref-verify: fix memory leaks · f311ade3
      Wenwen Wang authored
      In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
      kmalloc(), respectively. In the following code, if an error occurs, the
      execution will be redirected to 'out' or 'out_unlock' and the function will
      be exited. However, on some of the paths, 'ref' and 'ra' are not
      deallocated, leading to memory leaks. For example, if 'action' is
      BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
      value indicates an error, the execution will be redirected to 'out'. But,
      'ref' is not deallocated on this path, causing a memory leak.
      
      To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
      the function when an error is encountered.
      
      CC: stable@vger.kernel.org # 4.15+
      Signed-off-by: default avatarWenwen Wang <wenwen@cs.uga.edu>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f311ade3
  7. 02 Feb, 2020 1 commit
    • Josef Bacik's avatar
      btrfs: do not zero f_bavail if we have available space · d55966c4
      Josef Bacik authored
      There was some logic added a while ago to clear out f_bavail in statfs()
      if we did not have enough free metadata space to satisfy our global
      reserve.  This was incorrect at the time, however didn't really pose a
      problem for normal file systems because we would often allocate chunks
      if we got this low on free metadata space, and thus wouldn't really hit
      this case unless we were actually full.
      
      Fast forward to today and now we are much better about not allocating
      metadata chunks all of the time.  Couple this with d792b0f1 ("btrfs:
      always reserve our entire size for the global reserve") which now means
      we'll easily have a larger global reserve than our free space, we are
      now more likely to trip over this while still having plenty of space.
      
      Fix this by skipping this logic if the global rsv's space_info is not
      full.  space_info->full is 0 unless we've attempted to allocate a chunk
      for that space_info and that has failed.  If this happens then the space
      for the global reserve is definitely sacred and we need to report
      b_avail == 0, but before then we can just use our calculated b_avail.
      Reported-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Fixes: ca8a51b3 ("btrfs: statfs: report zero available if metadata are exhausted")
      CC: stable@vger.kernel.org # 4.5+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Tested-By: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d55966c4
  8. 31 Jan, 2020 9 commits
    • Filipe Manana's avatar
      Btrfs: send, fix emission of invalid clone operations within the same file · 9722b101
      Filipe Manana authored
      When doing an incremental send and a file has extents shared with itself
      at different file offsets, it's possible for send to emit clone operations
      that will fail at the destination because the source range goes beyond the
      file's current size. This happens when the file size has increased in the
      send snapshot, there is a hole between the shared extents and both shared
      extents are at file offsets which are greater the file's size in the
      parent snapshot.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
        $ btrfs send -f /tmp/1.snap /mnt/sdb/base
      
        # Create a 320K extent at file offset 512K.
        $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
      
        # Clone part of that 320K extent into a lower file offset (192K).
        # This file offset is greater than the file's size in the parent
        # snapshot (64K). Also the clone range is a bit behind the offset of
        # the 320K extent so that we leave a hole between the shared extents.
        $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
        $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/1.snap /mnt/sdc
        $ btrfs receive -f /tmp/2.snap /mnt/sdc
        ERROR: failed to clone extents to foobar: Invalid argument
      
      The problem is that after processing the extent at file offset 256K, which
      refers to the first 128K of the 320K extent created by the buffered write
      operations, we have 'cur_inode_next_write_offset' set to 384K, which
      corresponds to the end offset of the partially shared extent (256K + 128K)
      and to the current file size in the receiver. Then when we process the
      extent at offset 512K, we do extent backreference iteration to figure out
      if we can clone the extent from some other inode or from the same inode,
      and we consider the extent at offset 256K of the same inode as a valid
      source for a clone operation, which is not correct because at that point
      the current file size in the receiver is 384K, which corresponds to the
      end of last processed extent (at file offset 256K), so using a clone
      source range from 256K to 256K + 320K is invalid because that goes past
      the current size of the file (384K) - this makes the receiver get an
      -EINVAL error when attempting the clone operation.
      
      So fix this by excluding clone sources that have a range that goes beyond
      the current file size in the receiver when iterating extent backreferences.
      
      A test case for fstests follows soon.
      
      Fixes: 11f2069c ("Btrfs: send, allow clone operations within the same file")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9722b101
    • Josef Bacik's avatar
      btrfs: do not do delalloc reservation under page lock · f4b1363c
      Josef Bacik authored
      We ran into a deadlock in production with the fixup worker.  The stack
      traces were as follows:
      
      Thread responsible for the writeout, waiting on the page lock
      
        [<0>] io_schedule+0x12/0x40
        [<0>] __lock_page+0x109/0x1e0
        [<0>] extent_write_cache_pages+0x206/0x360
        [<0>] extent_writepages+0x40/0x60
        [<0>] do_writepages+0x31/0xb0
        [<0>] __writeback_single_inode+0x3d/0x350
        [<0>] writeback_sb_inodes+0x19d/0x3c0
        [<0>] __writeback_inodes_wb+0x5d/0xb0
        [<0>] wb_writeback+0x231/0x2c0
        [<0>] wb_workfn+0x308/0x3c0
        [<0>] process_one_work+0x1e0/0x390
        [<0>] worker_thread+0x2b/0x3c0
        [<0>] kthread+0x113/0x130
        [<0>] ret_from_fork+0x35/0x40
        [<0>] 0xffffffffffffffff
      
      Thread of the fixup worker who is holding the page lock
      
        [<0>] start_delalloc_inodes+0x241/0x2d0
        [<0>] btrfs_start_delalloc_roots+0x179/0x230
        [<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
        [<0>] btrfs_check_data_free_space+0x53/0xa0
        [<0>] btrfs_delalloc_reserve_space+0x20/0x70
        [<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
        [<0>] normal_work_helper+0x11c/0x360
        [<0>] process_one_work+0x1e0/0x390
        [<0>] worker_thread+0x2b/0x3c0
        [<0>] kthread+0x113/0x130
        [<0>] ret_from_fork+0x35/0x40
        [<0>] 0xffffffffffffffff
      
      Thankfully the stars have to align just right to hit this.  First you
      have to end up in the fixup worker, which is tricky by itself (my
      reproducer does DIO reads into a MMAP'ed region, so not a common
      operation).  Then you have to have less than a page size of free data
      space and 0 unallocated space so you go down the "commit the transaction
      to free up pinned space" path.  This was accomplished by a random
      balance that was running on the host.  Then you get this deadlock.
      
      I'm still in the process of trying to force the deadlock to happen on
      demand, but I've hit other issues.  I can still trigger the fixup worker
      path itself so this patch has been tested in that regard, so the normal
      case is fine.
      
      Fixes: 87826df0 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4b1363c
    • Josef Bacik's avatar
      btrfs: drop the -EBUSY case in __extent_writepage_io · 5ab58055
      Josef Bacik authored
      Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
      do not need this -EBUSY case.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5ab58055
    • Chris Mason's avatar
      Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker · 25f3c502
      Chris Mason authored
      For COW, btrfs expects pages dirty pages to have been through a few setup
      steps.  This includes reserving space for the new block allocations and marking
      the range in the state tree for delayed allocation.
      
      A few places outside btrfs will dirty pages directly, especially when unmapping
      mmap'd pages.  In order for these to properly go through COW, we run them
      through a fixup worker to wait for stable pages, and do the delalloc prep.
      
      87826df0 added a window where the dirty pages were cleaned, but pending
      more action from the fixup worker.  We clear_page_dirty_for_io() before
      we call into writepage, so the page is no longer dirty.  The commit
      changed it so now we leave the page clean between unlocking it here and
      the fixup worker starting at some point in the future.
      
      During this window, page migration can jump in and relocate the page.  Once our
      fixup work actually starts, it finds page->mapping is NULL and we end up
      freeing the page without ever writing it.
      
      This leads to crc errors and other exciting problems, since it screws up the
      whole statemachine for waiting for ordered extents.  The fix here is to keep
      the page dirty while we're waiting for the fixup worker to get to work.
      This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
      if we queued the page up for fixup, which will cause the writepage
      function to redirty the page.
      
      Because we now expect the page to be dirty once it gets to the fixup
      worker we must adjust the error cases to call clear_page_dirty_for_io()
      on the page.  That is the bulk of the patch, but it is not the fix, the
      fix is the -EAGAIN from btrfs_writepage_cow_fixup.  We cannot separate
      these two changes out because the error conditions change with the new
      expectations.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25f3c502
    • Josef Bacik's avatar
      btrfs: take overcommit into account in inc_block_group_ro · a30a3d20
      Josef Bacik authored
      inc_block_group_ro does a calculation to see if we have enough room left
      over if we mark this block group as read only in order to see if it's ok
      to mark the block group as read only.
      
      The problem is this calculation _only_ works for data, where our used is
      always less than our total.  For metadata we will overcommit, so this
      will almost always fail for metadata.
      
      Fix this by exporting btrfs_can_overcommit, and then see if we have
      enough space to remove the remaining free space in the block group we
      are trying to mark read only.  If we do then we can mark this block
      group as read only.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a30a3d20
    • Josef Bacik's avatar
      btrfs: fix force usage in inc_block_group_ro · a7a63acc
      Josef Bacik authored
      For some reason we've translated the do_chunk_alloc that goes into
      btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
      two different things.
      
      force for inc_block_group_ro is used when we are forcing the block group
      read only no matter what, for example when the underlying chunk is
      marked read only.  We need to not do the space check here as this block
      group needs to be read only.
      
      btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
      we need to pre-allocate a chunk before marking the block group read
      only.  This has nothing to do with forcing, and in fact we _always_ want
      to do the space check in this case, so unconditionally pass false for
      force in this case.
      
      Then fixup inc_block_group_ro to honor force as it's expected and
      documented to do.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7a63acc
    • Nikolay Borisov's avatar
      btrfs: Correctly handle empty trees in find_first_clear_extent_bit · 5750c375
      Nikolay Borisov authored
      Raviu reported that running his regular fs_trim segfaulted with the
      following backtrace:
      
      [  237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
      [  237.525984] ------------[ cut here ]------------
      [  237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
      [  237.525992] invalid opcode: 0000 [#1] SMP PTI
      [  237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G     U     OE     5.4.14-8-vanilla #1
      [  237.526001] Hardware name: ASUSTeK COMPUTER INC.
      [  237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
      [  237.526079] Call Trace:
      [  237.526120]  find_first_clear_extent_bit+0x13d/0x150 [btrfs]
      [  237.526148]  btrfs_trim_fs+0x211/0x3f0 [btrfs]
      [  237.526184]  btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
      [  237.526219]  btrfs_ioctl+0x129a/0x2ed0 [btrfs]
      [  237.526227]  ? filemap_map_pages+0x190/0x3d0
      [  237.526232]  ? do_filp_open+0xaf/0x110
      [  237.526238]  ? _copy_to_user+0x22/0x30
      [  237.526242]  ? cp_new_stat+0x150/0x180
      [  237.526247]  ? do_vfs_ioctl+0xa4/0x640
      [  237.526278]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [  237.526283]  do_vfs_ioctl+0xa4/0x640
      [  237.526288]  ? __do_sys_newfstat+0x3c/0x60
      [  237.526292]  ksys_ioctl+0x70/0x80
      [  237.526297]  __x64_sys_ioctl+0x16/0x20
      [  237.526303]  do_syscall_64+0x5a/0x1c0
      [  237.526310]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      That was due to btrfs_fs_device::aloc_tree being empty. Initially I
      thought this wasn't possible and as a percaution have put the assert in
      find_first_clear_extent_bit. Turns out this is indeed possible and could
      happen when a file system with SINGLE data/metadata profile has a 2nd
      device added. Until balance is run or a new chunk is allocated on this
      device it will be completely empty.
      
      In this case find_first_clear_extent_bit should return the full range
      [0, -1ULL] and let the caller handle this i.e for trim the end will be
      capped at the size of actual device.
      
      Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
      Fixes: 45bfcfc1 ("btrfs: Implement find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5750c375
    • Josef Bacik's avatar
      btrfs: flush write bio if we loop in extent_write_cache_pages · 42ffb0bf
      Josef Bacik authored
      There exists a deadlock with range_cyclic that has existed forever.  If
      we loop around with a bio already built we could deadlock with a writer
      who has the page locked that we're attempting to write but is waiting on
      a page in our bio to be written out.  The task traces are as follows
      
        PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
         #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
         #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
         #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
         #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
         #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
         #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
         #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
         #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
         #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
         #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
      
        PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
        "aio-dio-invalid"
         #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
         #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
         #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
         #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
         #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
         #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
         #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
         #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
         #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
         #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
      
      I used drgn to find the respective pages we were stuck on
      
      page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
      page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
      
      As you can see the kworker is waiting for bit 0 (PG_locked) on index
      7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
      8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
      following
      
        crash> struct extent_page_data ffffc900297bbbb0
        struct extent_page_data {
          bio = 0xffff889f747ed830,
          tree = 0xffff889eed6ba448,
          extent_locked = 0,
          sync_io = 0
        }
      
      Probably worth mentioning as well that it waits for writeback of the
      page to complete while holding a lock on it (at prepare_pages()).
      
      Using drgn I walked the bio pages looking for page
      0xffffea00fbfc7500 which is the one we're waiting for writeback on
      
        bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
        for i in range(0, bio.bi_vcnt.value_()):
            bv = bio.bi_io_vec[i]
            if bv.bv_page.value_() == 0xffffea00fbfc7500:
      	  print("FOUND IT")
      
      which validated what I suspected.
      
      The fix for this is simple, flush the epd before we loop back around to
      the beginning of the file during writeout.
      
      Fixes: b293f02e ("Btrfs: Add writepages support")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42ffb0bf
    • Filipe Manana's avatar
      Btrfs: fix race between adding and putting tree mod seq elements and nodes · 7227ff4d
      Filipe Manana authored
      There is a race between adding and removing elements to the tree mod log
      list and rbtree that can lead to use-after-free problems.
      
      Consider the following example that explains how/why the problems happens:
      
      1) Task A has mod log element with sequence number 200. It currently is
         the only element in the mod log list;
      
      2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
         access the tree mod log. When it enters the function, it initializes
         'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
         before checking if there are other elements in the mod seq list.
         Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
         unlocks the lock 'tree_mod_seq_lock';
      
      3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
         itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
         sequence number of 201;
      
      4) Some other task, name it task C, modifies a btree and because there
         elements in the mod seq list, it adds a tree mod elem to the tree
         mod log rbtree. That node added to the mod log rbtree is assigned
         a sequence number of 202;
      
      5) Task B, which is doing fiemap and resolving indirect back references,
         calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
         calls tree_mod_log_search() - the search returns the mod log node
         from the rbtree with sequence number 202, created by task C;
      
      6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
         the mod log rbtree and finds the node with sequence number 202. Since
         202 is less than the previously computed 'min_seq', (u64)-1, it
         removes the node and frees it;
      
      7) Task B still has a pointer to the node with sequence number 202, and
         it dereferences the pointer itself and through the call to
         __tree_mod_log_rewind(), resulting in a use-after-free problem.
      
      This issue can be triggered sporadically with the test case generic/561
      from fstests, and it happens more frequently with a higher number of
      duperemove processes. When it happens to me, it either freezes the VM or
      it produces a trace like the following before crashing:
      
        [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
        [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
        [ 1245.321307] Code: ....
        [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
        [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
        [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
        [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
        [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
        [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
        [ 1245.321539] FS:  00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
        [ 1245.321591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
        [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 1245.321706] Call Trace:
        [ 1245.321798]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
        [ 1245.321841]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
        [ 1245.321877]  resolve_indirect_refs+0x1eb/0xc60 [btrfs]
        [ 1245.321912]  find_parent_nodes+0x3dc/0x11b0 [btrfs]
        [ 1245.321947]  btrfs_check_shared+0x115/0x1c0 [btrfs]
        [ 1245.321980]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
        [ 1245.322029]  extent_fiemap+0x59d/0x6d0 [btrfs]
        [ 1245.322066]  do_vfs_ioctl+0x45a/0x750
        [ 1245.322081]  ksys_ioctl+0x70/0x80
        [ 1245.322092]  ? trace_hardirqs_off_thunk+0x1a/0x1c
        [ 1245.322113]  __x64_sys_ioctl+0x16/0x20
        [ 1245.322126]  do_syscall_64+0x5c/0x280
        [ 1245.322139]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 1245.322155] RIP: 0033:0x7fdee3942dd7
        [ 1245.322177] Code: ....
        [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
        [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
        [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
        [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
        [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
        [ 1245.322423] Modules linked in: ....
        [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
      
      Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
      sequence number and iterates the rbtree while holding the lock
      'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
      lock, since it is now redundant.
      
      Fixes: bd989ba3 ("Btrfs: add tree modification log functions")
      Fixes: 097b8a7c ("Btrfs: join tree mod log code with the code holding back delayed refs")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7227ff4d
  9. 23 Jan, 2020 13 commits
    • Josef Bacik's avatar
      btrfs: free block groups after free'ing fs trees · 4e19443d
      Josef Bacik authored
      Sometimes when running generic/475 we would trip the
      WARN_ON(cache->reserved) check when free'ing the block groups on umount.
      This is because sometimes we don't commit the transaction because of IO
      errors and thus do not cleanup the tree logs until at umount time.
      
      These blocks are still reserved until they are cleaned up, but they
      aren't cleaned up until _after_ we do the free block groups work.  Fix
      this by moving the free after free'ing the fs roots, that way all of the
      tree logs are cleaned up and we have a properly cleaned fs.  A bunch of
      loops of generic/475 confirmed this fixes the problem.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e19443d
    • Nikolay Borisov's avatar
      btrfs: Fix split-brain handling when changing FSID to metadata uuid · 1362089d
      Nikolay Borisov authored
      Current code doesn't correctly handle the situation which arises when
      a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
      changed to the one in metadata uuid. This causes the incompat flag to
      disappear.
      
      In case of a power failure we could end up in a situation where part of
      the disks in a multi-disk filesystem are correctly reverted to
      METADATA_UUID_INCOMPAT flag unset state, while others have
      METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
      
      This patch corrects the behavior required to handle the case where a
      disk of the second type is scanned first, creating the necessary
      btrfs_fs_devices. Subsequently, when a disk which has already completed
      the transition is scanned it should overwrite the data in
      btrfs_fs_devices.
      Reported-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1362089d
    • Nikolay Borisov's avatar
      btrfs: Handle another split brain scenario with metadata uuid feature · 05840710
      Nikolay Borisov authored
      There is one more cases which isn't handled by the original metadata
      uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
      the user decides to change the FSID to the original one e.g. have
      metadata_uuid and fsid match. In case of power failure while this
      operation is in progress we could end up in a situation where some of
      the disks have the incompat bit removed and the other half have both
      METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
      
      This patch handles the case where a disk that has successfully changed
      its FSID such that it equals METADATA_UUID is scanned first.
      Subsequently when a disk with both
      METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
      find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
      This is done by extending find_fsid_changed to correctly find
      btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
      the metadata_uuid of the currently scanned device.
      
      Fixes: cc5de4e7 ("btrfs: Handle final split-brain possibility during fsid change")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reported-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05840710
    • Su Yue's avatar
      btrfs: Factor out metadata_uuid code from find_fsid. · c6730a0e
      Su Yue authored
      find_fsid became rather hairy with the introduction of metadata uuid
      changing feature. Alleviate this by factoring out the metadata uuid
      specific code in a dedicated function which deals with finding
      correct fsid for a device with changed uuid.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6730a0e
    • Su Yue's avatar
      btrfs: Call find_fsid from find_fsid_inprogress · c0d81c7c
      Su Yue authored
      Since find_fsid_inprogress should also handle the case in which an fs
      didn't change its FSID make it call find_fsid directly. This makes the
      code in device_list_add simpler by eliminating a conditional call of
      find_fsid. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0d81c7c
    • Filipe Manana's avatar
      Btrfs: fix infinite loop during fsync after rename operations · b5e4ff9d
      Filipe Manana authored
      Recently fsstress (from fstests) sporadically started to trigger an
      infinite loop during fsync operations. This turned out to be because
      support for the rename exchange and whiteout operations was added to
      fsstress in fstests. These operations, unlike any others in fsstress,
      cause file names to be reused, whence triggering this issue. However
      it's not necessary to use rename exchange and rename whiteout operations
      trigger this issue, simple rename operations and file creations are
      enough to trigger the issue.
      
      The issue boils down to when we are logging inodes that conflict (that
      had the name of any inode we need to log during the fsync operation), we
      keep logging them even if they were already logged before, and after
      that we check if there's any other inode that conflicts with them and
      then add it again to the list of inodes to log. Skipping already logged
      inodes fixes the issue.
      
      Consider the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir                           # inode 257
      
        $ touch /mnt/testdir/zz                        # inode 258
        $ ln /mnt/testdir/zz /mnt/testdir/zz_link
      
        $ touch /mnt/testdir/a                         # inode 259
      
        $ sync
      
        # The following 3 renames achieve the same result as a rename exchange
        # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
      
        $ mv /mnt/testdir/a /mnt/testdir/a/tmp
        $ mv /mnt/testdir/zz_link /mnt/testdir/a
        $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
      
        # The following rename and file creation give the same result as a
        # rename whiteout operation (<rename_whiteout> zz to a2).
      
        $ mv /mnt/testdir/zz /mnt/testdir/a2
        $ touch /mnt/testdir/zz                        # inode 260
      
        $ xfs_io -c fsync /mnt/testdir/zz
          --> results in the infinite loop
      
      The following steps happen:
      
      1) When logging inode 260, we find that its reference named "zz" was
         used by inode 258 in the previous transaction (through the commit
         root), so inode 258 is added to the list of conflicting indoes that
         need to be logged;
      
      2) After logging inode 258, we find that its reference named "a" was
         used by inode 259 in the previous transaction, and therefore we add
         inode 259 to the list of conflicting inodes to be logged;
      
      3) After logging inode 259, we find that its reference named "zz_link"
         was used by inode 258 in the previous transaction - we add inode 258
         to the list of conflicting inodes to log, again - we had already
         logged it before at step 3. After logging it again, we find again
         that inode 259 conflicts with him, and we add again 259 to the list,
         etc - we end up repeating all the previous steps.
      
      So fix this by skipping logging of conflicting inodes that were already
      logged.
      
      Fixes: 6b5fc433 ("Btrfs: fix fsync after succession of renames of different files")
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5e4ff9d
    • Josef Bacik's avatar
      btrfs: set trans->drity in btrfs_commit_transaction · d62b23c9
      Josef Bacik authored
      If we abort a transaction we have the following sequence
      
      if (!trans->dirty && list_empty(&trans->new_bgs))
      	return;
      WRITE_ONCE(trans->transaction->aborted, err);
      
      The idea being if we didn't modify anything with our trans handle then
      we don't really need to abort the whole transaction, maybe the other
      trans handles are fine and we can carry on.
      
      However in the case of create_snapshot we add a pending_snapshot object
      to our transaction and then commit the transaction.  We don't actually
      modify anything.  sync() behaves the same way, attach to an existing
      transaction and commit it.  This means that if we have an IO error in
      the right places we could abort the committing transaction with our
      trans->dirty being not set and thus not set transaction->aborted.
      
      This is a problem because in the create_snapshot() case we depend on
      pending->error being set to something, or btrfs_commit_transaction
      returning an error.
      
      If we are not the trans handle that gets to commit the transaction, and
      we're waiting on the commit to happen we get our return value from
      cur_trans->aborted.  If this was not set to anything because sync() hit
      an error in the transaction commit before it could modify anything then
      cur_trans->aborted would be 0.  Thus we'd return 0 from
      btrfs_commit_transaction() in create_snapshot.
      
      This is a problem because we then try to do things with
      pending_snapshot->snap, which will be NULL because we didn't create the
      snapshot, and then we'll get a NULL pointer dereference like the
      following
      
      "BUG: kernel NULL pointer dereference, address: 00000000000001f0"
      RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
      Call Trace:
       ? btrfs_mksubvol.isra.31+0x3f2/0x510
       btrfs_mksubvol.isra.31+0x4bc/0x510
       ? __sb_start_write+0xfa/0x200
       ? mnt_want_write_file+0x24/0x50
       btrfs_ioctl_snap_create_transid+0x16c/0x1a0
       btrfs_ioctl_snap_create_v2+0x11e/0x1a0
       btrfs_ioctl+0x1534/0x2c10
       ? free_debug_processing+0x262/0x2a3
       do_vfs_ioctl+0xa6/0x6b0
       ? do_sys_open+0x188/0x220
       ? syscall_trace_enter+0x1f8/0x330
       ksys_ioctl+0x60/0x90
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x4a/0x1b0
      
      In order to fix this we need to make sure anybody who calls
      commit_transaction has trans->dirty set so that they properly set the
      trans->transaction->aborted value properly so any waiters know bad
      things happened.
      
      This was found while I was running generic/475 with my modified
      fsstress, it reproduced within a few runs.  I ran with this patch all
      night and didn't see the problem again.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d62b23c9
    • Josef Bacik's avatar
      btrfs: drop log root for dropped roots · 889bfa39
      Josef Bacik authored
      If we fsync on a subvolume and create a log root for that volume, and
      then later delete that subvolume we'll never clean up its log root.  Fix
      this by making switch_commit_roots free the log for any dropped roots we
      encounter.  The extra churn is because we need a btrfs_trans_handle, not
      the btrfs_transaction.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      889bfa39
    • Anand Jain's avatar
      btrfs: sysfs, add devid/dev_state kobject and device attributes · 668e48af
      Anand Jain authored
      New sysfs attributes that track the filesystem status of devices, stored
      in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
      a directory for each device, with name corresponding to the numerical
      device id.
      
        in_fs_metadata    - device is in the list of fs metadata
        missing           - device is missing (no device node or block device)
        replace_target    - device is target of replace
        writeable         - writes from fs are allowed
      
      These attributes reflect the state of the device::dev_state and created
      at mount time.
      
      Sample output:
        $ pwd
         /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
        $ ls
          in_fs_metadata  missing  replace_target  writeable
        $ cat missing
          0
      
      The output from these attributes are 0 or 1. 0 indicates unset and 1
      indicates set.  These attributes are readonly.
      
      It is observed that the device delete thread and sysfs read thread will
      not race because the delete thread calls sysfs kobject_put() which in
      turn waits for existing sysfs read to complete.
      
      Note for device replace devid swap:
      
      During the replace the target device temporarily assumes devid 0 before
      assigning the devid of the soruce device.
      
      In btrfs_dev_replace_finishing() we remove source sysfs devid using the
      function btrfs_sysfs_remove_devices_attr(), so after that call
      kobject_rename() to update the devid in the sysfs.  This adds and calls
      btrfs_sysfs_update_devid() helper function to update the device id.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      668e48af
    • Nikolay Borisov's avatar
      btrfs: Refactor btrfs_rmap_block to improve readability · 1776ad17
      Nikolay Borisov authored
      Move variables to appropriate scope. Remove last BUG_ON in the function
      and rework error handling accordingly. Make the duplicate detection code
      more straightforward. Use in_range macro. And give variables more
      descriptive name by explicitly distinguishing between IO stripe size
      (size recorded in the chunk item) and data stripe size (the size of
      an actual stripe, constituting a logical chunk/block group).
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1776ad17
    • Nikolay Borisov's avatar
      btrfs: Add self-tests for btrfs_rmap_block · bf2e2eb0
      Nikolay Borisov authored
      Add RAID1 and single testcases to verify that data stripes are excluded
      from super block locations and that the address mapping is valid.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf2e2eb0
    • Nikolay Borisov's avatar
      btrfs: selftests: Add support for dummy devices · b3ad2c17
      Nikolay Borisov authored
      Add basic infrastructure to create and link dummy btrfs_devices. This
      will be used in the pending btrfs_rmap_block test which deals with
      the block groups.
      
      Calling btrfs_alloc_dummy_device will link the newly created device to
      the passed fs_info and the test framework will free them once the test
      is finished.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3ad2c17
    • Nikolay Borisov's avatar
      btrfs: Move and unexport btrfs_rmap_block · 96a14336
      Nikolay Borisov authored
      It's used only during initial block group reading to map physical
      address of super block to a list of logical ones. Make it private to
      block-group.c, add proper kernel doc and ensure it's exported only for
      tests.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96a14336