1. 10 Mar, 2014 29 commits
    • Liu Bo's avatar
      Btrfs: avoid warning bomb of btrfs_invalidate_inodes · 7813b3db
      Liu Bo authored
      So after transaction is aborted, we need to cleanup inode resources by
      calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
      roots' refs to be zero in old times and sets a WARN_ON(), however, this
      is not always true within cleaning up transaction, so we get to detect
      transaction abortion and not warn at all.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      7813b3db
    • Liu Bo's avatar
      Btrfs: fix possible deadlock in btrfs_cleanup_transaction · 2a85d9ca
      Liu Bo authored
      [13654.480669] ======================================================
      [13654.480905] [ INFO: possible circular locking dependency detected ]
      [13654.481003] 3.12.0+ #4 Tainted: G        W  O
      [13654.481060] -------------------------------------------------------
      [13654.481060] btrfs-transacti/9347 is trying to acquire lock:
      [13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060] but task is already holding lock:
      [13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
      [13654.481060] which lock already depends on the new lock.
      
      [13654.481060] the existing dependency chain (in reverse order) is:
      [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
      [13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
      [13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
      [13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
      [13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
      [13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
      [13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
      [13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
      [13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
      [13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
      [13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
      [13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
      [13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
      [13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
      [13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
      [13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
      [13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
      [13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
      [13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
      [13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
      [13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
      [13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
      [13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
      [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
      [13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
      [13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [13654.481060] other info that might help us debug this:
      
      [13654.481060]  Possible unsafe locking scenario:
      
      [13654.481060]        CPU0                    CPU1
      [13654.481060]        ----                    ----
      [13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]
       *** DEADLOCK ***
      [...]
      
      ======================================================
      
      btrfs_destroy_all_ordered_extents()
      gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
      while btrfs_[add,remove]_ordered_extent()
      acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.
      
      This patch fixes the above problem.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      2a85d9ca
    • Filipe David Borba Manana's avatar
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana authored
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      d5f37527
    • Stanislaw Gruszka's avatar
      btrfs: always choose work from prio_head first · 51b98eff
      Stanislaw Gruszka authored
      In case we do not refill, we can overwrite cur pointer from prio_head
      by one from not prioritized head, what looks as something that was
      not intended.
      
      This change make we always take works from prio_head first until it's
      not empty.
      Signed-off-by: default avatarStanislaw Gruszka <stf_xl@wp.pl>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      51b98eff
    • Wang Shilong's avatar
      Revert "Btrfs: remove transaction from btrfs send" · dcfd5ad2
      Wang Shilong authored
      This reverts commit 41ce9970.
      Previously i was thinking we can use readonly root's commit root
      safely while it is not true, readonly root may be cowed with the
      following cases.
      
      1.snapshot send root will cow source root.
      2.balance,device operations will also cow readonly send root
      to relocate.
      
      So i have two ideas to make us safe to use commit root.
      
      -->approach 1:
      make it protected by transaction and end transaction properly and we research
      next item from root node(see btrfs_search_slot_for_read()).
      
      -->approach 2:
      add another counter to local root structure to sync snapshot with send.
      and add a global counter to sync send with exclusive device operations.
      
      So with approach 2, send can use commit root safely, because we make sure
      send root can not be cowed during send. Unfortunately, it make codes *ugly*
      and more complex to maintain.
      
      To make snapshot and send exclusively, device operations and send operation
      exclusively with each other is a little confusing for common users.
      
      So why not drop into previous way.
      
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dcfd5ad2
    • Wang Shilong's avatar
      Btrfs: skip readonly root for snapshot-aware defragment · bcbba5e6
      Wang Shilong authored
      Btrfs send is assuming readonly root won't change, let's skip readonly root.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      bcbba5e6
    • Wang Shilong's avatar
      Btrfs: switch to btrfs_previous_extent_item() · 850a8cdf
      Wang Shilong authored
      Since we have introduced btrfs_previous_extent_item() to search previous
      extent item, just switch into it.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      850a8cdf
    • Hidetoshi Seto's avatar
      Btrfs: skip submitting barrier for missing device · f88ba6a2
      Hidetoshi Seto authored
      I got an error on v3.13:
       BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)
      
      how to reproduce:
        > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
        > wipefs -a /dev/sdf2
        > mount -o degraded /dev/sdf1 /mnt
        > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt
      
      The reason of the error is that barrier_all_devices() failed to submit
      barrier to the missing device.  However it is clear that we cannot do
      anything on missing device, and also it is not necessary to care chunks
      on the missing device.
      
      This patch stops sending/waiting barrier if device is missing.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      f88ba6a2
    • Josef Bacik's avatar
      Btrfs: unlock extent and pages on error in cow_file_range · 29bce2f3
      Josef Bacik authored
      When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
      made it so we just return an error instead of unlocking all of our various
      stuff.  This is a mistake and causes us to hang when we run into this.  This
      patch fixes this problem.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      29bce2f3
    • Josef Bacik's avatar
      Btrfs: balance delayed inode updates · c581afc8
      Josef Bacik authored
      While trying to reproduce a delayed ref problem I noticed the box kept falling
      over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
      Turns out this is because we only throttle delayed inode updates in
      btrfs_dirty_inode, which doesn't actually get called that often, especially when
      all you are doing is creating a bunch of files.  So balance delayed inode
      updates everytime we create a new inode.  With this patch we no longer use up
      all of our ram with delayed inode updates.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c581afc8
    • David Sterba's avatar
      btrfs: add simple debugfs interface · 1bae3098
      David Sterba authored
      Help during debugging to export various interesting infromation and
      tunables without the need of extra mount options or ioctls.
      
      Usage:
      * declare your variable in sysfs.h, and include where you need it
      * define the variable in sysfs.c and make it visible via
        debugfs_create_TYPE
      
      Depends on CONFIG_DEBUG_FS.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      1bae3098
    • David Sterba's avatar
      btrfs: send: lower memory requirements in common case · ace01050
      David Sterba authored
      The fs_path structure uses an inline buffer and falls back to a chain of
      allocations, but vmalloc is not necessary because PATH_MAX fits into
      PAGE_SIZE.
      
      The size of fs_path has been reduced to 256 bytes from PAGE_SIZE,
      usually 4k. Experimental measurements show that most paths on a single
      filesystem do not exceed 200 bytes, and these get stored into the inline
      buffer directly, which is now 230 bytes. Longer paths are kmalloced when
      needed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      ace01050
    • Filipe David Borba Manana's avatar
      Btrfs: make some tree searches in send.c more efficient · dff6d0ad
      Filipe David Borba Manana authored
      We have this pattern where we do search for a contiguous group of
      items in a tree and everytime we find an item, we process it, then
      we release our path, increment the offset of the search key, do
      another full tree search and repeat these steps until a tree search
      can't find more items we're interested in.
      
      Instead of doing these full tree searches after processing each item,
      just process the next item/slot in our leaf and don't release the path.
      Since all these trees are read only and we always use the commit root
      for a search and skip node/leaf locks, we're not affecting concurrency
      on the trees.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dff6d0ad
    • Filipe David Borba Manana's avatar
      Btrfs: use right extent item position in send when finding extent clones · a0859c09
      Filipe David Borba Manana authored
      This was a leftover from the commit:
      
         74dd17fb
         (Btrfs: fix btrfs send for inline items and compression)
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      a0859c09
    • David Sterba's avatar
      btrfs: send: remove BUG_ON from name_cache_delete · 57fb8910
      David Sterba authored
      If cleaning the name cache fails, we could try to proceed at the cost of
      some memory leak. This is not expected to happen often.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      57fb8910
    • David Sterba's avatar
      btrfs: send: remove BUG from process_all_refs · 4d1a63b2
      David Sterba authored
      There are only 2 static callers, the BUG would normally be never
      reached, but let's be nice.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      4d1a63b2
    • David Sterba's avatar
      btrfs: send: squeeze bitfilelds in fs_path · 1f5a7ff9
      David Sterba authored
      We know that buf_len is at most PATH_MAX, 4k, and can merge it with the
      reversed member. This saves 3 bytes in favor of inline_buf.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      1f5a7ff9
    • David Sterba's avatar
      btrfs: send: remove virtual_mem member from fs_path · e25a8122
      David Sterba authored
      We don't need to keep track of that, it's available via is_vmalloc_addr.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e25a8122
    • David Sterba's avatar
      btrfs: send: remove prepared member from fs_path · b23ab57d
      David Sterba authored
      The member is used only to return value back from
      fs_path_prepare_for_add, we can do it locally and save 8 bytes for the
      inline_buf path.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      b23ab57d
    • David Sterba's avatar
      btrfs: send: replace check with an assert in gen_unique_name · 64792f25
      David Sterba authored
      The buffer passed to snprintf can hold the fully expanded format string,
      64 = 3x largest ULL + 3x char + trailing null.  I don't think that removing the
      check entirely is a good idea, hence the ASSERT.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      64792f25
    • Filipe David Borba Manana's avatar
      Btrfs: more send support for parent/child dir relationship inversion · 5ed7f9ff
      Filipe David Borba Manana authored
      The commit titled "Btrfs: fix infinite path build loops in incremental send"
      didn't cover a particular case where the parent-child relationship inversion
      of directories doesn't imply a rename of the new parent directory. This was
      due to a simple logic mistake, a logical and instead of a logical or.
      
      Steps to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb3
        $ mount /dev/sdb3 /mnt/btrfs
        $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4
        $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
        $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44
        $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44
        $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3
        $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11
        $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
        $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
      
      A patch to update the test btrfs/030 from xfstests, so that it covers
      this case, will be submitted soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      5ed7f9ff
    • Filipe David Borba Manana's avatar
      Btrfs: fix send dealing with file renames and directory moves · 03cb4fb9
      Filipe David Borba Manana authored
      This fixes a case that the commit titled:
      
         Btrfs: fix infinite path build loops in incremental send
      
      didn't cover. If the parent-child relationship between 2 directories
      is inverted, both get renamed, and the former parent has a file that
      got renamed too (but remains a child of that directory), the incremental
      send operation would use the file's old path after sending an unlink
      operation for that old path, causing receive to fail on future operations
      like changing owner, permissions or utimes of the corresponding inode.
      
      This is not a regression from the commit mentioned before, as without
      that commit we would fall into the issues that commit fixed, so it's
      just one case that wasn't covered before.
      
      Simple steps to reproduce this issue are:
      
            $ mkfs.btrfs -f /dev/sdb3
            $ mount /dev/sdb3 /mnt/btrfs
            $ mkdir -p /mnt/btrfs/a/b/c/d
            $ touch /mnt/btrfs/a/b/c/d/file
            $ mkdir -p /mnt/btrfs/a/b/x
            $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
            $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2
            $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2
            $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2
            $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
            $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
      
      A patch to update the test btrfs/030 from xfstests, so that it covers
      this case, will be submitted soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      03cb4fb9
    • Wang Shilong's avatar
      Btrfs: only add roots if necessary in find_parent_nodes() · 98cfee21
      Wang Shilong authored
      find_all_leafs() dosen't need add all roots actually, add roots only
      if we need, this can avoid unnecessary ulist dance.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      98cfee21
    • Hugo Mills's avatar
      btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl · abccd00f
      Hugo Mills authored
      The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
      and 64-bit systems. This means that it is impossible to use btrfs
      receive on a system with a 64-bit kernel and 32-bit userspace, because
      the structure size (and hence the ioctl number) is different.
      
      This patch adds a compatibility structure and ioctl to deal with the
      above case.
      Signed-off-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      abccd00f
    • Filipe David Borba Manana's avatar
      Btrfs: add missing error check in incremental send · d86477b3
      Filipe David Borba Manana authored
      Function wait_for_parent_move() returns negative value if an error
      happened, 0 if we don't need to wait for the parent's move, and
      1 if the wait is needed.
      Before this change an error return value was being treated like the
      return value 1, which was not correct.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      d86477b3
    • Miao Xie's avatar
      Btrfs: fix use-after-free in the finishing procedure of the device replace · c404e0dc
      Miao Xie authored
      During device replace test, we hit a null pointer deference (It was very easy
      to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
      scsi driver). There were two bugs that caused this problem:
      - We might allocate new chunks on the replaced device after we updated
        the mapping tree. And we forgot to replace the source device in those
        mapping of the new chunks.
      - We might get the mapping information which including the source device
        before the mapping information update. And then submit the bio which was
        based on that mapping information after we freed the source device.
      
      For the first bug, we can fix it by doing mapping tree update and source
      device remove in the same context of the chunk mutex. The chunk mutex is
      used to protect the allocable device list, the above method can avoid
      the new chunk allocation, and after we remove the source device, all
      the new chunks will be allocated on the new device. So it can fix
      the first bug.
      
      For the second bug, we need make sure all flighting bios are finished and
      no new bios are produced during we are removing the source device. To fix
      this problem, we introduced a global @bio_counter, we not only inc/dec
      @bio_counter outsize of map_blocks, but also inc it before submitting bio
      and dec @bio_counter when ending bios.
      
      Since Raid56 is a little different and device replace dosen't support raid56
      yet, it is not addressed in the patch and I add comments to make sure we will
      fix it in the future.
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c404e0dc
    • Miao Xie's avatar
      Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace · 391cd9df
      Miao Xie authored
      the alloc list of the filesystem is protected by ->chunk_mutex, we need
      get that mutex when we insert the new device into the list.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      391cd9df
    • Kusanagi Kouichi's avatar
      btrfs: Return EXDEV for cross file system snapshot · 23ad5b17
      Kusanagi Kouichi authored
      EXDEV seems an appropriate error if an operation fails bacause it
      crosses file system boundaries.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarKusanagi Kouichi <slash@ac.auone-net.jp>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      23ad5b17
    • Miao Xie's avatar
      Btrfs: don't mix the ordered extents of all files together during logging the inodes · 827463c4
      Miao Xie authored
      There was a problem in the old code:
      If we failed to log the csum, we would free all the ordered extents in the log list
      including those ordered extents that were logged successfully, it would make the
      log committer not to wait for the completion of the ordered extents.
      
      This patch doesn't insert the ordered extents that is about to be logged into
      a global list, instead, we insert them into a local list. If we log the ordered
      extents successfully, we splice them with the global list, or we will throw them
      away, then do full sync. It can also reduce the lock contention and the traverse
      time of list.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      827463c4
  2. 15 Feb, 2014 2 commits
    • Filipe David Borba Manana's avatar
      Btrfs: use right clone root offset for compressed extents · 93de4ba8
      Filipe David Borba Manana authored
      For non compressed extents, iterate_extent_inodes() gives us offsets
      that take into account the data offset from the file extent items, while
      for compressed extents it doesn't. Therefore we have to adjust them before
      placing them in a send clone instruction. Not doing this adjustment leads to
      the receiving end requesting for a wrong a file range to the clone ioctl,
      which results in different file content from the one in the original send
      root.
      
      Issue reproducible with the following excerpt from the test I made for
      xfstests:
      
        _scratch_mkfs
        _scratch_mount "-o compress-force=lzo"
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
      
        $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
        $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo
      
        # will be used for incremental send to be able to issue clone operations
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
      
        $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
        $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
            -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
        $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
            -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2
      
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
        $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
            -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap
      
        _scratch_unmount
        _scratch_mkfs
        _scratch_mount
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
        $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
        $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
        $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      93de4ba8
    • Anand Jain's avatar
      btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 · f085381e
      Anand Jain authored
      bdev is null when disk has disappeared and mounted with
      the degrade option
      
      stack trace
      ---------
      btrfs_sysfs_add_one+0x105/0x1c0 [btrfs]
      open_ctree+0x15f3/0x1fe0 [btrfs]
      btrfs_mount+0x5db/0x790 [btrfs]
      ? alloc_pages_current+0xa4/0x160
      mount_fs+0x34/0x1b0
      vfs_kern_mount+0x62/0xf0
      do_mount+0x22e/0xa80
      ? __get_free_pages+0x9/0x40
      ? copy_mount_options+0x31/0x170
      SyS_mount+0x7e/0xc0
      system_call_fastpath+0x16/0x1b
      ---------
      
      reproducer:
      -------
      mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd
      (detach a disk)
      devmgt detach /dev/sdc [1]
      mount -o degrade /dev/sdd /btrfs
      -------
      
      [1] github.com/anajain/devmgt.git
      Signed-off-by: default avatarAnand Jain <Anand.Jain@oracle.com>
      Tested-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f085381e
  3. 14 Feb, 2014 4 commits
    • Josef Bacik's avatar
      Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol · 3a0dfa6a
      Josef Bacik authored
      A user was running into errors from an NFS export of a subvolume that had a
      default subvol set.  When we mount a default subvol we will use d_obtain_alias()
      to find an existing dentry for the subvolume in the case that the root subvol
      has already been mounted, or a dummy one is allocated in the case that the root
      subvol has not already been mounted.  This allows us to connect the dentry later
      on if we wander into the path.  However if we don't ever wander into the path we
      will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It doesn't
      appear to cause any problems but it is annoying nonetheless, so simply unset
      DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
      use d_materialise_unique() instead which will make everything play nicely
      together and reconnect stuff if we wander into the defaul subvol path from a
      different way.  With this patch I'm no longer getting the NFS errors when
      exporting a volume that has been mounted with a default subvol set.  Thanks,
      
      cc: bfields@fieldses.org
      cc: ebiederm@xmission.com
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a0dfa6a
    • Mitch Harder's avatar
      Btrfs: fix max_inline mount option · feb5f965
      Mitch Harder authored
      Currently, the only mount option for max_inline that has any effect is
      max_inline=0.  Any other value that is supplied to max_inline will be
      adjusted to a minimum of 4k.  Since max_inline has an effective maximum
      of ~3900 bytes due to page size limitations, the current behaviour
      only has meaning for max_inline=0.
      
      This patch will allow the the max_inline mount option to accept non-zero
      values as indicated in the documentation.
      Signed-off-by: default avatarMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      feb5f965
    • Liu Bo's avatar
      Btrfs: fix a lockdep warning when cleaning up aborted transaction · a9d2d4ad
      Liu Bo authored
      Given now we have 2 spinlock for management of delayed refs,
      CONFIG_DEBUG_SPINLOCK=y helped me find this,
      
      [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
      [ 4723.414882]  lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
      [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G        W  O 3.12.0+ #4
      [ 4723.421321] Call Trace:
      [ 4723.421872]  [<ffffffff81680fe7>] dump_stack+0x54/0x74
      [ 4723.422753]  [<ffffffff81681093>] spin_dump+0x8c/0x91
      [ 4723.424979]  [<ffffffff816810b9>] spin_bug+0x21/0x26
      [ 4723.425846]  [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
      [ 4723.434424]  [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
      [ 4723.438747]  [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
      [ 4723.443321]  [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
      [ 4723.444692]  [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      [ 4723.450336]  [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
      [ 4723.451332]  [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
      [ 4723.452543]  [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
      [ 4723.457833]  [<ffffffff81079efa>] kthread+0xea/0xf0
      [ 4723.458990]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.460133]  [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [ 4723.460865]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.496521] ------------[ cut here ]------------
      
      ----------------------------------------------------------------------
      
      The reason is that we get to call cond_resched_lock(&head_ref->lock) while
      still holding @delayed_refs->lock.
      
      So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
      dance before and after actually processing delayed refs.
      
      Here we don't drop the lock, others are not able to add new delayed refs to
      head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a9d2d4ad
    • Chris Mason's avatar
      Revert "btrfs: add ioctl to export size of global metadata reservation" · 11bcac89
      Chris Mason authored
      This reverts commit 01e219e8.
      
      David Sterba found a different way to provide these features without adding a new
      ioctl.  We haven't released any progs with this ioctl yet, so I'm taking this out
      for now until we finalize things.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      CC: Jeff Mahoney <jeffm@suse.com>
      11bcac89
  4. 09 Feb, 2014 5 commits
    • Filipe David Borba Manana's avatar
      Btrfs: fix data corruption when reading/updating compressed extents · a2aa75e1
      Filipe David Borba Manana authored
      When using a mix of compressed file extents and prealloc extents, it
      is possible to fill a page of a file with random, garbage data from
      some unrelated previous use of the page, instead of a sequence of zeroes.
      
      A simple sequence of steps to get into such case, taken from the test
      case I made for xfstests, is:
      
         _scratch_mkfs
         _scratch_mount "-o compress-force=lzo"
         $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
      This results in the following file items in the fs tree:
      
         item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
             inode generation 6 transid 6 size 542872 block group 0 mode 100600
         item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
             inode ref index 2 namelen 6 name: foobar
         item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
             extent data disk byte 0 nr 0 gen 6
             extent data offset 0 nr 24576 ram 266240
             extent compression 0
         item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
             prealloc data disk byte 12849152 nr 241664 gen 6
             prealloc data offset 0 nr 241664
         item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
             extent data disk byte 12845056 nr 4096 gen 6
             extent data offset 0 nr 20480 ram 20480
             extent compression 2
         item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
             prealloc data disk byte 13090816 nr 405504 gen 6
             prealloc data offset 0 nr 258048
      
      The on disk extent at offset 266240 (which corresponds to 1 single disk block),
      contains 5 compressed chunks of file data. Each of the first 4 compress 4096
      bytes of file data, while the last one only compresses 3024 bytes of file data.
      Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
      1072 bytes) should always return zeroes (our next extent is a prealloc one).
      
      The solution here is the compression code path to zero the remaining (untouched)
      bytes of the last page it uncompressed data into, as the information about how
      much space the file data consumes in the last page is not known in the upper layer
      fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
      the remainder of the page but only if it corresponds to the last page of the inode
      and if the inode's size is not a multiple of the page size.
      
      This would cause not only returning random data on reads, but also permanently
      storing random data when updating parts of the region that should be zeroed.
      For the example above, it means updating a single byte in the region [285648 ; 286720[
      would store that byte correctly but also store random data on disk.
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a2aa75e1
    • Josef Bacik's avatar
      Btrfs: don't loop forever if we can't run because of the tree mod log · 27a377db
      Josef Bacik authored
      A user reported a 100% cpu hang with my new delayed ref code.  Turns out I
      forgot to increase the count check when we can't run a delayed ref because of
      the tree mod log.  If we can't run any delayed refs during this there is no
      point in continuing to look, and we need to break out.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      27a377db
    • David Sterba's avatar
      btrfs: reserve no transaction units in btrfs_ioctl_set_features · 8051aa1a
      David Sterba authored
      Added in patch "btrfs: add ioctls to query/change feature bits online"
      modifications to superblock don't need to reserve metadata blocks when
      starting a transaction.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8051aa1a
    • Jeff Mahoney's avatar
      btrfs: commit transaction after setting label and features · d0270aca
      Jeff Mahoney authored
      The set_fslabel ioctl uses btrfs_end_transaction, which means it's
      possible that the change will be lost if the system crashes, same for
      the newly set features. Let's use btrfs_commit_transaction instead.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d0270aca
    • Josef Bacik's avatar
      Btrfs: fix assert screwup for the pending move stuff · 6cc98d90
      Josef Bacik authored
      Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
      reproduce.  Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
      which meant that a key part of Filipe's original patch was not being built in.
      This appears to be a mess up with merging Filipe's patch as it does not exist in
      his original patch.  Fix this by changing how we make sure del_waiting_dir_move
      asserts that it did not error and take the function out of the ifdef check.
      This makes btrfs/030 pass with the assert on or off.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6cc98d90