1. 10 Mar, 2014 40 commits
    • Qu Wenruo's avatar
      btrfs: Cleanup the unused struct async_sched. · f5961d41
      Qu Wenruo authored
      The struct async_sched is not used by any codes and can be removed.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      f5961d41
    • Liu Bo's avatar
      Btrfs: skip search tree for REG files · 644d1940
      Liu Bo authored
      It is really unnecessary to search tree again for @gen, @mode and @rdev
      in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
      and @gen, @mode and @rdev can easily be fetched.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      644d1940
    • Miao Xie's avatar
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie authored
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      7b2b7085
    • Miao Xie's avatar
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie authored
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c933956d
    • David Sterba's avatar
      9c9ca00b
    • David Sterba's avatar
      btrfs: send: fix old buffer length in fs_path_ensure_buf · 1b2782c8
      David Sterba authored
      In "btrfs: send: lower memory requirements in common case" the code to
      save the old_buf_len was incorrectly moved to a wrong place and broke
      the original logic.
      Reported-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Reviewed-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      1b2782c8
    • Filipe Manana's avatar
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana authored
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      176840b3
    • Filipe Manana's avatar
      Btrfs: more efficient split extent state insertion · f2071b21
      Filipe Manana authored
      When we split an extent state there's no need to start the rbtree search
      from the root node - we can start it from the original extent state node,
      since we would end up in its subtree if we do the search starting at the
      root node anyway.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      f2071b21
    • Filipe Manana's avatar
      Btrfs: remove unneeded field / smaller extent_map structure · cbc0e928
      Filipe Manana authored
      We don't need to have an unsigned int field in the extent_map struct
      to tell us whether the extent map is in the inode's extent_map tree or
      not. We can use the rb_node struct field and the RB_CLEAR_NODE and
      RB_EMPTY_NODE macros to achieve the same task.
      
      This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
      64 bits system).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      cbc0e928
    • Wang Shilong's avatar
      Btrfs: skip locking when searching commit root · e84752d4
      Wang Shilong authored
      We won't change commit root, skip locking dance with commit root
      when walking backrefs, this can speed up btrfs send operations.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e84752d4
    • Wang Shilong's avatar
      Btrfs: wake up @scrub_pause_wait as much as we can · 32a44789
      Wang Shilong authored
      check if @scrubs_running=@scrubs_paused condition inside wait_event()
      is not an atomic operation which means we may inc/dec @scrub_running/
      paused at any time. Let's wake up @scrub_pause_wait as much as we can
      to let commit transaction blocked less.
      
      An example below:
      
      Thread1				Thread2
      |->scrub_blocked_if_needed()	|->scrub_pending_trans_workers_inc
        |->increase @scrub_paused
                                             |->increase @scrub_running
        |->wake up scrub_pause_wait list
                                             |->scrub blocked
                                             |->increase @scrub_paused
      
      Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
      So after Thread2 increase @scrub_paused, we meet the condition
      @scrub_paused=@scrub_running, but transaction will be still blocked until
      another calling to wake up @scrub_pause_wait.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      32a44789
    • Wang Shilong's avatar
      Btrfs: cancel scrub on transaction abortion · c0af8f0b
      Wang Shilong authored
      If we fail to commit transaction, we'd better
      cancel scrub operations.
      Suggested-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c0af8f0b
    • Wang Shilong's avatar
      Btrfs: device_replace: fix deadlock for nocow case · 12cf9372
      Wang Shilong authored
      commit cb7ab021 cause a following deadlock found by
      xfstests,btrfs/011:
      
      Thread1 is commiting transaction which is blocked at
      btrfs_scrub_pause().
      
      Thread2 is calling btrfs_file_aio_write() which has held
      inode's @i_mutex and commit transaction(blocked because
      Thread1 is committing transaction).
      
      Thread3 is copy_nocow_page worker which will also try to
      hold inode @i_mutex, so thread3 will wait Thread1 finished.
      
      Thread4 is waiting pending workers finished which will wait
      Thread3 finished. So the problem is like this:
      
      Thread1--->Thread4--->Thread3--->Thread2---->Thread1
      
      Deadlock happens! we fix it by letting Thread1 go firstly,
      which means we won't block transaction commit while we are
      waiting pending workers finished.
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      12cf9372
    • Wang Shilong's avatar
      Btrfs: fix a possible deadlock between scrub and transaction committing · 6cf7f77e
      Wang Shilong authored
      btrfs_scrub_continue() will be called when cleaning up transaction.However,
      this can only be called if btrfs_scrub_pause() is called before.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6cf7f77e
    • Sachin Kamat's avatar
      btrfs: Use PTR_ERR_OR_ZERO · 886322e8
      Sachin Kamat authored
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: default avatarSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      886322e8
    • Filipe Manana's avatar
      Btrfs: fix send issuing outdated paths for utimes, chown and chmod · bf0d1f44
      Filipe Manana authored
      When doing an incremental send, if we had a directory pending a move/rename
      operation and none of its parents, except for the immediate parent, were
      pending a move/rename, after processing the directory's references, we would
      be issuing utimes, chown and chmod intructions against am outdated path - a
      path which matched the one in the parent root.
      
      This change also simplifies a bit the code that deals with building a path
      for a directory which has a move/rename operation delayed.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/d/e
          $ mkdir /mnt/btrfs/a/b/c/f
          $ chmod 0777 /mnt/btrfs/a/b/c/d/e
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
          $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
          $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
          $ chmod 0700 /mnt/btrfs/a/b/f2/e2
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: chmod a/b/c/d/e failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      bf0d1f44
    • Filipe Manana's avatar
      Btrfs: correctly determine if blocks are shared in btrfs_compare_trees · 6baa4293
      Filipe Manana authored
      Just comparing the pointers (logical disk addresses) of the btree nodes is
      not completely bullet proof, we have to check if their generation numbers
      match too.
      
      It is guaranteed that a COW operation will result in a block with a different
      logical disk address than the original block's address, but over time we can
      reuse that former logical disk address.
      
      For example, creating a 2Gb filesystem on a loop device, and having a script
      running in a loop always updating the access timestamp of a file, resulted in
      the same logical disk address being reused for the same fs btree block in about
      only 4 minutes.
      
      This could make us skip entire subtrees when doing an incremental send (which
      is currently the only user of btrfs_compare_trees). However the odds of getting
      2 blocks at the same tree level, with the same logical disk address, equal first
      slot keys and different generations, should hopefully be very low.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6baa4293
    • Filipe Manana's avatar
      Btrfs: fix send attempting to rmdir non-empty directories · 9dc44214
      Filipe Manana authored
      The incremental send algorithm assumed that it was possible to issue
      a directory remove (rmdir) if the the inode number it was currently
      processing was greater than (or equal) to any inode that referenced
      the directory's inode. This wasn't a valid assumption because any such
      inode might be a child directory that is pending a move/rename operation,
      because it was moved into a directory that has a higher inode number and
      was moved/renamed too - in other words, the case the following commit
      addressed:
      
          9f03740a
          (Btrfs: fix infinite path build loops in incremental send)
      
      This made an incremental send issue an rmdir operation before the
      target directory was actually empty, which made btrfs receive fail.
      Therefore it needs to wait for all pending child directory inodes to
      be moved/renamed before sending an rmdir operation.
      
      Simple steps to reproduce this issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/x
          $ mkdir /mnt/btrfs/a/b/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
          $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. Directory not empty
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      9dc44214
    • Filipe Manana's avatar
      Btrfs: send, don't send rmdir for same target multiple times · 29d6d30f
      Filipe Manana authored
      When doing an incremental send, if we delete a directory that has N > 1
      hardlinks for the same file and that file has the highest inode number
      inside the directory contents, an incremental send would send N times an
      rmdir operation against the directory. This made the btrfs receive command
      fail on the second rmdir instruction, as the target directory didn't exist
      anymore.
      
      Steps to reproduce the issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c
          $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
          $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ rm -f /mnt/btrfs/a/b/c/foo.txt
          $ rm -f /mnt/btrfs/a/b/c/bar.txt
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      29d6d30f
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid path after dir rename · 2b863a13
      Filipe Manana authored
      This fixes yet one more case not caught by the commit titled:
      
         Btrfs: fix infinite path build loops in incremental send
      
      In this case, even before the initial full send, we have a directory
      which is a child of a directory with a higher inode number. Then we
      perform the initial send, and after we rename both the child and the
      parent, without moving them around. After doing these 2 renames, an
      incremental send sent a rename instruction for the child directory
      which contained an invalid "from" path (referenced the parent's old
      name, not the new one), which made the btrfs receive command fail.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b
          $ mkdir /mnt/btrfs/d
          $ mkdir /mnt/btrfs/a/b/c
          $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
          $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umout /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
        "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      2b863a13
    • Filipe Manana's avatar
      Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c
      Filipe Manana authored
      If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
      but we'll also insert file extent items representing holes (disk bytenr == 0) that start
      with a key offset that lies beyond the inode's size and are not contiguous with the last
      file extent item.
      
      Example:
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
      btrfs-debug-tree output:
      
        item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
      	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
        item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
      	inode ref index 2 namelen 3 name: foo
        item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 90112 ram 122880
      	extent compression 0
        item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
      	extent data disk byte 12845056 nr 4096 gen 6
      	extent data offset 0 nr 45056 ram 45056
      	extent compression 2
        item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 860160 ram 860160
      	extent compression 0
      
      The last extent item, which represents a hole, is useless as it lies beyond the inode's
      size.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      12870f1c
    • Filipe Manana's avatar
      Btrfs: cleanup delayed-ref.c:find_ref_head() · 85fdfdf6
      Filipe Manana authored
      The argument last wasn't used, all callers supplied a NULL value
      for it. Also removed unnecessary intermediate storage of the result
      of key comparisons.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      85fdfdf6
    • Filipe Manana's avatar
      Btrfs: remove unnecessary ref heads rb tree search · 6103fb43
      Filipe Manana authored
      When we didn't find the exact ref head we were looking for, if
      return_bigger != 0 we set a new search key to match either the
      next node after the last one we found or the first one in the
      ref heads rb tree, and then did another full tree search. For both
      cases this ended up being pointless as we would end up returning
      an entry we already had before repeating the search.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6103fb43
    • Justin Maggard's avatar
      btrfs: wake up transaction thread upon remount · 2c6a92b0
      Justin Maggard authored
      Now that we can adjust the commit interval with a remount, we need
      to wake up the transaction thread or else he will continue to sleep
      until the previous transaction interval has elapsed before waking
      up.  So, if we go from a large commit interval to something smaller,
      the transaction thread will not wake up until the large interval has
      expired.  This also causes the cleaner thread to stay sleeping, since
      it gets woken up by the transaction thread.
      
      Fix it by simply waking up the transaction thread during a remount.
      Signed-off-by: default avatarJustin Maggard <jmaggard10@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      2c6a92b0
    • Miao Xie's avatar
      Btrfs: stop joining the log transaction if sync log fails · 50471a38
      Miao Xie authored
      If the log sync fails, there is something wrong in the log tree, we
      should not continue to join the log transaction and log the metadata.
      What we should do is to do a full commit.
      
      This patch fixes this problem by setting ->last_trans_log_full_commit
      to the current transaction id, it will tell the tasks not to join
      the log transaction, and do a full commit.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      50471a38
    • Miao Xie's avatar
      Btrfs: just wait or commit our own log sub-transaction · d1433deb
      Miao Xie authored
      We might commit the log sub-transaction which didn't contain the metadata we
      logged. It was because we didn't record the log transid and just select
      the current log sub-transaction to commit, but the right one might be
      committed by the other task already. Actually, we needn't do anything
      and it is safe that we go back directly in this case.
      
      This patch improves the log sync by the above idea. We record the transid
      of the log sub-transaction in which we log the metadata, and the transid
      of the log sub-transaction we have committed. If the committed transid
      is >= the transid we record when logging the metadata, we just go back.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      d1433deb
    • Miao Xie's avatar
      Btrfs: fix skipped error handle when log sync failed · 8b050d35
      Miao Xie authored
      It is possible that many tasks sync the log tree at the same time, but
      only one task can do the sync work, the others will wait for it. But those
      wait tasks didn't get the result of the log sync, and returned 0 when they
      ended the wait. It caused those tasks skipped the error handle, and the
      serious problem was they told the users the file sync succeeded but in
      fact they failed.
      
      This patch fixes this problem by introducing a log context structure,
      we insert it into the a global list. When the sync fails, we will set
      the error number of every log context in the list, then the waiting tasks
      get the error number of the log context and handle the error if need.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      8b050d35
    • Miao Xie's avatar
      Btrfs: use signed integer instead of unsigned long integer for log transid · bb14a59b
      Miao Xie authored
      The log trans id is initialized to be 0 every time we create a log tree,
      and the log tree need be re-created after a new transaction is started,
      it means the log trans id is unlikely to be a huge number, so we can use
      signed integer instead of unsigned long integer to save a bit space.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      bb14a59b
    • Miao Xie's avatar
      Btrfs: remove unnecessary memory barrier in btrfs_sync_log() · 7483e1a4
      Miao Xie authored
      Mutex unlock implies certain memory barriers to make sure all the memory
      operation completes before the unlock, and the next mutex lock implies memory
      barriers to make sure the all the memory happens after the lock. So it is
      a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      7483e1a4
    • Miao Xie's avatar
      Btrfs: don't start the log transaction if the log tree init fails · e87ac136
      Miao Xie authored
      The old code would start the log transaction even the log tree init
      failed, it was unnecessary. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e87ac136
    • Miao Xie's avatar
      Btrfs: fix the skipped transaction commit during the file sync · 48cab2e0
      Miao Xie authored
      We may abort the wait earlier if ->last_trans_log_full_commit was set to
      the current transaction id, at this case, we need commit the current
      transaction instead of the log sub-transaction. But the current code
      didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      48cab2e0
    • Miao Xie's avatar
      Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit · 5c902ba6
      Miao Xie authored
      ->last_trans_log_full_commit may be changed by the other tasks without lock,
      so we need prevent the compiler from the optimize access just like
      	tmp = fs_info->last_trans_log_full_commit
      	if (tmp == ...)
      		...
      
      	<do something>
      
      	if (tmp == ...)
      		...
      
      In fact, we need get the new value of ->last_trans_log_full_commit during
      the second access. Fix it by ACCESS_ONCE().
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      5c902ba6
    • Liu Bo's avatar
      Btrfs: avoid warning bomb of btrfs_invalidate_inodes · 7813b3db
      Liu Bo authored
      So after transaction is aborted, we need to cleanup inode resources by
      calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
      roots' refs to be zero in old times and sets a WARN_ON(), however, this
      is not always true within cleaning up transaction, so we get to detect
      transaction abortion and not warn at all.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      7813b3db
    • Liu Bo's avatar
      Btrfs: fix possible deadlock in btrfs_cleanup_transaction · 2a85d9ca
      Liu Bo authored
      [13654.480669] ======================================================
      [13654.480905] [ INFO: possible circular locking dependency detected ]
      [13654.481003] 3.12.0+ #4 Tainted: G        W  O
      [13654.481060] -------------------------------------------------------
      [13654.481060] btrfs-transacti/9347 is trying to acquire lock:
      [13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060] but task is already holding lock:
      [13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
      [13654.481060] which lock already depends on the new lock.
      
      [13654.481060] the existing dependency chain (in reverse order) is:
      [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
      [13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
      [13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
      [13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
      [13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
      [13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
      [13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
      [13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
      [13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
      [13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
      [13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
      [13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
      [13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
      [13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
      [13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
      [13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
      [13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
      [13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
      [13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
      [13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
      [13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
      [13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
      [13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
      [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
      [13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
      [13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [13654.481060] other info that might help us debug this:
      
      [13654.481060]  Possible unsafe locking scenario:
      
      [13654.481060]        CPU0                    CPU1
      [13654.481060]        ----                    ----
      [13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]
       *** DEADLOCK ***
      [...]
      
      ======================================================
      
      btrfs_destroy_all_ordered_extents()
      gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
      while btrfs_[add,remove]_ordered_extent()
      acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.
      
      This patch fixes the above problem.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      2a85d9ca
    • Filipe David Borba Manana's avatar
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana authored
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      d5f37527
    • Stanislaw Gruszka's avatar
      btrfs: always choose work from prio_head first · 51b98eff
      Stanislaw Gruszka authored
      In case we do not refill, we can overwrite cur pointer from prio_head
      by one from not prioritized head, what looks as something that was
      not intended.
      
      This change make we always take works from prio_head first until it's
      not empty.
      Signed-off-by: default avatarStanislaw Gruszka <stf_xl@wp.pl>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      51b98eff
    • Wang Shilong's avatar
      Revert "Btrfs: remove transaction from btrfs send" · dcfd5ad2
      Wang Shilong authored
      This reverts commit 41ce9970.
      Previously i was thinking we can use readonly root's commit root
      safely while it is not true, readonly root may be cowed with the
      following cases.
      
      1.snapshot send root will cow source root.
      2.balance,device operations will also cow readonly send root
      to relocate.
      
      So i have two ideas to make us safe to use commit root.
      
      -->approach 1:
      make it protected by transaction and end transaction properly and we research
      next item from root node(see btrfs_search_slot_for_read()).
      
      -->approach 2:
      add another counter to local root structure to sync snapshot with send.
      and add a global counter to sync send with exclusive device operations.
      
      So with approach 2, send can use commit root safely, because we make sure
      send root can not be cowed during send. Unfortunately, it make codes *ugly*
      and more complex to maintain.
      
      To make snapshot and send exclusively, device operations and send operation
      exclusively with each other is a little confusing for common users.
      
      So why not drop into previous way.
      
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dcfd5ad2
    • Wang Shilong's avatar
      Btrfs: skip readonly root for snapshot-aware defragment · bcbba5e6
      Wang Shilong authored
      Btrfs send is assuming readonly root won't change, let's skip readonly root.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      bcbba5e6
    • Wang Shilong's avatar
      Btrfs: switch to btrfs_previous_extent_item() · 850a8cdf
      Wang Shilong authored
      Since we have introduced btrfs_previous_extent_item() to search previous
      extent item, just switch into it.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      850a8cdf
    • Hidetoshi Seto's avatar
      Btrfs: skip submitting barrier for missing device · f88ba6a2
      Hidetoshi Seto authored
      I got an error on v3.13:
       BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)
      
      how to reproduce:
        > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
        > wipefs -a /dev/sdf2
        > mount -o degraded /dev/sdf1 /mnt
        > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt
      
      The reason of the error is that barrier_all_devices() failed to submit
      barrier to the missing device.  However it is clear that we cannot do
      anything on missing device, and also it is not necessary to care chunks
      on the missing device.
      
      This patch stops sending/waiting barrier if device is missing.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      f88ba6a2