1. 07 Apr, 2014 10 commits
    • Dan Carpenter's avatar
      Btrfs: kmalloc() doesn't return an ERR_PTR · 84dbeb87
      Dan Carpenter authored
      The error handling was copy and pasted from memdup_user().  It should be
      checking for NULL obviously.
      
      Fixes: abccd00f ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      84dbeb87
    • Wang Shilong's avatar
      Btrfs: fix snapshot vs nocow writting · e9894fd3
      Wang Shilong authored
      While running fsstress and snapshots concurrently, we will hit something
      like followings:
      
      Thread 1			Thread 2
      
      |->fallocate
        |->write pages
          |->join transaction
             |->add ordered extent
          |->end transaction
      				|->flushing data
      				  |->creating pending snapshots
      |->write data into src root's
         fallocated space
      
      After above work flows finished, we will get a state that source and
      snapshot root share same space, but source root have written data into
      fallocated space, this will make fsck fail to verify checksums for
      snapshot root's preallocating file extent data.Nocow writting also
      has this same problem.
      
      Fix this problem by syncing snapshots with nocow writting:
      
       1.for nocow writting,if there are pending snapshots, we will
       fall into COW way.
      
       2.if there are pending nocow writes, snapshots for this root
       will be blocked until nocow writting finish.
      Reported-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e9894fd3
    • Qu Wenruo's avatar
      btrfs: Change the expanding write sequence to fix snapshot related bug. · 3ac0d7b9
      Qu Wenruo authored
      When testing fsstress with snapshot making background, some snapshot
      following problem.
      
      Snapshot 270:
      inode 323: size 0
      
      Snapshot 271:
      inode 323: size 349145
      |-------Hole---|---------Empty gap-------|-------Hole-----|
      0	    122880			172032	      349145
      
      Snapshot 272:
      inode 323: size 349145
      |-------Hole---|------------Data---------|-------Hole-----|
      0	    122880			172032	      349145
      
      The fsstress operation on inode 323 is the following:
      write: 		offset 	126832 	len 43124
      truncate: 	size 	349145
      
      Since the write with offset is consist of 2 operations:
      1. punch hole
      2. write data
      Hole punching is faster than data write, so hole punching in write
      and truncate is done first and then buffered write, so the snapshot 271 got
      empty gap, which will not pass btrfsck.
      
      To fix the bug, this patch will change the write sequence which will
      first punch a hole covering the write end if a hole is needed.
      Reported-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3ac0d7b9
    • David Sterba's avatar
      btrfs: make device scan less noisy · 60999ca4
      David Sterba authored
      Print the message only when the device is seen for the first time.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      60999ca4
    • Jeff Mahoney's avatar
      btrfs: fix lockdep warning with reclaim lock inversion · ed55b6ac
      Jeff Mahoney authored
      When encountering memory pressure, testers have run into the following
      lockdep warning. It was caused by __link_block_group calling kobject_add
      with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
      which gets us into reclaim context. The kobject doesn't actually need
      to be added under the lock -- it just needs to ensure that it's only
      added for the first block group to be linked.
      
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      3.14.0-rc8-default #1 Not tainted
      ---------------------------------------------------------
      kswapd0/169 just changed the state of lock:
       (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
      but this lock took another, RECLAIM_FS-unsafe lock in the past:
       (&found->groups_sem){+++++.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&found->groups_sem);
                                     local_irq_disable();
                                     lock(&delayed_node->mutex);
                                     lock(&found->groups_sem);
        <Interrupt>
          lock(&delayed_node->mutex);
      
       *** DEADLOCK ***
      2 locks held by kswapd0/169:
       #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
       #1:  (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ed55b6ac
    • Josef Bacik's avatar
      Btrfs: hold the commit_root_sem when getting the commit root during send · 3f8a18cc
      Josef Bacik authored
      We currently rely too heavily on roots being read-only to save us from just
      accessing root->commit_root.  We can easily balance blocks out from underneath a
      read only root, so to save us from getting screwed make sure we only access
      root->commit_root under the commit root sem.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3f8a18cc
    • Josef Bacik's avatar
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik authored
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e351cc8
    • Josef Bacik's avatar
      Btrfs: don't clear uptodate if the eb is under IO · a26e8c9f
      Josef Bacik authored
      So I have an awful exercise script that will run snapshot, balance and
      send/receive in parallel.  This sometimes would crash spectacularly and when it
      came back up the fs would be completely hosed.  Turns out this is because of a
      bad interaction of balance and send/receive.  Send will hold onto its entire
      path for the whole send, but its blocks could get relocated out from underneath
      it, and because it doesn't old tree locks theres nothing to keep this from
      happening.  So it will go to read in a slot with an old transid, and we could
      have re-allocated this block for something else and it could have a completely
      different transid.  But because we think it is invalid we clear uptodate and
      re-read in the block.  If we do this before we actually write out the new block
      we could write back stale data to the fs, and boom we're screwed.
      
      Now we definitely need to fix this disconnect between send and balance, but we
      really really need to not allow ourselves to accidently read in stale data over
      new data.  So make sure we check if the extent buffer is not under io before
      clearing uptodate, this will kick back EIO to the caller instead of reading in
      stale data and keep us from corrupting the fs.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a26e8c9f
    • Josef Bacik's avatar
      Btrfs: check for an extent_op on the locked ref · 573a0755
      Josef Bacik authored
      We could have possibly added an extent_op to the locked_ref while we dropped
      locked_ref->lock, so check for this case as well and loop around.  Otherwise we
      could lose flag updates which would lead to extent tree corruption.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      573a0755
    • Josef Bacik's avatar
      Btrfs: do not reset last_snapshot after relocation · ba8b0289
      Josef Bacik authored
      This was done to allow NO_COW to continue to be NO_COW after relocation but it
      is not right.  When relocating we will convert blocks to FULL_BACKREF that we
      relocate.  We can leave some of these full backref blocks behind if they are not
      cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
      then just drop the reloc tree.  Then when we go to cow the block again we won't
      lookup the extent flags because we won't think there has been a snapshot
      recently which means we will do our normal ref drop thing instead of adding back
      a tree ref and dropping the shared ref.  This will cause btrfs_free_extent to
      blow up because it can't find the ref we are trying to free.  This was found
      with my ref verifying tool.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ba8b0289
  2. 22 Mar, 2014 1 commit
    • Liu Bo's avatar
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo authored
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      00fdf13a
  3. 21 Mar, 2014 12 commits
    • Chris Mason's avatar
      btrfs: fix uninit variable warning · 73b802f4
      Chris Mason authored
      fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this
      function
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      73b802f4
    • Josef Bacik's avatar
      Btrfs: take into account total references when doing backref lookup · 44853868
      Josef Bacik authored
      I added an optimization for large files where we would stop searching for
      backrefs once we had looked at the number of references we currently had for
      this extent.  This works great most of the time, but for snapshots that point to
      this extent and has changes in the original root this assumption falls on it
      face.  So keep track of any delayed ref mods made and add in the actual ref
      count as reported by the extent item and use that to limit how far down an inode
      we'll search for extents.  Thanks,
      Reportedy-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reported-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Tested-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      44853868
    • Filipe Manana's avatar
      Btrfs: part 2, fix incremental send's decision to delay a dir move/rename · bfa7e1f8
      Filipe Manana authored
      For an incremental send, fix the process of determining whether the directory
      inode we're currently processing needs to have its move/rename operation delayed.
      
      We were ignoring the fact that if the inode's new immediate ancestor has a higher
      inode number than ours but wasn't renamed/moved, we might still need to delay our
      move/rename, because some other ancestor directory higher in the hierarchy might
      have an inode number higher than ours *and* was renamed/moved too - in this case
      we have to wait for rename/move of that ancestor to happen before our current
      directory's rename/move operation.
      
      Simple steps to reproduce this issue:
      
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir -p /mnt/a/x1/x2
            $ mkdir /mnt/a/Z
            $ mkdir -p /mnt/a/x1/x2/x3/x4/x5
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
            $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The incremental send caused the kernel code to enter an infinite loop when
      building the path string for directory Z after its references are processed.
      
      A more complex scenario:
      
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir -p /mnt/a/b/c/d
            $ mkdir /mnt/a/b/c/d/e
            $ mkdir /mnt/a/b/c/d/f
            $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
            $ mkdir /mmt/a/b/c/g
            $ mv /mnt/a/b/c/d /mnt/a/b/D2
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mkdir /mnt/a/o
            $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
            $ mv /mnt/a/b/D2 /mnt/a/b/dd
            $ mv /mnt/a/b/c /mnt/a/C2
            $ mv /mnt/a/b/dd/f /mnt/a/o/FF
            $ mv /mnt/a/b /mnt/a/o/FF/E2/BB
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bfa7e1f8
    • Filipe Manana's avatar
      Btrfs: fix incremental send's decision to delay a dir move/rename · 7b119a8b
      Filipe Manana authored
      It's possible to change the parent/child relationship between directories
      in such a way that if a child directory has a higher inode number than
      its parent, it doesn't necessarily means the child rename/move operation
      can be performed immediately. The parent migth have its own rename/move
      operation delayed, therefore in this case the child needs to have its
      rename/move operation delayed too, and be performed after its new parent's
      rename/move.
      
      Steps to reproduce the issue:
      
            $ umount /mnt
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir /mnt/A
            $ mkdir /mnt/B
            $ mkdir /mnt/C
            $ mv /mnt/C /mnt/A
            $ mv /mnt/B /mnt/A/C
            $ mkdir /mnt/A/C/D
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mv /mnt/A/C/D /mnt/A/D2
            $ mv /mnt/A/C/B /mnt/A/D2/B2
            $ mv /mnt/A/C /mnt/A/D2/B2/C2
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The incremental send caused the kernel code to enter an infinite loop when
      building the path string for directory C after its references are processed.
      
      The necessary conditions here are that C has an inode number higher than both
      A and B, and B as an higher inode number higher than A, and D has the highest
      inode number, that is:
          inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
      
      The same issue could happen if after the first snapshot there's any number
      of intermediary parent directories between A2 and B2, and between B2 and C2.
      
      A test case for xfstests follows, covering this simple case and more advanced
      ones, with files and hard links created inside the directories.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7b119a8b
    • Filipe Manana's avatar
      Btrfs: remove unnecessary inode generation lookup in send · 425b5daf
      Filipe Manana authored
      No need to search in the send tree for the generation number of the inode,
      we already have it in the recorded_ref structure passed to us.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      425b5daf
    • Filipe Manana's avatar
      Btrfs: fix race when updating existing ref head · 21543bad
      Filipe Manana authored
      While we update an existing ref head's extent_op, we're not holding
      its spinlock, so while we're updating its extent_op contents (key,
      flags) we can have a task running __btrfs_run_delayed_refs() that
      holds the ref head's lock and sets its extent_op to NULL right after
      the task updating the ref head just checked its extent_op was not NULL.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      21543bad
    • Qu Wenruo's avatar
      btrfs: Add trace for btrfs_workqueue alloc/destroy · c3a46891
      Qu Wenruo authored
      Since most of the btrfs_workqueue is printed as pointer address,
      for easier analysis, add trace for btrfs_workqueue alloc/destroy.
      So it is possible to determine the workqueue that a given work belongs
      to(by comparing the wq pointer address with alloc trace event).
      Signed-off-by: default avatarQu Wenruo <quenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c3a46891
    • Filipe Manana's avatar
      Btrfs: less fs tree lock contention when using autodefrag · f094c9bd
      Filipe Manana authored
      When finding new extents during an autodefrag, don't do so many fs tree
      lookups to find an extent with a size smaller then the target treshold.
      Instead, after each fs tree forward search immediately unlock upper
      levels and process the entire leaf while holding a read lock on the leaf,
      since our leaf processing is very fast.
      This reduces lock contention, allowing for higher concurrency when other
      tasks want to write/update items related to other inodes in the fs tree,
      as we're not holding read locks on upper tree levels while processing the
      leaf and we do less tree searches.
      
      Test:
      
          sysbench --test=fileio --file-num=512 --file-total-size=16G \
             --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
             --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
             --max-requests=10000000000 [prepare|run]
      
      (fileystem mounted with -o autodefrag, averages of 5 runs)
      
      Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
      After this change:  63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
      
      Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f094c9bd
    • Guangyu Sun's avatar
      Btrfs: return EPERM when deleting a default subvolume · 72de6b53
      Guangyu Sun authored
      The error message is confusing:
      
       # btrfs sub delete /mnt/mysub/
       Delete subvolume '/mnt/mysub'
       ERROR: cannot delete '/mnt/mysub' - Directory not empty
      
      The error message does not make sense to me: It's not about deleting a
      directory but it's a subvolume, and it doesn't matter if the subvolume is
      empty or not.
      
      Maybe EPERM or is more appropriate in this case, combined with an explanatory
      kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
      it is configured as default subvolume.")
      Reported-by: default avatarKoen De Wit <koen.de.wit@oracle.com>
      Signed-off-by: default avatarGuangyu Sun <guangyu.sun@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      72de6b53
    • Filipe Manana's avatar
    • Filipe Manana's avatar
      Btrfs: cache extent states in defrag code path · 308d9800
      Filipe Manana authored
      When locking file ranges in the inode's io_tree, cache the first
      extent state that belongs to the target range, so that when unlocking
      the range we don't need to search in the io_tree again, reducing cpu
      time and making and therefore holding the io_tree's lock for a shorter
      period.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      308d9800
    • Josef Bacik's avatar
      Btrfs: fix deadlock with nested trans handles · 3bbb24b2
      Josef Bacik authored
      Zach found this deadlock that would happen like this
      
      btrfs_end_transaction <- reduce trans->use_count to 0
        btrfs_run_delayed_refs
          btrfs_cow_block
            find_free_extent
      	btrfs_start_transaction <- increase trans->use_count to 1
                allocate chunk
      	btrfs_end_transaction <- decrease trans->use_count to 0
      	  btrfs_run_delayed_refs
      	    lock tree block we are cowing above ^^
      
      We need to only decrease trans->use_count if it is above 1, otherwise leave it
      alone.  This will make nested trans be the only ones who decrease their added
      ref, and will let us get rid of the trans->use_count++ hack if we have to commit
      the transaction.  Thanks,
      
      cc: stable@vger.kernel.org
      Reported-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Tested-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3bbb24b2
  4. 10 Mar, 2014 17 commits
    • Miao Xie's avatar
      Btrfs: fix possible empty list access when flushing the delalloc inodes · 573bfb72
      Miao Xie authored
      We didn't have a lock to protect the access to the delalloc inodes list, that is
      we might access a empty delalloc inodes list if someone start flushing delalloc
      inodes because the delalloc inodes were moved into a other list temporarily.
      Fix it by wrapping the access with a lock.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      573bfb72
    • Miao Xie's avatar
      Btrfs: split the global ordered extents mutex · 31f3d255
      Miao Xie authored
      When we create a snapshot, we just need wait the ordered extents in
      the source fs/file root, but because we use the global mutex to protect
      this ordered extents list of the source fs/file root to avoid accessing
      a empty list, if someone got the mutex to access the ordered extents list
      of the other fs/file root, we had to wait.
      
      This patch splits the above global mutex, now every fs/file root has
      its own mutex to protect its own list.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      31f3d255
    • Miao Xie's avatar
      Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock · 6c255e67
      Miao Xie authored
      We needn't flush all delalloc inodes when we doesn't get s_umount lock,
      or we would make the tasks wait for a long time.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6c255e67
    • Miao Xie's avatar
      Btrfs: reclaim delalloc metadata more aggressively · 24af7dd1
      Miao Xie authored
      generic/074 in xfstests failed sometimes because of the enospc error,
      the reason of this problem is that we just reclaimed the space we need
      from the reserved space for delalloc, and then tried to reserve the space,
      but if some task did no-flush reservation between the above reclamation
      and reservation,
      	Task1			Task2
      	shrink_delalloc()
      	reclaim 1 block
      	(The space that can
      	 be reserved now is 1
      	 block)
      				do no-flush reservation
      				reserve 1 block
      				(The space that can
      				 be reserved now is 0
      				 block)
      	reserving 1 block failed
      the reservation of Task1 failed, but in fact, there was enough space to
      reserve if we could reclaim more space before.
      
      Fix this problem by the aggressive reclamation of the reserved delalloc
      metadata space.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      24af7dd1
    • Miao Xie's avatar
      Btrfs: remove unnecessary lock in may_commit_transaction() · 0424c548
      Miao Xie authored
      The reason is:
      - The per-cpu counter has its own lock to protect itself.
      - Here we needn't get a exact value.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      0424c548
    • Miao Xie's avatar
      b88935bf
    • Miao Xie's avatar
      Btrfs: just do dirty page flush for the inode with compression before direct IO · 41bd9ca4
      Miao Xie authored
      As the comment in the btrfs_direct_IO says, only the compressed pages need be
      flush again to make sure they are on the disk, but the common pages needn't,
      so we add a if statement to check if the inode has compressed pages or not,
      if no, skip the flush.
      
      And in order to prevent the write ranges from intersecting, we need wait for
      the running ordered extents. But the current code waits for them twice, one
      is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
      is before we get the blocks, it is unnecessary. because we can do the direct
      IO without holding i_mutex, it means that the intersected ordered extents may
      happen during the direct IO, the first wait can not avoid this problem. So we
      use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
      the first wait.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      41bd9ca4
    • Miao Xie's avatar
      Btrfs: wake up the tasks that wait for the io earlier · af7a6509
      Miao Xie authored
      The tasks that wait for the IO_DONE flag just care about the io of the dirty
      pages, so it is better to wake up them immediately after all the pages are
      written, not the whole process of the io completes.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      af7a6509
    • Miao Xie's avatar
      Btrfs: fix early enospc due to the race of the two ordered extent wait · 8b9d83cd
      Miao Xie authored
      btrfs_wait_ordered_roots() moves all the list entries to a new list,
      and then deals with them one by one. But if the other task invokes this
      function at that time, it would get a empty list. It makes the enospc
      error happens more early. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      8b9d83cd
    • Miao Xie's avatar
      Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume · 8257b2dc
      Miao Xie authored
      If the snapshot creation happened after the nocow write but before the dirty
      data flush, we would fail to flush the dirty data because of no space.
      
      So we must keep track of when those nocow write operations start and when they
      end, if there are nocow writers, the snapshot creators must wait. In order
      to implement this function, I introduce btrfs_{start, end}_nocow_write(),
      which is similar to mnt_{want,drop}_write().
      
      These two functions are only used for nocow file write operations.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      8257b2dc
    • Qu Wenruo's avatar
      btrfs: Add ftrace for btrfs_workqueue · 52483bc2
      Qu Wenruo authored
      Add ftrace for btrfs_workqueue for further workqueue tunning.
      This patch needs to applied after the workqueue replace patchset.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      52483bc2
    • Qu Wenruo's avatar
      btrfs: Cleanup the btrfs_workqueue related function type · 6db8914f
      Qu Wenruo authored
      The new btrfs_workqueue still use open-coded function defition,
      this patch will change them into btrfs_func_t type which is much the
      same as kernel workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6db8914f
    • Liu Bo's avatar
      Btrfs: add readahead for send_write · 2131bcd3
      Liu Bo authored
      Btrfs send reads data from disk and then writes to a stream via pipe or
      a file via flush.
      
      Currently we're going to read each page a time, so every page results
      in a disk read, which is not friendly to disks, esp. HDD.  Given that,
      the performance can be gained by adding readahead for those pages.
      
      Here is a quick test:
      $ btrfs subvolume create send
      $ xfs_io -f -c "pwrite 0 1G" send/foobar
      $ btrfs subvolume snap -r send ro
      $ time "btrfs send ro -f /dev/null"
      
                 w/o             w
      real    1m37.527s       0m9.097s
      user    0m0.122s        0m0.086s
      sys     0m53.191s       0m12.857s
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      2131bcd3
    • Liu Bo's avatar
      Btrfs: share the same code for __record_{new,deleted}_ref · a4d96d62
      Liu Bo authored
      This has no functional change, only picks out the same part of two functions,
      and makes it shared.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      a4d96d62
    • Filipe Manana's avatar
      Btrfs: avoid unnecessary utimes update in incremental send · fcbd2154
      Filipe Manana authored
      When we're finishing processing of an inode, if we're dealing with a
      directory inode that has a pending move/rename operation, we don't
      need to send a utimes update instruction to the send stream, as we'll
      do it later after doing the move/rename operation. Therefore we save
      some time here building paths and doing btree lookups.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      fcbd2154
    • Filipe Manana's avatar
      Btrfs: make defrag not fragment files when using prealloc extents · e2127cf0
      Filipe Manana authored
      When using prealloc extents, a file defragment operation may actually
      fragment the file and increase the amount of data space used by the file.
      This change fixes that behaviour.
      
      Example:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt
      $ cd /mnt
      $ xfs_io -f -c 'falloc 0 1048576' foobar && sync
      $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
      
      Before defragmenting the file:
      
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.25MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 0 nr 4096
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 4096 nr 102400 ram 1048576
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 106496 nr 90112
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 196608 nr 106496 ram 1048576
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 897024 nr 106496 ram 1048576
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 1003520 nr 45056
      (...)
      
      Now defragmenting the file results in more data space used than before:
      
      $ btrfs filesystem defragment -f foobar && sync
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.55MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      And the corresponding file extent items are now no longer perfectly sequential
      as before, and we're now needlessly using more space from data block groups:
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 0 nr 4096 ram 1048576
      		extent compression 0
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 13893632 nr 102400
      		extent data offset 0 nr 102400 ram 102400
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 106496 nr 90112 ram 1048576
      		extent compression 0
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 13996032 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 14102528 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 1003520 nr 45056 ram 1048576
      		extent compression 0
      (...)
      
      With this change, the above example will no longer cause allocation of new data
      space nor change the sequentiality of the file extents, that is, defragment will
      be effectless, leaving all extent items pointing to the extent starting at disk
      byte 12845056.
      
      In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
      400Mb each, initially consisting of a single prealloc extent of 400Mb, having
      random writes happening at a low rate, lead to a total of over ~17Gb of data
      space used, not far from eventually reaching an ENOSPC state.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e2127cf0
    • Filipe Manana's avatar
      Btrfs: correctly flush data on defrag when compression is enabled · dec8ef90
      Filipe Manana authored
      When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
      enabled, we weren't flushing completely, as writing compressed extents
      is a 2 steps process, one to compress the data and another one to write
      the compressed data to disk.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dec8ef90