1. 07 Apr, 2014 13 commits
    • Filipe Manana's avatar
      Btrfs: send, build path string only once in send_hole · c715e155
      Filipe Manana authored
      There's no point building the path string in each iteration of the
      send_hole loop, as it produces always the same string.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c715e155
    • Gui Hecheng's avatar
      btrfs: filter invalid arg for btrfs resize · 9a40f122
      Gui Hecheng authored
      Originally following cmds will work:
      	# btrfs fi resize -10A  <mnt>
      	# btrfs fi resize -10Gaha <mnt>
      Filter the arg by checking the return pointer of memparse.
      Signed-off-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9a40f122
    • Filipe Manana's avatar
      Btrfs: send, fix data corruption due to incorrect hole detection · 766b5e5a
      Filipe Manana authored
      During an incremental send, when we finish processing an inode (corresponding to
      a regular file) we would assume the gap between the end of the last processed file
      extent and the file's size corresponded to a file hole, and therefore incorrectly
      send a bunch of zero bytes to overwrite that region in the file.
      
      This affects only kernel 3.14.
      
      Reproducer:
      
          mkfs.btrfs -f /dev/sdc
          mount /dev/sdc /mnt
      
          xfs_io -f -c "falloc -k 0 268435456" /mnt/foo
      
          btrfs subvolume snapshot -r /mnt /mnt/mysnap0
      
          xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo
          xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo
          xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo
          xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo
          xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo
          xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo
          xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo
          xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo
          xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo
          xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo
          xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo
          xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo
          xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo
          xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo
          xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo
          xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo
          xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo
          xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo
          xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo
          xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo
          xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo
          xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo
          xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo
          xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo
          xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo
          xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo
          xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo
          xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo
          xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo
          xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo
          xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo
          xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo
          xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo
          xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo
          xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo
          xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo
          xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo
          xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo
          xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo
          xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo
          xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo
          xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo
          xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo
          xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo
          xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo
          xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo
          xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo
          xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo
      
          btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
          xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo
      
          btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
          md5sum /mnt/foo          # returns 79e53f1466bfc09fd82b450689e6119e
          md5sum /mnt/mysnap2/foo  # returns 79e53f1466bfc09fd82b450689e6119e too
      
          btrfs send /mnt/mysnap1 -f /tmp/1.snap
          btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap
      
          mkfs.btrfs -f /dev/sdc
          mount /dev/sdc /mnt
      
          btrfs receive /mnt -f /tmp/1.snap
          btrfs receive /mnt -f /tmp/2.snap
      
          md5sum /mnt/mysnap2/foo  # returns 2bb414c5155767cedccd7063e51beabd !!
      
      A testcase for xfstests follows soon too.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      766b5e5a
    • Dan Carpenter's avatar
      Btrfs: kmalloc() doesn't return an ERR_PTR · 84dbeb87
      Dan Carpenter authored
      The error handling was copy and pasted from memdup_user().  It should be
      checking for NULL obviously.
      
      Fixes: abccd00f ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      84dbeb87
    • Wang Shilong's avatar
      Btrfs: fix snapshot vs nocow writting · e9894fd3
      Wang Shilong authored
      While running fsstress and snapshots concurrently, we will hit something
      like followings:
      
      Thread 1			Thread 2
      
      |->fallocate
        |->write pages
          |->join transaction
             |->add ordered extent
          |->end transaction
      				|->flushing data
      				  |->creating pending snapshots
      |->write data into src root's
         fallocated space
      
      After above work flows finished, we will get a state that source and
      snapshot root share same space, but source root have written data into
      fallocated space, this will make fsck fail to verify checksums for
      snapshot root's preallocating file extent data.Nocow writting also
      has this same problem.
      
      Fix this problem by syncing snapshots with nocow writting:
      
       1.for nocow writting,if there are pending snapshots, we will
       fall into COW way.
      
       2.if there are pending nocow writes, snapshots for this root
       will be blocked until nocow writting finish.
      Reported-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e9894fd3
    • Qu Wenruo's avatar
      btrfs: Change the expanding write sequence to fix snapshot related bug. · 3ac0d7b9
      Qu Wenruo authored
      When testing fsstress with snapshot making background, some snapshot
      following problem.
      
      Snapshot 270:
      inode 323: size 0
      
      Snapshot 271:
      inode 323: size 349145
      |-------Hole---|---------Empty gap-------|-------Hole-----|
      0	    122880			172032	      349145
      
      Snapshot 272:
      inode 323: size 349145
      |-------Hole---|------------Data---------|-------Hole-----|
      0	    122880			172032	      349145
      
      The fsstress operation on inode 323 is the following:
      write: 		offset 	126832 	len 43124
      truncate: 	size 	349145
      
      Since the write with offset is consist of 2 operations:
      1. punch hole
      2. write data
      Hole punching is faster than data write, so hole punching in write
      and truncate is done first and then buffered write, so the snapshot 271 got
      empty gap, which will not pass btrfsck.
      
      To fix the bug, this patch will change the write sequence which will
      first punch a hole covering the write end if a hole is needed.
      Reported-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3ac0d7b9
    • David Sterba's avatar
      btrfs: make device scan less noisy · 60999ca4
      David Sterba authored
      Print the message only when the device is seen for the first time.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      60999ca4
    • Jeff Mahoney's avatar
      btrfs: fix lockdep warning with reclaim lock inversion · ed55b6ac
      Jeff Mahoney authored
      When encountering memory pressure, testers have run into the following
      lockdep warning. It was caused by __link_block_group calling kobject_add
      with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
      which gets us into reclaim context. The kobject doesn't actually need
      to be added under the lock -- it just needs to ensure that it's only
      added for the first block group to be linked.
      
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      3.14.0-rc8-default #1 Not tainted
      ---------------------------------------------------------
      kswapd0/169 just changed the state of lock:
       (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
      but this lock took another, RECLAIM_FS-unsafe lock in the past:
       (&found->groups_sem){+++++.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&found->groups_sem);
                                     local_irq_disable();
                                     lock(&delayed_node->mutex);
                                     lock(&found->groups_sem);
        <Interrupt>
          lock(&delayed_node->mutex);
      
       *** DEADLOCK ***
      2 locks held by kswapd0/169:
       #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
       #1:  (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ed55b6ac
    • Josef Bacik's avatar
      Btrfs: hold the commit_root_sem when getting the commit root during send · 3f8a18cc
      Josef Bacik authored
      We currently rely too heavily on roots being read-only to save us from just
      accessing root->commit_root.  We can easily balance blocks out from underneath a
      read only root, so to save us from getting screwed make sure we only access
      root->commit_root under the commit root sem.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3f8a18cc
    • Josef Bacik's avatar
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik authored
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e351cc8
    • Josef Bacik's avatar
      Btrfs: don't clear uptodate if the eb is under IO · a26e8c9f
      Josef Bacik authored
      So I have an awful exercise script that will run snapshot, balance and
      send/receive in parallel.  This sometimes would crash spectacularly and when it
      came back up the fs would be completely hosed.  Turns out this is because of a
      bad interaction of balance and send/receive.  Send will hold onto its entire
      path for the whole send, but its blocks could get relocated out from underneath
      it, and because it doesn't old tree locks theres nothing to keep this from
      happening.  So it will go to read in a slot with an old transid, and we could
      have re-allocated this block for something else and it could have a completely
      different transid.  But because we think it is invalid we clear uptodate and
      re-read in the block.  If we do this before we actually write out the new block
      we could write back stale data to the fs, and boom we're screwed.
      
      Now we definitely need to fix this disconnect between send and balance, but we
      really really need to not allow ourselves to accidently read in stale data over
      new data.  So make sure we check if the extent buffer is not under io before
      clearing uptodate, this will kick back EIO to the caller instead of reading in
      stale data and keep us from corrupting the fs.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a26e8c9f
    • Josef Bacik's avatar
      Btrfs: check for an extent_op on the locked ref · 573a0755
      Josef Bacik authored
      We could have possibly added an extent_op to the locked_ref while we dropped
      locked_ref->lock, so check for this case as well and loop around.  Otherwise we
      could lose flag updates which would lead to extent tree corruption.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      573a0755
    • Josef Bacik's avatar
      Btrfs: do not reset last_snapshot after relocation · ba8b0289
      Josef Bacik authored
      This was done to allow NO_COW to continue to be NO_COW after relocation but it
      is not right.  When relocating we will convert blocks to FULL_BACKREF that we
      relocate.  We can leave some of these full backref blocks behind if they are not
      cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
      then just drop the reloc tree.  Then when we go to cow the block again we won't
      lookup the extent flags because we won't think there has been a snapshot
      recently which means we will do our normal ref drop thing instead of adding back
      a tree ref and dropping the shared ref.  This will cause btrfs_free_extent to
      blow up because it can't find the ref we are trying to free.  This was found
      with my ref verifying tool.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ba8b0289
  2. 22 Mar, 2014 1 commit
    • Liu Bo's avatar
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo authored
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      00fdf13a
  3. 21 Mar, 2014 12 commits
    • Chris Mason's avatar
      btrfs: fix uninit variable warning · 73b802f4
      Chris Mason authored
      fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this
      function
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      73b802f4
    • Josef Bacik's avatar
      Btrfs: take into account total references when doing backref lookup · 44853868
      Josef Bacik authored
      I added an optimization for large files where we would stop searching for
      backrefs once we had looked at the number of references we currently had for
      this extent.  This works great most of the time, but for snapshots that point to
      this extent and has changes in the original root this assumption falls on it
      face.  So keep track of any delayed ref mods made and add in the actual ref
      count as reported by the extent item and use that to limit how far down an inode
      we'll search for extents.  Thanks,
      Reportedy-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reported-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Tested-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      44853868
    • Filipe Manana's avatar
      Btrfs: part 2, fix incremental send's decision to delay a dir move/rename · bfa7e1f8
      Filipe Manana authored
      For an incremental send, fix the process of determining whether the directory
      inode we're currently processing needs to have its move/rename operation delayed.
      
      We were ignoring the fact that if the inode's new immediate ancestor has a higher
      inode number than ours but wasn't renamed/moved, we might still need to delay our
      move/rename, because some other ancestor directory higher in the hierarchy might
      have an inode number higher than ours *and* was renamed/moved too - in this case
      we have to wait for rename/move of that ancestor to happen before our current
      directory's rename/move operation.
      
      Simple steps to reproduce this issue:
      
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir -p /mnt/a/x1/x2
            $ mkdir /mnt/a/Z
            $ mkdir -p /mnt/a/x1/x2/x3/x4/x5
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
            $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The incremental send caused the kernel code to enter an infinite loop when
      building the path string for directory Z after its references are processed.
      
      A more complex scenario:
      
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir -p /mnt/a/b/c/d
            $ mkdir /mnt/a/b/c/d/e
            $ mkdir /mnt/a/b/c/d/f
            $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
            $ mkdir /mmt/a/b/c/g
            $ mv /mnt/a/b/c/d /mnt/a/b/D2
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mkdir /mnt/a/o
            $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
            $ mv /mnt/a/b/D2 /mnt/a/b/dd
            $ mv /mnt/a/b/c /mnt/a/C2
            $ mv /mnt/a/b/dd/f /mnt/a/o/FF
            $ mv /mnt/a/b /mnt/a/o/FF/E2/BB
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bfa7e1f8
    • Filipe Manana's avatar
      Btrfs: fix incremental send's decision to delay a dir move/rename · 7b119a8b
      Filipe Manana authored
      It's possible to change the parent/child relationship between directories
      in such a way that if a child directory has a higher inode number than
      its parent, it doesn't necessarily means the child rename/move operation
      can be performed immediately. The parent migth have its own rename/move
      operation delayed, therefore in this case the child needs to have its
      rename/move operation delayed too, and be performed after its new parent's
      rename/move.
      
      Steps to reproduce the issue:
      
            $ umount /mnt
            $ mkfs.btrfs -f /dev/sdd
            $ mount /dev/sdd /mnt
      
            $ mkdir /mnt/A
            $ mkdir /mnt/B
            $ mkdir /mnt/C
            $ mv /mnt/C /mnt/A
            $ mv /mnt/B /mnt/A/C
            $ mkdir /mnt/A/C/D
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap1
            $ btrfs send /mnt/snap1 -f /tmp/base.send
      
            $ mv /mnt/A/C/D /mnt/A/D2
            $ mv /mnt/A/C/B /mnt/A/D2/B2
            $ mv /mnt/A/C /mnt/A/D2/B2/C2
      
            $ btrfs subvolume snapshot -r /mnt /mnt/snap2
            $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The incremental send caused the kernel code to enter an infinite loop when
      building the path string for directory C after its references are processed.
      
      The necessary conditions here are that C has an inode number higher than both
      A and B, and B as an higher inode number higher than A, and D has the highest
      inode number, that is:
          inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
      
      The same issue could happen if after the first snapshot there's any number
      of intermediary parent directories between A2 and B2, and between B2 and C2.
      
      A test case for xfstests follows, covering this simple case and more advanced
      ones, with files and hard links created inside the directories.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7b119a8b
    • Filipe Manana's avatar
      Btrfs: remove unnecessary inode generation lookup in send · 425b5daf
      Filipe Manana authored
      No need to search in the send tree for the generation number of the inode,
      we already have it in the recorded_ref structure passed to us.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      425b5daf
    • Filipe Manana's avatar
      Btrfs: fix race when updating existing ref head · 21543bad
      Filipe Manana authored
      While we update an existing ref head's extent_op, we're not holding
      its spinlock, so while we're updating its extent_op contents (key,
      flags) we can have a task running __btrfs_run_delayed_refs() that
      holds the ref head's lock and sets its extent_op to NULL right after
      the task updating the ref head just checked its extent_op was not NULL.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      21543bad
    • Qu Wenruo's avatar
      btrfs: Add trace for btrfs_workqueue alloc/destroy · c3a46891
      Qu Wenruo authored
      Since most of the btrfs_workqueue is printed as pointer address,
      for easier analysis, add trace for btrfs_workqueue alloc/destroy.
      So it is possible to determine the workqueue that a given work belongs
      to(by comparing the wq pointer address with alloc trace event).
      Signed-off-by: default avatarQu Wenruo <quenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c3a46891
    • Filipe Manana's avatar
      Btrfs: less fs tree lock contention when using autodefrag · f094c9bd
      Filipe Manana authored
      When finding new extents during an autodefrag, don't do so many fs tree
      lookups to find an extent with a size smaller then the target treshold.
      Instead, after each fs tree forward search immediately unlock upper
      levels and process the entire leaf while holding a read lock on the leaf,
      since our leaf processing is very fast.
      This reduces lock contention, allowing for higher concurrency when other
      tasks want to write/update items related to other inodes in the fs tree,
      as we're not holding read locks on upper tree levels while processing the
      leaf and we do less tree searches.
      
      Test:
      
          sysbench --test=fileio --file-num=512 --file-total-size=16G \
             --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
             --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
             --max-requests=10000000000 [prepare|run]
      
      (fileystem mounted with -o autodefrag, averages of 5 runs)
      
      Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
      After this change:  63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
      
      Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f094c9bd
    • Guangyu Sun's avatar
      Btrfs: return EPERM when deleting a default subvolume · 72de6b53
      Guangyu Sun authored
      The error message is confusing:
      
       # btrfs sub delete /mnt/mysub/
       Delete subvolume '/mnt/mysub'
       ERROR: cannot delete '/mnt/mysub' - Directory not empty
      
      The error message does not make sense to me: It's not about deleting a
      directory but it's a subvolume, and it doesn't matter if the subvolume is
      empty or not.
      
      Maybe EPERM or is more appropriate in this case, combined with an explanatory
      kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
      it is configured as default subvolume.")
      Reported-by: default avatarKoen De Wit <koen.de.wit@oracle.com>
      Signed-off-by: default avatarGuangyu Sun <guangyu.sun@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      72de6b53
    • Filipe Manana's avatar
    • Filipe Manana's avatar
      Btrfs: cache extent states in defrag code path · 308d9800
      Filipe Manana authored
      When locking file ranges in the inode's io_tree, cache the first
      extent state that belongs to the target range, so that when unlocking
      the range we don't need to search in the io_tree again, reducing cpu
      time and making and therefore holding the io_tree's lock for a shorter
      period.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      308d9800
    • Josef Bacik's avatar
      Btrfs: fix deadlock with nested trans handles · 3bbb24b2
      Josef Bacik authored
      Zach found this deadlock that would happen like this
      
      btrfs_end_transaction <- reduce trans->use_count to 0
        btrfs_run_delayed_refs
          btrfs_cow_block
            find_free_extent
      	btrfs_start_transaction <- increase trans->use_count to 1
                allocate chunk
      	btrfs_end_transaction <- decrease trans->use_count to 0
      	  btrfs_run_delayed_refs
      	    lock tree block we are cowing above ^^
      
      We need to only decrease trans->use_count if it is above 1, otherwise leave it
      alone.  This will make nested trans be the only ones who decrease their added
      ref, and will let us get rid of the trans->use_count++ hack if we have to commit
      the transaction.  Thanks,
      
      cc: stable@vger.kernel.org
      Reported-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Tested-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3bbb24b2
  4. 10 Mar, 2014 14 commits