1. 28 Aug, 2012 24 commits
    • Liu Bo's avatar
      Btrfs: fix ordered extent leak when failing to start a transaction · d280e5be
      Liu Bo authored
      We cannot just return error before freeing ordered extent and releasing reserved
      space when we fail to start a transacion.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d280e5be
    • Liu Bo's avatar
      Btrfs: fix a dio write regression · 24c03fa5
      Liu Bo authored
      This bug is introduced by commit 3b8bde746f6f9bd36a9f05f5f3b6e334318176a9
      (Btrfs: lock extents as we map them in DIO).
      
      In dio write, we should unlock the section which we didn't do IO on in case that
      we fall back to buffered write.  But we need to not only unlock the section
      but also cleanup reserved space for the section.
      
      This bug was found while running xfstests 133, with this 133 no longer complains.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      24c03fa5
    • Josef Bacik's avatar
      Btrfs: fix deadlock with freeze and sync V2 · bd7de2c9
      Josef Bacik authored
      We can deadlock with freeze right now because we unconditionally start a
      transaction in our ->sync_fs() call.  To fix this just check and see if we
      have a running transaction to commit.  This saves us from the deadlock
      because at this point we'll have the umount sem for the sb so we're safe
      from freezes coming in after we've done our check.  With this patch the
      freeze xfstests no longer deadlocks.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      bd7de2c9
    • Stefan Behrens's avatar
      Btrfs: revert checksum error statistic which can cause a BUG() · 5ee0844d
      Stefan Behrens authored
      Commit 442a4f63 added btrfs device
      statistic counters for detected IO and checksum errors to Linux 3.5.
      The statistic part that counts checksum errors in
      end_bio_extent_readpage() can cause a BUG() in a subfunction:
      "kernel BUG at fs/btrfs/volumes.c:3762!"
      That part is reverted with the current patch.
      However, the counting of checksum errors in the scrub context remains
      active, and the counting of detected IO errors (read, write or flush
      errors) in all contexts remains active.
      
      Cc: stable <stable@vger.kernel.org> # 3.5
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5ee0844d
    • Stefan Behrens's avatar
      Btrfs: remove superblock writing after fatal error · 68ce9682
      Stefan Behrens authored
      With commit acce952b, btrfs was changed to flag the filesystem with
      BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
      error happened like a write I/O errors of all mirrors.
      In such situations, on unmount, the superblock is written in
      btrfs_error_commit_super(). This is done with the intention to be able
      to evaluate the error flag on the next mount. A warning is printed
      in this case during the next mount and the log tree is ignored.
      
      The issue is that it is possible that the superblock points to a root
      that was not written (due to write I/O errors).
      The result is that the filesystem cannot be mounted. btrfsck also does
      not start and all the other btrfs-progs tools fail to start as well.
      However, mount -o recovery is working well and does the right things
      to recover the filesystem (i.e., don't use the log root, clear the
      free space cache and use the next mountable root that is stored in the
      root backup array).
      
      This patch removes the writing of the superblock when
      BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
      flag in the mount function.
      
      These lines can be used to reproduce the issue (using /dev/sdm):
      SCRATCH_DEV=/dev/sdm
      SCRATCH_MNT=/mnt
      echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
      ls -alLF /dev/mapper/foo
      mkfs.btrfs /dev/mapper/foo
      mount /dev/mapper/foo $SCRATCH_MNT
      echo bar > $SCRATCH_MNT/foo
      sync
      echo 0 25165824 error | dmsetup reload foo
      dmsetup resume foo
      ls -alF $SCRATCH_MNT
      touch $SCRATCH_MNT/1
      ls -alF $SCRATCH_MNT
      sleep 35
      echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
      dmsetup resume foo
      sleep 1
      umount $SCRATCH_MNT
      btrfsck /dev/mapper/foo
      dmsetup remove foo
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      68ce9682
    • Josef Bacik's avatar
      Btrfs: allow delayed refs to be merged · ae1e206b
      Josef Bacik authored
      Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
      Basically what happens is the balance relocates a tree block which will drop
      the implicit refs for all of its children and adds a full backref.  Once the
      block is relocated we have to add the implicit refs back, so when we cow the
      block again we add the implicit refs for its children back.  The problem
      comes when the original drop ref doesn't get run before we add the implicit
      refs back.  The delayed ref stuff will specifically prefer ADD operations
      over DROP to keep us from freeing up an extent that will have references to
      it, so we try to add the implicit ref before it is actually removed and we
      panic.  This worked fine before because the add would have just canceled the
      drop out and we would have been fine.  But the backref walking work needs to
      be able to freeze the delayed ref stuff in time so we have this ever
      increasing sequence number that gets attached to all new delayed ref updates
      which makes us not merge refs and we run into this issue.
      
      So to fix this we need to merge delayed refs.  So everytime we run a
      clustered ref we need to try and merge all of its delayed refs.  The backref
      walking stuff locks the delayed ref head before processing, so if we have it
      locked we are safe to merge any refs inside of the sequence number.  If
      there is no sequence number we can merge all refs.  Doing this not only
      fixes our bug but keeps the delayed ref code from adding and removing
      useless refs and batching together multiple refs into one search instead of
      one search per delayed ref, which will really help our commit times.  I ran
      this with Daniels test and 276 and I haven't seen any problems.  Thanks,
      Reported-by: default avatarDaniel J Blueman <daniel@quora.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      ae1e206b
    • Josef Bacik's avatar
      Btrfs: fix enospc problems when deleting a subvol · 5a24e84c
      Josef Bacik authored
      Subvol delete is a special kind of awful where we use the global reserve to
      cover the ENOSPC requirements.  The problem is once we're done removing
      everything we do a btrfs_update_inode(), which by default will try to do the
      delayed update stuff which will use it's own reserve.  There will be no
      space in this reserve and we'll return ENOSPC.  So instead use
      btrfs_update_inode_fallback() which will just fallback to updating the inode
      item in the case of enospc.  This is fine because the global reserve covers
      the space requirements for this.  With this patch I can now delete a subvol
      on a problem image Dave Sterba sent me.  Thanks,
      Reported-by: default avatarDavid Sterba <dave@jikos.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      5a24e84c
    • Miao Xie's avatar
      Btrfs: fix wrong mtime and ctime when creating snapshots · c0f62ded
      Miao Xie authored
      When we created a new snapshot, the mtime and ctime of its parent directory
      were not updated. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      c0f62ded
    • Arne Jansen's avatar
      Btrfs: fix race in run_clustered_refs · 22cd2e7d
      Arne Jansen authored
      With commit
      
      commit d1270cd9
      Author: Arne Jansen <sensille@gmx.net>
      Date:   Tue Sep 13 15:16:43 2011 +0200
      
           Btrfs: put back delayed refs that are too new
      
      I added a window where the delayed_ref's head->ref_mod code can diverge
      from the sum of the remaining refs, because we release the head->mutex
      in the middle. This leads to btrfs_lookup_extent_info returning wrong
      numbers. This patch fixes this by adjusting the head's ref_mod with each
      delayed ref we run.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      22cd2e7d
    • Chris Mason's avatar
      Btrfs: don't run __tree_mod_log_free_eb on leaves · b12a3b1e
      Chris Mason authored
      When we split a leaf, we may end up inserting a new root on top of that
      leaf.  The reflog code was incorrectly assuming the old root was always
      a node.  This makes sure we skip over leaves.
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      b12a3b1e
    • Josef Bacik's avatar
      Btrfs: increase the size of the free space cache · 6fc823b1
      Josef Bacik authored
      Arne was complaining about the space cache having mismatching generation
      numbers when debugging a deadlock.  This is because we can run out of space
      in our preallocated range for our space cache if you have a pretty
      fragmented amount of space in your pinned space.  So just increase the
      amount of space we preallocate for space cache so we can be sure to have
      enough space.  This will only really affect data ranges since their the only
      chunks that end up larger than 256MB.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      6fc823b1
    • Josef Bacik's avatar
      Btrfs: barrier before waitqueue_active · 66657b31
      Josef Bacik authored
      We need a barrir before calling waitqueue_active otherwise we will miss
      wakeups.  So in places that do atomic_dec(); then atomic_read() use
      atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
      and then add an explicit memory barrier everywhere else that need them.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      66657b31
    • Arne Jansen's avatar
      Btrfs: fix deadlock in wait_for_more_refs · 1fa11e26
      Arne Jansen authored
      Commit a168650c introduced a waiting mechanism to prevent busy waiting in
      btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
      where a tree_mod_seq is held while waiting for the io to complete, while
      the end_io calls btrfs_run_delayed_refs.
      This whole mechanism is unnecessary. If not enough runnable refs are
      available to satisfy count, just return as count is more like a guideline
      than a strict requirement.
      In case we have to run all refs, commit transaction makes sure that no
      other threads are working in the transaction anymore, so we just assert
      here that no refs are blocked.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      1fa11e26
    • Fengguang Wu's avatar
      btrfs: fix second lock in btrfs_delete_delayed_items() · 62095265
      Fengguang Wu authored
      Fix a real bug caught by coccinelle.
      
      fs/btrfs/delayed-inode.c:1013:1-11: second lock on line 1013
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      62095265
    • Josef Bacik's avatar
      Btrfs: don't allocate a seperate csums array for direct reads · c329861d
      Josef Bacik authored
      We've been allocating a big array for csums instead of storing them in the
      io_tree like we do for buffered reads because previously we were locking the
      entire range, so we didn't have an extent state for each sector of the
      range.  But now that we do the range locking as we map the buffers we can
      limit the mapping lenght to sectorsize and use the private part of the
      io_tree for our csums.  This allows us to avoid an extra memory allocation
      for direct reads which could incur latency.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      c329861d
    • Josef Bacik's avatar
      Btrfs: do not strdup non existent strings · 99f5944b
      Josef Bacik authored
      When we close devices we add back empty devices for some reason that escapes
      me.  In the case of a missing dev we don't allocate an rcu_string for it's
      name, so check to see if the device has a name and if it doesn't don't
      bother strdup()'ing it.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      99f5944b
    • Josef Bacik's avatar
      Btrfs: do not use missing devices when showing devname · aa9ddcd4
      Josef Bacik authored
      If you do the following
      
      mkfs.btrfs /dev/sdb /dev/sdc
      rmmod btrfs
      dd if=/dev/zero of=/dev/sdb bs=1M count=1
      mount -o degraded /dev/sdc /mnt/btrfs-test
      
      the box will panic trying to deref the name for the missing dev since it is
      the lower numbered devid.  So fix show_devname to not use missing devices.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      aa9ddcd4
    • Stefan Behrens's avatar
      Btrfs: fix that error value is changed by mistake · 3627bf45
      Stefan Behrens authored
      In iterate_inodes_from_logical() the error result from
      extent_from_logical() is patched by mistake. Typically ENOENT is
      patched to EINVAL because (-ENOENT & BTRFS_EXTENT_FLAG_TREE_BLOCK)
      evaluates to true.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      3627bf45
    • Josef Bacik's avatar
      Btrfs: lock extents as we map them in DIO · eb838e73
      Josef Bacik authored
      A deadlock in xfstests 113 was uncovered by commit
      
      d187663e
      
      This is because we would not return EIOCBQUEUED for short AIO reads, instead
      we'd wait for the DIO to complete and then return the amount of data we
      transferred, which would allow our stuff to unlock the remaning amount.  But
      with this change this no longer happens, so if we have a short AIO read (for
      example if we try to read past EOF), we could leave the section from EOF to
      the end of where we tried to read locked.  Fixing this is tricky since there
      is no clear way to know exactly how much data DIO truly submitted for IO, so
      to make this less hard on ourselves and less combersome we need to lock the
      extents as we try to map them, and then we unlock any areas we didn't
      actually map.  This makes us completely safe from deadlocks and reliance on
      a particular behavior of the DIO code.  This also lays the groundwork for
      allowing us to use the normal csum storage method for reads which means we
      can remove an allocation.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      eb838e73
    • Dan Carpenter's avatar
      Btrfs: fix some endian bugs handling the root times · dadd1105
      Dan Carpenter authored
      "trans->transid" is cpu endian but we want to store the data as little
      endian.  "item->ctime.nsec" is only 32 bits, not 64.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      dadd1105
    • Dan Carpenter's avatar
      Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() · 55e591ff
      Dan Carpenter authored
      We should release this mutex before returning the error code.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      55e591ff
    • Dan Carpenter's avatar
      Btrfs: checking for NULL instead of IS_ERR · 57a5a882
      Dan Carpenter authored
      add_qgroup_rb() never returns NULL, only error pointers.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      57a5a882
    • Dan Carpenter's avatar
      Btrfs: fix some error codes in btrfs_qgroup_inherit() · 5986802c
      Dan Carpenter authored
      These are returning zero when it should be returning a negative error
      code.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      5986802c
    • Stefan Behrens's avatar
      Btrfs: fix a misplaced address operator in a condition · aa2ffd06
      Stefan Behrens authored
      This should obviously not be "if (&flag)" but "if (flag)".
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      aa2ffd06
  2. 25 Jul, 2012 12 commits
  3. 23 Jul, 2012 4 commits
    • Liu Bo's avatar
      Btrfs: improve multi-thread buffer read · 67c9684f
      Liu Bo authored
      While testing with my buffer read fio jobs[1], I find that btrfs does not
      perform well enough.
      
      Here is a scenario in fio jobs:
      
      We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
      and all of them will race on add_to_page_cache_lru(), and if one thread
      successfully puts its page into the page cache, it takes the responsibility
      to read the page's data.
      
      And what's more, reading a page needs a period of time to finish, in which
      other threads can slide in and process rest pages:
      
           t1          t2          t3          t4
         add Page1
         read Page1  add Page2
           |         read Page2  add Page3
           |            |        read Page3  add Page4
           |            |           |        read Page4
      -----|------------|-----------|-----------|--------
           v            v           v           v
          bio          bio         bio         bio
      
      Now we have four bios, each of which holds only one page since we need to
      maintain consecutive pages in bio.  Thus, we can end up with far more bios
      than we need.
      
      Here we're going to
      a) delay the real read-page section and
      b) try to put more pages into page cache.
      
      With that said, we can make each bio hold more pages and reduce the number
      of bios we need.
      
      Here is some numbers taken from fio results:
               w/o patch                 w patch
             -------------  --------  ---------------
      READ:    745MB/s        +25%       934MB/s
      
      [1]:
      [global]
      group_reporting
      thread
      numjobs=4
      bs=32k
      rw=read
      ioengine=sync
      directory=/mnt/btrfs/
      
      [READ]
      filename=foobar
      size=2000M
      invalidate=1
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      67c9684f
    • Liu Bo's avatar
      Btrfs: make btrfs's allocation smoothly with preallocation · df57dbe6
      Liu Bo authored
      For backref walking, we've introduce delayed ref's sequence.  However,
      it changes our preallocation behavior.
      
      The story is that when we preallocate an extent and then mark it written
      piece by piece, the ideal case should be that we don't need to COW the
      extent, which is why we use 'preallocate'.
      
      But we may not make use of preallocation, since when we check for cross refs on
      the extent, we may have two ref entries which have the same content except
      the sequence value, and we recognize them as cross refs and do COW to allocate
      another extent.
      
      So we end up with several pieces of space instead of an whole extent.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      df57dbe6
    • Josef Bacik's avatar
      Btrfs: lock the transition from dirty to writeback for an eb · 51561ffe
      Josef Bacik authored
      There is a small window where an eb can have no IO bits set on it, which
      could potentially result in extent_buffer_under_io() returning false when we
      want it to return true, which could result in not fun things happening.  So
      in order to protect this case we need to hold the refs_lock when we make
      this transition to make sure we get reliable results out of
      extent_buffer_udner_io().  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      51561ffe
    • Josef Bacik's avatar
      Btrfs: fix potential race in extent buffer freeing · 594831c4
      Josef Bacik authored
      This sounds sort of impossible but it is the only thing I can think of and
      at the very least it is theoretically possible so here it goes.
      
      If we are in try_release_extent_buffer we will check that the ref count on
      the extent buffer is 1 and not under IO, and then go down and clear the tree
      ref.  If between this check and clearing the tree ref somebody else comes in
      and grabs a ref on the eb and the marks it dirty before
      try_release_extent_buffer() does it's tree ref clear we can end up with a
      dirty eb that will be freed while it is still dirty which will result in a
      panic.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      594831c4