1. 25 Jul, 2012 12 commits
  2. 23 Jul, 2012 28 commits
    • Liu Bo's avatar
      Btrfs: improve multi-thread buffer read · 67c9684f
      Liu Bo authored
      While testing with my buffer read fio jobs[1], I find that btrfs does not
      perform well enough.
      
      Here is a scenario in fio jobs:
      
      We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
      and all of them will race on add_to_page_cache_lru(), and if one thread
      successfully puts its page into the page cache, it takes the responsibility
      to read the page's data.
      
      And what's more, reading a page needs a period of time to finish, in which
      other threads can slide in and process rest pages:
      
           t1          t2          t3          t4
         add Page1
         read Page1  add Page2
           |         read Page2  add Page3
           |            |        read Page3  add Page4
           |            |           |        read Page4
      -----|------------|-----------|-----------|--------
           v            v           v           v
          bio          bio         bio         bio
      
      Now we have four bios, each of which holds only one page since we need to
      maintain consecutive pages in bio.  Thus, we can end up with far more bios
      than we need.
      
      Here we're going to
      a) delay the real read-page section and
      b) try to put more pages into page cache.
      
      With that said, we can make each bio hold more pages and reduce the number
      of bios we need.
      
      Here is some numbers taken from fio results:
               w/o patch                 w patch
             -------------  --------  ---------------
      READ:    745MB/s        +25%       934MB/s
      
      [1]:
      [global]
      group_reporting
      thread
      numjobs=4
      bs=32k
      rw=read
      ioengine=sync
      directory=/mnt/btrfs/
      
      [READ]
      filename=foobar
      size=2000M
      invalidate=1
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      67c9684f
    • Liu Bo's avatar
      Btrfs: make btrfs's allocation smoothly with preallocation · df57dbe6
      Liu Bo authored
      For backref walking, we've introduce delayed ref's sequence.  However,
      it changes our preallocation behavior.
      
      The story is that when we preallocate an extent and then mark it written
      piece by piece, the ideal case should be that we don't need to COW the
      extent, which is why we use 'preallocate'.
      
      But we may not make use of preallocation, since when we check for cross refs on
      the extent, we may have two ref entries which have the same content except
      the sequence value, and we recognize them as cross refs and do COW to allocate
      another extent.
      
      So we end up with several pieces of space instead of an whole extent.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      df57dbe6
    • Josef Bacik's avatar
      Btrfs: lock the transition from dirty to writeback for an eb · 51561ffe
      Josef Bacik authored
      There is a small window where an eb can have no IO bits set on it, which
      could potentially result in extent_buffer_under_io() returning false when we
      want it to return true, which could result in not fun things happening.  So
      in order to protect this case we need to hold the refs_lock when we make
      this transition to make sure we get reliable results out of
      extent_buffer_udner_io().  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      51561ffe
    • Josef Bacik's avatar
      Btrfs: fix potential race in extent buffer freeing · 594831c4
      Josef Bacik authored
      This sounds sort of impossible but it is the only thing I can think of and
      at the very least it is theoretically possible so here it goes.
      
      If we are in try_release_extent_buffer we will check that the ref count on
      the extent buffer is 1 and not under IO, and then go down and clear the tree
      ref.  If between this check and clearing the tree ref somebody else comes in
      and grabs a ref on the eb and the marks it dirty before
      try_release_extent_buffer() does it's tree ref clear we can end up with a
      dirty eb that will be freed while it is still dirty which will result in a
      panic.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      594831c4
    • Josef Bacik's avatar
      Btrfs: don't return true in releasepage unless we actually freed the eb · e64860aa
      Josef Bacik authored
      I noticed while looking at an extent_buffer race that we will
      unconditionally return 1 if we get down to release_extent_buffer after
      clearing the tree ref.  However we can easily race in here and get a ref on
      the eb and not actually free the eb.  So make release_extent_buffer return 1
      if it free'd the eb and 0 if not so we can be a little kinder to the vm.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      e64860aa
    • Stefan Behrens's avatar
      Btrfs: suppress printk() if all device I/O stats are zero · a98cdb85
      Stefan Behrens authored
      Code is added to suppress the I/O stats printing at mount time if all
      statistic values are zero.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      a98cdb85
    • Stefan Behrens's avatar
      Btrfs: remove unwanted printk() for btrfs device I/O stats · 5021976d
      Stefan Behrens authored
      People complained about the annoying kernel log message
      "btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
      everytime a filesystem is mounted for the first time after running
      mkfs. Since the distribution of the btrfs-progs is not synchronized
      to the kernel version, mkfs like it is now will be used also in the
      future. Then this message is not useful to find errors, it is just
      annoying. This commit removes the printk().
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      5021976d
    • Li Zefan's avatar
      Btrfs: rewrite BTRFS_SETGET_FUNCS · 18077bb4
      Li Zefan authored
      BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
      btrfs_foo() functions, which read and write specific fields in the
      extent buffer.
      
      The total number of set/get functions is ~200, but in fact we only
      need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.
      
      It results in redunction of ~37K bytes.
      
         text    data     bss     dec     hex filename
       629661   12489     216  642366   9cd3e fs/btrfs/btrfs.o.orig
       592637   12489     216  605342   93c9e fs/btrfs/btrfs.o
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      18077bb4
    • Li Zefan's avatar
      Btrfs: zero unused bytes in inode item · 293f7e07
      Li Zefan authored
      The otime field is not zeroed, so users will see random otime in an old
      filesystem with a new kernel which has otime support in the future.
      
      The reserved bytes are also not zeroed, and we'll have compatibility
      issue if we make use of those bytes.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      293f7e07
    • Li Zefan's avatar
      Btrfs: kill free_space pointer from inode structure · b4d7c3c9
      Li Zefan authored
      Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
      which means every inode has the same BTRFS_I(inode)->free_space pointer.
      
      This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      b4d7c3c9
    • Anand Jain's avatar
      btrfs read error corrected message floods the console during recovery · d5b025d5
      Anand Jain authored
      Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      d5b025d5
    • Jan Schmidt's avatar
      Btrfs: fix buffer leak in btrfs_next_old_leaf · e6466e35
      Jan Schmidt authored
      When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
      rare case of using the deadlock avoidance code needed for the tree mod log.
      Signed-off-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      e6466e35
    • Liu Bo's avatar
      Btrfs: do not count in readonly bytes · f6175efa
      Liu Bo authored
      If a block group is ro, do not count its entries in when we dump space info.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      f6175efa
    • Liu Bo's avatar
      Btrfs: add ro notification to dump_space_info · 799ffc3c
      Liu Bo authored
      Block group has ro attributes, make dump_space_info show it.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      799ffc3c
    • Liu Bo's avatar
      Btrfs: fix a bug of writting free space cache during balance · cf7c1ef6
      Liu Bo authored
      Here is the whole story:
      1)
      A free space cache consists of two parts:
      o  free space cache inode, which is special becase it's stored in root tree.
      o  free space info, which is stored as the above inode's file data.
      
      But we only build up another new inode and does not flush its free space info
      onto disk when we _clear and setup_ free space cache, and this ends up with
      that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
      
      And holding DC_SETUP means that we will not truncate this free space cache inode,
      which means the disk offset of its file extent will remain _unchanged_ at least
      until next transaction finishes committing itself.
      
      2)
      We can set a block group readonly when we relocate the block group.
      
      However,
      if the readonly block group covers the disk offset where our free space cache
      inode is going to write, it will force the free space cache inode into
      cow_file_range() and it'll end up hitting a BUG_ON.
      
      3)
      Due to the above analysis, we fix this bug by adding the missing dirty flag.
      
      4)
      However, it's not over, there is still another case, nospace_cache.
      
      With nospace_cache, we do not want to set dirty flag, instead we just truncate
      free space cache inode and bail out with setting cache state DC_WRITTEN.
      
      We can benifit from it since it saves us another 'pre-allocation' part which
      usually costs a lot.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      cf7c1ef6
    • Liu Bo's avatar
      Btrfs: do not abort transaction in prealloc case · 06789384
      Liu Bo authored
      During disk balance, we prealloc new file extent for file data relocation,
      but we may fail in 'no available space' case, and it leads to flipping btrfs
      into readonly.
      
      It is not necessary to bail out and abort transaction since we do have several
      ways to rescue ourselves from ENOSPC case.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      06789384
    • Liu Bo's avatar
      Btrfs: kill root from btrfs_is_free_space_inode · 83eea1f1
      Liu Bo authored
      Since root can be fetched via BTRFS_I macro directly, we can save an args
      for btrfs_is_free_space_inode().
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      83eea1f1
    • Liu Bo's avatar
      Btrfs: fix btrfs_is_free_space_inode to recognize btree inode · 51a8cf9d
      Liu Bo authored
      For btree inode, its root is also 'tree root', so btree inode can be
      misunderstood as a free space inode.
      
      We should add one more check for btree inode.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      51a8cf9d
    • Stefan Behrens's avatar
      Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages() · c0901581
      Stefan Behrens authored
      From btree_read_extent_buffer_pages(), currently repair_io_failure()
      can be called with mirror_num being zero when submit_one_bio() returned
      an error before. This used to cause a BUG_ON(!mirror_num) in
      repair_io_failure() and indeed this is not a case that needs the I/O
      repair code to rewrite disk blocks.
      This commit prevents calling repair_io_failure() in this case and thus
      avoids the BUG_ON() and malfunction.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      c0901581
    • Josef Bacik's avatar
      Btrfs: rework shrink_delalloc · f4c738c2
      Josef Bacik authored
      So shrink_delalloc has grown all sorts of cruft over the years thanks to
      many reworkings of how we track enospc.  What happens now as we fill up the
      disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
      of space of metadata, this was from when everybody flushed at the same time.
      Now we only have people flushing one at a time.  So instead of trying to
      reclaim a huge amount of space, just try to flush a decent chunk of space,
      and stop looping as soon as we have enough free space to satisfy our
      reservation.  This makes xfstests 224 go much faster.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      f4c738c2
    • Liu Bo's avatar
      Btrfs: do not set subvolume flags in readonly mode · b9ca0664
      Liu Bo authored
      $ mkfs.btrfs /dev/sdb7
      $ btrfstune -S1 /dev/sdb7
      $ mount /dev/sdb7 /mnt/btrfs
      mount: block device /dev/sdb7 is write-protected, mounting read-only
      $ btrfs dev add /dev/sdb8 /mnt/btrfs/
      
      Now we get a btrfs in which mnt flags has readonly but sb flags does
      not.  So for those ioctls that only check sb flags with MS_RDONLY, it
      is going to be a problem.
      Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
      to check RO flags.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      b9ca0664
    • Liu Bo's avatar
      Btrfs: use mnt_want_write_file instead of mnt_want_write · e54bfa31
      Liu Bo authored
      mnt_want_write_file is faster when file has been opened for write.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      e54bfa31
    • Liu Bo's avatar
      Btrfs: remove redundant r/o check for superblock · 768e9dfe
      Liu Bo authored
      mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
      MS_RDONLY, and we don't need to do it ourselves.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      768e9dfe
    • Liu Bo's avatar
      Btrfs: check write access to mount earlier while creating snapshots · a874a63e
      Liu Bo authored
      Move check of write access to mount into upper functions so that we can
      use mnt_want_write_file instead, which is faster than mnt_want_write.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      a874a63e
    • Liu Bo's avatar
      Btrfs: fix typo in cow_file_range_async and async_cow_submit · 287082b0
      Liu Bo authored
      It should be 10 * 1024 * 1024.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      287082b0
    • Josef Bacik's avatar
      Btrfs: change how we indicate we're adding csums · 0e721106
      Josef Bacik authored
      There is weird logic I had to put in place to make sure that when we were
      adding csums that we'd used the delalloc block rsv instead of the global
      block rsv.  Part of this meant that we had to free up our transaction
      reservation before we ran the delayed refs since csum deletion happens
      during the delayed ref work.  The problem with this is that when we release
      a reservation we will add it to the global reserve if it is not full in
      order to keep us going along longer before we have to force a transaction
      commit.  By releasing our reservation before we run delayed refs we don't
      get the opportunity to drain down the global reserve for the work we did, so
      we won't refill it as often.  This isn't a problem per-se, it just results
      in us possibly committing transactions more and more often, and in rare
      cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
      out of space in our block rsv.
      
      This also helps us by holding onto space while the delayed refs run so we
      don't end up with as many people trying to do things at the same time, which
      again will help us not force commits or hit the use_block_rsv warnings.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      0e721106
    • Tsutomu Itoh's avatar
      Btrfs: return error of btrfs_update_inode() to caller · b9959295
      Tsutomu Itoh authored
      We didn't check error of btrfs_update_inode(), but that error looks
      easy to bubble back up.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      b9959295
    • Dan Carpenter's avatar
      Btrfs: fix error handling in __add_reloc_root() · 23291a04
      Dan Carpenter authored
      We dereferenced "node" in the error message after freeing it.  Also
      btrfs_panic() can return so we should return an error code instead of
      continuing.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      23291a04