1. 11 Sep, 2009 15 commits
    • Chris Mason's avatar
      Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214
      Chris Mason authored
      This closes a whole where the page may be written before
      the page_mkwrite caller has a chance to dirty it
      
      (thanks to Nick Piggin)
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      50a9b214
    • Chris Mason's avatar
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason authored
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a1ed835e
    • Chris Mason's avatar
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason authored
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8b62b72b
    • Chris Mason's avatar
      Btrfs: use a cached state for extent state operations during delalloc · 9655d298
      Chris Mason authored
      This changes the btrfs code to find delalloc ranges in the extent state
      tree to use the new state caching code from set/test bit.  It reduces
      one of the biggest causes of rbtree searches in the writeback path.
      
      test_range_bit is also modified to take the cached state as a starting
      point while searching.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9655d298
    • Chris Mason's avatar
      Btrfs: don't lock bits in the extent tree during writepage · d5550c63
      Chris Mason authored
      At writepage time, we have the page locked and we have the
      extent_map entry for this extent pinned in the extent_map tree.
      So, the page can't go away and its mapping can't change.
      
      There is no need for the extra extent_state lock bits during writepage.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d5550c63
    • Chris Mason's avatar
      Btrfs: cache values for locking extents · 2c64c53d
      Chris Mason authored
      Many of the btrfs extent state tree users follow the same pattern.
      They lock an extent range in the tree, do some operation and then
      unlock.
      
      This translates to at least 2 rbtree searches, and maybe more if they
      are doing operations on the extent state tree.  A locked extent
      in the tree isn't going to be merged or changed, and so we can
      safely return the extent state structure as a cached handle.
      
      This changes set_extent_bit to give back a cached handle, and also
      changes both set_extent_bit and clear_extent_bit to use the cached
      handle if it is available.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2c64c53d
    • Chris Mason's avatar
      Btrfs: reduce CPU usage in the extent_state tree · 1edbb734
      Chris Mason authored
      Btrfs is currently mirroring some of the page state bits into
      its extent state tree.  The goal behind this was to use it in supporting
      blocksizes other than the page size.
      
      But, we don't currently support that, and we're using quite a lot of CPU
      on the rb tree and its spin lock.  This commit starts a series of
      cleanups to reduce the amount of work done in the extent state tree as
      part of each IO.
      
      This commit:
      
      * Adds the ability to lock an extent in the state tree and also set
      other bits.  The idea is to do locking and delalloc in one call
      
      * Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
      a combination of the page bits and the ordered write code for this
      instead.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1edbb734
    • Chris Mason's avatar
      Btrfs: Fix new state initialization order · e48c465b
      Chris Mason authored
      As the extent state tree is manipulated, there are call backs
      that are used to take extra actions when different state bits are set
      or cleared.  One example of this is a counter for the total number
      of delayed allocation bytes in a single inode and in the whole FS.
      
      When new states are inserted, this callback is being done before we
      properly setup the new state.  This hasn't caused problems before
      because the lock bit was always done first, and the existing call backs
      don't care about the lock bit.
      
      This patch makes sure the state is properly setup before using the
      callback, which is important for later optimizations that do more work
      without using the lock bit.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e48c465b
    • Chris Mason's avatar
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason authored
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      890871be
    • Chris Mason's avatar
      Btrfs: tweak congestion backoff · 57fd5a5f
      Chris Mason authored
      The btrfs io submission thread tries to back off congested devices in
      favor of rotating off to another disk.
      
      But, it tries to make sure it submits at least some IO before rotating
      on (the others may be congested too), and so it has a magic number of
      requests it tries to write before it hops.
      
      This makes the magic number smaller.  Testing shows that we're spending
      too much time on congested devices and leaving the other devices idle.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      57fd5a5f
    • Chris Mason's avatar
      Btrfs: use larger nr_to_write for larger extents · a97adc9f
      Chris Mason authored
      When btrfs fills a large delayed allocation extent, it is a good idea
      to try and convince the write_cache_pages caller to go ahead and
      write a good chunk of that extent.  The extra IO is basically free
      because we know it is contiguous.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a97adc9f
    • Chris Mason's avatar
      Btrfs: reduce worker thread spin_lock_irq hold times · 4f878e84
      Chris Mason authored
      This changes the btrfs worker threads to batch work items
      into a local list.  It allows us to pull work items in
      large chunks and significantly reduces the number of times we
      need to take the worker thread spinlock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4f878e84
    • Chris Mason's avatar
      Btrfs: keep irqs on more often in the worker threads · 4e3f9c50
      Chris Mason authored
      The btrfs worker thread spinlock was being used both for the
      queueing of IO and for the processing of ordered events.
      
      The ordered events never happen from end_io handlers, and so they
      don't need to use the _irq version of spinlocks.  This adds a
      dedicated lock to the ordered lists so they don't have to run
      with irqs off.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4e3f9c50
    • Chris Mason's avatar
      Btrfs: optimize set extent bit · 40431d6c
      Chris Mason authored
      The Btrfs set_extent_bit call currently searches the rbtree
      every time it needs to find more extent_state objects to fill
      the requested operation.
      
      This adds a simple test with rb_next to see if the next object
      in the tree was adjacent to the one we just found.  If so,
      we skip the search and just use the next object.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      40431d6c
    • Chris Mason's avatar
      Btrfs: Allow worker threads to exit when idle · 9042846b
      Chris Mason authored
      The Btrfs worker threads don't currently die off after they have
      been idle for a while, leading to a lot of threads sitting around
      doing nothing for each mount.
      
      Also, they are unable to start atomically (from end_io hanlders).
      
      This commit reworks the worker threads so they can be started
      from end_io handlers (just setting a flag that asks for a thread
      to be added at a later date) and so they can exit if they
      have been idle for a long time.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9042846b
  2. 07 Aug, 2009 3 commits
  3. 31 Jul, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: make sure the async caching thread advances the key · 013f1b12
      Chris Mason authored
      The async caching thread can end up looping forever if a given
      search puts it at the last key in a leaf.  It will end up calling
      btrfs_next_leaf and then checking if it needs to politely drop
      the read semaphore.
      
      Most of the time this looping isn't noticed because it is able to
      make progress the next time around.  But, during log replay,
      we wait on the async caching thread to finish, and the async thread
      is waiting on the commit, and no progress is really made.
      
      The fix used here is to copy the key out of the next leaf,
      that way our search lands there properly.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      013f1b12
    • Josef Bacik's avatar
      Btrfs: fix btrfs_remove_from_free_space corner case · 6606bb97
      Josef Bacik authored
      Yan Zheng hit a problem where we tried to remove some free space but failed
      because we couldn't find the free space entry.  This is because the free space
      was held within a bitmap that had a starting offset well before the actual
      offset of the free space, and there were free space extents that were in the
      same range as that offset, so tree_search_offset returned with NULL because we
      couldn't find a free space extent that had that offset.  This is fixed by
      making sure that if we fail to find the entry, we re-search again with
      bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate
      bitmap.  A similar problem happens in btrfs_alloc_from_bitmap for the
      clustering code, but that is not as bad since we will just go and redo our
      cluster allocation.
      
      Also this adds some debugging checks to make sure that the free space we are
      trying to remove from the bitmap is in fact there.  This can probably go away
      after a while, but since this code is only used by the tree-logging stuff it
      would be nice to run with it for a while to make sure there are no problems.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      6606bb97
  4. 30 Jul, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: be more polite in the async caching threads · f36f3042
      Chris Mason authored
      The semaphore used by the async caching threads can prevent a
      transaction commit, which can make the FS appear to stall.  This
      releases the semaphore more often when a transaction commit is
      in progress.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f36f3042
    • Yan Zheng's avatar
      Btrfs: preserve commit_root for async caching · 276e680d
      Yan Zheng authored
      The async block group caching code uses the commit_root pointer
      to get a stable version of the extent allocation tree for scanning.
      This copy of the tree root isn't going to change and it significantly
      reduces the complexity of the scanning code.
      
      During a commit, we have a loop where we update the extent allocation
      tree root.  We need to loop because updating the root pointer in
      the tree of tree roots may allocate blocks which may change the
      extent allocation tree.
      
      Right now the commit_root pointer is changed inside this loop.  It
      is more correct to change the commit_root pointer only after all the
      looping is done.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      276e680d
  5. 28 Jul, 2009 1 commit
  6. 27 Jul, 2009 2 commits
    • Josef Bacik's avatar
      Btrfs: change how we unpin extents · 68b38550
      Josef Bacik authored
      We are racy with async block caching and unpinning extents.  This patch makes
      things much less complicated by only unpinning the extent if the block group is
      cached.  We check the block_group->cached var under the block_group->lock spin
      lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
      and unpin the extent and add the free space back.  If it is not set to this, we
      start the caching of the block group so the next time we unpin extents we can
      unpin the extent.  This keeps us from racing with the async caching threads,
      lets us kill the fs wide async thread counter, and keeps us from having to set
      DELALLOC bits for every extent we hit if there are caching kthreads going.
      
      One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
      instead of just looking for LOCKED extents, we also look for DIRTY extents,
      since we could have left some extents pinned in the previous transaction that
      will never get freed now that we are unmounting, which would cause us to leak
      memory.  So btrfs_free_super_mirror_extents has been changed to
      btrfs_free_pinned_extents, and it will clear the extents locked for the super
      mirror, and any remaining pinned extents that may be present.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      68b38550
    • Julia Lawall's avatar
      Btrfs: Correct redundant test in add_inode_ref · 631c07c8
      Julia Lawall authored
      dir has already been tested.  It seems that this test should be on the
      recently returned value inode.
      
      A simplified version of the semantic match that finds this problem is as
      follows: (http://www.emn.fr/x-info/coccinelle/)
      Signed-off-by: default avatarJulia Lawall <julia@diku.dk>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      631c07c8
  7. 24 Jul, 2009 9 commits
    • Chris Mason's avatar
      Btrfs: find smallest available device extent during chunk allocation · 9779b72f
      Chris Mason authored
      Allocating new block group is easy when the disk has plenty of space.
      But things get difficult as the disk fills up, especially if
      the FS has been run through btrfs-vol -b.  The balance operation
      is likely to make the total bytes available on the device greater
      than the largest extent we'll actually be able to allocate.
      
      But the device extent allocation code incorrectly assumes that a device
      with 5G free will be able to allocate a 5G extent.  It isn't normally a
      problem because device extents don't get freed unless btrfs-vol -b
      is run.
      
      This fixes the device extent allocator to remember the largest free
      extent it can find, and then uses that value as a fallback.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9779b72f
    • Chris Mason's avatar
      Btrfs: clear all space_info->full after removing a block group · 283bb197
      Chris Mason authored
      Btrfs allocates individual extents from block groups, and each
      block group has a specific type.  It may hold metadata, data
      mirrored or striped etc.
      
      When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
      we free block groups.  Once a block group is freed, the space it was
      using on the device may be available for use by new block groups.
      
      btrfs_remove_block_group was clearing the flag that said
      'our devices are full, don't even try to allocate new block groups',
      but it was only clearing that flag for a specific type of block group.
      
      This commit clears the full flag for all of the types of block groups,
      making it much more likely that we'll be able to balance space when
      the drive is close to full.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      283bb197
    • Sage Weil's avatar
      Btrfs: make flushoncommit mount option correctly wait on ordered_extents · ebecd3d9
      Sage Weil authored
      The commit_transaction call to wait_ordered_extents when snap_pending
      passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
      correct for the 'flushoncommit' mode, as it skips extents we just started
      IO on in start_delalloc_inodes.
      
      So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
      only pass the nocow_only flag to wait_ordered_extents if snap_pending.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ebecd3d9
    • Yan Zheng's avatar
      Btrfs: Avoid delayed reference update looping · d717aa1d
      Yan Zheng authored
      btrfs_split_leaf and btrfs_del_items can end up in a loop
      where one is constantly spliting a given leaf and the other
      is constantly merging it back with the adjacent nodes.
      
      There is a better fix for this, but in the interest of something
      small, this patch just changes btrfs_del_items back to balancing less
      often.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d717aa1d
    • Yan Zheng's avatar
      Btrfs: Fix ordering of key field checks in btrfs_previous_item · 0a4eefbb
      Yan Zheng authored
      Check objectid of item before checking the item type, otherwise we may return
      zero for a key that is actually too low.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0a4eefbb
    • Yan Zheng's avatar
      Btrfs: find_free_dev_extent doesn't handle holes at the start of the device · 1fcbac58
      Yan Zheng authored
      find_free_dev_extent does not properly handle the case where
      the device is not complete free, and there is a free extent
      at the beginning of the device.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1fcbac58
    • Diego Calleja's avatar
      Btrfs: Remove code duplication in comp_keys · 20736aba
      Diego Calleja authored
      comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
      call it.
      Signed-off-by: default avatarDiego Calleja <diegocg@gmail.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      20736aba
    • Josef Bacik's avatar
      Btrfs: async block group caching · 817d52f8
      Josef Bacik authored
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      seconds.
      
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      817d52f8
    • Josef Bacik's avatar
      Btrfs: use hybrid extents+bitmap rb tree for free space · 96303081
      Josef Bacik authored
      Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
      tracking free space.  As free space gets fragmented, we end up with thousands of
      entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
      we currently don't ever flush free space cache back to disk this gets to be a
      bit unweildly on large fs's with lots of fragmentation.
      
      This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
      space cache.  Initially we calculate a threshold of extent entries we can
      handle, which is however many extent entries we can cram into 16k of ram.  The
      maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
      will be 32k of RAM, which scales much better than we did before.
      
      Once we pass the extent threshold, we start adding bitmaps and using those
      instead for tracking the free space.  This patch also makes it so that any free
      space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
      nice since we try and allocate out of the front of a block group, so if the
      front of a block group is heavily fragmented and then has a huge chunk of free
      space at the end, we go ahead and add the fragmented areas to bitmaps and use a
      normal extent entry to track the big chunk at the back of the block group.
      
      I've also taken the opportunity to revamp how we search for free space.
      Previously we indexed free space via an offset indexed rb tree and a bytes
      indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
      indexed rb tree.  This cuts the number of tree operations we were doing
      previously down by half, and gives us a little bit of a better allocation
      pattern since we will always start from a specific offset and search forward
      from there, instead of searching for the size we need and try and get it as
      close as possible to the offset we want.
      
      I've given this a healthy amount of testing pre-new format stuff, as well as
      post-new format stuff.  I've booted up my fedora box which is installed on btrfs
      with this patch and ran with it for a few days without issues.  I've not seen
      any performance regressions in any of my tests.
      
      Since the last patch Yan Zheng fixed a problem where we could have overlapping
      entries, so updating their offset inline would cause problems.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      96303081
  8. 22 Jul, 2009 6 commits