1. 17 Aug, 2011 2 commits
    • liubo's avatar
      Btrfs: fix an oops of log replay · 34f3e4f2
      liubo authored
      When btrfs recovers from a crash, it may hit the oops below:
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/inode.c:4580!
      [...]
      RIP: 0010:[<ffffffffa03df251>]  [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
      [...]
      Call Trace:
       [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
       [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
       [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
       [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
       [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
       [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
       [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
       [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
      [...]
      
      This comes from that while replaying an inode ref item, we forget to
      check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
      then we will come to conflict corners which lead to BUG_ON().
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Tested-by: default avatarAndy Lutomirski <luto@mit.edu>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      34f3e4f2
    • Josef Bacik's avatar
      Btrfs: detect wether a device supports discard · d5e2003c
      Josef Bacik authored
      We have a problem where if a user specifies discard but doesn't actually support
      it we will return EOPNOTSUPP from btrfs_discard_extent.  This is a problem
      because this gets called (in a fashion) from the tree log recovery code, which
      has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
      replay.  So instead detect wether our devices support discard when we're adding
      them and then don't issue discards if we know that the device doesn't support
      it.  And just for good measure set ret = 0 in btrfs_issue_discard just in case
      we still get EOPNOTSUPP so we don't screw anybody up like this again.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d5e2003c
  2. 05 Aug, 2011 1 commit
  3. 01 Aug, 2011 25 commits
  4. 27 Jul, 2011 12 commits
    • Chris Mason's avatar
      Merge branch 'integration' into for-linus · ff95acb6
      Chris Mason authored
      ff95acb6
    • Chris Mason's avatar
      Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors · 75c195a2
      Chris Mason authored
      The btrfs transaction code will return any errors that come from
      reserve_metadata_bytes.  We need to make sure we don't return funny
      things like 1 or EAGAIN.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      75c195a2
    • Chris Mason's avatar
      Btrfs: use the commit_root for reading free_space_inode crcs · 2cf8572d
      Chris Mason authored
      Now that we are using regular file crcs for the free space cache,
      we can deadlock if we try to read the free_space_inode while we are
      updating the crc tree.
      
      This commit fixes things by using the commit_root to read the crcs.  This is
      safe because we the free space cache file would already be loaded if
      that block group had been changed in the current transaction.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      2cf8572d
    • Chris Mason's avatar
      Btrfs: reduce extent_state lock contention for metadata · 19b6caf4
      Chris Mason authored
      For metadata buffers that don't straddle pages (all of them), btrfs
      can safely use the page uptodate bits and extent_buffer uptodate bit
      instead of needing to use the extent_state tree.
      
      This greatly reduces contention on the state tree lock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      19b6caf4
    • Chris Mason's avatar
      Btrfs: remove lockdep magic from btrfs_next_leaf · 31533fb2
      Chris Mason authored
      Before the reader/writer locks, btrfs_next_leaf needed to keep
      the path blocking to avoid making lockdep upset.
      
      Now that btrfs_next_leaf only takes read locks, this isn't required.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      31533fb2
    • Chris Mason's avatar
      Btrfs: make a lockdep class for each root · 85d4e461
      Chris Mason authored
      This patch was originally from Tejun Heo.  lockdep complains about the btrfs
      locking because we sometimes take btree locks from two different trees at the
      same time.  The current classes are based only on level in the btree, which
      isn't enough information for lockdep to figure out if the lock is safe.
      
      This patch makes a class for each type of tree, and lumps all the FS trees that
      actually have files and directories into the same class.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      85d4e461
    • Chris Mason's avatar
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason authored
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      bd681513
    • Josef Bacik's avatar
      Btrfs: fix deadlock when throttling transactions · 81317fde
      Josef Bacik authored
      Hit this nice little deadlock.  What happens is this
      
      __btrfs_end_transaction with throttle set, --use_count so it equals 0
        btrfs_commit_transaction
          <somebody else actually manages to start the commit>
          btrfs_end_transaction --use_count so now its -1 <== BAD
            we just return and wait on the transaction
      
      This is bad because we just return after our use_count is -1 and don't let go
      of our num_writer count on the transaction, so the guy committing the
      transaction just sits there forever.  Fix this by inc'ing our use_count if we're
      going to call commit_transaction so that if we call btrfs_end_transaction it's
      valid.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      81317fde
    • Chris Mason's avatar
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason authored
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a6591715
    • Miao Xie's avatar
      Btrfs: fix BUG_ON() caused by ENOSPC when relocating space · 199c36ea
      Miao Xie authored
      When we balanced the chunks across the devices, BUG_ON() in
      __finish_chunk_alloc() was triggered.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:2568!
      [SNIP]
      Call Trace:
       [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
       [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
       [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
       [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
       [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
       [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
       [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
       [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
       [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
       [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
       [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
       [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
       [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
       [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
       [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
       [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
       [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
       [<ffffffff81159c63>] ? generic_permission+0x23/0xb0
       [<ffffffff8115b415>] vfs_create+0xa5/0xc0
       [<ffffffff8115ce6e>] do_last+0x5fe/0x880
       [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
       [<ffffffff8115e029>] do_filp_open+0x49/0xa0
       [<ffffffff8116a965>] ? alloc_fd+0x95/0x160
       [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
       [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
       [<ffffffff8114f1e0>] sys_open+0x20/0x30
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
      
      The reason is:
      Task1					Space balance task
      do_chunk_alloc()
        __finish_chunk_alloc()
          update device info
          in the chunk tree
            alloc system metadata block
      					relocate system metadata block group
      					  set system metadata block group
      					  readonly, This block group is the
      					  only one that can allocate space. So
      					  there is no free space that can be
      					  allocated now.
              find no space and don't try
              to alloc new chunk, and then
              return ENOSPC
        BUG_ON() in __finish_chunk_alloc()
        was triggered.
      
      Fix this bug by allocating a new system metadata chunk before relocating the
      old one if we find there is no free space which can be allocated after setting
      the old block group to be read-only.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      199c36ea
    • Josef Bacik's avatar
      Btrfs: tag pages for writeback in sync · f7aaa06b
      Josef Bacik authored
      Everybody else does this, we need to do it too.  If we're syncing, we need to
      tag the pages we're going to write for writeback so we don't end up writing the
      same stuff over and over again if somebody is constantly redirtying our file.
      This will keep us from having latencies with heavy sync workloads.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f7aaa06b
    • Josef Bacik's avatar
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik authored
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9e0baf60