1. 29 Jun, 2017 16 commits
    • Jan Kara's avatar
      btrfs: Don't clear SGID when inheriting ACLs · b7f8a09f
      Jan Kara authored
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101
      CC: stable@vger.kernel.org
      CC: linux-btrfs@vger.kernel.org
      CC: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7f8a09f
    • Chris Mason's avatar
      btrfs: fix integer overflow in calc_reclaim_items_nr · 6374e57a
      Chris Mason authored
      Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
      v4.12-rc6.  This was because commit 70e7af24 made it possible for
      calc_reclaim_items_nr() to return a negative number.  It's not really a
      bug in that commit, it just didn't go far enough down the stack to find
      all the possible 64->32 bit overflows.
      
      This switches calc_reclaim_items_nr() to return a u64 and changes everyone
      that uses the results of that math to u64 as well.
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6374e57a
    • David Sterba's avatar
      btrfs: scrub: fix target device intialization while setting up scrub context · ded56184
      David Sterba authored
      The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
      helper but wrongly sets up the target device. Incidentally there's a
      local variable with the same name as a parameter in the previous
      function, so this got caught during runtime as crash in test btrfs/027.
      Reported-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ded56184
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo authored
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo authored
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      364ecf36
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write... · a12b877b
      Qu Wenruo authored
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
      
      [BUG]
      Under the following case, we can underflow qgroup reserved space.
      
                  Task A                |            Task B
      ---------------------------------------------------------------
       Quota disabled                   |
       Buffered write                   |
       |- btrfs_check_data_free_space() |
       |  *NO* qgroup space is reserved |
       |  since quota is *DISABLED*     |
       |- All pages are copied to page  |
          cache                         |
                                        | Enable quota
                                        | Quota scan finished
                                        |
                                        | Sync_fs
                                        | |- run_delalloc_range
                                        | |- Write pages
                                        | |- btrfs_finish_ordered_io
                                        |    |- insert_reserved_file_extent
                                        |       |- btrfs_qgroup_release_data()
                                        |          Since no qgroup space is
                                                   reserved in Task A, we
                                                   underflow qgroup reserved
                                                   space
      This can be detected by fstest btrfs/104.
      
      [CAUSE]
      In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
      size of qgroup reserved_space in all cases.
      And btrfs_qgroup_release_data() will check if quotas are enabled.
      
      However in the above case, the buffered write happens before quota is
      enabled, so we don't have the reserved space for that range.
      
      [FIX]
      In insert_reserved_file_extent(), we tell qgroup to release the acctual
      byte number it released.
      In the above case, since we don't have the reserved space, we tell
      qgroups to release 0 byte, so the problem can be fixed.
      
      And thanks to the @reserved parameter introduced by the qgroup rework,
      and previous patch to return released bytes, the fix can be as small as
      10 lines.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog updates ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a12b877b
    • Qu Wenruo's avatar
      btrfs: qgroup: Return actually freed bytes for qgroup release or free data · 7bc329c1
      Qu Wenruo authored
      btrfs_qgroup_release/free_data() only returns 0 or a negative error
      number (ENOMEM is the only possible error).
      
      This is normally good enough, but sometimes we need the exact byte
      count it freed/released.
      
      Change it to return actually released/freed bytenr number instead of 0
      for success.
      And slightly modify related extent_changeset structure, since in btrfs
      one no-hole data extent won't be larger than 128M, so "unsigned int"
      is large enough for the use case.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bc329c1
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo authored
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1b8b94a
    • Qu Wenruo's avatar
      btrfs: qgroup: Add quick exit for non-fs extents · 5edfd9fd
      Qu Wenruo authored
      Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents.
      
      The quick exit condition is:
      1) The extent belongs to a non-fs tree
         Only fs-tree extents can affect qgroup numbers and is the only case
         where extent can be shared between different trees.
      
         Although strictly speaking extent in data-reloc or tree-reloc tree
         can be shared, data/tree-reloc root won't appear in the result of
         btrfs_find_all_roots(), so we can ignore such case.
      
         So we can check the first root in old_roots/new_roots ulist.
         - if we find the 1st root is a not a fs/subvol root, then we can skip
           the extent
         - if we find the 1st root is a fs/subvol root, then we must continue
           calculation
      
      OR
      
      2) both 'nr_old_roots' and 'nr_new_roots' are 0
         This means either such extent got allocated then freed in current
         transaction or it's a new reloc tree extent, whose nr_new_roots is 0.
         Either way it won't affect qgroup accounting and can be skipped
         safely.
      
      Such quick exit can make trace output more quite and less confusing:
      (example with fs uuid and time stamp removed)
      
      Before:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1
      ------
      Extent tree block will trigger btrfs_qgroup_account_extent() trace point
      while no qgroup number is changed, as extent tree won't affect qgroup
      accounting.
      
      After:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      ------
      Now such unrelated extent won't trigger btrfs_qgroup_account_extent()
      trace point, making the trace less noisy.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5edfd9fd
    • Omar Sandoval's avatar
      Btrfs: rework delayed ref total_bytes_pinned accounting · d7eae340
      Omar Sandoval authored
      The total_bytes_pinned counter is completely broken when accounting
      delayed refs:
      
      - If two drops for the same extent are merged, we will decrement
        total_bytes_pinned twice but only increment it once.
      - If an add is merged into a drop or vice versa, we will decrement the
        total_bytes_pinned counter but never increment it.
      - If multiple references to an extent are dropped, we will account it
        multiple times, potentially vastly over-estimating the number of bytes
        that will be freed by a commit and doing unnecessary work when we're
        close to ENOSPC.
      
      The last issue is relatively minor, but the first two make the
      total_bytes_pinned counter leak or underflow very often. These
      accounting issues were introduced in b150a4f1 ("Btrfs: use a percpu
      to keep track of possibly pinned bytes"), but they were papered over by
      zeroing out the counter on every commit until d288db5d ("Btrfs: fix
      race of using total_bytes_pinned").
      
      We need to make sure that an extent is accounted as pinned exactly once
      if and only if we will drop references to it when when the transaction
      is committed. Ideally we would only add to total_bytes_pinned when the
      *last* reference is dropped, but this information isn't readily
      available for data extents. Again, this over-estimation can lead to
      extra commits when we're close to ENOSPC, but it's not as bad as before.
      
      The fix implemented here is to increment total_bytes_pinned when the
      total refmod count for an extent goes negative and decrement it if the
      refmod count goes back to non-negative or after we've run all of the
      delayed refs for that extent.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7eae340
    • Omar Sandoval's avatar
      Btrfs: return old and new total ref mods when adding delayed refs · 7be07912
      Omar Sandoval authored
      We need this to decide when to account pinned bytes.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7be07912
    • Omar Sandoval's avatar
      Btrfs: always account pinned bytes when dropping a tree block ref · 0a16c7d7
      Omar Sandoval authored
      Currently, we only increment total_bytes_pinned in
      btrfs_free_tree_block() when dropping the last reference on the block.
      However, when the delayed ref is run later, we will decrement
      total_bytes_pinned regardless of whether it was the last reference or
      not. This causes the counter to underflow when the reference we dropped
      was not the last reference. Fix it by incrementing the counter
      unconditionally, which is what btrfs_free_extent() does. This makes
      total_bytes_pinned an overestimate when references to shared extents are
      dropped, but in the worst case this will just make us try to commit the
      transaction to try to free up space and find we didn't free enough.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a16c7d7
    • Omar Sandoval's avatar
      Btrfs: update total_bytes_pinned when pinning down extents · 4da8b76d
      Omar Sandoval authored
      The extents marked in pin_down_extent() will be unpinned later in
      unpin_extent_range(), which decrements total_bytes_pinned.
      pin_down_extent() must increment the counter to avoid underflowing it.
      Also adjust btrfs_free_tree_block() to avoid accounting for the same
      extent twice.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4da8b76d
    • Omar Sandoval's avatar
      Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() · 55e8196a
      Omar Sandoval authored
      The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
      when add_pinned_bytes is called.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ added changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e8196a
    • Omar Sandoval's avatar
      Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 · 0d9f824d
      Omar Sandoval authored
      There are a few places where we pass in a negative num_bytes, so make it
      signed for clarity. Also move it up in the file since later patches will
      need it there.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d9f824d
    • David Sterba's avatar
      btrfs: fix validation of XATTR_ITEM dir items · 1164a9fb
      David Sterba authored
      The XATTR_ITEM is a type of a directory item so we use the common
      validator helper. Unlike other dir items, it can have data. The way the
      name len validation is currently implemented does not reflect that. We'd
      have to adjust by the data_len when comparing the read and item limits.
      
      However, this will not work for multi-item xattr dir items.
      
      Example from tree dump of generic/337:
      
              item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 3 name_len 11
                      name: user.foobar
                      data 123
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 6 name_len 13
                      name: user.WvG1c1Td
                      data qwerty
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 5 name_len 19
                      name: user.J3__T_Km3dVsW_
                      data hello
      
      At the point of btrfs_is_name_len_valid call we don't have access to the
      data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
      di) would always return 3, although we'd need to get 6 and 5 respectively to
      get the claculations right. (read_end + name_len + data_len vs item_end)
      
      We'd have to also pass data_len externally, which is not point of the
      name validation. The last check is supposed to test if there's at least
      one dir item space after the one we're processing. I don't think this is
      particularly useful, validation of the next item would catch that too.
      So the check is removed and we don't weaken the validation. Now tests
      btrfs/048, btrfs/053, generic/273 and generic/337 pass.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1164a9fb
  2. 21 Jun, 2017 16 commits
    • Su Yue's avatar
      btrfs: Verify dir_item in iterate_object_props · fbc32615
      Su Yue authored
      Call verify_dir_item before memcmp_extent_buffer reading name from
      dir_item.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbc32615
    • Su Yue's avatar
      btrfs: Check name_len before in btrfs_del_root_ref · 64c7b014
      Su Yue authored
      btrfs_del_root_ref calls btrfs_search_slot and reads name from root_ref.
      Call btrfs_is_name_len_valid before memcmp.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64c7b014
    • Su Yue's avatar
      btrfs: Check name_len before reading btrfs_get_name · 488d7c45
      Su Yue authored
      In btrfs_get_name, there's btrfs_search_slot and reads name from
      inode_ref/root_ref.
      
      Call btrfs_is_name_len_valid in btrfs_get_name.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      488d7c45
    • Su Yue's avatar
      btrfs: Check name_len before read in iterate_dir_item · 59b0a7f2
      Su Yue authored
      Since iterate_dir_item checks name_len in its own way,
      so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict
      name_len check.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ switched ENAMETOOLONG to EIO ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59b0a7f2
    • Su Yue's avatar
      btrfs: Check name_len in btrfs_check_ref_name_override · 3c1d4184
      Su Yue authored
      In btrfs_log_inode, btrfs_search_forward gets the buffer and then
      btrfs_check_ref_name_override will read name from ref/extref for the
      first time.
      
      Call btrfs_is_name_len_valid before reading name.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c1d4184
    • Su Yue's avatar
      btrfs: Verify dir_item in replay_xattr_deletes · 8ee8c2d6
      Su Yue authored
      replay_xattr_deletes calls btrfs_search_slot to get buffer and reads
      name.
      
      Call verify_dir_item to check name_len in replay_xattr_deletes to avoid
      reading out of boundary.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ee8c2d6
    • Su Yue's avatar
      btrfs: Check name_len on add_inode_ref call path · 26a836ce
      Su Yue authored
      replay_one_buffer first reads buffers and dispatches items accroding to
      the item type.
      In this patch, add_inode_ref handles inode_ref and inode_extref.
      Then add_inode_ref calls ref_get_fields and extref_get_fields to read
      ref/extref name for the first time.
      So checking name_len before reading those two is fine.
      
      add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir.
      The call graph includes btrfs_match_dir_item_name to read dir_item name
      in the parent dir.
      Checking first dir_item is not enough. Change it to verify every
      dir_item while doing matches.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26a836ce
    • Su Yue's avatar
      btrfs: Check name_len with boundary in verify dir_item · e79a3327
      Su Yue authored
      Originally, verify_dir_item verifies name_len of dir_item with fixed
      values but not item boundary.
      If corrupted name_len was not bigger than the fixed value, for example
      255, the function will think the dir_item is fine. And then reading
      beyond boundary will cause crash.
      
      Example:
      	1. Corrupt one dir_item name_len to be 255.
              2. Run 'ls -lar /mnt/test/ > /dev/null'
      dmesg:
      [   48.451449] BTRFS info (device vdb1): disk space caching is enabled
      [   48.451453] BTRFS info (device vdb1): has skinny extents
      [   48.489420] general protection fault: 0000 [#1] SMP
      [   48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
      [   48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
      [   48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
      [   48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
      [   48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
      [   48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
      [   48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
      [   48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
      [   48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
      [   48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
      [   48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
      [   48.490008] FS:  00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
      [   48.490008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
      [   48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   48.490008] Call Trace:
      [   48.490008]  btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
      [   48.490008]  iterate_dir+0x181/0x1b0
      [   48.490008]  SyS_getdents+0xa7/0x150
      [   48.490008]  ? fillonedir+0x150/0x150
      [   48.490008]  entry_SYSCALL_64_fastpath+0x18/0xad
      [   48.490008] RIP: 0033:0x7fef5032546b
      [   48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
      [   48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
      [   48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
      [   48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
      [   48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
      [   48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
      [   48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
      [   48.499455] ---[ end trace 321920d8e8339505 ]---
      
      Fix it by adding a parameter @slot and check name_len with item boundary
      by calling btrfs_is_name_len_valid.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      rev
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e79a3327
    • Su Yue's avatar
      btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary · 19c6dcbf
      Su Yue authored
      Introduce function btrfs_is_name_len_valid.
      
      The function compares parameter @name_len with item boundary then
      returns true if name_len is valid.
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19c6dcbf
    • David Sterba's avatar
      btrfs: move dev stats accounting out of wait_dev_flush · 66b4993e
      David Sterba authored
      We should really just wait in wait_dev_flush and let the caller decide
      what to do with the error value.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66b4993e
    • David Sterba's avatar
      btrfs: account as waiting for IO, while waiting fot the flush bio completion · 2980d574
      David Sterba authored
      Similar to what submit_bio_wait does, we should account for IO while
      waiting for a bio completion. This has marginal visible effects, flush
      bio is short-lived.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2980d574
    • David Sterba's avatar
      btrfs: preallocate device flush bio · e0ae9994
      David Sterba authored
      For devices that support flushing, we allocate a bio, submit, wait for
      it and then free it. The bio allocation does not fail so ENOMEM is not a
      problem but we still may unnecessarily stress the allocation subsystem.
      
      Instead, we can allocate the bio at the same time we allocate the device
      and reuse it each time we need to flush the barriers. The bio is reset
      before each use. Reference counting is simplified to just device
      allocation (get) and freeing (put).
      
      The bio used to be submitted through the integrity checker which will
      find out that bio has no data attached and call submit_bio.
      
      Status of the bio in flight needs to be tracked separately in case the
      device caches get switched off between write and wait.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0ae9994
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid path for unlink commands · fdb13889
      Filipe Manana authored
      An incremental send can contain unlink operations with an invalid target
      path when we rename some directory inode A, then rename some file inode B
      to the old name of inode A and directory inode A is an ancestor of inode B
      in the parent snapshot (but not anymore in the send snapshot).
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |     |--- file1                                 (ino 259)
             |     |--- file3                                 (ino 261)
             |
             |--- dir3/                                       (ino 262)
                   |--- file22                                (ino 260)
                   |--- dir4/                                 (ino 263)
      
      Send snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |--- dir3                                        (ino 260)
             |--- file3/                                      (ino 262)
                   |--- dir4/                                 (ino 263)
                         |--- file11                          (ino 269)
                         |--- file33                          (ino 261)
      
      When attempting to apply the corresponding incremental send stream, an
      unlink operation contains an invalid path which makes the receiver fail.
      The following is verbose output of the btrfs receive command:
      
       receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
       utimes
       utimes dir1
       utimes dir1/dir2
       link dir1/dir3/dir4/file11 -> dir1/dir2/file1
       unlink dir1/dir2/file1
       utimes dir1/dir2
       truncate dir1/dir3/dir4/file11 size=0
       utimes dir1/dir3/dir4/file11
       rename dir1/dir3 -> o262-7-0
       link dir1/dir3 -> o262-7-0/file22
       unlink dir1/dir3/file22
       ERROR: unlink dir1/dir3/file22 failed. Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) Before we start processing the new and deleted references for inode
         260, we compute the full path of the deleted reference
         ("dir1/dir3/file22") and cache it in the list of deleted references
         for our inode.
      
      2) We then start processing the new references for inode 260, for which
         there is only one new, located at "dir1/dir3". When processing this
         new reference, we check that inode 262, which was not yet processed,
         collides with the new reference and because of that we orphanize
         inode 262 so its new full path becomes "o262-7-0".
      
      3) After the orphanization of inode 262, we create the new reference for
         inode 260 by issuing a link command with a target path of "dir1/dir3"
         and a source path of "o262-7-0/file22".
      
      4) We then start processing the deleted references for inode 260, for
         which there is only one with the base name of "file22", and issue
         an unlink operation containing the target path computed at step 1,
         which is wrong because that path no longer exists and should be
         replaced with "o262-7-0/file22".
      
      So fix this issue by recomputing the full path of deleted references if
      when we processed the new references for an inode we ended up orphanizing
      any other inode that is an ancestor of our inode in the parent snapshot.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fdb13889
    • Filipe Manana's avatar
      Btrfs: send, fix invalid path after renaming and linking file · 72c3668f
      Filipe Manana authored
      Currently an incremental snapshot can generate link operations which
      contain an invalid target path. Such case happens when in the send
      snapshot a file was renamed, a new hard link added for it and some
      other inode (with a lower number) got renamed to the former name of
      that file. Example:
      
      Parent snapshot
      
       .                  (ino 256)
       |
       |--- f1            (ino 257)
       |--- f2            (ino 258)
       |--- f3            (ino 259)
      
      Send snapshot
      
       .                  (ino 256)
       |
       |--- f2            (ino 257)
       |--- f3            (ino 258)
       |--- f4            (ino 259)
       |--- f5            (ino 258)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, inode 258 is orphanized (renamed to
         "o258-7-0"), because its current reference has the same name as the
         new reference for inode 257;
      
      2) When processing inode 258, we iterate over all its new references,
         which have the names "f3" and "f5". The first iteration sees name
         "f5" and renames the inode from its orphan name ("o258-7-0") to
         "f5", while the second iteration sees the name "f3" and, incorrectly,
         issues a link operation with a target name matching the orphan name,
         which no longer exists. The first iteration had reset the current
         valid path of the inode to "f5", but in the second iteration we lost
         it because we found another inode, with a higher number of 259, which
         has a reference named "f3" as well, so we orphanized inode 259 and
         recomputed the current valid path of inode 258 to its old orphan
         name because inode 259 could be an ancestor of inode 258 and therefore
         the current valid path could contain the pre-orphanization name of
         inode 259. However in this case inode 259 is not an ancestor of inode
         258 so the current valid path should not be recomputed.
         This makes the receiver fail with the following error:
      
         ERROR: link f3 -> o258-7-0 failed: No such file or directory
      
      So fix this by not recomputing the current valid path for an inode
      whenever we find a colliding reference from some not yet processed inode
      (inode number higher then the one currently being processed), unless
      that other inode is an ancestor of the one we are currently processing.
      
      A test case for fstests will follow soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72c3668f
    • Filipe Manana's avatar
      Btrfs: fix invalid extent maps due to hole punching · 609805d8
      Filipe Manana authored
      While punching a hole in a range that is not aligned with the sector size
      (currently the same as the page size) we can end up leaving an extent map
      in memory with a length that is smaller then the sector size or with a
      start offset that is not aligned to the sector size. Both cases are not
      expected and can lead to problems. This issue is easily detected
      after the patch from commit a7e3b975 ("Btrfs: fix reported number of
      inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
      following for example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
        $ xfs_io -c "fpunch 60K 90K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
        $ umount /mnt
      
      After the unmount operation we can see several warnings emmitted due to
      underflows related to space reservation counters:
      
      [ 2837.443299] ------------[ cut here ]------------
      [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
      rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
      ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.462379] Call Trace:
      [ 2837.462379]  dump_stack+0x68/0x92
      [ 2837.462379]  __warn+0xc2/0xdd
      [ 2837.462379]  warn_slowpath_null+0x1d/0x1f
      [ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.462379]  destroy_inode+0x3d/0x55
      [ 2837.462379]  evict+0x177/0x17e
      [ 2837.462379]  dispose_list+0x50/0x71
      [ 2837.462379]  evict_inodes+0x132/0x141
      [ 2837.462379]  generic_shutdown_super+0x3f/0xeb
      [ 2837.462379]  kill_anon_super+0x12/0x1c
      [ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.462379]  deactivate_locked_super+0x30/0x68
      [ 2837.462379]  deactivate_super+0x36/0x39
      [ 2837.462379]  cleanup_mnt+0x58/0x76
      [ 2837.462379]  __cleanup_mnt+0x12/0x14
      [ 2837.462379]  task_work_run+0x77/0x9b
      [ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
      [ 2837.596256] ------------[ cut here ]------------
      [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.663359] Call Trace:
      [ 2837.663359]  dump_stack+0x68/0x92
      [ 2837.663359]  __warn+0xc2/0xdd
      [ 2837.663359]  warn_slowpath_null+0x1d/0x1f
      [ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.663359]  ? evict_inodes+0x132/0x141
      [ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.663359]  generic_shutdown_super+0x6a/0xeb
      [ 2837.663359]  kill_anon_super+0x12/0x1c
      [ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.663359]  deactivate_locked_super+0x30/0x68
      [ 2837.663359]  deactivate_super+0x36/0x39
      [ 2837.663359]  cleanup_mnt+0x58/0x76
      [ 2837.663359]  __cleanup_mnt+0x12/0x14
      [ 2837.663359]  task_work_run+0x77/0x9b
      [ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
      [ 2837.745595] ------------[ cut here ]------------
      [ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.758526] Call Trace:
      [ 2837.758925]  dump_stack+0x68/0x92
      [ 2837.759383]  __warn+0xc2/0xdd
      [ 2837.759383]  warn_slowpath_null+0x1d/0x1f
      [ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.759383]  ? evict_inodes+0x132/0x141
      [ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.759383]  generic_shutdown_super+0x6a/0xeb
      [ 2837.759383]  kill_anon_super+0x12/0x1c
      [ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.759383]  deactivate_locked_super+0x30/0x68
      [ 2837.759383]  deactivate_super+0x36/0x39
      [ 2837.759383]  cleanup_mnt+0x58/0x76
      [ 2837.759383]  __cleanup_mnt+0x12/0x14
      [ 2837.759383]  task_work_run+0x77/0x9b
      [ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
      [ 2837.778235] ------------[ cut here ]------------
      [ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.800118] Call Trace:
      [ 2837.800515]  dump_stack+0x68/0x92
      [ 2837.801015]  __warn+0xc2/0xdd
      [ 2837.801471]  warn_slowpath_null+0x1d/0x1f
      [ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.801698]  ? evict_inodes+0x132/0x141
      [ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.801698]  generic_shutdown_super+0x6a/0xeb
      [ 2837.801698]  kill_anon_super+0x12/0x1c
      [ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.801698]  deactivate_locked_super+0x30/0x68
      [ 2837.801698]  deactivate_super+0x36/0x39
      [ 2837.801698]  cleanup_mnt+0x58/0x76
      [ 2837.801698]  __cleanup_mnt+0x12/0x14
      [ 2837.801698]  task_work_run+0x77/0x9b
      [ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
      [ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
      [ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
      
      What happens in the above example is the following:
      
      1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
         is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
         This results in the creation of an extent map with a length of 2Kb
         starting at file offset 148Kb, through find_first_non_hole() ->
         btrfs_get_extent().
      
      2) The second write (first write after the hole punch operation), sets
         the range [50Kb, 152Kb[ to delalloc.
      
      3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
         map covering the range [148Kb, 150Kb[ and ends up calling
         set_extent_bit() for the same range, which results in splitting an
         existing extent state record, covering the range [148Kb, 152Kb[ into
         two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
         [150Kb, 152Kb[.
      
      4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
         btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
         range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
         callback being invoked against the two 2Kb extent state records that
         cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
         the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
         with a length argument of 2048 bytes. That function rounds up the length
         to a sector size aligned length, so it ends up considering a length of
         4096 bytes, and then calls calc_csum_metadata_size() which results in
         decrementing the inode's csum_bytes counter by 4096 bytes, so after
         it stays a value of 0 bytes. Then the same happens when
         btrfs_clear_bit_hook() is called against the second extent state that
         has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
         rounded up to 4096 and calc_csum_metadata_size() ends up being called
         to decrement 4096 bytes from the inode's csum_bytes counter, which
         at that time has a value of 0, leading to an underflow, which is
         exactly what triggers the first warning, at btrfs_destroy_inode().
         All the other warnings relate to several space accounting counters
         that underflow as well due to similar reasons.
      
      A similar case but where the hole punching operation creates an extent map
      with a start offset not aligned to the sector size is the following:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
        $ umount /mnt
      
      During the unmount operation we get similar traces for the same reasons as
      in the first example.
      
      So fix the hole punching operation to make sure it never creates extent
      maps with a length that is not aligned to the sector size nor with a start
      offset that is not aligned to the sector size, as this breaks all
      assumptions and it's a land mine.
      
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      609805d8
    • Jeff Mahoney's avatar
      btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items · cddf3b2c
      Jeff Mahoney authored
      On an uncontended system, we can end up hitting soft lockups while
      doing replace_path.  At the core, and frequently called is
      btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
      there.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cddf3b2c
  3. 20 Jun, 2017 5 commits
    • Nikolay Borisov's avatar
      btrfs: Round down values which are written for total_bytes_size · 7dfb8be1
      Nikolay Borisov authored
      We got an internal report about a file system not wanting to mount
      following 99e3ecfc ("Btrfs: add more validation checks for
      superblock").
      
      BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
      fs_devices total_rw_bytes 1000203820544
      
      Subtracting the numbers we get a difference of less than a 4kb. Upon
      closer inspection it became apparent that mkfs actually rounds down the
      size of the device to a multiple of sector size. However, the same
      cannot be said for various functions which modify the total size and are
      called from btrfs_balance as well as when adding a new device. So this
      patch ensures that values being saved into on-disk data structures are
      always rounded down to a multiple of sectorsize.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dfb8be1
    • Nikolay Borisov's avatar
      btrfs: Manually implement device_total_bytes getter/setter · eca152ed
      Nikolay Borisov authored
      The device->total_bytes member needs to always be rounded down to sectorsize
      so that it corresponds to the value of super->total_bytes. However, there are
      multiple places where the setter is fed a value which is not rounded which
      can cause a fs to be unmountable due to the check introduced in
      99e3ecfc ("Btrfs: add more validation checks for superblock"). This patch
      implements the getter/setter manually so that in a later patch I can add
      necessary code to catch offenders.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eca152ed
    • David Sterba's avatar
      btrfs: obsolete and remove mount option alloc_start · 0d0c71b3
      David Sterba authored
      The mount option alloc_start was used in the past for debugging and
      stressing the chunk allocator. Not meant to be used by users, so we're
      not breaking anybody's setup.
      
      There was some added complexity handling changes of the value and when
      it was not same as default. Such code has likely been untested and I
      think it's better to remove it.
      
      This patch kills all use of alloc_start, and by doing that also fixes
      a bug when alloc_size is set, potentially called from statfs:
      
      in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
      protection is temporarily dropped so btrfs_account_dev_extents_size can
      be called and then RCU is locked again! Doing that inside
      list_for_each_entry_rcu is just asking for trouble, but unlikely to be
      observed in practice.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d0c71b3
    • David Sterba's avatar
      btrfs: move fs_info::fs_frozen to the flags · fac03c8d
      David Sterba authored
      We can keep the state among the other fs_info flags, there's no reason
      why fs_frozen would need to be separate.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fac03c8d
    • David Sterba's avatar
      btrfs: cleanup duplicate return value in insert_inline_extent · 79b4f4c6
      David Sterba authored
      The pattern when err is used for function exit and ret is used for
      return values of callees is not used here.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79b4f4c6
  4. 19 Jun, 2017 3 commits