1. 27 Jul, 2020 25 commits
    • Qu Wenruo's avatar
      btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak · 7dbeaad0
      Qu Wenruo authored
      [BUG]
      The following simple workload from fsstress can lead to qgroup reserved
      data space leak:
        0/0: creat f0 x:0 0 0
        0/0: creat add id=0,parent=-1
        0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
        0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
        0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
      
      This would cause btrfs qgroup to leak 20480 bytes for data reserved
      space.  If btrfs qgroup limit is enabled, such leak can lead to
      unexpected early EDQUOT and unusable space.
      
      [CAUSE]
      When doing direct IO, kernel will try to writeback existing buffered
      page cache, then invalidate them:
        generic_file_direct_write()
        |- filemap_write_and_wait_range();
        |- invalidate_inode_pages2_range();
      
      However for btrfs, the bi_end_io hook doesn't finish all its heavy work
      right after bio ends.  In fact, it delays its work further:
      
        submit_extent_page(end_io_func=end_bio_extent_writepage);
        end_bio_extent_writepage()
        |- btrfs_writepage_endio_finish_ordered()
           |- btrfs_init_work(finish_ordered_fn);
      
        <<< Work queue execution >>>
        finish_ordered_fn()
        |- btrfs_finish_ordered_io();
           |- Clear qgroup bits
      
      This means, when filemap_write_and_wait_range() returns,
      btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
      qgroup bits for related range are not cleared.
      
      Now into how the leak happens, this will only focus on the overlapping
      part of buffered and direct IO part.
      
      1. After buffered write
         The inode had the following range with QGROUP_RESERVED bit:
         	596		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      2. Writeback part for range [596K, 616K)
         Write back finished, but btrfs_finish_ordered_io() not get called
         yet.
         So we still have:
         	596K		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      3. Pages for range [596K, 616K) get released
         This will clear all qgroup bits, but don't update the reserved data
         space.
         So we have:
         	596K		616K
      	|		|
         Qgroup reserved data space: 20K
         That number doesn't match the qgroup bit range anymore.
      
      4. Dio prepare space for range [596K, 700K)
         Qgroup reserved data space for that range, we got:
         	596K		616K			700K
      	|///////////////|///////////////////////|
         Qgroup reserved data space: 20K + 104K = 124K
      
      5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
         Qgroup free reserved space for that range, we got:
         	596K		616K			700K
      	|		|///////////////////////|
         We need to free that range of reserved space.
         Qgroup reserved data space: 124K - 20K = 104K
      
      6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
         However qgroup bit for range [596K, 616K) is already cleared in
         previous step, so we only free 84K for qgroup reserved space.
         	596K		616K			700K
      	|		|			|
         We need to free that range of reserved space.
         Qgroup reserved data space: 104K - 84K = 20K
      
         Now there is no way to release that 20K unless disabling qgroup or
         unmounting the fs.
      
      [FIX]
      This patch will change the timing of btrfs_qgroup_release/free_data()
      call.  Here it uses buffered COW write as an example.
      
      	The new timing			|	The old timing
      ----------------------------------------+---------------------------------------
       btrfs_buffered_write()			| btrfs_buffered_write()
       |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
      					|
       btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
       |- btrfs_add_ordered_extent()  	|
          |- btrfs_qgroup_release_data()	|
             The reserved is passed into	|
             btrfs_ordered_extent structure	|
      					|
       btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
       |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
          btrfs_qgroup_record			|    The resereved space is passed
      					|    to btrfs_qgroup_recrod
      					|
       btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
       |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()
      
      The point of such change is to ensure, when ordered extents are
      submitted, the qgroup reserved space is already released, to keep the
      timing aligned with file_write_and_wait_range().
      
      So that qgroup data reserved space is all bound to btrfs_ordered_extent
      and solve the timing mismatch.
      
      Fixes: f695fdce ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
      Suggested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dbeaad0
    • Qu Wenruo's avatar
      btrfs: file: reserve qgroup space after the hole punch range is locked · a7f8b1c2
      Qu Wenruo authored
      The incoming qgroup reserved space timing will move the data reservation
      to ordered extent completely.
      
      However in btrfs_punch_hole_lock_range() will call
      btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
      range.
      
      In current stage it's OK, but if we're making ordered extents handle the
      reserved space, then btrfs_punch_hole_lock_range() can clear the
      QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
      reserved space leakage.
      
      So here change the timing to make reserve data space after
      btrfs_punch_hole_lock_range().
      The new timing is fine for either current code or the new code.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7f8b1c2
    • Qu Wenruo's avatar
      btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent() · 9729f10a
      Qu Wenruo authored
      This is to prepare for the incoming timing change of qgroup reserved
      data space and ordered extent.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9729f10a
    • Qu Wenruo's avatar
      btrfs: inode: refactor the parameters of insert_reserved_file_extent() · 203f44c5
      Qu Wenruo authored
      Function insert_reserved_file_extent() takes a long list of parameters,
      which are all for btrfs_file_extent_item, even including two reserved
      members, encryption and other_encoding.
      
      This makes the parameter list unnecessary long for a function which only
      gets called twice.
      
      This patch will refactor the parameter list, by using
      btrfs_file_extent_item as parameter directly to hugely reduce the number
      of parameters.
      
      Also, since there are only two callers, one in btrfs_finish_ordered_io()
      which inserts file extent for ordered extent, and one
      __btrfs_prealloc_file_range().
      
      These two call sites have completely different context, where ordered
      extent can be compressed, but will always be regular extent, while the
      preallocated one is never going to be compressed and always has PREALLOC
      type.
      
      So use two small wrapper for these two different call sites to improve
      readability.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      203f44c5
    • David Sterba's avatar
      btrfs: scrub: clean up temporary page variables in scrub_checksum_tree_block · 100aa5d9
      David Sterba authored
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      100aa5d9
    • David Sterba's avatar
      btrfs: scrub: simplify tree block checksum calculation · 521e1022
      David Sterba authored
      Use a simpler iteration over tree block pages, same what csum_tree_block
      does: first page always exists, loop over the rest.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      521e1022
    • David Sterba's avatar
      btrfs: scrub: clean up temporary page variables in scrub_checksum_data · d41ebef2
      David Sterba authored
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d41ebef2
    • David Sterba's avatar
      btrfs: scrub: simplify data block checksum calculation · 771aba0d
      David Sterba authored
      We have sectorsize same as PAGE_SIZE, the checksum can be calculated in
      one go.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      771aba0d
    • David Sterba's avatar
      btrfs: scrub: clean up temporary page variables in scrub_checksum_super · c7460541
      David Sterba authored
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7460541
    • David Sterba's avatar
      btrfs: scrub: remove temporary csum array in scrub_checksum_super · 74710cf1
      David Sterba authored
      The page contents with the checksum is available during the entire
      function so we don't need to make a copy.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74710cf1
    • David Sterba's avatar
      btrfs: scrub: simplify superblock checksum calculation · 83cf6d5e
      David Sterba authored
      BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
      architectures, so we can calculate the checksum in one go.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83cf6d5e
    • David Sterba's avatar
      btrfs: scrub: unify naming of page address variables · b0485252
      David Sterba authored
      As the page mapping has been removed, rename the variables to 'kaddr'
      that we use everywhere else. The type is changed to 'char *' so pointer
      arithmetic works without casts.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0485252
    • David Sterba's avatar
      btrfs: scrub: remove kmap/kunmap of pages · a8b3a890
      David Sterba authored
      All pages that scrub uses in the scrub_block::pagev array are allocated
      with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
      we only need to know the page address.
      
      In scrub_write_page_to_dev_replace we don't even need to call
      flush_dcache_page because of the same reason as above.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8b3a890
    • Qu Wenruo's avatar
      btrfs: introduce "rescue=" mount option · 74ef0018
      Qu Wenruo authored
      This patch introduces a new "rescue=" mount option group for all mount
      options for data recovery.
      
      Different rescue sub options are seperated by ':'. E.g
      "ro,rescue=nologreplay:usebackuproot".
      
      The original plan was to use ';', but ';' needs to be escaped/quoted,
      or it will be interpreted by bash, similar to '|'.
      
      And obviously, user can specify rescue options one by one like:
      "ro,rescue=nologreplay,rescue=usebackuproot".
      
      The following mount options are converted to "rescue=", old mount
      options are deprecated but still available for compatibility purpose:
      
      - usebackuproot
        Now it's "rescue=usebackuproot"
      
      - nologreplay
        Now it's "rescue=nologreplay"
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74ef0018
    • Filipe Manana's avatar
      btrfs: use btrfs_alloc_data_chunk_ondemand() when allocating space for relocation · a89ef455
      Filipe Manana authored
      We currently use btrfs_check_data_free_space() when allocating space for
      relocating data extents, but that is not necessary because that function
      combines btrfs_alloc_data_chunk_ondemand(), which does the actual space
      reservation, and btrfs_qgroup_reserve_data().
      
      We can use btrfs_alloc_data_chunk_ondemand() directly because we know we
      do not need to reserve qgroup space since we are dealing with a relocation
      tree, which can never have qgroups (btrfs_qgroup_reserve_data() does
      nothing as is_fstree() returns false for a relocation tree).
      
      Conversely we can use btrfs_free_reserved_data_space_noquota() directly
      instead of btrfs_free_reserved_data_space(), since we had no qgroup
      reservation when allocating space.
      
      This change is preparatory work for another patch in this series that
      makes relocation reserve the exact amount of space it needs to relocate
      a data block group. The function btrfs_check_data_free_space() has
      the incovenient of requiring a start offset argument and we will want to
      be able to allocate space for multiple ranges, which are not consecutive,
      at once.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a89ef455
    • Filipe Manana's avatar
      btrfs: remove the start argument from btrfs_free_reserved_data_space_noquota() · 46d4dac8
      Filipe Manana authored
      The start argument for btrfs_free_reserved_data_space_noquota() is only
      used to make sure the amount of bytes we decrement from the bytes_may_use
      counter of the data space_info object is aligned to the filesystem's
      sector size. It serves no other purpose.
      
      All its current callers always pass a length argument that is already
      aligned to the sector size, so we can make the start argument go away.
      In fact its presence makes it impossible to use it in a context where we
      just want to free a number of bytes for a range for which either we do
      not know its start offset or for freeing multiple ranges at once (which
      are not contiguous).
      
      This change is preparatory work for a patch (third patch in this series)
      that makes relocation of data block groups that are not full reserve less
      data space.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46d4dac8
    • Liao Pingfang's avatar
      btrfs: check-integrity: remove unnecessary failure messages during memory allocation · ab483009
      Liao Pingfang authored
      As there is a dump_stack() done on memory allocation failures, these
      messages might as well be deleted instead.
      Signed-off-by: default avatarLiao Pingfang <liao.pingfang@zte.com.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor tweaks ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab483009
    • Anand Jain's avatar
      btrfs: use helper btrfs_get_block_group · b5790d51
      Anand Jain authored
      Use the helper function where it is open coded to increment the
      block_group reference count As btrfs_get_block_group() is a one-liner we
      could have open-coded it, but its partner function
      btrfs_put_block_group() isn't one-liner which does the free part in it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5790d51
    • Anand Jain's avatar
      btrfs: let btrfs_return_cluster_to_free_space() return void · 69b0e093
      Anand Jain authored
      __btrfs_return_cluster_to_free_space() returns only 0. And all its
      parent functions don't need the return value either so make this a void
      function.
      
      Further, as none of the callers of btrfs_return_cluster_to_free_space()
      is actually using the return from this function, make this function also
      return void.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      69b0e093
    • Filipe Manana's avatar
      btrfs: remove no longer necessary chunk mutex locking cases · f22f457a
      Filipe Manana authored
      Initially when the 'removed' flag was added to a block group to avoid
      races between block group removal and fitrim, by commit 04216820
      ("Btrfs: fix race between fs trimming and block group remove/allocation"),
      we had to lock the chunks mutex because we could be moving the block
      group from its current list, the pending chunks list, into the pinned
      chunks list, or we could just be adding it to the pinned chunks if it was
      not in the pending chunks list. Both lists were protected by the chunk
      mutex.
      
      However we no longer have those lists since commit 1c11b63e
      ("btrfs: replace pending/pinned chunks lists with io tree"), and locking
      the chunk mutex is no longer necessary because of that. The same happens
      at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block
      group's extent map could be part of the pinned chunks list and the call
      to remove_extent_mapping() could be deleting it from that list, which
      used to be protected by that mutex.
      
      So just remove those lock and unlock calls as they are not needed anymore.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f22f457a
    • Johannes Thumshirn's avatar
      btrfs: factor out reading of bg from find_frist_block_group · e3ba67a1
      Johannes Thumshirn authored
      When find_first_block_group() finds a block group item in the extent-tree,
      it does a lookup of the object in the extent mapping tree and does further
      checks on the item.
      
      Factor out this step from find_first_block_group() so we can further
      simplify the code.
      
      While we're at it, we can also just return early in
      find_first_block_group(), if the tree slot isn't found.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3ba67a1
    • Johannes Thumshirn's avatar
      btrfs: get mapping tree directly from fsinfo in find_first_block_group · 89d7da9b
      Johannes Thumshirn authored
      We already have an fs_info in our function parameters, there's no need
      to do the maths again and get fs_info from the extent_root just to get
      the mapping_tree.
      
      Instead directly grab the mapping_tree from fs_info.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      89d7da9b
    • Nikolay Borisov's avatar
      btrfs: simplify checks when adding excluded ranges · 96f9b0f2
      Nikolay Borisov authored
      Adresses held in 'logical' array are always guaranteed to fall within
      the boundaries of the block group. That is, 'start' can never be
      smaller than cache->start. This invariant follows from the way the
      address are calculated in btrfs_rmap_block:
      
          stripe_nr = physical - map->stripes[i].physical;
          stripe_nr = div64_u64(stripe_nr, map->stripe_len);
          bytenr = chunk_start + stripe_nr * io_stripe_size;
      
      I.e it's always some IO stripe within the given chunk.
      
      Exploit this invariant to simplify the body of the loop by removing the
      unnecessary 'if' since its 'else' part is the one always executed.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96f9b0f2
    • Nikolay Borisov's avatar
      btrfs: read stripe len directly in btrfs_rmap_block · 9e22b925
      Nikolay Borisov authored
      extent_map::orig_block_len contains the size of a physical stripe when
      it's used to describe block groups (calculated in read_one_chunk via
      calc_stripe_length or calculated in decide_stripe_size and then assigned
      to extent_map::orig_block_len in create_chunk). Exploit this fact to get
      the size directly rather than opencoding the calculations. No functional
      changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e22b925
    • Nikolay Borisov's avatar
      btrfs: don't balance btree inode pages from buffered write path · 6a3c7f5c
      Nikolay Borisov authored
      The call to btrfs_btree_balance_dirty has been there since the early
      days of BTRFS, when the btree was directly modified from the write path,
      hence dirtied btree inode pages. With the implementation of b888db2b
      ("Btrfs: Add delayed allocation to the extent based page tree code")
      13 years ago the btree is no longer modified from the write path, hence
      there is no point in calling this function. Just remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a3c7f5c
  2. 26 Jul, 2020 9 commits
  3. 25 Jul, 2020 6 commits