1. 18 Nov, 2019 40 commits
    • David Sterba's avatar
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba authored
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a019e9e1
    • David Sterba's avatar
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba authored
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a418027
    • David Sterba's avatar
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba authored
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      429aebc0
    • Josef Bacik's avatar
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik authored
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e174099
    • Filipe Manana's avatar
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana authored
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      042528f8
    • Qu Wenruo's avatar
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo authored
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b12de528
    • Johannes Thumshirn's avatar
      btrfs: change btrfs_fs_devices::rotating to bool · 7f0432d0
      Johannes Thumshirn authored
      struct btrfs_fs_devices::rotating currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f0432d0
    • Johannes Thumshirn's avatar
      btrfs: change btrfs_fs_devices::seeding to bool · 0395d84f
      Johannes Thumshirn authored
      struct btrfs_fs_devices::seeding currently is declared as an integer
      variable but only used as a boolean.
      
      Change the variable definition to bool and update to code touching it to
      set 'true' and 'false'.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0395d84f
    • David Sterba's avatar
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba authored
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32da5386
    • Qu Wenruo's avatar
      btrfs: block-group: Reuse the item key from caller of read_one_block_group() · d49a2ddb
      Qu Wenruo authored
      For read_one_block_group(), its only caller has already got the item key
      to search next block group item.
      
      So we can use that key directly without doing our own convertion on
      stack.
      
      Also, since that key used in btrfs_read_block_groups() is vital for
      block group item search, add 'const' keyword for that parameter to
      prevent read_one_block_group() to modify it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d49a2ddb
    • Qu Wenruo's avatar
      btrfs: block-group: Refactor btrfs_read_block_groups() · ffb9e0f0
      Qu Wenruo authored
      Refactor the work inside the loop of btrfs_read_block_groups() into one
      separate function, read_one_block_group().
      
      This allows read_one_block_group to be reused for later BG_TREE feature.
      
      The refactor does the following extra fix:
      - Use btrfs_fs_incompat() to replace open-coded feature check
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffb9e0f0
    • David Sterba's avatar
      btrfs: document extent buffer locking · d4e253bb
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e253bb
    • David Sterba's avatar
      btrfs: access eb::blocking_writers according to ACCESS_ONCE policies · a4477988
      David Sterba authored
      A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
      once policies can be found here
      https://lwn.net/Articles/799218/#Access-Marking%20Policies .
      
      The locked and unlocked access to eb::blocking_writers should be
      annotated accordingly, following this:
      
      Writes:
      
      - locked write must use ONCE, may use plain read
      - unlocked write must use ONCE
      
      Reads:
      
      - unlocked read must use ONCE
      - locked read may use plain read iff not mixed with unlocked read
      - unlocked read then locked must use ONCE
      
      There's one difference on the assembly level, where
      btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
      value and did not reevaluate it after taking the lock. This could have
      missed some opportunities to take the lock in case blocking writers
      changed between the calls, but the window is just a few instructions
      long. As this is in try-lock, the callers handle that.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4477988
    • David Sterba's avatar
      btrfs: set blocking_writers directly, no increment or decrement · 40d38f53
      David Sterba authored
      The increment and decrement was inherited from previous version that
      used atomics, switched in commit 06297d8c ("btrfs: switch
      extent_buffer blocking_writers from atomic to int"). The only possible
      values are 0 and 1 so we can set them directly.
      
      The generated assembly (gcc 9.x) did the direct value assignment in
      btrfs_set_lock_blocking_write (asm diff after change in 06297d8c):
      
           5d:   test   %eax,%eax
           5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
           61:   retq
      
        -  62:   lock incl 0x44(%rdi)
        -  66:   add    $0x50,%rdi
        -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>
      
        +  62:   movl   $0x1,0x44(%rdi)
        +  69:   add    $0x50,%rdi
        +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>
      
      The part in btrfs_tree_unlock did a decrement because
      BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
      otherwise the output looks safe:
      
        - lock decl 0x44(%rdi)
      
        + sub    $0x1,%eax
        + mov    %eax,0x44(%rdi)
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40d38f53
    • David Sterba's avatar
      btrfs: merge blocking_writers branches in btrfs_tree_read_lock · f5c2a525
      David Sterba authored
      There are two ifs that use eb::blocking_writers. As this is a variable
      modified inside and outside of locks, we could minimize number of
      accesses to avoid problems with getting different results at different
      times.
      
      The access here is locked so this can only race with btrfs_tree_unlock
      that sets blocking_writers to 0 without lock and unsets the lock owner.
      
      The first branch is taken only if the same thread already holds the
      lock, the second if checks for blocking writers. Here we'd either unlock
      and wait, or proceed. Both are valid states of the locking protocol.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5c2a525
    • David Sterba's avatar
      btrfs: drop incompat bit for raid1c34 after last block group is gone · 9c907446
      David Sterba authored
      When there are no raid1c3 or raid1c4 block groups left after balance
      (either convert or with other filters applied), remove the incompat bit.
      This is already done for RAID56, do the same for RAID1C34.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c907446
    • David Sterba's avatar
      btrfs: add incompat for raid1 with 3, 4 copies · cfbb825c
      David Sterba authored
      The new raid1c3 and raid1c4 profiles are backward incompatible and the
      name shall be 'raid1c34', the status can be found in the global
      supported features in /sys/fs/btrfs/features or in the per-filesystem
      directory.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfbb825c
    • David Sterba's avatar
      btrfs: add support for 4-copy replication (raid1c4) · 8d6fac00
      David Sterba authored
      Add new block group profile to store 4 copies in a simliar way that
      current RAID1 does.  The profile attributes and constraints are defined
      in the raid table and used by the same code that already handles the 2-
      and 3-copy RAID1.
      
      The minimum number of devices is 4, the maximum number of devices/chunks
      that can be lost/damaged is 3. There is no comparable traditional RAID
      level, the profile is added for future needs to accompany triple-parity
      and beyond.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d6fac00
    • David Sterba's avatar
      btrfs: add support for 3-copy replication (raid1c3) · 47e6f742
      David Sterba authored
      Add new block group profile to store 3 copies in a simliar way that
      current RAID1 does. The profile attributes and constraints are defined
      in the raid table and used by the same code that already handles the
      2-copy RAID1.
      
      The minimum number of devices is 3, the maximum number of devices/chunks
      that can be lost/damaged is 2. Like RAID6 but with 33% space
      utilization.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47e6f742
    • David Sterba's avatar
      btrfs: sink write flags to cow_file_range_async · fac07d2b
      David Sterba authored
      In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
      cow_file_range_async gained wbc as a parameter and this makes passing
      write flags redundant. Set it inside the function and remove the
      parameter.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fac07d2b
    • David Sterba's avatar
      btrfs: sink write_flags to __extent_writepage_io · 57e5ffeb
      David Sterba authored
      __extent_writepage reads write flags from wbc and passes both to
      __extent_writepage_io. This makes write_flags redundant and we can
      remove it.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57e5ffeb
    • Filipe Manana's avatar
      Btrfs: send, skip backreference walking for extents with many references · fd0ddbe2
      Filipe Manana authored
      Backreference walking, which is used by send to figure if it can issue
      clone operations instead of write operations, can be very slow and use
      too much memory when extents have many references. This change simply
      skips backreference walking when an extent has more than 64 references,
      in which case we fallback to a write operation instead of a clone
      operation. This limit is conservative and in practice I observed no
      signicant slowdown with up to 100 references and still low memory usage
      up to that limit.
      
      This is a temporary workaround until there are speedups in the backref
      walking code, and as such it does not attempt to add extra interfaces or
      knobs to tweak the threshold.
      Reported-by: default avatarAtemu <atemu.main@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd0ddbe2
    • Filipe Manana's avatar
      Btrfs: send, allow clone operations within the same file · 11f2069c
      Filipe Manana authored
      For send we currently skip clone operations when the source and
      destination files are the same. This is so because clone didn't support
      this case in its early days, but support for it was added back in May
      2013 by commit a96fbc72 ("Btrfs: allow file data clone within a
      file"). This change adds support for it.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
        $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
      
        $ mkfs.btrfs -f /dev/sde
        $ mount /dev/sde /mnt/sde
      
        $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
      
      Without this change file foobar at the destination has a single 128Kb
      extent:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      31:          0..        31:     32:             last,unknown_loc,delalloc,eof
        /mnt/sde/snap/foobar: 1 extent found
      
      With this we get a single 64Kb extent that is shared at file offsets 0
      and 64K, just like in the source filesystem:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      15:       3328..      3343:     16:             shared
           1:       16..      31:       3328..      3343:     16:       3344: last,shared,eof
        /mnt/sde/snap/foobar: 2 extents found
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11f2069c
    • Qu Wenruo's avatar
      btrfs: Ensure we trim ranges across block group boundary · 6b7faadd
      Qu Wenruo authored
      [BUG]
      When deleting large files (which cross block group boundary) with
      discard mount option, we find some btrfs_discard_extent() calls only
      trimmed part of its space, not the whole range:
      
        btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
      
      type:		bbio->map_type, in above case, it's SINGLE DATA.
      start:		Logical address of this trim
      len:		Logical length of this trim
      trimmed:	Physically trimmed bytes
      ratio:		trimmed / len
      
      Thus leaving some unused space not discarded.
      
      [CAUSE]
      When discard mount option is specified, after a transaction is fully
      committed (super block written to disk), we begin to cleanup pinned
      extents in the following call chain:
      
      btrfs_commit_transaction()
      |- btrfs_finish_extent_commit()
         |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
         |- btrfs_discard_extent()
      
      However, pinned extents are recorded in an extent_io_tree, which can
      merge adjacent extent states.
      
      When a large file gets deleted and it has adjacent file extents across
      block group boundary, we will get a large merged range like this:
      
            |<---    BG1    --->|<---      BG2     --->|
            |//////|<--   Range to discard   --->|/////|
      
      To discard that range, we have the following calls:
      
        btrfs_discard_extent()
        |- btrfs_map_block()
        |  Returned bbio will end at BG1's end. As btrfs_map_block()
        |  never returns result across block group boundary.
        |- btrfs_issuse_discard()
           Issue discard for each stripe.
      
      So we will only discard the range in BG1, not the remaining part in BG2.
      
      Furthermore, this bug is not that reliably observed, for above case, if
      there is no other extent in BG2, BG2 will be empty and btrfs will trim
      all space of BG2, covering up the bug.
      
      [FIX]
      - Allow __btrfs_map_block_for_discard() to modify @length parameter
        btrfs_map_block() uses its @length paramter to notify the caller how
        many bytes are mapped in current call.
        With __btrfs_map_block_for_discard() also modifing the @length,
        btrfs_discard_extent() now understands when to do extra trim.
      
      - Call btrfs_map_block() in a loop until we hit the range end Since we
        now know how many bytes are mapped each time, we can iterate through
        each block group boundary and issue correct trim for each range.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b7faadd
    • Qu Wenruo's avatar
      btrfs: volumes: Use more straightforward way to calculate map length · 2d974619
      Qu Wenruo authored
      The old code goes:
      
       	offset = logical - em->start;
      	length = min_t(u64, em->len - offset, length);
      
      Where @length calculation is dependent on offset, it can take reader
      several more seconds to find it's just the same code as:
      
       	offset = logical - em->start;
      	length = min_t(u64, em->start + em->len - logical, length);
      
      Use above code to make the length calculate independent from other
      variable, thus slightly increase the readability.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d974619
    • Qu Wenruo's avatar
      btrfs: tree-checker: Check item size before reading file extent type · 153a6d29
      Qu Wenruo authored
      In check_extent_data_item(), we read file extent type without verifying
      if the item size is valid.
      
      Add such check to ensure the file extent type we read is correct.
      
      The check is not as accurate as we need to cover both inline and regular
      extents, so it only checks if the item size is larger or equal to inline
      header.
      So the existing size checks on inline/regular extents are still needed.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      153a6d29
    • Dan Carpenter's avatar
      btrfs: clean up locking name in scrub_enumerate_chunks() · 3ec17a67
      Dan Carpenter authored
      The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
      the same lock but Smatch is not clever enough to figure that out so it
      leads to static checker warnings.  It's better to use it consistently
      anyway.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ec17a67
    • Nikolay Borisov's avatar
      btrfs: Streamline btrfs_fs_info::backup_root_index semantics · 6ef108dd
      Nikolay Borisov authored
      The backup_root_index member stores the index at which the backup root
      should be saved upon next transaction commit. However, there is a
      small deviation from this behavior in the form of a check in
      backup_super_roots which checks if current root generation equals to the
      generation of the previous root. This can trigger in the following
      scenario:
      
      slot0: gen-2
      slot1: gen-1
      slot2: gen
      slot3: unused
      
      Now suppose slot3 (which is also the root specified in the super block)
      is corrupted hence init_tree_roots chooses to use the backup root at
      slot2, meaning read_backup_root will read slot2 and assign the
      superblock generation to gen-1. Despite this backup_root_index will
      point at slot3 because its init happens in init_backup_root_slot, long
      before any parsing of the backup roots occur. Then on next transaction
      start, gen-1 will be incremented by 1 making the root's generation
      equal gen. Subsequently, on transaction commit the following check
      triggers:
      
        if (btrfs_backup_tree_root_gen(root_backup) ==
                 btrfs_header_generation(info->tree_root->node))
      
      This causes the 'next_backup', which is the index at which the backup is
      going to be written to, to set to last_backup, which will be slot2.
      
      All of this is a very confusing way of expressing the following
      invariant:
      
       Always write a backup root at the index following the last used backup
       root.
      
      This commit streamlines this logic by setting backup_root_index to the
      next index after the one used for mount.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ef108dd
    • Nikolay Borisov's avatar
      btrfs: Rename find_oldest_super_backup to init_backup_root_slot · 4ac039ad
      Nikolay Borisov authored
      The old name name was an awful misnomer because it didn't really find
      the oldest super backup per-se but rather its slot. For example if we
      have:
      
      slot0: gen - 2
      slot1: gen - 1
      slot2: gen
      slot3: empty
      
      init_backup_root_slot will return slot3 and not slot0.
      
      The new name is more appropriate since the function doesn't care whether
      there is a valid backup in the returned slot or not.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ac039ad
    • Nikolay Borisov's avatar
      btrfs: Remove unused next_root_backup function · 260eb11b
      Nikolay Borisov authored
      This function has been superseded by previous commits and is no longer
      used so just remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      260eb11b
    • Nikolay Borisov's avatar
      btrfs: Don't use objectid_mutex during mount · 336a0d8d
      Nikolay Borisov authored
      Since the filesystem is not well formed and no trees are loaded it's
      pointless holding the objectid_mutex. Just remove its usage.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      336a0d8d
    • Nikolay Borisov's avatar
      btrfs: Factor out tree roots initialization during mount · b8522a1e
      Nikolay Borisov authored
      The code responsible for reading and initializing tree roots is
      scattered in open_ctree among 2 labels, emulating a loop. This is rather
      confusing to reason about. Instead, factor the code to a new function,
      init_tree_roots which implements the same logical flow.
      
      There are a couple of notable differences, namely:
      
      * Instead of using next_backup_root it's using the newly introduced
        read_backup_root.
      
      * If read_backup_root returns an error init_tree_roots propagates the
        error and there is no special handling of that case e.g. the code jumps
        straight to 'fail_tree_roots' label. The old code, however, was
        (erroneously) jumping to 'fail_block_groups' label if next_backup_root
        did fail, this was unnecessary since the tree roots init logic doesn't
        modify the state of block groups.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8522a1e
    • Nikolay Borisov's avatar
      btrfs: Add read_backup_root · bd2336b2
      Nikolay Borisov authored
      This function will replace next_root_backup with a much saner/cleaner
      interface.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd2336b2
    • Nikolay Borisov's avatar
      btrfs: Remove newest_gen argument from find_oldest_super_backup · fc2e4c5b
      Nikolay Borisov authored
      It's no longer needed following cleanups around find_newest_backup_root
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc2e4c5b
    • Nikolay Borisov's avatar
      btrfs: Cleanup and simplify find_newest_super_backup · 01f0f9da
      Nikolay Borisov authored
      Backup roots are always written in a circular manner. By definition we
      can only ever have 1 backup root whose generation equals to that of the
      superblock. Hence, the 'if' in the for loop will trigger at most once.
      This is sufficient to return the newest backup root.
      
      Furthermore the newest_gen parameter is always set to the generation of
      the superblock. This value can be obtained from the fs_info.
      
      This patch removes the unnecessary code dealing with the wraparound
      case and makes 'newest_gen' a local variable.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      01f0f9da
    • Filipe Manana's avatar
      Btrfs: remove unnecessary delalloc mutex for inodes · 16ad3be1
      Filipe Manana authored
      The inode delalloc mutex was added a long time ago by commit f248679e
      ("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
      the reason for its introduction is not very clear from the change log. It
      claims it solves bogus warnings from lockdep, however it lacks an example
      report/warning from lockdep, or any explanation.
      
      Since we have enough concurrentcy protection from the locks of the space
      info and block reserve objects, and such lockdep warnings don't seem to
      exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
      ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
      the size of the btrfs_inode structure. With some quick fio tests doing
      direct IO and mmap writes I couldn't observe any significant performance
      increase either (direct IO writes that don't increase the file's size
      don't hold the inode's lock for their entire duration and mmap writes
      don't hold the inode's lock at all), which are the only type of writes
      that could see any performance gain due to less serialization.
      
      Review feedback from Josef:
      
      The problem was taking the i_mutex in mmap, which is how I was
      protecting delalloc reservations originally.  The delalloc mutex didn't
      come with all of the other dependencies.  That's what the lockdep
      messages were about, removing the lock isn't going to make them appear
      again.
      
      We _had_ to lock around this because we used to do tricks to keep from
      over-reserving, and if we didn't serialize delalloc reservations we'd
      end up with ugly accounting problems when we tried to clean things up.
      
      However with my recentish changes this isn't the case anymore.  Every
      operation is responsible for reserving its space, and then adding it to
      the inode.  Then cleaning up is straightforward and can't be mucked up
      by other users.  So we no longer need the delalloc mutex to safe us from
      ourselves.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      16ad3be1
    • Filipe Manana's avatar
      Btrfs: remove wait queue from space_info structure · bf2df5ae
      Filipe Manana authored
      It is not used anymore since commit 957780eb ("Btrfs: introduce
      ticketed enospc infrastructure"), so just remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf2df5ae
    • Johannes Thumshirn's avatar
      btrfs: remove cached space_info in btrfs_statfs() · f5389f33
      Johannes Thumshirn authored
      In btrfs_statfs() we cache fs_info::space_info in a local variable only
      to use it once in a list_for_each_rcu() statement.
      
      Not only is the local variable unnecessary it even makes the code harder
      to follow as it's not clear which list it is iterating.
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5389f33
    • David Sterba's avatar
      btrfs: add dedicated members for start and length of a block group · b3470b5d
      David Sterba authored
      The on-disk format of block group item makes use of the key that stores
      the offset and length. This is further used in the code, although this
      makes thing harder to understand. The key is also packed so the
      offset/length is not properly aligned as u64.
      
      Add start (key.objectid) and length (key.offset) members to block group
      and remove the embedded key.  When the item is searched or written, a
      local variable for key is used.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3470b5d
    • David Sterba's avatar
      btrfs: rename extent buffer block group item accessors · 0222dfdd
      David Sterba authored
      Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
      manipulate the items there, there's no special prefix required. The
      block group accessors had _disk_ because previously the names were
      occupied by the on-stack accessors. As this has been addressed in the
      previous patch, we can now unify the naming.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0222dfdd