1. 17 Dec, 2018 40 commits
    • Anand Jain's avatar
      btrfs: harden agaist duplicate fsid on scanned devices · a9261d41
      Anand Jain authored
      It's not that impossible to imagine that a device OR a btrfs image is
      copied just by using the dd or the cp command. Which in case both the
      copies of the btrfs will have the same fsid. If on the system with
      automount enabled, the copied FS gets scanned.
      
      We have a known bug in btrfs, that we let the device path be changed
      after the device has been mounted. So using this loop hole the new
      copied device would appears as if its mounted immediately after it's
      been copied.
      
      For example:
      
      Initially.. /dev/mmcblk0p4 is mounted as /
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part /
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show
           Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
           Total devices 1 FS bytes used 1.40GiB
           devid    1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
      
      Copy mmcblk0 to sda
      
        $ dd if=/dev/mmcblk0 of=/dev/sda
      
      And immediately after the copy completes the change in the device
      superblock is notified which the automount scans using btrfs device scan
      and the new device sda becomes the mounted root device.
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        sda           8:0    1 14.9G  0 disk
        |-sda4        8:4    1    4G  0 part /
        |-sda2        8:2    1  500M  0 part
        |-sda3        8:3    1  256M  0 part
        `-sda1        8:1    1  256M  0 part
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show /
          Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
          Total devices 1 FS bytes used 1.40GiB
          devid    1 size 4.00GiB used 3.00GiB path /dev/sda4
      
      The bug is quite nasty that you can't either unmount /dev/sda4 or
      /dev/mmcblk0p4. And the problem does not get solved until you take sda
      out of the system on to another system to change its fsid using the
      'btrfstune -u' command.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9261d41
    • Hans van Kranenburg's avatar
      btrfs: introduce nparity raid_attr · b50836ed
      Hans van Kranenburg authored
      Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
      nparity field in raid_attr.
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b50836ed
    • Hans van Kranenburg's avatar
      btrfs: fix ncopies raid_attr for RAID56 · da612e31
      Hans van Kranenburg authored
      RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
      values are not yet used anywhere so there's no change.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da612e31
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: fix more DUP stripe size handling · baf92114
      Hans van Kranenburg authored
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      baf92114
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: improve chunk size variable name · 23f0ff1e
      Hans van Kranenburg authored
      The variable num_bytes is really a way too generic name for a variable
      in this function. There are a dozen other variables that hold a number
      of bytes as value.
      
      Give it a name that actually describes what it does, which is holding
      the size of the chunk that we're allocating.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      23f0ff1e
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: do not refurbish num_bytes · 2f29df4f
      Hans van Kranenburg authored
      The variable num_bytes is used to store the chunk length of the chunk
      that we're allocating. Do not reuse it for something really different in
      the same function.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f29df4f
    • Ethan Lien's avatar
      btrfs: use tagged writepage to mitigate livelock of snapshot · 3cd24c69
      Ethan Lien authored
      Snapshot is expected to be fast. But if there are writers steadily
      creating dirty pages in our subvolume, the snapshot may take a very long
      time to complete. To fix the problem, we use tagged writepage for
      snapshot flusher as we do in the generic write_cache_pages(), so we can
      omit pages dirtied after the snapshot command.
      
      This does not change the semantics regarding which data get to the
      snapshot, if there are pages being dirtied during the snapshotting
      operation.  There's a sync called before snapshot is taken in old/new
      case, any IO in flight just after that may be in the snapshot but this
      depends on other system effects that might still sync the IO.
      
      We do a simple snapshot speed test on a Intel D-1531 box:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
      --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 1m58sec
      patched:  6.54sec
      
      This is the best case for this patch since for a sequential write case,
      we omit nearly all pages dirtied after the snapshot command.
      
      For a multi writers, random write test:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
      --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 15.83sec
      patched:  10.35sec
      
      The improvement is smaller compared to the sequential write case,
      since we omit only half of the pages dirtied after snapshot command.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cd24c69
    • Nikolay Borisov's avatar
      btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered · c629732d
      Nikolay Borisov authored
      This parameter was never used, yet was part of the interface of the
      function ever since its introduction as extent_io_ops::writepage_end_io_hook
      in e6dcd2dc ("Btrfs: New data=ordered implementation"). Now that
      NULL is passed everywhere as a value for this parameter let's remove it
      for good. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c629732d
    • Nikolay Borisov's avatar
      btrfs: Remove extent_page_data argument from writepage_delalloc · 8cc0237a
      Nikolay Borisov authored
      The only remaining use of the 'epd' argument in writepage_delalloc is
      to reference the extent_io_tree which was set in extent_writepages. Since
      it is guaranteed that page->mapping of any page passed to
      writepage_delalloc (and __extent_writepage as the sole caller) to be
      equal to that passed in extent_writepages we can directly get the
      io_tree via the already passed inode (which is also taken from
      page->mapping->host). No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cc0237a
    • Nikolay Borisov's avatar
      btrfs: Move epd::extent_locked check to writepage_delalloc's caller · 7789a55a
      Nikolay Borisov authored
      If epd::extent_locked is set then writepage_delalloc terminates. Make
      this a bit more apparent in the caller by simply bubbling the check up.
      This enables to remove epd as an argument to writepage_delalloc in a
      future patch. No functional change.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7789a55a
    • Nikolay Borisov's avatar
      btrfs: Check for missing device before bio submission in btrfs_map_bio · fc8a168a
      Nikolay Borisov authored
      Before btrfs_map_bio submits all stripe bios it does a number of checks
      to ensure the device for every stripe is present. However, it doesn't do
      a DEV_STATE_MISSING check, instead this is relegated to the lower level
      btrfs_schedule_bio (in the async submission case, sync submission
      doesn't check DEV_STATE_MISSING at all). Additionally
      btrfs_schedule_bios does the duplicate device->bdev check which has
      already been performed in btrfs_map_bio.
      
      This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
      removes the duplicate device->bdev check. Doing so ensures that no bio
      cloning/submission happens for both async/sync requests in the face of
      missing device. This makes the async io submission path slightly shorter
      in terms of instruction count. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc8a168a
    • Anand Jain's avatar
      btrfs: remove redundant replace_state init · ab457246
      Anand Jain authored
      dev_replace::replace_state has been set to
      BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function,
      So delete the line which sets replace_state = 0;
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab457246
    • Filipe Manana's avatar
      Btrfs: remove no longer used io_err from btrfs_log_ctx · 6d4cbf79
      Filipe Manana authored
      The io_err field of struct btrfs_log_ctx is no longer used after the
      recent simplification of the fast fsync path, where we now wait for
      ordered extents to complete before logging the inode. We did this in
      commit b5e6c3e1 ("btrfs: always wait on ordered extents at fsync
      time") and commit a2120a47 ("btrfs: clean up the left over
      logged_list usage") removed its last use.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d4cbf79
    • Filipe Manana's avatar
      Btrfs: simpler and more efficient cleanup of a log tree's extent io tree · 59b0713a
      Filipe Manana authored
      We currently are in a loop finding each range (corresponding to a btree
      node/leaf) in a log root's extent io tree and then clean it up. This is a
      waste of time since we are traversing the extent io tree's rb_tree more
      times then needed (one for a range lookup and another for cleaning it up)
      without any good reason.
      
      We free the log trees when we are in the critical section of a transaction
      commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
      of great convenience to do everything as fast as possible in order to
      reduce the time we block other tasks from starting a new transaction.
      
      So fix this by traversing the extent io tree once and cleaning up all its
      records in one go while traversing it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59b0713a
    • Nikolay Borisov's avatar
      btrfs: Adjust loop in free_extent_buffer · 46cc775e
      Nikolay Borisov authored
      The loop construct in free_extent_buffer was added in
      242e18c7 ("Btrfs: reduce lock contention on extent buffer locks")
      as means of reducing the times the eb lock is taken, the non-last ref
      count is decremented and lock is released. As the special handling
      of UNMAPPED extent buffers was removed now there is only one decrement
      op which is happening for EXTENT_BUFFER_UNMAPPED case.
      
      This commit modifies the loop condition so that in case of UNMAPPED
      buffers the eb's lock is taken only if we are 100% sure the eb is going
      to be freed by the current executor of the code. Additionally, remove
      superfluous ref count ops in btrfs test.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46cc775e
    • Nikolay Borisov's avatar
      btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeing · 9cfc8ba7
      Nikolay Borisov authored
      Now that the whole of btrfs code has been audited for eb reference count
      management it's time to remove the hunk in free_extent_buffer that
      essentially considered the condition
      
        "eb->ref == 2 && EXTENT_BUFFER_DUMMY"
      
      to equal "eb->ref = 1". Also remove the last location
      which takes an extra reference count in alloc_test_extent_buffer.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9cfc8ba7
    • Nikolay Borisov's avatar
      btrfs: Remove unnecessary tree locking code in qgroup_rescan_leaf · df449714
      Nikolay Borisov authored
      In qgroup_rescan_leaf a copy is made of the target leaf by calling
      btrfs_clone_extent_buffer. The latter allocates a new buffer and
      attaches a new set of pages and copies the content of the source buffer.
      The new scratch buffer is only used to iterate it's items, it's not
      published anywhere and cannot be accessed by a third party.
      
      Hence, it's not necessary to perform any locking on it whatsoever.
      Furthermore, remove the extra extent_buffer_get call since the new
      buffer is always allocated with a reference count of 1 which is
      sufficient here.  No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df449714
    • Nikolay Borisov's avatar
      btrfs: Remove extra reference count bumps in btrfs_compare_trees · 8c7eeb65
      Nikolay Borisov authored
      When the 2 comparison trees roots are initialised they are private to
      the function and already have reference counts of 1 each. There is no
      need to further increment the reference count since the cloned buffers
      are already accessed via struct btrfs_path. Eventually the 2 paths used
      for comparison are going to be released, effectively disposing of the
      cloned buffers.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c7eeb65
    • Nikolay Borisov's avatar
      btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind · 24cee18a
      Nikolay Borisov authored
      When a rewound buffer is created it already has a ref count of 1 and the
      dummy flag set. Then another ref is taken bumping the count to 2.
      Finally when this buffer is released from btrfs_release_path the extra
      reference is decremented by the special handling code in
      free_extent_buffer.
      
      However, this special code is in fact redundant sinca ref count of 1 is
      still correct since the buffer is only accessed via btrfs_path struct.
      This paves the way forward of removing the special handling in
      free_extent_buffer.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24cee18a
    • Nikolay Borisov's avatar
      btrfs: Remove redundant extent_buffer_get in get_old_root · 6c122e2a
      Nikolay Borisov authored
      get_old_root used used only by btrfs_search_old_slot to initialise the
      path structure. The old root is always a cloned buffer (either via alloc
      dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
      from allocation, 1 from extent_buffer_get call in get_old_root.
      
      This latter explicit ref count acquire operation is in fact unnecessary
      since the semantic is such that the newly allocated buffer is handed
      over to the btrfs_path for lifetime management. Considering this just
      remove the extra extent_buffer_get in get_old_root.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c122e2a
    • Nikolay Borisov's avatar
      btrfs: Remove needless tree locking in iterate_inode_extrefs · 5c623d33
      Nikolay Borisov authored
      In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer
      which creates a private extent buffer with the dummy flag set and ref
      count of 1. Then this buffer is locked for reading and its ref count is
      incremented by 1. Finally it's fed to the passed iterate_irefs_t
      function. The actual iterate call back is inode_to_path (coming from
      paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
      function the passed eb is only read by first assigning it to the local
      eb variable. This variable is only modified in the case another eb was
      referenced from the passed path that is eb != eb_in check triggers.
      
      Considering this there is no point in locking the cloned eb in
      iterate_inode_refs since it's never being modified and is not published
      anywhere. Furthermore the cloned eb is completely fine having its ref
      count be 1.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c623d33
    • Nikolay Borisov's avatar
      btrfs: Remove needless tree locking in iterate_inode_refs · e5bba0b0
      Nikolay Borisov authored
      In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer
      which creates a private extent buffer with the dummy flag set and ref
      count of 1. Then this buffer is locked for reading and its ref count is
      incremented by 1. Finally it's fed to the passed iterate_irefs_t
      function. The actual iterate call back is inode_to_path (coming from
      paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
      function the passed eb is only read by first assigning it to the local
      eb variable. This variable is only modified in the case another eb was
      referenced from the passed path that is eb != eb_in check triggers.
      
      Considering this there is no point in locking the cloned eb in
      iterate_inode_refs since it's never being modified and is not published
      anywhere. Furthermore the cloned eb is completely fine having its ref
      count be 1.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5bba0b0
    • Qu Wenruo's avatar
      btrfs: tests: Use BTRFS_MAX_EXTENT_SIZE to replace the intermediate number · d9cb2459
      Qu Wenruo authored
      In extent-io self test, we need 2 ordered extents at its maximum size to
      do the test.
      
      Instead of using the intermediate numbers, use BTRFS_MAX_EXTENT_SIZE for
      @max_bytes, and twice @max_bytes for @total_dirty.  This should explain
      why we need all these magic numbers and prevent people to modify them by
      accident.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9cb2459
    • Omar Sandoval's avatar
      Btrfs: support swap files · ed46ff3d
      Omar Sandoval authored
      Btrfs has not allowed swap files since commit 35054394 ("Btrfs: stop
      providing a bmap operation to avoid swapfile corruptions"). However, now
      that the proper restrictions are in place, Btrfs can support swap files
      through the swap file a_ops, similar to iomap in commit 67482129
      ("iomap: add a swapfile activation function").
      
      For Btrfs, activation needs to make sure that the file can be used as a
      swap file, which currently means that it must be fully allocated as
      NOCOW with no compression on one device. It must also do the proper
      tracking so that ioctls will not interfere with the swap file.
      Deactivation clears this tracking.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed46ff3d
    • Omar Sandoval's avatar
      Btrfs: rename and export get_chunk_map · 60ca842e
      Omar Sandoval authored
      The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
      make it non-static.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60ca842e
    • Omar Sandoval's avatar
      Btrfs: prevent ioctls from interfering with a swap file · eede2bf3
      Omar Sandoval authored
      A later patch will implement swap file support for Btrfs, but before we
      do that, we need to make sure that the various Btrfs ioctls cannot
      change a swap file.
      
      When a swap file is active, we must make sure that the extents of the
      file are not moved and that they don't become shared. That means that
      the following are not safe:
      
      - chattr +c (enable compression)
      - reflink
      - dedupe
      - snapshot
      - defrag
      
      Don't allow those to happen on an active swap file.
      
      Additionally, balance, resize, device remove, and device replace are
      also unsafe if they affect an active swapfile. Add a red-black tree of
      block groups and devices which contain an active swapfile. Relocation
      checks each block group against this tree and skips it or errors out for
      balance or resize, respectively. Device remove and device replace check
      the tree for the device they will operate on.
      
      Note that we don't have to worry about chattr -C (disable nocow), which
      we ignore for non-empty files, because an active swapfile must be
      non-empty and can't be truncated. We also don't have to worry about
      autodefrag because it's only done on COW files. Truncate and fallocate
      are already taken care of by the generic code. Device add doesn't do
      relocation so it's not an issue, either.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eede2bf3
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::split_extent_hook callback · abbb55f4
      Nikolay Borisov authored
      This is the counterpart to merge_extent_hook, similarly, it's used only
      for data/freespace inodes so let's remove it, rename it and call it
      directly where necessary. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abbb55f4
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::merge_extent_hook callback · 5c848198
      Nikolay Borisov authored
      This callback is used only for data and free space inodes. Such inodes
      are guaranteed to have their extent_io_tree::private_data set to the
      inode struct. Exploit this fact to directly call the function. Also give
      it a more descriptive name. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c848198
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::clear_bit_hook callback · a36bb5f9
      Nikolay Borisov authored
      This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
      similar to what was done before remove clear_bit_hook and rename the
      function. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a36bb5f9
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::set_bit_hook extent_io callback · e06a1fc9
      Nikolay Borisov authored
      This callback is used to properly account delalloc extents for data
      inodes (ordinary file inodes and freespace v1 inodes). Those can be
      easily identified since they have their extent_io trees ->private_data
      member point to the inode. Let's exploit this fact to remove the
      needless indirection through extent_io_hooks and directly call the
      function. Also give the function a name which reflects its purpose -
      btrfs_set_delalloc_extent.
      
      This patch also modified test_find_delalloc so that the extent_io_tree
      used for testing doesn't have its ->private_data set which would have
      caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
      member not being initialised. The old version of the code also didn't
      call set_bit_hook since the extent_io ops weren't set for the inode.  No
      functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e06a1fc9
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::check_extent_io_range callback · 65a680f6
      Nikolay Borisov authored
      This callback was only used in debug builds by btrfs_leak_debug_check.
      A better approach is to move its implementation in
      btrfs_leak_debug_check and ensure the latter is only executed for extent
      tree which have ->private_data set i.e. relate to a data node and not
      the btree one. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65a680f6
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::writepage_end_io_hook · 7087a9d8
      Nikolay Borisov authored
      This callback is ony ever called for data page writeout so there is no
      need to actually abstract it via extent_io_ops. Lets just export it,
      remove the definition of the callback and call it directly in the
      functions that invoke the callback. Also rename the function to
      btrfs_writepage_endio_finish_ordered since what it really does is
      account finished io in the ordered extent data structures.  No
      functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7087a9d8
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::writepage_start_hook · d75855b4
      Nikolay Borisov authored
      This hook is called only from __extent_writepage_io which is already
      called only from the data page writeout path. So there is no need to
      make an indirect call via extent_io_ops. This patch just removes the
      callback definition, exports the callback function and calls it directly
      at the only call site. Also give the function a more descriptive name.
      No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d75855b4
    • Nikolay Borisov's avatar
      btrfs: Remove extent_io_ops::fill_delalloc · 5eaad97a
      Nikolay Borisov authored
      This callback is called only from writepage_delalloc which in turn is
      guaranteed to be called from the data page writeout path. In the end
      there is no reason to have the call to this function to be indrected via
      the extent_io_ops structure. This patch removes the callback definition,
      exports the function and calls it directly. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_run_delalloc_range ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5eaad97a
    • Nikolay Borisov's avatar
      btrfs: Add function to distinguish between data and btree inode · 06f2548f
      Nikolay Borisov authored
      This will be used in future patches that remove the optional
      extent_io_ops callbacks.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06f2548f
    • Qu Wenruo's avatar
      btrfs: volumes: Make sure no dev extent is beyond device boundary · 05a37c48
      Qu Wenruo authored
      Add extra dev extent end check against device boundary.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05a37c48
    • Qu Wenruo's avatar
      btrfs: volumes: Make sure there is no overlap of dev extents at mount time · 5eb19381
      Qu Wenruo authored
      Enhance btrfs_verify_dev_extents() to remember previous checked dev
      extents, so it can verify no dev extents can overlap.
      
      Analysis from Hans:
      
      "Imagine allocating a DATA|DUP chunk.
      
       In the chunk allocator, we first set...
         max_stripe_size = SZ_1G;
         max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
       ... which is 10GiB.
      
       Then...
         /* we don't want a chunk larger than 10% of writeable space */
         max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
             		 max_chunk_size);
      
       Imagine we only have one 7880MiB block device in this filesystem. Now
       max_chunk_size is down to 788MiB.
      
       The next step in the code is to search for max_stripe_size * dev_stripes
       amount of free space on the device, which is in our example 1GiB * 2 =
       2GiB. Imagine the device has exactly 1578MiB free in one contiguous
       piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
      
       Next we recalculate the stripe_size (which is actually the device extent
       length), based on the actual maximum amount of available raw disk space:
         stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
      
       stripe_size is now 789MiB
      
       Next we do...
         data_stripes = num_stripes / ncopies
       ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
       of device extents we're going to have), and DUP has ncopies 2.
      
       Next there's a check...
         if (stripe_size * data_stripes > max_chunk_size)
       ...which matches because 789MiB * 1 > 788MiB.
      
       We go into the if code, and next is...
         stripe_size = div_u64(max_chunk_size, data_stripes);
       ...which resets stripe_size to max_chunk_size: 788MiB
      
       Next is a fun one...
         /* bump the answer up to a 16MB boundary */
         stripe_size = round_up(stripe_size, SZ_16M);
       ...which changes stripe_size from 788MiB to 800MiB.
      
       We're not done changing stripe_size yet...
         /* But don't go higher than the limits we found while searching
          * for free extents
          */
         stripe_size = min(devices_info[ndevs - 1].max_avail,
             	      stripe_size);
      
       This is bad. max_avail is twice the stripe_size (we need to fit 2 device
       extents on the same device for DUP).
      
       The result here is that 800MiB < 1578MiB, so it's unchanged. However,
       the resulting DUP chunk will need 1600MiB disk space, which isn't there,
       and the second dev_extent might extend into the next thing (next
       dev_extent? end of device?) for 22MiB.
      
       The last shown line of code relies on a situation where there's twice
       the value of stripe_size present as value for the variable stripe_size
       when it's DUP. This was actually the case before commit 92e222df
       "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
         "[...] in the meantime there's a check to see if the stripe_size does
       not exceed max_chunk_size. Since during this check stripe_size is twice
       the amount as intended, the check will reduce the stripe_size to
       max_chunk_size if the actual correct to be used stripe_size is more than
       half the amount of max_chunk_size."
      
       In the previous version of the code, the 16MiB alignment (why is this
       done, by the way?) would result in a 50% chance that it would actually
       do an 8MiB alignment for the individual dev_extents, since it was
       operating on double the size. Does this matter?
      
       Does it matter that stripe_size can be set to anything which is not
       16MiB aligned because of the amount of remaining available disk space
       which is just taken?
      
       What is the main purpose of this round_up?
      
       The most straightforward thing to do seems something like...
         stripe_size = min(
             div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
             stripe_size
         )
       ..just putting half of the max_avail into stripe_size."
      
      Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ add analysis from report ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5eb19381
    • Qu Wenruo's avatar
      btrfs: Refactor find_free_extent loops update into find_free_extent_update_loop · e72d79d6
      Qu Wenruo authored
      We have a complex loop design for find_free_extent(), that has different
      behavior for each loop, some even includes new chunk allocation.
      
      Instead of putting such a long code into find_free_extent() and makes it
      harder to read, just extract them into find_free_extent_update_loop().
      
      With all the cleanups, the main find_free_extent() should be pretty
      barebone:
      
      find_free_extent()
      |- Iterate through all block groups
      |  |- Get a valid block group
      |  |- Try to do clustered allocation in that block group
      |  |- Try to do unclustered allocation in that block group
      |  |- Check if the result is valid
      |  |  |- If valid, then exit
      |  |- Jump to next block group
      |
      |- Push harder to find free extents
         |- If not found, re-iterate all block groups
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      [ copy callchain from changelog to function comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e72d79d6
    • Qu Wenruo's avatar
      btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered() · e1a41848
      Qu Wenruo authored
      This patch will extract unclsutered extent allocation code into
      find_free_extent_unclustered().
      
      And this helper function will use return value to indicate what to do
      next.
      
      This should make find_free_extent() a little easier to read.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [Update merge conflict with fb5c39d7 ("btrfs: don't use ctl->free_space for max_extent_size")]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1a41848
    • Qu Wenruo's avatar
      btrfs: Refactor clustered extent allocation into find_free_extent_clustered · d06e3bb6
      Qu Wenruo authored
      We have two main methods to find free extents inside a block group:
      
      1) clustered allocation
      2) unclustered allocation
      
      This patch will extract the clustered allocation into
      find_free_extent_clustered() to make it a little easier to read.
      
      Instead of jumping between different labels in find_free_extent(), the
      helper function will use return value to indicate different behavior.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d06e3bb6