1. 16 May, 2022 40 commits
    • David Sterba's avatar
    • David Sterba's avatar
      btrfs: remove btrfs_delayed_extent_op::is_data · 0e3696f8
      David Sterba authored
      The value of btrfs_delayed_extent_op::is_data is always false, we can
      cascade the change and simplify code that depends on it, removing the
      structure member eventually.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e3696f8
    • David Sterba's avatar
      btrfs: sink parameter is_data to btrfs_set_disk_extent_flags · 2fe6a5a1
      David Sterba authored
      The parameter has been added in 2009 in the infamous monster commit
      5d4f98a2 ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT
      CHANGE)") but not used ever since. We can sink it and allow further
      simplifications.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fe6a5a1
    • Filipe Manana's avatar
      btrfs: fix deadlock between concurrent dio writes when low on free data space · f5585f4f
      Filipe Manana authored
      When reserving data space for a direct IO write we can end up deadlocking
      if we have multiple tasks attempting a write to the same file range, there
      are multiple extents covered by that file range, we are low on available
      space for data and the writes don't expand the inode's i_size.
      
      The deadlock can happen like this:
      
      1) We have a file with an i_size of 1M, at offset 0 it has an extent with
         a size of 128K and at offset 128K it has another extent also with a
         size of 128K;
      
      2) Task A does a direct IO write against file range [0, 256K), and because
         the write is within the i_size boundary, it takes the inode's lock (VFS
         level) in shared mode;
      
      3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
         then gets the extent map for the extent covering the range [0, 128K).
         At btrfs_get_blocks_direct_write(), it creates an ordered extent for
         that file range ([0, 128K));
      
      4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
         range [0, 256K);
      
      5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
         range [128K, 256K), and locks the file range [128K, 256K);
      
      6) Task B starts a direct IO write against file range [0, 256K) as well.
         It also locks the inode in shared mode, as it's within the i_size limit,
         and then tries to lock file range [0, 256K). It is able to lock the
         subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
         as it is currently locked by task A;
      
      7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
         space. Because we are low on available free space, it triggers the
         async data reclaim task, and waits for it to reserve data space;
      
      8) The async reclaim task decides to wait for all existing ordered extents
         to complete (through btrfs_wait_ordered_roots()).
         It finds the ordered extent previously created by task A for the file
         range [0, 128K) and waits for it to complete;
      
      9) The ordered extent for the file range [0, 128K) can not complete
         because it blocks at btrfs_finish_ordered_io() when trying to lock the
         file range [0, 128K).
      
         This results in a deadlock, because:
      
         - task B is holding the file range [0, 128K) locked, waiting for the
           range [128K, 256K) to be unlocked by task A;
      
         - task A is holding the file range [128K, 256K) locked and it's waiting
           for the async data reclaim task to satisfy its space reservation
           request;
      
         - the async data reclaim task is waiting for ordered extent [0, 128K)
           to complete, but the ordered extent can not complete because the
           file range [0, 128K) is currently locked by task B, which is waiting
           on task A to unlock file range [128K, 256K) and task A waiting
           on the async data reclaim task.
      
         This results in a deadlock between 4 task: task A, task B, the async
         data reclaim task and the task doing ordered extent completion (a work
         queue task).
      
      This type of deadlock can sporadically be triggered by the test case
      generic/300 from fstests, and results in a stack trace like the following:
      
      [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
      [12084.034877]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.036548] task:kworker/u16:7   state:D stack:    0 pid:123749 ppid:     2 flags:0x00004000
      [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
      [12084.036599] Call Trace:
      [12084.036601]  <TASK>
      [12084.036606]  __schedule+0x3cb/0xed0
      [12084.036616]  schedule+0x4e/0xb0
      [12084.036620]  btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
      [12084.036651]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.036659]  btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
      [12084.036688]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.036719]  ? lock_is_held_type+0xe8/0x140
      [12084.036727]  process_one_work+0x252/0x5a0
      [12084.036736]  ? process_one_work+0x5a0/0x5a0
      [12084.036738]  worker_thread+0x52/0x3b0
      [12084.036743]  ? process_one_work+0x5a0/0x5a0
      [12084.036745]  kthread+0xf2/0x120
      [12084.036747]  ? kthread_complete_and_exit+0x20/0x20
      [12084.036751]  ret_from_fork+0x22/0x30
      [12084.036765]  </TASK>
      [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
      [12084.037702]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.039506] task:kworker/u16:11  state:D stack:    0 pid:153787 ppid:     2 flags:0x00004000
      [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
      [12084.039551] Call Trace:
      [12084.039553]  <TASK>
      [12084.039557]  __schedule+0x3cb/0xed0
      [12084.039566]  schedule+0x4e/0xb0
      [12084.039569]  schedule_timeout+0xed/0x130
      [12084.039573]  ? mark_held_locks+0x50/0x80
      [12084.039578]  ? _raw_spin_unlock_irq+0x24/0x50
      [12084.039580]  ? lockdep_hardirqs_on+0x7d/0x100
      [12084.039585]  __wait_for_common+0xaf/0x1f0
      [12084.039587]  ? usleep_range_state+0xb0/0xb0
      [12084.039596]  btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
      [12084.039636]  btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
      [12084.039670]  flush_space+0x25b/0x630 [btrfs]
      [12084.039712]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
      [12084.039747]  process_one_work+0x252/0x5a0
      [12084.039756]  ? process_one_work+0x5a0/0x5a0
      [12084.039758]  worker_thread+0x52/0x3b0
      [12084.039762]  ? process_one_work+0x5a0/0x5a0
      [12084.039765]  kthread+0xf2/0x120
      [12084.039766]  ? kthread_complete_and_exit+0x20/0x20
      [12084.039770]  ret_from_fork+0x22/0x30
      [12084.039783]  </TASK>
      [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
      [12084.040709]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.042404] task:kworker/u16:17  state:D stack:    0 pid:217907 ppid:     2 flags:0x00004000
      [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [12084.042461] Call Trace:
      [12084.042463]  <TASK>
      [12084.042471]  __schedule+0x3cb/0xed0
      [12084.042485]  schedule+0x4e/0xb0
      [12084.042490]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.042539]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.042551]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.042601]  btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
      [12084.042656]  ? lock_is_held_type+0xe8/0x140
      [12084.042667]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.042716]  ? lock_is_held_type+0xe8/0x140
      [12084.042727]  process_one_work+0x252/0x5a0
      [12084.042742]  worker_thread+0x52/0x3b0
      [12084.042750]  ? process_one_work+0x5a0/0x5a0
      [12084.042754]  kthread+0xf2/0x120
      [12084.042757]  ? kthread_complete_and_exit+0x20/0x20
      [12084.042763]  ret_from_fork+0x22/0x30
      [12084.042783]  </TASK>
      [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
      [12084.043598]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.045244] task:fio             state:D stack:    0 pid:234517 ppid:234515 flags:0x00004000
      [12084.045248] Call Trace:
      [12084.045250]  <TASK>
      [12084.045254]  __schedule+0x3cb/0xed0
      [12084.045263]  schedule+0x4e/0xb0
      [12084.045266]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.045298]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.045306]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.045336]  btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
      [12084.045370]  ? lock_is_held_type+0xe8/0x140
      [12084.045378]  iomap_iter+0x184/0x4c0
      [12084.045383]  __iomap_dio_rw+0x2c6/0x8a0
      [12084.045406]  iomap_dio_rw+0xa/0x30
      [12084.045408]  btrfs_do_write_iter+0x370/0x5e0 [btrfs]
      [12084.045440]  aio_write+0xfa/0x2c0
      [12084.045448]  ? __might_fault+0x2a/0x70
      [12084.045451]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045455]  ? lock_release+0x153/0x4a0
      [12084.045463]  io_submit_one+0x615/0x9f0
      [12084.045467]  ? __might_fault+0x2a/0x70
      [12084.045469]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045478]  __x64_sys_io_submit+0x83/0x160
      [12084.045483]  ? syscall_enter_from_user_mode+0x1d/0x50
      [12084.045489]  do_syscall_64+0x3b/0x90
      [12084.045517]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [12084.045521] RIP: 0033:0x7fa76511af79
      [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
      [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
      [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
      [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
      [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
      [12084.045561]  </TASK>
      
      Fix this issue by always reserving data space before locking a file range
      at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
      error out immediately - instead after locking the file range, check if we
      can do a NOCOW write, and if we can we don't error out since we don't need
      to allocate a data extent, however if we can't NOCOW then error out with
      -ENOSPC. This also implies that we may end up reserving space when it's
      not needed because the write will end up being done in NOCOW mode - in that
      case we just release the space after we noticed we did a NOCOW write - this
      is the same type of logic that is done in the path for buffered IO writes.
      
      Fixes: f0bfa76a ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
      CC: stable@vger.kernel.org # 5.17+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5585f4f
    • Goldwyn Rodrigues's avatar
      btrfs: derive compression type from extent map during reads · 1d8fa2e2
      Goldwyn Rodrigues authored
      Derive the compression type from extent map as opposed to the bio flags
      passed. This makes it more precise and not reliant on function
      parameters.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d8fa2e2
    • Qu Wenruo's avatar
      btrfs: scrub: move scrub_remap_extent() call into scrub_extent() · a13467ee
      Qu Wenruo authored
      [SUSPICIOUS CODE]
      When refactoring scrub code, I noticed a very strange behavior around
      scrub_remap_extent():
      
      	if (sctx->is_dev_replace)
      		scrub_remap_extent(fs_info, cur_logical, scrub_len,
      				   &cur_physical, &target_dev, &cur_mirror);
      
      As replace target is a 1:1 copy of the source device, thus physical
      offset inside the target should be the same as physical inside source,
      thus this remap call makes no sense to me.
      
      [REAL FUNCTIONALITY]
      After more investigation, the function name scrub_remap_extent()
      doesn't tell anything of the truth, nor does its if () condition.
      
      The real story behind this function is that, for scrub_pages() we never
      expect missing device, even for replacing missing device.
      
      What scrub_remap_extent() is really doing is to find a live mirror, and
      make later scrub_pages() to read data from the good copy, other than
      from the missing device and increase error counters unnecessarily.
      
      [IMPROVEMENT]
      We have no need to bother scrub_remap_extent() in scrub_simple_mirror()
      at all, we only need to call it before we call scrub_pages().
      
      And rename the function to scrub_find_live_copy(), add extra comments on
      them.
      
      By this we can remove one parameter from scrub_extent(), and reduce the
      unnecessary calls to scrub_remap_extent() for regular replace.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a13467ee
    • Qu Wenruo's avatar
      btrfs: scrub: use find_first_extent_item to for extent item search · d483bfd2
      Qu Wenruo authored
      Since we have find_first_extent_item() to iterate the extent items of a
      certain range, there is no need to use the open-coded version.
      
      Replace the final scrub call site with find_first_extent_item().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d483bfd2
    • Qu Wenruo's avatar
      btrfs: scrub: refactor scrub_raid56_parity() · 9ae53bf9
      Qu Wenruo authored
      Currently scrub_raid56_parity() has a large double loop, handling the
      following things at the same time:
      
      - Iterate each data stripe
      - Iterate each extent item in one data stripe
      
      Refactor this by:
      
      - Introduce a new helper to handle data stripe iteration
        The new helper is scrub_raid56_data_stripe_for_parity(), which
        only has one while() loop handling the extent items inside the
        data stripe.
      
        The code is still mostly the same as the old code.
      
      - Call cond_resched() for each extent
        Previously we only call cond_resched() under a complex if () check.
        I see no special reason to do that, and for other scrub functions,
        like scrub_simple_mirror() we're already doing the same cond_resched()
        after scrubbing one extent.
      
      - Add more comments
      
      Please note that, this patch is only to address the double loop, there
      are incoming patches to do extra cleanup.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ae53bf9
    • Qu Wenruo's avatar
      btrfs: scrub: use scrub_simple_mirror() to handle RAID56 data stripe scrub · 18d30ab9
      Qu Wenruo authored
      Although RAID56 has complex repair mechanism, which involves reading the
      whole full stripe, but inside one data stripe, it's in fact no different
      than SINGLE/RAID1.
      
      The point here is, for data stripe we just check the csum for each
      extent we hit.  Only for csum mismatch case, our repair paths divide.
      
      So we can still reuse scrub_simple_mirror() for RAID56 data stripes,
      which saves quite some code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18d30ab9
    • Qu Wenruo's avatar
      btrfs: scrub: cleanup the non-RAID56 branches in scrub_stripe() · e430c428
      Qu Wenruo authored
      Since we have moved all other profiles handling into their own
      functions, now the main body of scrub_stripe() is just handling RAID56
      profiles.
      
      There is no need to address other profiles in the main loop of
      scrub_stripe(), so we can remove those dead branches.
      
      Since we're here, also slightly change the timing of initialization of
      variables like @offset, @increment and @logical.
      
      Especially for @logical, we don't really need to initialize it for
      btrfs_extent_root()/btrfs_csum_root(), we can use bg->start for that
      purpose.
      
      Now those variables are only initialize for RAID56 branches.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e430c428
    • Qu Wenruo's avatar
      btrfs: scrub: introduce dedicated helper to scrub simple-stripe based range · 8557635e
      Qu Wenruo authored
      The new entrance will iterate through each data stripe which belongs to
      the target device.
      
      And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
      just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8557635e
    • Qu Wenruo's avatar
      btrfs: scrub: introduce dedicated helper to scrub simple-mirror based range · 09022b14
      Qu Wenruo authored
      The new helper, scrub_simple_mirror(), will scrub all extents inside a
      range which only has simple mirror based duplication.
      
      This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each
      data stripe for RAID0/RAID10.
      
      Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C*
      profiles.  As one can see, the new entrance for those simple-mirror
      based profiles can be small enough (with comments, just reach 100
      lines).
      
      This function will be the basis for the incoming scrub refactor.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09022b14
    • Qu Wenruo's avatar
      btrfs: scrub: introduce a helper to locate an extent item · 416bd7e7
      Qu Wenruo authored
      The new helper, find_first_extent_item(), will locate an extent item
      (either EXTENT_ITEM or METADATA_ITEM) which covers any byte of the
      search range.
      
      This helper will later be used to refactor scrub code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      416bd7e7
    • Qu Wenruo's avatar
      btrfs: calculate physical_end using dev_extent_len directly in scrub_stripe() · 1194a824
      Qu Wenruo authored
      The variable @physical_end is the exclusive stripe end, currently it's
      calculated using @physical + @dev_extent_len / map->stripe_len *
       map->stripe_len.
      
      And since at allocation time we ensured dev_extent_len is stripe_len
      aligned, the result is the same as @physical + @dev_extent_len.
      
      So this patch will just assign @physical and @physical_end early,
      without using @nstripes.
      
      This is especially helpful for any possible out: label user, as now we
      only need to initialize @offset before going to out: label.
      
      Since we're here, also make @physical_end constant.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1194a824
    • Gabriel Niebler's avatar
      btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray · 48b36a60
      Gabriel Niebler authored
      … rename it to simply fs_roots and adjust all usages of this object to use
      the XArray API, because it is notionally easier to use and understand, as
      it provides array semantics, and also takes care of locking for us,
      further simplifying the code.
      
      Also do some refactoring, esp. where the API change requires largely
      rewriting some functions, anyway.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48b36a60
    • Gabriel Niebler's avatar
      btrfs: turn fs_info member buffer_radix into XArray · 8ee92268
      Gabriel Niebler authored
      … named 'extent_buffers'. Also adjust all usages of this object to use
      the XArray API, which greatly simplifies the code as it takes care of
      locking and is generally easier to use and understand, providing
      notionally simpler array semantics.
      
      Also perform some light refactoring.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ee92268
    • Gabriel Niebler's avatar
      btrfs: turn name_cache radix tree into XArray in send_ctx · 40769420
      Gabriel Niebler authored
      … and adjust all usages of this object to use the XArray API for the sake
      of consistency.
      
      XArray API provides array semantics, so it is notionally easier to use and
      understand, and it also takes care of locking for us.
      
      None of this makes a real difference in this particular patch, but it does
      in other places where similar replacements are or have been made and we
      want to be consistent in our usage of data structures in btrfs.
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40769420
    • Gabriel Niebler's avatar
      btrfs: turn delayed_nodes_tree into an XArray · 253bf575
      Gabriel Niebler authored
      … in the btrfs_root struct and adjust all usages of this object to use
      the XArray API, because it is notionally easier to use and understand,
      as it provides array semantics, and also takes care of locking for us,
      further simplifying the code.
      
      Also use the opportunity to do some light refactoring.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      253bf575
    • Qu Wenruo's avatar
      btrfs: use ilog2() to replace if () branches for btrfs_bg_flags_to_raid_index() · 719fae89
      Qu Wenruo authored
      In function btrfs_bg_flags_to_raid_index(), we use quite some if () to
      convert the BTRFS_BLOCK_GROUP_* bits to a index number.
      
      But the truth is, there is really no such need for so many branches at
      all.
      Since all BTRFS_BLOCK_GROUP_* flags are just one single bit set inside
      BTRFS_BLOCK_GROUP_PROFILES_MASK, we can easily use ilog2() to calculate
      their values.
      
      This calculation has an anchor point, the lowest PROFILE bit, which is
      RAID0.
      
      Even it's fixed on-disk format and should never change, here I added
      extra compile time checks to make it super safe:
      
      1. Make sure RAID0 is always the lowest bit in PROFILE_MASK
         This is done by finding the first (least significant) bit set of
         RAID0 and PROFILE_MASK & ~RAID0.
      
      2. Make sure RAID0 bit set beyond the highest bit of TYPE_MASK
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      719fae89
    • Qu Wenruo's avatar
      btrfs: move definition of btrfs_raid_types to volumes.h · f04fbcc6
      Qu Wenruo authored
      It's only internally used as another way to represent btrfs profiles,
      it's not exposed through any on-disk format, in fact this
      btrfs_raid_types is diverted from the on-disk format values.
      
      Furthermore, since it's internal structure, its definition can change in
      the future.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f04fbcc6
    • Christoph Hellwig's avatar
      btrfs: use a normal workqueue for rmw_workers · 385de0ef
      Christoph Hellwig authored
      rmw_workers doesn't need ordered execution or thread disabling threshold
      (as the thresh parameter is less than DFT_THRESHOLD).
      
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      385de0ef
    • Christoph Hellwig's avatar
      btrfs: use normal workqueues for scrub · be539518
      Christoph Hellwig authored
      All three scrub workqueues don't need ordered execution or thread
      disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
      Just switch to the normal workqueues that use a lot less resources,
      especially in the work_struct vs btrfs_work structures.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be539518
    • Christoph Hellwig's avatar
      btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue · a31b4a43
      Christoph Hellwig authored
      Just let the one caller that wants optional WQ_HIGHPRI handling allocate
      a separate btrfs_workqueue for that.  This allows to rename struct
      __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and
      separate allocation for all btrfs_workqueue users and generally simplify
      the code.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a31b4a43
    • Qu Wenruo's avatar
      btrfs: raid56: enable subpage support for RAID56 · a7b8e39c
      Qu Wenruo authored
      Now the btrfs RAID56 infrastructure has migrated to use sector_ptr
      interface, it should be safe to enable subpage support for RAID56.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7b8e39c
    • Qu Wenruo's avatar
      btrfs: raid56: make alloc_rbio_essential_pages() subpage compatible · 3907ce29
      Qu Wenruo authored
      The non-compatible part is only the bitmap iteration part, now the
      bitmap size is extended to rbio::stripe_nsectors, not the old
      rbio::stripe_npages.
      
      Since we're here, also slightly improve the function by:
      
      - Rename @i to @stripe
      - Rename @bit to @sectornr
      - Move @page and @index into the inner loop
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3907ce29
    • Qu Wenruo's avatar
      btrfs: raid56: make steal_rbio() subpage compatible · d4e28d9b
      Qu Wenruo authored
      Function steal_rbio() will take all the uptodate pages from the source
      rbio to destination rbio.
      
      With the new stripe_sectors[] array, we also need to do the extra check:
      
      - Check sector::flags to make sure the full page is uptodate
        Now we don't use PageUptodate flag for subpage cases to indicate
        if the page is uptodate.
      
        Instead we need to check all the sectors belong to the page to be sure
        about whether it's full page uptodate.
      
        So here we introduce a new helper, full_page_sectors_uptodate() to do
        the check.
      
      - Update rbio::stripe_sectors[] to use the new page pointer
        We only need to change the page pointer, no need to change anything
        else.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e28d9b
    • Qu Wenruo's avatar
      btrfs: raid56: make set_bio_pages_uptodate() subpage compatible · 5fdb7afc
      Qu Wenruo authored
      Unlike previous code, we can not directly set PageUptodate for stripe
      pages now.  Instead we have to iterate through all the sectors and set
      SECTOR_UPTODATE flag there.
      
      Introduce a new helper find_stripe_sector(), to do the work.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5fdb7afc
    • Qu Wenruo's avatar
      btrfs: raid56: remove btrfs_raid_bio::bio_pages array · ac26df8b
      Qu Wenruo authored
      The functionality is completely replaced by the new bio_sectors member,
      now it's time to remove the old member.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac26df8b
    • Qu Wenruo's avatar
      btrfs: raid56: make raid56_add_scrub_pages() subpage compatible · 6346f6bf
      Qu Wenruo authored
      This requires one extra parameter @pgoff for the function.
      
      In the current code base, scrub is still one page per sector, thus the
      new parameter will always be 0.
      
      It needs the extra subpage scrub optimization code to fully take
      advantage.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6346f6bf
    • Qu Wenruo's avatar
      btrfs: raid56: open code rbio_stripe_page_index() · f77183dc
      Qu Wenruo authored
      There is only one caller for that helper now, and we're definitely fine
      to open-code it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f77183dc
    • Qu Wenruo's avatar
      btrfs: raid56: make finish_rmw() subpage compatible · 1145059a
      Qu Wenruo authored
      With this function converted to subpage compatible sector interfaces,
      the following helper functions can be removed:
      
      - rbio_stripe_page()
      - rbio_pstripe_page()
      - rbio_qstripe_page()
      - page_in_rbio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1145059a
    • Qu Wenruo's avatar
      btrfs: raid56: make __raid_recover_endio_io() subpage compatible · 07e4d380
      Qu Wenruo authored
      This involves:
      
      - Use sector_ptr interface to grab the pointers
      
      - Add sector->pgoff to pointers[]
      
      - Rebuild data using sectorsize instead of PAGE_SIZE
      
      - Use memcpy() to replace copy_page()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e4d380
    • Qu Wenruo's avatar
      btrfs: raid56: make finish_parity_scrub() subpage compatible · 46900662
      Qu Wenruo authored
      The core is to convert direct page usage into sector_ptr usage, and
      use memcpy() to replace copy_page().
      
      For pointers usage, we need to convert it to kmap_local_page() +
      sector->pgoff.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46900662
    • Qu Wenruo's avatar
      btrfs: raid56: make rbio_add_io_page() subpage compatible · 3e77605d
      Qu Wenruo authored
      Make rbio_add_io_page() subpage compatible, which involves:
      
      - Rename rbio_add_io_page() to rbio_add_io_sector()
        Although we still rely on PAGE_SIZE == sectorsize, so add a new
        ASSERT() inside rbio_add_io_sector() to make sure all pgoff is 0.
      
      - Introduce rbio_stripe_sector() helper
        The equivalent of rbio_stripe_page().
      
        This new helper has extra ASSERT()s to validate the stripe and sector
        number.
      
      - Introduce sector_in_rbio() helper
        The equivalent of page_in_rbio().
      
      - Rename @pagenr variables to @sectornr
      
      - Use rbio::stripe_nsectors when iterating the bitmap
      
      Please note that, this only changes the interface, the bios are still
      using full page for IO.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e77605d
    • Qu Wenruo's avatar
      btrfs: raid56: introduce btrfs_raid_bio::bio_sectors · 00425dd9
      Qu Wenruo authored
      This new member is going to fully replace bio_pages in the future, but
      for now let's keep them co-exist, until the full switch is done.
      
      Currently cache_rbio_pages() and index_rbio_pages() will also populate
      the new array.
      
      And cache_rbio_pages() need to record which sectors are uptodate, so we
      also need to introduce sector_ptr::uptodate bit.
      
      To avoid extra memory usage, we let the new @uptodate bit to share bits
      with @pgoff.  Now pgoff only has at most 31 bits, which is already more
      than enough, as even for 256K page size, we only need 18 bits.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00425dd9
    • Qu Wenruo's avatar
      btrfs: raid56: introduce btrfs_raid_bio::stripe_sectors · eb357060
      Qu Wenruo authored
      The new member is an array of sector_ptr pointers, they will represent
      all sectors inside a full stripe (including P/Q).
      
      They co-operate with btrfs_raid_bio::stripe_pages:
      
      stripe_pages:   | Page 0, range [0, 64K)   | Page 1 ...
      stripe_sectors: |  |  | ...             |  |
                      |  |                    \- sector 15, page 0, pgoff=60K
                      |  \- sector 1, page 0, pgoff=4K
                      \---- sector 0, page 0, pfoff=0
      
      With such structure, we can represent subpage sectors without using
      extra pages.
      
      Here we introduce a new helper, index_stripe_sectors(), to update
      stripe_sectors[] to point to correct page and pgoff.
      
      So every time rbio::stripe_pages[] pointer gets updated, the new helper
      should be called.
      
      The following functions have to call the new helper:
      
      - steal_rbio()
      - alloc_rbio_pages()
      - alloc_rbio_parity_pages()
      - alloc_rbio_essential_pages()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb357060
    • Qu Wenruo's avatar
      btrfs: raid56: introduce new cached members for btrfs_raid_bio · 94efbe19
      Qu Wenruo authored
      The new members are all related to number of sectors, but the existing
      number of pages members are kept as is:
      
      - nr_sectors
        Total sectors of the full stripe including P/Q.
      
      - stripe_nsectors
        The sectors of a single stripe.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94efbe19
    • Qu Wenruo's avatar
      btrfs: raid56: make btrfs_raid_bio more compact · 29b06838
      Qu Wenruo authored
      There are a lot of members using much larger type in btrfs_raid_bio than
      necessary, like nr_pages which represents the total number of a full
      stripe.
      
      Instead of int (which is at least 32bits), u16 is already enough
      (max stripe length will be 256MiB, already beyond current RAID56 device
      number limit).
      
      So this patch will reduce the width of the following members:
      
      - stripe_len to u32
      - nr_pages to u16
      - nr_data to u8
      - real_stripes to u8
      - scrubp to u8
      - faila/b to s8
        As -1 is used to indicate no corruption
      
      This will slightly reduce the size of btrfs_raid_bio from 272 bytes to
      256 bytes, reducing 16 bytes usage.
      
      But please note that, when using btrfs_raid_bio, we allocate extra space
      for it to cover various pointer array, so the reduce memory is not
      really a big saving overall.
      
      As we're here modifying the comments already, update existing comments
      to current code standard.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29b06838
    • Qu Wenruo's avatar
      btrfs: raid56: open code rbio_nr_pages() · 843de58b
      Qu Wenruo authored
      The function rbio_nr_pages() is only called once inside alloc_rbio(),
      there is no reason to make it dedicated helper.
      
      Furthermore, the return type doesn't match, the function return "unsigned
      long" which may not be necessary, while the only caller only uses "int".
      
      Since we're doing cleaning up here, also fix the type to "const unsigned
      int" for all involved local variables.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      843de58b
    • Qu Wenruo's avatar
      btrfs: reduce width for stripe_len from u64 to u32 · cc353a8b
      Qu Wenruo authored
      Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough
      for the usage.
      
      Furthermore, even in the future we choose to enlarge stripe length to
      larger values, I don't believe we would want stripe as large as 4G or
      larger.
      
      So this patch will reduce the width for all in-memory structures and
      parameters, this involves:
      
      - RAID56 related function argument lists
        This allows us to do direct division related to stripe_len.
        Although we will use bits shift to replace the division anyway.
      
      - btrfs_io_geometry structure
        This involves one change to simplify the calculation of both @stripe_nr
        and @stripe_offset, using div64_u64_rem().
        And add extra sanity check to make sure @stripe_offset is always small
        enough for u32.
      
        This saves 8 bytes for the structure.
      
      - map_lookup structure
        This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes,
        and removed a 4-bytes hole)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc353a8b