1. 22 Jun, 2021 14 commits
    • Josef Bacik's avatar
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik authored
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c416a30c
    • Filipe Manana's avatar
      btrfs: send: fix crash when memory allocations trigger reclaim · 35b22c19
      Filipe Manana authored
      When doing a send we don't expect the task to ever start a transaction
      after the initial check that verifies if commit roots match the regular
      roots. This is because after that we set current->journal_info with a
      stub (special value) that signals we are in send context, so that we take
      a read lock on an extent buffer when reading it from disk and verifying
      it is valid (its generation matches the generation stored in the parent).
      This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO") in order to fix a concurrency issue
      between send and balance.
      
      However there is one particular exception where we end up needing to start
      a transaction and when this happens it results in a crash with a stack
      trace like the following:
      
      [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
      [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
      [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
      [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
      [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
      [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
      [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
      [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
      [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
      [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
      [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
      [60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
      [60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
      [60015.902414] kernel: Call Trace:
      [60015.902419] kernel:  kfence_unprotect+0x13/0x30
      [60015.902423] kernel:  page_fault_oops+0x89/0x270
      [60015.902427] kernel:  ? search_module_extables+0xf/0x40
      [60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
      [60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
      [60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
      [60015.902440] kernel:  exc_page_fault+0x67/0x150
      [60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
      [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
      [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
      [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
      [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
      [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
      [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
      [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
      [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
      [60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
      [60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
      [60015.902476] kernel:  evict+0xd1/0x180
      [60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
      [60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
      [60015.902487] kernel:  ? iput+0x1a0/0x1a0
      [60015.902489] kernel:  ? iput+0x1a0/0x1a0
      [60015.902491] kernel:  list_lru_walk_one+0x47/0x70
      [60015.902495] kernel:  prune_icache_sb+0x39/0x50
      [60015.902497] kernel:  super_cache_scan+0x161/0x1f0
      [60015.902501] kernel:  do_shrink_slab+0x142/0x240
      [60015.902505] kernel:  shrink_slab+0x164/0x280
      [60015.902509] kernel:  shrink_node+0x2c8/0x6e0
      [60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
      [60015.902514] kernel:  try_to_free_pages+0xda/0x190
      [60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
      [60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
      [60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
      [60015.902528] kernel:  pipe_write+0x30b/0x5c0
      [60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
      [60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
      [60015.902538] kernel:  __kernel_write+0x13a/0x2b0
      [60015.902541] kernel:  kernel_write+0x73/0x150
      [60015.902543] kernel:  send_cmd+0x7b/0xd0
      [60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
      [60015.902549] kernel:  process_extent+0x19b/0xed0
      [60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
      [60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
      [60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
      [60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
      [60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
      [60015.902564] kernel:  ? psi_task_change+0x84/0xc0
      [60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
      [60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
      [60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
      [60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
      [60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
      [60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
      [60015.902588] kernel:  do_syscall_64+0x33/0x40
      [60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
      [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
      [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
      [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
      [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
      [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
      [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
      [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---
      
      This happens because when writing to the pipe, by calling kernel_write(),
      we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
      gfp flags, which allow reclaim to happen if there is memory pressure. This
      allocation happens at fs/pipe.c:pipe_write().
      
      If the reclaim is triggered, inode eviction can be triggered and that in
      turn can result in starting a transaction if the inode has a link count
      of 0. The transaction start happens early on during eviction, when we call
      btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
      there is currently an open file descriptor for an inode with a link count
      of 0 and the reclaim task gets a reference on the inode before that
      descriptor is closed, in which case the reclaim task ends up doing the
      final iput that triggers the inode eviction.
      
      When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
      the following assertion at transaction.c:start_transaction():
      
          /* Send isn't supposed to start transactions. */
          ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
      
      And when assertions are not enabled, it triggers a crash since after that
      assertion we cast current->journal_info into a transaction handle pointer
      and then dereference it:
      
         if (current->journal_info) {
             WARN_ON(type & TRANS_EXTWRITERS);
             h = current->journal_info;
             refcount_inc(&h->use_count);
             (...)
      
      Which obviously results in a crash due to an invalid memory access.
      
      The same type of issue can happen during other memory allocations we
      do directly in the send code with kmalloc (and friends) as they use
      GFP_KERNEL and therefore may trigger reclaim too, which started to
      happen since 2016 after commit e780b0d1 ("btrfs: send: use
      GFP_KERNEL everywhere").
      
      The issue could be solved by setting up a NOFS context for the entire
      send operation so that reclaim could not be triggered when allocating
      memory or pages through kernel_write(). However that is not very friendly
      and we can in fact get rid of the send stub because:
      
      1) The stub was introduced way back in 2014 by commit a26e8c9f
         ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
         issue exclusive to when send and balance are running in parallel,
         however there were other problems between balance and send and we do
         not allow anymore to have balance and send run concurrently since
         commit 9e967495 ("Btrfs: prevent send failures and crashes due
         to concurrent relocation"). More generically the issues are between
         send and relocation, and that last commit eliminated only the
         possibility of having send and balance run concurrently, but shrinking
         a device also can trigger relocation, and on zoned filesystems we have
         relocation of partially used block groups triggered automatically as
         well. The previous patch that has a subject of:
      
         "btrfs: ensure relocation never runs while we have send operations running"
      
         Addresses all the remaining cases that can trigger relocation.
      
      2) We can actually allow starting and even committing transactions while
         in a send context if needed because send is not holding any locks that
         would block the start or the commit of a transaction.
      
      So get rid of all the logic added by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO"). We can now always call
      clear_extent_buffer_uptodate() at verify_parent_transid() since send is
      the only case that uses commit roots without having a transaction open or
      without holding the commit_root_sem.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35b22c19
    • Filipe Manana's avatar
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana authored
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • David Sterba's avatar
      btrfs: shorten integrity checker extent data mount option · cbeaae4f
      David Sterba authored
      Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
      calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
      mount option name.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbeaae4f
    • David Sterba's avatar
      btrfs: switch mount option bits to enums and use wider type · ccd9395b
      David Sterba authored
      Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
      recorded in the debugging information for convenience).
      
      There are two more things done but separating them would not make much
      sense as it's touching the same lines:
      
      - Renumber shifts 18..31 to 17..30 to get rid of the hole in the
        sequence.
      
      - Use 1UL as the value that gets shifted because we're approaching the
        32bit limit and due to integer promotions the value of (1 << 31)
        becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
        manipulating helpers).
      
        This is not causing any problems yet as the operations are in-memory
        and masking the 31st bit works, we don't have more than 31 bits so the
        ill effects of not masking higher bits don't happen. But once we have
        more, the problems will emerge.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ccd9395b
    • David Sterba's avatar
      btrfs: props: change how empty value is interpreted · 5548c8c6
      David Sterba authored
      Based on user feedback and actual problems with compression property,
      there's no support to unset any compression options, or to force no
      compression flag.
      
      Note: This has changed recently in e2fsprogs 1.46.2, 'chattr +m'
      (setting NOCOMPRESS).
      
      In btrfs properties, the empty value should really mean reset to
      defaults, for all properties in general. Right now there's only the
      compression one, so this change should not cause too many problems.
      
      Old behaviour:
      
        $ lsattr file
        ---------------------- file
        # the NOCOMPRESS bit is set
        $ btrfs prop set file compression ''
        $ lsattr file
        ---------------------m file
      
      This is equivalent to 'btrfs prop set file compression no' in current
      btrfs-progs as the 'no' or 'none' values are translated to an empty
      string.
      
      This is where the new behaviour is different: empty string drops the
      compression flag (-c) and nocompress (-m):
      
        $ lsattr file
        ---------------------- file
        # No change
        $ btrfs prop set file compression ''
        $ lsattr file
        ---------------------- file
        $ btrfs prop set file compression lzo
        $ lsattr file
        --------c------------- file
        $ btrfs prop get file compression
        compression=lzo
        $ btrfs prop set file compression ''
        # Reset to the initial state
        $ lsattr file
        ---------------------- file
        # Set NOCOMPRESS bit
        $ btrfs prop set file compression no
        $ lsattr file
        ---------------------m file
      
      This obviously brings problems with backward compatibility, so this
      patch should not be backported without making sure the updated
      btrfs-progs are also used and that scripts have been updated to use the
      new semantics.
      
      Summary:
      
      - old kernel:
        no, none, "" - set NOCOMPRESS bit
      - new kernel:
        no, none - set NOCOMPRESS bit
        "" - drop all compression flags, ie. COMPRESS and NOCOMPRESS
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5548c8c6
    • David Sterba's avatar
      btrfs: compression: don't try to compress if we don't have enough pages · f2165627
      David Sterba authored
      The early check if we should attempt compression does not take into
      account the number of input pages. It can happen that there's only one
      page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED
      have been processed, or an isolated page that won't be converted to an
      inline extent.
      
      The single page would be compressed but a later check would drop it
      again because the result size must be at least one block shorter than
      the input. That can never work with just one page.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2165627
    • Naohiro Aota's avatar
      btrfs: fix unbalanced unlock in qgroup_account_snapshot() · 44365827
      Naohiro Aota authored
      qgroup_account_snapshot() is trying to unlock the not taken
      tree_log_mutex in a error path. Since ret != 0 in this case, we can
      just return from here.
      
      Fixes: 2a4d84c1 ("btrfs: move delayed ref flushing for qgroup into qgroup helper")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44365827
    • David Sterba's avatar
      btrfs: sysfs: export dev stats in devinfo directory · da658b57
      David Sterba authored
      The device stats can be read by ioctl, wrapped by command 'btrfs device
      stats'. Provide another source where to read the information in
      /sys/fs/btrfs/FSID/devinfo/DEVID/error_stats . The format is a list of
      'key value' pairs one per line, which is common in other stat files.
      The names are the same as used in other device stat outputs.
      
      The stats are all in one file as it's the snapshot of all available
      stats. The 'one value per file' format is not very suitable here. The
      stats should be valid right after the stats item is read from disk,
      shortly after initializing the device.
      
      In case the stats are not yet valid, print just 'invalid' as the file
      contents.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da658b57
    • David Sterba's avatar
      btrfs: fix typos in comments · 1a9fd417
      David Sterba authored
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a9fd417
    • Qu Wenruo's avatar
      btrfs: remove a stale comment for btrfs_decompress_bio() · c86bdc9b
      Qu Wenruo authored
      Since commit 8140dc30 ("btrfs: btrfs_decompress_bio() could accept
      compressed_bio instead"), btrfs_decompress_bio() accepts
      "struct compressed_bio" other than open-coded parameter list.
      
      Thus the comments for the parameter list is no longer needed.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c86bdc9b
    • Baokun Li's avatar
      btrfs: send: use list_move_tail instead of list_del/list_add_tail · bb930007
      Baokun Li authored
      Use list_move_tail() instead of list_del() + list_add_tail() as it's
      doing the same thing and allows further cleanups.  Open code
      name_cache_used() as there is only one user.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb930007
    • Christophe Leroy's avatar
      btrfs: disable build on platforms having page size 256K · b05fbcc3
      Christophe Leroy authored
      With a config having PAGE_SIZE set to 256K, BTRFS build fails
      with the following message
      
        include/linux/compiler_types.h:326:38: error: call to
        '__compiletime_assert_791' declared with attribute error:
        BUILD_BUG_ON failed: (BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0
      
      BTRFS_MAX_COMPRESSED being 128K, BTRFS cannot support platforms with
      256K pages at the time being.
      
      There are two platforms that can select 256K pages:
       - hexagon
       - powerpc
      
      Disable BTRFS when 256K page size is selected. Supporting this would
      require changes to the subpage mode that's currently being developed.
      Given that 256K is many times larger than page sizes commonly used and
      for what the algorithms and structures have been tuned, it's out of
      scope and disabling build is a reasonable option.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b05fbcc3
    • Filipe Manana's avatar
      btrfs: send: fix invalid path for unlink operations after parent orphanization · d8ac76cd
      Filipe Manana authored
      During an incremental send operation, when processing the new references
      for the current inode, we might send an unlink operation for another inode
      that has a conflicting path and has more than one hard link. However this
      path was computed and cached before we processed previous new references
      for the current inode. We may have orphanized a directory of that path
      while processing a previous new reference, in which case the path will
      be invalid and cause the receiver process to fail.
      
      The following reproducer triggers the problem and explains how/why it
      happens in its comments:
      
        $ cat test-send-unlink.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test files and directory. Inode 259 (file3) has two hard
        # links.
        touch $MNT/file1
        touch $MNT/file2
        touch $MNT/file3
      
        mkdir $MNT/A
        ln $MNT/file3 $MNT/A/hard_link
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- file1                          (ino 257)
        # |----- file2                          (ino 258)
        # |----- file3                          (ino 259)
        # |----- A/                             (ino 260)
        #        |---- hard_link                (ino 259)
        #
      
        # Now create the base snapshot, which is going to be the parent snapshot
        # for a later incremental send.
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Move inode 257 into directory inode 260. This results in computing the
        # path for inode 260 as "/A" and caching it.
        mv $MNT/file1 $MNT/A/file1
      
        # Move inode 258 (file2) into directory inode 260, with a name of
        # "hard_link", moving first inode 259 away since it currently has that
        # location and name.
        mv $MNT/A/hard_link $MNT/tmp
        mv $MNT/file2 $MNT/A/hard_link
      
        # Now rename inode 260 to something else (B for example) and then create
        # a hard link for inode 258 that has the old name and location of inode
        # 260 ("/A").
        mv $MNT/A $MNT/B
        ln $MNT/B/hard_link $MNT/A
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- tmp                            (ino 259)
        # |----- file3                          (ino 259)
        # |----- B/                             (ino 260)
        # |      |---- file1                    (ino 257)
        # |      |---- hard_link                (ino 258)
        # |
        # |----- A                              (ino 258)
      
        # Create another snapshot of our subvolume and use it for an incremental
        # send.
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        # First add the first snapshot to the new filesystem by applying the
        # first send stream.
        btrfs receive -f /tmp/snap1.send $MNT
      
        # The incremental receive operation below used to fail with the
        # following error:
        #
        #    ERROR: unlink A/hard_link failed: No such file or directory
        #
        # This is because when send is processing inode 257, it generates the
        # path for inode 260 as "/A", since that inode is its parent in the send
        # snapshot, and caches that path.
        #
        # Later when processing inode 258, it first processes its new reference
        # that has the path of "/A", which results in orphanizing inode 260
        # because there is a a path collision. This results in issuing a rename
        # operation from "/A" to "/o260-6-0".
        #
        # Finally when processing the new reference "B/hard_link" for inode 258,
        # it notices that it collides with inode 259 (not yet processed, because
        # it has a higher inode number), since that inode has the name
        # "hard_link" under the directory inode 260. It also checks that inode
        # 259 has two hardlinks, so it decides to issue a unlink operation for
        # the name "hard_link" for inode 259. However the path passed to the
        # unlink operation is "/A/hard_link", which is incorrect since currently
        # "/A" does not exists, due to the orphanization of inode 260 mentioned
        # before. The path is incorrect because it was computed and cached
        # before the orphanization. This results in the receiver to fail with
        # the above error.
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, it fails like this:
      
        $ ./test-send-unlink.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: unlink A/hard_link failed: No such file or directory
      
      Fix this by recomputing a path before issuing an unlink operation when
      processing the new references for the current inode if we previously
      have orphanized a directory.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8ac76cd
  2. 21 Jun, 2021 26 commits
    • David Sterba's avatar
      btrfs: inline wait_current_trans_commit_start in its caller · ae5d29d4
      David Sterba authored
      Function wait_current_trans_commit_start is now fairly trivial so it can
      be inlined in its only caller.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae5d29d4
    • David Sterba's avatar
      btrfs: sink wait_for_unblock parameter to async commit · 32cc4f87
      David Sterba authored
      There's only one caller left btrfs_ioctl_start_sync that passes 0, so we
      can remove the switch in btrfs_commit_transaction_async.
      
      A cleanup 9babda9f ("btrfs: Remove async_transid from
      btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed
      1, so this is a followup.
      
      As this removes last call of wait_current_trans_commit_start_and_unblock,
      remove the function as well.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32cc4f87
    • Nathan Chancellor's avatar
      btrfs: remove total_data_size variable in btrfs_batch_insert_items() · bfaa324e
      Nathan Chancellor authored
      clang warns:
      
        fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set
        but not used [-Wunused-but-set-variable]
      	  int total_data_size = 0, total_size = 0;
      	      ^
        1 warning generated.
      
      This variable's value has been unused since commit fc0d82e1 ("btrfs:
      sink total_data parameter in setup_items_for_insert"). Eliminate it.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1391Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfaa324e
    • Nikolay Borisov's avatar
      btrfs: eliminate insert label in add_falloc_range · 77d25534
      Nikolay Borisov authored
      By way of inverting the list_empty conditional the insert label can be
      eliminated, making the function's flow entirely linear.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77d25534
    • Qu Wenruo's avatar
      btrfs: subpage: fix a rare race between metadata endio and eb freeing · 3d078efa
      Qu Wenruo authored
      [BUG]
      There is a very rare ASSERT() triggering during full fstests run for
      subpage rw support.
      
      No other reproducer so far.
      
      The ASSERT() gets triggered for metadata read in
      btrfs_page_set_uptodate() inside end_page_read().
      
      [CAUSE]
      There is still a small race window for metadata only, the race could
      happen like this:
      
                      T1                  |              T2
      ------------------------------------+-----------------------------
      end_bio_extent_readpage()           |
      |- btrfs_validate_metadata_buffer() |
      |  |- free_extent_buffer()          |
      |     Still have 2 refs             |
      |- end_page_read()                  |
         |- if (unlikely(PagePrivate())   |
         |  The page still has Private    |
         |                                | free_extent_buffer()
         |                                | |  Only one ref 1, will be
         |                                | |  released
         |                                | |- detach_extent_buffer_page()
         |                                |    |- btrfs_detach_subpage()
         |- btrfs_set_page_uptodate()     |
            The page no longer has Private|
            >>> ASSERT() triggered <<<    |
      
      This race window is super small, thus pretty hard to hit, even with so
      many runs of fstests.
      
      But the race window is still there, we have to go another way to solve
      it other than relying on random PagePrivate() check.
      
      Data path is not affected, as it will lock the page before reading,
      while unlocking the page after the last read has finished, thus no race
      window.
      
      [FIX]
      This patch will fix the bug by repurposing btrfs_subpage::readers.
      
      Now btrfs_subpage::readers will be a member shared by both metadata and
      data.
      
      For metadata path, we don't do the page unlock as metadata only relies
      on extent locking.
      
      At the same time, teach page_range_has_eb() to take
      btrfs_subpage::readers into consideration.
      
      So that even if the last eb of a page gets freed, page::private won't be
      detached as long as there still are pending end_page_read() calls.
      
      By this we eliminate the race window, this will slight increase the
      metadata memory usage, as the page may not be released as frequently as
      usual.  But it should not be a big deal.
      
      The code got introduced in ("btrfs: submit read time repair only for
      each corrupted sector"), but the fix is in a separate patch to keep the
      problem description and the crash is rare so it should not hurt
      bisectability.
      Signed-off-by: default avatarQu Wegruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d078efa
    • Qu Wenruo's avatar
      btrfs: don't clear page extent mapped if we're not invalidating the full page · bcd77455
      Qu Wenruo authored
      [BUG]
      With current btrfs subpage rw support, the following script can lead to
      fs hang:
      
        $ mkfs.btrfs -f -s 4k $dev
        $ mount $dev -o nospace_cache $mnt
        $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
      
      The fs will hang at btrfs_start_ordered_extent().
      
      [CAUSE]
      In above test case, btrfs_invalidate() will be called with the following
      parameters:
      
        offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000
      
      Since @offset is 0, btrfs_invalidate() will try to invalidate the full
      page, and finally call clear_page_extent_mapped() which will detach
      subpage structure from the page.
      
      And since the page no longer has subpage structure, the subpage dirty
      bitmap will be cleared, preventing the dirty range from being written
      back, thus no way to wake up the ordered extent.
      
      [FIX]
      Just follow other filesystems, only to invalidate the page if the range
      covers the full page.
      
      There are cases like truncate_setsize() which can call
      btrfs_invalidatepage() with offset == 0 and length != 0 for the last
      page of an inode.
      
      Although the old code will still try to invalidate the full page, we are
      still safe to just wait for ordered extent to finish.
      So it shouldn't cause extra problems.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bcd77455
    • Qu Wenruo's avatar
      btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() · 0528476b
      Qu Wenruo authored
      [BUG]
      With current subpage RW support, the following script can hang the fs
      with 64K page size.
      
       # mkfs.btrfs -f -s 4k $dev
       # mount $dev -o nospace_cache $mnt
       # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
      
      The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
      
      [CAUSE]
      In btrfs_punch_hole_lock_range() we:
      
      - Truncate page cache range
      - Lock extent io tree
      - Wait any ordered extents in the range.
      
      We exit the loop until we meet all the following conditions:
      
      - No ordered extent in the lock range
      - No page is in the lock range
      
      The latter condition has a pitfall, it only works for sector size ==
      PAGE_SIZE case.
      
      While can't handle the following subpage case:
      
        0       32K     64K     96K     128K
        |       |///////||//////|       ||
      
      lockstart=32K
      lockend=96K - 1
      
      In this case, although the range crosses 2 pages,
      truncate_pagecache_range() will invalidate no page at all, but only zero
      the [32K, 96K) range of the two pages.
      
      Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
      will never meet the loop exit condition.
      
      [FIX]
      Fix the problem by doing page alignment for the lock range.
      
      Function filemap_range_has_page() has already handled lend < lstart
      case, we only need to round up @lockstart, and round_down @lockend for
      truncate_pagecache_range().
      
      This modification should not change any thing for sector size ==
      PAGE_SIZE case, as in that case our range is already page aligned.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0528476b
    • Qu Wenruo's avatar
      btrfs: reflink: make copy_inline_to_page() to be subpage compatible · 3115deb3
      Qu Wenruo authored
      The modifications are:
      
      - Page copy destination
        For subpage case, one page can contain multiple sectors, thus we can
        no longer expect the memcpy_to_page()/btrfs_decompress() to copy
        data into page offset 0.
        The correct offset is offset_in_page(file_offset) now, which should
        handle both regular sectorsize and subpage cases well.
      
      - Page status update
        Now we need to use subpage helper to handle the page status update.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3115deb3
    • Qu Wenruo's avatar
      btrfs: make btrfs_page_mkwrite() to be subpage compatible · 2d8ec40e
      Qu Wenruo authored
      Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
      Convert them to subpage helpers, so that __extent_writepage_io() can
      submit page content correctly.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d8ec40e
    • Qu Wenruo's avatar
      btrfs: make btrfs_truncate_block() to be subpage compatible · 6c9ac8be
      Qu Wenruo authored
      btrfs_truncate_block() itself is already mostly subpage compatible, the
      only missing part is the page dirtying code.
      
      Currently if we have a sector that needs to be truncated, we set the
      sector aligned range delalloc, then set the full page dirty.
      
      The problem is, current subpage code requires subpage dirty bit to be
      set, or __extent_writepage_io() won't submit bio, thus leads to ordered
      extent never to finish.
      
      So this patch will make btrfs_truncate_block() to call
      btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
      problem.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c9ac8be
    • Qu Wenruo's avatar
      btrfs: make __extent_writepage_io() only submit dirty range for subpage · c5ef5c6c
      Qu Wenruo authored
      __extent_writepage_io() function originally just iterates through all
      the extent maps of a page, and submits any regular extents.
      
      This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
      need to submit the only sector contained in the page.
      
      But for subpage case, one dirty page can contain several clean sectors
      with at least one dirty sector.
      
      If __extent_writepage_io() still submit all regular extent maps, it can
      submit data which is already written to disk.
      And since such already written data won't have corresponding ordered
      extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
      
      Change the behavior of __extent_writepage_io() by finding the first
      dirty byte in the page, and only submit the dirty range other than the
      full extent.
      
      Since we're also here, also modify the following calls to be subpage
      compatible:
      
      - SetPageError()
      - end_page_writeback()
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c5ef5c6c
    • Qu Wenruo's avatar
      btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064
      Qu Wenruo authored
      Function btrfs_set_range_writeback() currently just sets the page
      writeback unconditionally.
      
      Change it to call the subpage helper so that we can handle both cases
      well.
      
      Since the subpage helpers needs btrfs_fs_info, also change the parameter
      to accept btrfs_inode.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2a91064
    • Qu Wenruo's avatar
      btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() · 4750af3b
      Qu Wenruo authored
      In cow_file_range(), after we have succeeded creating an inline extent,
      we unlock the page with extent_clear_unlock_delalloc() by passing
      locked_page == NULL.
      
      For sectorsize == PAGE_SIZE case, this is just making the page lock and
      unlock harder to grab.
      
      But for incoming subpage case, it can be a big problem.
      
      For incoming subpage case, page locking have two entry points:
      
      - __process_pages_contig()
        In that case, we know exactly the range we want to lock (which only
        requires sector alignment).
        To handle the subpage requirement, we introduce btrfs_subpage::writers
        to page::private, and will update it in __process_pages_contig().
      
      - Other directly lock/unlock_page() call sites
        Those won't touch btrfs_subpage::writers at all.
      
      This means, page locked by __process_pages_contig() can only be unlocked
      by __process_pages_contig().
      Thankfully we already have the existing infrastructure in the form of
      @locked_page in various call sites.
      
      Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
      creating an inline extent is the exception.
      It intentionally call extent_clear_unlock_delalloc() with locked_page ==
      NULL, to also unlock current page (and clear its dirty/writeback bits).
      
      To co-operate with incoming subpage modifications, and make the page
      lock/unlock pair easier to understand, this patch will still call
      extent_clear_unlock_delalloc() with locked_page, and only unlock the
      page in __extent_writepage().
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4750af3b
    • Qu Wenruo's avatar
      btrfs: update locked page dirty/writeback/error bits in __process_pages_contig · a33a8e9a
      Qu Wenruo authored
      When __process_pages_contig() gets called for
      extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
      bit is updated, but dirty/writeback/error bits are all skipped.
      
      There are several call sites that call extent_clear_unlock_delalloc()
      with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
      
      - cow_file_range()
      - run_delalloc_nocow()
      - cow_file_range_async()
        All for their error handling branches.
      
      For those call sites, since we skip the locked page for
      dirty/error/writeback bit update, the locked page will still have its
      subpage dirty bit remaining.
      
      Normally it's the call sites which locked the page to handle the locked
      page, but it won't hurt if we also do the update.
      
      Especially there are already other call sites doing the same thing by
      manually passing NULL as locked_page.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a33a8e9a
    • Qu Wenruo's avatar
      btrfs: make page Ordered bit to be subpage compatible · b945a463
      Qu Wenruo authored
      This involves the following modification:
      
      - Ordered extent creation
        This is done in process_one_page(), now PAGE_SET_ORDERED will call
        subpage helper to do the work.
      
      - endio functions
        This is done in btrfs_mark_ordered_io_finished().
      
      - btrfs_invalidatepage()
      
      - btrfs_cleanup_ordered_extents()
        Use the subpage page helper, and add an extra branch to exit if the
        locked page have covered the full range.
      
      Now the usage of page Ordered flag for ordered extent accounting is fully
      subpage compatible.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b945a463
    • Qu Wenruo's avatar
      btrfs: introduce helpers for subpage ordered status · 6f17400b
      Qu Wenruo authored
      This patch introduces the following functions to handle btrfs subpage
      ordered (Private2) status:
      
      - btrfs_subpage_set_ordered()
      - btrfs_subpage_clear_ordered()
      - btrfs_subpage_test_ordered()
        These helpers can only be called when the range is ensured to be
        inside the page.
      
      - btrfs_page_set_ordered()
      - btrfs_page_clear_ordered()
      - btrfs_page_test_ordered()
        These helpers can handle both regular sector size and subpage without
        problem.
      
      These functions are here to coordinate btrfs_invalidatepage() with
      btrfs_writepage_endio_finish_ordered(), to make sure only one of those
      functions can finish the ordered extent.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f17400b
    • Qu Wenruo's avatar
      btrfs: make process_one_page() to handle subpage locking · 1e1de387
      Qu Wenruo authored
      Introduce a new data inodes specific subpage member, writers, to record
      how many sectors are under page lock for delalloc writing.
      
      This member acts pretty much the same as readers, except it's only for
      delalloc writes.
      
      This is important for delalloc code to trace which page can really be
      freed, as we have cases like run_delalloc_nocow() where we may exit
      processing nocow range inside a page, but need to exit to do cow half
      way.
      In that case, we need a way to determine if we can really unlock a full
      page.
      
      With the new btrfs_subpage::writers, there is a new requirement:
      - Page locked by process_one_page() must be unlocked by
        process_one_page()
        There are still tons of call sites manually lock and unlock a page,
        without updating btrfs_subpage::writers.
        So if we lock a page through process_one_page() then it must be
        unlocked by process_one_page() to keep btrfs_subpage::writers
        consistent.
      
        This will be handled in next patch.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e1de387
    • Qu Wenruo's avatar
      btrfs: make end_bio_extent_writepage() to be subpage compatible · 9047e317
      Qu Wenruo authored
      Now in end_bio_extent_writepage(), the only subpage incompatible code is
      the end_page_writeback().
      
      Just call the subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9047e317
    • Qu Wenruo's avatar
      btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status · e38992be
      Qu Wenruo authored
      For __process_pages_contig() and process_one_page(), to handle subpage
      we only need to pass bytenr in and call subpage helpers to handle
      dirty/error/writeback status.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e38992be
    • Qu Wenruo's avatar
      btrfs: make btrfs_dirty_pages() to be subpage compatible · f02a85d2
      Qu Wenruo authored
      Since the extent io tree operations in btrfs_dirty_pages() are already
      subpage compatible, we only need to make the page status update to use
      subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f02a85d2
    • Qu Wenruo's avatar
      btrfs: only require sector size alignment for end_bio_extent_writepage() · 321a02db
      Qu Wenruo authored
      Just like read page, for subpage support we only require sector size
      alignment.
      
      So change the error message condition to only require sector alignment.
      
      This should not affect existing code, as for regular sectorsize ==
      PAGE_SIZE case, we are still requiring page alignment.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      321a02db
    • Qu Wenruo's avatar
      btrfs: provide btrfs_page_clamp_*() helpers · 60e2d255
      Qu Wenruo authored
      In the coming subpage RW supports, there are a lot of page status update
      calls which need to be converted to subpage compatible version, which
      needs @start and @len.
      
      Some call sites already have such @start/@len and are already in
      page range, like various endio functions.
      
      But there are also call sites which need to clamp the range for subpage
      case, like btrfs_dirty_pagse() and __process_contig_pages().
      
      Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
      clamp for subpage version.
      
      Although in theory all existing btrfs_page_*() calls can be converted to
      use btrfs_page_clamp_*() directly, but that would make us to do
      unnecessary clamp operations.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60e2d255
    • Qu Wenruo's avatar
      btrfs: refactor page status update into process_one_page() · ed8f13bf
      Qu Wenruo authored
      In __process_pages_contig() we update page status according to page_ops.
      
      That update process is a bunch of 'if' branches, which lie inside
      two loops, this makes it pretty hard to expand for later subpage
      operations.
      
      So this patch will extract these operations into its own function,
      process_one_pages().
      
      Also since we're refactoring __process_pages_contig(), also move the new
      helper and __process_pages_contig() before the first caller of them, to
      remove the forward declaration.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed8f13bf
    • Qu Wenruo's avatar
      btrfs: pass bytenr directly to __process_pages_contig() · 98af9ab1
      Qu Wenruo authored
      As a preparation for incoming subpage support, we need bytenr passed to
      __process_pages_contig() directly, not the current page index.
      
      So change the parameter and all callers to pass bytenr in.
      
      With the modification, here we need to replace the old @index_ret with
      @processed_end for __process_pages_contig(), but this brings a small
      problem.
      
      Normally we follow the inclusive return value, meaning @processed_end
      should be the last byte we processed.
      
      If parameter @start is 0, and we failed to lock any page, then we would
      return @processed_end as -1, causing more problems for
      __unlock_for_delalloc().
      
      So here for @processed_end, we use two different return value patterns.
      If we have locked any page, @processed_end will be the last byte of
      locked page.
      Or it will be @start otherwise.
      
      This change will impact lock_delalloc_pages(), so it needs to check
      @processed_end to only unlock the range if we have locked any.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98af9ab1
    • Qu Wenruo's avatar
      btrfs: fix hang when run_delalloc_range() failed · 968f2566
      Qu Wenruo authored
      [BUG]
      When running subpage preparation patches on x86, btrfs/125 will hang
      forever with one ordered extent never finished.
      
      [CAUSE]
      The test case btrfs/125 itself will always fail as the fix is never merged.
      
      When the test fails at balance, btrfs needs to cleanup the ordered
      extent in btrfs_cleanup_ordered_extents() for data reloc inode.
      
      The problem is in the sequence how we cleanup the page Order bit.
      
      Currently it works like:
      
        btrfs_cleanup_ordered_extents()
        |- find_get_page();
        |- btrfs_page_clear_ordered(page);
        |  Now the page doesn't have Ordered bit anymore.
        |  !!! This also includes the first (locked) page !!!
        |
        |- offset += PAGE_SIZE
        |  This is to skip the first page
        |- __endio_write_update_ordered()
           |- btrfs_mark_ordered_io_finished(NULL)
              Except the first page, all ordered extents are finished.
      
      Then the locked page is cleaned up in __extent_writepage():
      
        __extent_writepage()
        |- If (PageError(page))
        |- end_extent_writepage()
           |- btrfs_mark_ordered_io_finished(page)
              |- if (btrfs_test_page_ordered(page))
              |-  !!! The page gets skipped !!!
                  The ordered extent is not decreased as the page doesn't
                  have ordered bit anymore.
      
      This leaves the ordered extent with bytes_left == sectorsize, thus never
      finish.
      
      [FIX]
      The fix is to ensure we never clear page Ordered bit without running the
      ordered extent accounting.
      
      Here we choose to skip the locked page in
      btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
      properly finish the ordered extent.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      968f2566
    • Qu Wenruo's avatar
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo authored
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f57ad937