1. 23 Aug, 2021 40 commits
    • Naohiro Aota's avatar
      btrfs: zoned: add asserts on splitting extent_map · 63fb5879
      Naohiro Aota authored
      We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
      we can assume the extent_map is PINNED, not LOGGING, and in the modified
      list. Add ASSERT()s to ensure the extent_maps after the split also has the
      proper flags set and are in the modified list.
      Suggested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63fb5879
    • Naohiro Aota's avatar
      btrfs: zoned: fix block group alloc_offset calculation · 0ae79c6f
      Naohiro Aota authored
      alloc_offset is offset from the start of a block group and @offset is
      actually an address in logical space. Thus, we need to consider
      block_group->start when calculating them.
      
      Fixes: 011b41bf ("btrfs: zoned: advance allocation pointer after tree log node")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ae79c6f
    • Naohiro Aota's avatar
      btrfs: zoned: suppress reclaim error message on EAGAIN · ba86dd9f
      Naohiro Aota authored
      btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
      running. The message can fail btrfs/187 and it's unnecessary because we
      anyway add it back to the reclaim list.
      
      btrfs_reclaim_bgs_work()
      `-> btrfs_relocate_chunk()
          `-> btrfs_relocate_block_group()
              `-> reloc_chunk_start()
                  `-> if (fs_info->send_in_progress)
                      `-> return -EAGAIN
      
      CC: stable@vger.kernel.org # 5.13+
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba86dd9f
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow disabling of zone auto reclaim · 77233c2d
      Johannes Thumshirn authored
      Automatically reclaiming dirty zones might not always be desired for all
      workloads, especially as there are currently still some rough edges with
      the relocation code on zoned filesystems.
      
      Allow disabling zone auto reclaim on a per filesystem basis by writing 0
      as the threshold value.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77233c2d
    • Filipe Manana's avatar
      btrfs: update comment at log_conflicting_inodes() · 1f295373
      Filipe Manana authored
      A comment at log_conflicting_inodes() mentions that we check the inode's
      logged_trans field instead of using btrfs_inode_in_log() because the field
      last_log_commit is not updated when we log that an inode exists and the
      inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
      about the full sync flag is not true anymore since commit 9acc8103
      ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
      update the comment to not mention that part anymore.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f295373
    • Filipe Manana's avatar
      btrfs: remove no longer needed full sync flag check at inode_logged() · d135a533
      Filipe Manana authored
      Now that we are checking if the inode's logged_trans is 0 to detect the
      possibility of the inode having been evicted and reloaded, the test for
      the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
      tree-log.c:inode_logged(). Its purpose was to detect the possibility
      of a previous eviction as well, since when an inode is loaded the full
      sync flag is always set on it (and only cleared after the inode is
      logged).
      
      So just remove the check and update the comment. The check for the inode's
      logged_trans being 0 was added recently by the patch with the subject
      "btrfs: eliminate some false positives when checking if inode was logged".
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d135a533
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL check for the new inode during rename exchange · 1c167b87
      Filipe Manana authored
      At the very end of btrfs_rename_exchange(), in case an error happened, we
      are checking if 'new_inode' is NULL, but that is not needed since during a
      rename exchange, unlike regular renames, 'new_inode' can never be NULL,
      and if it were, we would have a crashed much earlier when we dereference it
      multiple times.
      
      So remove the check because it is not necessary and because it is causing
      static checkers to emit a warning. I probably introduced the check by
      copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
      NULL, in commit 86e8aa0e ("Btrfs: unpin logs if rename exchange
      operation fails").
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c167b87
    • Goldwyn Rodrigues's avatar
      btrfs: allocate backref_ctx on stack in find_extent_clone · dce28150
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
      on stack. The size is reasonably small.
      
      sizeof(backref_ctx) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dce28150
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_defrag_range_args on stack · c853a578
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
      allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_defrag_range_args) = 48
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c853a578
    • Goldwyn Rodrigues's avatar
      btrfs: allocate btrfs_ioctl_quota_rescan_args on stack · 0afb603a
      Goldwyn Rodrigues authored
      Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
      allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_quota_rescan_args) = 64
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0afb603a
    • Goldwyn Rodrigues's avatar
      btrfs: allocate file_ra_state on stack in readahead_cache · 98caf953
      Goldwyn Rodrigues authored
      Instead of allocating file_ra_state using kmalloc, allocate on stack.
      sizeof(struct readahead) = 32 bytes.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98caf953
    • Marcos Paulo de Souza's avatar
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza authored
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ff40a91
    • David Sterba's avatar
      btrfs: print if fsverity support is built in when loading module · ea3dc7d2
      David Sterba authored
      As fsverity support depends on a config option, print that at module
      load time like we do for similar features.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea3dc7d2
    • Boris Burkov's avatar
      btrfs: verity metadata orphan items · 70524253
      Boris Burkov authored
      Writing out the verity data is too large of an operation to do in a
      single transaction. If we are interrupted before we finish creating
      fsverity metadata for a file, or fail to clean up already created
      metadata after a failure, we could leak the verity items that we already
      committed.
      
      To address this issue, we use the orphan mechanism. When we start
      enabling verity on a file, we also add an orphan item for that inode.
      When we are finished, we delete the orphan. However, if we are
      interrupted midway, the orphan will be present at mount and we can
      cleanup the half-formed verity state.
      
      There is a possible race with a normal unlink operation: if unlink and
      verity run on the same file in parallel, it is possible for verity to
      succeed and delete the still legitimate orphan added by unlink. Then, if
      we are interrupted and mount in that state, we will never clean up the
      inode properly. This is also possible for a file created with O_TMPFILE.
      Check nlink==0 before deleting to avoid this race.
      
      A final thing to note is that this is a resurrection of using orphans to
      signal an operation besides "delete this inode". The old case was to
      signal the need to do a truncate. That case still technically applies
      for mounting very old file systems, so we need to take some care to not
      clobber it. To that end, we just have to be careful that verity orphan
      cleanup is a no-op for non-verity files.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      70524253
    • Boris Burkov's avatar
      btrfs: initial fsverity support · 14605409
      Boris Burkov authored
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: default avatarEric Biggers <ebiggers@google.com>
      Co-developed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14605409
    • Boris Burkov's avatar
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov authored
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77eea05e
    • Anand Jain's avatar
      btrfs: simplify return values in btrfs_check_raid_min_devices · efc222f8
      Anand Jain authored
      Function btrfs_check_raid_min_devices() returns error code from the enum
      btrfs_err_code and it starts from 1. So there is no need to check if ret
      is > 0. So drop this check and also drop the local variable ret.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efc222f8
    • Qu Wenruo's avatar
      btrfs: remove the dead comment in writepage_delalloc() · 7361b4ae
      Qu Wenruo authored
      When btrfs_run_delalloc_range() failed, we will error out.
      
      But there is a strange comment mentioning that
      btrfs_run_delalloc_range() could have returned value >0 to indicate the
      IO has already started.
      
      Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack
      usage") introduced the comment, but unfortunately at that time, we were
      already using @page_started to indicate that case, and still return 0.
      
      Furthermore, even if that comment was right (which is not), we would
      return -EIO if the IO had already started.
      
      By all means the comment is incorrect, just remove the comment along
      with the dead check.
      
      Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
      make sure we either return 0 or error, no positive return value.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7361b4ae
    • David Sterba's avatar
      btrfs: allow degenerate raid0/raid10 · b2f78e88
      David Sterba authored
      The data on raid0 and raid10 are supposed to be spread over multiple
      devices, so the minimum constraints are set to 2 and 4 respectively.
      This is an artificial limit and there's some interest to remove it.
      
      Change this to allow raid0 on one device and raid10 on two devices. This
      works as expected eg. when converting or removing devices.
      
      The only difference is when raid0 on two devices gets one device
      removed. Unpatched would silently create a single profile, while newly
      it would be raid0.
      
      The motivation is to allow to preserve the profile type as long as it
      possible for some intermediate state (device removal, conversion), or
      when there are disks of different size, with raid0 the otherwise
      unusable space of the last device will be used too. Similarly for
      raid10, though the two largest devices would need to be the same.
      
      Unpatched kernel will mount and use the degenerate profiles just fine
      but won't allow any operation that would not satisfy the stricter device
      number constraints, eg. not allowing to go from 3 to 2 devices for
      raid10 or various profile conversions.
      
      Example output:
      
        # btrfs fi us -T .
        Overall:
            Device size:                  10.00GiB
            Device allocated:              1.01GiB
            Device unallocated:            8.99GiB
            Device missing:                  0.00B
            Used:                        200.61MiB
            Free (estimated):              9.79GiB      (min: 9.79GiB)
            Free (statfs, df):             9.79GiB
            Data ratio:                       1.00
            Metadata ratio:                   1.00
            Global reserve:                3.25MiB      (used: 0.00B)
            Multiple profiles:                  no
      
      		Data      Metadata  System
        Id Path       RAID0     single    single   Unallocated
        -- ---------- --------- --------- -------- -----------
         1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
        -- ---------- --------- --------- -------- -----------
           Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
           Used       200.25MiB 352.00KiB 16.00KiB
      
        # btrfs dev us .
        /dev/sda10, ID: 1
           Device size:            10.00GiB
           Device slack:              0.00B
           Data,RAID0/1:            1.00GiB
           Metadata,single:         8.00MiB
           System,single:           1.00MiB
           Unallocated:             8.99GiB
      
      Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
      profile is printed.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2f78e88
    • Filipe Manana's avatar
      btrfs: do not pin logs too early during renames · bd54f381
      Filipe Manana authored
      During renames we pin the logs of the roots a bit too early, before the
      calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
      since those will not change anything in a log tree.
      
      In a scenario where we have multiple and diverse filesystem operations
      running in parallel, those calls can take a significant amount of time,
      due to lock contention on extent buffers, and delay log commits from other
      tasks for longer than necessary.
      
      So just pin logs after calls to btrfs_insert_inode_ref() and right before
      the first operation that can update a log tree.
      
      The following script that uses dbench was used for testing:
      
        $ cat dbench-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 16
      
        umount $MNT
      
      The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
      and using a non-debug kernel config (Debian's default config).
      
      The results compare a branch without this patch and without the previous
      patch in the series, that has the subject:
      
       "btrfs: eliminate some false positives when checking if inode was logged"
      
      Versus the same branch with these two patches applied.
      
      dbench with 8 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4391359     0.009   249.745
       Close        3225882     0.001     3.243
       Rename        185953     0.065   240.643
       Unlink        886669     0.049   249.906
       Deltree          112     2.455   217.433
       Mkdir             56     0.002     0.004
       Qpathinfo    3980281     0.004     3.109
       Qfileinfo     697579     0.001     0.187
       Qfsinfo       729780     0.002     2.424
       Sfileinfo     357764     0.004     1.415
       Find         1538861     0.016     4.863
       WriteX       2189666     0.010     3.327
       ReadX        6883443     0.002     0.729
       LockX          14298     0.002     0.073
       UnlockX        14298     0.001     0.042
       Flush         307777     2.447   303.663
      
      Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms
      
      dbench with 8 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4269920     0.009   213.532
       Close        3136653     0.001     0.690
       Rename        180805     0.082   213.858
       Unlink        862189     0.050   172.893
       Deltree          112     2.998   218.328
       Mkdir             56     0.002     0.003
       Qpathinfo    3870158     0.004     5.072
       Qfileinfo     678375     0.001     0.194
       Qfsinfo       709604     0.002     0.485
       Sfileinfo     347850     0.004     1.304
       Find         1496310     0.017     5.504
       WriteX       2129613     0.010     2.882
       ReadX        6693066     0.002     1.517
       LockX          13902     0.002     0.075
       UnlockX        13902     0.001     0.055
       Flush         299276     2.511   220.189
      
      Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms
      
      +3.2% throughput, -31.8% max latency
      
      dbench with 16 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5978334     0.028   156.507
       Close        4391598     0.001     1.345
       Rename        253136     0.241   155.057
       Unlink       1207220     0.182   257.344
       Deltree          160     6.123    36.277
       Mkdir             80     0.003     0.005
       Qpathinfo    5418817     0.012     6.867
       Qfileinfo     949929     0.001     0.941
       Qfsinfo       993560     0.002     1.386
       Sfileinfo     486904     0.004     2.829
       Find         2095088     0.059     8.164
       WriteX       2982319     0.017     9.029
       ReadX        9371484     0.002     4.052
       LockX          19470     0.002     0.461
       UnlockX        19470     0.001     0.990
       Flush         418936     2.740   347.902
      
      Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms
      
      dbench with 16 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5711833     0.029   131.240
       Close        4195897     0.001     1.732
       Rename        241849     0.204   147.831
       Unlink       1153341     0.184   231.322
       Deltree          160     6.086    30.198
       Mkdir             80     0.003     0.021
       Qpathinfo    5177011     0.012     7.150
       Qfileinfo     907768     0.001     0.793
       Qfsinfo       949205     0.002     1.431
       Sfileinfo     465317     0.004     2.454
       Find         2001541     0.058     7.819
       WriteX       2850661     0.017     9.110
       ReadX        8952289     0.002     3.991
       LockX          18596     0.002     0.655
       UnlockX        18596     0.001     0.179
       Flush         400342     2.879   293.607
      
      Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms
      
      +4.6% throughput, -16.9% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd54f381
    • Filipe Manana's avatar
      btrfs: eliminate some false positives when checking if inode was logged · 6e8e777d
      Filipe Manana authored
      When checking if an inode was previously logged in the current transaction
      through the helper inode_logged(), we can return some false positives that
      can be easily eliminated. These correspond to the cases where an inode has
      a ->logged_trans value that is not zero and its value is smaller then the
      ID of the current transaction. This means we know exactly that the inode
      was never logged before in the current transaction, so we can return false
      and avoid the callers to do extra work:
      
      1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
         unnecessarily join a log transaction and do deletion searches in a log
         tree that will not find anything. This just adds unnecessary contention
         on extent buffer locks;
      
      2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
         needed. If the inode was not logged before, we don't need to log it in
         LOG_INODE_EXISTS mode.
      
      So just make sure that any false positive only happens when ->logged_trans
      has a value of 0.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e8e777d
    • Naohiro Aota's avatar
      btrfs: drop unnecessary ASSERT from btrfs_submit_direct() · 42b5d73b
      Naohiro Aota authored
      When on SINGLE block group, btrfs_get_io_geometry() will return "the
      size of the block group - the offset of the logical address within the
      block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
      filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
      geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
      INT_MAX);".
      
      The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
      which both take "int" (now u64 due to the previous patch). So to be
      precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
      clone_len is already capped by bio.bi_iter.bi_size which is unsigned
      int. So the ASSERT is not necessary.
      
      Drop the ASSERT and properly compare submit_len and geom.len in u64.
      Then, let the implicit casting to convert it to u64.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42b5d73b
    • Chaitanya Kulkarni's avatar
      btrfs: fix argument type of btrfs_bio_clone_partial() · 21dda654
      Chaitanya Kulkarni authored
      The offset and can never be negative use unsigned int instead of int
      type for them.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21dda654
    • Chaitanya Kulkarni's avatar
      block: fix argument type of bio_trim() · e83502ca
      Chaitanya Kulkarni authored
      The function bio_trim has offset and size arguments that are declared
      as int.
      
      The callers of this function use sector_t type when passing the offset
      and size, e.g. drivers/md/raid1.c:narrow_write_error() and
      drivers/md/raid1.c:narrow_write_error().
      
      Change offset and size arguments to sector_t type for bio_trim(). Also,
      add WARN_ON_ONCE() to catch their overflow.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e83502ca
    • Josef Bacik's avatar
      fs: kill sync_inode · 5662c967
      Josef Bacik authored
      Now that all users of sync_inode() have been deleted, remove
      sync_inode().
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5662c967
    • Josef Bacik's avatar
      9p: migrate from sync_inode to filemap_fdatawrite_wbc · 25d23cd0
      Josef Bacik authored
      We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc
      instead.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25d23cd0
    • Josef Bacik's avatar
      btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking · b3776305
      Josef Bacik authored
      sync_inode() has some holes that can cause problems if we're under heavy
      ENOSPC pressure.  If there's writeback running on a separate thread
      sync_inode() will skip writing the inode altogether.  What we really
      want is to make sure writeback has been started on all the pages to make
      sure we can see the ordered extents and wait on them if appropriate.
      Switch to this new helper which will allow us to accomplish this and
      avoid ENOSPC'ing early.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3776305
    • Josef Bacik's avatar
      fs: add a filemap_fdatawrite_wbc helper · 5a798493
      Josef Bacik authored
      Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
      order to reclaim metadata reservations.  Unfortunately most helpers in
      this area are too smart for us:
      
      1) The normal filemap_fdata* helpers only take range and sync modes, and
         don't give any indication of how much was written, so we can only
         flush full inodes, which isn't what we want in most cases.
      2) The normal writeback path requires us to have the s_umount sem held,
         but we can't unconditionally take it in this path because we could
         deadlock.
      3) The normal writeback path also skips inodes with I_SYNC set if we
         write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
         ENOSPC pressure, we want to actually make sure the pages are under
         writeback before returning, and if another thread is in the middle of
         writing the file we may return before they're under writeback and
         miss our ordered extents and not properly wait for completion.
      4) sync_inode() uses the normal writeback path and has the same problem
         as #3.
      
      What we really want is to call do_writepages() with our wbc.  This way
      we can make sure that writeback is actually started on the pages, and we
      can control how many pages are written as a whole as we write many
      inodes using the same wbc.  Accomplish this with a new helper that does
      just that so we can use it for our ENOSPC flushing infrastructure.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a798493
    • Josef Bacik's avatar
      btrfs: wait on async extents when flushing delalloc · e1646070
      Josef Bacik authored
      I've been debugging an early ENOSPC problem in production and finally
      root caused it to this problem.  When we switched to the per-inode in
      38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") I pulled out the async extent handling, because we
      were doing the correct thing by calling filemap_flush() if we had async
      extents set.  This would properly wait on any async extents by locking
      the page in the second flush, thus making sure our ordered extents were
      properly set up.
      
      However when I switched us back to page based flushing, I used
      sync_inode(), which allows us to pass in our own wbc.  The problem here
      is that sync_inode() is smarter than the filemap_* helpers, it tries to
      avoid calling writepages at all.  This means that our second call could
      skip calling do_writepages altogether, and thus not wait on the pagelock
      for the async helpers.  This means we could come back before any ordered
      extents were created and then simply continue on in our flushing
      mechanisms and ENOSPC out when we have plenty of space to use.
      
      Fix this by putting back the async pages logic in shrink_delalloc.  This
      allows us to bulk write out everything that we need to, and then we can
      wait in one place for the async helpers to catch up, and then wait on
      any ordered extents that are created.
      
      Fixes: e076ab2a ("btrfs: shrink delalloc pages instead of full inodes")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1646070
    • Josef Bacik's avatar
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik authored
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03fe78cc
    • Josef Bacik's avatar
      btrfs: enable a tracepoint when we fail tickets · fcdef39c
      Josef Bacik authored
      When debugging early enospc problems it was useful to have a tracepoint
      where we failed all tickets so I could check the state of the enospc
      counters at failure time to validate my fixes.  This adds the tracpoint
      so you can easily get that information.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fcdef39c
    • Josef Bacik's avatar
      btrfs: include delalloc related info in dump space info tracepoint · 8197766d
      Josef Bacik authored
      In order to debug delalloc flushing issues I added delalloc_bytes and
      ordered_bytes to this tracepoint to see if they were non-zero when we
      were going ENOSPC. This was valuable for me and showed me cases where we
      weren't waiting on ordered extents properly. In order to add this to the
      tracepoint we need to take away the const modifier for fs_info, as
      percpu_sum_counter_positive() will change the counter when it adds up
      the percpu buckets.  This is needed to make sure we're getting accurate
      information at these tracepoints, as the wrong information could send us
      down the wrong path when debugging problems.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8197766d
    • Josef Bacik's avatar
      btrfs: wake up async_delalloc_pages waiters after submit · ac98141d
      Josef Bacik authored
      We use the async_delalloc_pages mechanism to make sure that we've
      completed our async work before trying to continue our delalloc
      flushing.  The reason for this is we need to see any ordered extents
      that were created by our delalloc flushing.  However we're waking up
      before we do the submit work, which is before we create the ordered
      extents.  This is a pretty wide race window where we could potentially
      think there are no ordered extents and thus exit shrink_delalloc
      prematurely.  Fix this by waking us up after we've done the work to
      create ordered extents.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac98141d
    • Qu Wenruo's avatar
      btrfs: unify regular and subpage error paths in __extent_writepage() · 963e4db8
      Qu Wenruo authored
      [BUG]
      When running btrfs/160 in a loop for subpage with experimental
      compression support, it has a high chance to crash (~20%):
      
       BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ordered-data.c:238!
       Internal error: Oops - BUG: 0 [#1] SMP
       pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
       lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
       Call trace:
        __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
        btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
        run_delalloc_nocow+0x81c/0x8fc [btrfs]
        btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
        writepage_delalloc+0xc0/0x1ac [btrfs]
        __extent_writepage+0xf4/0x370 [btrfs]
        extent_write_cache_pages+0x288/0x4f4 [btrfs]
        extent_writepages+0x58/0xe0 [btrfs]
        btrfs_writepages+0x1c/0x30 [btrfs]
        do_writepages+0x60/0x110
        __filemap_fdatawrite_range+0x108/0x170
        filemap_fdatawrite_range+0x20/0x30
        btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
        __btrfs_write_out_cache+0x34c/0x480 [btrfs]
        btrfs_write_out_cache+0x144/0x220 [btrfs]
        btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
        btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
        btrfs_sync_fs+0x64/0x1cc [btrfs]
        sync_fs_one_sb+0x3c/0x50
        iterate_supers+0xcc/0x1d4
        ksys_sync+0x6c/0xd0
        __arm64_sys_sync+0x1c/0x30
        invoke_syscall+0x50/0x120
        el0_svc_common.constprop.0+0x4c/0xd4
        do_el0_svc+0x30/0x9c
        el0_svc+0x2c/0x54
        el0_sync_handler+0x1a8/0x1b0
        el0_sync+0x198/0x1c0
       ---[ end trace 336f67369ae6e0af ]---
      
      [CAUSE]
      For subpage case, we can have multiple sectors inside a page, this makes
      it possible for __extent_writepage() to have part of its page submitted
      before returning.
      
      In btrfs/160, we are using dm-dust to emulate write error, this means
      for certain pages, we could have everything running fine, but at the end
      of __extent_writepage(), one of the submitted bios fails due to dm-dust.
      
      Then the page is marked Error, and we change @ret from 0 to -EIO.
      
      This makes the caller extent_write_cache_pages() to error out, without
      submitting the remaining pages.
      
      Furthermore, since we're erroring out for free space cache, it doesn't
      really care about the error and will update the inode and retry the
      writeback.
      
      Then we re-run the delalloc range, and will try to insert the same
      delalloc range while previous delalloc range is still hanging there,
      triggering the above error.
      
      [FIX]
      The proper fix is to handle errors from __extent_writepage() properly,
      by ending the remaining ordered extent.
      
      But that fix needs the following changes:
      
      - Know at exactly which sector the error happened
        Currently __extent_writepage_io() works for the full page, can't
        return at which sector we hit the error.
      
      - Grab the ordered extent covering the failed sector
      
      As a hotfix for subpage case, here we unify the error paths in
      __extent_writepage().
      
      In fact, the "if (PageError(page))" branch never get executed if @ret is
      still 0 for non-subpage cases.
      
      As for non-subpage case, we never submit current page in
      __extent_writepage(), but only add current page into bio.
      The bio can only get submitted in next page.
      
      Thus we never get PageError() set due to IO failure, thus when we hit
      the branch, @ret is never 0.
      
      By simply removing that @ret assignment, we let subpage case ignore the
      IO failure, thus only error out for fatal errors just like regular
      sectorsize.
      
      So that IO error won't be treated as fatal error not trigger the hanging
      OE problem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      963e4db8
    • Qu Wenruo's avatar
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo authored
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95ea0486
    • Qu Wenruo's avatar
      btrfs: subpage: fix relocation potentially overwriting last page data · 9d9ea1e6
      Qu Wenruo authored
      [BUG]
      When using the following script, btrfs will report data corruption after
      one data balance with subpage support:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev -o nospace_cache $mnt
        $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
        sync
        btrfs balance start -d $mnt
        btrfs scrub start -B $mnt
      
      Similar problem can be easily observed in btrfs/028 test case, there
      will be tons of balance failure with -EIO.
      
      [CAUSE]
      Above fsstress will result the following data extents layout in extent
      tree:
        item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
          refs 2 gen 7 flags DATA
          extent data backref root FS_TREE objectid 259 offset 1339392 count 1
          extent data backref root FS_TREE objectid 259 offset 647168 count 1
        item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
          block group used 102400 chunk_objectid 256 flags DATA
        item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
          refs 1 gen 7 flags DATA
          extent data backref root FS_TREE objectid 259 offset 729088 count 1
      
      Then when creating the data reloc inode, the data reloc inode will look
      like this:
      
      	0	32K	64K	96K 100K	104K
      	|<------ Extent A ----->|   |<- Ext B ->|
      
      Then when we first try to relocate extent A, we setup the data reloc
      inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).
      
      For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
      with 0 and set it uptodate.
      
      Then when we come to extent B, we update i_size to 104K, then try to read
      page [64K, 128K).
      Then we find the page is already uptodate, so we skip the read.
      But range [96K, 128K) is filled with 0, not the real data.
      
      Then we writeback the data reloc inode to disk, with 0 filling range
      [96K, 128K), corrupting the content of extent B.
      
      The behavior is caused by the fact that we still do full page read for
      subpage case.
      
      The bug won't really happen for regular sectorsize, as one page only
      contains one sector.
      
      [FIX]
      This patch will fix the problem by invalidating range [i_size, PAGE_END]
      in prealloc_file_extent_cluster().
      
      So that if above example happens, when we preallocate the file extent
      for extent B, we will clear the uptodate bits for range [96K, 128K),
      allowing later relocate_one_page() to re-read the needed range.
      
      There is a special note for the invalidating part.
      
      Since we're not calling real btrfs_invalidatepage(), but just clearing
      the subpage and page uptodate bits, we can leave a page half dirty and
      half out of date.
      
      Reading such page can cause a deadlock, as we normally expect a dirty
      page to be fully uptodate.
      
      Thus here we flush and wait the data reloc inode before doing the hacked
      invalidating.  This won't cause extra overhead, as we're going to
      writeback the data later anyway.
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d9ea1e6
    • Qu Wenruo's avatar
      btrfs: subpage: fix false alert when relocating partial preallocated data extents · e3c62324
      Qu Wenruo authored
      [BUG]
      When relocating partial preallocated data extents (part of the
      preallocated extent is written) for subpage, it can cause the following
      false alert and make the relocation to fail:
      
        BTRFS info (device dm-3): balance: start -d
        BTRFS info (device dm-3): relocating block group 13631488 flags data
        BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
        BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
        BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        BTRFS info (device dm-3): balance: ended with status: -5
      
      The minimal script to reproduce looks like this:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev -o nospace_cache $mnt
        xfs_io -f -c "falloc 0 8k" $mnt/file
        xfs_io -f -c "pwrite 0 4k" $mnt/file
        btrfs balance start -d $mnt
      
      [CAUSE]
      Function btrfs_verify_data_csum() checks if the full range has
      EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
      have EXTENT_NODATASUM bit, then it skip the range.
      
      This works pretty well for regular sectorsize, as in that case
      btrfs_verify_data_csum() is called for each sector, thus no problem at
      all.
      
      But for subpage case, btrfs_verify_data_csum() is called on each bvec,
      which can contain several sectors, and since it checks *all* bytes for
      EXTENT_NODATASUM bit, if we have some range with csum, then we will
      continue checking all the sectors.
      
      For the preallocated sectors, it doesn't have any csum, thus obviously
      the csum won't match and cause the false alert.
      
      [FIX]
      Move the EXTENT_NODATASUM check into the main loop, so that we can check
      each sector for EXTENT_NODATASUM bit for subpage case.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3c62324
    • Qu Wenruo's avatar
      btrfs: subpage: fix a potential use-after-free in writeback helper · 7c11d0ae
      Qu Wenruo authored
      [BUG]
      There is a possible use-after-free bug when running generic/095.
      
       BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
       Faulting instruction address: 0xc000000000283654
       c000000000283078 do_raw_spin_unlock+0x88/0x230
       c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
       c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
       c0000000009e0458 end_bio_extent_writepage+0x158/0x270
       c000000000b6fd14 bio_endio+0x254/0x270
       c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
       c000000000b6fd14 bio_endio+0x254/0x270
       c000000000b781fc blk_update_request+0x46c/0x670
       c000000000b8b394 blk_mq_end_request+0x34/0x1d0
       c000000000d82d1c lo_complete_rq+0x11c/0x140
       c000000000b880a4 blk_complete_reqs+0x84/0xb0
       c0000000012b2ca4 __do_softirq+0x334/0x680
       c0000000001dd878 irq_exit+0x148/0x1d0
       c000000000016f4c do_IRQ+0x20c/0x240
       c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
      
      [CAUSE]
      There is very small race window like the following in generic/095.
      
      	Thread 1		|		Thread 2
      --------------------------------+------------------------------------
        end_bio_extent_writepage()	| btrfs_releasepage()
        |- spin_lock_irqsave()	| |
        |- end_page_writeback()	| |
        |				| |- if (PageWriteback() ||...)
        |				| |- clear_page_extent_mapped()
        |				|    |- kfree(subpage);
        |- spin_unlock_irqrestore().
      
      The race can also happen between writeback and btrfs_invalidatepage(),
      although that would be much harder as btrfs_invalidatepage() has much
      more work to do before the clear_page_extent_mapped() call.
      
      [FIX]
      Here we "wait" for the subapge spinlock to be released before we detach
      subpage structure.
      So this patch will introduce a new function, wait_subpage_spinlock(), to
      do the "wait" by acquiring the spinlock and release it.
      
      Since the caller has ensured the page is not dirty nor writeback, and
      page is already locked, the only way to hold the subpage spinlock is
      from endio function.
      Thus we only need to acquire the spinlock to wait for any existing
      holder.
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c11d0ae
    • Qu Wenruo's avatar
      btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() · e0467866
      Qu Wenruo authored
      [BUG]
      When running generic/095, there is a high chance to crash with subpage
      data RW support:
      
       assertion failed: PagePrivate(page) && page->private
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.h:3403!
       Internal error: Oops - BUG: 0 [#1] SMP
       CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
       Hardware name: Khadas VIM3 (DT)
       Call trace:
        assertfail.constprop.0+0x28/0x2c [btrfs]
        btrfs_subpage_assert+0x80/0xa0 [btrfs]
        btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
        btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
        btrfs_dirty_pages+0x160/0x270 [btrfs]
        btrfs_buffered_write+0x444/0x630 [btrfs]
        btrfs_direct_write+0x1cc/0x2d0 [btrfs]
        btrfs_file_write_iter+0xc0/0x160 [btrfs]
        new_sync_write+0xe8/0x180
        vfs_write+0x1b4/0x210
        ksys_pwrite64+0x7c/0xc0
        __arm64_sys_pwrite64+0x24/0x30
        el0_svc_common.constprop.0+0x70/0x140
        do_el0_svc+0x28/0x90
        el0_svc+0x2c/0x54
        el0_sync_handler+0x1a8/0x1ac
        el0_sync+0x170/0x180
       Code: f0000160 913be042 913c4000 955444bc (d4210000)
       ---[ end trace 3fdd39f4cccedd68 ]---
      
      [CAUSE]
      Although prepare_pages() calls find_or_create_page(), which returns the
      page locked, but in later prepare_uptodate_page() calls, we may call
      btrfs_readpage() which will unlock the page before it returns.
      
      This leaves a window where btrfs_releasepage() can sneak in and release
      the page, clearing page->private and causing above ASSERT().
      
      [FIX]
      In prepare_uptodate_page(), we should not only check page->mapping, but
      also PagePrivate() to ensure we are still holding the correct page which
      has proper fs context setup.
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0467866
    • Qu Wenruo's avatar
      btrfs: subpage: reject raid56 filesystem and profile conversion · c8050b3b
      Qu Wenruo authored
      RAID56 is not only unsafe due to its write-hole problem, but also has
      tons of hardcoded PAGE_SIZE.
      
      Disable it for subpage support for now.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8050b3b