1. 05 Dec, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: avoid duplicated resolution of indirect backrefs during fiemap · 877c1476
      Filipe Manana authored
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      Currently whenever we find the data extent that a file extent item points
      to is not directly shared, we always resolve the path in the fs tree, and
      then check if any extent buffer in the path is shared. This is a lot of
      work and when we have file extent items that belong to the same leaf, we
      have the same path, so we only need to calculate it once.
      
      This change does that, it keeps track of the current and previous leaf,
      and when we find that a data extent is not directly shared, we try to
      compute the fs tree path only once and then use it for every other file
      extent item in the same leaf, using the existing cached path result for
      the leaf as long as the cache results are valid.
      
      This saves us from doing expensive b+tree searches in the fs tree of our
      target inode, as well as other minor work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-with-snapshots.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         # Create a snapshot so all the extents become indirectly shared
         # through subtrees, with a generation less than or equals to the
         # generation used to create the snapshot.
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Result before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1204 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 729 milliseconds (metadata cached)
      
      Result after applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 732 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 421 milliseconds (metadata cached)
      
      That's a -46.1% total reduction for the metadata not cached case, and
      a -42.2% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      877c1476
    • Filipe Manana's avatar
      btrfs: move up backref sharedness cache store and lookup functions · 583f4ac5
      Filipe Manana authored
      Move the static functions to lookup and store sharedness check of an
      extent buffer to a location above find_all_parents(), because in the
      next patch the lookup function will be used by find_all_parents().
      The store function is also moved just because it's the counter part
      to the lookup function and it's best to have their definitions close
      together.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      583f4ac5
    • Filipe Manana's avatar
      btrfs: cache sharedness of the last few data extents during fiemap · 73e339e6
      Filipe Manana authored
      During fiemap we process all the file extent items of an inode, by their
      file offset order (left to right b+tree order), and then check if the data
      extent they point at is shared or not. Until now we didn't cache those
      results, we only did it for b+tree nodes/leaves since for each unique
      b+tree path we have access to hundreds of file extent items. However, it
      is also common to repeat checking the sharedness of a particular data
      extent in a very short time window, and the cases that lead to that are
      the following:
      
      1) COW writes.
      
         If have a file extent item like this:
      
                        [ bytenr X, offset = 0, num_bytes = 512K ]
         file offset    0                                        512K
      
         Then a 4K write into file offset 64K happens, we end up with the
         following file extent item layout:
      
                        [ bytenr X, offset = 0, num_bytes = 64K ]
         file offset    0                                       64K
      
                        [ bytenr Y, offset = 0, num_bytes = 4K ]
         file offset   64K                                     68K
      
                        [ bytenr X, offset = 68K, num_bytes = 444K ]
         file offset   68K                                         512K
      
         So during fiemap we well check for the sharedness of the data extent
         with bytenr X twice. Typically for COW writes and for at least
         moderately updated files, we end up with many file extent items that
         point to different sections of the same data extent.
      
      2) Writing into a NOCOW file after a snapshot is taken.
      
         This happens if the target extent was created in a generation older
         than the generation where the last snapshot for the root (the tree the
         inode belongs to) was made.
      
         This leads to a scenario like the previous one.
      
      3) Writing into sections of a preallocated extent.
      
         For example if a file has the following layout:
      
         [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
         0                                                       1M
      
         After doing a 4K write into file offset 0 and another 4K write into
         offset 512K, we get the following layout:
      
            [ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
            0                                                      4K
      
            [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
           4K                                                          512K
      
            [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
         512K                                                         516K
      
            [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
         516K                                                            1M
      
         So we end up with 4 consecutive file extent items pointing to the data
         extent at bytenr X.
      
      4) Hole punching in the middle of an extent.
      
         For example if a file has the following file extent item:
      
         [ bytenr X, offset = 0, num_bytes = 8M ]
         0                                      8M
      
         And then hole is punched for the file range [4M, 6M[, we our file
         extent item split into two:
      
         [ bytenr X, offset = 0, num_bytes = 4M  ]
         0                                       4M
      
         [ 2M hole, implicit or explicit depending on NO_HOLES feature ]
         4M                                                            6M
      
         [ bytenr X, offset = 6M, num_bytes = 2M  ]
         6M                                       8M
      
         Again, we end up with two file extent items pointing to the same
         data extent.
      
      5) When reflinking (clone and deduplication) within the same file.
         This is probably the least common case of all.
      
      In cases 1, 2, 4 and 4, when we have multiple file extent items that point
      to the same data extent, their distance is usually short, typically
      separated by a few slots in a b+tree leaf (or across sibling leaves). For
      case 5, the distance can vary a lot, but it's typically the less common
      case.
      
      This change caches the result of the sharedness checks for data extents,
      but only for the last 8 extents that we notice that our inode refers to
      with multiple file extent items. Whenever we want to check if a data
      extent is shared, we lookup the cache which consists of doing a linear
      scan of an 8 elements array, and if we find the data extent there, we
      return the result and don't check the extent tree and delayed refs.
      
      The array/cache is small so that doing the search has no noticeable
      negative impact on the performance in case we don't have file extent items
      within a distance of 8 slots that point to the same data extent.
      
      Slots in the cache/array are overwritten in a simple round robin fashion,
      as that approach fits very well.
      
      Using this simple approach with only the last 8 data extents seen is
      effective as usually when multiple file extents items point to the same
      data extent, their distance is within 8 slots. It also uses very little
      memory and the time to cache a result or lookup the cache is negligible.
      
      The following test was run on non-debug kernel (Debian's default kernel
      config) to measure the impact in the case of COW writes (first example
      given above), where we run fiemap after overwriting 33% of the blocks of
      a file:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1 * 1024 * 1024  * 1024))
      
         # Create the file full of 1M extents.
         xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
      
         block_count=$((FILE_SIZE / 4096))
         # Overwrite about 33% of the file blocks.
         overwrite_count=$((block_count / 3))
      
         echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
         RANDOM=123
         for ((i = 1; i <= $overwrite_count; i++)); do
             off=$(((RANDOM % block_count) * 4096))
             xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
             echo -ne "\r$i blocks overwritten..."
         done
         echo -e "\n"
      
         # Unmount and mount to clear all cached metadata.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds"
      
         umount $MNT
      
      Result before applying this patch:
      
         fiemap took 128 milliseconds
      
      Result after applying this patch:
      
         fiemap took 92 milliseconds   (-28.1%)
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      73e339e6
    • Filipe Manana's avatar
      btrfs: remove useless logic when finding parent nodes · 56f5c199
      Filipe Manana authored
      At find_parent_nodes(), at its last step, when iterating over all direct
      references, we are checking if we have a share context and if we have
      a reference with a different root from the one in the share context.
      However that logic is pointless because of two reasons:
      
      1) After the previous patch in the series (subject "btrfs: remove roots
         ulist when checking data extent sharedness"), the roots argument is
         always NULL when using a share check context (struct share_check), so
         this code is never triggered;
      
      2) Even before that previous patch, we could not hit this code because
         if we had a reference with a root different from the one in our share
         context, then we would have exited earlier when doing either of the
         following:
      
            - Adding a second direct ref to the direct refs red black tree
              resulted in extent_is_shared() returning true when called from
              add_direct_ref() -> add_prelim_ref(), after processing delayed
              references or while processing references in the extent tree;
      
            - When adding a second reference to the indirect refs red black
              tree (same as above, extent_is_shared() returns true);
      
            - If we only have one indirect reference and no direct references,
              then when resolving it at resolve_indirect_refs() we immediately
              return that the target extent is shared, therefore never reaching
              that loop that iterates over all direct references at
              find_parent_nodes();
      
            - If we have 1 indirect reference and 1 direct reference, then we
              also exit early because extent_is_shared() ends up returning true
              when called through add_prelim_ref() (by add_direct_ref() or
              add_indirect_ref()) or add_delayed_refs(). Same applies as when
              having a combination of direct, indirect and indirect with missing
              key references.
      
         This logic had been obsoleted since commit 3ec4d323 ("btrfs:
         allow backref search checks for shared extents"), which introduced the
         early exits in case an extent is shared.
      
      So just remove that logic, and assert at find_parent_nodes() that when we
      have a share context we don't have a roots ulist and that we haven't found
      the extent to be directly shared after processing delayed references and
      all references from the extent tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56f5c199
    • Filipe Manana's avatar
      btrfs: remove roots ulist when checking data extent sharedness · b6296858
      Filipe Manana authored
      Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
      argument of find_parent_nodes(), however it does not use that ulist for
      anything and for this context that list always ends up with at most one
      element.
      
      Since find_parent_nodes() is able to deal with a NULL ulist for its roots
      argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
      burden of allocating memory for the unnused roots ulist, initializing it,
      releasing it and allocating one struct ulist_node for it during the call
      to find_parent_nodes().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6296858
    • Filipe Manana's avatar
      btrfs: move ulists to data extent sharedness check context · 84a7949d
      Filipe Manana authored
      When calling btrfs_is_data_extent_shared() we pass two ulists that were
      allocated by the caller. This is because the single caller, fiemap, calls
      btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
      instead of allocating new ones before each call and freeing them after
      each call.
      
      Now that we have a context structure/object that we pass to
      btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
      their allocation and the context's allocation in a helper function, as
      well as the freeing of the ulists and the context object. This allows to
      reduce the number of parameters passed to btrfs_is_data_extent_shared(),
      the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
      and having the caller deal with allocating and releasing the ulists.
      
      Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
      that's a much better name as it reflects what the list is used for (and
      matching the argument name for find_parent_nodes()).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84a7949d
    • Filipe Manana's avatar
      btrfs: turn the backref sharedness check cache into a context object · 61dbb952
      Filipe Manana authored
      Right now we are using a struct btrfs_backref_shared_cache to pass state
      across multiple btrfs_is_data_extent_shared() calls. The structure's name
      closely follows its current purpose, which is to cache previous checks
      for the sharedness of metadata extents. However we will start using the
      structure for more things other than caching sharedness checks, so rename
      it to struct btrfs_backref_share_check_ctx.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61dbb952
    • Filipe Manana's avatar
      btrfs: directly pass the inode to btrfs_is_data_extent_shared() · ceb707da
      Filipe Manana authored
      Currently we pass a root and an inode number as arguments for
      btrfs_is_data_extent_shared() and the inode number is always from an
      inode that belongs to that root (it wouldn't make sense otherwise).
      In every context that we call btrfs_is_data_extent_shared() (fiemap only),
      we have an inode available, so directly pass the inode to the function
      instead of a root and inode number. This reduces the number of parameters
      and it makes the function's signature conform to most other functions we
      have.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ceb707da
    • Filipe Manana's avatar
      btrfs: remove checks for a 0 inode number during backref walking · a0a5472a
      Filipe Manana authored
      When doing backref walking to determine if an extent is shared, we are
      testing if the inode number, stored in the 'inum' field of struct
      share_check, is 0. However that can never be case, since the all instances
      of the structure are created at btrfs_is_data_extent_shared(), which
      always initializes it with the inode number from a fs tree (and the number
      for any inode from any tree can never be 0). So remove the checks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0a5472a
    • Filipe Manana's avatar
      btrfs: remove checks for a root with id 0 during backref walking · c9024219
      Filipe Manana authored
      When doing backref walking to determine if an extent is shared, we are
      testing the root_objectid of the given share_check struct is 0, but that
      is an impossible case, since btrfs_is_data_extent_shared() always
      initializes the root_objectid field with the id of the given root, and
      no root can have an objectid of 0. So remove those checks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9024219
    • Filipe Manana's avatar
      btrfs: drop redundant bflags initialization when allocating extent buffer · 206c1d32
      Filipe Manana authored
      When allocating an extent buffer, at __alloc_extent_buffer(), there's no
      point in explicitly assigning zero to the bflags field of the new extent
      buffer because we allocated it with kmem_cache_zalloc().
      
      So just remove the redundant initialization, it saves one mov instruction
      in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      206c1d32
    • Filipe Manana's avatar
      btrfs: drop pointless memset when cloning extent buffer · b98c6cd5
      Filipe Manana authored
      At btrfs_clone_extent_buffer(), before allocating the pages array for the
      new extent buffer we are calling memset() to zero out the pages array of
      the extent buffer. This is pointless however, because the extent buffer
      already has every element in its pages array pointing to NULL, as it was
      allocated with kmem_cache_zalloc(). The memset() was introduced with
      commit dd137dd1 ("btrfs: factor out allocating an array of pages"),
      but even before that commit we already depended on the pages array being
      initialized to NULL for the error paths that need to call
      btrfs_release_extent_buffer().
      
      So remove the memset(), it's useless and slightly increases the object
      text size.
      
      Before this change:
      
         $ size fs/btrfs/extent_io.o
            text	   data	    bss	    dec	    hex	filename
           70580	   5469	     40	  76089	  12939	fs/btrfs/extent_io.o
      
      After this change:
      
         $ size fs/btrfs/extent_io.o
            text	   data	    bss	    dec	    hex	filename
           70564	   5469	     40	  76073	  12929	fs/btrfs/extent_io.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b98c6cd5
    • Filipe Manana's avatar
      btrfs: skip unnecessary delalloc search during fiemap and lseek · a2853ffc
      Filipe Manana authored
      During fiemap and lseek (hole and data seeking), there's no point in
      iterating the inode's io tree to count delalloc bits if the inode's
      delalloc bytes counter has a value of zero, as that counter is updated
      whenever we set a range for delalloc or clear a range from delalloc.
      
      So skip the counting and io tree iteration if the inode's delalloc bytes
      counter has a value of zero. This helps save time when processing a file
      range corresponding to a hole or prealloc (unwritten) extent.
      
      This patch is part of a series comprised of the following patches:
      
        btrfs: get the next extent map during fiemap/lseek more efficiently
        btrfs: skip unnecessary extent map searches during fiemap and lseek
        btrfs: skip unnecessary delalloc search during fiemap and lseek
      
      The following test was performed on a release kernel (Debian's default
      kernel config) before and after applying those 3 patches.
      
         # Wrapper to call fiemap in extent count only mode.
         # (struct fiemap::fm_extent_count set to 0)
         $ cat fiemap.c
         #include <stdio.h>
         #include <unistd.h>
         #include <stdlib.h>
         #include <fcntl.h>
         #include <errno.h>
         #include <string.h>
         #include <sys/ioctl.h>
         #include <linux/fs.h>
         #include <linux/fiemap.h>
      
         int main(int argc, char **argv)
         {
                  struct fiemap fiemap = { 0 };
                  int fd;
      
                  if (argc != 2) {
                          printf("usage: %s <path>\n", argv[0]);
                          return 1;
                  }
                  fd = open(argv[1], O_RDONLY);
                  if (fd < 0) {
                          fprintf(stderr, "error opening file: %s\n",
                                  strerror(errno));
                          return 1;
                  }
      
                  /* fiemap.fm_extent_count set to 0, to count extents only. */
                  fiemap.fm_length = FIEMAP_MAX_OFFSET;
                  if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                          fprintf(stderr, "fiemap error: %s\n",
                                  strerror(errno));
                          return 1;
                  }
                  close(fd);
                  printf("fm_mapped_extents = %d\n", fiemap.fm_mapped_extents);
      
                  return 0;
         }
      
         $ gcc -o fiemap fiemap.c
      
      And the wrapper shell script that creates a file with many holes and runs
      fiemap against it:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1 * 1024 * 1024 * 1024))
         echo -n > $MNT/foobar
         for ((off = 0; off < $FILE_SIZE; off += 8192)); do
                 xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
         done
      
         # flush all delalloc
         sync
      
         start=$(date +%s%N)
         ./fiemap $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds"
      
         umount $MNT
      
      Result before applying patchset:
      
         fm_mapped_extents = 131072
         fiemap took 63 milliseconds
      
      Result after applying patchset:
      
         fm_mapped_extents = 131072
         fiemap took 39 milliseconds   (-38.1%)
      
      Running the same test for a 512M file instead of a 1G file, gave the
      following results.
      
      Result before applying patchset:
      
         fm_mapped_extents = 65536
         fiemap took 29 milliseconds
      
      Result after applying patchset:
      
         fm_mapped_extents = 65536
         fiemap took 20 milliseconds    (-31.0%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2853ffc
    • Filipe Manana's avatar
      btrfs: skip unnecessary extent map searches during fiemap and lseek · 013f9c70
      Filipe Manana authored
      If we have no outstanding extents it means we don't have any extent maps
      corresponding to delalloc that is flushing, as when an ordered extent is
      created we increment the number of outstanding extents to 1 and when we
      remove the ordered extent we decrement them by 1. So skip extent map tree
      searches if the number of outstanding ordered extents is 0, saving time as
      the tree is not empty if we have previously made some reads or flushed
      delalloc, as in those cases it can have a very large number of extent maps
      for files with many extents.
      
      This helps save time when processing a file range corresponding to a hole
      or prealloc (unwritten) extent.
      
      The next patch in the series has a performance test in its changelog and
      its subject is:
      
          "btrfs: skip unnecessary delalloc search during fiemap and lseek"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      013f9c70
    • Filipe Manana's avatar
      btrfs: get the next extent map during fiemap/lseek more efficiently · d47704bd
      Filipe Manana authored
      At find_delalloc_subrange(), when we need to get the next extent map, we
      do a full search on the extent map tree (a red black tree). This is fine
      but it's a lot more efficient to simply use rb_next(), which typically
      requires iterating over less nodes of the tree and never needs to compare
      the ranges of nodes with the one we are looking for.
      
      So add a public helper to extent_map.{h,c} to get the extent map that
      immediately follows another extent map, using rb_next(), and use that
      helper at find_delalloc_subrange().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d47704bd
    • Qu Wenruo's avatar
      btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate · 88074c8b
      Qu Wenruo authored
      For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).
      
      We call cache_rbio_pages() to mark a qualified rbio ready for cache.
      
      The timing happens at:
      
      - finish_rmw()
      
        At this timing, we have already read all necessary sectors, along with
        the rbio sectors, we have covered all data stripes.
      
      - __raid_recover_end_io()
      
        At this timing, we have rebuild the rbio, thus all data sectors
        involved (either from stripe or bio list) are uptodate now.
      
      Thus at the timing of cache_rbio_pages(), we should have all data
      sectors uptodate.
      
      This patch will make it explicit that all data sectors are uptodate at
      cache_rbio_pages() timing, mostly to prepare for the incoming
      verification at RMW time.
      
      This patch will add:
      
      - Extra ASSERT()s in cache_rbio_pages()
        This is to make sure all data sectors, which are not covered by bio,
        are already uptodate.
      
      - Extra ASSERT()s in steal_rbio()
        Since only cached rbio can be stolen, thus every data sector should
        already be uptodate in the source rbio.
      
      - Update __raid_recover_end_io() to update recovered sector->uptodate
        Previously __raid_recover_end_io() will only mark failed sectors
        uptodate if it's doing an RMW.
      
        But this can trigger new ASSERT()s, as for recovery case, a recovered
        failed sector will not be marked uptodate, and trigger ASSERT() in
        later cache_rbio_pages() call.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88074c8b
    • Qu Wenruo's avatar
      btrfs: raid56: allocate memory separately for rbio pointers · 797d74b7
      Qu Wenruo authored
      Currently inside alloc_rbio(), we allocate a larger memory to contain
      the following members:
      
      - struct btrfs_raid_rbio itself
      - stripe_pages array
      - bio_sectors array
      - stripe_sectors array
      - finish_pointers array
      
      Then update rbio pointers to point the extra space after the rbio
      structure itself.
      
      Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.
      
      This is too hacky, and is going to make later pointers expansion harder.
      
      This patch will change it to use regular kcalloc() for each pointer
      inside btrfs_raid_bio, making the later expansion much easier.
      
      And introduce a helper free_raid_bio_pointers() to free up all the
      pointer members in btrfs_raid_bio, which will be used in both
      free_raid_bio() and error path of alloc_rbio().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      797d74b7
    • Qu Wenruo's avatar
      btrfs: raid56: cleanup for function __free_raid_bio() · ff2b64a2
      Qu Wenruo authored
      The cleanup involves two things:
      
      - Remove the "__" prefix
        There is no naming confliction.
      
      - Remove the forward declaration
        There is no special function call involved.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff2b64a2
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY · 765c3fe9
      Josef Bacik authored
      Inside of FB, as well as some user reports, we've had a consistent
      problem of occasional ENOSPC transaction aborts.  Inside FB we were
      seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
      low occurrence rate given the size of our fleet, but it's not nothing.
      
      There are two causes of this particular problem.
      
      First is delayed allocation.  The reservation system for delalloc
      assumes that contiguous dirty ranges will result in 1 file extent item.
      However if there is memory pressure that results in fragmented writeout,
      or there is fragmentation in the block groups, this won't necessarily be
      true.  Consider the case where we do a single 256MiB write to a file and
      then close it.  We will have 1 reservation for the inode update, the
      reservations for the checksum updates, and 1 reservation for the file
      extent item.  At some point later we decide to write this entire range
      out, but we're so fragmented that we break this into 100 different file
      extents.  Since we've already closed the file and are no longer writing
      to it there's nothing to trigger a refill of the delalloc block rsv to
      satisfy the 99 new file extent reservations we need.  At this point we
      exhaust our delalloc reservation, and we begin to steal from the global
      reserve.  If you have enough of these cases going in parallel you can
      easily exhaust the global reserve, get an ENOSPC at
      btrfs_alloc_tree_block() time, and then abort the transaction.
      
      The other case is the delayed refs reserve.  The delayed refs reserve
      updates its size based on outstanding delayed refs and dirty block
      groups.  However we only refill this block reserve when returning
      excess reservations and when we call btrfs_start_transaction(root, X).
      We will reserve 2*X credits at transaction start time, and fill in X
      into the delayed refs reserve to make sure it stays topped off.
      Generally this works well, but clearly has downsides.  If we do a
      particularly delayed ref heavy operation we may never catch up in our
      reservations.  Additionally running delayed refs generates more delayed
      refs, and at that point we may be committing the transaction and have no
      way to trigger a refill of our delayed refs rsv.  Then a similar thing
      occurs with the delalloc reserve.
      
      Generally speaking we well over-reserve in all of our block rsvs.  If we
      reserve 1 credit we're usually reserving around 264k of space, but we'll
      often not use any of that reservation, or use a few blocks of that
      reservation.  We can be reasonably sure that as long as you were able to
      reserve space up front for your operation you'll be able to find space
      on disk for that reservation.
      
      So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY.  This
      gets used in the case that we've exhausted our reserve and the global
      reserve.  It simply forces a reservation if we have enough actual space
      on disk to make the reservation, which is almost always the case.  This
      keeps us from hitting ENOSPC aborts in these odd occurrences where we've
      not kept up with the delayed work.
      
      Fixing this in a complete way is going to be relatively complicated and
      time consuming.  This patch is what I discussed with Filipe earlier this
      year, and what I put into our kernels inside FB.  With this patch we're
      down to 1-2 ENOSPC aborts per week, which is a significant reduction.
      This is a decent stop gap until we can work out a more wholistic
      solution to these two corner cases.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      765c3fe9
    • Josef Bacik's avatar
      btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h · 7a66eda3
      Josef Bacik authored
      These are wrapped in CONFIG_FS_VERITY, but we can have the definitions
      without verity enabled.  Move these definitions up with the other
      accessor helpers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a66eda3
    • Josef Bacik's avatar
      btrfs: move btrfs_next_old_item into ctree.c · 890d2b1a
      Josef Bacik authored
      This uses btrfs_header_nritems, which I will be moving out of ctree.h.
      In order to avoid needing to include the relevant header in ctree.h,
      simply move this helper function into ctree.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename parameters ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      890d2b1a
    • Josef Bacik's avatar
      btrfs: move free space cachep's out of ctree.h · eda517fd
      Josef Bacik authored
      This is local to the free-space-cache.c code, remove it from ctree.h and
      inode.c, create new init/exit functions for the cachep, and move it
      locally to free-space-cache.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eda517fd
    • Josef Bacik's avatar
      btrfs: move btrfs_path_cachep out of ctree.h · 226463d7
      Josef Bacik authored
      This is local to the ctree code, remove it from ctree.h and inode.c,
      create new init/exit functions for the cachep, and move it locally to
      ctree.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      226463d7
    • Josef Bacik's avatar
      btrfs: move trans_handle_cachep out of ctree.h · 956504a3
      Josef Bacik authored
      This is local to the transaction code, remove it from ctree.h and
      inode.c, create new helpers in the transaction to handle the init work
      and move the cachep locally to transaction.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      956504a3
    • Josef Bacik's avatar
      btrfs: move btrfs_print_data_csum_error into inode.c · f60acad3
      Josef Bacik authored
      This isn't used outside of inode.c, there's no reason to define it in
      btrfs_inode.h. Drop the inline and add __cold as it's for errors that
      are not in any hot path.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f60acad3
    • Josef Bacik's avatar
      btrfs: move flush related definitions to space-info.h · f1e5c618
      Josef Bacik authored
      This code is used in space-info.c, move the definitions to space-info.h.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1e5c618
    • Josef Bacik's avatar
      btrfs: move btrfs_should_fragment_free_space into block-group.c · 06d61cb1
      Josef Bacik authored
      This function uses functions that are not defined in block-group.h, move
      it into block-group.c in order to keep the header clean.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06d61cb1
    • Josef Bacik's avatar
      btrfs: move discard stat defs to free-space-cache.h · 390d89cc
      Josef Bacik authored
      These definitions are used for discard statistics, move them out of
      ctree.h and put them in free-space-cache.h.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      390d89cc
    • Josef Bacik's avatar
      btrfs: move BTRFS_MAX_MIRRORS into scrub.c · ed4c491a
      Josef Bacik authored
      This is only used locally in scrub.c, move it out of ctree.h into
      scrub.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed4c491a
    • Josef Bacik's avatar
      btrfs: move maximum limits to btrfs_tree.h · ad4b63ca
      Josef Bacik authored
      We have maximum link and name length limits, move these to btrfs_tree.h
      as they're on disk limitations.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad4b63ca
    • Josef Bacik's avatar
      btrfs: move btrfs_get_block_group helper out of disk-io.h · 51129b33
      Josef Bacik authored
      This inline helper calls btrfs_fs_compat_ro(), which is defined in
      another header.  To avoid weird header dependency problems move this
      helper into disk-io.c with the rest of the global root helpers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51129b33
    • Josef Bacik's avatar
      btrfs: move btrfs on-disk definitions out of ctree.h · 4300c58f
      Josef Bacik authored
      The bulk of our on-disk definitions exist in btrfs_tree.h, which user
      space can use.  Keep things consistent and move the rest of the on disk
      definitions out of ctree.h into btrfs_tree.h.  Note I did have to update
      all u8's to __u8, but otherwise this is a strict copy and paste.
      
      Most of the definitions are mainly for internal use and are not
      guaranteed stable public API and may change as we need. Compilation
      failures by user applications can happen.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comments, style fixups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4300c58f
    • Josef Bacik's avatar
      btrfs: remove unused BTRFS_IOPRIO_READA · 4ce76e8e
      Josef Bacik authored
      The last user of this definition was removed in patch f26c9238
      ("btrfs: remove reada infrastructure") so we can remove this definition.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ce76e8e
    • Josef Bacik's avatar
      btrfs: remove unused BTRFS_TOTAL_BYTES_PINNED_BATCH · ea206640
      Josef Bacik authored
      This hasn't been used since 138a12d8 ("btrfs: rip out
      btrfs_space_info::total_bytes_pinned") so it is safe to remove.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea206640
    • Josef Bacik's avatar
      btrfs: remove unused set/clear_pending_info helpers · d60d956e
      Josef Bacik authored
      The last users of these helpers were removed in 5297199a ("btrfs:
      remove inode number cache feature") so delete these helpers.
      
      The point was for mount options that were applicable after transaction
      commit so they could not be applied immediately. We don't have such
      options anymore and if we do the patch can be reverted.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d60d956e
    • Peng Hao's avatar
      btrfs: simplify cleanup after error in btrfs_create_tree · c1b07854
      Peng Hao authored
      Since leaf is already NULL, and no other branch will go to fail_unlock,
      the fail_unlock label is useless and can be removed
      Signed-off-by: default avatarPeng Hao <flyingpeng@tencent.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1b07854
    • Josef Bacik's avatar
      btrfs: add cached_state to read_extent_buffer_subpage · e5e886ba
      Josef Bacik authored
      We don't use a cached state here at all, which generally makes sense as
      async reads are going to unlock at endio time.  However for blocking
      reads we will call wait_extent_bit() for our range.  Since the
      lock_extent() stuff will return the cached_state for the start of the
      range this is a helpful optimization to have for this case, we'll have
      the exact state we want to wait on.  Add a cached state here and simply
      throw it away if we're a non-blocking read, otherwise we'll get a small
      improvement by eliminating some tree searches.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5e886ba
    • Josef Bacik's avatar
      btrfs: cache the failed state when locking extents · 123a7f00
      Josef Bacik authored
      Currently if we fail to lock a range we'll return the start of the range
      that we failed to lock.  We'll then search down to this range and wait
      on any extent states in this range.
      
      However we can avoid this search altogether if we simply cache the
      extent_state that had the contention.  We can pass this into
      wait_extent_bit() and start from that extent_state without doing the
      search.  In the most optimistic case we can avoid all searches, more
      likely we'll avoid the initial search and have to perform the search
      after we wait on the failed state, or worst case we must search both
      times which is what currently happens.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      123a7f00
    • Josef Bacik's avatar
      btrfs: use a cached_state everywhere in relocation · 9c5c9604
      Josef Bacik authored
      All of the relocation code avoids using the cached state, despite
      everywhere using the normal
      
        lock_extent()
        // do something
        unlock_extent()
      
      pattern.  Fix this by plumbing a cached state throughout all of these
      functions in order to allow for less tree searches.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c5c9604
    • Josef Bacik's avatar
      btrfs: use cached_state for btrfs_check_nocow_lock · 632ddfa2
      Josef Bacik authored
      Now that try_lock_extent() takes a cached_state, plumb the cached_state
      through btrfs_try_lock_ordered_range() and then use a cached_state in
      btrfs_check_nocow_lock everywhere to avoid extra tree searches on the
      extent_io_tree.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      632ddfa2