1. 30 May, 2018 16 commits
    • Ethan Lien's avatar
      btrfs: balance dirty metadata pages in btrfs_finish_ordered_io · e73e81b6
      Ethan Lien authored
      [Problem description and how we fix it]
      We should balance dirty metadata pages at the end of
      btrfs_finish_ordered_io, since a small, unmergeable random write can
      potentially produce dirty metadata which is multiple times larger than
      the data itself. For example, a small, unmergeable 4KiB write may
      produce:
      
          16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
      
      Although we do call balance dirty pages in write side, but in the
      buffered write path, most metadata are dirtied only after we reach the
      dirty background limit (which by far only counts dirty data pages) and
      wakeup the flusher thread. If there are many small, unmergeable random
      writes spread in a large btree, we'll find a burst of dirty pages
      exceeds the dirty_bytes limit after we wakeup the flusher thread - which
      is not what we expect. In our machine, it caused out-of-memory problem
      since a page cannot be dropped if it is marked dirty.
      
      Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
      but since we do btrfs_finish_ordered_io in a separate worker, it will not
      stop the flusher consuming dirty pages. Also, we use different worker for
      metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
      the size of dirty metadata pages.
      
      [Reproduce steps]
      To reproduce the problem, we need to do 4KiB write randomly spread in a
      large btree. In our 2GiB RAM machine:
      
      1) Create 4 subvolumes.
      2) Run fio on each subvolume:
      
         [global]
         direct=0
         rw=randwrite
         ioengine=libaio
         bs=4k
         iodepth=16
         numjobs=1
         group_reporting
         size=128G
         runtime=1800
         norandommap
         time_based
         randrepeat=0
      
      3) Take snapshot on each subvolume and repeat fio on existing files.
      4) Repeat step (3) until we get large btrees.
         In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
         metadata in each subvolume tree and 12GiB of metadata in extent tree.
      5) Stop all fio, take snapshot again, and wait until all delayed work is
         completed.
      6) Start all fio. Few seconds later we hit OOM when the flusher starts
         to work.
      
      It can be reproduced even when using nocow write.
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e73e81b6
    • Ethan Lien's avatar
      btrfs: lift some btrfs_cross_ref_exist checks in nocow path · 78d4295b
      Ethan Lien authored
      In nocow path, we check if the extent is snapshotted in
      btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
      unnecessary search into extent tree.
      
      A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
      
      [global]
      group_reporting
      time_based
      thread=1
      ioengine=libaio
      bs=4k
      iodepth=32
      size=64G
      runtime=180
      numjobs=8
      rw=randwrite
      
      [file1]
      filename=/mnt/nocow/testfile
      
      IOPS result:   unpatched     patched
      
      1 fio round:     46670        46958
      snapshot
      2 fio round:     51826        54498
      3 fio round:     59767        61289
      
      After snapshot, the first fio get about 5% performance gain. As we
      continually write to the same file, all writes will resume to nocow mode
      and eventually we have no performance gain.
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78d4295b
    • Lu Fengqi's avatar
      btrfs: Remove fs_info argument from btrfs_uuid_tree_rem · d1957791
      Lu Fengqi authored
      This function always takes a transaction handle which contains a
      reference to the fs_info. Use that and remove the extra argument.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      [ rename the function ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1957791
    • Lu Fengqi's avatar
      btrfs: Remove fs_info argument from btrfs_uuid_tree_add · cdb345a8
      Lu Fengqi authored
      This function always takes a transaction handle which contains a
      reference to the fs_info. Use that and remove the extra argument.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cdb345a8
    • Liu Bo's avatar
      Btrfs: remove unused check of skip_locking · f9ddfd05
      Liu Bo authored
      The check is superfluous since all callers who set search_for_commit
      also have skip_locking set.
      
      ASSERT() is put in place to ensure skip_locking is set by new callers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9ddfd05
    • Liu Bo's avatar
      Btrfs: remove always true check in unlock_up · d80bb3f9
      Liu Bo authored
      As unlock_up() is written as
      
      for () {
         if (!path->locks[i])
             break;
         ...
         if (... && path->locks[i]) {
         }
      }
      
      Apparently, @path->locks[i] is always true at this 'if'.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d80bb3f9
    • Liu Bo's avatar
      Btrfs: grab write lock directly if write_lock_level is the max level · 662c653b
      Liu Bo authored
      Typically, when acquiring root node's lock, btrfs tries its best to get
      read lock and trade for write lock if @write_lock_level implies to do so.
      
      In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
      is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's
      write lock directly.
      
      In this particular case, the dance of acquiring read lock and then trading
      for write lock can be saved.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      662c653b
    • Liu Bo's avatar
      Btrfs: move get root out of btrfs_search_slot to a helper · 1fc28d8e
      Liu Bo authored
      It's good to have a helper instead of having all get-root details
      open-coded.  The new helper locks (if necessary) and sets root node of
      the path.
      
      Also invert the checks to make the code flow easier to read.  There is
      no functional change in this commit.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fc28d8e
    • Liu Bo's avatar
      Btrfs: use more straightforward extent_buffer_uptodate check · e6a1d6fd
      Liu Bo authored
      If parent_transid "0" is passed to btrfs_buffer_uptodate(),
      btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but
      extent_buffer_uptodate() is preferred since we don't have to look into
      verify_parent_transid().
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6a1d6fd
    • Liu Bo's avatar
      Btrfs: remove superfluous free_extent_buffer in read_block_for_search · ca19b4a6
      Liu Bo authored
      read_block_for_search() can be simplified as:
      
      tmp = find_extent_buffer();
      if (tmp)
         return;
      
      ...
      
      free_extent_buffer();
      read_tree_block();
      
      Apparently, @tmp must be NULL at this point, free_extent_buffer() is not
      needed.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca19b4a6
    • Lu Fengqi's avatar
      btrfs: drop unused space_info parameter from create_space_info · 4ca61683
      Lu Fengqi authored
      Since commit dc2d3005 ("btrfs: remove dead create_space_info
      calls"), there is only one caller btrfs_init_space_info. However, it
      doesn't need create_space_info to return space_info at all.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ca61683
    • Liu Bo's avatar
      Btrfs: add parent_transid parameter to veirfy_level_key · ff76a864
      Liu Bo authored
      As verify_level_key() is checked after verify_parent_transid(), i.e.
      
      if (verify_parent_transid())
         ret = -EIO;
      else if (verify_level_key())
         ret = -EUCLEAN;
      
      if parent_transid is 0, verify_parent_transid() skips verifying
      parent_transid and considers eb as valid, and if verify_level_key()
      reports something wrong, we're not going to know if it's caused by
      corrupted metadata or non-checkecd eb (e.g. stale eb).
      
      The stale eb can be from an outdated raid1 mirror after a degraded
      mount, see eg "btrfs: fix reading stale metadata blocks after degraded
      raid1 mounts" (02a3307a) for more details.
      
      @parent_transid is able to tell whether the eb's generation has been
      verified by the caller.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff76a864
    • Qu Wenruo's avatar
      btrfs: qgroup: show more meaningful qgroup_rescan_init error message · 9593bf49
      Qu Wenruo authored
      Error message from qgroup_rescan_init() mostly looks like:
      
        BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115
      
      Which is far from meaningful, and sometimes confusing as for above
      -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes
      it can also indicate problem if the return value is -EINVAL.
      
      Change it to some more meaningful messages like:
      
        BTRFS info (device nvme0n1p1): qgroup rescan is already in progress
      
      And
      
        BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      [ update the messages and level ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9593bf49
    • Omar Sandoval's avatar
      Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2() · fd4e994b
      Omar Sandoval authored
      If we have invalid flags set, when we error out we must drop our writer
      counter and free the buffer we allocated for the arguments. This bug is
      trivially reproduced with the following program on 4.7+:
      
      	#include <fcntl.h>
      	#include <stdint.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/ioctl.h>
      	#include <sys/stat.h>
      	#include <sys/types.h>
      	#include <linux/btrfs.h>
      	#include <linux/btrfs_tree.h>
      
      	int main(int argc, char **argv)
      	{
      		struct btrfs_ioctl_vol_args_v2 vol_args = {
      			.flags = UINT64_MAX,
      		};
      		int ret;
      		int fd;
      
      		if (argc != 2) {
      			fprintf(stderr, "usage: %s PATH\n", argv[0]);
      			return EXIT_FAILURE;
      		}
      
      		fd = open(argv[1], O_WRONLY);
      		if (fd == -1) {
      			perror("open");
      			return EXIT_FAILURE;
      		}
      
      		ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
      		if (ret == -1)
      			perror("ioctl");
      
      		close(fd);
      		return EXIT_SUCCESS;
      	}
      
      When unmounting the filesystem, we'll hit the
      WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
      filesystem to be remounted read-only as the writer count will stay
      lifted.
      
      Fixes: 6b526ed7 ("btrfs: introduce device delete by devid")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd4e994b
    • Qu Wenruo's avatar
      btrfs: lzo: Harden inline lzo compressed extent decompression · de885e3e
      Qu Wenruo authored
      For inlined extent, we only have one segment, thus less things to check.
      And further more, inlined extent always has the csum in its leaf header,
      it's less probable to have corrupted data.
      
      Anyway, still check header and segment header.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de885e3e
    • Qu Wenruo's avatar
      btrfs: lzo: Add header length check to avoid potential out-of-bounds access · 314bfa47
      Qu Wenruo authored
      James Harvey reported that some corrupted compressed extent data can
      lead to various kernel memory corruption.
      
      Such corrupted extent data belongs to inode with NODATASUM flags, thus
      data csum won't help us detecting such bug.
      
      If lucky enough, KASAN could catch it like:
      
      BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
      Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338
      
      CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G           O      4.17.0-rc5-custom+ #50
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
      Call Trace:
       dump_stack+0xc2/0x16b
       print_address_description+0x6a/0x270
       kasan_report+0x260/0x380
       memcpy+0x34/0x50
       lzo_decompress_bio+0x384/0x7a0 [btrfs]
       end_compressed_bio_read+0x99f/0x10b0 [btrfs]
       bio_endio+0x32e/0x640
       normal_work_helper+0x15a/0xea0 [btrfs]
       process_one_work+0x7e3/0x1470
       worker_thread+0x1b0/0x1170
       kthread+0x2db/0x390
       ret_from_fork+0x22/0x40
      ...
      
      The offending compressed data has the following info:
      
      Header:			length 32768		(looks completely valid)
      Segment 0 Header:	length 3472882419	(obviously out of bounds)
      
      Then when handling segment 0, since it's over the current page, we need
      the copy the compressed data to temporary buffer in workspace, then such
      large size would trigger out-of-bounds memory access, screwing up the
      whole kernel.
      
      Fix it by adding extra checks on header and segment headers to ensure we
      won't access out-of-bounds, and even checks the decompressed data won't
      be out-of-bounds.
      Reported-by: default avatarJames Harvey <jamespharvey20@gmail.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ updated comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      314bfa47
  2. 29 May, 2018 10 commits
  3. 28 May, 2018 14 commits