1. 05 Nov, 2015 2 commits
    • Justin Maggard's avatar
      btrfs: qgroup: exit the rescan worker during umount · 7343dd61
      Justin Maggard authored
      I was hitting a consistent NULL pointer dereference during shutdown that
      showed the trace running through end_workqueue_bio().  I traced it back to
      the endio_meta_workers workqueue being poked after it had already been
      destroyed.
      
      Eventually I found that the root cause was a qgroup rescan that was still
      in progress while we were stopping all the btrfs workers.
      
      Currently we explicitly pause balance and scrub operations in
      close_ctree(), but we do nothing to stop the qgroup rescan.  We should
      probably be doing the same for qgroup rescan, but that's a much larger
      change.  This small change is good enough to allow me to unmount without
      crashing.
      Signed-off-by: default avatarJustin Maggard <jmaggard@netgear.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      7343dd61
    • Filipe Manana's avatar
      Btrfs: fix extent accounting for partial direct IO writes · 9c9464cc
      Filipe Manana authored
      When doing a write using direct IO we can end up not doing the whole write
      operation using the direct IO path, in that case we fallback to a buffered
      write to do the remaining IO. This happens for example if the range we are
      writing to contains a compressed extent.
      When we do a partial write and fallback to buffered IO, due to the
      existence of a compressed extent for example, we end up not adjusting the
      outstanding extents counter of our inode which ends up getting decremented
      twice, once by the DIO ordered extent for the partial write and once again
      by btrfs_direct_IO(), resulting in an arithmetic underflow at
      extent-tree.c:drop_outstanding_extent(). For example if we have:
      
        extents        [ prealloc extent ] [ compressed extent ]
        offsets        A        B          C       D           E
      
      and at the moment our inode's outstanding extents counter is 0, if we do a
      direct IO write against the range [B, D[ (which has a length smaller than
      128Mb), we end up bumping our inode's outstanding extents counter to 1, we
      create a DIO ordered extent for the range [B, C[ and then fallback to a
      buffered write for the range [C, D[. The direct IO handler
      (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
      1, leaving it with a value of 0, through a call to
      btrfs_delalloc_release_space() and then shortly after the DIO ordered
      extent finishes and calls btrfs_delalloc_release_metadata() which ends
      up to attempt to decrement the inode's outstanding extents counter by 1,
      resulting in an assertion failure at drop_outstanding_extent() because
      the operation would result in an arithmetic underflow (0 - 1). This
      produces the following trace:
      
        [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
        [125471.338844] ------------[ cut here ]------------
        [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
        [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
        [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
        [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
        [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
        [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
        [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
        [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
        [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
        [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
        [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
        [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
        [125471.340745] Stack:
        [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
        [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
        [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
        [125471.340745] Call Trace:
        [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
        [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
        [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
        [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
        [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
        [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
        [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
        [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745]  RSP <ffff88040a11bc78>
        [125471.539620] ---[ end trace 144259f7838b4aa4 ]---
      
      So fix this by ensuring we adjust the outstanding extents counter when we
      do the fallback just like we do for the case where the whole write can be
      done through the direct IO path.
      
      We were also adjusting the outstanding extents counter by a constant value
      of 1, which is incorrect because we were ignorning that we account extents
      in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
      
      The following test case for fstests reproduces this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "falloc"
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create a compressed extent covering the range [700K, 800K[.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Create prealloc extent covering the range [600K, 700K[.
        $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
      
        # Write 80K of data to the range [640K, 720K[ using direct IO. This
        # range covers both the prealloc extent and the compressed extent.
        # Because there's a compressed extent in the range we are writing to,
        # the DIO write code path ends up only writing the first 60k of data,
        # which goes to the prealloc extent, and then falls back to buffered IO
        # for writing the remaining 20K of data - because that remaining data
        # maps to a file range containing a compressed extent.
        # When falling back to buffered IO, we used to trigger an assertion when
        # releasing reserved space due to bad accounting of the inode's
        # outstanding extents counter, which was set to 1 but we ended up
        # decrementing it by 1 twice, once through the ordered extent for the
        # 60K of data we wrote using direct IO, and once through the main direct
        # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
        # wrote less than 80K of data (60K).
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now similar test as above but for very large write operations. This
        # triggers special cases for an inode's outstanding extents accounting,
        # as internally btrfs logically splits extents into 128Mb units.
        $XFS_IO_PROG -f -s \
            -c "pwrite -S 0xaa -b 128M 258M 128M" \
            -c "falloc 0 258M" \
            $SCRATCH_MNT/bar | _filter_xfs_io
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
            | _filter_xfs_io
      
        # Now verify the file contents are correct and that they are the same
        # even after unmounting and mounting the fs again (or evicting the page
        # cache).
        #
        # For file foo, all bytes in the range [0, 640K[ must have a value of
        # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
        # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
        #
        # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
        # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
        # bytes in the range [259M, 386M[ must have a value of 0xaa.
        #
        echo "File digests before remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
        _scratch_remount
        echo "File digests after remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
      
        status=0
        exit
      
      Fixes: e1cbbfa5 ("Btrfs: fix outstanding_extents accounting in DIO")
      Fixes: 3e05bde8 ("Btrfs: only adjust outstanding_extents when we do a short write")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      9c9464cc
  2. 03 Nov, 2015 3 commits
    • Filipe Manana's avatar
      Btrfs: fix hole punching when using the no-holes feature · 2959a32a
      Filipe Manana authored
      When we are using the no-holes feature, if we punch a hole into a file
      range that already contains a hole which overlaps the range we are passing
      to fallocate(), we end up removing the extent map that represents the
      existing hole without adding a new one. This happens because with the
      no-holes feature we do not have explicit extent items to represent holes
      and therefore the call to __btrfs_drop_extents(), made from
      btrfs_punch_hole(), returns an end offset to the variable drop_end that
      is smaller than the end of the range passed to fallocate(), while it
      drops all existing extent maps in that range.
      Normally having a missing extent map is not a problem, for example for
      a readpages() operation we just end up building the extent map by
      looking at the fs/subvol tree for a matching extent item (or a lack of
      one for implicit holes). However for an fsync that uses the fast path,
      which needs to look at the list of modified extent maps, this means
      the fsync will not record information about the complete hole we had
      before the fallocate() call into the log tree, resulting in a file with
      content/layout that does not match what we had neither before nor after
      the hole punch operation.
      
      The following test case for fstests reproduces the issue. It fails without
      this change because we get a file with a different digest after the fsync
      log replay and also with a different extent/hole layout.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
           _cleanup_flakey
           rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/punch
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "fpunch"
        _require_xfs_io_command "fiemap"
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create out test file with some data and then fsync it.
        # We do the fsync only to make sure the last fsync we do in this test
        # triggers the fast code path of btrfs' fsync implementation, a
        # condition necessary to trigger the bug btrfs had.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
                        -c "fsync"                  \
                        $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Now punch a hole against the range [96K, 128K[.
        $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the previous range
        # and ends beyond eof.
        $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the first range
        # ([96K, 128K[) and ends at eof.
        $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
      
        # Fsync our file. We want to verify that, after a power failure and
        # mounting the filesystem again, the file content reflects all the hole
        # punch operations.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap before power failure:"
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        # Silently drop all writes and umount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest we got before the power failure.
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap after log replay:"
        # Must match the same extent listing we got before the power failure.
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2959a32a
    • chandan's avatar
      Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state · 13a0db5a
      chandan authored
      When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
      and nodesize set to 64k), the following call trace is observed,
      
      WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
      Modules linked in:
      CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d9 #54
      task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
      NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
      REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d9)
      MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
      CFAR: c0000000004a3b30 SOFTE: 1
      GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
      GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
      GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
      GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
      GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
      GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
      GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
      GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
      NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
      LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      Call Trace:
      [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
      [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
      [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
      [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
      [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
      [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
      [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
      [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
      [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
      [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
      [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
      [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
      [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
      [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
      [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
      [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
      [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
      [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
      [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
      [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
      [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
      [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
      [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
      [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
      Instruction dump:
      fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
      812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
      
      The above call trace is seen even on x86_64; albeit very rarely and that too
      with nodesize set to 64k and with nospace_cache mount option being used.
      
      The reason for the above call trace is,
      btrfs_remove_chunk
        check_system_chunk
          Allocate chunk if required
        For each physical stripe on underlying device,
          btrfs_free_dev_extent
            ...
            Take lock on Device tree's root node
            btrfs_cow_block("dev tree's root node");
              btrfs_reserve_extent
                find_free_extent
      	    index = BTRFS_RAID_DUP;
      	    have_caching_bg = false;
      
                  When in LOOP_CACHING_NOWAIT state, Assume we find a block group
      	    which is being cached; Hence have_caching_bg is set to true
      
                  When repeating the search for the next RAID index, we set
      	    have_caching_bg to false.
      
      Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
      skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
      allocate a chunk and try to add entries corresponding to the chunk's physical
      stripe into the device tree. When doing so the task deadlocks itself waiting
      for the blocking lock on the root node of the device tree.
      
      This commit fixes the issue by introducing a new local variable to help
      indicate as to whether a block group of any RAID type is being cached.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      13a0db5a
    • Qu Wenruo's avatar
      btrfs: Fix a data space underflow warning · 485290a7
      Qu Wenruo authored
      Even with quota disabled, generic/127 will trigger a kernel warning by
      underflow data space info.
      
      The bug is caused by buffered write, which in case of short copy, the
      start parameter for btrfs_delalloc_release_space() is wrong, and
      round_up/down() in btrfs_delalloc_release() extents the range to page
      aligned, decreasing one more page than expected.
      
      This patch will fix it by passing correct start.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      485290a7
  3. 27 Oct, 2015 10 commits
  4. 25 Oct, 2015 2 commits
    • Filipe Manana's avatar
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarElias Probst <mail@eliasprobst.eu>
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
    • Filipe Manana's avatar
      Btrfs: fix regression when running delayed references · 2c3cf7d5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      2c3cf7d5
  5. 22 Oct, 2015 23 commits