1. 01 Jun, 2017 3 commits
    • Jeff Mahoney's avatar
      btrfs: fix race with relocation recovery and fs_root setup · a9b3311e
      Jeff Mahoney authored
      If we have to recover relocation during mount, we'll ultimately have to
      evict the orphan inode.  That goes through the reservation dance, where
      priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
      to be valid.  That's the next thing to be set up during mount, so we
      crash, almost always in flush_space trying to join the transaction
      but priority_reclaim_metadata_space is possible as well.  This call
      path has been problematic in the past WRT whether ->fs_root is valid
      yet.  Commit 957780eb (Btrfs: introduce ticketed enospc
      infrastructure) added new users that are called in the direct path
      instead of the async path that had already been worked around.
      
      The thing is that we don't actually need the fs_root, specifically, for
      anything.  We either use it to determine whether the root is the
      chunk_root for use in choosing an allocation profile or as a root to pass
      btrfs_join_transaction before immediately committing it.  Anything that
      isn't the chunk root works in the former case and any root works in
      the latter.
      
      A simple fix is to use a root we know will always be there: the
      extent_root.
      
      Cc: <stable@vger.kernel.org> # v4.8+
      Fixes: 957780eb (Btrfs: introduce ticketed enospc infrastructure)
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9b3311e
    • Jeff Mahoney's avatar
      btrfs: fix memory leak in update_space_info failure path · 896533a7
      Jeff Mahoney authored
      If we fail to add the space_info kobject, we'll leak the memory
      for the percpu counter.
      
      Fixes: 6ab0a202 (btrfs: publish allocation data in sysfs)
      Cc: <stable@vger.kernel.org> # v3.14+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      896533a7
    • David Sterba's avatar
      btrfs: use correct types for page indices in btrfs_page_exists_in_range · cc2b702c
      David Sterba authored
      Variables start_idx and end_idx are supposed to hold a page index
      derived from the file offsets. The int type is not the right one though,
      offsets larger than 1 << 44 will get silently trimmed off the high bits.
      (1 << 44 is 16TiB)
      
      What can go wrong, if start is below the boundary and end gets trimmed:
      - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
      - the final check "if (page->index <= end_idx)" will unexpectedly fail
      
      The function will return false, ie. "there's no page in the range",
      although there is at least one.
      
      btrfs_page_exists_in_range is used to prevent races in:
      
      * in hole punching, where we make sure there are not pages in the
        truncated range, otherwise we'll wait for them to finish and redo
        truncation, but we're going to replace the pages with holes anyway so
        the only problem is the intermediate state
      
      * lock_extent_direct: we want to make sure there are no pages before we
        lock and start DIO, to prevent stale data reads
      
      For practical occurence of the bug, there are several constaints.  The
      file must be quite large, the affected range must cross the 16TiB
      boundary and the internal state of the file pages and pending operations
      must match.  Also, we must not have started any ordered data in the
      range, otherwise we don't even reach the buggy function check.
      
      DIO locking tries hard in several places to avoid deadlocks with
      buffered IO and avoids waiting for ranges. The worst consequence seems
      to be stale data read.
      
      CC: Liu Bo <bo.li.liu@oracle.com>
      CC: stable@vger.kernel.org	# 3.16+
      Fixes: fc4adbff ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc2b702c
  2. 16 May, 2017 3 commits
    • Colin Ian King's avatar
      btrfs: fix incorrect error return ret being passed to mapping_set_error · bff5baf8
      Colin Ian King authored
      The setting of return code ret should be based on the error code
      passed into function end_extent_writepage and not on ret. Thanks
      to Liu Bo for spotting this mistake in the original fix I submitted.
      
      Detected by CoverityScan, CID#1414312 ("Logically dead code")
      
      Fixes: 5dca6eea ("Btrfs: mark mapping with error flag to report errors to userspace")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bff5baf8
    • Jan Kara's avatar
      btrfs: Make flush bios explicitely sync · 8d910125
      Jan Kara authored
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      CC: David Sterba <dsterba@suse.com>
      CC: linux-btrfs@vger.kernel.org
      Fixes: b685d3d6Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d910125
    • Qu Wenruo's avatar
      btrfs: fiemap: Cache and merge fiemap extent before submit it to user · 4751832d
      Qu Wenruo authored
      [BUG]
      Cycle mount btrfs can cause fiemap to return different result.
      Like:
       # mount /dev/vdb5 /mnt/btrfs
       # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        25088..25215       128   0x1
       # umount /mnt/btrfs
       # mount /dev/vdb5 /mnt/btrfs
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..31]:         25088..25119        32   0x0
         1: [32..63]:        25120..25151        32   0x0
         2: [64..95]:        25152..25183        32   0x0
         3: [96..127]:       25184..25215        32   0x1
      But after above fiemap, we get correct merged result if we call fiemap
      again.
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        25088..25215       128   0x1
      
      [REASON]
      Btrfs will try to merge extent map when inserting new extent map.
      
      btrfs_fiemap(start=0 len=(u64)-1)
      |- extent_fiemap(start=0 len=(u64)-1)
         |- get_extent_skip_holes(start=0 len=64k)
         |  |- btrfs_get_extent_fiemap(start=0 len=64k)
         |     |- btrfs_get_extent(start=0 len=64k)
         |        |  Found on-disk (ino, EXTENT_DATA, 0)
         |        |- add_extent_mapping()
         |        |- Return (em->start=0, len=16k)
         |
         |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
         |
         |- get_extent_skip_holes(start=0 len=64k)
         |  |- btrfs_get_extent_fiemap(start=0 len=64k)
         |     |- btrfs_get_extent(start=16k len=48k)
         |        |  Found on-disk (ino, EXTENT_DATA, 16k)
         |        |- add_extent_mapping()
         |        |  |- try_merge_map()
         |        |     Merge with previous em start=0 len=16k
         |        |     resulting em start=0 len=32k
         |        |- Return (em->start=0, len=32K)    << Merged result
         |- Stripe off the unrelated range (0~16K) of return em
         |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
            ^^^ Causing split fiemap extent.
      
      And since in add_extent_mapping(), em is already merged, in next
      fiemap() call, we will get merged result.
      
      [FIX]
      Here we introduce a new structure, fiemap_cache, which records previous
      fiemap extent.
      
      And will always try to merge current fiemap_cache result before calling
      fiemap_fill_next_extent().
      Only when we failed to merge current fiemap extent with cached one, we
      will call fiemap_fill_next_extent() to submit cached one.
      
      So by this method, we can merge all fiemap extents.
      
      It can also be done in fs/ioctl.c, however the problem is if
      fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
      extent.
      So I choose to merge it in btrfs.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4751832d
  3. 04 May, 2017 1 commit
  4. 27 Apr, 2017 1 commit
  5. 26 Apr, 2017 7 commits
    • Filipe Manana's avatar
      Btrfs: fix reported number of inode blocks · a7e3b975
      Filipe Manana authored
      Currently when there are buffered writes that were not yet flushed and
      they fall within allocated ranges of the file (that is, not in holes or
      beyond eof assuming there are no prealloc extents beyond eof), btrfs
      simply reports an incorrect number of used blocks through the stat(2)
      system call (or any of its variants), regardless of mount options or
      inode flags (compress, compress-force, nodatacow). This is because the
      number of blocks used that is reported is based on the current number
      of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
      inode. The later covers bytes that both fall within allocated regions
      of the file and holes.
      
      Example scenarios where the number of reported blocks is wrong while the
      buffered writes are not flushed:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
      
        # The following should have reported 64K...
        $ du -h /mnt/sdc/foo1
        128K	/mnt/sdc/foo1
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo1
        64K	/mnt/sdc/foo1
      
        $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 65536
        64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
      
        # The following should have reported 128K...
        $ du -h /mnt/sdc/foo2
        192K	/mnt/sdc/foo2
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo2
        128K	/mnt/sdc/foo2
      
      So the number of used file blocks is simply incorrect, unlike in other
      filesystems such as ext4 and xfs for example, but only while the buffered
      writes are not flushed.
      
      Fix this by tracking the number of delalloc bytes that fall within holes
      and beyond eof of a file, and use instead this new counter when reporting
      the number of used blocks for an inode.
      
      Another different problem that exists is that the delalloc bytes counter
      is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
      the respective range in the inode's iotree) and the vfs inode's bytes
      counter is only incremented when writeback finishes (through
      insert_reserved_file_extent()). Therefore while writeback is ongoing we
      simply report a wrong number of blocks used by an inode if the write
      operation covers a range previously unallocated. While this change does
      not fix this problem, it does minimizes it a lot by shortening that time
      window, as the new dealloc bytes counter (new_delalloc_bytes) is only
      decremented when writeback finishes right before updating the vfs inode's
      bytes counter. Fully fixing this second problem is not trivial and will
      be addressed later by a different patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      a7e3b975
    • Filipe Manana's avatar
      Btrfs: send, fix file hole not being preserved due to inline extent · e1cbfd7b
      Filipe Manana authored
      Normally we don't have inline extents followed by regular extents, but
      there's currently at least one harmless case where this happens. For
      example, when the page size is 4Kb and compression is enabled:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
        $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
      
      In this case we get a compressed inline extent, representing 4Kb of
      data, followed by a hole extent and then a regular data extent. The
      inline extent was not expanded/converted to a regular extent exactly
      because it represents 4Kb of data. This does not cause any apparent
      problem (such as the issue solved by commit e1699d2d
      ("btrfs: add missing memset while reading compressed inline extents"))
      except trigger an unexpected case in the incremental send code path
      that makes us issue an operation to write a hole when it's not needed,
      resulting in more writes at the receiver and wasting space at the
      receiver.
      
      So teach the incremental send code to deal with this particular case.
      
      The issue can be currently triggered by running fstests btrfs/137 with
      compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      e1cbfd7b
    • Filipe Manana's avatar
      Btrfs: fix extent map leak during fallocate error path · be2d253c
      Filipe Manana authored
      If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
      extent map structure. The failure can happen either due to an -ENOMEM
      condition or, when quotas are enabled, due to -EDQUOT for example.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      be2d253c
    • Filipe Manana's avatar
      Btrfs: fix incorrect space accounting after failure to insert inline extent · 1c81ba23
      Filipe Manana authored
      When using compression, if we fail to insert an inline extent we
      incorrectly end up attempting to free the reserved data space twice,
      once through extent_clear_unlock_delalloc(), because we pass it the
      flag EXTENT_DO_ACCOUNTING, and once through a direct call to
      btrfs_free_reserved_data_space_noquota(). This results in a trace
      like the following:
      
      [  834.576240] ------------[ cut here ]------------
      [  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
      [  834.596103] Call Trace:
      [  834.596103]  dump_stack+0x67/0x90
      [  834.596103]  __warn+0xc2/0xdd
      [  834.596103]  warn_slowpath_null+0x1d/0x1f
      [  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
      [  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
      [  834.596103]  async_cow_start+0x32/0x4d [btrfs]
      [  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
      [  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
      [  834.596103]  process_one_work+0x273/0x4e4
      [  834.596103]  worker_thread+0x1eb/0x2ca
      [  834.596103]  ? rescuer_thread+0x2b6/0x2b6
      [  834.596103]  kthread+0x100/0x108
      [  834.596103]  ? __list_del_entry+0x22/0x22
      [  834.596103]  ret_from_fork+0x2e/0x40
      [  834.611656] ---[ end trace 719902fe6bdef08f ]---
      
      So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
      if an error happened.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      1c81ba23
    • Filipe Manana's avatar
      Btrfs: fix invalid attempt to free reserved space on failure to cow range · a315e68f
      Filipe Manana authored
      When attempting to COW a file range (we are starting writeback and doing
      COW), if we manage to reserve an extent for the range we will write into
      but fail after reserving it and before creating the respective ordered
      extent, we end up in an error path where we attempt to decrement the
      data space's bytes_may_use counter after we already did it while
      reserving the extent, leading to a warning/trace like the following:
      
      [  847.621524] ------------[ cut here ]------------
      [  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
      [  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  847.648601] Call Trace:
      [  847.648601]  dump_stack+0x67/0x90
      [  847.648601]  __warn+0xc2/0xdd
      [  847.648601]  warn_slowpath_null+0x1d/0x1f
      [  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
      [  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
      [  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
      [  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
      [  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
      [  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
      [  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
      [  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
      [  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
      [  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  ? mark_lock+0x24/0x201
      [  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
      [  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
      [  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
      [  847.648601]  do_writepages+0x23/0x2c
      [  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
      [  847.648601]  filemap_fdatawrite_range+0x13/0x15
      [  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
      [  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
      [  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
      [  847.648601]  vfs_fsync_range+0x8c/0x9e
      [  847.648601]  vfs_fsync+0x1c/0x1e
      [  847.648601]  do_fsync+0x31/0x4a
      [  847.648601]  SyS_fsync+0x10/0x14
      [  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  847.648601] RIP: 0033:0x7f5b05200800
      [  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
      [  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
      [  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
      [  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
      [  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
      [  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
      [  847.685787] ---[ end trace 2a4a3e15382508e8 ]---
      
      So fix this by not attempting to decrement the data space info's
      bytes_may_use counter if we already reserved the extent and an error
      happened before creating the ordered extent. We are already correctly
      freeing the reserved extent if an error happens, so there's no additional
      measure needed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      a315e68f
    • Qu Wenruo's avatar
      btrfs: Handle delalloc error correctly to avoid ordered extent hang · 52427260
      Qu Wenruo authored
      [BUG]
      If run_delalloc_range() returns error and there is already some ordered
      extents created, btrfs will be hanged with the following backtrace:
      
      Call Trace:
       __schedule+0x2d4/0xae0
       schedule+0x3d/0x90
       btrfs_start_ordered_extent+0x160/0x200 [btrfs]
       ? wake_atomic_t_function+0x60/0x60
       btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
       btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
       btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
       process_one_work+0x2af/0x720
       ? process_one_work+0x22b/0x720
       worker_thread+0x4b/0x4f0
       kthread+0x10f/0x150
       ? process_one_work+0x720/0x720
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x2e/0x40
      
      [CAUSE]
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|                       |<---------- cleanup range --------->|
       ||
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      
      The problem is caused by error handler of run_delalloc_range(), which
      doesn't handle any created ordered extents, leaving them waiting on
      btrfs_finish_ordered_io() to finish.
      
      However after run_delalloc_range() returns error, __extent_writepage()
      won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
      except the first page, and btrfs_finish_ordered_io() won't be triggered
      for created ordered extents either.
      
      So OE 2~n will hang forever, and if OE 1 is larger than one page, it
      will also hang.
      
      [FIX]
      Introduce btrfs_cleanup_ordered_extents() function to cleanup created
      ordered extents and finish them manually.
      
      The function is based on existing
      btrfs_endio_direct_write_update_ordered() function, and modify it to
      act just like btrfs_writepage_endio_hook() but handles specified range
      other than one page.
      
      After fix, delalloc error will be handled like:
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|<--------  ----------->|<------ old error handler --------->|
       ||          ||
       ||          \_=> Cleaned up by cleanup_ordered_extents()
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      52427260
    • Qu Wenruo's avatar
      btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error · 4dbd80fb
      Qu Wenruo authored
      [BUG]
      When btrfs_reloc_clone_csum() reports error, it can underflow metadata
      and leads to kernel assertion on outstanding extents in
      run_delalloc_nocow() and cow_file_range().
      
       BTRFS info (device vdb5): relocating block group 12582912 flags data
       BTRFS info (device vdb5): found 1 extents
       assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
      
      Currently, due to another bug blocking ordered extents, the bug is only
      reproducible under certain block group layout and using error injection.
      
      a) Create one data block group with one 4K extent in it.
         To avoid the bug that hangs btrfs due to ordered extent which never
         finishes
      b) Make btrfs_reloc_clone_csum() always fail
      c) Relocate that block group
      
      [CAUSE]
      run_delalloc_nocow() and cow_file_range() handles error from
      btrfs_reloc_clone_csum() wrongly:
      
      (The ascii chart shows a more generic case of this bug other than the
      bug mentioned above)
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
                          |<----------- cleanup range --------------->|
      |<-----------  ----------->|
                   \/
       btrfs_finish_ordered_io() range
      
      So error handler, which calls extent_clear_unlock_delalloc() with
      EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
      will both cover OE n, and free its metadata, causing metadata under flow.
      
      [Fix]
      The fix is to ensure after calling btrfs_add_ordered_extent(), we only
      call error handler after increasing the iteration offset, so that
      cleanup range won't cover any created ordered extent.
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<-----------  ----------->|<---------- cleanup range --------->|
                   \/
       btrfs_finish_ordered_io() range
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      4dbd80fb
  6. 18 Apr, 2017 25 commits
    • Anand Jain's avatar
      btrfs: check if the device is flush capable · c2a9c7ab
      Anand Jain authored
      The block layer call chain from submit_bio will check if the write cache
      is enabled for the given queue before submitting the flush. This will
      add a code to fail fast if its not.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ updated changelog to reflect current code stat, blkdev_issue_flush is
        not used yet ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2a9c7ab
    • Anand Jain's avatar
      btrfs: delete unused member nobarriers · 13e88e15
      Anand Jain authored
      The last consumer of nobarriers is removed by the commit [1] and sync
      won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
      through it just return success without actually transpiring such a
      request to the block device/lun.
      
      [1]
      commit b25de9d6
      block: remove BIO_EOPNOTSUPP
      
      And, as the device/lun write cache state may change dynamically saving
      such as state won't help either. So deleting the member nobarriers.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13e88e15
    • Qu Wenruo's avatar
      btrfs: scrub: Fix RAID56 recovery race condition · 28d70e23
      Qu Wenruo authored
      When scrubbing a RAID5 which has recoverable data corruption (only one
      data stripe is corrupted), sometimes scrub will report more csum errors
      than expected. Sometimes even unrecoverable error will be reported.
      
      The problem can be easily reproduced by the following steps:
      1) Create a btrfs with RAID5 data profile with 3 devs
      2) Mount it with nospace_cache or space_cache=v2
         To avoid extra data space usage.
      3) Create a 128K file and sync the fs, unmount it
         Now the 128K file lies at the beginning of the data chunk
      4) Locate the physical bytenr of data chunk on dev3
         Dev3 is the 1st data stripe.
      5) Corrupt the first 64K of the data chunk stripe on dev3
      6) Mount the fs and scrub it
      
      The correct csum error number should be 16 (assuming using x86_64).
      Larger csum error number can be reported in a 1/3 chance.
      And unrecoverable error can also be reported in a 1/10 chance.
      
      The root cause of the problem is RAID5/6 recover code has race
      condition, due to the fact that full scrub is initiated per device.
      
      While for other mirror based profiles, each mirror is independent with
      each other, so race won't cause any big problem.
      
      For example:
              Corrupted       |       Correct          |      Correct        |
      |   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
      ------------------------------------------------------------------------
      Read out D1             |Read out D2             |Read full stripe     |
      Check csum              |Check csum              |Check parity         |
      Csum mismatch           |Csum match, continue    |Parity mismatch      |
      handle_errored_block    |                        |handle_errored_block |
       Read out full stripe   |                        | Read out full stripe|
       D1 csum error(err++)   |                        | D1 csum error(err++)|
       Recover D1             |                        | Recover D1          |
      
      So D1's csum error is accounted twice, just because
      handle_errored_block() doesn't have enough protection, and race can happen.
      
      On even worse case, for example D1's recovery code is re-writing
      D1/D2/P, and P's recovery code is just reading out full stripe, then we
      can cause unrecoverable error.
      
      This patch will use previously introduced lock_full_stripe() and
      unlock_full_stripe() to protect the whole scrub_handle_errored_block()
      function for RAID56 recovery.
      So no extra csum error nor unrecoverable error.
      Reported-by: default avatarGoffredo Baroncelli <kreijack@libero.it>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28d70e23
    • Qu Wenruo's avatar
      btrfs: scrub: Introduce full stripe lock for RAID56 · 0966a7b1
      Qu Wenruo authored
      Unlike mirror based profiles, RAID5/6 recovery needs to read out the
      whole full stripe.
      
      And if we don't do proper protection, it can easily cause race condition.
      
      Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
      for RAID5/6.
      Which store a rb_tree of mutexes for full stripes, so scrub callers can
      use them to lock a full stripe to avoid race.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor comment adjustments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0966a7b1
    • Deepa Dinamani's avatar
      btrfs: Use ktime_get_real_ts for root ctime · fa7aede2
      Deepa Dinamani authored
      btrfs_root_item maintains the ctime for root updates.  This is not part
      of vfs_inode.
      
      Since current_time() uses struct inode* as an argument as Linus
      suggested, this cannot be used to update root times unless, we modify
      the signature to use inode.
      
      Since btrfs uses nanosecond time granularity, it can also use
      ktime_get_real_ts directly to obtain timestamp for the root. It is
      necessary to use the timespec time api here because the same
      btrfs_set_stack_timespec_*() apis are used for vfs inode times as well.
      These can be transitioned to using timespec64 when btrfs internally
      changes to use timespec64 as well.
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa7aede2
    • Dan Carpenter's avatar
      Btrfs: handle only applicable errors returned by btrfs_get_extent · 9986277e
      Dan Carpenter authored
      btrfs_get_extent() never returns NULL pointers, so this code introduces
      a static checker warning.
      
      The btrfs_get_extent() is a bit complex, but trust me that it doesn't
      return NULLs and also if it did we would trigger the BUG_ON(!em) before
      the last return statement.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      [ updated subject ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9986277e
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option · 82bafb38
      Qu Wenruo authored
      [BUG]
      The easist way to reproduce the bug is:
      ------
       # mkfs.btrfs -f $dev -n 16K
       # mount $dev $mnt -o inode_cache
       # btrfs quota enable $mnt
       # btrfs quota rescan -w $mnt
       # btrfs qgroup show $mnt
      qgroupid         rfer         excl
      --------         ----         ----
      0/5          32.00KiB     32.00KiB
                   ^^ Twice the correct value
      ------
      
      And fstests/btrfs qgroup test group can easily detect them with
      inode_cache mount option.
      Although some of them are false alerts since old test cases are using
      fixed golden output.
      While new test cases will use "btrfs check" to detect qgroup mismatch.
      
      [CAUSE]
      Inode_cache mount option will make commit_fs_roots() to call
      btrfs_save_ino_cache() to update fs/subvol trees, and generate new
      delayed refs.
      
      However we call btrfs_qgroup_prepare_account_extents() too early, before
      commit_fs_roots().
      This makes the "old_roots" for newly generated extents are always NULL.
      For freeing extent case, this makes both new_roots and old_roots to be
      empty, while correct old_roots should not be empty.
      This causing qgroup numbers not decreased correctly.
      
      [FIX]
      Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
      just before btrfs_qgroup_account_extents(), and add needed delayed_refs
      handler.
      So qgroup can handle inode_map mount options correctly.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82bafb38
    • Anand Jain's avatar
      btrfs: use q which is already obtained from bdev_get_queue · e884f4f0
      Anand Jain authored
      We have already assigned q from bdev_get_queue() so use it.
      And rearrange the code for better view.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e884f4f0
    • Liu Bo's avatar
      Btrfs: switch to div64_u64 if with a u64 divisor · 42c61ab6
      Liu Bo authored
      This is fixing code pieces where we use div_u64 when passing a u64 divisor.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42c61ab6
    • Liu Bo's avatar
      Btrfs: update scrub_parity to use u64 stripe_len · 972d7219
      Liu Bo authored
      Commit 3d8da678 ("Btrfs: fix divide error upon chunk's stripe_len")
      changed stripe_len in struct map_lookup to u64, but didn't update
      stripe_len in struct scrub_parity.
      
      This updates the type and switches to div64_u64_rem to match u64 divisor.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      972d7219
    • Liu Bo's avatar
      Btrfs: enable repair during read for raid56 profile · c725328c
      Liu Bo authored
      Now that scrub can fix data errors with the help of parity for raid56
      profile, repair during read is able to as well.
      
      Although the mirror num in raid56 scenario has different meanings, i.e.
      0 or 1: read data directly
      > 1:    do recover with parity,
      it could be fit into how we repair bad block during read.
      
      The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
      device and position on it.
      
      Cc: David Sterba <dsterba@suse.cz>
      Tested-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c725328c
    • David Sterba's avatar
      btrfs: use clear_page where appropriate · 619a9742
      David Sterba authored
      There's a helper to clear whole page, with a arch-specific optimized
      code. The replaced cases do not seem to be in performace critical code,
      but we still might get some percent gain.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      619a9742
    • Qu Wenruo's avatar
      btrfs: Prevent scrub recheck from racing with dev replace · e501bfe3
      Qu Wenruo authored
      scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
      bbio without protection of bio_counter.
      
      This can lead to use-after-free if racing with dev replace cancel.
      
      Fix it by increasing bio_counter before calling btrfs_map_sblock() and
      decreasing the bio_counter when corresponding recover is finished.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Reported-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e501bfe3
    • Qu Wenruo's avatar
      btrfs: Wait for in-flight bios before freeing target device for raid56 · ae6529c3
      Qu Wenruo authored
      When raid56 dev-replace is cancelled by running scrub, we will free
      target device without waiting for in-flight bios, causing the following
      NULL pointer deference or general protection failure.
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
       IP: generic_make_request_checks+0x4d/0x610
       CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
       Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
       task: ffff88002875b4c0 task.stack: ffffc90001334000
       RIP: 0010:generic_make_request_checks+0x4d/0x610
       Call Trace:
        ? generic_make_request+0xc7/0x360
        generic_make_request+0x24/0x360
        ? generic_make_request+0xc7/0x360
        submit_bio+0x64/0x120
        ? page_in_rbio+0x4d/0x80 [btrfs]
        ? rbio_orig_end_io+0x80/0x80 [btrfs]
        finish_rmw+0x3f4/0x540 [btrfs]
        validate_rbio_for_rmw+0x36/0x40 [btrfs]
        raid_rmw_end_io+0x7a/0x90 [btrfs]
        bio_endio+0x56/0x60
        end_workqueue_fn+0x3c/0x40 [btrfs]
        btrfs_scrubparity_helper+0xef/0x620 [btrfs]
        btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
        process_one_work+0x2af/0x720
        ? process_one_work+0x22b/0x720
        worker_thread+0x4b/0x4f0
        kthread+0x10f/0x150
        ? process_one_work+0x720/0x720
        ? kthread_create_on_node+0x40/0x40
        ret_from_fork+0x2e/0x40
       RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8
      
      In btrfs_dev_replace_finishing(), we will call
      btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
      device when scrub is finished normally.
      
      However when dev-replace is aborted, either due to error or cancelled by
      scrub, we didn't wait for bios, this can lead to use-after-free if there
      are bios holding the target device.
      
      Furthermore, for raid56 scrub, at least 2 places are calling
      btrfs_map_sblock() without protection of bio_counter, leading to the
      problem.
      
      This patch fixes the problem:
      1) Wait for bio_counter before freeing target device when canceling
         replace
      2) When calling btrfs_map_sblock() for raid56, use bio_counter to
         protect the call.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae6529c3
    • Qu Wenruo's avatar
      btrfs: scrub: Don't append on-disk pages for raid56 scrub · 9a33944b
      Qu Wenruo authored
      In the following situation, scrub will calculate wrong parity to
      overwrite the correct one:
      
      RAID5 full stripe:
      
      Before
      |     Dev 1      |     Dev  2     |     Dev 3     |
      | Data stripe 1  | Data stripe 2  | Parity Stripe |
      --------------------------------------------------- 0
      | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
      --------------------------------------------------- 4K
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      ...
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      --------------------------------------------------- 64K
      
      After scrubbing dev3 only:
      
      |     Dev 1      |     Dev  2     |     Dev 3     |
      | Data stripe 1  | Data stripe 2  | Parity Stripe |
      --------------------------------------------------- 0
      | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
      --------------------------------------------------- 4K
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      ...
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      --------------------------------------------------- 64K
      
      The reason is that after raid56 read rebuild rbio->stripe_pages are all
      correctly recovered (0xcd for data stripes).
      
      However when we check and repair parity in
      scrub_parity_check_and_repair(), we will append pages in sparity->spages
      list to rbio->bio_pages[], which contains old on-disk data.
      
      And when we submit parity data to disk, we calculate parity using
      rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
      to rbio->stripe_pages[].
      
      The patch fix it by not appending pages from sparity->spages.
      So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a33944b
    • Qu Wenruo's avatar
      btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint · d51ea5dd
      Qu Wenruo authored
      Newly introduced qgroup reserved space trace points are normally nested
      into several common qgroup operations.
      
      While some other trace points are not well placed to co-operate with
      them, causing confusing output.
      
      This patch re-arrange trace_btrfs_qgroup_release_data() and
      trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
      before reserved space ones.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d51ea5dd
    • Qu Wenruo's avatar
      btrfs: qgroup: Add trace point for qgroup reserved space · 3159fe7b
      Qu Wenruo authored
      Introduce the following trace points:
      qgroup_update_reserve
      qgroup_meta_reserve
      
      These trace points are handy to trace qgroup reserve space related
      problems.
      
      Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
      structure to trace points, so that structure needs to be exported.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3159fe7b
    • David Sterba's avatar
      btrfs: drop redundant parameters from btrfs_map_sblock · 825ad4c9
      David Sterba authored
      All callers pass 0 for mirror_num and 1 for need_raid_map.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      825ad4c9
    • David Sterba's avatar
      btrfs: sink GFP flags parameter to tree_mod_log_insert_root · bcc8e07f
      David Sterba authored
      All (1) callers pass the same value.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bcc8e07f
    • David Sterba's avatar
      btrfs: sink GFP flags parameter to tree_mod_log_insert_move · 176ef8f5
      David Sterba authored
      All (1) callers pass the same value.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      176ef8f5
    • Liu Bo's avatar
      Btrfs: fix wrong failed mirror_num of read-repair on raid56 · abad60c6
      Liu Bo authored
      In raid56 scenario, after trying parity recovery, we didn't set
      mirror_num for btrfs_bio with failed mirror_num, hence
      end_bio_extent_readpage() will report a random mirror_num in dmesg
      log.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abad60c6
    • Liu Bo's avatar
      Btrfs: set scrub page's io_error if failing to submit io · 1bcd7aa1
      Liu Bo authored
      Scrub repairs data by the unit called scrub_block, which may contain
      several pages.  Scrub always tries to look up a good copy of a whole
      block, but if there's no such copy, it tries to do repair page by page.
      
      If we don't set page's io_error when checking this bad copy, in the last
      step, we may skip this page when repairing bad copy from good copy.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bcd7aa1
    • David Sterba's avatar
      btrfs: track exclusive filesystem operation in flags · 171938e5
      David Sterba authored
      There are several operations, usually started from ioctls, that cannot
      run concurrently. The status is tracked in
      mutually_exclusive_operation_running as an atomic_t. We can easily track
      the status as one of the per-filesystem flag bits with same
      synchronization guarantees.
      
      The conversion replaces:
      
      * atomic_xchg(..., 1)    ->   test_and_set_bit(FLAG, ...)
      * atomic_set(..., 0)     ->   clear_bit(FLAG, ...)
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      171938e5
    • Goldwyn Rodrigues's avatar
      btrfs: qgroups: Retry after commit on getting EDQUOT · 48a89bc4
      Goldwyn Rodrigues authored
      We are facing the same problem with EDQUOT which was experienced with
      ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
      here is a quick fix, which may be too big a hammer.
      
      Quotas are reserved during the start of an operation, incrementing
      qg->reserved. However, it is written to disk in a commit_transaction
      which could take as long as commit_interval. In the meantime there
      could be deletions which are not accounted for because deletions are
      accounted for only while committed (free_refroot). So, when we get
      a EDQUOT flush the data to disk and try again.
      
      This fixes fstests btrfs/139.
      
      Here is a sample script which shows this issue.
      
      DEVICE=/dev/vdb
      MOUNTPOINT=/mnt
      TESTVOL=$MOUNTPOINT/tmp
      QUOTA=5
      PROG=btrfs
      DD_BS="4k"
      DD_COUNT="256"
      RUN_TIMES=5000
      
      mkfs.btrfs -f $DEVICE
      mount -o commit=240 $DEVICE $MOUNTPOINT
      $PROG subvolume create $TESTVOL
      $PROG quota enable $TESTVOL
      $PROG qgroup limit ${QUOTA}G $TESTVOL
      
      typeset -i DD_RUN_GOOD
      typeset -i QUOTA
      
      function _check_cmd() {
              if [[ ${?} > 0 ]]; then
                      echo -n "$(date) E: Running previous command"
                      echo ${*}
                      echo "Without sync"
                      $PROG qgroup show -pcreFf ${TESTVOL}
                      echo "With sync"
                      $PROG qgroup show -pcreFf --sync ${TESTVOL}
                      exit 1
              fi
      }
      
      while true; do
        DD_RUN_GOOD=$RUN_TIMES
      
        while (( ${DD_RUN_GOOD} != 0 )); do
              dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}
              _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}"
              DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
        done
      
        $PROG qgroup show -pcref $TESTVOL
        echo "----------- Cleanup ---------- "
        rm $TESTVOL/quotatest*
      
      done
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48a89bc4
    • Edmund Nadolski's avatar
      btrfs: replace hardcoded value with SEQ_LAST macro · de47c9d3
      Edmund Nadolski authored
      Define the SEQ_LAST macro to replace (u64)-1 in places where said
      value triggers a special-case ref search behavior.
      Signed-off-by: default avatarEdmund Nadolski <enadolski@suse.com>
      Reviewed-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de47c9d3