1. 16 May, 2017 1 commit
    • Qu Wenruo's avatar
      btrfs: fiemap: Cache and merge fiemap extent before submit it to user · 4751832d
      Qu Wenruo authored
      [BUG]
      Cycle mount btrfs can cause fiemap to return different result.
      Like:
       # mount /dev/vdb5 /mnt/btrfs
       # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        25088..25215       128   0x1
       # umount /mnt/btrfs
       # mount /dev/vdb5 /mnt/btrfs
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..31]:         25088..25119        32   0x0
         1: [32..63]:        25120..25151        32   0x0
         2: [64..95]:        25152..25183        32   0x0
         3: [96..127]:       25184..25215        32   0x1
      But after above fiemap, we get correct merged result if we call fiemap
      again.
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        25088..25215       128   0x1
      
      [REASON]
      Btrfs will try to merge extent map when inserting new extent map.
      
      btrfs_fiemap(start=0 len=(u64)-1)
      |- extent_fiemap(start=0 len=(u64)-1)
         |- get_extent_skip_holes(start=0 len=64k)
         |  |- btrfs_get_extent_fiemap(start=0 len=64k)
         |     |- btrfs_get_extent(start=0 len=64k)
         |        |  Found on-disk (ino, EXTENT_DATA, 0)
         |        |- add_extent_mapping()
         |        |- Return (em->start=0, len=16k)
         |
         |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
         |
         |- get_extent_skip_holes(start=0 len=64k)
         |  |- btrfs_get_extent_fiemap(start=0 len=64k)
         |     |- btrfs_get_extent(start=16k len=48k)
         |        |  Found on-disk (ino, EXTENT_DATA, 16k)
         |        |- add_extent_mapping()
         |        |  |- try_merge_map()
         |        |     Merge with previous em start=0 len=16k
         |        |     resulting em start=0 len=32k
         |        |- Return (em->start=0, len=32K)    << Merged result
         |- Stripe off the unrelated range (0~16K) of return em
         |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
            ^^^ Causing split fiemap extent.
      
      And since in add_extent_mapping(), em is already merged, in next
      fiemap() call, we will get merged result.
      
      [FIX]
      Here we introduce a new structure, fiemap_cache, which records previous
      fiemap extent.
      
      And will always try to merge current fiemap_cache result before calling
      fiemap_fill_next_extent().
      Only when we failed to merge current fiemap extent with cached one, we
      will call fiemap_fill_next_extent() to submit cached one.
      
      So by this method, we can merge all fiemap extents.
      
      It can also be done in fs/ioctl.c, however the problem is if
      fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
      extent.
      So I choose to merge it in btrfs.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4751832d
  2. 04 May, 2017 1 commit
  3. 27 Apr, 2017 1 commit
  4. 26 Apr, 2017 7 commits
    • Filipe Manana's avatar
      Btrfs: fix reported number of inode blocks · a7e3b975
      Filipe Manana authored
      Currently when there are buffered writes that were not yet flushed and
      they fall within allocated ranges of the file (that is, not in holes or
      beyond eof assuming there are no prealloc extents beyond eof), btrfs
      simply reports an incorrect number of used blocks through the stat(2)
      system call (or any of its variants), regardless of mount options or
      inode flags (compress, compress-force, nodatacow). This is because the
      number of blocks used that is reported is based on the current number
      of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
      inode. The later covers bytes that both fall within allocated regions
      of the file and holes.
      
      Example scenarios where the number of reported blocks is wrong while the
      buffered writes are not flushed:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
      
        # The following should have reported 64K...
        $ du -h /mnt/sdc/foo1
        128K	/mnt/sdc/foo1
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo1
        64K	/mnt/sdc/foo1
      
        $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 65536
        64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
      
        # The following should have reported 128K...
        $ du -h /mnt/sdc/foo2
        192K	/mnt/sdc/foo2
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo2
        128K	/mnt/sdc/foo2
      
      So the number of used file blocks is simply incorrect, unlike in other
      filesystems such as ext4 and xfs for example, but only while the buffered
      writes are not flushed.
      
      Fix this by tracking the number of delalloc bytes that fall within holes
      and beyond eof of a file, and use instead this new counter when reporting
      the number of used blocks for an inode.
      
      Another different problem that exists is that the delalloc bytes counter
      is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
      the respective range in the inode's iotree) and the vfs inode's bytes
      counter is only incremented when writeback finishes (through
      insert_reserved_file_extent()). Therefore while writeback is ongoing we
      simply report a wrong number of blocks used by an inode if the write
      operation covers a range previously unallocated. While this change does
      not fix this problem, it does minimizes it a lot by shortening that time
      window, as the new dealloc bytes counter (new_delalloc_bytes) is only
      decremented when writeback finishes right before updating the vfs inode's
      bytes counter. Fully fixing this second problem is not trivial and will
      be addressed later by a different patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      a7e3b975
    • Filipe Manana's avatar
      Btrfs: send, fix file hole not being preserved due to inline extent · e1cbfd7b
      Filipe Manana authored
      Normally we don't have inline extents followed by regular extents, but
      there's currently at least one harmless case where this happens. For
      example, when the page size is 4Kb and compression is enabled:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
        $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
      
      In this case we get a compressed inline extent, representing 4Kb of
      data, followed by a hole extent and then a regular data extent. The
      inline extent was not expanded/converted to a regular extent exactly
      because it represents 4Kb of data. This does not cause any apparent
      problem (such as the issue solved by commit e1699d2d
      ("btrfs: add missing memset while reading compressed inline extents"))
      except trigger an unexpected case in the incremental send code path
      that makes us issue an operation to write a hole when it's not needed,
      resulting in more writes at the receiver and wasting space at the
      receiver.
      
      So teach the incremental send code to deal with this particular case.
      
      The issue can be currently triggered by running fstests btrfs/137 with
      compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      e1cbfd7b
    • Filipe Manana's avatar
      Btrfs: fix extent map leak during fallocate error path · be2d253c
      Filipe Manana authored
      If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
      extent map structure. The failure can happen either due to an -ENOMEM
      condition or, when quotas are enabled, due to -EDQUOT for example.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      be2d253c
    • Filipe Manana's avatar
      Btrfs: fix incorrect space accounting after failure to insert inline extent · 1c81ba23
      Filipe Manana authored
      When using compression, if we fail to insert an inline extent we
      incorrectly end up attempting to free the reserved data space twice,
      once through extent_clear_unlock_delalloc(), because we pass it the
      flag EXTENT_DO_ACCOUNTING, and once through a direct call to
      btrfs_free_reserved_data_space_noquota(). This results in a trace
      like the following:
      
      [  834.576240] ------------[ cut here ]------------
      [  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
      [  834.596103] Call Trace:
      [  834.596103]  dump_stack+0x67/0x90
      [  834.596103]  __warn+0xc2/0xdd
      [  834.596103]  warn_slowpath_null+0x1d/0x1f
      [  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
      [  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
      [  834.596103]  async_cow_start+0x32/0x4d [btrfs]
      [  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
      [  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
      [  834.596103]  process_one_work+0x273/0x4e4
      [  834.596103]  worker_thread+0x1eb/0x2ca
      [  834.596103]  ? rescuer_thread+0x2b6/0x2b6
      [  834.596103]  kthread+0x100/0x108
      [  834.596103]  ? __list_del_entry+0x22/0x22
      [  834.596103]  ret_from_fork+0x2e/0x40
      [  834.611656] ---[ end trace 719902fe6bdef08f ]---
      
      So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
      if an error happened.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      1c81ba23
    • Filipe Manana's avatar
      Btrfs: fix invalid attempt to free reserved space on failure to cow range · a315e68f
      Filipe Manana authored
      When attempting to COW a file range (we are starting writeback and doing
      COW), if we manage to reserve an extent for the range we will write into
      but fail after reserving it and before creating the respective ordered
      extent, we end up in an error path where we attempt to decrement the
      data space's bytes_may_use counter after we already did it while
      reserving the extent, leading to a warning/trace like the following:
      
      [  847.621524] ------------[ cut here ]------------
      [  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
      [  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  847.648601] Call Trace:
      [  847.648601]  dump_stack+0x67/0x90
      [  847.648601]  __warn+0xc2/0xdd
      [  847.648601]  warn_slowpath_null+0x1d/0x1f
      [  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
      [  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
      [  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
      [  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
      [  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
      [  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
      [  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
      [  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
      [  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
      [  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  ? mark_lock+0x24/0x201
      [  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
      [  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
      [  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
      [  847.648601]  do_writepages+0x23/0x2c
      [  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
      [  847.648601]  filemap_fdatawrite_range+0x13/0x15
      [  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
      [  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
      [  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
      [  847.648601]  vfs_fsync_range+0x8c/0x9e
      [  847.648601]  vfs_fsync+0x1c/0x1e
      [  847.648601]  do_fsync+0x31/0x4a
      [  847.648601]  SyS_fsync+0x10/0x14
      [  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  847.648601] RIP: 0033:0x7f5b05200800
      [  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
      [  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
      [  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
      [  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
      [  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
      [  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
      [  847.685787] ---[ end trace 2a4a3e15382508e8 ]---
      
      So fix this by not attempting to decrement the data space info's
      bytes_may_use counter if we already reserved the extent and an error
      happened before creating the ordered extent. We are already correctly
      freeing the reserved extent if an error happens, so there's no additional
      measure needed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      a315e68f
    • Qu Wenruo's avatar
      btrfs: Handle delalloc error correctly to avoid ordered extent hang · 52427260
      Qu Wenruo authored
      [BUG]
      If run_delalloc_range() returns error and there is already some ordered
      extents created, btrfs will be hanged with the following backtrace:
      
      Call Trace:
       __schedule+0x2d4/0xae0
       schedule+0x3d/0x90
       btrfs_start_ordered_extent+0x160/0x200 [btrfs]
       ? wake_atomic_t_function+0x60/0x60
       btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
       btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
       btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
       process_one_work+0x2af/0x720
       ? process_one_work+0x22b/0x720
       worker_thread+0x4b/0x4f0
       kthread+0x10f/0x150
       ? process_one_work+0x720/0x720
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x2e/0x40
      
      [CAUSE]
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|                       |<---------- cleanup range --------->|
       ||
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      
      The problem is caused by error handler of run_delalloc_range(), which
      doesn't handle any created ordered extents, leaving them waiting on
      btrfs_finish_ordered_io() to finish.
      
      However after run_delalloc_range() returns error, __extent_writepage()
      won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
      except the first page, and btrfs_finish_ordered_io() won't be triggered
      for created ordered extents either.
      
      So OE 2~n will hang forever, and if OE 1 is larger than one page, it
      will also hang.
      
      [FIX]
      Introduce btrfs_cleanup_ordered_extents() function to cleanup created
      ordered extents and finish them manually.
      
      The function is based on existing
      btrfs_endio_direct_write_update_ordered() function, and modify it to
      act just like btrfs_writepage_endio_hook() but handles specified range
      other than one page.
      
      After fix, delalloc error will be handled like:
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|<--------  ----------->|<------ old error handler --------->|
       ||          ||
       ||          \_=> Cleaned up by cleanup_ordered_extents()
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      52427260
    • Qu Wenruo's avatar
      btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error · 4dbd80fb
      Qu Wenruo authored
      [BUG]
      When btrfs_reloc_clone_csum() reports error, it can underflow metadata
      and leads to kernel assertion on outstanding extents in
      run_delalloc_nocow() and cow_file_range().
      
       BTRFS info (device vdb5): relocating block group 12582912 flags data
       BTRFS info (device vdb5): found 1 extents
       assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
      
      Currently, due to another bug blocking ordered extents, the bug is only
      reproducible under certain block group layout and using error injection.
      
      a) Create one data block group with one 4K extent in it.
         To avoid the bug that hangs btrfs due to ordered extent which never
         finishes
      b) Make btrfs_reloc_clone_csum() always fail
      c) Relocate that block group
      
      [CAUSE]
      run_delalloc_nocow() and cow_file_range() handles error from
      btrfs_reloc_clone_csum() wrongly:
      
      (The ascii chart shows a more generic case of this bug other than the
      bug mentioned above)
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
                          |<----------- cleanup range --------------->|
      |<-----------  ----------->|
                   \/
       btrfs_finish_ordered_io() range
      
      So error handler, which calls extent_clear_unlock_delalloc() with
      EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
      will both cover OE n, and free its metadata, causing metadata under flow.
      
      [Fix]
      The fix is to ensure after calling btrfs_add_ordered_extent(), we only
      call error handler after increasing the iteration offset, so that
      cleanup range won't cover any created ordered extent.
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<-----------  ----------->|<---------- cleanup range --------->|
                   \/
       btrfs_finish_ordered_io() range
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      4dbd80fb
  5. 18 Apr, 2017 30 commits