1. 26 Apr, 2017 6 commits
    • Filipe Manana's avatar
      Btrfs: send, fix file hole not being preserved due to inline extent · e1cbfd7b
      Filipe Manana authored
      Normally we don't have inline extents followed by regular extents, but
      there's currently at least one harmless case where this happens. For
      example, when the page size is 4Kb and compression is enabled:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
        $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
      
      In this case we get a compressed inline extent, representing 4Kb of
      data, followed by a hole extent and then a regular data extent. The
      inline extent was not expanded/converted to a regular extent exactly
      because it represents 4Kb of data. This does not cause any apparent
      problem (such as the issue solved by commit e1699d2d
      ("btrfs: add missing memset while reading compressed inline extents"))
      except trigger an unexpected case in the incremental send code path
      that makes us issue an operation to write a hole when it's not needed,
      resulting in more writes at the receiver and wasting space at the
      receiver.
      
      So teach the incremental send code to deal with this particular case.
      
      The issue can be currently triggered by running fstests btrfs/137 with
      compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      e1cbfd7b
    • Filipe Manana's avatar
      Btrfs: fix extent map leak during fallocate error path · be2d253c
      Filipe Manana authored
      If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
      extent map structure. The failure can happen either due to an -ENOMEM
      condition or, when quotas are enabled, due to -EDQUOT for example.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      be2d253c
    • Filipe Manana's avatar
      Btrfs: fix incorrect space accounting after failure to insert inline extent · 1c81ba23
      Filipe Manana authored
      When using compression, if we fail to insert an inline extent we
      incorrectly end up attempting to free the reserved data space twice,
      once through extent_clear_unlock_delalloc(), because we pass it the
      flag EXTENT_DO_ACCOUNTING, and once through a direct call to
      btrfs_free_reserved_data_space_noquota(). This results in a trace
      like the following:
      
      [  834.576240] ------------[ cut here ]------------
      [  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
      [  834.596103] Call Trace:
      [  834.596103]  dump_stack+0x67/0x90
      [  834.596103]  __warn+0xc2/0xdd
      [  834.596103]  warn_slowpath_null+0x1d/0x1f
      [  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
      [  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
      [  834.596103]  async_cow_start+0x32/0x4d [btrfs]
      [  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
      [  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
      [  834.596103]  process_one_work+0x273/0x4e4
      [  834.596103]  worker_thread+0x1eb/0x2ca
      [  834.596103]  ? rescuer_thread+0x2b6/0x2b6
      [  834.596103]  kthread+0x100/0x108
      [  834.596103]  ? __list_del_entry+0x22/0x22
      [  834.596103]  ret_from_fork+0x2e/0x40
      [  834.611656] ---[ end trace 719902fe6bdef08f ]---
      
      So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
      if an error happened.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      1c81ba23
    • Filipe Manana's avatar
      Btrfs: fix invalid attempt to free reserved space on failure to cow range · a315e68f
      Filipe Manana authored
      When attempting to COW a file range (we are starting writeback and doing
      COW), if we manage to reserve an extent for the range we will write into
      but fail after reserving it and before creating the respective ordered
      extent, we end up in an error path where we attempt to decrement the
      data space's bytes_may_use counter after we already did it while
      reserving the extent, leading to a warning/trace like the following:
      
      [  847.621524] ------------[ cut here ]------------
      [  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
      [  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  847.648601] Call Trace:
      [  847.648601]  dump_stack+0x67/0x90
      [  847.648601]  __warn+0xc2/0xdd
      [  847.648601]  warn_slowpath_null+0x1d/0x1f
      [  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
      [  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
      [  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
      [  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
      [  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
      [  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
      [  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
      [  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
      [  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
      [  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  ? mark_lock+0x24/0x201
      [  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
      [  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
      [  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
      [  847.648601]  do_writepages+0x23/0x2c
      [  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
      [  847.648601]  filemap_fdatawrite_range+0x13/0x15
      [  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
      [  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
      [  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
      [  847.648601]  vfs_fsync_range+0x8c/0x9e
      [  847.648601]  vfs_fsync+0x1c/0x1e
      [  847.648601]  do_fsync+0x31/0x4a
      [  847.648601]  SyS_fsync+0x10/0x14
      [  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  847.648601] RIP: 0033:0x7f5b05200800
      [  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
      [  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
      [  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
      [  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
      [  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
      [  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
      [  847.685787] ---[ end trace 2a4a3e15382508e8 ]---
      
      So fix this by not attempting to decrement the data space info's
      bytes_may_use counter if we already reserved the extent and an error
      happened before creating the ordered extent. We are already correctly
      freeing the reserved extent if an error happens, so there's no additional
      measure needed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      a315e68f
    • Qu Wenruo's avatar
      btrfs: Handle delalloc error correctly to avoid ordered extent hang · 52427260
      Qu Wenruo authored
      [BUG]
      If run_delalloc_range() returns error and there is already some ordered
      extents created, btrfs will be hanged with the following backtrace:
      
      Call Trace:
       __schedule+0x2d4/0xae0
       schedule+0x3d/0x90
       btrfs_start_ordered_extent+0x160/0x200 [btrfs]
       ? wake_atomic_t_function+0x60/0x60
       btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
       btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
       btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
       process_one_work+0x2af/0x720
       ? process_one_work+0x22b/0x720
       worker_thread+0x4b/0x4f0
       kthread+0x10f/0x150
       ? process_one_work+0x720/0x720
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x2e/0x40
      
      [CAUSE]
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|                       |<---------- cleanup range --------->|
       ||
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      
      The problem is caused by error handler of run_delalloc_range(), which
      doesn't handle any created ordered extents, leaving them waiting on
      btrfs_finish_ordered_io() to finish.
      
      However after run_delalloc_range() returns error, __extent_writepage()
      won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
      except the first page, and btrfs_finish_ordered_io() won't be triggered
      for created ordered extents either.
      
      So OE 2~n will hang forever, and if OE 1 is larger than one page, it
      will also hang.
      
      [FIX]
      Introduce btrfs_cleanup_ordered_extents() function to cleanup created
      ordered extents and finish them manually.
      
      The function is based on existing
      btrfs_endio_direct_write_update_ordered() function, and modify it to
      act just like btrfs_writepage_endio_hook() but handles specified range
      other than one page.
      
      After fix, delalloc error will be handled like:
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|<--------  ----------->|<------ old error handler --------->|
       ||          ||
       ||          \_=> Cleaned up by cleanup_ordered_extents()
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      52427260
    • Qu Wenruo's avatar
      btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error · 4dbd80fb
      Qu Wenruo authored
      [BUG]
      When btrfs_reloc_clone_csum() reports error, it can underflow metadata
      and leads to kernel assertion on outstanding extents in
      run_delalloc_nocow() and cow_file_range().
      
       BTRFS info (device vdb5): relocating block group 12582912 flags data
       BTRFS info (device vdb5): found 1 extents
       assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
      
      Currently, due to another bug blocking ordered extents, the bug is only
      reproducible under certain block group layout and using error injection.
      
      a) Create one data block group with one 4K extent in it.
         To avoid the bug that hangs btrfs due to ordered extent which never
         finishes
      b) Make btrfs_reloc_clone_csum() always fail
      c) Relocate that block group
      
      [CAUSE]
      run_delalloc_nocow() and cow_file_range() handles error from
      btrfs_reloc_clone_csum() wrongly:
      
      (The ascii chart shows a more generic case of this bug other than the
      bug mentioned above)
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
                          |<----------- cleanup range --------------->|
      |<-----------  ----------->|
                   \/
       btrfs_finish_ordered_io() range
      
      So error handler, which calls extent_clear_unlock_delalloc() with
      EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
      will both cover OE n, and free its metadata, causing metadata under flow.
      
      [Fix]
      The fix is to ensure after calling btrfs_add_ordered_extent(), we only
      call error handler after increasing the iteration offset, so that
      cleanup range won't cover any created ordered extent.
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<-----------  ----------->|<---------- cleanup range --------->|
                   \/
       btrfs_finish_ordered_io() range
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      4dbd80fb
  2. 11 Apr, 2017 4 commits
    • Liu Bo's avatar
      Btrfs: fix potential use-after-free for cloned bio · a967efb3
      Liu Bo authored
      KASAN reports that there is a use-after-free case of bio in btrfs_map_bio.
      
      If we need to submit IOs to several disks at a time, the original bio
      would get cloned and mapped to the destination disk, but we really should
      use the original bio instead of a cloned bio to do the sanity check
      because cloned bios are likely to be freed by its endio.
      Reported-by: default avatarDiego <diegocg@gmail.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a967efb3
    • Liu Bo's avatar
      Btrfs: fix segmentation fault when doing dio read · 97bf5a55
      Liu Bo authored
      Commit 2dabb324 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
      introduced this bug during iterating bio pages in dio read's endio hook,
      and it could end up with segment fault of the dio reading task.
      
      So the reason is 'if (nr_sectors--)', and it makes the code assume that
      there is one more block in the same page, so page offset is increased and
      the bio which is created to repair the bad block then has an incorrect
      bvec.bv_offset, and a later access of the page content would throw a
      segmentation fault.
      
      This also adds ASSERT to check page offset against page size.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97bf5a55
    • Liu Bo's avatar
      Btrfs: fix invalid dereference in btrfs_retry_endio · 2e949b0a
      Liu Bo authored
      When doing directIO repair, we have this oops:
      
      [ 1458.532816] general protection fault: 0000 [#1] SMP
      ...
      [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
      [ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
      [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
      ...
      [ 1458.543261] Call Trace:
      [ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
      [ 1458.544374]  bio_endio+0xed/0x100
      [ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
      [ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
      [ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
      [ 1458.546224]  process_one_work+0x34d/0xb70
      [ 1458.546570]  ? process_one_work+0x29e/0xb70
      [ 1458.546938]  worker_thread+0x1cf/0x960
      [ 1458.547263]  ? process_one_work+0xb70/0xb70
      [ 1458.547624]  kthread+0x17d/0x180
      [ 1458.547909]  ? kthread_create_on_node+0x70/0x70
      [ 1458.548300]  ret_from_fork+0x31/0x40
      
      It turns out that btrfs_retry_endio is trying to get inode from a directIO
      page.
      
      This fixes the problem by using the saved inode pointer, done->inode.
      btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
      
      Also cleanup unused @start (which is too trivial for a separate patch).
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e949b0a
    • Adam Borowski's avatar
      btrfs: drop the nossd flag when remounting with -o ssd · 951e7966
      Adam Borowski authored
      The opposite case was already handled right in the very next switch entry.
      And also when turning on nossd, drop ssd_spread.
      Reported-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarAdam Borowski <kilobyte@angband.pl>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      951e7966
  3. 29 Mar, 2017 4 commits
    • Chris Mason's avatar
      Merge branch 'for-chris-4.11-rc5' of... · 41a75a6e
      Chris Mason authored
      Merge branch 'for-chris-4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11
      41a75a6e
    • Dan Carpenter's avatar
      Btrfs: fix an integer overflow check · 457ae726
      Dan Carpenter authored
      This isn't super serious because you need CAP_ADMIN to run this code.
      
      I added this integer overflow check last year but apparently I am
      rubbish at writing integer overflow checks...  There are two issues.
      First, access_ok() works on unsigned long type and not u64 so on 32 bit
      systems the access_ok() could be checking a truncated size.  The other
      issue is that we should be using a stricter limit so we don't overflow
      the kzalloc() setting ctx->clone_roots later in the function after the
      access_ok():
      
      	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
      	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
      
      Fixes: f5ecec3c ("btrfs: send: silence an integer overflow warning")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ added comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457ae726
    • Goldwyn Rodrigues's avatar
      btrfs: Change qgroup_meta_rsv to 64bit · ce0dcee6
      Goldwyn Rodrigues authored
      Using an int value is causing qg->reserved to become negative and
      exclusive -EDQUOT to be reached prematurely.
      
      This affects exclusive qgroups only.
      
      TEST CASE:
      
      DEVICE=/dev/vdb
      MOUNTPOINT=/mnt
      SUBVOL=$MOUNTPOINT/tmp
      
      umount $SUBVOL
      umount $MOUNTPOINT
      
      mkfs.btrfs -f $DEVICE
      mount /dev/vdb $MOUNTPOINT
      btrfs quota enable $MOUNTPOINT
      btrfs subvol create $SUBVOL
      umount $MOUNTPOINT
      mount /dev/vdb $MOUNTPOINT
      mount -o subvol=tmp $DEVICE $SUBVOL
      btrfs qgroup limit -e 3G $SUBVOL
      
      btrfs quota rescan /mnt -w
      
      for i in `seq 1 44000`; do
        dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
        if [[ $? > 0 ]]; then
           btrfs qgroup show -pcref $SUBVOL
           exit 1
        fi
      done
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      [ add reproducer to changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce0dcee6
    • Liu Bo's avatar
      Btrfs: bring back repair during read · 9d0d1c8b
      Liu Bo authored
      Commit 20a7db8a ("btrfs: add dummy callback for readpage_io_failed
      and drop checks") made a cleanup around readpage_io_failed_hook, and
      it was supposed to keep the original sematics, but it also
      unexpectedly disabled repair during read for dup, raid1 and raid10.
      
      This fixes the problem by letting data's inode call the generic
      readpage_io_failed callback by returning -EAGAIN from its
      readpage_io_failed_hook in order to notify end_bio_extent_readpage to
      do the rest.  We don't call it directly because the generic one takes
      an offset from end_bio_extent_readpage() to calculate the index in the
      checksum array and inode's readpage_io_failed_hook doesn't offer that
      offset.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ keep the const function attribute ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d0d1c8b
  4. 17 Mar, 2017 2 commits
    • Zygo Blaxell's avatar
      btrfs: add missing memset while reading compressed inline extents · e1699d2d
      Zygo Blaxell authored
      This is a story about 4 distinct (and very old) btrfs bugs.
      
      Commit c8b97818 ("Btrfs: Add zlib compression support") added
      three data corruption bugs for inline extents (bugs #1-3).
      
      Commit 93c82d57 ("Btrfs: zero page past end of inline file items")
      fixed bug #1:  uncompressed inline extents followed by a hole and more
      extents could get non-zero data in the hole as they were read.  The fix
      was to add a memset in btrfs_get_extent to zero out the hole.
      
      Commit 166ae5a4 ("btrfs: fix inline compressed read err corruption")
      fixed bug #2:  compressed inline extents which contained non-zero bytes
      might be replaced with zero bytes in some cases.  This patch removed an
      unhelpful memset from uncompress_inline, but the case where memset is
      required was missed.
      
      There is also a memset in the decompression code, but this only covers
      decompressed data that is shorter than the ram_bytes from the extent
      ref record.  This memset doesn't cover the region between the end of the
      decompressed data and the end of the page.  It has also moved around a
      few times over the years, so there's no single patch to refer to.
      
      This patch fixes bug #3:  compressed inline extents followed by a hole
      and more extents could get non-zero data in the hole as they were read
      (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
      The fix is the same:  zero out the hole in the compressed case too,
      by putting a memset back in uncompress_inline, but this time with
      correct parameters.
      
      The last and oldest bug, bug #0, is the cause of the offending inline
      extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
      of behavior somewhere in the btrfs write code.  In a few special cases,
      an inline extent and hole are allowed to persist where they normally
      would be combined with later extents in the file.
      
      A fast reproducer for bug #0 is presented below.  A few offending extents
      are also created in the wild during large rsync transfers with the -S
      flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
      will produce a handful of offending files as well.  Once an offending
      file is created, it can present different content to userspace each
      time it is read.
      
      Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
      kernel back to v3.5 has this behavior.  There are fossil records of this
      bug's effects in commits all the way back to v2.6.32.  I have no reason
      to believe bug #0 wasn't present at the beginning of btrfs compression
      support in v2.6.29, but I can't easily test kernels that old to be sure.
      
      It is not clear whether bug #0 is worth fixing.  A fix would likely
      require injecting extra reads into currently write-only paths, and most
      of the exceptional cases caused by bug #0 are already handled now.
      
      Whether we like them or not, bug #0's inline extents followed by holes
      are part of the btrfs de-facto disk format now, and we need to be able
      to read them without data corruption or an infoleak.  So enough about
      bug #0, let's get back to bug #3 (this patch).
      
      An example of on-disk structure leading to data corruption found in
      the wild:
      
              item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                      inode generation 50 transid 50 size 47424 nbytes 49141
                      block group 0 mode 100644 links 1 uid 0 gid 0
                      rdev 0 flags 0x0(none)
              item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                      inode ref index 3 namelen 10 name: DB_File.so
              item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                      inline extent data size 1341 ram 4085 compress(zlib)
              item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                      extent data disk byte 5367308288 nr 20480
                      extent data offset 0 nr 45056 ram 45056
                      extent compression(zlib)
      
      Different data appears in userspace during each read of the 11 bytes
      between 4085 and 4096.  The extent in item 63 is not long enough to
      fill the first page of the file, so a memset is required to fill the
      space between item 63 (ending at 4085) and item 64 (beginning at 4096)
      with zero.
      
      Here is a reproducer from Liu Bo, which demonstrates another method
      of creating the same inline extent and hole pattern:
      
      Using 'page_poison=on' kernel command line (or enable
      CONFIG_PAGE_POISONING) run the following:
      
      	# touch foo
      	# chattr +c foo
      	# xfs_io -f -c "pwrite -W 0 1000" foo
      	# xfs_io -f -c "falloc 4 8188" foo
      	# od -x foo
      	# echo 3 >/proc/sys/vm/drop_caches
      	# od -x foo
      
      This produce the following on my box:
      
      Correct output:  file contains 1000 data bytes followed
      by zeros:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
      	0001760 0000 0000 0000 0000 0000 0000 0000 0000
      	*
      	0020000
      
      Actual output:  the data after the first 1000 bytes
      will be different each run:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
      	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
      	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
      	(...)
      Signed-off-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e1699d2d
    • Liu Bo's avatar
      Btrfs: fix regression in lock_delalloc_pages · 49d4a334
      Liu Bo authored
      The bug is a regression after commit
      (da2c7009 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
      and commit
      (76c0021d "Btrfs: use helper to simplify lock/unlock pages").
      
      So if the dirty pages which are under writeback got truncated partially
      before we lock the dirty pages, we couldn't find all pages mapping to the
      delalloc range, and the bug didn't return an error so it kept going on and
      found that the delalloc range got truncated and got to unlock the dirty
      pages, and then the ASSERT could caught the error, and showed
      
      -----------------------------------------------------------------------------
      assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
      -----------------------------------------------------------------------------
      
      This fixes the bug by returning the proper -EAGAIN.
      
      Cc: David Sterba <dsterba@suse.com>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49d4a334
  5. 07 Mar, 2017 1 commit
  6. 28 Feb, 2017 23 commits