1. 24 Feb, 2017 10 commits
    • Filipe Manana's avatar
      Btrfs: fix data loss after truncate when using the no-holes feature · 76b42abb
      Filipe Manana authored
      If we have a file with an implicit hole (NO_HOLES feature enabled) that
      has an extent following the hole, delayed writes against regions of the
      file behind the hole happened before but were not yet flushed and then
      we truncate the file to a smaller size that lies inside the hole, we
      end up persisting a wrong disk_i_size value for our inode that leads to
      data loss after umounting and mounting again the filesystem or after
      the inode is evicted and loaded again.
      
      This happens because at inode.c:btrfs_truncate_inode_items() we end up
      setting last_size to the offset of the extent that we deleted and that
      followed the hole. We then pass that value to btrfs_ordered_update_i_size()
      which updates the inode's disk_i_size to a value smaller then the offset
      of the buffered (delayed) writes.
      
      Example reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo
       $ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo
       $ xfs_io -c "truncate 60K" /mnt/foo
         --> inode's disk_i_size updated to 0
      
       $ md5sum /mnt/foo
       3c5ca3c3ab42f4b04d7e7eb0b0d4d806  /mnt/foo
      
       $ umount /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ md5sum /mnt/foo
       d41d8cd98f00b204e9800998ecf8427e  /mnt/foo
         --> Empty file, all data lost!
      
      Cc: <stable@vger.kernel.org>  # 3.14+
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      76b42abb
    • Filipe Manana's avatar
      Btrfs: incremental send, fix unnecessary hole writes for sparse files · 82bfb2e7
      Filipe Manana authored
      When using the NO_HOLES feature, during an incremental send we often issue
      write operations for holes when we should not, because that range is already
      a hole in the destination snapshot. While that does not change the contents
      of the file at the receiver, it avoids preservation of file holes, leading
      to wasted disk space and extra IO during send/receive.
      
      A couple examples where the holes are not preserved follows.
      
       $ mkfs.btrfs -O no-holes -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
       # Now add one new extent to our first test file, increasing its size and
       # leaving a 1Mb hole between the first extent and this new extent.
       $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo
      
       # Now overwrite the last extent of our second test file.
       $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar
      
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
       $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
       /mnt/snap2/foo:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          25088..25095         8 0x2000
         1: [8..2055]:       hole              2048
         2: [2056..2063]:    24576..24583         8 0x2001
      
       $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
       /mnt/snap2/bar:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          25096..25103         8 0x2000
         1: [8..2055]:       hole              2048
         2: [2056..2063]:    24584..24591         8 0x2001
      
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
      
        $ umount /mnt
        # It's not relevant to enable no-holes in the new filesystem.
        $ mkfs.btrfs -O no-holes -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ btrfs receive /mnt -f /tmp/1.snap
        $ btrfs receive /mnt -f /tmp/2.snap
      
        $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
        /mnt/snap2/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [0..7]:          24576..24583         8 0x2000
          1: [8..2063]:       25624..27679      2056   0x1
      
        $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
        /mnt/snap2/bar:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [0..7]:          24584..24591         8 0x2000
          1: [8..2063]:       27680..29735      2056   0x1
      
      The holes do not exist in the second filesystem and they were replaced
      with extents filled with the byte 0x00, making each file take 1032Kb of
      space instead of 8Kb.
      
      So fix this by not issuing the write operations consisting of buffers
      filled with the byte 0x00 when the destination snapshot already has a
      hole for the respective range.
      
      A test case for fstests will follow soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      82bfb2e7
    • Filipe Manana's avatar
      Btrfs: fix use-after-free due to wrong order of destroying work queues · a9b9477d
      Filipe Manana authored
      Before we destroy all work queues (and wait for their tasks to complete)
      we were destroying the work queues used for metadata I/O operations, which
      can result in a use-after-free problem because most tasks from all work
      queues do metadata I/O operations. For example, the tasks from the caching
      workers work queue (fs_info->caching_workers), which is destroyed only
      after the work queue used for metadata reads (fs_info->endio_meta_workers)
      is destroyed, do metadata reads, which result in attempts to queue tasks
      into the later work queue, triggering a use-after-free with a trace like
      the following:
      
      [23114.613543] general protection fault: 0000 [#1] PREEMPT SMP
      [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic
      acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16
      jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug]
      [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1
      [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs]
      [23114.616932] task: ffff880221d45780 task.stack: ffffc9000bc50000
      [23114.616932] RIP: 0010:[<ffffffffa037c1bf>]  [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs]
      [23114.616932] RSP: 0018:ffff88023f443d60  EFLAGS: 00010246
      [23114.616932] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000102
      [23114.616932] RDX: ffffffffa0419000 RSI: ffff88011df534f0 RDI: ffff880101f01c00
      [23114.616932] RBP: ffff88023f443d80 R08: 00000000000f7000 R09: 000000000000ffff
      [23114.616932] R10: ffff88023f443d48 R11: 0000000000001000 R12: ffff88011df534f0
      [23114.616932] R13: ffff880135963868 R14: 0000000000001000 R15: 0000000000001000
      [23114.616932] FS:  0000000000000000(0000) GS:ffff88023f440000(0000) knlGS:0000000000000000
      [23114.616932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [23114.616932] CR2: 00007f0fb9f8e520 CR3: 0000000001a0b000 CR4: 00000000000006e0
      [23114.616932] Stack:
      [23114.616932]  ffff880101f01c00 ffff88011df534f0 ffff880135963868 0000000000001000
      [23114.616932]  ffff88023f443da0 ffffffffa03470af ffff880149b37200 ffff880135963868
      [23114.616932]  ffff88023f443db8 ffffffff8125293c ffff880149b37200 ffff88023f443de0
      [23114.616932] Call Trace:
      [23114.616932]  <IRQ> [23114.616932]  [<ffffffffa03470af>] end_workqueue_bio+0xd5/0xda [btrfs]
      [23114.616932]  [<ffffffff8125293c>] bio_endio+0x54/0x57
      [23114.616932]  [<ffffffffa0377929>] btrfs_end_bio+0xf7/0x106 [btrfs]
      [23114.616932]  [<ffffffff8125293c>] bio_endio+0x54/0x57
      [23114.616932]  [<ffffffff8125955f>] blk_update_request+0x21a/0x30f
      [23114.616932]  [<ffffffffa0022316>] scsi_end_request+0x31/0x182 [scsi_mod]
      [23114.616932]  [<ffffffffa00235fc>] scsi_io_completion+0x1ce/0x4c8 [scsi_mod]
      [23114.616932]  [<ffffffffa001ba9d>] scsi_finish_command+0x104/0x10d [scsi_mod]
      [23114.616932]  [<ffffffffa002311f>] scsi_softirq_done+0x101/0x10a [scsi_mod]
      [23114.616932]  [<ffffffff8125fbd9>] blk_done_softirq+0x82/0x8d
      [23114.616932]  [<ffffffff814c8a4b>] __do_softirq+0x1ab/0x412
      [23114.616932]  [<ffffffff8105b01d>] irq_exit+0x49/0x99
      [23114.616932]  [<ffffffff81035135>] smp_call_function_single_interrupt+0x24/0x26
      [23114.616932]  [<ffffffff814c7ec9>] call_function_single_interrupt+0x89/0x90
      [23114.616932]  <EOI> [23114.616932]  [<ffffffffa0023262>] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod]
      [23114.616932]  [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a
      [23114.616932]  [<ffffffff814c596c>] ? _raw_spin_unlock_irq+0x32/0x4a
      [23114.616932]  [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a
      [23114.616932]  [<ffffffffa0023262>] scsi_request_fn+0x13a/0x2a1 [scsi_mod]
      [23114.616932]  [<ffffffff8125590e>] __blk_run_queue_uncond+0x22/0x2b
      [23114.616932]  [<ffffffff81255930>] __blk_run_queue+0x19/0x1b
      [23114.616932]  [<ffffffff8125ab01>] blk_queue_bio+0x268/0x282
      [23114.616932]  [<ffffffff81258f44>] generic_make_request+0xbd/0x160
      [23114.616932]  [<ffffffff812590e7>] submit_bio+0x100/0x11d
      [23114.616932]  [<ffffffff81298603>] ? __this_cpu_preempt_check+0x13/0x15
      [23114.616932]  [<ffffffff812a1805>] ? __percpu_counter_add+0x8e/0xa7
      [23114.616932]  [<ffffffffa03bfd47>] btrfsic_submit_bio+0x1a/0x1d [btrfs]
      [23114.616932]  [<ffffffffa0377db2>] btrfs_map_bio+0x1f4/0x26d [btrfs]
      [23114.616932]  [<ffffffffa0348a33>] btree_submit_bio_hook+0x74/0xbf [btrfs]
      [23114.616932]  [<ffffffffa03489bf>] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs]
      [23114.616932]  [<ffffffffa03697a9>] submit_one_bio+0x6b/0x89 [btrfs]
      [23114.616932]  [<ffffffffa036f5be>] read_extent_buffer_pages+0x170/0x1ec [btrfs]
      [23114.616932]  [<ffffffffa03471fa>] ? free_root_pointers+0x64/0x64 [btrfs]
      [23114.616932]  [<ffffffffa0348adf>] readahead_tree_block+0x3f/0x4c [btrfs]
      [23114.616932]  [<ffffffffa032e115>] read_block_for_search.isra.20+0x1ce/0x23d [btrfs]
      [23114.616932]  [<ffffffffa032fab8>] btrfs_search_slot+0x65f/0x774 [btrfs]
      [23114.616932]  [<ffffffffa036eff1>] ? free_extent_buffer+0x73/0x7e [btrfs]
      [23114.616932]  [<ffffffffa0331ba4>] btrfs_next_old_leaf+0xa1/0x33c [btrfs]
      [23114.616932]  [<ffffffffa0331e4f>] btrfs_next_leaf+0x10/0x12 [btrfs]
      [23114.616932]  [<ffffffffa0336aa6>] caching_thread+0x22d/0x416 [btrfs]
      [23114.616932]  [<ffffffffa037bce9>] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs]
      [23114.616932]  [<ffffffffa037c036>] btrfs_cache_helper+0xe/0x10 [btrfs]
      [23114.616932]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
      [23114.616932]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
      [23114.616932]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
      [23114.616932]  [<ffffffff81072a81>] kthread+0xd5/0xdd
      [23114.616932]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
      [23114.616932]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
      [23114.616932] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b
      64 ff 74 04 f0 ff 43 58 49 83 7c 24 08 00 74 2c 4c 8d 6b
      [23114.616932] RIP  [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs]
      [23114.616932]  RSP <ffff88023f443d60>
      [23114.689493] ---[ end trace 6e48b6bc707ca34b ]---
      [23114.690166] Kernel panic - not syncing: Fatal exception in interrupt
      [23114.691283] Kernel Offset: disabled
      [23114.691918] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      The following diagram shows the sequence of operations that lead to the
      use-after-free problem from the above trace:
      
              CPU 1                               CPU 2                                     CPU 3
      
                                             caching_thread()
       close_ctree()
         btrfs_stop_all_workers()
           btrfs_destroy_workqueue(
            fs_info->endio_meta_workers)
      
                                               btrfs_search_slot()
                                                read_block_for_search()
                                                 readahead_tree_block()
                                                  read_extent_buffer_pages()
                                                   submit_one_bio()
                                                    btree_submit_bio_hook()
                                                     btrfs_bio_wq_end_io()
                                                      --> sets the bio's
                                                          bi_end_io callback
                                                          to end_workqueue_bio()
                                                     --> bio is submitted
                                                                                        bio completes
                                                                                        and its bi_end_io callback
                                                                                        is invoked
                                                                                         --> end_workqueue_bio()
                                                                                             --> attempts to queue
                                                                                                 a task on fs_info->endio_meta_workers
      
           btrfs_destroy_workqueue(
            fs_info->caching_workers)
      
      So fix this by destroying the queues used for metadata I/O tasks only
      after destroying all the other queues.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      a9b9477d
    • Filipe Manana's avatar
      Btrfs: fix assertion failure when freeing block groups at close_ctree() · 5cdd7db6
      Filipe Manana authored
      At close_ctree() we free the block groups and then only after we wait for
      any running worker kthreads to finish and shutdown the workqueues. This
      behaviour is racy and it triggers an assertion failure when freeing block
      groups because while we are doing it we can have for example a block group
      caching kthread running, and in that case the block group's reference
      count can still be greater than 1 by the time we assert its reference count
      is 1, leading to an assertion failure:
      
      [19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799
      [19041.200584] ------------[ cut here ]------------
      [19041.201692] kernel BUG at fs/btrfs/ctree.h:3418!
      [19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP
      [19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1
      [19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000
      [19041.208082] RIP: 0010:[<ffffffffa03e319e>]  [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
      [19041.208082] RSP: 0018:ffffc9000ad37d60  EFLAGS: 00010286
      [19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001
      [19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff
      [19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000
      [19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170
      [19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100
      [19041.208082] FS:  00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000
      [19041.208082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0
      [19041.208082] Stack:
      [19041.208082]  ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630
      [19041.208082]  ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8
      [19041.208082]  ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d
      [19041.208082] Call Trace:
      [19041.208082]  [<ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs]
      [19041.208082]  [<ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs]
      [19041.208082]  [<ffffffff811a634d>] ? evict_inodes+0x132/0x141
      [19041.208082]  [<ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs]
      [19041.208082]  [<ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb
      [19041.208082]  [<ffffffff8119004f>] kill_anon_super+0x12/0x1c
      [19041.208082]  [<ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs]
      [19041.208082]  [<ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68
      [19041.208082]  [<ffffffff8118fb34>] deactivate_super+0x36/0x39
      [19041.208082]  [<ffffffff811a9946>] cleanup_mnt+0x58/0x76
      [19041.208082]  [<ffffffff811a99a2>] __cleanup_mnt+0x12/0x14
      [19041.208082]  [<ffffffff81071573>] task_work_run+0x6f/0x95
      [19041.208082]  [<ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1
      [19041.208082]  [<ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2
      [19041.208082]  [<ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad
      [19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9
      [19041.208082] RIP  [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
      [19041.208082]  RSP <ffffc9000ad37d60>
      [19041.279264] ---[ end trace 23330586f16f064d ]---
      
      This started happening as of kernel 4.8, since commit f3bca802
      ("Btrfs: add ASSERT for block group's memory leak") introduced these
      assertions.
      
      So fix this by freeing the block groups only after waiting for all
      worker kthreads to complete and shutdown the workqueues.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      5cdd7db6
    • Filipe Manana's avatar
      Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled · 3168021c
      Filipe Manana authored
      We log holes explicitly by using file extent items, however when replaying
      a log tree, if a logged file extent item corresponds to a hole and the
      NO_HOLES feature is enabled we do not need to copy the file extent item
      into the fs/subvolume tree, as the absence of such file extent items is
      the purpose of the NO_HOLES feature. So skip the copying of file extent
      items representing holes when the NO_HOLES feature is enabled.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      3168021c
    • Robbie Ko's avatar
      Btrfs: fix leak of subvolume writers counter · 91e1f56a
      Robbie Ko authored
      When falling back from a nocow write to a regular cow write, we were
      leaking the subvolume writers counter in 2 situations, preventing
      snapshot creation from ever completing in the future, as it waits
      for that counter to go down to zero before the snapshot creation
      starts.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Improved changelog and subject]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      91e1f56a
    • Filipe Manana's avatar
      Btrfs: bulk delete checksum items in the same leaf · 6f546216
      Filipe Manana authored
      Very often we have the checksums for an extent spread in multiple items
      in the checksums tree, and currently the algorithm to delete them starts
      by looking for them one by one and then deleting them one by one, which
      is not optimal since each deletion involves shifting all the other items
      in the leaf and when the leaf reaches some low threshold, to move items
      off the leaf into its left and right neighbor leafs. Also, after each
      item deletion we release our search path and start a new search for other
      checksums items.
      
      So optimize this by deleting in bulk all the items in the same leaf that
      contain checksums for the extent being freed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      6f546216
    • Robbie Ko's avatar
      Btrfs: incremental send, do not issue invalid rmdir operations · 01914101
      Robbie Ko authored
      When both the parent and send snapshots have a directory inode with the
      same number but different generations (therefore they are different
      inodes) and both have an entry with the same name, an incremental send
      stream will contain an invalid rmdir operation that refers to the
      orphanized name of the inode from the parent snapshot.
      
      The following example scenario shows how this happens.
      
      Parent snapshot:
      
       .
       |---- d259_old/               (ino 259, gen 9)
       |         |---- d1/           (ino 258, gen 9)
       |
       |---- f                       (ino 257, gen 9)
      
      Send snapshot:
      
       .
       |---- d258/                   (ino 258, gen 7)
       |---- d259/                   (ino 259, gen 7)
               |---- d1/             (ino 257, gen 7)
      
      When the kernel is processing inode 258 it notices that in both snapshots
      there is an inode numbered 259 that is a parent of an inode 258. However
      it ignores the fact that the inodes numbered 259 have different generations
      in both snapshots, which means they are effectively different inodes.
      Then it checks that both inodes 259 have a dentry named "d1" and because
      of that it issues a rmdir operation with orphanized name of the inode 258
      from the parent snapshot. This happens at send.c:process_record_refs(),
      which calls send.c:did_overwrite_first_ref() that returns true and because
      of that later on at process_recorded_refs() such rmdir operation is issued
      because the inode being currently processed (258) is a directory and it
      was deleted in the send snapshot (and replaced with another inode that has
      the same number and is a directory too).
      Fix this issue by comparing the generations of parent directory inodes
      that have the same number and make send.c:did_overwrite_first_ref() when
      the generations are different.
      
      The following steps reproduce the problem.
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ touch /mnt/f
       $ mkdir /mnt/d1
       $ mkdir /mnt/d259_old
       $ mv /mnt/d1 /mnt/d259_old/d1
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ umount /mnt
      
       $ mkfs.btrfs -f /dev/sdc
       $ mount /dev/sdc /mnt
       $ mkdir /mnt/d1
       $ mkdir /mnt/dir258
       $ mkdir /mnt/dir259
       $ mv /mnt/d1 /mnt/dir259/d1
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
       $ btrfs receive /mnt/ -f /tmp/1.snap
       # Take note that once the filesystem is created, its current
       # generation has value 7 so the inodes from the second snapshot all have
       # a generation value of 7. And after receiving the first snapshot
       # the filesystem is at a generation value of 10, because the call to
       # create the second snapshot bumps the generation to 8 (the snapshot
       # creation ioctl does a transaction commit), the receive command calls
       # the snapshot creation ioctl to create the first snapshot, which bumps
       # the filesystem's generation to 9, and finally when the receive
       # operation finishes it calls an ioctl to transition the first snapshot
       # (snap1) from RW mode to RO mode, which does another transaction commit
       # and bumps the filesystem's generation to 10. This means all the inodes
       # in the first snapshot (snap1) have a generation value of 9.
       $ rm -f /tmp/1.snap
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
       $ umount /mnt
      
       $ mkfs.btrfs -f /dev/sdd
       $ mount /dev/sdd /mnt
       $ btrfs receive /mnt -f /tmp/1.snap
       $ btrfs receive -vv /mnt -f /tmp/2.snap
       receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
       utimes
       unlink f
       mkdir o257-7-0
       mkdir o259-7-0
       rename o257-7-0 -> o259-7-0/d1
       chown o259-7-0/d1 - uid=0, gid=0
       chmod o259-7-0/d1 - mode=0755
       utimes o259-7-0/d1
       rmdir o258-9-0
       ERROR: rmdir o258-9-0 failed: No such file or directory
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote changelog to be more precise and clear]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      01914101
    • Filipe Manana's avatar
      Btrfs: incremental send, do not delay rename when parent inode is new · fe9c798d
      Filipe Manana authored
      When we are checking if we need to delay the rename operation for an
      inode we not checking if a parent inode that exists in the send and
      parent snapshots is really the same inode or not, that is, we are not
      comparing the generation number of the parent inode in the send and
      parent snapshots. Not only this results in unnecessarily delaying a
      rename operation but also can later on make us generate an incorrect
      name for a new inode in the send snapshot that has the same number
      as another inode in the parent snapshot but a different generation.
      
      Here follows an example where this happens.
      
      Parent snapshot:
      
       .                                                  (ino 256, gen 3)
       |--- dir258/                                       (ino 258, gen 7)
       |       |--- dir257/                               (ino 257, gen 7)
       |
       |--- dir259/                                       (ino 259, gen 7)
      
      Send snapshot:
      
       .                                                  (ino 256, gen 3)
       |--- file258                                       (ino 258, gen 10)
       |
       |--- new_dir259/                                   (ino 259, gen 10)
                |--- dir257/                              (ino 257, gen 7)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, its new parent is created using its orphan
         name (o257-21-0), and the rename operation for inode 257 is delayed
         because its new parent (inode 259) was not yet processed - this
         decision to delay the rename operation does not make much sense
         because the inode 259 in the send snapshot is a new inode, it's not
         the same as inode 259 in the parent snapshot.
      
      2) When processing inode 258 we end up delaying its rmdir operation,
         because inode 257 was not yet renamed (moved away from the directory
         inode 258 represents). We also create the new inode 258 using its
         orphan name "o258-10-0", then rename it to its final name of "file258"
         and then issue a truncate operation for it. However this truncate
         operation contains an incorrect name, which corresponds to the orphan
         name and not to the final name, which makes the receiver fail. This
         happens because when we attempt to compute the inode's current name
         we verify that there's another inode with the same number (258) that
         has its rmdir operation pending and because of that we generate an
         orphan name for the new inode 258 (we do this in the function
         get_cur_path()).
      
      Fix this by not delayed the rename operation of an inode if it has parents
      with the same number but different generations in both snapshots.
      
      The following steps reproduce this example scenario.
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ mkdir /mnt/dir257
       $ mkdir /mnt/dir258
       $ mkdir /mnt/dir259
       $ mv /mnt/dir257 /mnt/dir258/dir257
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
       $ mv /mnt/dir258/dir257 /mnt/dir257
       $ rmdir /mnt/dir258
       $ rmdir /mnt/dir259
      
       # Remount the filesystem so that the next created inodes will have the
       # numbers 258 and 259. This is because when a filesystem is mounted,
       # btrfs sets the subvolume's inode counter to a value corresponding to
       # the highest inode number in the subvolume plus 1. This inode counter
       # is used to assign a unique number to each new inode and it's
       # incremented by 1 after very inode creation.
       # Note: we unmount and then mount instead of doing a mount with
       # "-o remount" because otherwise the inode counter remains at value 260.
       $ umount /mnt
       $ mount /dev/sdb /mnt
       $ touch /mnt/file258
       $ mkdir /mnt/new_dir259
       $ mv /mnt/dir257 /mnt/new_dir259/dir257
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
      
       $ umount /mnt
       $ mkfs.btrfs -f /dev/sdc
       $ mount /dev/sdc /mnt
       $ btrfs receive /mnt -f /tmo/1.snap
       $ btrfs receive /mnt -f /tmo/2.snap -vv
       receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
       utimes
       mkdir o259-10-0
       rename dir258 -> o258-7-0
       utimes
       mkfile o258-10-0
       rename o258-10-0 -> file258
       utimes
       truncate o258-10-0 size=0
       ERROR: truncate o258-10-0 failed: No such file or directory
      Reported-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      fe9c798d
    • Robbie Ko's avatar
      Btrfs: send, fix failure to rename top level inode due to name collision · 4dd9920d
      Robbie Ko authored
      Under certain situations, an incremental send operation can fail due to a
      premature attempt to create a new top level inode (a direct child of the
      subvolume/snapshot root) whose name collides with another inode that was
      removed from the send snapshot.
      
      Consider the following example scenario.
      
      Parent snapshot:
      
        .                 (ino 256, gen 8)
        |---- a1/         (ino 257, gen 9)
        |---- a2/         (ino 258, gen 9)
      
      Send snapshot:
      
        .                 (ino 256, gen 3)
        |---- a2/         (ino 257, gen 7)
      
      In this scenario, when receiving the incremental send stream, the btrfs
      receive command fails like this (ran in verbose mode, -vv argument):
      
        rmdir a1
        mkfile o257-7-0
        rename o257-7-0 -> a2
        ERROR: rename o257-7-0 -> a2 failed: Is a directory
      
      What happens when computing the incremental send stream is:
      
      1) An operation to remove the directory with inode number 257 and
         generation 9 is issued.
      
      2) An operation to create the inode with number 257 and generation 7 is
         issued. This creates the inode with an orphanized name of "o257-7-0".
      
      3) An operation rename the new inode 257 to its final name, "a2", is
         issued. This is incorrect because inode 258, which has the same name
         and it's a child of the same parent (root inode 256), was not yet
         processed and therefore no rmdir operation for it was yet issued.
         The rename operation is issued because we fail to detect that the
         name of the new inode 257 collides with inode 258, because their
         parent, a subvolume/snapshot root (inode 256) has a different
         generation in both snapshots.
      
      So fix this by ignoring the generation value of a parent directory that
      matches a root inode (number 256) when we are checking if the name of the
      inode currently being processed collides with the name of some other
      inode that was not yet processed.
      
      We can achieve this scenario of different inodes with the same number but
      different generation values either by mounting a filesystem with the inode
      cache option (-o inode_cache) or by creating and sending snapshots across
      different filesystems, like in the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/a1
        $ mkdir /mnt/a2
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ touch /mnt/a2
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
        $ btrfs receive /mnt -f /tmp/1.snap
        # Take note that once the filesystem is created, its current
        # generation has value 7 so the inode from the second snapshot has
        # a generation value of 7. And after receiving the first snapshot
        # the filesystem is at a generation value of 10, because the call to
        # create the second snapshot bumps the generation to 8 (the snapshot
        # creation ioctl does a transaction commit), the receive command calls
        # the snapshot creation ioctl to create the first snapshot, which bumps
        # the filesystem's generation to 9, and finally when the receive
        # operation finishes it calls an ioctl to transition the first snapshot
        # (snap1) from RW mode to RO mode, which does another transaction commit
        # and bumps the filesystem's generation to 10.
        $ rm -f /tmp/1.snap
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt
        $ btrfs receive /mnt /tmp/1.snap
        # Receive of snapshot snap2 used to fail.
        $ btrfs receive /mnt /tmp/2.snap
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote changelog to be more precise and clear]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      4dd9920d
  2. 22 Feb, 2017 2 commits
    • Liu Bo's avatar
      Btrfs: use the correct type when creating cow dio extent · 6288d6ea
      Liu Bo authored
      'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch
      'Btrfs: specify a new ordered extent type for create_io_em',
      but it missed the directIO cow case.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6288d6ea
    • Filipe Manana's avatar
      Btrfs: fix deadlock between dedup on same file and starting writeback · b1517622
      Filipe Manana authored
      If we are deduping two ranges of the same file we need to make sure that
      we lock all pages in ascending order, that is, lock first the pages from
      the range with lower offset and then the pages from the other range, as
      otherwise we can deadlock with a concurrent task that is starting delalloc
      (writeback). Example trace:
      
      [74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds.
      [74073.053889]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
      [74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [74073.056696] kworker/u32:10  D    0 17997      2 0x00000000
      [74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176)
      [74073.061370]  ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240
      [74073.064784]  ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000
      [74073.068386]  ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240
      [74073.071712] Call Trace:
      [74073.072884]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
      [74073.075395]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
      [74073.077511]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
      [74073.079440]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
      [74073.081637]  [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14
      [74073.083809]  [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197
      [74073.086314]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
      [74073.100654]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
      [74073.102619]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
      [74073.104771]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
      [74073.106969]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
      [74073.108954]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
      [74073.110981]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
      [74073.112833]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
      [74073.115010]  [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs]
      [74073.116999]  [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs]
      [74073.119243]  [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs]
      [74073.121636]  [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs]
      [74073.124229]  [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs]
      [74073.126372]  [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs]
      [74073.129371]  [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs]
      [74073.131440]  [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs]
      [74073.134303]  [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1
      [74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
      [74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
      [74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
      [74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
      [74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
      [74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
      [74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
      [74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
      [74073.143911]  [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae
      [74073.145787]  [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7
      [74073.147452]  [<ffffffff811b60cd>] wb_workfn+0x194/0x37d
      [74073.149084]  [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d
      [74073.150726]  [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4
      [74073.152694]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
      [74073.154452]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
      [74073.156138]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
      [74073.157837]  [<ffffffff81072a81>] kthread+0xd5/0xdd
      [74073.159339]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
      [74073.161088]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
      [74073.162680] INFO: lockdep is turned off.
      [74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds.
      [74073.181180]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
      [74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [74073.185296] fdm-stress      D    0 30264  29974 0x00000000
      [74073.186810]  ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00
      [74073.188998]  ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000
      [74073.191070]  ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00
      [74073.193286] Call Trace:
      [74073.193990]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
      [74073.195418]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
      [74073.196796]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
      [74073.198163]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
      [74073.199621]  [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf
      [74073.201100]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
      [74073.202686]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
      [74073.204051]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
      [74073.205585]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
      [74073.207123]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
      [74073.208238]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
      [74073.208871]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
      [74073.209430]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
      [74073.210101]  [<ffffffff8112b800>] lock_page+0x2f/0x32
      [74073.210636]  [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153
      [74073.211270]  [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs]
      [74073.212166]  [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs]
      [74073.213257]  [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221
      [74073.214086]  [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600
      [74073.214767]  [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d
      [74073.215619]  [<ffffffff811a7953>] ? __fget+0x6b/0x77
      [74073.216338]  [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79
      [74073.217149]  [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad
      [74073.218102]  [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14
      [74073.218968]  [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa
      [74073.219938] INFO: lockdep is turned off.
      
      What happened was the following:
      
            CPU 1                                       CPU 2
      
                                                   btrfs_dedupe_file_range()
                                                     --> using same inode as source
                                                         and target
                                                     --> src range is [768K, 1Mb[
                                                     --> dst range is [0, 256K[
                                                    btrfs_cmp_data_prepare()
                                                     --> calls gather_extent_pages()
                                                         for range [768K, 1Mb[ and
                                                         locks all pages in that range
      
       do_writepages()
        btrfs_writepages()
         extent_writepages()
          extent_write_cache_pages()
           __extent_writepage()
            writepage_delalloc()
             find_lock_delalloc_range()
               --> finds range [0, 1Mb[
               lock_delalloc_pages()
                --> locks all pages in the
                    range [0, 768K[
                --> tries to lock page at
                    offset 768K
                      --> deadlock
      
                                                     --> calls gather_extent_pages()
                                                         to lock pages in the range
                                                         [0, 256K[
                                                          --> deadlock, task at CPU 1
                                                              already locked that
                                                              range and it's trying
                                                              to lock the range we
                                                              locked previously
      
      So fix this by making sure that during a dedup we always lock first the
      pages from the range with lower offset.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b1517622
  3. 17 Feb, 2017 28 commits