1. 14 Jun, 2019 1 commit
  2. 12 Jun, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix race between block group removal and block group allocation · 8eaf40c0
      Filipe Manana authored
      If a task is removing the block group that currently has the highest start
      offset amongst all existing block groups, there is a short time window
      where it races with a concurrent block group allocation, resulting in a
      transaction abort with an error code of EEXIST.
      
      The following diagram explains the race in detail:
      
            Task A                                                        Task B
      
       btrfs_remove_block_group(bg offset X)
      
         remove_extent_mapping(em offset X)
           -> removes extent map X from the
              tree of extent maps
              (fs_info->mapping_tree), so the
              next call to find_next_chunk()
              will return offset X
      
                                                         btrfs_alloc_chunk()
                                                           find_next_chunk()
                                                             --> returns offset X
      
                                                           __btrfs_alloc_chunk(offset X)
                                                             btrfs_make_block_group()
                                                               btrfs_create_block_group_cache()
                                                                 --> creates btrfs_block_group_cache
                                                                     object with a key corresponding
                                                                     to the block group item in the
                                                                     extent, the key is:
                                                                     (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
                                                               --> adds the btrfs_block_group_cache object
                                                                   to the list new_bgs of the transaction
                                                                   handle
      
                                                         btrfs_end_transaction(trans handle)
                                                           __btrfs_end_transaction()
                                                             btrfs_create_pending_block_groups()
                                                               --> sees the new btrfs_block_group_cache
                                                                   in the new_bgs list of the transaction
                                                                   handle
                                                               --> its call to btrfs_insert_item() fails
                                                                   with -EEXIST when attempting to insert
                                                                   the block group item key
                                                                   (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
                                                                   because task A has not removed that key yet
                                                               --> aborts the running transaction with
                                                                   error -EEXIST
      
         btrfs_del_item()
           -> removes the block group's key from
              the extent tree, key is
              (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
      A sample transaction abort trace:
      
        [78912.403537] ------------[ cut here ]------------
        [78912.403811] BTRFS: Transaction aborted (error -17)
        [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G        W         5.0.0-btrfs-next-46 #1
        [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
        [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
        [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
        [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
        [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
        [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
        [78912.409899] FS:  00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
        [78912.410285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
        [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [78912.411898] Call Trace:
        [78912.412318]  __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
        [78912.412746]  btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
        [78912.413179]  scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
        [78912.413622]  ? __mutex_unlock_slowpath+0x100/0x2a0
        [78912.414078]  btrfs_scrub_dev+0x2ef/0x720 [btrfs]
        [78912.414535]  ? __sb_start_write+0xd4/0x1c0
        [78912.414963]  ? mnt_want_write_file+0x24/0x50
        [78912.415403]  btrfs_ioctl+0x17fb/0x3120 [btrfs]
        [78912.415832]  ? lock_acquire+0xa6/0x190
        [78912.416256]  ? do_vfs_ioctl+0xa2/0x6f0
        [78912.416685]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [78912.417116]  do_vfs_ioctl+0xa2/0x6f0
        [78912.417534]  ? __fget+0x113/0x200
        [78912.417954]  ksys_ioctl+0x70/0x80
        [78912.418369]  __x64_sys_ioctl+0x16/0x20
        [78912.418812]  do_syscall_64+0x60/0x1b0
        [78912.419231]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [78912.419644] RIP: 0033:0x7f3880252dd7
        (...)
        [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
        [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
        [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
        [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
        [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
        [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---
      
      Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
      only at the very end, after removing the block group item key from the
      extent tree (and removing the free space tree entry if we are using the
      free space tree feature).
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8eaf40c0
  3. 07 Jun, 2019 1 commit
    • Nikolay Borisov's avatar
      btrfs: Always trim all unallocated space in btrfs_trim_free_extents · 8103d10b
      Nikolay Borisov authored
      This patch removes support for range parameters of FITRIM ioctl when
      trimming unallocated space on devices. This is necessary since ranges
      passed from user space are generally interpreted as logical addresses,
      whereas btrfs_trim_free_extents used to interpret them as device
      physical extents. This could result in counter-intuitive behavior for
      users so it's best to remove that support altogether.
      
      Additionally, the existing range support had a bug where if an offset
      was passed to FITRIM which overflows u64 e.g. -1 (parsed as u64
      18446744073709551615) then wrong data was fed into btrfs_issue_discard,
      which in turn leads to wrap-around when aligning the passed range and
      results in wrong regions being discarded which leads to data corruption.
      
      Fixes: c2d1b3aa ("btrfs: Honour FITRIM range constraints during free space trim")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8103d10b
  4. 28 May, 2019 9 commits
    • Filipe Manana's avatar
      Btrfs: fix race updating log root item during fsync · 06989c79
      Filipe Manana authored
      When syncing the log, the final phase of a fsync operation, we need to
      either create a log root's item or update the existing item in the log
      tree of log roots, and that depends on the current value of the log
      root's log_transid - if it's 1 we need to create the log root item,
      otherwise it must exist already and we update it. Since there is no
      synchronization between updating the log_transid and checking it for
      deciding whether the log root's item needs to be created or updated, we
      end up with a tiny race window that results in attempts to update the
      item to fail because the item was not yet created:
      
                    CPU 1                                    CPU 2
      
        btrfs_sync_log()
      
          lock root->log_mutex
      
          set log root's log_transid to 1
      
          unlock root->log_mutex
      
                                                     btrfs_sync_log()
      
                                                       lock root->log_mutex
      
                                                       sets log root's
                                                       log_transid to 2
      
                                                       unlock root->log_mutex
      
          update_log_root()
      
            sees log root's log_transid
            with a value of 2
      
              calls btrfs_update_root(),
              which fails with -EUCLEAN
              and causes transaction abort
      
      Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
      the recent commit 7ac1e464 ("btrfs: Don't panic when we can't find a
      root key") we just abort the current transaction.
      
      A sample trace of the BUG_ON() on a SLE12 kernel:
      
        ------------[ cut here ]------------
        kernel BUG at ../fs/btrfs/root-tree.c:157!
        Oops: Exception in kernel mode, sig: 5 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        (...)
        Supported: Yes, External
        CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
        task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
        NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
        REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
        MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
        CFAR: d000000036aba66c SOFTE: 1
        GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
        GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
        GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
        GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
        GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
        GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
        GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
        GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
        NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
        LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
        Call Trace:
        [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
        [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
        [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
        [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
        [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
        [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
        [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
        Instruction dump:
        7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
        e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
        ---[ end trace 8f2dc8f919cabab8 ]---
      
      So fix this by doing the check of log_transid and updating or creating the
      log root's item while holding the root's log_mutex.
      
      Fixes: 7237f183 ("Btrfs: fix tree logs parallel sync")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06989c79
    • Filipe Manana's avatar
      Btrfs: fix wrong ctime and mtime of a directory after log replay · 5338e43a
      Filipe Manana authored
      When replaying a log that contains a new file or directory name that needs
      to be added to its parent directory, we end up updating the mtime and the
      ctime of the parent directory to the current time after we have set their
      values to the correct ones (set at fsync time), efectivelly losing them.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/file
      
        # fsync of the directory is optional, not needed
        $ xfs_io -c fsync /mnt/dir
        $ xfs_io -c fsync /mnt/dir/file
      
        $ stat -c %Y /mnt/dir
        1557856079
      
        <power failure>
      
        $ sleep 3
        $ mount /dev/sdb /mnt
        $ stat -c %Y /mnt/dir
        1557856082
      
          --> should have been 1557856079, the mtime is updated to the current
              time when replaying the log
      
      Fix this by not updating the mtime and ctime to the current time at
      btrfs_add_link() when we are replaying a log tree.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5338e43a
    • Filipe Manana's avatar
      Btrfs: fix fsync not persisting changed attributes of a directory · 60d9f503
      Filipe Manana authored
      While logging an inode we follow its ancestors and for each one we mark
      it as logged in the current transaction, even if we have not logged it.
      As a consequence if we change an attribute of an ancestor, such as the
      UID or GID for example, and then explicitly fsync it, we end up not
      logging the inode at all despite returning success to user space, which
      results in the attribute being lost if a power failure happens after
      the fsync.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ chown 6007:6007 /mnt/dir
      
        $ sync
      
        $ chown 9003:9003 /mnt/dir
        $ touch /mnt/dir/file
        $ xfs_io -c fsync /mnt/dir/file
      
        # fsync our directory after fsync'ing the new file, should persist the
        # new values for the uid and gid.
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ stat -c %u:%g /mnt/dir
        6007:6007
      
          --> should be 9003:9003, the uid and gid were not persisted, despite
              the explicit fsync on the directory prior to the power failure
      
      Fix this by not updating the logged_trans field of ancestor inodes when
      logging an inode, since we have not logged them. Let only future calls to
      btrfs_log_inode() to mark inodes as logged.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60d9f503
    • Qu Wenruo's avatar
      btrfs: qgroup: Check bg while resuming relocation to avoid NULL pointer dereference · 57949d03
      Qu Wenruo authored
      [BUG]
      When mounting a fs with reloc tree and has qgroup enabled, it can cause
      NULL pointer dereference at mount time:
      
        BUG: kernel NULL pointer dereference, address: 00000000000000a8
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs]
        Call Trace:
         replace_path.isra.23+0x685/0x900 [btrfs]
         merge_reloc_root+0x26e/0x5f0 [btrfs]
         merge_reloc_roots+0x10a/0x1a0 [btrfs]
         btrfs_recover_relocation+0x3cd/0x420 [btrfs]
         open_ctree+0x1bc8/0x1ed0 [btrfs]
         btrfs_mount_root+0x544/0x680 [btrfs]
         legacy_get_tree+0x34/0x60
         vfs_get_tree+0x2d/0xf0
         fc_mount+0x12/0x40
         vfs_kern_mount.part.12+0x61/0xa0
         vfs_kern_mount+0x13/0x20
         btrfs_mount+0x16f/0x860 [btrfs]
         legacy_get_tree+0x34/0x60
         vfs_get_tree+0x2d/0xf0
         do_mount+0x81f/0xac0
         ksys_mount+0xbf/0xe0
         __x64_sys_mount+0x25/0x30
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      In btrfs_recover_relocation(), we don't have enough info to determine
      which block group we're relocating, but only to merge existing reloc
      trees.
      
      Thus in btrfs_recover_relocation(), rc->block_group is NULL.
      btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration,
      and causes a NULL pointer dereference.
      
      The bug is introduced by commit 3d0174f7 ("btrfs: qgroup: Only trace
      data extents in leaves if we're relocating data block group"), and
      later qgroup refactoring still keeps this optimization.
      
      [FIX]
      Thankfully in the context of btrfs_recover_relocation(), there is no
      other progress can modify tree blocks, thus those swapped tree blocks
      pair will never affect qgroup numbers, no matter whatever we set for
      block->trace_leaf.
      
      So we only need to check if @bg is NULL before accessing @bg->flags.
      Reported-by: default avatarJuan Erbes <jerbes@gmail.com>
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806
      Fixes: 3d0174f7 ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group")
      CC: stable@vger.kernel.org # 4.20+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57949d03
    • Qu Wenruo's avatar
      btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON() · 30d40577
      Qu Wenruo authored
      [BUG]
      When a fs has orphan reloc tree along with unfinished balance:
        ...
              item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
                      generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
                      lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
                      uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
              item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
                      temporary item objectid BALANCE offset 0
                      balance status flags 14
      
      Then at mount time, we can hit the following kernel BUG_ON():
        BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/relocation.c:1413!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G           O      5.2.0-rc1-custom #15
        RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
        Call Trace:
         btrfs_init_reloc_root+0x96/0xb0 [btrfs]
         record_root_in_trans+0xb2/0xe0 [btrfs]
         btrfs_record_root_in_trans+0x55/0x70 [btrfs]
         select_reloc_root+0x7e/0x230 [btrfs]
         do_relocation+0xc4/0x620 [btrfs]
         relocate_tree_blocks+0x592/0x6a0 [btrfs]
         relocate_block_group+0x47b/0x5d0 [btrfs]
         btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
         btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
         btrfs_balance+0x864/0xfa0 [btrfs]
         balance_kthread+0x3b/0x50 [btrfs]
         kthread+0x123/0x140
         ret_from_fork+0x27/0x50
      
      [CAUSE]
      In btrfs, reloc trees are used to record swapped tree blocks during
      balance.
      Reloc tree either get merged (replace old tree blocks of its parent
      subvolume) in next transaction if its ref is 1 (fresh).
      Or is already merged and will be cleaned up if its ref is 0 (orphan).
      
      After commit d2311e69 ("btrfs: relocation: Delay reloc tree deletion
      after merge_reloc_roots"), reloc tree cleanup is delayed until one block
      group is balanced.
      
      Since fresh reloc roots are recorded during merge, as long as there
      is no power loss, those orphan reloc roots converted from fresh ones are
      handled without problem.
      
      However when power loss happens, orphan reloc roots can be recorded
      on-disk, thus at next mount time, we will have orphan reloc roots from
      on-disk data directly, and ignored by clean_dirty_subvols() routine.
      
      Then when background balance starts to balance another block group, and
      needs to create new reloc root for the same root, btrfs_insert_item()
      returns -EEXIST, and trigger that BUG_ON().
      
      [FIX]
      For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
      all reloc roots no matter orphan or not, can be cleaned up properly and
      avoid above BUG_ON().
      
      And to cooperate with above change, clean_dirty_subvols() will check if
      the queued root is a reloc root or a subvol root.
      For a subvol root, do the old work, and for a orphan reloc root, clean it
      up.
      
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30d40577
    • Filipe Manana's avatar
      Btrfs: incremental send, fix emission of invalid clone operations · 3c850b45
      Filipe Manana authored
      When doing an incremental send we can now issue clone operations with a
      source range that ends at the source's file eof and with a destination
      range that ends at an offset smaller then the destination's file eof.
      If the eof of the source file is not aligned to the sector size of the
      filesystem, the receiver will get a -EINVAL error when trying to do the
      operation or, on older kernels, silently corrupt the destination file.
      The corruption happens on kernels without commit ac765f83
      ("Btrfs: fix data corruption due to cloning of eof block"), while the
      failure to clone happens on kernels with that commit.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
        $ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
        $ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
        $ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.send /mnt/sdb/base
      
        $ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
        $ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
        $ xfs_io -c "truncate 550K" /mnt/sdb/bar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.send /mnt/sdc
        $ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
        (...)
        truncate bar size=563200
        utimes bar
        clone zoo - source=bar source offset=512000 offset=0 length=51200
        ERROR: failed to clone extents to zoo
        Invalid argument
      
      The failure happens because the clone source range ends at the eof of file
      bar, 563200, which is not aligned to the filesystems sector size (4Kb in
      this case), and the destination range ends at offset 0 + 51200, which is
      less then the size of the file zoo (2Mb).
      
      So fix this by detecting such case and instead of issuing a clone
      operation for the whole range, do a clone operation for smaller range
      that is sector size aligned followed by a write operation for the block
      containing the eof. Here we will always be pessimistic and assume the
      destination filesystem of the send stream has the largest possible sector
      size (64Kb), since we have no way of determining it.
      
      This fixes a recent regression introduced in kernel 5.2-rc1.
      
      Fixes: 040ee612 ("Btrfs: send, improve clone range")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c850b45
    • Filipe Manana's avatar
      Btrfs: incremental send, fix file corruption when no-holes feature is enabled · 6b1f72e5
      Filipe Manana authored
      When using the no-holes feature, if we have a file with prealloc extents
      with a start offset beyond the file's eof, doing an incremental send can
      cause corruption of the file due to incorrect hole detection. Such case
      requires that the prealloc extent(s) exist in both the parent and send
      snapshots, and that a hole is punched into the file that covers all its
      extents that do not cross the eof boundary.
      
      Example reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
        $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.snap /mnt/sdb/base
      
        $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
      
        $ md5sum /mnt/sdb/incr/foobar
        816df6f64deba63b029ca19d880ee10a   /mnt/sdb/incr/foobar
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.snap /mnt/sdc
        $ btrfs receive -f /tmp/incr.snap /mnt/sdc
      
        $ md5sum /mnt/sdc/incr/foobar
        cf2ef71f4a9e90c2f6013ba3b2257ed2   /mnt/sdc/incr/foobar
      
          --> Different checksum, because the prealloc extent beyond the
              file's eof confused the hole detection code and it assumed
              a hole starting at offset 0 and ending at the offset of the
              prealloc extent (1200Kb) instead of ending at the offset
              500Kb (the file's size).
      
      Fix this by ensuring we never cross the file's size when issuing the
      write operations for a hole.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      CC: stable@vger.kernel.org # 3.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b1f72e5
    • Dennis Zhou's avatar
      btrfs: correct zstd workspace manager lock to use spin_lock_bh() · fee13fe9
      Dennis Zhou authored
      The btrfs zstd workspace manager uses a background timer to reclaim not
      recently used workspaces. I used spin_lock() from this context which
      should have been caught with lockdep, but was not. This deadlock was
      reported in bugzilla. The fix is to switch the zstd wsm lock to use
      spin_lock_bh() from the softirq context.
      
      This happened quite relibably on ppc64, unlike on other architectures.
      
        [  313.402874] ================================
        [  313.402875] WARNING: inconsistent lock state
        [  313.402879] 5.1.0-rc7 #1 Not tainted
        [  313.402880] --------------------------------
        [  313.402882] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        [  313.402885] swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        [  313.402888] 0000000080d1120c (&(&wsm.lock)->rlock){+.?.}, at: .zstd_reclaim_timer_fn+0x40/0x230
        [  313.402895] {SOFTIRQ-ON-W} state was registered at:
        [  313.402899]   .lock_acquire+0xd0/0x240
        [  313.402903]   ._raw_spin_lock+0x34/0x60
        [  313.402906]   .zstd_get_workspace+0xd0/0x360
        [  313.402908]   .end_compressed_bio_read+0x3b8/0x540
        [  313.402911]   .bio_endio+0x174/0x2c0
        [  313.402914]   .end_workqueue_fn+0x4c/0x70
        [  313.402917]   .normal_work_helper+0x138/0x7e0
        [  313.402920]   .process_one_work+0x324/0x790
        [  313.402922]   .worker_thread+0x68/0x570
        [  313.402925]   .kthread+0x19c/0x1b0
        [  313.402928]   .ret_from_kernel_thread+0x58/0x78
        [  313.402930] irq event stamp: 2629216
        [  313.402933] hardirqs last  enabled at (2629216): [<c0000000009da738>] ._raw_spin_unlock_irq+0x38/0x60
        [  313.402936] hardirqs last disabled at (2629215): [<c0000000009da4c4>] ._raw_spin_lock_irq+0x24/0x70
        [  313.402939] softirqs last  enabled at (2629212): [<c0000000000af9fc>] .irq_enter+0x8c/0xd0
        [  313.402942] softirqs last disabled at (2629213): [<c0000000000afb58>] .irq_exit+0x118/0x170
        [  313.402944]
      		 other info that might help us debug this:
        [  313.402945]  Possible unsafe locking scenario:
      
        [  313.402947]        CPU0
        [  313.402948]        ----
        [  313.402949]   lock(&(&wsm.lock)->rlock);
        [  313.402951]   <Interrupt>
        [  313.402952]     lock(&(&wsm.lock)->rlock);
        [  313.402954]
      		  *** DEADLOCK ***
      
        [  313.402957] 1 lock held by swapper/5/0:
        [  313.402958]  #0: 000000004b612042 ((&wsm.timer)){+.-.}, at: .call_timer_fn+0x0/0x3c0
        [  313.402963]
      		 stack backtrace:
        [  313.402967] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.1.0-rc7 #1
        [  313.402968] Call Trace:
        [  313.402972] [c0000007fa262e70] [c0000000009b3294] .dump_stack+0xe0/0x15c (unreliable)
        [  313.402975] [c0000007fa262f10] [c000000000125548] .print_usage_bug+0x348/0x390
        [  313.402978] [c0000007fa262fd0] [c000000000125cb4] .mark_lock+0x724/0x930
        [  313.402981] [c0000007fa263080] [c000000000126c20] .__lock_acquire+0xc90/0x16a0
        [  313.402984] [c0000007fa2631b0] [c000000000128040] .lock_acquire+0xd0/0x240
        [  313.402987] [c0000007fa263280] [c0000000009da2b4] ._raw_spin_lock+0x34/0x60
        [  313.402990] [c0000007fa263300] [c00000000054b0b0] .zstd_reclaim_timer_fn+0x40/0x230
        [  313.402993] [c0000007fa2633d0] [c000000000158b38] .call_timer_fn+0xc8/0x3c0
        [  313.402996] [c0000007fa2634a0] [c000000000158f74] .expire_timers+0x144/0x260
        [  313.402999] [c0000007fa263550] [c000000000159178] .run_timer_softirq+0xe8/0x230
        [  313.403002] [c0000007fa263680] [c0000000009db288] .__do_softirq+0x188/0x5d4
        [  313.403004] [c0000007fa263790] [c0000000000afb58] .irq_exit+0x118/0x170
        [  313.403008] [c0000007fa263800] [c000000000028d88] .timer_interrupt+0x158/0x430
        [  313.403012] [c0000007fa2638b0] [c0000000000091d4] decrementer_common+0x134/0x140
        [  313.403017] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
      		     LR = .arch_local_irq_restore.part.0+0x68/0x80
        [  313.403020] [c0000007fa263bb0] [c00000000001a3ac] .arch_local_irq_restore.part.0+0x2c/0x80 (unreliable)
        [  313.403024] [c0000007fa263c30] [c0000000007bbbcc] .cpuidle_enter_state+0xec/0x670
        [  313.403027] [c0000007fa263d00] [c0000000000f5130] .call_cpuidle+0x40/0x90
        [  313.403031] [c0000007fa263d70] [c0000000000f554c] .do_idle+0x2dc/0x3a0
        [  313.403034] [c0000007fa263e30] [c0000000000f59ac] .cpu_startup_entry+0x2c/0x30
        [  313.403037] [c0000007fa263ea0] [c000000000045674] .start_secondary+0x644/0x650
        [  313.403041] [c0000007fa263f90] [c00000000000ad5c] start_secondary_prolog+0x10/0x14
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203517
      Fixes: 3f93aef5 ("btrfs: add zstd compression level support")
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fee13fe9
    • Nikolay Borisov's avatar
      btrfs: Ensure replaced device doesn't have pending chunk allocation · debd1c06
      Nikolay Borisov authored
      Recent FITRIM work, namely bbbf7243 ("btrfs: combine device update
      operations during transaction commit") combined the way certain
      operations are recoded in a transaction. As a result an ASSERT was added
      in dev_replace_finish to ensure the new code works correctly.
      Unfortunately I got reports that it's possible to trigger the assert,
      meaning that during a device replace it's possible to have an unfinished
      chunk allocation on the source device.
      
      This is supposed to be prevented by the fact that a transaction is
      committed before finishing the replace oepration and alter acquiring the
      chunk mutex. This is not sufficient since by the time the transaction is
      committed and the chunk mutex acquired it's possible to allocate a chunk
      depending on the workload being executed on the replaced device. This
      bug has been present ever since device replace was introduced but there
      was never code which checks for it.
      
      The correct way to fix is to ensure that there is no pending device
      modification operation when the chunk mutex is acquire and if there is
      repeat transaction commit. Unfortunately it's not possible to just
      exclude the source device from btrfs_fs_devices::dev_alloc_list since
      this causes ENOSPC to be hit in transaction commit.
      
      Fixing that in another way would need to add special cases to handle the
      last writes and forbid new ones. The looped transaction fix is more
      obvious, and can be easily backported. The runtime of dev-replace is
      long so there's no noticeable delay caused by that.
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: 391cd9df ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      debd1c06
  5. 16 May, 2019 6 commits
    • Filipe Manana's avatar
      Btrfs: tree-checker: detect file extent items with overlapping ranges · 4e9845ef
      Filipe Manana authored
      Having file extent items with ranges that overlap each other is a
      serious issue that leads to all sorts of corruptions and crashes (like a
      BUG_ON() during the course of __btrfs_drop_extents() when it traims file
      extent items). Therefore teach the tree checker to detect such cases.
      This is motivated by a recently fixed bug (race between ranged full
      fsync and writeback or adjacent ranges).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e9845ef
    • Filipe Manana's avatar
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 0c713cba
      Filipe Manana authored
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c713cba
    • Filipe Manana's avatar
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · ebb92906
      Filipe Manana authored
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebb92906
    • Qu Wenruo's avatar
      btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes · 14ae4ec1
      Qu Wenruo authored
      Commit ddf30cf0 ("btrfs: extent-tree: Use btrfs_ref to refactor
      add_pinned_bytes()") refactored add_pinned_bytes(), but during that
      refactor, there are two callers which add the pinned bytes instead
      of subtracting.
      
      That refactor misses those two caller, causing incorrect pinned bytes
      calculation and resulting unexpected ENOSPC error.
      
      Fix it by adding a new parameter @sign to restore the original behavior.
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Fixes: ddf30cf0 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14ae4ec1
    • Tobin C. Harding's avatar
      btrfs: sysfs: don't leak memory when failing add fsid · e3277335
      Tobin C. Harding authored
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3277335
    • Tobin C. Harding's avatar
      btrfs: sysfs: Fix error path kobject memory leak · 450ff834
      Tobin C. Harding authored
      If a call to kobject_init_and_add() fails we must call kobject_put()
      otherwise we leak memory.
      
      Calling kobject_put() when kobject_init_and_add() fails drops the
      refcount back to 0 and calls the ktype release method (which in turn
      calls the percpu destroy and kfree).
      
      Add call to kobject_put() in the error path of call to
      kobject_init_and_add().
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      450ff834
  6. 09 May, 2019 2 commits
    • Filipe Manana's avatar
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · 72bd2323
      Filipe Manana authored
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72bd2323
    • Josef Bacik's avatar
      btrfs: use the existing reserved items for our first prop for inheritance · d7400ee1
      Josef Bacik authored
      We're now reserving an extra items worth of space for property
      inheritance.  We only have one property at the moment so this covers us,
      but if we add more in the future this will allow us to not get bitten by
      the extra space reservation.  If we do add more properties in the future
      we should re-visit how we calculate the space reservation needs by the
      callers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ refreshed on top of prop/xattr cleanups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7400ee1
  7. 03 May, 2019 2 commits
    • Josef Bacik's avatar
      btrfs: don't double unlock on error in btrfs_punch_hole · 8fca9550
      Josef Bacik authored
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fca9550
    • Johnny Chang's avatar
      btrfs: Check the compression level before getting a workspace · 2b90883c
      Johnny Chang authored
      When a file's compression property is set as zlib or zstd but leave
      the compression mount option not be set, that means btrfs will try
      to compress the file with default compression level. But in
      btrfs_compress_pages(), it calls get_workspace() with level = 0.
      This will return a workspace with a wrong compression level.
      For zlib, the compression level in the workspace will be 0
      (that means "store only"). And for zstd, the compression in the
      workspace will be 1, not the default level 3.
      
      How to reproduce:
        mkfs -t btrfs /dev/sdb
        mount /dev/sdb /mnt/
        mkdir /mnt/zlib
        btrfs property set /mnt/zlib/ compression zlib
        dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
        sync
        btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
      
      btrfs-debugfs output:
      * before:
        ...
        (258 9961472): ram 524288 disk 1106247680 disk_size 524288
        file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
      
      * after:
       ...
       (258 10354688): ram 131072 disk 14217216 disk_size 4096
       file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
      
      The steps for zstd are similar, but need to put a debugging message to
      show the level of the return workspace in zstd_get_workspace().
      
      This commit adds a check of the compression level before getting a
      workspace by set_level().
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarJohnny Chang <johnnyc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b90883c
  8. 02 May, 2019 9 commits
  9. 29 Apr, 2019 9 commits