1. 02 Jul, 2019 9 commits
    • David Sterba's avatar
      btrfs: add mask for all RAID1 types · c7369b3f
      David Sterba authored
      Preparatory patch for additional RAID1 profiles with more copies. The
      mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
      work the same for the other profiles.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7369b3f
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() · e88439de
      Qu Wenruo authored
      [BUG]
      Lockdep will report the following circular locking dependency:
      
        WARNING: possible circular locking dependency detected
        5.2.0-rc2-custom #24 Tainted: G           O
        ------------------------------------------------------
        btrfs/8631 is trying to acquire lock:
        000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]
      
        but task is already holding lock:
        000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->tree_log_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x475/0xa00 [btrfs]
               btrfs_commit_super+0x71/0x80 [btrfs]
               close_ctree+0x2bd/0x320 [btrfs]
               btrfs_put_super+0x15/0x20 [btrfs]
               generic_shutdown_super+0x72/0x110
               kill_anon_super+0x18/0x30
               btrfs_kill_super+0x16/0xa0 [btrfs]
               deactivate_locked_super+0x3a/0x80
               deactivate_super+0x51/0x60
               cleanup_mnt+0x3f/0x80
               __cleanup_mnt+0x12/0x20
               task_work_run+0x94/0xb0
               exit_to_usermode_loop+0xd8/0xe0
               do_syscall_64+0x210/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&fs_info->reloc_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x40d/0xa00 [btrfs]
               btrfs_quota_enable+0x2da/0x730 [btrfs]
               btrfs_ioctl+0x2691/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
               lock_acquire+0xa7/0x190
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_qgroup_inherit+0x40/0x620 [btrfs]
               create_pending_snapshot+0x9d7/0xe60 [btrfs]
               create_pending_snapshots+0x94/0xb0 [btrfs]
               btrfs_commit_transaction+0x415/0xa00 [btrfs]
               btrfs_mksubvol+0x496/0x4e0 [btrfs]
               btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
               btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
               btrfs_ioctl+0xa90/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&fs_info->tree_log_mutex);
                                       lock(&fs_info->reloc_mutex);
                                       lock(&fs_info->tree_log_mutex);
          lock(&fs_info->qgroup_ioctl_lock#2);
      
         *** DEADLOCK ***
      
        6 locks held by btrfs/8631:
         #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
         #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
         #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
         #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
         #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
         #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
      [CAUSE]
      Due to the delayed subvolume creation, we need to call
      btrfs_qgroup_inherit() inside commit transaction code, with a lot of
      other mutex hold.
      This hell of lock chain can lead to above problem.
      
      [FIX]
      On the other hand, we don't really need to hold qgroup_ioctl_lock if
      we're in the context of create_pending_snapshot().
      As in that context, we're the only one being able to modify qgroup.
      
      All other qgroup functions which needs qgroup_ioctl_lock are either
      holding a transaction handle, or will start a new transaction:
        Functions will start a new transaction():
        * btrfs_quota_enable()
        * btrfs_quota_disable()
        Functions hold a transaction handler:
        * btrfs_add_qgroup_relation()
        * btrfs_del_qgroup_relation()
        * btrfs_create_qgroup()
        * btrfs_remove_qgroup()
        * btrfs_limit_qgroup()
        * btrfs_qgroup_inherit() call inside create_subvol()
      
      So we have a higher level protection provided by transaction, thus we
      don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().
      
      Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
      qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
      create_pending_snapshot() is already protected by transaction.
      
      So the fix is to detect the context by checking
      trans->transaction->state.
      If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
      context and no need to get the mutex.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e88439de
    • Johannes Thumshirn's avatar
      btrfs: correctly validate compression type · aa53e3bf
      Johannes Thumshirn authored
      Nikolay reported the following KASAN splat when running btrfs/048:
      
      [ 1843.470920] ==================================================================
      [ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
      [ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979
      
      [ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
      [ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 1843.476322] Call Trace:
      [ 1843.476674]  dump_stack+0x7c/0xbb
      [ 1843.477132]  ? strncmp+0x66/0xb0
      [ 1843.477587]  print_address_description+0x114/0x320
      [ 1843.478256]  ? strncmp+0x66/0xb0
      [ 1843.478740]  ? strncmp+0x66/0xb0
      [ 1843.479185]  __kasan_report+0x14e/0x192
      [ 1843.479759]  ? strncmp+0x66/0xb0
      [ 1843.480209]  kasan_report+0xe/0x20
      [ 1843.480679]  strncmp+0x66/0xb0
      [ 1843.481105]  prop_compression_validate+0x24/0x70
      [ 1843.481798]  btrfs_xattr_handler_set_prop+0x65/0x160
      [ 1843.482509]  __vfs_setxattr+0x71/0x90
      [ 1843.483012]  __vfs_setxattr_noperm+0x84/0x130
      [ 1843.483606]  vfs_setxattr+0xac/0xb0
      [ 1843.484085]  setxattr+0x18c/0x230
      [ 1843.484546]  ? vfs_setxattr+0xb0/0xb0
      [ 1843.485048]  ? __mod_node_page_state+0x1f/0xa0
      [ 1843.485672]  ? _raw_spin_unlock+0x24/0x40
      [ 1843.486233]  ? __handle_mm_fault+0x988/0x1290
      [ 1843.486823]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487330]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487842]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.488442]  ? debug_lockdep_rcu_enabled+0x22/0x40
      [ 1843.489089]  ? rcu_sync_lockdep_assert+0xe/0x70
      [ 1843.489707]  ? __sb_start_write+0x158/0x200
      [ 1843.490278]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.490855]  ? __mnt_want_write+0x98/0xe0
      [ 1843.491397]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.492201]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 1843.493201]  do_syscall_64+0x6c/0x230
      [ 1843.493988]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1843.495041] RIP: 0033:0x7fa7a8a7707a
      [ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
      [ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
      [ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
      [ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
      [ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
      [ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
      [ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff
      
      [ 1843.505268] Allocated by task 3979:
      [ 1843.505771]  save_stack+0x19/0x80
      [ 1843.506211]  __kasan_kmalloc.constprop.5+0xa0/0xd0
      [ 1843.506836]  setxattr+0xeb/0x230
      [ 1843.507264]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.507886]  do_syscall_64+0x6c/0x230
      [ 1843.508429]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [ 1843.509558] Freed by task 0:
      [ 1843.510188] (stack is not available)
      
      [ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
                      which belongs to the cache kmalloc-8 of size 8
      [ 1843.514095] The buggy address is located 2 bytes inside of
                      8-byte region [ffff888111e369e0, ffff888111e369e8)
      [ 1843.516524] The buggy address belongs to the page:
      [ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
      [ 1843.519993] flags: 0x4404000010200(slab|head)
      [ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
      [ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
      [ 1843.524281] page dumped because: kasan: bad access detected
      
      [ 1843.525936] Memory state around the buggy address:
      [ 1843.526975]  ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.528479]  ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
      [ 1843.531877]                                                        ^
      [ 1843.533287]  ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.534874]  ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.536468] ==================================================================
      
      This is caused by supplying a too short compression value ('lz') in the
      test-case and comparing it to 'lzo' with strncmp() and a length of 3.
      strncmp() read past the 'lz' when looking for the 'o' and thus caused an
      out-of-bounds read.
      
      Introduce a new check 'btrfs_compress_is_valid_type()' which not only
      checks the user-supplied value against known compression types, but also
      employs checks for too short values.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Fixes: 272e5326 ("btrfs: prop: fix vanished compression property after failed set")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa53e3bf
    • Filipe Manana's avatar
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · d1d832a0
      Filipe Manana authored
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1 ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1d832a0
    • David Sterba's avatar
      btrfs: raid56: clear incompat block group flags after removing the last one · 6d58a55a
      David Sterba authored
      The incompat bit for RAID56 is set either at mount time or automatically
      when the profile is used by balance. The part where the bit is removed
      is missing and can be unexpected or undesired when an older kernel is
      needed.
      
      This patch will drop the incompat bit after this command, assuming
      that RAID5 profile is not used by system or metadata:
      
       $ btrfs balance start -dconvert=raid5 /mnt
       $ btrfs balance start -dconvert=raid1 /mnt
      
      This will print "clearing 128 feature flag" to the system log.
      
      The patch is safe for backporting to older kernels.
      Reported-by: default avatarHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d58a55a
    • David Sterba's avatar
      btrfs: switch extent_buffer write_locks from atomic to int · 00801ae4
      David Sterba authored
      The write_locks is either 0 or 1 and always updated under the lock,
      so we don't need the atomic_t semantics.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00801ae4
    • David Sterba's avatar
      btrfs: switch extent_buffer spinning_writers from atomic to int · f3dc24c5
      David Sterba authored
      The spinning_writers is either 0 or 1 and always updated under the lock,
      so we don't need the atomic_t semantics.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3dc24c5
    • David Sterba's avatar
      btrfs: switch extent_buffer blocking_writers from atomic to int · 06297d8c
      David Sterba authored
      The blocking_writers is either 0 or 1 and always updated under the lock,
      so we don't need the atomic_t semantics.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06297d8c
    • David Sterba's avatar
      btrfs: assert delayed ref lock in btrfs_find_delayed_ref_head · 38e9372e
      David Sterba authored
      Turn the comment about required lock into an assertion.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e9372e
  2. 01 Jul, 2019 31 commits