1. 23 Jan, 2020 13 commits
    • Nikolay Borisov's avatar
      btrfs: Fix split-brain handling when changing FSID to metadata uuid · 1362089d
      Nikolay Borisov authored
      Current code doesn't correctly handle the situation which arises when
      a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
      changed to the one in metadata uuid. This causes the incompat flag to
      disappear.
      
      In case of a power failure we could end up in a situation where part of
      the disks in a multi-disk filesystem are correctly reverted to
      METADATA_UUID_INCOMPAT flag unset state, while others have
      METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
      
      This patch corrects the behavior required to handle the case where a
      disk of the second type is scanned first, creating the necessary
      btrfs_fs_devices. Subsequently, when a disk which has already completed
      the transition is scanned it should overwrite the data in
      btrfs_fs_devices.
      Reported-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1362089d
    • Nikolay Borisov's avatar
      btrfs: Handle another split brain scenario with metadata uuid feature · 05840710
      Nikolay Borisov authored
      There is one more cases which isn't handled by the original metadata
      uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
      the user decides to change the FSID to the original one e.g. have
      metadata_uuid and fsid match. In case of power failure while this
      operation is in progress we could end up in a situation where some of
      the disks have the incompat bit removed and the other half have both
      METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
      
      This patch handles the case where a disk that has successfully changed
      its FSID such that it equals METADATA_UUID is scanned first.
      Subsequently when a disk with both
      METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
      find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
      This is done by extending find_fsid_changed to correctly find
      btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
      the metadata_uuid of the currently scanned device.
      
      Fixes: cc5de4e7 ("btrfs: Handle final split-brain possibility during fsid change")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reported-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05840710
    • Su Yue's avatar
      btrfs: Factor out metadata_uuid code from find_fsid. · c6730a0e
      Su Yue authored
      find_fsid became rather hairy with the introduction of metadata uuid
      changing feature. Alleviate this by factoring out the metadata uuid
      specific code in a dedicated function which deals with finding
      correct fsid for a device with changed uuid.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6730a0e
    • Su Yue's avatar
      btrfs: Call find_fsid from find_fsid_inprogress · c0d81c7c
      Su Yue authored
      Since find_fsid_inprogress should also handle the case in which an fs
      didn't change its FSID make it call find_fsid directly. This makes the
      code in device_list_add simpler by eliminating a conditional call of
      find_fsid. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0d81c7c
    • Filipe Manana's avatar
      Btrfs: fix infinite loop during fsync after rename operations · b5e4ff9d
      Filipe Manana authored
      Recently fsstress (from fstests) sporadically started to trigger an
      infinite loop during fsync operations. This turned out to be because
      support for the rename exchange and whiteout operations was added to
      fsstress in fstests. These operations, unlike any others in fsstress,
      cause file names to be reused, whence triggering this issue. However
      it's not necessary to use rename exchange and rename whiteout operations
      trigger this issue, simple rename operations and file creations are
      enough to trigger the issue.
      
      The issue boils down to when we are logging inodes that conflict (that
      had the name of any inode we need to log during the fsync operation), we
      keep logging them even if they were already logged before, and after
      that we check if there's any other inode that conflicts with them and
      then add it again to the list of inodes to log. Skipping already logged
      inodes fixes the issue.
      
      Consider the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir                           # inode 257
      
        $ touch /mnt/testdir/zz                        # inode 258
        $ ln /mnt/testdir/zz /mnt/testdir/zz_link
      
        $ touch /mnt/testdir/a                         # inode 259
      
        $ sync
      
        # The following 3 renames achieve the same result as a rename exchange
        # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
      
        $ mv /mnt/testdir/a /mnt/testdir/a/tmp
        $ mv /mnt/testdir/zz_link /mnt/testdir/a
        $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
      
        # The following rename and file creation give the same result as a
        # rename whiteout operation (<rename_whiteout> zz to a2).
      
        $ mv /mnt/testdir/zz /mnt/testdir/a2
        $ touch /mnt/testdir/zz                        # inode 260
      
        $ xfs_io -c fsync /mnt/testdir/zz
          --> results in the infinite loop
      
      The following steps happen:
      
      1) When logging inode 260, we find that its reference named "zz" was
         used by inode 258 in the previous transaction (through the commit
         root), so inode 258 is added to the list of conflicting indoes that
         need to be logged;
      
      2) After logging inode 258, we find that its reference named "a" was
         used by inode 259 in the previous transaction, and therefore we add
         inode 259 to the list of conflicting inodes to be logged;
      
      3) After logging inode 259, we find that its reference named "zz_link"
         was used by inode 258 in the previous transaction - we add inode 258
         to the list of conflicting inodes to log, again - we had already
         logged it before at step 3. After logging it again, we find again
         that inode 259 conflicts with him, and we add again 259 to the list,
         etc - we end up repeating all the previous steps.
      
      So fix this by skipping logging of conflicting inodes that were already
      logged.
      
      Fixes: 6b5fc433 ("Btrfs: fix fsync after succession of renames of different files")
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5e4ff9d
    • Josef Bacik's avatar
      btrfs: set trans->drity in btrfs_commit_transaction · d62b23c9
      Josef Bacik authored
      If we abort a transaction we have the following sequence
      
      if (!trans->dirty && list_empty(&trans->new_bgs))
      	return;
      WRITE_ONCE(trans->transaction->aborted, err);
      
      The idea being if we didn't modify anything with our trans handle then
      we don't really need to abort the whole transaction, maybe the other
      trans handles are fine and we can carry on.
      
      However in the case of create_snapshot we add a pending_snapshot object
      to our transaction and then commit the transaction.  We don't actually
      modify anything.  sync() behaves the same way, attach to an existing
      transaction and commit it.  This means that if we have an IO error in
      the right places we could abort the committing transaction with our
      trans->dirty being not set and thus not set transaction->aborted.
      
      This is a problem because in the create_snapshot() case we depend on
      pending->error being set to something, or btrfs_commit_transaction
      returning an error.
      
      If we are not the trans handle that gets to commit the transaction, and
      we're waiting on the commit to happen we get our return value from
      cur_trans->aborted.  If this was not set to anything because sync() hit
      an error in the transaction commit before it could modify anything then
      cur_trans->aborted would be 0.  Thus we'd return 0 from
      btrfs_commit_transaction() in create_snapshot.
      
      This is a problem because we then try to do things with
      pending_snapshot->snap, which will be NULL because we didn't create the
      snapshot, and then we'll get a NULL pointer dereference like the
      following
      
      "BUG: kernel NULL pointer dereference, address: 00000000000001f0"
      RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
      Call Trace:
       ? btrfs_mksubvol.isra.31+0x3f2/0x510
       btrfs_mksubvol.isra.31+0x4bc/0x510
       ? __sb_start_write+0xfa/0x200
       ? mnt_want_write_file+0x24/0x50
       btrfs_ioctl_snap_create_transid+0x16c/0x1a0
       btrfs_ioctl_snap_create_v2+0x11e/0x1a0
       btrfs_ioctl+0x1534/0x2c10
       ? free_debug_processing+0x262/0x2a3
       do_vfs_ioctl+0xa6/0x6b0
       ? do_sys_open+0x188/0x220
       ? syscall_trace_enter+0x1f8/0x330
       ksys_ioctl+0x60/0x90
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x4a/0x1b0
      
      In order to fix this we need to make sure anybody who calls
      commit_transaction has trans->dirty set so that they properly set the
      trans->transaction->aborted value properly so any waiters know bad
      things happened.
      
      This was found while I was running generic/475 with my modified
      fsstress, it reproduced within a few runs.  I ran with this patch all
      night and didn't see the problem again.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d62b23c9
    • Josef Bacik's avatar
      btrfs: drop log root for dropped roots · 889bfa39
      Josef Bacik authored
      If we fsync on a subvolume and create a log root for that volume, and
      then later delete that subvolume we'll never clean up its log root.  Fix
      this by making switch_commit_roots free the log for any dropped roots we
      encounter.  The extra churn is because we need a btrfs_trans_handle, not
      the btrfs_transaction.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      889bfa39
    • Anand Jain's avatar
      btrfs: sysfs, add devid/dev_state kobject and device attributes · 668e48af
      Anand Jain authored
      New sysfs attributes that track the filesystem status of devices, stored
      in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
      a directory for each device, with name corresponding to the numerical
      device id.
      
        in_fs_metadata    - device is in the list of fs metadata
        missing           - device is missing (no device node or block device)
        replace_target    - device is target of replace
        writeable         - writes from fs are allowed
      
      These attributes reflect the state of the device::dev_state and created
      at mount time.
      
      Sample output:
        $ pwd
         /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
        $ ls
          in_fs_metadata  missing  replace_target  writeable
        $ cat missing
          0
      
      The output from these attributes are 0 or 1. 0 indicates unset and 1
      indicates set.  These attributes are readonly.
      
      It is observed that the device delete thread and sysfs read thread will
      not race because the delete thread calls sysfs kobject_put() which in
      turn waits for existing sysfs read to complete.
      
      Note for device replace devid swap:
      
      During the replace the target device temporarily assumes devid 0 before
      assigning the devid of the soruce device.
      
      In btrfs_dev_replace_finishing() we remove source sysfs devid using the
      function btrfs_sysfs_remove_devices_attr(), so after that call
      kobject_rename() to update the devid in the sysfs.  This adds and calls
      btrfs_sysfs_update_devid() helper function to update the device id.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      668e48af
    • Nikolay Borisov's avatar
      btrfs: Refactor btrfs_rmap_block to improve readability · 1776ad17
      Nikolay Borisov authored
      Move variables to appropriate scope. Remove last BUG_ON in the function
      and rework error handling accordingly. Make the duplicate detection code
      more straightforward. Use in_range macro. And give variables more
      descriptive name by explicitly distinguishing between IO stripe size
      (size recorded in the chunk item) and data stripe size (the size of
      an actual stripe, constituting a logical chunk/block group).
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1776ad17
    • Nikolay Borisov's avatar
      btrfs: Add self-tests for btrfs_rmap_block · bf2e2eb0
      Nikolay Borisov authored
      Add RAID1 and single testcases to verify that data stripes are excluded
      from super block locations and that the address mapping is valid.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf2e2eb0
    • Nikolay Borisov's avatar
      btrfs: selftests: Add support for dummy devices · b3ad2c17
      Nikolay Borisov authored
      Add basic infrastructure to create and link dummy btrfs_devices. This
      will be used in the pending btrfs_rmap_block test which deals with
      the block groups.
      
      Calling btrfs_alloc_dummy_device will link the newly created device to
      the passed fs_info and the test framework will free them once the test
      is finished.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3ad2c17
    • Nikolay Borisov's avatar
      btrfs: Move and unexport btrfs_rmap_block · 96a14336
      Nikolay Borisov authored
      It's used only during initial block group reading to map physical
      address of super block to a list of logical ones. Make it private to
      block-group.c, add proper kernel doc and ensure it's exported only for
      tests.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96a14336
    • David Sterba's avatar
      btrfs: separate definition of assertion failure handlers · 68c467cb
      David Sterba authored
      There's a report where objtool detects unreachable instructions, eg.:
      
        fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction
      
      This seems to be a false positive due to compiler version. The cause is
      in the ASSERT macro implementation that does the conditional check as
      IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.
      
      To avoid that, use the ifdefs directly.
      
      There are still 2 reports that aren't fixed:
      
        fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
        fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
      Co-developed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68c467cb
  2. 20 Jan, 2020 27 commits