• Filipe Manana's avatar
    Btrfs: fix fs corruption on transaction abort if device supports discard · 678886bd
    Filipe Manana authored
    When we abort a transaction we iterate over all the ranges marked as dirty
    in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
    from those trees, add them back (unpin) to the free space caches and, if
    the fs was mounted with "-o discard", perform a discard on those regions.
    Also, after adding the regions to the free space caches, a fitrim ioctl call
    can see those ranges in a block group's free space cache and perform a discard
    on the ranges, so the same issue can happen without "-o discard" as well.
    
    This causes corruption, affecting one or multiple btree nodes (in the worst
    case leaving the fs unmountable) because some of those ranges (the ones in
    the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
    referred by the last committed super block - breaking the rule that anything
    that was committed by a transaction is untouched until the next transaction
    commits successfully.
    
    I ran into this while running in a loop (for several hours) the fstest that
    I recently submitted:
    
      [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
    
    The corruption always happened when a transaction aborted and then fsck complained
    like this:
    
       _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
       *** fsck.btrfs output ***
       Check tree block failed, want=94945280, have=0
       Check tree block failed, want=94945280, have=0
       Check tree block failed, want=94945280, have=0
       Check tree block failed, want=94945280, have=0
       Check tree block failed, want=94945280, have=0
       read block failed check_tree_block
       Couldn't open file system
    
    In this case 94945280 corresponded to the root of a tree.
    Using frace what I observed was the following sequence of steps happened:
    
       1) transaction N started, fs_info->pinned_extents pointed to
          fs_info->freed_extents[0];
    
       2) node/eb 94945280 is created;
    
       3) eb is persisted to disk;
    
       4) transaction N commit starts, fs_info->pinned_extents now points to
          fs_info->freed_extents[1], and transaction N completes;
    
       5) transaction N + 1 starts;
    
       6) eb is COWed, and btrfs_free_tree_block() called for this eb;
    
       7) eb range (94945280 to 94945280 + 16Kb) is added to
          fs_info->pinned_extents (fs_info->freed_extents[1]);
    
       8) Something goes wrong in transaction N + 1, like hitting ENOSPC
          for example, and the transaction is aborted, turning the fs into
          readonly mode. The stack trace I got for example:
    
          [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
          [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
          [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
          [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
          [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
          [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
          [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
          [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
          [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
          (...)
          [112065.264493] ---[ end trace dd7903a975a31a08 ]---
          [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
          [112065.264997] BTRFS info (device sdc): forced readonly
    
       9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
          fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
          turn calls btrfs_destroy_pinned_extent();
    
       10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
           marked as dirty in fs_info->freed_extents[], and for each one
           it calls discard, if the fs was mounted with "-o discard", and
           adds the range to the free space cache of the respective block
           group;
    
       11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
           sees the free space entries and performs a discard;
    
       12) After an umount and mount (or fsck), our eb's location on disk was full
           of zeroes, and it should have been untouched, because it was marked as
           dirty in the fs_info->pinned_extents tree, and therefore used by the
           trees that the last committed superblock points to.
    
    Fix this by not performing a discard and not adding the ranges to the free space
    caches - it's useless from this point since the fs is now in readonly mode and
    we won't write free space caches to disk anymore (otherwise we would leak space)
    nor any new superblock. By not adding the ranges to the free space caches, it
    prevents other code paths from allocating that space and write to it as well,
    therefore being safer and simpler.
    
    This isn't a new problem, as it's been present since 2011 (git commit
    acce952b).
    
    Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    678886bd
extent-tree.c 261 KB