• Filipe Manana's avatar
    Btrfs: fix race leading to incorrect item deletion when dropping extents · b4f5eab6
    Filipe Manana authored
    commit aeafbf84 upstream.
    
    While running a stress test I got the following warning triggered:
    
      [191627.672810] ------------[ cut here ]------------
      [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
      (...)
      [191627.701485] Call Trace:
      [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
      [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
      [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
      [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
      [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
      [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
      [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
      [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
      [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
      [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
      [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
      [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
      [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
      [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
      [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
      [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
      [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
      [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
      [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
      [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
      [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
      [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
      [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
      [191627.744206] ---[ end trace bbfddacb7aaada8d ]---
    
      $ cat -n fs/btrfs/file.c
      691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
      (...)
      758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
      759                  if (key.objectid > ino ||
      760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
      761                          break;
      762
      763                  fi = btrfs_item_ptr(leaf, path->slots[0],
      764                                      struct btrfs_file_extent_item);
      765                  extent_type = btrfs_file_extent_type(leaf, fi);
      766
      767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
      768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
      (...)
      774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
      (...)
      778                  } else {
      779                          WARN_ON(1);
      780                          extent_end = search_start;
      781                  }
      (...)
    
    This happened because the item we were processing did not match a file
    extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
    case we cast the item to a struct btrfs_file_extent_item pointer and
    then find a type field value that does not match any of the expected
    values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
    due to a tiny time window where a race can happen as exemplified below.
    For example, consider the following scenario where we're using the
    NO_HOLES feature and we have the following two neighbour leafs:
    
                   Leaf X (has N items)                    Leaf Y
    
    [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
              slot N - 2         slot N - 1              slot 0
    
    Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
    than explicit because NO_HOLES is enabled). Now if our inode has an
    ordered extent for the range [4K, 8K[ that is finishing, the following
    can happen:
    
              CPU 1                                       CPU 2
    
      btrfs_finish_ordered_io()
        insert_reserved_file_extent()
          __btrfs_drop_extents()
             Searches for the key
              (257 EXTENT_DATA 4096) through
              btrfs_lookup_file_extent()
    
             Key not found and we get a path where
             path->nodes[0] == leaf X and
             path->slots[0] == N
    
             Because path->slots[0] is >=
             btrfs_header_nritems(leaf X), we call
             btrfs_next_leaf()
    
             btrfs_next_leaf() releases the path
    
                                                      inserts key
                                                      (257 INODE_REF 4096)
                                                      at the end of leaf X,
                                                      leaf X now has N + 1 keys,
                                                      and the new key is at
                                                      slot N
    
             btrfs_next_leaf() searches for
             key (257 INODE_REF 256), with
             path->keep_locks set to 1,
             because it was the last key it
             saw in leaf X
    
               finds it in leaf X again and
               notices it's no longer the last
               key of the leaf, so it returns 0
               with path->nodes[0] == leaf X and
               path->slots[0] == N (which is now
               < btrfs_header_nritems(leaf X)),
               pointing to the new key
               (257 INODE_REF 4096)
    
             __btrfs_drop_extents() casts the
             item at path->nodes[0], slot
             path->slots[0], to a struct
             btrfs_file_extent_item - it does
             not skip keys for the target
             inode with a type less than
             BTRFS_EXTENT_DATA_KEY
             (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
    
             sees a bogus value for the type
             field triggering the WARN_ON in
             the trace shown above, and sets
             extent_end = search_start (4096)
    
             does the if-then-else logic to
             fixup 0 length extent items created
             by a past bug from hole punching:
    
               if (extent_end == key.offset &&
                   extent_end >= search_start)
                   goto delete_extent_item;
    
             that evaluates to true and it ends
             up deleting the key pointed to by
             path->slots[0], (257 INODE_REF 4096),
             from leaf X
    
    The same could happen for example for a xattr that ends up having a key
    with an offset value that matches search_start (very unlikely but not
    impossible).
    
    So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
    skipped, never casted to struct btrfs_file_extent_item and never deleted
    by accident. Also protect against the unexpected case of getting a key
    for a lower inode number by skipping that key and issuing a warning.
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    [bwh: Backported to 3.2: drop use of ASSERT()]
    Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
    b4f5eab6
file.c 47.8 KB