• Ross Zwisler's avatar
    radix tree: fix multi-order iteration race · 9f418224
    Ross Zwisler authored
    Fix a race in the multi-order iteration code which causes the kernel to
    hit a GP fault.  This was first seen with a production v4.15 based
    kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
    order 9 PMD DAX entries.
    
    The race has to do with how we tear down multi-order sibling entries
    when we are removing an item from the tree.  Remember for example that
    an order 2 entry looks like this:
    
      struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
    
    where 'entry' is in some slot in the struct radix_tree_node, and the
    three slots following 'entry' contain sibling pointers which point back
    to 'entry.'
    
    When we delete 'entry' from the tree, we call :
    
      radix_tree_delete()
        radix_tree_delete_item()
          __radix_tree_delete()
            replace_slot()
    
    replace_slot() first removes the siblings in order from the first to the
    last, then at then replaces 'entry' with NULL.  This means that for a
    brief period of time we end up with one or more of the siblings removed,
    so:
    
      struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
    
    This causes an issue if you have a reader iterating over the slots in
    the tree via radix_tree_for_each_slot() while only under
    rcu_read_lock()/rcu_read_unlock() protection.  This is a common case in
    mm/filemap.c.
    
    The issue is that when __radix_tree_next_slot() => skip_siblings() tries
    to skip over the sibling entries in the slots, it currently does so with
    an exact match on the slot directly preceding our current slot.
    Normally this works:
    
                                          V preceding slot
      struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
                                                  ^ current slot
    
    This lets you find the first sibling, and you skip them all in order.
    
    But in the case where one of the siblings is NULL, that slot is skipped
    and then our sibling detection is interrupted:
    
                                                 V preceding slot
      struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
                                                        ^ current slot
    
    This means that the sibling pointers aren't recognized since they point
    all the way back to 'entry', so we think that they are normal internal
    radix tree pointers.  This causes us to think we need to walk down to a
    struct radix_tree_node starting at the address of 'entry'.
    
    In a real running kernel this will crash the thread with a GP fault when
    you try and dereference the slots in your broken node starting at
    'entry'.
    
    We fix this race by fixing the way that skip_siblings() detects sibling
    nodes.  Instead of testing against the preceding slot we instead look
    for siblings via is_sibling_entry() which compares against the position
    of the struct radix_tree_node.slots[] array.  This ensures that sibling
    entries are properly identified, even if they are no longer contiguous
    with the 'entry' they point to.
    
    Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
    Fixes: 148deab2 ("radix-tree: improve multiorder iterators")
    Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
    Reported-by: default avatarCR, Sapthagirish <sapthagirish.cr@intel.com>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    9f418224
radix-tree.c 62.2 KB