• Nikolay Aleksandrov's avatar
    net: fix for a race condition in the inet frag code · 24b9bf43
    Nikolay Aleksandrov authored
    I stumbled upon this very serious bug while hunting for another one,
    it's a very subtle race condition between inet_frag_evictor,
    inet_frag_intern and the IPv4/6 frag_queue and expire functions
    (basically the users of inet_frag_kill/inet_frag_put).
    
    What happens is that after a fragment has been added to the hash chain
    but before it's been added to the lru_list (inet_frag_lru_add) in
    inet_frag_intern, it may get deleted (either by an expired timer if
    the system load is high or the timer sufficiently low, or by the
    fraq_queue function for different reasons) before it's added to the
    lru_list, then after it gets added it's a matter of time for the
    evictor to get to a piece of memory which has been freed leading to a
    number of different bugs depending on what's left there.
    
    I've been able to trigger this on both IPv4 and IPv6 (which is normal
    as the frag code is the same), but it's been much more difficult to
    trigger on IPv4 due to the protocol differences about how fragments
    are treated.
    
    The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
    in a RR bond, so the same flow can be seen on multiple cards at the
    same time. Then I used multiple instances of ping/ping6 to generate
    fragmented packets and flood the machines with them while running
    other processes to load the attacked machine.
    
    *It is very important to have the _same flow_ coming in on multiple CPUs
    concurrently. Usually the attacked machine would die in less than 30
    minutes, if configured properly to have many evictor calls and timeouts
    it could happen in 10 minutes or so.
    
    An important point to make is that any caller (frag_queue or timer) of
    inet_frag_kill will remove both the timer refcount and the
    original/guarding refcount thus removing everything that's keeping the
    frag from being freed at the next inet_frag_put.  All of this could
    happen before the frag was ever added to the LRU list, then it gets
    added and the evictor uses a freed fragment.
    
    An example for IPv6 would be if a fragment is being added and is at
    the stage of being inserted in the hash after the hash lock is
    released, but before inet_frag_lru_add executes (or is able to obtain
    the lru lock) another overlapping fragment for the same flow arrives
    at a different CPU which finds it in the hash, but since it's
    overlapping it drops it invoking inet_frag_kill and thus removing all
    guarding refcounts, and afterwards freeing it by invoking
    inet_frag_put which removes the last refcount added previously by
    inet_frag_find, then inet_frag_lru_add gets executed by
    inet_frag_intern and we have a freed fragment in the lru_list.
    
    The fix is simple, just move the lru_add under the hash chain locked
    region so when a removing function is called it'll have to wait for
    the fragment to be added to the lru_list, and then it'll remove it (it
    works because the hash chain removal is done before the lru_list one
    and there's no window between the two list adds when the frag can get
    dropped). With this fix applied I couldn't kill the same machine in 24
    hours with the same setup.
    
    Fixes: 3ef0eb0d ("net: frag, move LRU list maintenance outside of
    rwlock")
    
    CC: Florian Westphal <fw@strlen.de>
    CC: Jesper Dangaard Brouer <brouer@redhat.com>
    CC: David S. Miller <davem@davemloft.net>
    Signed-off-by: default avatarNikolay Aleksandrov <nikolay@redhat.com>
    Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    24b9bf43
inet_fragment.c 8.61 KB