• Willem de Bruijn's avatar
    net/packet: tpacket_rcv: avoid a producer race condition · 61fad681
    Willem de Bruijn authored
    PACKET_RX_RING can cause multiple writers to access the same slot if a
    fast writer wraps the ring while a slow writer is still copying. This
    is particularly likely with few, large, slots (e.g., GSO packets).
    
    Synchronize kernel thread ownership of rx ring slots with a bitmap.
    
    Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
    while holding the sk receive queue lock. They release this lock before
    copying and set tp_status to TP_STATUS_USER to release to userspace
    when done. During copying, another writer may take the lock, also see
    TP_STATUS_KERNEL, and start writing to the same slot.
    
    Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
    slot, test and set with the lock held. To release race-free, update
    tp_status and owner bit as a transaction, so take the lock again.
    
    This is the one of a variety of discussed options (see Link below):
    
    * instead of a shadow ring, embed the data in the slot itself, such as
    in tp_padding. But any test for this field may match a value left by
    userspace, causing deadlock.
    
    * avoid the lock on release. This leaves a small race if releasing the
    shadow slot before setting TP_STATUS_USER. The below reproducer showed
    that this race is not academic. If releasing the slot after tp_status,
    the race is more subtle. See the first link for details.
    
    * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
    of two fields. But, legacy applications may interpret all non-zero
    tp_status as owned by the user. As libpcap does. So this is possible
    only opt-in by newer processes. It can be added as an optional mode.
    
    * embed the struct at the tail of pg_vec to avoid extra allocation.
    The implementation proved no less complex than a separate field.
    
    The additional locking cost on release adds contention, no different
    than scaling on multicore or multiqueue h/w. In practice, below
    reproducer nor small packet tcpdump showed a noticeable change in
    perf report in cycles spent in spinlock. Where contention is
    problematic, packet sockets support mitigation through PACKET_FANOUT.
    And we can consider adding opt-in state TP_KERNEL_OWNED.
    
    Easy to reproduce by running multiple netperf or similar TCP_STREAM
    flows concurrently with `tcpdump -B 129 -n greater 60000`.
    
    Based on an earlier patchset by Jon Rosen. See links below.
    
    I believe this issue goes back to the introduction of tpacket_rcv,
    which predates git history.
    
    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.htmlSuggested-by: default avatarJon Rosen <jrosen@cisco.com>
    Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
    Signed-off-by: default avatarJon Rosen <jrosen@cisco.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    61fad681
internal.h 3.3 KB