• Willem de Bruijn's avatar
    packet: rollover lock contention avoidance · 2ccdbaa6
    Willem de Bruijn authored
    Rollover has to call packet_rcv_has_room on sockets in the fanout
    group to find a socket to migrate to. This operation is expensive
    especially if the packet sockets use rings, when a lock has to be
    acquired.
    
    Avoid pounding on the lock by all sockets by temporarily marking a
    socket as "under memory pressure" when such pressure is detected.
    While set, only the socket owner may call packet_rcv_has_room on the
    socket. Once it detects normal conditions, it clears the flag. The
    socket is not used as a victim by any other socket in the meantime.
    
    Under reasonably balanced load, each socket writer frequently calls
    packet_rcv_has_room and clears its own pressure field. As a backup
    for when the socket is rarely written to, also clear the flag on
    reading (packet_recvmsg, packet_poll) if this can be done cheaply
    (i.e., without calling packet_rcv_has_room). This is only for
    edge cases.
    
    Tested:
      Ran bench_rollover: a process with 8 sockets in a single fanout
      group, each pinned to a single cpu that receives one nic recv
      interrupt. RPS and RFS are disabled. The benchmark uses packet
      rx_ring, which has to take a lock when determining whether a
      socket has room.
    
      Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
      uniformly across the packet sockets (and inserted an iptables
      rule to drop in PREROUTING to avoid protocol stack processing).
    
      Without this patch, all sockets try to migrate traffic to
      neighbors, causing lock contention when searching for a non-
      empty neighbor. The lock is the top 9 entries.
    
        perf record -a -g sleep 5
    
        -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
           - _raw_spin_lock
              - 99.00% spin_lock
        	 + 81.77% packet_rcv_has_room.isra.41
        	 + 18.23% tpacket_rcv
              + 0.84% packet_rcv_has_room.isra.41
        +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
        +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
        +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
        +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
        +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
        +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
        +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
        +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
        +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41
    
      On net-next with this patch, this lock contention is no longer a
      top entry. Most time is spent in the actual read function. Next up
      are other locks:
    
        +  15.52%  bench_rollover  bench_rollover     [.] reader
        +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
        +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
        +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
        +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
        +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq
    
      Looking closer at the remaining _raw_spin_lock, the cost of probing
      in rollover is now comparable to the cost of taking the lock later
      in tpacket_rcv.
    
        -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
           - _raw_spin_lock
              + 33.41% packet_rcv_has_room
              + 28.15% tpacket_rcv
              + 19.54% enqueue_to_backlog
              + 6.45% __free_pages_ok
              + 2.78% packet_rcv_fanout
              + 2.13% fanout_demux_rollover
              + 2.01% netif_receive_skb_internal
    Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    2ccdbaa6
internal.h 2.85 KB