• Jarno Rajahalme's avatar
    openvswitch: Per NUMA node flow stats. · 63e7959c
    Jarno Rajahalme authored
    Keep kernel flow stats for each NUMA node rather than each (logical)
    CPU.  This avoids using the per-CPU allocator and removes most of the
    kernel-side OVS locking overhead otherwise on the top of perf reports
    and allows OVS to scale better with higher number of threads.
    
    With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
    rate doubles on a server with two hyper-threaded physical CPUs (16
    logical cores each) compared to the current OVS master.  Tested with
    non-trivial flow table with a TCP port match rule forcing all new
    connections with unique port numbers to OVS userspace.  The IP
    addresses are still wildcarded, so the kernel flows are not considered
    as exact match 5-tuple flows.  This type of flows can be expected to
    appear in large numbers as the result of more effective wildcarding
    made possible by improvements in OVS userspace flow classifier.
    
    Perf results for this test (master):
    
    Events: 305K cycles
    +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
    +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
    +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
    +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
    +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
    +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
    +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
    +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
    +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
    +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
    +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
    +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
    +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
    ...
    
    And after this patch:
    
    Events: 356K cycles
    +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
    +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
    +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
    +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
    +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
    +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
    +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
    +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
    +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
    +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
    +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
    +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
    +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
    +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
    +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
    ...
    
    There is a small increase in kernel spinlock overhead due to the same
    spinlock being shared between multiple cores of the same physical CPU,
    but that is barely visible in the netperf TCP_CRR test performance
    (maybe ~1% performance drop, hard to tell exactly due to variance in
    the test results), when testing for kernel module throughput (with no
    userspace activity, handful of kernel flows).
    
    On flow setup, a single stats instance is allocated (for the NUMA node
    0).  As CPUs from multiple NUMA nodes start updating stats, new
    NUMA-node specific stats instances are allocated.  This allocation on
    the packet processing code path is made to never block or look for
    emergency memory pools, minimizing the allocation latency.  If the
    allocation fails, the existing preallocated stats instance is used.
    Also, if only CPUs from one NUMA-node are updating the preallocated
    stats instance, no additional stats instances are allocated.  This
    eliminates the need to pre-allocate stats instances that will not be
    used, also relieving the stats reader from the burden of reading stats
    that are never used.
    Signed-off-by: default avatarJarno Rajahalme <jrajahalme@nicira.com>
    Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
    Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
    63e7959c
flow_table.c 14.1 KB