1. 29 Mar, 2023 4 commits
    • Jakub Kicinski's avatar
      Merge branch 'net-refcount-address-dst_entry-reference-count-scalability-issues' · 2600badf
      Jakub Kicinski authored
      Thomas Gleixner says:
      
      ====================
      net, refcount: Address dst_entry reference count scalability issues
      
      This is version 3 of this series. Version 2 can be found here:
      
           https://lore.kernel.org/lkml/20230307125358.772287565@linutronix.de
      
      Wangyang and Arjan reported a bottleneck in the networking code related to
      struct dst_entry::__refcnt. Performance tanks massively when concurrency on
      a dst_entry increases.
      
      This happens when there are a large amount of connections to or from the
      same IP address. The memtier benchmark when run on the same host as
      memcached amplifies this massively. But even over real network connections
      this issue can be observed at an obviously smaller scale (due to the
      network bandwith limitations in my setup, i.e. 1Gb). How to reproduce:
      
        Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100
        on the same machine. localhost connections amplify the problem.
      
        Start with the defaults for $N and $M and increase them. Depending on
        your machine this will tank at some point. But even in reasonably small
        $N, $M scenarios the refcount operations and the resulting false sharing
        fallout becomes visible in perf top. At some point it becomes the
        dominating issue.
      
      There are two factors which make this reference count a scalability issue:
      
         1) False sharing
      
            dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
            it into a seperate cacheline vs. the read mostly members located at
            the beginning of the struct.
      
            That prevents false sharing vs. the struct members in the first 64
            bytes of the structure, but there is also
      
            	    dst_entry::lwtstate
      
            which is located after the reference count and in the same cache
            line. This member is read after a reference count has been acquired.
      
            The other problem is struct rtable, which embeds a struct dst_entry
            at offset 0. struct dst_entry has a size of 112 bytes, which means
            that the struct members of rtable which follow the dst member share
            the same cache line as dst_entry::__refcnt. Especially
      
            	  rtable::rt_genid
      
            is also read by the contexts which have a reference count acquired
            already.
      
            When dst_entry:__refcnt is incremented or decremented via an atomic
            operation these read accesses stall and contribute to the performance
            problem.
      
         2) atomic_inc_not_zero()
      
            A reference on dst_entry:__refcnt is acquired via
            atomic_inc_not_zero() and released via atomic_dec_return().
      
            atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
            which exposes O(N^2) behaviour under contention with N concurrent
            operations. Contention scalability is degrading with even a small
            amount of contenders and gets worse from there.
      
            Lightweight instrumentation exposed an average of 8!! retry loops per
            atomic_inc_not_zero() invocation in a inc()/dec() loop running
            concurrently on 112 CPUs.
      
            There is nothing which can be done to make atomic_inc_not_zero() more
            scalable.
      
      The following series addresses these issues:
      
          1) Reorder and pad struct dst_entry to prevent the false sharing.
      
          2) Implement and use a reference count implementation which avoids the
             atomic_inc_not_zero() problem.
      
             It is slightly less performant in the case of the final 0 -> -1
             transition, but the deconstruction of these objects is a low
             frequency event. get()/put() pairs are in the hotpath and that's
             what this implementation optimizes for.
      
             The algorithm of this reference count is only suitable for RCU
             managed objects. Therefore it cannot replace the refcount_t
             algorithm, which is also based on atomic_inc_not_zero(), due to a
             subtle race condition related to the 0 -> -1 transition and the final
             verdict to mark the reference count dead. See details in patch 2/3.
      
             It might be just my lack of imagination which declares this to be
             impossible and I'd be happy to be proven wrong.
      
             As a bonus the new rcuref implementation provides underflow/overflow
             detection and mitigation while being performance wise on par with
             open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
             the non-contended case.
      
      The combination of these two changes results in performance gains in micro
      benchmarks and also localhost and networked memtier benchmarks talking to
      memcached. It's hard to quantify the benchmark results as they depend
      heavily on the micro-architecture and the number of concurrent operations.
      
      The overall gain of both changes for localhost memtier ranges from 1.2X to
      3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
      
      A micro benchmark which enforces maximized concurrency shows a gain between
      1.2X and 4.7X!!!
      
      Obviously this is focussed on a particular problem and therefore needs to
      be discussed in detail. It also requires wider testing outside of the cases
      which this is focussed on.
      
      Though the false sharing issue is obvious and should be addressed
      independent of the more focussed reference count changes.
      
      The series is also available from git:
      
        git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rcuref
      
      Changes vs. V2:
      
        - Rename __refcnt to __rcuref (Linus)
      
        - Fix comments and changelogs (Mark, Qiuxu)
      
        - Fixup kernel doc of generated atomic_add_negative() variants
      
      I want to say thanks to Wangyang who analyzed the issue and provided the
      initial fix for the false sharing problem. Further thanks go to Arjan
      Peter, Marc, Will and Borislav for valuable input and providing test
      results on machines which I do not have access to, and to Linus and
      Eric, Qiuxu and Mark for helpful feedback.
      ====================
      
      Link: https://lore.kernel.org/r/20230323102649.764958589@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2600badf
    • Thomas Gleixner's avatar
      net: dst: Switch to rcuref_t reference counting · bc9d3a9f
      Thomas Gleixner authored
      Under high contention dst_entry::__refcnt becomes a significant bottleneck.
      
      atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
      high retry rates on contention.
      
      Switch the reference count to rcuref_t which results in a significant
      performance gain. Rename the reference count member to __rcuref to reflect
      the change.
      
      The gain depends on the micro-architecture and the number of concurrent
      operations and has been measured in the range of +25% to +130% with a
      localhost memtier/memcached benchmark which amplifies the problem
      massively.
      
      Running the memtier/memcached benchmark over a real (1Gb) network
      connection the conversion on top of the false sharing fix for struct
      dst_entry::__refcnt results in a total gain in the 2%-5% range over the
      upstream baseline.
      Reported-by: default avatarWangyang Guo <wangyang.guo@intel.com>
      Reported-by: default avatarArjan Van De Ven <arjan.van.de.ven@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
      Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bc9d3a9f
    • Wangyang Guo's avatar
      net: dst: Prevent false sharing vs. dst_entry:: __refcnt · d288a162
      Wangyang Guo authored
      dst_entry::__refcnt is highly contended in scenarios where many connections
      happen from and to the same IP. The reference count is an atomic_t, so the
      reference count operations have to take the cache-line exclusive.
      
      Aside of the unavoidable reference count contention there is another
      significant problem which is caused by that: False sharing.
      
      perf top identified two affected read accesses. dst_entry::lwtstate and
      rtable::rt_genid.
      
      dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
      a seperate cacheline vs. the read mostly members located at the beginning
      of the struct.
      
      That prevents false sharing vs. the struct members in the first 64
      bytes of the structure, but there is also
      
        dst_entry::lwtstate
      
      which is located after the reference count and in the same cache line. This
      member is read after a reference count has been acquired.
      
      struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
      size of 112 bytes, which means that the struct members of rtable which
      follow the dst member share the same cache line as dst_entry::__refcnt.
      Especially
      
        rtable::rt_genid
      
      is also read by the contexts which have a reference count acquired
      already.
      
      When dst_entry:__refcnt is incremented or decremented via an atomic
      operation these read accesses stall. This was found when analysing the
      memtier benchmark in 1:100 mode, which amplifies the problem extremly.
      
      Move the rt[6i]_uncached[_list] members out of struct rtable and struct
      rt6_info into struct dst_entry to provide padding and move the lwtstate
      member after that so it ends up in the same cache line.
      
      The resulting improvement depends on the micro-architecture and the number
      of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
      benchmark.
      
      [ tglx: Rearrange struct ]
      Signed-off-by: default avatarWangyang Guo <wangyang.guo@intel.com>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d288a162
    • Jakub Kicinski's avatar
  2. 28 Mar, 2023 19 commits
  3. 27 Mar, 2023 17 commits