1. 21 Dec, 2019 13 commits
    • Alexei Starovoitov's avatar
      Merge branch 'xsk-cleanup' · ce3cec27
      Alexei Starovoitov authored
      Magnus Karlsson says:
      
      ====================
      This patch set cleans up the ring access functions of AF_XDP in hope
      that it will now be easier to understand and maintain. I used to get a
      headache every time I looked at this code in order to really understand it,
      but now I do think it is a lot less painful.
      
      The code has been simplified a lot and as a bonus we get better
      performance in nearly all cases. On my new 2.1 GHz Cascade Lake
      machine with a standard default config plus AF_XDP support and
      CONFIG_PREEMPT on I get the following results in percent performance
      increases with this patch set compared to without it:
      
      Zero-copy (-N):
                rxdrop        txpush        l2fwd
      1 core:    -2%            0%            3%
      2 cores:    4%            0%            3%
      
      Zero-copy with poll() (-N -p):
                rxdrop        txpush        l2fwd
      1 core:     3%            0%            1%
      2 cores:   21%            0%            9%
      
      Skb mode (-S):
      Shows a 0% to 5% performance improvement over the same benchmarks as
      above.
      
      Here 1 core means that we are running the driver processing and the
      application on the same core, while 2 cores means that they execute on
      separate cores. The applications are from the xdpsock sample app.
      
      On my older 2.0 Ghz Broadwell machine that I used for the v1, I get
      the following results:
      
      Zero-copy (-N):
                rxdrop        txpush        l2fwd
      1 core:     4%            5%            4%
      2 cores:    1%            0%            2%
      
      Zero-copy with poll() (-N -p):
                rxdrop        txpush        l2fwd
      1 core:     1%            3%            3%
      2 cores:   22%            0%            5%
      
      Skb mode (-S):
      Shows a 0% to 1% performance improvement over the same benchmarks as
      above.
      
      When a results says 21 or 22% better, as in the case of poll mode with
      2 cores and rxdrop, my first reaction is that it must be a
      bug. Everything else shows between 0% and 5% performance
      improvement. What is giving rise to 22%? A quick bisect indicates that
      it is patches 2, 3, 4, 5, and 6 that are giving rise to most of this
      improvement. So not one patch in particular, but something around 4%
      improvement from each one of them. Note that exactly this benchmark
      has previously had an extraordinary slow down compared to when running
      without poll syscalls. For all the other poll tests above, the
      slowdown has always been around 4% for using poll syscalls. But with
      the bad performing test in question, it was above 25%. Interestingly,
      after this clean up, the slow down is 4%, just like all the other poll
      tests. Please take an extra peek at this so I have not messed up
      something.
      
      The 0% for several txpush results are due to the test bottlenecking on
      a non-CPU HW resource. If I eliminated that bottleneck on my system, I
      would expect to see an increase there too.
      
      Changes v1 -> v2:
      * Corrected textual errors in the commit logs (Sergei and Martin)
      * Fixed the functions that detect empty and full rings so that they
        now operate on the global ring state (Maxim)
      
      This patch has been applied against commit a352a824 ("Merge branch 'libbpf-extern-followups'")
      
      Structure of the patch set:
      
      Patch 1: Eliminate the lazy update threshold used when preallocating
               entries in the completion ring
      Patch 2: Simplify the detection of empty and full rings
      Patch 3: Consolidate the two local producer pointers into one
      Patch 4: Standardize the naming of the producer ring access functions
      Patch 5: Eliminate the Rx batch size used for the fill ring
      Patch 6: Simplify the functions xskq_nb_avail and xskq_nb_free
      Patch 7: Simplify and standardize the naming of the consumer ring
               access functions
      Patch 8: Change the names of the validation functions to improve
               readability and also the return value of these functions
      Patch 9: Change the name of xsk_umem_discard_addr() to
               xsk_umem_release_addr() to better reflect the new
               names. Requires a name change in the drivers that support AF_XDP
               zero-copy.
      Patch 10: Remove unnecessary READ_ONCE of data in the ring
      Patch 11: Add overall function naming comment and reorder the functions
                for easier reference
      Patch 12: Use the struct_size helper function when allocating rings
      ====================
      Reviewed-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Tested-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ce3cec27
    • Magnus Karlsson's avatar
      xsk: Use struct_size() helper · 1d9cb1f3
      Magnus Karlsson authored
      Improve readability and maintainability by using the struct_size()
      helper when allocating the AF_XDP rings.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-13-git-send-email-magnus.karlsson@intel.com
      1d9cb1f3
    • Magnus Karlsson's avatar
      xsk: Add function naming comments and reorder functions · 15d8c916
      Magnus Karlsson authored
      Add comments on how the ring access functions are named and how they
      are supposed to be used for producers and consumers. The functions are
      also reordered so that the consumer functions are in the beginning and
      the producer functions in the end, for easier reference. Put this in a
      separate patch as the diff might look a little odd, but no
      functionality has changed in this patch.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-12-git-send-email-magnus.karlsson@intel.com
      15d8c916
    • Magnus Karlsson's avatar
      xsk: Remove unnecessary READ_ONCE of data · c34787fc
      Magnus Karlsson authored
      There are two unnecessary READ_ONCE of descriptor data. These are not
      needed since the data is written by the producer before it signals
      that the data is available by incrementing the producer pointer. As the
      access to this producer pointer is serialized and the consumer always
      reads the descriptor after it has read and synchronized with the
      producer counter, the write of the descriptor will have fully
      completed and it does not matter if the consumer has any read tearing.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-11-git-send-email-magnus.karlsson@intel.com
      c34787fc
    • Magnus Karlsson's avatar
      xsk: ixgbe: i40e: ice: mlx5: Xsk_umem_discard_addr to xsk_umem_release_addr · f8509aa0
      Magnus Karlsson authored
      Change the name of xsk_umem_discard_addr to xsk_umem_release_addr to
      better reflect the new naming of the AF_XDP queue manipulation
      functions. As this functions is used by drivers implementing support
      for AF_XDP zero-copy, it requires a name change to these drivers. The
      function xsk_umem_release_addr_rq has also changed name in the same
      fashion.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-10-git-send-email-magnus.karlsson@intel.com
      f8509aa0
    • Magnus Karlsson's avatar
      xsk: Change names of validation functions · 03896ef1
      Magnus Karlsson authored
      Change the names of the validation functions to better reflect what
      they are doing. The uppermost ones are reading entries from the rings
      and only the bottom ones validate entries. So xskq_cons_read_ is a
      better prefix name.
      
      Also change the xskq_cons_read_ functions to return a bool
      as the the descriptor or address is already returned by reference
      in the parameters. Everyone is using the return value as a bool
      anyway.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-9-git-send-email-magnus.karlsson@intel.com
      03896ef1
    • Magnus Karlsson's avatar
      xsk: Simplify the consumer ring access functions · c5ed924b
      Magnus Karlsson authored
      Simplify and refactor consumer ring functions. The consumer first
      "peeks" to find descriptors or addresses that are available to
      read from the ring, then reads them and finally "releases" these
      descriptors once it is done. The two local variables cons_tail
      and cons_head are turned into one single variable called
      cached_cons. cached_tail referred to the cached value of the
      global consumer pointer and will be stored in cached_cons. For
      cached_head, we just use cached_prod instead as it was not used
      for a consumer queue before. It also better reflects what it
      really is now: a cached copy of the producer pointer.
      
      The names of the functions are also renamed in the same manner as
      the producer functions. The new functions are called xskq_cons_
      followed by what it does.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-8-git-send-email-magnus.karlsson@intel.com
      c5ed924b
    • Magnus Karlsson's avatar
      xsk: Simplify xskq_nb_avail and xskq_nb_free · df0ae6f7
      Magnus Karlsson authored
      At this point, there are no users of the functions xskq_nb_avail and
      xskq_nb_free that take any other number of entries argument than 1, so
      let us get rid of the second argument that takes the number of
      entries.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-7-git-send-email-magnus.karlsson@intel.com
      df0ae6f7
    • Magnus Karlsson's avatar
      xsk: Eliminate the RX batch size · 4b638f13
      Magnus Karlsson authored
      In the xsk consumer ring code there is a variable called RX_BATCH_SIZE
      that dictates the minimum number of entries that we try to grab from
      the fill and Tx rings. In fact, the code always try to grab the
      maximum amount of entries from these rings. The only thing this
      variable does is to throw an error if there is less than 16 (as it is
      defined) entries on the ring. There is no reason to do this and it
      will just lead to weird behavior from user space's point of view. So
      eliminate this variable.
      
      With this change, we will be able to simplify the xskq_nb_free and
      xskq_nb_avail code in the next commit.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-6-git-send-email-magnus.karlsson@intel.com
      4b638f13
    • Magnus Karlsson's avatar
      xsk: Standardize naming of producer ring access functions · 59e35e55
      Magnus Karlsson authored
      Adopt the naming of the producer ring access functions to have a
      similar naming convention as the functions in libbpf, but adapted to
      the kernel. You first reserve a number of entries that you later
      submit to the global state of the ring. This is much clearer, IMO,
      than the one that was in the kernel part. Once renamed, we also
      discover that two functions are actually the same, so remove one of
      them. Some of the primitive ring submission operations are also the
      same so break these out into __xskq_prod_submit that the upper level
      ring access functions can use.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-5-git-send-email-magnus.karlsson@intel.com
      59e35e55
    • Magnus Karlsson's avatar
      xsk: Consolidate to one single cached producer pointer · d7012f05
      Magnus Karlsson authored
      Currently, the xsk ring code has two cached producer pointers:
      prod_head and prod_tail. This patch consolidates these two into a
      single one called cached_prod to make the code simpler and easier to
      maintain. This will be in line with the user space part of the the
      code found in libbpf, that only uses a single cached pointer.
      
      The Rx path only uses the two top level functions
      xskq_produce_batch_desc and xskq_produce_flush_desc and they both use
      prod_head and never prod_tail. So just move them over to
      cached_prod.
      
      The Tx XDP_DRV path uses xskq_produce_addr_lazy and
      xskq_produce_flush_addr_n and unnecessarily operates on both prod_tail
      and prod_head, so move them over to just use cached_prod by skipping
      the intermediate step of updating prod_tail.
      
      The Tx path in XDP_SKB mode uses xskq_reserve_addr and
      xskq_produce_addr. They currently use both cached pointers, but we can
      operate on the global producer pointer in xskq_produce_addr since it
      has to be updated anyway, thus eliminating the use of both cached
      pointers. We can also remove the xskq_nb_free in xskq_produce_addr
      since it is already called in xskq_reserve_addr. No need to do it
      twice.
      
      When there is only one cached producer pointer, we can also simplify
      xskq_nb_free by removing one argument.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-4-git-send-email-magnus.karlsson@intel.com
      d7012f05
    • Magnus Karlsson's avatar
      xsk: Simplify detection of empty and full rings · 11cc2d21
      Magnus Karlsson authored
      In order to set the correct return flags for poll, the xsk code has to
      check if the Rx queue is empty and if the Tx queue is full. This code
      was unnecessarily large and complex as it used the functions that are
      used to update the local state from the global state (xskq_nb_free and
      xskq_nb_avail). Since we are not doing this nor updating any data
      dependent on this state, we can simplify the functions. Another
      benefit from this is that we can also simplify the xskq_nb_free and
      xskq_nb_avail functions in a later commit.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-3-git-send-email-magnus.karlsson@intel.com
      11cc2d21
    • Magnus Karlsson's avatar
      xsk: Eliminate the lazy update threshold · 484b1653
      Magnus Karlsson authored
      The lazy update threshold was introduced to keep the producer and
      consumer some distance apart in the completion ring. This was
      important in the beginning of the development of AF_XDP as the ring
      format as that point in time was very sensitive to the producer and
      consumer being on the same cache line. This is not the case
      anymore as the current ring format does not degrade in any noticeable
      way when this happens. Moreover, this threshold makes it impossible
      to run the system with rings that have less than 128 entries.
      
      So let us remove this threshold and just get one entry from the ring
      as in all other functions. This will enable us to remove this function
      in a later commit. Note that xskq_produce_addr_lazy followed by
      xskq_produce_flush_addr_n are still not the same function as
      xskq_produce_addr() as it operates on another cached pointer.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1576759171-28550-2-git-send-email-magnus.karlsson@intel.com
      484b1653
  2. 20 Dec, 2019 16 commits
  3. 19 Dec, 2019 11 commits